AIO v2.0

Started by Andres Freundover 1 year ago213 messages

andres@anarazel.de

over 1 year ago

17 attachment(s)

Hi,

It's been quite a while since the last version of the AIO patchset that I have
posted. Of course parts of the larger project have since gone upstream [1]bulk relation extension, streaming read.

A lot of time since the last versions was spent understanding the performance
characteristics of using AIO with WAL and understanding some other odd
performance characteristics I didn't understand. I think I mostly understand
that now and what the design implications for an AIO subsystem are.

The prototype I had been working on unfortunately suffered from a few design
issues that weren't trivial to fix.

The biggest was that each backend could essentially have hard references to
unbounded numbers of "AIO handles" and that these references prevented these
handles from being reused. Because "AIO handles" have to live in shared memory
(so other backends can wait on them, that IO workers can perform them, etc)
that's obviously an issue. There was always a way to just run out of AIO
handles. I went through quite a few iterations of a design for how to resolve
that - I think I finally got there.

Another significant issue was that when I wrote the AIO prototype,
bufmgr.c/smgr.c/md.c only issued IOs in BLCKSZ increments, with the AIO
subsystem merging them into larger IOs. Thomas et al's work around streaming
read make bufmgr.c issue larger IOs - which is good for performance. But it
was surprisingly hard to fit into my older design.

It took me much longer than I had hoped to address these issues in
prototype. In the end I made progress by working on a rewriting the patchset
from scratch (well, with a bit of copy & paste).

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

While making v2 somewhat presentable I unfortunately found a few more design
issues - they're now mostly resolved, I think. But I only resolved the last
one a few hours ago, who knows what a few nights of sleeping on it will
bring. Unfortunately that prevented me from doing some of the polishing that I
had wanted to finish...

Because of the aforementioned move, I currently do not have access to my
workstation. I just have access to my laptop - which has enough thermal issues
to make benchmarks not particularly reliable.

So here are just a few teaser numbers, on an PCIe v4 NVMe SSD, note however
that this is with the BAS_BULKREAD size increased, with the default 256kB, we
can only keep one IO in flight at a time (due to io_combine_limit building
larger IOs) - we'll need to do something better than this, but that's yet
another separate discussion.

Workload: pg_prewarm('pgbench_accounts') of a scale 5k database, which is
bigger than memory:

time
master: 59.097
aio v2.0, worker: 11.211
aio v2.0, uring *: 19.991
aio v2.0, direct, worker: 09.617
aio v2.0, direct, uring *: 09.802

Workload: SELECT sum(abalance) FROM pgbench_accounts;

0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759

* the reason io_uring is slower is that workers effectively parallelize
*memcpy, at the cost of increased CPU usage
** a simple heuristic to use IOSQE_ASYNC to force some parallelism of memcpys

Workload: checkpointing ~20GB of dirty data, mostly sequential:

time
master: 10.209
aio v2.0, worker: 05.391
aio v2.0, uring: 04.593
aio v2.0, direct, worker: 07.745
aio v2.0, direct, uring: 03.351

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Besides that, I am planning to introduce "io_method=sync", which will just
execute IO synchrously. Besides that being a good capability to have, it'll
also make it more sensible to split off worker mode support into its own
commit(s).

Greetings,

Andres Freund

[1]: bulk relation extension, streaming read
[2]: personal health challenges, family health challenges and now moving from the US West Coast to the East Coast, ...
the US West Coast to the East Coast, ...

Attachments:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload

From e05cf468cab4003baa510053ff921063ca32c19a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Jul 2023 18:59:25 -0700
Subject: [PATCH v2.0 01/17] bufmgr: Return early in
 ScheduleBufferTagForWriteback() if fsync=off

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5cdd2f10fc8..ec957635f2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5926,7 +5926,12 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context
 {
 	PendingWriteback *pending;
 
-	if (io_direct_flags & IO_DIRECT_DATA)
+	/*
+	 * As pg_flush_data() doesn't do anything with fsync disabled, there's no
+	 * point in tracking in that case.
+	 */
+	if (io_direct_flags & IO_DIRECT_DATA ||
+		!enableFsync)
 		return;
 
 	/*
-- 
2.45.2.827.g557ae147e6

v2.0-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload

From a1f0fd69a34d146294bd4398bd5a5712cdc002ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.0 02/17] Allow lwlocks to be unowned

This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
 src/include/storage/lwlock.h      |  2 +
 src/backend/storage/lmgr/lwlock.c | 96 +++++++++++++++++++++----------
 2 files changed, 68 insertions(+), 30 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..00e8022fbad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockReleaseOwnership(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d805..f3d3435b1f5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	}
 }
 
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 {
-	LWLockMode	mode;
 	uint32		oldstate;
 	bool		check_waiters;
-	int			i;
-
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-		if (lock == held_lwlocks[i].lock)
-			break;
-
-	if (i < 0)
-		elog(ERROR, "lock %s is not held", T_NAME(lock));
-
-	mode = held_lwlocks[i].mode;
-
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
-
-	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
 	/*
 	 * Release my hold on lock, after that it can immediately be acquired by
 	 * others, even if we still have to wakeup other waiters.
 	 */
 	if (mode == LW_EXCLUSIVE)
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
 	else
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
 
 	/* nobody else can have that kind of lock */
-	Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+	if (mode == LW_EXCLUSIVE)
+		Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+	else
+		Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+			   (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
 
 	if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
 		TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
 
+	if (mode == LW_EXCLUSIVE)
+		oldstate -= LW_VAL_EXCLUSIVE;
+	else
+		oldstate -= LW_VAL_SHARED;
+
 	/*
 	 * We're still waiting for backends to get scheduled, don't wake them up
 	 * again.
@@ -1841,6 +1825,58 @@ LWLockRelease(LWLock *lock)
 		LWLockWakeup(lock);
 	}
 
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+	LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * XXX: this doesn't do a RESUME_INTERRUPTS(), responsibility of the caller.
+ */
+LWLockMode
+LWLockReleaseOwnership(LWLock *lock)
+{
+	LWLockMode	mode;
+	int			i;
+
+	/*
+	 * Remove lock from list of locks held.  Usually, but not always, it will
+	 * be the latest-acquired lock; so search array backwards.
+	 */
+	for (i = num_held_lwlocks; --i >= 0;)
+		if (lock == held_lwlocks[i].lock)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+	mode = held_lwlocks[i].mode;
+
+	num_held_lwlocks--;
+	for (; i < num_held_lwlocks; i++)
+		held_lwlocks[i] = held_lwlocks[i + 1];
+
+	return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+	LWLockMode	mode;
+
+	mode = LWLockReleaseOwnership(lock);
+
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+	LWLockReleaseInternal(lock, mode);
+
 	/*
 	 * Now okay to allow cancel/die interrupts.
 	 */
-- 
2.45.2.827.g557ae147e6

v2.0-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload

From 97e621ddc5fb3b7f60b8dd5517c45fac16e1f6f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Aug 2021 12:16:28 -0700
Subject: [PATCH v2.0 03/17] Use aux process resource owner in walsender

AIO will need a resource owner to do IO. Right now we create a resowner
on-demand during basebackup, and we could do the same for AIO. But it seems
easier to just always create an aux process resowner.
---
 src/include/replication/walsender.h |  1 -
 src/backend/backup/basebackup.c     |  8 ++++--
 src/backend/replication/walsender.c | 44 ++++++-----------------------
 3 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index f2d8297f016..aff0f7a51ca 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -38,7 +38,6 @@ extern PGDLLIMPORT bool log_replication_commands;
 extern void InitWalSender(void);
 extern bool exec_replication_command(const char *cmd_string);
 extern void WalSndErrorCleanup(void);
-extern void WalSndResourceCleanup(bool isCommit);
 extern void PhysicalWakeupLogicalWalSnd(void);
 extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
 extern void WalSndSignals(void);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index de16afac749..23bf8bf2db0 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -250,8 +250,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 	state.bytes_total_is_valid = false;
 
 	/* we're going to use a BufFile, so we need a ResourceOwner */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+		   CurrentResourceOwner == NULL);
+	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	backup_started_in_recovery = RecoveryInProgress();
 
@@ -672,7 +674,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 	FreeBackupManifest(&manifest);
 
 	/* clean up the resource owner we created */
-	WalSndResourceCleanup(true);
+	ReleaseAuxProcessResources(true);
 
 	basebackup_progress_done();
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c5f1009f370..0e847535a64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -282,10 +282,8 @@ InitWalSender(void)
 	/* Create a per-walsender data structure in shared memory */
 	InitWalSenderSlot();
 
-	/*
-	 * We don't currently need any ResourceOwner in a walsender process, but
-	 * if we did, we could call CreateAuxProcessResourceOwner here.
-	 */
+	/* need resource owner for e.g. basebackups */
+	CreateAuxProcessResourceOwner();
 
 	/*
 	 * Let postmaster know that we're a WAL sender. Once we've declared us as
@@ -346,7 +344,7 @@ WalSndErrorCleanup(void)
 	 * without a transaction, we've got to clean that up now.
 	 */
 	if (!IsTransactionOrTransactionBlock())
-		WalSndResourceCleanup(false);
+		ReleaseAuxProcessResources(false);
 
 	if (got_STOPPING || got_SIGUSR2)
 		proc_exit(0);
@@ -355,34 +353,6 @@ WalSndErrorCleanup(void)
 	WalSndSetState(WALSNDSTATE_STARTUP);
 }
 
-/*
- * Clean up any ResourceOwner we created.
- */
-void
-WalSndResourceCleanup(bool isCommit)
-{
-	ResourceOwner resowner;
-
-	if (CurrentResourceOwner == NULL)
-		return;
-
-	/*
-	 * Deleting CurrentResourceOwner is not allowed, so we must save a pointer
-	 * in a local variable and clear it first.
-	 */
-	resowner = CurrentResourceOwner;
-	CurrentResourceOwner = NULL;
-
-	/* Now we can release resources and delete it. */
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_BEFORE_LOCKS, isCommit, true);
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_LOCKS, isCommit, true);
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_AFTER_LOCKS, isCommit, true);
-	ResourceOwnerDelete(resowner);
-}
-
 /*
  * Handle a client's connection abort in an orderly manner.
  */
@@ -685,8 +655,10 @@ UploadManifest(void)
 	 * parsing the manifest will use the cryptohash stuff, which requires a
 	 * resource owner
 	 */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+		   CurrentResourceOwner == NULL);
+	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	/* Prepare to read manifest data into a temporary context. */
 	mcxt = AllocSetContextCreate(CurrentMemoryContext,
@@ -723,7 +695,7 @@ UploadManifest(void)
 	uploaded_manifest_mcxt = mcxt;
 
 	/* clean up the resource owner we created */
-	WalSndResourceCleanup(true);
+	ReleaseAuxProcessResources(true);
 }
 
 /*
-- 
2.45.2.827.g557ae147e6

v2.0-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From 6e9b170059b75642e348e93e4a83b332ef9b3f99 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 1 Aug 2024 09:56:36 -0700
Subject: [PATCH v2.0 04/17] Ensure a resowner exists for all paths that may
 perform AIO

---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 3 ++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7637581a184..234fdc57ca7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -331,8 +331,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3fe1774a1e9..be0c7846d00 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3b50ce19a2c..11128ea461c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -719,7 +719,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.45.2.827.g557ae147e6

v2.0-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload

From 7d58cc85191c96d8dc731b62810b64c5b366743b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:10:35 -0400
Subject: [PATCH v2.0 05/17] bufmgr/smgr: Don't cross segment boundaries in
 StartReadBuffers()

With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.
---
 src/include/storage/md.h            |  2 ++
 src/include/storage/smgr.h          |  2 ++
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 src/backend/storage/smgr/md.c       | 17 +++++++++++++++++
 src/backend/storage/smgr/smgr.c     | 16 ++++++++++++++++
 5 files changed, 55 insertions(+)

diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 620f10abdeb..b72293c79a5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,8 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, int nblocks);
+extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e15b20a566a..899d0d681c5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks);
+extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+							 BlockNumber blocknum);
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ec957635f2a..f2e608f597d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1286,6 +1286,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			io_buffers_len = 0;
+	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
@@ -1317,6 +1318,23 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		{
 			/* Extend the readable range to cover this block. */
 			io_buffers_len++;
+
+			/*
+			 * Check how many blocks we can cover with the same IO. The smgr
+			 * implementation might e.g. be limited due to a segment boundary.
+			 */
+			if (i == 0 && actual_nblocks > 1)
+			{
+				maxcombine = smgrmaxcombine(operation->smgr,
+											operation->forknum,
+											blockNum);
+				if (maxcombine < actual_nblocks)
+				{
+					elog(DEBUG2, "limiting nblocks at %u from %u to %u",
+						 blockNum, actual_nblocks, maxcombine);
+					actual_nblocks = maxcombine;
+				}
+			}
 		}
 	}
 	*nblocks = actual_nblocks;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358f..6cd81a61faa 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -803,6 +803,17 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
 	return iovcnt;
 }
 
+uint32
+mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum)
+{
+	BlockNumber segoff;
+
+	segoff = blocknum % ((BlockNumber) RELSEG_SIZE);
+
+	return RELSEG_SIZE - segoff;
+}
+
 /*
  * mdreadv() -- Read the specified blocks from a relation.
  */
@@ -833,6 +844,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "read crossing segment boundary");
+
 		iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
 		size_this_segment = nblocks_this_segment * BLCKSZ;
 		transferred_this_segment = 0;
@@ -956,6 +970,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "write crossing segment boundary");
+
 		iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
 		size_this_segment = nblocks_this_segment * BLCKSZ;
 		transferred_this_segment = 0;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7b9fa103eff..ee31db85eec 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -88,6 +88,8 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum, int nblocks);
+	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
@@ -117,6 +119,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
+		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
@@ -588,6 +591,19 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
 }
 
+/*
+ * smgrmaxcombine() - Return the maximum number of total blocks that can be
+ *				 combined with an IO starting at blocknum.
+ *
+ * The returned value includes the io for blocknum itself.
+ */
+uint32
+smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum)
+{
+	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+}
+
 /*
  * smgrreadv() -- read a particular block range from a relation into the
  *				 supplied buffers.
-- 
2.45.2.827.g557ae147e6

v2.0-0006-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 1e1c9d880f71f4548c9616d57d6071e3f90d8f70 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.0 06/17] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/pg_config.h.in |   3 +
 src/makefiles/meson.build  |   3 +
 configure                  | 138 +++++++++++++++++++++++++++++++++++++
 configure.ac               |  11 +++
 meson.build                |  14 ++++
 meson_options.txt          |   3 +
 src/Makefile.global.in     |   4 ++
 7 files changed, 176 insertions(+)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 979925cc2e2..397133b51ac 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -708,6 +708,9 @@
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e9275845..cca689b2028 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'gssapi': gssapi,
   'icu': icu,
   'ldap': ldap,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/configure b/configure
index 537366945c0..317a462f610 100755
--- a/configure
+++ b/configure
@@ -654,6 +654,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ XML2_CFLAGS
 XML2_CONFIG
 with_libxml
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libxml
@@ -907,6 +911,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1574,6 +1580,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         use liburing for async io
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
@@ -1617,6 +1624,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8664,6 +8675,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13222,6 +13267,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/configure.ac b/configure.ac
index 4e279c4bd66..fa634ecf9e0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -970,6 +970,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1430,6 +1438,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/meson.build b/meson.build
index ea07126f78e..71200f4cb8f 100644
--- a/meson.build
+++ b/meson.build
@@ -848,6 +848,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3103,6 +3115,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3747,6 +3760,7 @@ if meson.version().version_compare('>=0.57')
       'gss': gssapi,
       'icu': icu,
       'ldap': ldap,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index b9421557606..084eebe72d7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'Use liburing for async io')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b49761..a8ff18faed6 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd	= @with_systemd@
 with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.45.2.827.g557ae147e6

v2.0-0007-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 8b3dabb0ec36a6aea6b5f9d30fadefc8748bfb9c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.0 07/17] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.
---
 src/include/storage/aio.h                     | 42 +++++++++++++++++
 src/include/storage/aio_init.h                | 26 +++++++++++
 src/backend/postmaster/postmaster.c           |  8 ++++
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 35 ++++++++++++++
 src/backend/storage/aio/aio_init.c            | 46 +++++++++++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 ++
 src/backend/tcop/postgres.c                   |  7 +++
 src/backend/utils/init/miscinit.c             |  3 ++
 src/backend/utils/init/postinit.c             |  3 ++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  7 +++
 src/tools/pgindent/typedefs.list              |  1 +
 14 files changed, 196 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_init.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..98fafcf9bc4
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_WORKER = 0,
+	IOMETHOD_IO_URING,
+} IoMethod;
+
+
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int	io_method;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..5bcfb8a9d58
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ *    AIO initialization - kept separate as initialization sites don't need to
+ *    know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_postmaster_init(void);
+extern void pgaio_postmaster_child_init_local(void);
+extern void pgaio_postmaster_child_init(void);
+
+#endif							/* AIO_INIT_H */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a6fff93db34..921073a2ca4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -111,6 +111,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -941,6 +942,13 @@ PostmasterMain(int argc, char *argv[])
 		ExitPostmaster(0);
 	}
 
+	/*
+	 * As AIO might create internal FDs, and will trigger shared memory
+	 * allocations, need to do this before reset_shared() and
+	 * set_max_safe_fds().
+	 */
+	pgaio_postmaster_init();
+
 	/*
 	 * Set up shared memory and semaphores.
 	 *
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..67f6b52de91
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
+	{NULL, 0, false}
+};
+
+int			io_method = IOMETHOD_WORKER;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..1c277a7eb3b
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    Asynchronous I/O subsytem - Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_postmaster_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e6..f0227a12a7d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, WaitLSNShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	WaitLSNShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea1135..4dc46b17b41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -61,6 +61,7 @@
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -4198,6 +4199,12 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	InitProcess();
 
+	/* AIO is needed during InitPostgres() */
+	pgaio_postmaster_init();
+	pgaio_postmaster_child_init_local();
+
+	set_max_safe_fds();
+
 	/*
 	 * Now that sufficient infrastructure has been initialized, PostgresMain()
 	 * can do the rest.
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 537d92c0cfd..b8fa2e64ffe 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/slotsync.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -137,6 +138,8 @@ InitPostmasterChild(void)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	pgaio_postmaster_child_init_local();
+
 	/*
 	 * If possible, make this process a group leader, so that the postmaster
 	 * can signal any child processes too. Not all processes will have
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 11128ea461c..f1151645242 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -589,6 +590,8 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	pgaio_postmaster_child_init();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 521ec5591c8..4961a5f4b16 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -5196,6 +5197,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..e904c3fea30 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -835,6 +835,13 @@
 #include = '...'			# include file
 
 
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = worker
+
+
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
 #------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..309686627e7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1257,6 +1257,7 @@ IntervalAggState
 IntoClause
 InvalMessageArray
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.45.2.827.g557ae147e6

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 4f6f260ff706c769d5e4f40e5fc23c2c3105afa2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 28 Aug 2024 14:28:36 -0400
Subject: [PATCH v2.0 08/17] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_init.h                |   2 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/postmaster.c           | 186 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  84 ++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 16 files changed, 312 insertions(+), 16 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb9..d043445b544 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -352,6 +352,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -380,6 +381,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
 
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 63c12917cfe..4cc000df79e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -62,6 +62,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 5bcfb8a9d58..a38dd982fbe 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -23,4 +23,6 @@ extern void pgaio_postmaster_init(void);
 extern void pgaio_postmaster_child_init_local(void);
 extern void pgaio_postmaster_child_init(void);
 
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index deeb06c9e01..b466ba843d6 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -442,7 +442,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 0ae23fdf55e..78429b2af2f 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -55,6 +55,7 @@
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -199,6 +200,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 921073a2ca4..fc3901d5347 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
 #include "replication/walsender.h"
 #include "storage/aio_init.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
@@ -321,6 +322,7 @@ typedef enum
 								 * ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
+	PM_SHUTDOWN_IO,				/* waiting for io workers to exit */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -382,6 +384,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static pid_t io_worker_pids[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -420,6 +426,9 @@ static int	CountChildren(int target);
 static Backend *assign_backendlist_entry(void);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
+static void signal_io_workers(int signal);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(BackendType type);
 static void StartAutovacuumWorker(void);
@@ -1334,6 +1343,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	pmState = PM_STARTUP;
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPID == 0)
 		CheckpointerPID = StartChildProcess(B_CHECKPOINTER);
@@ -1346,7 +1360,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPID = StartChildProcess(B_STARTUP);
 	Assert(StartupPID != 0);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -1995,6 +2008,7 @@ process_pm_reload_request(void)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (SlotSyncWorkerPID != 0)
 			signal_child(SlotSyncWorkerPID, SIGHUP);
+		signal_io_workers(SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2527,6 +2541,22 @@ process_pm_child_exit(void)
 			}
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+
+			if (io_worker_count == 0 &&
+				pmState >= PM_SHUTDOWN_IO)
+			{
+				pmState = PM_WAIT_DEAD_END;
+			}
+			continue;
+		}
+
 		/*
 		 * We don't know anything about this child process.  That's highly
 		 * unexpected, as we do track all the child processes that we fork.
@@ -2763,6 +2793,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		if (SlotSyncWorkerPID != 0)
 			sigquit_child(SlotSyncWorkerPID);
 
+		/* Take care of io workers too */
+		signal_io_workers(SIGQUIT);
+
 		/* We do NOT restart the syslogger */
 	}
 
@@ -2986,10 +3019,11 @@ PostmasterStateMachine(void)
 					FatalError = true;
 					pmState = PM_WAIT_DEAD_END;
 
-					/* Kill the walsenders and archiver too */
+					/* Kill walsenders, archiver and aio workers too */
 					SignalChildren(SIGQUIT);
 					if (PgArchPID != 0)
 						signal_child(PgArchPID, SIGQUIT);
+					signal_io_workers(SIGQUIT);
 				}
 			}
 		}
@@ -2999,16 +3033,26 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead_end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * dead_end children and aio workers left. There shouldn't be any
+		 * regular backends left by now anyway; what we're really waiting for
+		 * is walsenders and archiver.
 		 */
 		if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			pmState = PM_SHUTDOWN_IO;
+			signal_io_workers(SIGUSR2);
 		}
 	}
 
+	if (pmState == PM_SHUTDOWN_IO)
+	{
+		/*
+		 * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+		 */
+		if (io_worker_count == 0)
+			pmState = PM_WAIT_DEAD_END;
+	}
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/* Don't allow any new socket connection events. */
@@ -3016,17 +3060,22 @@ PostmasterStateMachine(void)
 
 		/*
 		 * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-		 * (ie, no dead_end children remain), and the archiver is gone too.
+		 * (ie, no dead_end children remain), and the archiver and aio workers
+		 * are all gone too.
 		 *
-		 * The reason we wait for those two is to protect them against a new
+		 * We need to wait for those because we might have transitioned
+		 * directly to PM_WAIT_DEAD_END due to immediate shutdown or fatal
+		 * error.  Note that they have already been sent appropriate shutdown
+		 * signals, either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
+		 *
+		 * The reason we wait for those is to protect them against a new
 		 * postmaster starting conflicting subprocesses; this isn't an
 		 * ironclad protection, but it at least helps in the
-		 * shutdown-and-immediately-restart scenario.  Note that they have
-		 * already been sent appropriate shutdown signals, either during a
-		 * normal state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * shutdown-and-immediately-restart scenario.
 		 */
-		if (dlist_is_empty(&BackendList) && PgArchPID == 0)
+		if (dlist_is_empty(&BackendList) && io_worker_count == 0
+			&& PgArchPID == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPID == 0);
@@ -3119,10 +3168,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		pmState = PM_STARTUP;
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPID = StartChildProcess(B_STARTUP);
 		Assert(StartupPID != 0);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3374,6 +3427,7 @@ TerminateChildren(int signal)
 		signal_child(PgArchPID, signal);
 	if (SlotSyncWorkerPID != 0)
 		signal_child(SlotSyncWorkerPID, signal);
+	signal_io_workers(signal);
 }
 
 /*
@@ -3955,6 +4009,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
+		case PM_SHUTDOWN_IO:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
@@ -4148,6 +4203,109 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_pids[id] == pid)
+		{
+			--io_worker_count;
+			io_worker_pids[id] = 0;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	/* ATODO: This will need to check if io_method == worker */
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_SHUTDOWN_IO)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		int			pid;
+		int			id;
+
+		/* Find the lowest unused IO worker ID. */
+
+		/*
+		 * AFIXME: This logic doesn't work right now, the ids aren't
+		 * transported to workers anymore.
+		 */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_pids[id] == 0)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		Assert(pmState < PM_SHUTDOWN_IO);
+
+		/* Try to launch one. */
+		pid = StartChildProcess(B_IO_WORKER);
+		if (pid > 0)
+		{
+			io_worker_pids[id] = pid;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* Ask the highest used IO worker ID to exit. */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_pids[id] != 0)
+			{
+				kill(io_worker_pids[id], SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+static void
+signal_io_workers(int signal)
+{
+	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		if (io_worker_pids[i] != 0)
+			signal_child(io_worker_pids[i], signal);
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..824682e7354 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_init.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..e13728b73da 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,6 @@
 backend_sources += files(
   'aio.c',
   'aio_init.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..5df2eea4a03
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		/*
+		 * We normally shouldn't get errors here. Need to do just enough error
+		 * recovery so that we can mark the IO as failed and then exit.
+		 */
+		LWLockReleaseAll();
+
+		/* TODO: recover from IO errors */
+
+		EmitErrorReport();
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 4dc46b17b41..d42546db195 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3294,6 +3294,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8af55989eed..a750caa9b2a 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -335,6 +335,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 	{
 		case B_INVALID:
 		case B_ARCHIVER:
+		case B_IO_WORKER:
 		case B_LOGGER:
 		case B_WAL_RECEIVER:
 		case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..47a2c4d126b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b8fa2e64ffe..bedeed588d3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4961a5f4b16..5670f40478a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3201,6 +3202,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904c3fea30..90430381efa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -839,7 +839,8 @@
 # WIP AIO GUC docs
 #------------------------------------------------------------------------------
 
-#io_method = worker
+#io_method = worker			# (change requires restart)
+#io_workers = 3				# 1-32;
 
 
 #------------------------------------------------------------------------------
-- 
2.45.2.827.g557ae147e6

v2.0-0009-aio-Basic-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload

From 0de554082f3ff6468ff352000774245b337d6d64 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:23:37 -0400
Subject: [PATCH v2.0 09/17] aio: Basic AIO implementation

At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.

Todo:
- implement "synchronous" AIO method
- split worker, io_uring methods out into separate commits
- lots of cleanup
---
 src/include/storage/aio.h                     | 308 ++++++
 src/include/storage/aio_internal.h            | 274 +++++
 src/include/storage/aio_ref.h                 |  24 +
 src/include/storage/lwlock.h                  |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/utils/resowner.h                  |   7 +
 src/backend/access/transam/xact.c             |   9 +
 src/backend/postmaster/postmaster.c           |   3 +-
 src/backend/storage/aio/Makefile              |   3 +
 src/backend/storage/aio/aio.c                 | 963 +++++++++++++++++-
 src/backend/storage/aio/aio_init.c            | 318 ++++++
 src/backend/storage/aio/aio_io.c              | 111 ++
 src/backend/storage/aio/aio_subject.c         | 170 ++++
 src/backend/storage/aio/meson.build           |   3 +
 src/backend/storage/aio/method_io_uring.c     | 393 +++++++
 src/backend/storage/aio/method_worker.c       | 413 +++++++-
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   4 +
 src/backend/utils/misc/guc_tables.c           |  25 +
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/backend/utils/resowner/resowner.c         |  51 +
 src/tools/pgindent/typedefs.list              |  23 +
 22 files changed, 3104 insertions(+), 7 deletions(-)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_ref.h
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_subject.c
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 98fafcf9bc4..65052462b02 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,315 @@
 #define AIO_H
 
 
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
 #include "utils/guc_tables.h"
 
 
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READ,
+	PGAIO_OP_WRITE,
+
+	PGAIO_OP_FSYNC,
+
+	PGAIO_OP_FLUSH_RANGE,
+
+	PGAIO_OP_NOP,
+
+	/**
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_NOP + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	AHF_REFERENCES_LOCAL = 1 << 0,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+	ASC_PLACEHOLDER /* empty enums are invalid */ ,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+
+	struct
+	{
+		int			fd;
+		bool		datasync;
+	}			fsync;
+
+	struct
+	{
+		int			fd;
+		uint32		nbytes;
+		uint64		offset;
+	}			flush_range;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioSubjectData;
+
+
+
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,
+	ARS_OK,
+	ARS_PARTIAL,
+	ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+	PgAioHandleSharedCallbackID id:8;
+	PgAioResultStatus status:2;
+	uint32		error_data:22;
+	int32		result;
+} PgAioResult;
+
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+	char	   *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+	const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+	PgAioHandleSharedCallbackPrepare prepare;
+	PgAioHandleSharedCallbackComplete complete;
+	PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int	pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+							 int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
 /* GUC related */
 extern void assign_io_method(int newval, void *extra);
 
@@ -37,6 +343,8 @@ typedef enum IoMethod
 /* GUCs */
 extern const struct config_enum_entry io_method_options[];
 extern int	io_method;
+extern int	io_max_concurrency;
+extern int	io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..67d994cc0b1
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,274 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,
+
+	/* IO finished, but result has not yet been processed */
+	AHS_REAPED,
+
+	/* IO completed, shared completion has been called */
+	AHS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioSubjectID subject:8;
+
+	/* which operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_shared_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+	uint8		iovec_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* FIXME: remove in favor of distilled_result */
+	/* raw result of the IO operation */
+	int32		result;
+
+	/* index into PgAioCtl->iovecs */
+	uint32		iovec_off;
+
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - PREPARED - in per-backend staged list
+	 * - IN_FLIGHT - not in any list
+	 * - REAPED - in per-reap context list
+	 * - COMPLETED_SHARED - not in any list
+	 * - COMPLETED_LOCAL - not in any list
+	 *
+	 * XXX: It probably make sense to optimize this out to save on per-io
+	 * memory at the cost of per-backend memory.
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary for shared completions. Needs to be sufficient to allow
+	 * another backend to retry an IO.
+	 */
+	PgAioSubjectData scb_data;
+};
+
+
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
+typedef struct PgAioPerBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+	 * having been either defined (by actually associating it with IO) or by
+	 * released (with pgaio_io_release()). This restriction is necessary to
+	 * guarantee that we always can acquire an IO. ->handed_out_io is used to
+	 * enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	dclist_head staged_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioPerBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *iovecs_data;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+	/* initialization */
+	size_t		(*shmem_size) (void);
+	void		(*shmem_init) (bool first_time);
+
+	void		(*postmaster_init) (void);
+	void		(*postmaster_child_init_local) (void);
+	void		(*postmaster_child_init) (void);
+
+	/* teardown */
+	void		(*postmaster_before_child_exit) (void);
+
+	/* handling of IOs */
+	int			(*submit) (void);
+
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+
+	/* properties */
+	bool		can_scatter_gather_direct;
+	bool		can_scatter_gather_buffered;
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronously(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ *    headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+	uint32		aio_index;
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioHandleRef;
+
+#endif							/* AIO_REF_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 00e8022fbad..f4e6abce327 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd6..7aaccf69d1e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, AioWorkerSubmissionQueue)
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,11 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0fe1630fca8..cb4ee5dfd1f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -52,6 +52,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2462,6 +2463,8 @@ CommitTransaction(void)
 	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
@@ -2976,6 +2979,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
@@ -5350,6 +5357,8 @@ AbortSubTransaction(void)
 		AtSubAbort_Snapshot(s->nestingLevel);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fc3901d5347..71930094309 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4221,7 +4221,8 @@ maybe_reap_io_worker(int pid)
 static void
 maybe_adjust_io_workers(void)
 {
-	/* ATODO: This will need to check if io_method == worker */
+	if (!pgaio_workers_enabled())
+		return;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 824682e7354..2a5e72a8024 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,8 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_io.o \
 	aio_init.o \
+	aio_subject.o \
 	method_worker.o \
+	method_io_uring.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 67f6b52de91..d6f9f658b97 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -14,7 +14,23 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 
 /* Options for io_method. */
@@ -26,10 +42,955 @@ const struct config_enum_entry io_method_options[] = {
 	{NULL, 0, false}
 };
 
-int			io_method = IOMETHOD_WORKER;
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
+
+
+/* global control for AIO */
+PgAioCtl   *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * AFIXME: rewrite
+ *
+ * Shared completion callbacks can be executed by any backend (otherwise there
+ * would be deadlocks). Therefore they cannot update state for the issuer of
+ * the IO. That can be done with issuer callbacks.
+ *
+ * Note that issuer callbacks are effectively executed in a critical
+ * section. This is necessary as we need to be able to execute IO in critical
+ * sections (consider e.g. WAL logging) and to be able to execute IOs we need
+ * to acquire an IO, which in turn requires executing issuer callbacks. An
+ * alternative scheme could be to defer local callback execution until a later
+ * point, but that gets complicated quickly.
+ *
+ * Therefore the typical pattern is to use an issuer callback to set some
+ * flags in backend local memory, which can then be used to error out at a
+ * later time.
+ *
+ * NB: The issuer callback is cleared when the resowner owning the IO goes out
+ * of scope.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_get_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (dclist_count(&my_aio->staged_ios) >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		pgaio_submit_staged();
+	}
+
+	if (my_aio->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&my_aio->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == AHS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		ioh->state = AHS_HANDED_OUT;
+		my_aio->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+			ioh->report_return = ret;
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == my_aio->handed_out_io)
+	{
+		Assert(ioh->state == AHS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		my_aio->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case AHS_HANDED_OUT:
+			Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+			if (ioh == my_aio->handed_out_io)
+			{
+				my_aio->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case AHS_DEFINED:
+		case AHS_PREPARED:
+			/* XXX: Should we warn about this when is_commit? */
+			pgaio_submit_staged();
+			break;
+		case AHS_IN_FLIGHT:
+		case AHS_REAPED:
+		case AHS_COMPLETED_SHARED:
+			/* this is expected to happen */
+			break;
+		case AHS_COMPLETED_LOCAL:
+			/* XXX: unclear if this ought to be possible? */
+			pgaio_io_reclaim(ioh);
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	*iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* AFIXME: Needs to be the value at startup time */
+	return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+	return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+	return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	for (int i = 0; i < len; i++)
+		aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+	ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->iovec_data_len > 0);
+
+	*len = ioh->iovec_data_len;
+
+	return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->subject = subjid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+	Assert(ioh->state == AHS_HANDED_OUT ||
+		   ioh->state == AHS_DEFINED ||
+		   ioh->state == AHS_PREPARED);
+	Assert(ioh->generation != 0);
+
+	ior->aio_index = ioh - aio_ctl->io_handles;
+	ior->generation_upper = (uint32) (ioh->generation >> 32);
+	ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+	ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+	return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+	Assert(pgaio_io_ref_valid(ior));
+	return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();
+		}
+		else if (state != AHS_IN_FLIGHT && state != AHS_REAPED && state != AHS_COMPLETED_SHARED && state != AHS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+
+		/*
+		 * Somebody else completed the IO, need to execute issuer callback, so
+		 * reclaim eagerly.
+		 */
+		if (state == AHS_COMPLETED_LOCAL)
+		{
+			pgaio_io_reclaim(ioh);
+
+			return;
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case AHS_IN_FLIGHT:
+				if (pgaio_impl->wait_one)
+				{
+					pgaio_impl->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case AHS_PREPARED:
+			case AHS_DEFINED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case AHS_REAPED:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state != AHS_REAPED && state != AHS_DEFINED &&
+						state != AHS_IN_FLIGHT)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case AHS_COMPLETED_SHARED:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+			case AHS_COMPLETED_LOCAL:
+				return;
+		}
+	}
+}
+
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+
+	if (state == AHS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= aio_ctl->io_handles &&
+		   ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+	return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			return "idle";
+		case AHS_HANDED_OUT:
+			return "handed_out";
+		case AHS_DEFINED:
+			return "DEFINED";
+		case AHS_PREPARED:
+			return "PREPARED";
+		case AHS_IN_FLIGHT:
+			return "IN_FLIGHT";
+		case AHS_REAPED:
+			return "REAPED";
+		case AHS_COMPLETED_SHARED:
+			return "COMPLETED_SHARED";
+		case AHS_COMPLETED_LOCAL:
+			return "COMPLETED_LOCAL";
+	}
+	pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	dclist_push_tail(&my_aio->staged_ios, &ioh->node);
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;
+
+	elog(DEBUG3, "io:%d: prepared %s",
+		 pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh));
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	ioh->result = result;
+
+	pg_write_barrier();
+
+	/* FIXME: should be done in separate function */
+	ioh->state = AHS_REAPED;
+
+	pgaio_io_process_completion_subject(ioh);
+
+	/* ensure results of completion are visible before the new state */
+	pg_write_barrier();
+
+	ioh->state = AHS_COMPLETED_SHARED;
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	ioh->state = AHS_IN_FLIGHT;
+	pg_write_barrier();
+
+	dclist_delete_from(&my_aio->staged_ios, &ioh->node);
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+	ioh = &aio_ctl->io_handles[ior->aio_index];
+
+	*ref_generation = ((uint64) ior->generation_upper) << 32 |
+		ior->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	ereport(DEBUG3,
+			errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+				   pgaio_io_get_id(ioh),
+				   pgaio_io_get_state_name(ioh),
+				   pgaio_io_get_op_name(ioh),
+				   pgaio_io_get_subject_name(ioh),
+				   ioh->result,
+				   ioh->report_return
+				   ),
+			errhidestmt(true), errhidecontext(true));
+
+	if (ioh->report_return)
+	{
+		if (ioh->state != AHS_HANDED_OUT)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->subject_data = ioh->scb_data;
+		}
+	}
+
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&my_aio->idle_bbs, &bb->node);
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->num_shared_callbacks = 0;
+	ioh->iovec_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->flags = 0;
+
+	pg_write_barrier();
+	ioh->generation++;
+	pg_write_barrier();
+	ioh->state = AHS_IDLE;
+	pg_write_barrier();
+
+	dclist_push_tail(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	bool		found_handed_out = false;
+	int			reclaimed = 0;
+	static uint32 lastpos = 0;
+
+	elog(DEBUG2,
+		 "waiting for self: %d pending",
+		 dclist_count(&my_aio->staged_ios));
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+		if (ioh->state == AHS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	if (!dclist_is_empty(&my_aio->staged_ios))
+	{
+		elog(DEBUG2, "submitting while acquiring free io");
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case AHS_IDLE:
+
+				/*
+				 * While one might think that pgaio_io_get_nb() should have
+				 * succeeded, this is reachable because the IO could have
+				 * completed during the submission above.
+				 */
+				return;
+			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARED:
+			case AHS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case AHS_HANDED_OUT:
+				if (found_handed_out)
+					elog(ERROR, "more than one handed out IO");
+				found_handed_out = true;
+				continue;
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d",
+						 pgaio_io_get_id(ioh));
+					lastpos = i;
+					return;
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+				lastpos = i;
+				return;
+		}
+	}
+
+	elog(PANIC, "could not reclaim any handles");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (my_aio->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * FIXME It probably is not correct to have bounce buffers be per backend,
+	 * they use too much memory.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&my_aio->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	my_aio->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	my_aio->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&my_aio->idle_bbs, &bb->node);
+	my_aio->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (!dclist_is_empty(&my_aio->staged_ios))
+	{
+		elog(DEBUG2, "submitting while acquiring free bb");
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				continue;
+			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d to reclaim BB",
+						 pgaio_io_get_id(ioh));
+
+					if (slist_is_empty(&my_aio->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&my_aio->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+			case AHS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&my_aio->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+
+	if (dclist_is_empty(&my_aio->staged_ios))
+		return;
+
+	while (!dclist_is_empty(&my_aio->staged_ios))
+	{
+		int			staged_count PG_USED_FOR_ASSERTS_ONLY = dclist_count(&my_aio->staged_ios);
+		int			did_submit;
+
+		Assert(staged_count > 0);
+
+		START_CRIT_SECTION();
+		END_CRIT_SECTION();
+
+		did_submit = pgaio_impl->submit();
+
+		total_submitted += did_submit;
+	}
+
+#ifdef PGAIO_VERBOSE
+	ereport(DEBUG2,
+			errmsg("submitted %d",
+				   total_submitted),
+			errhidestmt(true),
+			errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+	return !dclist_is_empty(&my_aio->staged_ios);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!my_aio)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
+}
 
 
 void
 assign_io_method(int newval, void *extra)
 {
+	pgaio_impl = pgaio_ops_table[newval];
 }
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 1c277a7eb3b..cf3512f79fc 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,33 +14,351 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/io_worker.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* aio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioIOVShmemSize());
+	sz = add_size(sz, AioIOVDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
+
+	if (pgaio_impl->shmem_size)
+		sz = add_size(sz, pgaio_impl->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * io_combine_limit;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
+
+	aio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(aio_ctl, 0, AioCtlShmemSize());
+
+	aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
+
+	aio_ctl->backend_state = (PgAioPerBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	aio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+	aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+	aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+	bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+	bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->subject = ASI_INVALID;
+		ioh->state = AHS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
+		dclist_init(&bs->idle_ios);
+		dclist_init(&bs->staged_ios);
+		slist_init(&bs->idle_bbs);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->iovec_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_shared_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += io_combine_limit;
+		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	pgaio_impl->shmem_init(!found);
 }
 
 void
 pgaio_postmaster_init(void)
 {
+	if (pgaio_impl->postmaster_init)
+		pgaio_impl->postmaster_init();
 }
 
 void
 pgaio_postmaster_child_init(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!my_aio);
+
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	my_aio = &aio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_impl->postmaster_child_init)
+		pgaio_impl->postmaster_child_init();
 }
 
 void
 pgaio_postmaster_child_init_local(void)
 {
+	if (pgaio_impl->postmaster_child_init_local)
+		pgaio_impl->postmaster_child_init_local();
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..5b2f9ee3ba6
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READ:
+			return "read";
+		case PGAIO_OP_WRITE:
+			return "write";
+		case PGAIO_OP_FSYNC:
+			return "fsync";
+		case PGAIO_OP_FLUSH_RANGE:
+			return "flush_range";
+		case PGAIO_OP_NOP:
+			return "nop";
+	}
+
+	pg_unreachable();
+}
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_READ);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_WRITE);
+}
+
+
+extern void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READ:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITE:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		default:
+			elog(ERROR, "not yet");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..68e9e80074c
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,170 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ *    IO completion handling for IOs on different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+static const PgAioSubjectInfo *aio_subject_info[] = {
+	[ASI_INVALID] = &(PgAioSubjectInfo) {
+		.name = "invalid",
+	},
+};
+
+static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+};
+
+
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+	if (cbid >= lengthof(aio_shared_cbs))
+		elog(ERROR, "callback %d is out of range", cbid);
+	if (aio_shared_cbs[cbid]->complete == NULL)
+		elog(ERROR, "callback %d is undefined", cbid);
+	if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+	ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, adding cbid num %d, id %d",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 ioh->num_shared_callbacks + 1, cbid);
+
+	ioh->num_shared_callbacks++;
+}
+
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+	return aio_subject_info[ioh->subject]->name;
+}
+
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+	Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleSharedCallbacks *cbs = aio_shared_cbs[cbid];
+
+		if (!cbs->prepare)
+			continue;
+
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d: prepare",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i, cbid);
+		cbs->prepare(ioh);
+	}
+}
+
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = 0;				/* FIXME */
+	result.error_data = 0;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid;
+
+		cbid = ioh->shared_callbacks[i - 1];
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d with distilled result status %d, id %u, error_data: %d, result: %d",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i, cbid,
+			 result.status,
+			 result.id,
+			 result.error_data,
+			 result.result);
+		result = aio_shared_cbs[cbid]->complete(ioh, result);
+	}
+
+	ioh->distilled_result = result;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 result.status,
+		 result.id,
+		 result.error_data,
+		 result.result,
+		 ioh->result);
+}
+
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+bool
+pgaio_io_needs_synchronously(PgAioHandle *ioh)
+{
+	if (aio_subject_info[ioh->subject]->reopen == NULL)
+		return true;
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	const PgAioHandleSharedCallbacks *scb;
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	scb = aio_shared_cbs[result.id];
+
+	if (scb->error == NULL)
+		elog(ERROR, "scb id %d does not have error callback", result.id);
+
+	scb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index e13728b73da..8960223194a 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,7 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_io.c',
   'aio_init.c',
+  'aio_subject.c',
+  'method_io_uring.c',
   'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..f76533b4cdc
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,393 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO implementation using io_uring on Linux
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_postmaster_init(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_postmaster_child_init(void);
+static void pgaio_uring_postmaster_child_init_local(void);
+
+static int	pgaio_uring_submit(void);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.postmaster_init = pgaio_uring_postmaster_init,
+	.postmaster_child_init = pgaio_uring_postmaster_child_init,
+	.postmaster_child_init_local = pgaio_uring_postmaster_child_init_local,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+#if 0
+	.retry = pgaio_uring_io_retry,
+	.wait_one = pgaio_uring_wait_one,
+	.drain = pgaio_uring_drain,
+#endif
+	.can_scatter_gather_direct = true,
+	.can_scatter_gather_buffered = true
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	aio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &aio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_postmaster_init(void)
+{
+	uint32		TotalProcs =
+		MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	for (int i = 0; i < TotalProcs; i++)
+		ReserveExternalFD();
+}
+
+static void
+pgaio_uring_postmaster_child_init(void)
+{
+	my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+}
+
+static void
+pgaio_uring_postmaster_child_init_local(void)
+{
+	int			ret;
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(void)
+{
+	PgAioHandle *ios[PGAIO_SUBMIT_BATCH_SIZE];
+	struct io_uring_sqe *sqe[PGAIO_SUBMIT_BATCH_SIZE];
+	struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+	int			nios = 0;
+
+	while (!dclist_is_empty(&my_aio->staged_ios))
+	{
+		dlist_node *node;
+		PgAioHandle *ioh;
+
+		node = dclist_head_node(&my_aio->staged_ios);
+		ioh = dlist_container(PgAioHandle, node, node);
+
+		sqe[nios] = io_uring_get_sqe(uring_instance);
+		ios[nios] = ioh;
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ios[nios], sqe[nios]);
+
+		nios++;
+
+		if (nios + 1 > PGAIO_SUBMIT_BATCH_SIZE)
+			break;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", nios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != nios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, nios);
+		}
+		else
+		{
+			elog(DEBUG3, "submit nios: %d", nios);
+		}
+		break;
+	}
+
+	return nios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+		uint32		reaped;
+
+		START_CRIT_SECTION();
+		reaped =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									reaped_cqes,
+									Min(PGAIO_MAX_LOCAL_REAPED, ready));
+		Assert(reaped <= ready);
+
+		ready -= reaped;
+
+		for (int i = 0; i < reaped; i++)
+		{
+			struct io_uring_cqe *cqe = reaped_cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		ereport(DEBUG3,
+				errmsg("drained %d/%d, now expecting %d",
+					   reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+				errhidestmt(true),
+				errhidecontext(true));
+
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will reap the completions, making the locking
+	 * unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		ereport(DEBUG3,
+				errmsg("wait_one for io:%d in state %s, cycle %d",
+					   pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh), waited),
+				errhidestmt(true),
+				errhidecontext(true));
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != AHS_IN_FLIGHT)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	ereport(DEBUG3,
+			errmsg("wait_one with %d sleeps",
+				   waited),
+			errhidestmt(true),
+			errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READ:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITE:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		default:
+			elog(ERROR, "not implemented");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5df2eea4a03..cd79bf1fba6 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO implementation using workers
  *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,24 +31,299 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/wait_event.h"
+#include "utils/ps_status.h"
+
+
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+static void pgaio_worker_postmaster_child_init_local(void);
+
+static int	pgaio_worker_submit(void);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+	.postmaster_child_init_local = pgaio_worker_postmaster_child_init_local,
+	.submit = pgaio_worker_submit,
+#if 0
+	.wait_one = pgaio_worker_wait_one,
+	.retry = pgaio_worker_io_retry,
+	.drain = pgaio_worker_drain,
+#endif
+
+	.can_scatter_gather_direct = true,
+	.can_scatter_gather_buffered = true
+};
 
 
 int			io_workers = 3;
+static int	io_worker_queue_size = 64;
 
+static int	MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * io_worker_queue_size +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static void
+pgaio_worker_postmaster_child_init_local(void)
+{
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		elog(DEBUG1, "full");
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_need_synchronous(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & AHF_REFERENCES_LOCAL
+		|| pgaio_io_needs_synchronously(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(PgAioHandle *ios[], int nios)
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		if (pgaio_worker_need_synchronous(ios[i]) ||
+			!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			ereport(DEBUG3,
+					errmsg("submission for io:%d choosing worker %d, latch %p",
+						   pgaio_io_get_id(ios[i]), worker, wakeup),
+					errhidestmt(true), errhidecontext(true));
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(void)
+{
+	PgAioHandle *ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nios = 0;
+
+	while (!dclist_is_empty(&my_aio->staged_ios))
+	{
+		dlist_node *node;
+		PgAioHandle *ioh;
+
+		node = dclist_head_node(&my_aio->staged_ios);
+		ioh = dlist_container(PgAioHandle, node, node);
+
+		pgaio_io_prepare_submit(ioh);
+
+		Assert(nios < PGAIO_SUBMIT_BATCH_SIZE);
+
+		ios[nios++] = ioh;
+
+		if (nios + 1 == PGAIO_SUBMIT_BATCH_SIZE)
+			break;
+	}
+
+	pgaio_worker_submit_internal(ios, nios);
+
+	return nios;
+}
 
 void
 IoWorkerMain(char *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	volatile PgAioHandle *ioh = NULL;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
 
 	/* TODO review all signals */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
@@ -49,7 +339,34 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
-	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* FIXME: locking */
+	MyIoWorkerId = -1;
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	sprintf(cmd, "worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
 
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -64,21 +381,107 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 
 		/* TODO: recover from IO errors */
+		if (ioh != NULL)
+		{
+#if 0
+			/* EINTR is treated as a retryable error */
+			pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+										EINTR);
+#endif
+		}
 
 		EmitErrorReport();
+
+		/* FIXME: should probably be a before-shmem-exit instead */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+		Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+		io_worker_control->workers[MyIoWorkerId].in_use = false;
+		io_worker_control->workers[MyIoWorkerId].latch = NULL;
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
 		proc_exit(1);
 	}
 
 	/* We can now handle ereport(ERROR) */
 	PG_exception_stack = &local_sigjmp_buf;
 
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/* Nothing to do.  Mark self idle. */
+			/*
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+#if 0
+			if (nwakeups > 0)
+				elog(LOG, "wake %d", nwakeups);
+#endif
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			ioh = &aio_ctl->io_handles[io_index];
+
+			ereport(DEBUG3,
+					errmsg("worker processing io:%d",
+						   pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+					errhidestmt(true), errhidecontext(true));
+
+			pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+			pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+			ioh = NULL;
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
 	proc_exit(0);
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f3d3435b1f5..63d1f905554 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 47a2c4d126b..3678f2b3e43 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,9 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_SUBMIT	"Waiting for AIO submission."
+AIO_DRAIN	"Waiting for IOs to finish."
+AIO_COMPLETION	"Waiting for completion callback."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
@@ -348,6 +351,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5670f40478a..5828072a48e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3214,6 +3214,31 @@ struct config_int ConfigureNamesInt[] =
 		NULL, assign_io_workers, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IOs that may be in flight in one backend."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90430381efa..1fc8336496c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -842,6 +842,12 @@
 #io_method = worker			# (change requires restart)
 #io_workers = 3				# 1-32;
 
+#io_max_concurrency = 32		# Max number of IOs that may be in
+					# flight at the same time in one backend
+					# (change requires restart)
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
+
 
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,13 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -425,6 +434,9 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
+
 	return owner;
 }
 
@@ -725,6 +737,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		/* XXX: Could probably be a later phase? */
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1109,27 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 309686627e7..be8be9fbff0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
@@ -1258,6 +1261,7 @@ IntoClause
 InvalMessageArray
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2093,6 +2097,25 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBounceBuffer
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleState
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
+PgAioUringContext
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.45.2.827.g557ae147e6

v2.0-0010-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload

From 03723ac0d170aba51febc975921296d814af7765 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:33:30 -0400
Subject: [PATCH v2.0 10/17] aio: Implement smgr/md.c aio methods

---
 src/include/storage/aio.h             |  17 +-
 src/include/storage/fd.h              |   6 +
 src/include/storage/md.h              |  12 ++
 src/include/storage/smgr.h            |  21 +++
 src/backend/storage/aio/aio_subject.c |   4 +
 src/backend/storage/file/fd.c         |  68 ++++++++
 src/backend/storage/smgr/md.c         | 217 ++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c       |  91 +++++++++++
 8 files changed, 434 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 65052462b02..acfd50c587c 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -57,9 +57,10 @@ typedef enum PgAioSubjectID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	ASI_INVALID = 0,
+	ASI_SMGR,
 } PgAioSubjectID;
 
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
 
 /*
  * Flags for an IO that can be set with pgaio_io_set_flag().
@@ -90,7 +91,8 @@ typedef enum PgAioHandleFlags
  */
 typedef enum PgAioHandleSharedCallbackID
 {
-	ASC_PLACEHOLDER /* empty enums are invalid */ ,
+	ASC_MD_READV,
+	ASC_MD_WRITEV,
 } PgAioHandleSharedCallbackID;
 
 
@@ -139,6 +141,17 @@ typedef union
 
 typedef union PgAioSubjectData
 {
+	struct
+	{
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		int			nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp;	/* proc can be inferred by owning AIO */
+		bool		release_lock;
+		int8		mode;
+	}			smgr;
+
 	/* just as an example placeholder for later */
 	struct
 	{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b72293c79a5..ede77695853 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 899d0d681c5..66730bc24fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -109,6 +123,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -126,4 +141,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+									  SMgrRelationData *smgr,
+									  ForkNumber forknum,
+									  BlockNumber blocknum,
+									  int nblocks);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 68e9e80074c..12ab1730f49 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 
@@ -28,9 +29,12 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
 	[ASI_INVALID] = &(PgAioSubjectInfo) {
 		.name = "invalid",
 	},
+	[ASI_SMGR] = &aio_smgr_subject_info,
 };
 
 static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+	[ASC_MD_READV] = &aio_md_readv_cb,
+	[ASC_MD_WRITEV] = &aio_md_writev_cb,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 368cc9455cf..35bf3c1e7bd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -95,6 +95,7 @@
 #include "pgstat.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1295,6 +1296,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1988,6 +1991,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2211,6 +2216,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2316,6 +2347,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
@@ -2499,6 +2558,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2779,6 +2844,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2847,6 +2913,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6cd81a61faa..f96308490d9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -931,6 +932,49 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1036,6 +1080,49 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1357,6 +1444,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1832,3 +1934,118 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+
+
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+	.complete = md_readv_complete,
+	.error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+	.complete = md_writev_complete,
+};
+
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioResult result = prior_result;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_error(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		result.error_data = 0;
+
+		md_readv_error(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.id = ASC_MD_READV;
+		result.status = ARS_PARTIAL;
+	}
+
+	/* AFIXME: post-read portion of mdreadv() */
+
+	return result;
+}
+
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+					   )
+			);
+	}
+	else
+	{
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+					   result.result * (size_t) BLCKSZ,
+					   subject_data->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+}
+
+
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	if (prior_result.status == ARS_ERROR)
+	{
+		/* AFIXME: complain */
+		return prior_result;
+	}
+
+	prior_result.result /= BLCKSZ;
+
+	return prior_result;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ee31db85eec..2dacb361a4f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -620,6 +642,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -651,6 +686,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
@@ -807,6 +852,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -835,3 +886,43 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+						  struct SMgrRelationData *smgr,
+						  ForkNumber forknum,
+						  BlockNumber blocknum,
+						  int nblocks)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+	pgaio_io_set_subject(ioh, ASI_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	sd->smgr.release_lock = false;
+	sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
-- 
2.45.2.827.g557ae147e6

v2.0-0011-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload

From 9d8c6210e3a5e39d585d0a8ebebeac8a9e9b62a2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2.0 11/17] bufmgr: Implement AIO support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |   6 +
 src/include/storage/buf_internals.h   |   6 +
 src/include/storage/bufmgr.h          |  10 +
 src/backend/storage/aio/aio_subject.c |   5 +
 src/backend/storage/buffer/buf_init.c |   3 +
 src/backend/storage/buffer/bufmgr.c   | 431 +++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c |  65 ++++
 7 files changed, 519 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index acfd50c587c..40c80a2fed4 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -93,6 +93,12 @@ typedef enum PgAioHandleSharedCallbackID
 {
 	ASC_MD_READV,
 	ASC_MD_WRITEV,
+
+	ASC_SHARED_BUFFER_READ,
+	ASC_SHARED_BUFFER_WRITE,
+
+	ASC_LOCAL_BUFFER_READ,
+	ASC_LOCAL_BUFFER_WRITE,
 } PgAioHandleSharedCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e46..5cfa7dbd1f1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_ref.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -252,6 +253,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioHandleRef io_in_progress;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -465,4 +468,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..6cd64b8c2b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,14 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
@@ -194,6 +202,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+struct PgAioHandle;
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 12ab1730f49..0676f3d3a66 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -35,6 +35,11 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
 static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
 	[ASC_MD_READV] = &aio_md_readv_cb,
 	[ASC_MD_WRITEV] = &aio_md_writev_cb,
+
+	[ASC_SHARED_BUFFER_READ] = &aio_shared_buffer_read_cb,
+	[ASC_SHARED_BUFFER_WRITE] = &aio_shared_buffer_write_cb,
+	[ASC_LOCAL_BUFFER_READ] = &aio_local_buffer_read_cb,
+	[ASC_LOCAL_BUFFER_WRITE] = &aio_local_buffer_write_cb,
 };
 
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 09bec6449b6..059a80dfb13 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
@@ -126,6 +127,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_io_ref_clear(&buf->io_in_progress);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f2e608f597d..8feafd6e53c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -58,6 +59,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -541,7 +543,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1108,7 +1111,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1593,7 +1596,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2477,7 +2480,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3926,7 +3929,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5541,6 +5544,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioHandleRef ior;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5548,10 +5552,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		ior = buf->io_in_progress;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_io_ref_valid(&ior))
+		{
+			pgaio_io_ref_wait(&ior);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5640,7 +5653,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5652,6 +5665,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_io_ref_clear(&buf->io_in_progress);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5660,6 +5680,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5711,7 +5765,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6170,3 +6224,366 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+		BlockNumber forkNum = bufHdr->tag.forkNum;
+
+		/* AFIXME: relpathperm allocates memory */
+		MemoryContextSwitchTo(ErrorContext);
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	/* Report I/Os as completing individually. */
+
+	/* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+	return buf_failed;
+}
+
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	/* AFIXME: implement track_io_timing */
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockReleaseOwnership()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_in_progress = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock now owned by IO.
+			 */
+			LWLockReleaseOwnership(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_read_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, false);
+}
+
+static void
+shared_buffer_write_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, true);
+}
+
+
+static PgAioResult
+shared_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+			 buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+		/*
+		 * AFIXME: It'd probably be better to not set BM_IO_ERROR (which is
+		 * what failed = true leads to) when it's just a short read...
+		 */
+		buf_failed = ReadBufferCompleteReadShared(buf,
+												  mode,
+												  failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_SHARED_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+shared_buffer_read_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   subject_data->smgr.blockNum + result.error_data,
+				   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+				   )
+		);
+	MemoryContextSwitchTo(oldContext);
+}
+
+static PgAioResult
+shared_buffer_write_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->scb_data.shared_buffer.release_lock */
+		ReadBufferCompleteWriteShared(buf,
+									  true,
+									  false);
+
+	}
+
+	return result;
+}
+
+static void
+local_buffer_read_prepare(PgAioHandle *ioh)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_in_progress = io_ref;
+		LocalRefCount[-buf - 1] += 1;
+
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	/* FIXME: error handling */
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+
+		buf_failed = ReadBufferCompleteReadLocal(buf,
+												 mode,
+												 false);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_LOCAL_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+local_buffer_write_prepare(PgAioHandle *ioh)
+{
+	elog(ERROR, "not yet");
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb = {
+	.prepare = shared_buffer_read_prepare,
+	.complete = shared_buffer_read_complete,
+	.error = shared_buffer_read_error,
+};
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb = {
+	.prepare = shared_buffer_write_prepare,
+	.complete = shared_buffer_write_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb = {
+	.prepare = local_buffer_read_prepare,
+	.complete = local_buffer_read_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb = {
+	.prepare = local_buffer_write_prepare,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8da7dd6c98a..a7eb723f1e9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "executor/instrument.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_io_ref_clear(&buf->io_in_progress);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *buf_hdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+	blockno = buf_hdr->tag.blockNum;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+		BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	/* release pin held by IO subsystem */
+	LocalRefCount[-buffer - 1] -= 1;
+
+	return buf_failed;
+}
-- 
2.45.2.827.g557ae147e6

v2.0-0012-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From fe6df768de29f124263f6fe250017f04454ca699 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2.0 12/17] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  25 ++-
 src/backend/storage/buffer/bufmgr.c | 259 +++++++++++++++++-----------
 2 files changed, 182 insertions(+), 102 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6cd64b8c2b3..a075a40b2ed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -107,11 +108,22 @@ typedef struct BufferManagerRelation
 #define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
 #define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
 
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
 /* Zero out page if reading fails. */
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
 
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here.  Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
+
 struct ReadBuffersOperation
 {
 	/* The following members should be set by the caller. */
@@ -131,6 +143,17 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		io_buffers_len;
+
+	/*
+	 * In some rare-ish cases one operation causes multiple reads (e.g. if a
+	 * buffer was concurrently read by another backend). It'd be much better
+	 * if we ensured that each ReadBuffersOperation covered only one IO - but
+	 * that's not entirely trivial, due to having pinned victim buffers before
+	 * starting IOs.
+	 */
+	int16		nios;
+	PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+	PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8feafd6e53c..90e873d278f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1280,6 +1280,12 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	return buffer;
 }
 
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blockNum,
+							 int *nblocks,
+							 int flags);
+
 static pg_attribute_always_inline bool
 StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
@@ -1315,6 +1321,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			 * so we stop here.
 			 */
 			actual_nblocks = i + 1;
+
+			ereport(DEBUG3,
+					errmsg("found buf %d, idx %i: %s, data %p",
+						   buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+						   BufferGetBlock(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			break;
 		}
 		else
@@ -1352,27 +1364,18 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->nblocks = actual_nblocks;
 	operation->io_buffers_len = io_buffers_len;
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
-	{
-		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.  Note also
-		 * that the following call might actually issue two advice calls if we
-		 * cross a segment boundary; in a true asynchronous version we might
-		 * choose to process only one real I/O at a time in that case.
-		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 operation->io_buffers_len);
-	}
+	operation->nios = 0;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	/*
+	 * TODO: When called for synchronous IO execution, we probably should
+	 * enter a dedicated fastpath here.
+	 */
+
+	/* initiate the IO */
+	return AsyncReadBuffers(operation,
+							buffers,
+							blockNum,
+							nblocks, flags);
 }
 
 /*
@@ -1424,12 +1427,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * AFIXME: localbuf.c should use IO_IN_PROGRESS / have an equivalent
+		 * of StartBufferIO().
+		 */
+		if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+		{
+			PgAioHandleRef ior = bufHdr->io_in_progress;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_io_ref_wait(&ior);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1439,12 +1461,7 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
 	char		persistence;
 
 	/*
@@ -1460,11 +1477,65 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	if (nblocks == 0)
 		return;					/* nothing to do */
 
+	persistence = operation->persistence;
+
+	Assert(operation->nios > 0);
+
+	for (int i = 0; i < operation->nios; i++)
+	{
+		PgAioReturn *aio_ret;
+
+		pgaio_io_ref_wait(&operation->refs[i]);
+
+		aio_ret = &operation->returns[i];
+
+		if (aio_ret->result.status != ARS_OK)
+			pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* FIXME: io timing */
+	/* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			io_buffers_len = 0;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+	bool		did_start_io_overall = false;
+	PgAioHandle *ioh = NULL;
+
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
-	persistence = operation->persistence;
 
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
@@ -1485,25 +1556,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		BlockNumber io_first_block;
+		bool		did_start_io_this = false;
+
+		/*
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+		 * which we don't want after setting IO_IN_PROGRESS.
+		 */
+		if (likely(!ioh))
+			ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
 
 		/*
 		 * Skip this block if someone else has already completed it.  If an
 		 * I/O is already in progress in another backend, this will wait for
 		 * the outcome: either done, or something went wrong and we will
 		 * retry.
+		 *
+		 * ATODO: Should we wait if we already submitted another IO?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1515,6 +1594,10 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u", buffers[i]),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1524,6 +1607,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG3,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we can scatter-read into
 		 * other buffers at the same time?  In this case we don't wait if we
@@ -1531,86 +1619,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * for the head block, so we should get on with that I/O as soon as
 		 * possible.  We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG3,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								io_buffers_len);
+		pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
+		pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+		if (persistence == RELPERSISTENCE_TEMP)
 		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-			}
-
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
+			pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+			pgaio_io_set_flag(ioh, AHF_REFERENCES_LOCAL);
 		}
+		else
+			pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+		did_start_io_overall = did_start_io_this = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+		operation->nios++;
+
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
 	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
+	}
+
+	if (did_start_io_overall)
+	{
+		pgaio_submit_staged();
+		return true;
+	}
+	else
+		return false;
 }
 
 /*
-- 
2.45.2.827.g557ae147e6

v2.0-0013-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload

From 6fcd84b237df81097a4271198e380dd82c76757b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.0 13/17] aio: Very-WIP: read_stream.c adjustments for real
 AIO

---
 src/include/storage/bufmgr.h          |  2 ++
 src/backend/storage/aio/read_stream.c | 29 +++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c   |  3 ++-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a075a40b2ed..ac6496bb1eb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -117,6 +117,8 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 2)
 
 /*
  * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..42b2434918b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
 
 #include "catalog/pg_tablespace.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -223,14 +224,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
 	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 *
+	 * XXX: Used to also check stream->pending_read_blocknum !=
+	 * stream->seq_blocknum
 	 */
 	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		stream->advice_enabled)
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
 
+	flags |= READ_BUFFERS_MORE_MORE_MORE;
+
 	/* We say how many blocks we want to read, but may be smaller on return. */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
@@ -289,6 +294,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -338,6 +351,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit the limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_submit_staged();
 				return;
 			}
 		}
@@ -362,6 +376,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, suppress_advice);
+
+	pgaio_submit_staged();
 }
 
 /*
@@ -476,10 +492,11 @@ read_stream_begin_impl(int flags,
 	 * direct I/O isn't enabled, the caller hasn't promised sequential access
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
+	 *
+	 * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	 * (flags & READ_STREAM_SEQUENTIAL) == 0
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
-		max_ios > 0)
+	if (max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
@@ -710,7 +727,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90e873d278f..59f4b22457d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1665,7 +1665,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 
 	if (did_start_io_overall)
 	{
-		pgaio_submit_staged();
+		if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+			pgaio_submit_staged();
 		return true;
 	}
 	else
-- 
2.45.2.827.g557ae147e6

v2.0-0014-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 2df34d8ac4fa381da607358ad3d214aadd05fdc7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:00:06 -0700
Subject: [PATCH v2.0 14/17] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 195 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 232 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2a5e72a8024..3fb527ed0d1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
 	aio_io.o \
 	aio_init.o \
 	aio_subject.o \
+	io_queue.o \
 	method_worker.o \
 	method_io_uring.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..4dda2f4e20e
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+	PgAioHandleRef ior;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_io_ref_clear(&tio->ior);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_io_ref_wait(&tio->ior);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->ior = *ior;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_io_ref_check_done(&tio->ior))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_io_ref_get_id(&tio->ior)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_io_ref_wait(&tio->ior);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8960223194a..6d64c75a49c 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
   'aio_io.c',
   'aio_init.c',
   'aio_subject.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be8be9fbff0..6f39abcdf3c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1171,6 +1171,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -2959,6 +2960,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.45.2.827.g557ae147e6

v2.0-0015-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From f7ad1fbd6a37434b67cb50916a5c28255d3a14eb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2.0 15/17] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   1 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  25 +-
 src/backend/postmaster/checkpointer.c |  12 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 64 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
 extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cfa7dbd1f1..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,7 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ac6496bb1eb..a65888c8915 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -325,7 +325,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 5999e5ca5a5..f5f5adb066d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		 * about in bgwriter, but we do have LWLocks, buffers, and temp files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * XXX: Before exiting, wait for all IO to finish. That's only
+		 * important to avoid spurious PrintBufferLeakWarning() /
+		 * PrintAioIPLeakWarning() calls, triggered by
+		 * ReleaseAuxProcessResources() being called with isCommit=true.
+		 *
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 199f008bcda..0350a71cab4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
 		UnlockBuffers();
@@ -708,7 +711,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -741,6 +744,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 59f4b22457d..e62f2de2034 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -77,6 +78,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -538,8 +540,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -557,6 +557,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2981,6 +2982,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
+	to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	if (to_write->total_writes > 0)
+		pgaio_submit_staged();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3012,7 +3063,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3074,7 +3128,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3182,48 +3238,89 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					break;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3241,15 +3338,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3275,7 +3380,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3318,6 +3423,8 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3494,11 +3601,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3509,6 +3630,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == io_combine_limit)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3520,6 +3648,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3558,8 +3691,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3568,22 +3759,56 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
-	int			result = 0;
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
 	uint32		buf_state;
-	BufferTag	tag;
+	int			result = 0;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		if (to_write->ioh == NULL)
+		{
+			to_write->ioh = io_queue_get_io(ioq);
+			pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+		}
+
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3593,7 +3818,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3602,40 +3827,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
-	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
-	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
-
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
-
-	tag = bufHdr->tag;
-
-	UnpinBuffer(bufHdr);
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
 
-	return result | BUF_WRITTEN;
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_io_data_32(to_write->ioh,
+							(uint32 *) to_write->buffers,
+							to_write->nbuffers);
+	pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 IOOP_WRITE, to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->ior);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
 }
 
 /*
@@ -4001,6 +4468,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d29..8295e3fb0a0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1491,6 +1491,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6f39abcdf3c..ca6dd0bebf0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -344,6 +344,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.45.2.827.g557ae147e6

v2.0-0016-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From c3a8731578a7fc1b03609e5bdb800e4fc18db80e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2.0 16/17] very-wip: test_aio module

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h            |  10 +
 src/include/storage/buf_internals.h           |   4 +
 src/backend/storage/aio/aio.c                 |  38 ++
 src/backend/storage/buffer/bufmgr.c           |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_aio/.gitignore          |   6 +
 src/test/modules/test_aio/Makefile            |  34 ++
 src/test/modules/test_aio/expected/inject.out | 180 +++++++
 src/test/modules/test_aio/expected/io.out     |  40 ++
 .../modules/test_aio/expected/ownership.out   | 148 ++++++
 src/test/modules/test_aio/expected/prep.out   |  17 +
 src/test/modules/test_aio/io_uring.conf       |   5 +
 src/test/modules/test_aio/meson.build         |  65 +++
 src/test/modules/test_aio/sql/inject.sql      |  51 ++
 src/test/modules/test_aio/sql/io.sql          |  16 +
 src/test/modules/test_aio/sql/ownership.sql   |  65 +++
 src/test/modules/test_aio/sql/prep.sql        |   9 +
 src/test/modules/test_aio/test_aio--1.0.sql   |  94 ++++
 src/test/modules/test_aio/test_aio.c          | 479 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control    |   3 +
 src/test/modules/test_aio/worker.conf         |   5 +
 22 files changed, 1272 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/expected/inject.out
 create mode 100644 src/test/modules/test_aio/expected/io.out
 create mode 100644 src/test/modules/test_aio/expected/ownership.out
 create mode 100644 src/test/modules/test_aio/expected/prep.out
 create mode 100644 src/test/modules/test_aio/io_uring.conf
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/sql/inject.sql
 create mode 100644 src/test/modules/test_aio/sql/io.sql
 create mode 100644 src/test/modules/test_aio/sql/ownership.sql
 create mode 100644 src/test/modules/test_aio/sql/prep.sql
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control
 create mode 100644 src/test/modules/test_aio/worker.conf

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 67d994cc0b1..cd3063f6c11 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -259,6 +259,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
 extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 
 
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_worker_ops;
 #ifdef USE_LIBURING
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index d6f9f658b97..9db661b1cd0 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -22,6 +22,9 @@
 #include "utils/resowner.h"
 #include "utils/wait_event_types.h"
 
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
 
 
 static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -65,6 +68,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
 const IoMethodOps *pgaio_impl;
 
 
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
 
 /* --------------------------------------------------------------------------------
  * "Core" IO Api
@@ -529,6 +537,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
 	/* FIXME: should be done in separate function */
 	ioh->state = AHS_REAPED;
 
+#ifdef USE_INJECTION_POINTS
+	inj_cur_handle = ioh;
+
+	/*
+	 * FIXME: This could be in a critical section - but it looks like we can't
+	 * just InjectionPointLoad() at process start, as the injection point
+	 * might not yet be defined.
+	 */
+	InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	inj_cur_handle = NULL;
+#endif
+
 	pgaio_io_process_completion_subject(ioh);
 
 	/* ensure results of completion are visible before the new state */
@@ -994,3 +1015,20 @@ assign_io_method(int newval, void *extra)
 {
 	pgaio_impl = pgaio_ops_table[newval];
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e62f2de2034..f774b42651a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -541,7 +541,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
 							  bool syncio);
@@ -6122,7 +6121,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520a..7df90602e90 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d236..bc7d19e694f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e52b0f086dd
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,180 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b;
+$$);
+NOTICE:  wrapped error: could not read blocks 1..2 in file base/<redacted>: read only 8192 of 16384 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block 
+-------------------
+ 
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact 
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+ERROR:  release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get 
+------------
+ 
+ 
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR:  API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+ERROR:  can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR:  can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel 
+----------
+ 
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel 
+----------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..102c2e01537
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,65 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+  'prep',
+  'ownership',
+  'io',
+]
+
+if get_option('injection_points')
+  testfiles += 'inject'
+endif
+
+
+
+tests += {
+  'name': 'test_aio_worker',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('worker.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+if liburing.found()
+  tests += {
+    'name': 'test_aio_uring',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'regress': {
+      'sql': testfiles,
+      'regress_args': [
+        '--temp-config', files('io_uring.conf'),
+      ],
+      # requires custom config
+      'runningcheck': false,
+    }
+  }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..b3d34de8977
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,51 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b;
+$$);
+SELECT inj_io_short_read_detach();
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..ea9ad43ed8f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,94 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+    DECLARE
+	err_state text;
+        err_msg text;
+    BEGIN
+        EXECUTE p_sql;
+	RETURN true;
+    EXCEPTION WHEN OTHERS THEN
+        GET STACKED DIAGNOSTICS
+	    err_state = RETURNED_SQLSTATE,
+	    err_msg = MESSAGE_TEXT;
+	err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+        RAISE NOTICE 'wrapped error: %', err_msg
+	    USING ERRCODE = err_state;
+	RETURN false;
+    END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..9626d495241
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,479 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "storage/ipc.h"
+#include "access/relation.h"
+#include "utils/rel.h"
+#include "utils/injection_point.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled;
+	bool		result_set;
+	int			result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		page;
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+	SMgrRelation smgr;
+	uint32		buf_state;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	page = BufferGetBlock(buf);
+
+	ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get_ref(ioh, &ior);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+	smgr = RelationGetSmgr(rel);
+
+	/* FIXME: even if just a test, we should verify nobody else uses this */
+	buf_state = LockBufHdr(buf_hdr);
+	buf_state &= ~(BM_VALID | BM_DIRTY);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	StartBufferIO(buf_hdr, true, false);
+
+	pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+				   (void *) &page, 1);
+
+	ReleaseBuffer(buf);
+
+	pgaio_io_ref_wait(&ior);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* this is a gross hack, but there's no good API exposed */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+	buf = pr.recent_buffer;
+	elog(LOG, "recent: %d", buf);
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't unpin");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+	MarkBufferDirty(buf);
+	ph->pd_special = BLCKSZ + 1;
+
+	/* last_handle = pgaio_io_get(); */
+
+	PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+	if (inj_io_error_state->enabled)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->result);
+
+			ioh->result = inj_io_error_state->result;
+		}
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = true;
+	inj_io_error_state->result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->result_set)
+		inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
-- 
2.45.2.827.g557ae147e6

v2.0-0017-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 6ad40d5b074c4af85289e48b574c3461dbab9a4c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.0 17/17] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..5be8125ad3a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,11 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.45.2.827.g557ae147e6

Heikki Linnakangas

hlinnaka@iki.fi

over 1 year ago

In reply to: Andres Freund (#1)

Re: AIO v2.0

On 01/09/2024 09:27, Andres Freund wrote:

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

+1 on that approach.

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to
follow what "handed out" and "defined" above means for example.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired
the lock, but it doesn't own it anymore. That's reasonable, but maybe
add a boolean to the LWLock to mark whether the lock is currently owned
or not.

The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might
think that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.

v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch

+1. The old comment "We don't currently need any ResourceOwner in a
walsender process" was a bit misleading, because the walsender did
create the short-lived "base backup" resource owner, so it's nice to get
that fixed.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict
with this [1]/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi. Not in any fundamental way, but can I ask you to review
those patch, please? After those patches, AIO workers should also have
PMChild slots (formerly known as Backend structs).

[1]: /messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi

--
Heikki Linnakangas
Neon (https://neon.tech)

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Heikki Linnakangas (#2)

Re: AIO v2.0

Hi,

On 2024-09-02 13:03:07 +0300, Heikki Linnakangas wrote:

On 01/09/2024 09:27, Andres Freund wrote:

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to follow
what "handed out" and "defined" above means for example.

Yea - I had actually written a bunch of that before, but then redesigns just
obsoleted most of it :(

FWIW, "handed out" is an IO handle acquired by code, which doesn't yet have an
operation associated with it. Once "defined" it actually could be - but isn't
yet - executed.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

Cool

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired the
lock, but it doesn't own it anymore. That's reasonable, but maybe add a
boolean to the LWLock to mark whether the lock is currently owned or not.

Hm, not sure it's worth doing that...

The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might think
that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.

Yea, that makes sense.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict with
this [1]. Not in any fundamental way, but can I ask you to review those
patch, please? After those patches, AIO workers should also have PMChild
slots (formerly known as Backend structs).

I'll try to do that soonish!

Greetings,

Andres Freund

陈宗志

baotiao@gmail.com

over 1 year ago

In reply to: Andres Freund (#3)

Re: AIO v2.0

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
For example, which paths were modified in the AIO module? Is it the
path for writing WAL logs, or the path for flushing pages, etc.?

Also, I recommend keeping this patch as small as possible.
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Andres Freund (#1)

Re: AIO v2.0

On Sun, 1 Sept 2024 at 18:28, Andres Freund <andres@anarazel.de> wrote:

0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759

I took this for a test drive on an AMD 3990x machine with a 1TB
Samsung 980 Pro SSD on PCIe 4. I only tried io_method = io_uring, but
I did try with and without direct IO.

This machine has 64GB RAM and I was using ClickBench Q2 [1]https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql, which is
"SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;"
(for some reason they use 0-based query IDs). This table is 64GBs
without indexes.

I'm seeing direct IO slower than buffered IO with smaller worker
counts. That's counter to what I would have expected as I'd have
expected the memcpys from the kernel space to be quite an overhead in
the buffered IO case. With larger worker counts the bottleneck is
certainly disk. The part that surprised me was that the bottleneck is
reached more quickly with buffered IO. I was seeing iotop going up to
5.54GB/s at higher worker counts.

times in milliseconds
workers buffered direct cmp
0 58880 102852 57%
1 33622 53538 63%
2 24573 40436 61%
4 18557 27359 68%
8 14844 17330 86%
16 12491 12754 98%
32 11802 11956 99%
64 11895 11941 100%

Is there some other information I can provide to help this make sense?
(Or maybe it does already to you.)

David

[1]: https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Andres Freund (#1)

20 attachment(s)

Re: AIO v2.0

Hi,

Attached is the next version of the patchset. Changes:

- added "sync" io method, the main benefit of that is that the main AIO commit
doesn't need to include worker mode

- split worker and io_uring methods into their own commits

- added src/backend/storage/aio/README.md, explaining design constraints and
the resulting design on a high level

- renamed LWLockReleaseOwnership as suggested by Heikki

- a bunch of small cleanups and improvements

There's plenty more to do, but I thought this would be a useful checkpoint.

Greetings,

Andres Freund

Attachments:

v2.1-0012-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 55448fdaa5e54983fdfd147ff1f28cf3867d58e5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 6 Sep 2024 15:27:57 -0400
Subject: [PATCH v2.1 12/20] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 311 ++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..9c3a11f2063
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,311 @@
+# Asynchronous & Direct IO
+
+## Design Criteria & Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse then when using buffered IO.
+
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#Worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_get()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#IO-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#State-for-AIO-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#AIO-Callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#AIO-Result)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO Referencs](#AIO-References).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_get()`
+and because `pgaio_io_get()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_get()`) without causing the
+IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#State-for-AIO-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleSharedCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
+
+As [explained earlier](#IO-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#AIO-Results) can
+be used.
+
+
+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+
+### AIO References
+
+As [described above](#AIO-Handles) can be reused immediately after completion
+and therefore cannot be used to wait for completion of the IO. Waiting is
+enabled using AIO references, which do not just identify an AIO Handle but
+also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_ref()` and
+then waited upon using `pgaio_io_ref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#IO-can-be-started-in-critical-sections)
+and [may be executed by any backend](#Deadlock-and-Starvation-Dangers-due-to-AIO)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level, helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+[Read stream](../../include/storage/read_stream.h)
+makes it comparatively easy to use AIO for such use cases.
-- 
2.45.2.827.g557ae147e6

v2.1-0013-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload

From f138cbab018b104e416d23175a38141d8827232d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:33:30 -0400
Subject: [PATCH v2.1 13/20] aio: Implement smgr/md.c aio methods

---
 src/include/storage/aio.h             |  17 +-
 src/include/storage/fd.h              |   6 +
 src/include/storage/md.h              |  12 ++
 src/include/storage/smgr.h            |  21 +++
 src/backend/storage/aio/aio_subject.c |   4 +
 src/backend/storage/file/fd.c         |  68 ++++++++
 src/backend/storage/smgr/md.c         | 217 ++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c       |  91 +++++++++++
 8 files changed, 434 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index b8c743548c9..07bf92a6b7a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -57,9 +57,10 @@ typedef enum PgAioSubjectID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	ASI_INVALID = 0,
+	ASI_SMGR,
 } PgAioSubjectID;
 
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
 
 /*
  * Flags for an IO that can be set with pgaio_io_set_flag().
@@ -90,7 +91,8 @@ typedef enum PgAioHandleFlags
  */
 typedef enum PgAioHandleSharedCallbackID
 {
-	ASC_PLACEHOLDER /* empty enums are invalid */ ,
+	ASC_MD_READV,
+	ASC_MD_WRITEV,
 } PgAioHandleSharedCallbackID;
 
 
@@ -139,6 +141,17 @@ typedef union
 
 typedef union PgAioSubjectData
 {
+	struct
+	{
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		int			nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp;	/* proc can be inferred by owning AIO */
+		bool		release_lock;
+		int8		mode;
+	}			smgr;
+
 	/* just as an example placeholder for later */
 	struct
 	{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b72293c79a5..ede77695853 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 899d0d681c5..66730bc24fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -109,6 +123,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -126,4 +141,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+									  SMgrRelationData *smgr,
+									  ForkNumber forknum,
+									  BlockNumber blocknum,
+									  int nblocks);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 51ee3b3969d..14be8432f5a 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 
@@ -28,9 +29,12 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
 	[ASI_INVALID] = &(PgAioSubjectInfo) {
 		.name = "invalid",
 	},
+	[ASI_SMGR] = &aio_smgr_subject_info,
 };
 
 static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+	[ASC_MD_READV] = &aio_md_readv_cb,
+	[ASC_MD_WRITEV] = &aio_md_writev_cb,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ec1505802b9..f5ff554f946 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -95,6 +95,7 @@
 #include "pgstat.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1295,6 +1296,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1988,6 +1991,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2211,6 +2216,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2316,6 +2347,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
@@ -2499,6 +2558,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2779,6 +2844,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2847,6 +2913,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6cd81a61faa..f96308490d9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -931,6 +932,49 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1036,6 +1080,49 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1357,6 +1444,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1832,3 +1934,118 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+
+
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+	.complete = md_readv_complete,
+	.error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+	.complete = md_writev_complete,
+};
+
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioResult result = prior_result;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_error(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		result.error_data = 0;
+
+		md_readv_error(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.id = ASC_MD_READV;
+		result.status = ARS_PARTIAL;
+	}
+
+	/* AFIXME: post-read portion of mdreadv() */
+
+	return result;
+}
+
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+					   )
+			);
+	}
+	else
+	{
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+					   result.result * (size_t) BLCKSZ,
+					   subject_data->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+}
+
+
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	if (prior_result.status == ARS_ERROR)
+	{
+		/* AFIXME: complain */
+		return prior_result;
+	}
+
+	prior_result.result /= BLCKSZ;
+
+	return prior_result;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ee31db85eec..2dacb361a4f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -620,6 +642,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -651,6 +686,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
@@ -807,6 +852,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -835,3 +886,43 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+						  struct SMgrRelationData *smgr,
+						  ForkNumber forknum,
+						  BlockNumber blocknum,
+						  int nblocks)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+	pgaio_io_set_subject(ioh, ASI_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	sd->smgr.release_lock = false;
+	sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
-- 
2.45.2.827.g557ae147e6

v2.1-0014-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload

From 34a11207d325b445d15a12e2c63aff4b90a935d8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2.1 14/20] bufmgr: Implement AIO support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |   6 +
 src/include/storage/buf_internals.h   |   6 +
 src/include/storage/bufmgr.h          |  10 +
 src/backend/storage/aio/aio_subject.c |   5 +
 src/backend/storage/buffer/buf_init.c |   3 +
 src/backend/storage/buffer/bufmgr.c   | 432 +++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c |  65 ++++
 7 files changed, 520 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 07bf92a6b7a..260c3701247 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -93,6 +93,12 @@ typedef enum PgAioHandleSharedCallbackID
 {
 	ASC_MD_READV,
 	ASC_MD_WRITEV,
+
+	ASC_SHARED_BUFFER_READ,
+	ASC_SHARED_BUFFER_WRITE,
+
+	ASC_LOCAL_BUFFER_READ,
+	ASC_LOCAL_BUFFER_WRITE,
 } PgAioHandleSharedCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e46..5cfa7dbd1f1 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_ref.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -252,6 +253,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioHandleRef io_in_progress;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -465,4 +468,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..6cd64b8c2b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,14 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
@@ -194,6 +202,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+struct PgAioHandle;
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 14be8432f5a..07c7989b273 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -35,6 +35,11 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
 static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
 	[ASC_MD_READV] = &aio_md_readv_cb,
 	[ASC_MD_WRITEV] = &aio_md_writev_cb,
+
+	[ASC_SHARED_BUFFER_READ] = &aio_shared_buffer_read_cb,
+	[ASC_SHARED_BUFFER_WRITE] = &aio_shared_buffer_write_cb,
+	[ASC_LOCAL_BUFFER_READ] = &aio_local_buffer_read_cb,
+	[ASC_LOCAL_BUFFER_WRITE] = &aio_local_buffer_write_cb,
 };
 
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 09bec6449b6..059a80dfb13 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
@@ -126,6 +127,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_io_ref_clear(&buf->io_in_progress);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7e987836335..976ced82b6a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -58,6 +59,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5514,6 +5517,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioHandleRef ior;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5521,10 +5525,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		ior = buf->io_in_progress;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_io_ref_valid(&ior))
+		{
+			pgaio_io_ref_wait(&ior);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5613,7 +5626,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5625,6 +5638,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_io_ref_clear(&buf->io_in_progress);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5633,6 +5653,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5684,7 +5738,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6143,3 +6197,367 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+		BlockNumber forkNum = bufHdr->tag.forkNum;
+
+		/* AFIXME: relpathperm allocates memory */
+		MemoryContextSwitchTo(ErrorContext);
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	/* Report I/Os as completing individually. */
+
+	/* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+	return buf_failed;
+}
+
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	/* AFIXME: implement track_io_timing */
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_in_progress = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by IO.
+			 */
+			LWLockDisown(content_lock);
+			RESUME_INTERRUPTS();
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_read_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, false);
+}
+
+static void
+shared_buffer_write_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, true);
+}
+
+
+static PgAioResult
+shared_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+			 buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+		/*
+		 * AFIXME: It'd probably be better to not set BM_IO_ERROR (which is
+		 * what failed = true leads to) when it's just a short read...
+		 */
+		buf_failed = ReadBufferCompleteReadShared(buf,
+												  mode,
+												  failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_SHARED_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+shared_buffer_read_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   subject_data->smgr.blockNum + result.error_data,
+				   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+				   )
+		);
+	MemoryContextSwitchTo(oldContext);
+}
+
+static PgAioResult
+shared_buffer_write_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->scb_data.shared_buffer.release_lock */
+		ReadBufferCompleteWriteShared(buf,
+									  true,
+									  false);
+
+	}
+
+	return result;
+}
+
+static void
+local_buffer_read_prepare(PgAioHandle *ioh)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_in_progress = io_ref;
+		LocalRefCount[-buf - 1] += 1;
+
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_read_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	/* FIXME: error handling */
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+
+		buf_failed = ReadBufferCompleteReadLocal(buf,
+												 mode,
+												 false);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_LOCAL_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+local_buffer_write_prepare(PgAioHandle *ioh)
+{
+	elog(ERROR, "not yet");
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_read_cb = {
+	.prepare = shared_buffer_read_prepare,
+	.complete = shared_buffer_read_complete,
+	.error = shared_buffer_read_error,
+};
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_write_cb = {
+	.prepare = shared_buffer_write_prepare,
+	.complete = shared_buffer_write_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_read_cb = {
+	.prepare = local_buffer_read_prepare,
+	.complete = local_buffer_read_complete,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_write_cb = {
+	.prepare = local_buffer_write_prepare,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8da7dd6c98a..a7eb723f1e9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "executor/instrument.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_io_ref_clear(&buf->io_in_progress);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *buf_hdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+	blockno = buf_hdr->tag.blockNum;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+		BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	/* release pin held by IO subsystem */
+	LocalRefCount[-buffer - 1] -= 1;
+
+	return buf_failed;
+}
-- 
2.45.2.827.g557ae147e6

v2.1-0015-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From bfd939b88a8dcdbc424c1e7452d70195a46910ae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2.1 15/20] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  25 ++-
 src/backend/storage/buffer/bufmgr.c | 259 +++++++++++++++++-----------
 2 files changed, 182 insertions(+), 102 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6cd64b8c2b3..a075a40b2ed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -107,11 +108,22 @@ typedef struct BufferManagerRelation
 #define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
 #define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
 
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
 /* Zero out page if reading fails. */
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
 
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here.  Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
+
 struct ReadBuffersOperation
 {
 	/* The following members should be set by the caller. */
@@ -131,6 +143,17 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		io_buffers_len;
+
+	/*
+	 * In some rare-ish cases one operation causes multiple reads (e.g. if a
+	 * buffer was concurrently read by another backend). It'd be much better
+	 * if we ensured that each ReadBuffersOperation covered only one IO - but
+	 * that's not entirely trivial, due to having pinned victim buffers before
+	 * starting IOs.
+	 */
+	int16		nios;
+	PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+	PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 976ced82b6a..4914c71d41e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1253,6 +1253,12 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	return buffer;
 }
 
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blockNum,
+							 int *nblocks,
+							 int flags);
+
 static pg_attribute_always_inline bool
 StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
@@ -1288,6 +1294,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			 * so we stop here.
 			 */
 			actual_nblocks = i + 1;
+
+			ereport(DEBUG3,
+					errmsg("found buf %d, idx %i: %s, data %p",
+						   buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+						   BufferGetBlock(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			break;
 		}
 		else
@@ -1325,27 +1337,18 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->nblocks = actual_nblocks;
 	operation->io_buffers_len = io_buffers_len;
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
-	{
-		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.  Note also
-		 * that the following call might actually issue two advice calls if we
-		 * cross a segment boundary; in a true asynchronous version we might
-		 * choose to process only one real I/O at a time in that case.
-		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 operation->io_buffers_len);
-	}
+	operation->nios = 0;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	/*
+	 * TODO: When called for synchronous IO execution, we probably should
+	 * enter a dedicated fastpath here.
+	 */
+
+	/* initiate the IO */
+	return AsyncReadBuffers(operation,
+							buffers,
+							blockNum,
+							nblocks, flags);
 }
 
 /*
@@ -1397,12 +1400,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * AFIXME: localbuf.c should use IO_IN_PROGRESS / have an equivalent
+		 * of StartBufferIO().
+		 */
+		if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+		{
+			PgAioHandleRef ior = bufHdr->io_in_progress;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_io_ref_wait(&ior);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1412,12 +1434,7 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
 	char		persistence;
 
 	/*
@@ -1433,11 +1450,65 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	if (nblocks == 0)
 		return;					/* nothing to do */
 
+	persistence = operation->persistence;
+
+	Assert(operation->nios > 0);
+
+	for (int i = 0; i < operation->nios; i++)
+	{
+		PgAioReturn *aio_ret;
+
+		pgaio_io_ref_wait(&operation->refs[i]);
+
+		aio_ret = &operation->returns[i];
+
+		if (aio_ret->result.status != ARS_OK)
+			pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* FIXME: io timing */
+	/* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			io_buffers_len = 0;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+	bool		did_start_io_overall = false;
+	PgAioHandle *ioh = NULL;
+
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
-	persistence = operation->persistence;
 
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
@@ -1458,25 +1529,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		BlockNumber io_first_block;
+		bool		did_start_io_this = false;
+
+		/*
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+		 * which we don't want after setting IO_IN_PROGRESS.
+		 */
+		if (likely(!ioh))
+			ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
 
 		/*
 		 * Skip this block if someone else has already completed it.  If an
 		 * I/O is already in progress in another backend, this will wait for
 		 * the outcome: either done, or something went wrong and we will
 		 * retry.
+		 *
+		 * ATODO: Should we wait if we already submitted another IO?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1567,10 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u", buffers[i]),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1497,6 +1580,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG3,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we can scatter-read into
 		 * other buffers at the same time?  In this case we don't wait if we
@@ -1504,86 +1592,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * for the head block, so we should get on with that I/O as soon as
 		 * possible.  We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG3,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								io_buffers_len);
+		pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
+		pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+		if (persistence == RELPERSISTENCE_TEMP)
 		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-			}
-
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
+			pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+			pgaio_io_set_flag(ioh, AHF_REFERENCES_LOCAL);
 		}
+		else
+			pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+		did_start_io_overall = did_start_io_this = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+		operation->nios++;
+
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
 	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
+	}
+
+	if (did_start_io_overall)
+	{
+		pgaio_submit_staged();
+		return true;
+	}
+	else
+		return false;
 }
 
 /*
-- 
2.45.2.827.g557ae147e6

v2.1-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload

From ea3373e8793932e849d2904046f76b14ec971549 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Jul 2023 18:59:25 -0700
Subject: [PATCH v2.1 01/20] bufmgr: Return early in
 ScheduleBufferTagForWriteback() if fsync=off

As pg_flush_data() doesn't do anything with fsync disabled, there's no point
in tracking the buffer for writeback. Arguably the better fix would be to
change pg_flush_data() to flush data even with fsync off, but that's a
behavioral change, whereas this is just a small optimization.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/storage/buffer/bufmgr.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 48520443001..b8680cc8fd4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5899,7 +5899,12 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context
 {
 	PendingWriteback *pending;
 
-	if (io_direct_flags & IO_DIRECT_DATA)
+	/*
+	 * As pg_flush_data() doesn't do anything with fsync disabled, there's no
+	 * point in tracking in that case.
+	 */
+	if (io_direct_flags & IO_DIRECT_DATA ||
+		!enableFsync)
 		return;
 
 	/*
-- 
2.45.2.827.g557ae147e6

v2.1-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload

From 7daeafca64fd950bf63fb43cdb31fd578f27c85d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.1 02/20] Allow lwlocks to be unowned

This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
 src/include/storage/lwlock.h      |   2 +
 src/backend/storage/lmgr/lwlock.c | 110 ++++++++++++++++++++++--------
 2 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..eabf813ce05 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockDisown(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab3..a5fa77412ed 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	}
 }
 
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 {
-	LWLockMode	mode;
 	uint32		oldstate;
 	bool		check_waiters;
-	int			i;
-
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-		if (lock == held_lwlocks[i].lock)
-			break;
-
-	if (i < 0)
-		elog(ERROR, "lock %s is not held", T_NAME(lock));
-
-	mode = held_lwlocks[i].mode;
-
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
-
-	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
 	/*
 	 * Release my hold on lock, after that it can immediately be acquired by
 	 * others, even if we still have to wakeup other waiters.
 	 */
 	if (mode == LW_EXCLUSIVE)
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
 	else
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
 
 	/* nobody else can have that kind of lock */
-	Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+	if (mode == LW_EXCLUSIVE)
+		Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+	else
+		Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+			   (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
 
 	if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
 		TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
 
+	if (mode == LW_EXCLUSIVE)
+		oldstate -= LW_VAL_EXCLUSIVE;
+	else
+		oldstate -= LW_VAL_SHARED;
+
 	/*
 	 * We're still waiting for backends to get scheduled, don't wake them up
 	 * again.
@@ -1841,6 +1825,72 @@ LWLockRelease(LWLock *lock)
 		LWLockWakeup(lock);
 	}
 
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+	LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */
+LWLockMode
+LWLockDisown(LWLock *lock)
+{
+	LWLockMode	mode;
+	int			i;
+
+	/*
+	 * Remove lock from list of locks held.  Usually, but not always, it will
+	 * be the latest-acquired lock; so search array backwards.
+	 */
+	for (i = num_held_lwlocks; --i >= 0;)
+		if (lock == held_lwlocks[i].lock)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+	mode = held_lwlocks[i].mode;
+
+	num_held_lwlocks--;
+	for (; i < num_held_lwlocks; i++)
+		held_lwlocks[i] = held_lwlocks[i + 1];
+
+	return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+	LWLockMode	mode;
+
+	mode = LWLockDisown(lock);
+
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+	LWLockReleaseInternal(lock, mode);
+
 	/*
 	 * Now okay to allow cancel/die interrupts.
 	 */
-- 
2.45.2.827.g557ae147e6

v2.1-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload

From 6dacd88481c9c79042a6b5bdc5783ca8f8ce1cce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Aug 2021 12:16:28 -0700
Subject: [PATCH v2.1 03/20] Use aux process resource owner in walsender

AIO will need a resource owner to do IO. Right now we create a resowner
on-demand during basebackup, and we could do the same for AIO. But it seems
easier to just always create an aux process resowner.
---
 src/include/replication/walsender.h |  1 -
 src/backend/backup/basebackup.c     |  8 ++++--
 src/backend/replication/walsender.c | 44 ++++++-----------------------
 3 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index f2d8297f016..aff0f7a51ca 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -38,7 +38,6 @@ extern PGDLLIMPORT bool log_replication_commands;
 extern void InitWalSender(void);
 extern bool exec_replication_command(const char *cmd_string);
 extern void WalSndErrorCleanup(void);
-extern void WalSndResourceCleanup(bool isCommit);
 extern void PhysicalWakeupLogicalWalSnd(void);
 extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli);
 extern void WalSndSignals(void);
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 14e5ba72e97..0f8cddcbeeb 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -250,8 +250,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 	state.bytes_total_is_valid = false;
 
 	/* we're going to use a BufFile, so we need a ResourceOwner */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+		   CurrentResourceOwner == NULL);
+	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	backup_started_in_recovery = RecoveryInProgress();
 
@@ -672,7 +674,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 	FreeBackupManifest(&manifest);
 
 	/* clean up the resource owner we created */
-	WalSndResourceCleanup(true);
+	ReleaseAuxProcessResources(true);
 
 	basebackup_progress_done();
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c5f1009f370..0e847535a64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -282,10 +282,8 @@ InitWalSender(void)
 	/* Create a per-walsender data structure in shared memory */
 	InitWalSenderSlot();
 
-	/*
-	 * We don't currently need any ResourceOwner in a walsender process, but
-	 * if we did, we could call CreateAuxProcessResourceOwner here.
-	 */
+	/* need resource owner for e.g. basebackups */
+	CreateAuxProcessResourceOwner();
 
 	/*
 	 * Let postmaster know that we're a WAL sender. Once we've declared us as
@@ -346,7 +344,7 @@ WalSndErrorCleanup(void)
 	 * without a transaction, we've got to clean that up now.
 	 */
 	if (!IsTransactionOrTransactionBlock())
-		WalSndResourceCleanup(false);
+		ReleaseAuxProcessResources(false);
 
 	if (got_STOPPING || got_SIGUSR2)
 		proc_exit(0);
@@ -355,34 +353,6 @@ WalSndErrorCleanup(void)
 	WalSndSetState(WALSNDSTATE_STARTUP);
 }
 
-/*
- * Clean up any ResourceOwner we created.
- */
-void
-WalSndResourceCleanup(bool isCommit)
-{
-	ResourceOwner resowner;
-
-	if (CurrentResourceOwner == NULL)
-		return;
-
-	/*
-	 * Deleting CurrentResourceOwner is not allowed, so we must save a pointer
-	 * in a local variable and clear it first.
-	 */
-	resowner = CurrentResourceOwner;
-	CurrentResourceOwner = NULL;
-
-	/* Now we can release resources and delete it. */
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_BEFORE_LOCKS, isCommit, true);
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_LOCKS, isCommit, true);
-	ResourceOwnerRelease(resowner,
-						 RESOURCE_RELEASE_AFTER_LOCKS, isCommit, true);
-	ResourceOwnerDelete(resowner);
-}
-
 /*
  * Handle a client's connection abort in an orderly manner.
  */
@@ -685,8 +655,10 @@ UploadManifest(void)
 	 * parsing the manifest will use the cryptohash stuff, which requires a
 	 * resource owner
 	 */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == AuxProcessResourceOwner ||
+		   CurrentResourceOwner == NULL);
+	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	/* Prepare to read manifest data into a temporary context. */
 	mcxt = AllocSetContextCreate(CurrentMemoryContext,
@@ -723,7 +695,7 @@ UploadManifest(void)
 	uploaded_manifest_mcxt = mcxt;
 
 	/* clean up the resource owner we created */
-	WalSndResourceCleanup(true);
+	ReleaseAuxProcessResources(true);
 }
 
 /*
-- 
2.45.2.827.g557ae147e6

v2.1-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From a70eeb4cc7dd87a693162f0632d5d60bfa17575e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 1 Aug 2024 09:56:36 -0700
Subject: [PATCH v2.1 04/20] Ensure a resowner exists for all paths that may
 perform AIO

---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 3 ++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7637581a184..234fdc57ca7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -331,8 +331,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3fe1774a1e9..be0c7846d00 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3b50ce19a2c..11128ea461c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -719,7 +719,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.45.2.827.g557ae147e6

v2.1-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload

From 5308c29e3fd09601ad2e63669837f1e7eef45921 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 22:10:35 -0400
Subject: [PATCH v2.1 05/20] bufmgr/smgr: Don't cross segment boundaries in
 StartReadBuffers()

With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.
---
 src/include/storage/md.h            |  2 ++
 src/include/storage/smgr.h          |  2 ++
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 src/backend/storage/smgr/md.c       | 17 +++++++++++++++++
 src/backend/storage/smgr/smgr.c     | 16 ++++++++++++++++
 5 files changed, 55 insertions(+)

diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 620f10abdeb..b72293c79a5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,8 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, int nblocks);
+extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e15b20a566a..899d0d681c5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks);
+extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+							 BlockNumber blocknum);
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b8680cc8fd4..7e987836335 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1259,6 +1259,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			io_buffers_len = 0;
+	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
@@ -1290,6 +1291,23 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		{
 			/* Extend the readable range to cover this block. */
 			io_buffers_len++;
+
+			/*
+			 * Check how many blocks we can cover with the same IO. The smgr
+			 * implementation might e.g. be limited due to a segment boundary.
+			 */
+			if (i == 0 && actual_nblocks > 1)
+			{
+				maxcombine = smgrmaxcombine(operation->smgr,
+											operation->forknum,
+											blockNum);
+				if (maxcombine < actual_nblocks)
+				{
+					elog(DEBUG2, "limiting nblocks at %u from %u to %u",
+						 blockNum, actual_nblocks, maxcombine);
+					actual_nblocks = maxcombine;
+				}
+			}
 		}
 	}
 	*nblocks = actual_nblocks;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358f..6cd81a61faa 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -803,6 +803,17 @@ buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
 	return iovcnt;
 }
 
+uint32
+mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum)
+{
+	BlockNumber segoff;
+
+	segoff = blocknum % ((BlockNumber) RELSEG_SIZE);
+
+	return RELSEG_SIZE - segoff;
+}
+
 /*
  * mdreadv() -- Read the specified blocks from a relation.
  */
@@ -833,6 +844,9 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "read crossing segment boundary");
+
 		iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
 		size_this_segment = nblocks_this_segment * BLCKSZ;
 		transferred_this_segment = 0;
@@ -956,6 +970,9 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
 		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
+		if (nblocks_this_segment != nblocks)
+			elog(ERROR, "write crossing segment boundary");
+
 		iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
 		size_this_segment = nblocks_this_segment * BLCKSZ;
 		transferred_this_segment = 0;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7b9fa103eff..ee31db85eec 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -88,6 +88,8 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum, int nblocks);
+	uint32		(*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
@@ -117,6 +119,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
+		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
@@ -588,6 +591,19 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
 }
 
+/*
+ * smgrmaxcombine() - Return the maximum number of total blocks that can be
+ *				 combined with an IO starting at blocknum.
+ *
+ * The returned value includes the io for blocknum itself.
+ */
+uint32
+smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum)
+{
+	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+}
+
 /*
  * smgrreadv() -- read a particular block range from a relation into the
  *				 supplied buffers.
-- 
2.45.2.827.g557ae147e6

v2.1-0006-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 177af4d07a51bac7b785dc02b2abea019d7395e4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.1 06/20] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.
---
 src/include/storage/aio.h                     | 41 +++++++++++++++++
 src/include/storage/aio_init.h                | 26 +++++++++++
 src/backend/postmaster/postmaster.c           |  8 ++++
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 32 +++++++++++++
 src/backend/storage/aio/aio_init.c            | 46 +++++++++++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 ++
 src/backend/tcop/postgres.c                   |  7 +++
 src/backend/utils/init/miscinit.c             |  3 ++
 src/backend/utils/init/postinit.c             |  3 ++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  7 +++
 src/tools/pgindent/typedefs.list              |  1 +
 14 files changed, 192 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_init.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..1e4dfd07e89
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int	io_method;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..5bcfb8a9d58
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ *    AIO initialization - kept separate as initialization sites don't need to
+ *    know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_postmaster_init(void);
+extern void pgaio_postmaster_child_init_local(void);
+extern void pgaio_postmaster_child_init(void);
+
+#endif							/* AIO_INIT_H */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 96bc1d1cfed..70c5ce19f6e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -111,6 +111,7 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -941,6 +942,13 @@ PostmasterMain(int argc, char *argv[])
 		ExitPostmaster(0);
 	}
 
+	/*
+	 * As AIO might create internal FDs, and will trigger shared memory
+	 * allocations, need to do this before reset_shared() and
+	 * set_max_safe_fds().
+	 */
+	pgaio_postmaster_init();
+
 	/*
 	 * Set up shared memory and semaphores.
 	 *
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..d831c772960
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+int			io_method = DEFAULT_IO_METHOD;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..1c277a7eb3b
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    Asynchronous I/O subsytem - Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_postmaster_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 6caeca3a8e6..f0227a12a7d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -39,6 +39,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, WaitLSNShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -339,6 +341,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	WaitLSNShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea1135..4dc46b17b41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -61,6 +61,7 @@
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -4198,6 +4199,12 @@ PostgresSingleUserMain(int argc, char *argv[],
 	 */
 	InitProcess();
 
+	/* AIO is needed during InitPostgres() */
+	pgaio_postmaster_init();
+	pgaio_postmaster_child_init_local();
+
+	set_max_safe_fds();
+
 	/*
 	 * Now that sufficient infrastructure has been initialized, PostgresMain()
 	 * can do the rest.
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 537d92c0cfd..b8fa2e64ffe 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/slotsync.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -137,6 +138,8 @@ InitPostmasterChild(void)
 	InitProcessLocalLatch();
 	InitializeLatchWaitSet();
 
+	pgaio_postmaster_child_init_local();
+
 	/*
 	 * If possible, make this process a group leader, so that the postmaster
 	 * can signal any child processes too. Not all processes will have
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 11128ea461c..f1151645242 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -589,6 +590,8 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	pgaio_postmaster_child_init();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58b..a4b3c7c62bd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -5196,6 +5197,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..3a5e307c9dc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -835,6 +835,13 @@
 #include = '...'			# include file
 
 
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync			# (change requires restart)
+
+
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
 #------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index df3f336bec0..2681dd51bb7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1258,6 +1258,7 @@ IntervalAggState
 IntoClause
 InvalMessageArray
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.45.2.827.g557ae147e6

v2.1-0007-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload

From d1c318432d40aee43b46db6187a033872af96b31 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:23:08 -0400
Subject: [PATCH v2.1 07/20] aio: Core AIO implementation

At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.

Todo:
- lots of cleanup
---
 src/include/storage/aio.h                     | 308 ++++++
 src/include/storage/aio_internal.h            | 274 +++++
 src/include/storage/aio_ref.h                 |  24 +
 src/include/utils/resowner.h                  |   7 +
 src/backend/access/transam/xact.c             |   9 +
 src/backend/storage/aio/Makefile              |   3 +
 src/backend/storage/aio/aio.c                 | 975 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 304 ++++++
 src/backend/storage/aio/aio_io.c              | 111 ++
 src/backend/storage/aio/aio_subject.c         | 167 +++
 src/backend/storage/aio/meson.build           |   3 +
 src/backend/storage/aio/method_sync.c         |  43 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |  25 +
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/backend/utils/resowner/resowner.c         |  51 +
 src/tools/pgindent/typedefs.list              |  19 +
 17 files changed, 2332 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_ref.h
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_subject.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1e4dfd07e89..c0a59f47bc0 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,315 @@
 #define AIO_H
 
 
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
 #include "utils/guc_tables.h"
 
 
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READ,
+	PGAIO_OP_WRITE,
+
+	PGAIO_OP_FSYNC,
+
+	PGAIO_OP_FLUSH_RANGE,
+
+	PGAIO_OP_NOP,
+
+	/**
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_NOP + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	AHF_REFERENCES_LOCAL = 1 << 0,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+	ASC_PLACEHOLDER /* empty enums are invalid */ ,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+
+	struct
+	{
+		int			fd;
+		bool		datasync;
+	}			fsync;
+
+	struct
+	{
+		int			fd;
+		uint32		nbytes;
+		uint64		offset;
+	}			flush_range;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioSubjectData;
+
+
+
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,
+	ARS_OK,
+	ARS_PARTIAL,
+	ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+	PgAioHandleSharedCallbackID id:8;
+	PgAioResultStatus status:2;
+	uint32		error_data:22;
+	int32		result;
+} PgAioResult;
+
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+	char	   *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+	const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+	PgAioHandleSharedCallbackPrepare prepare;
+	PgAioHandleSharedCallbackComplete complete;
+	PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int	pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+							 int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
 /* GUC related */
 extern void assign_io_method(int newval, void *extra);
 
@@ -36,6 +342,8 @@ typedef enum IoMethod
 /* GUCs */
 extern const struct config_enum_entry io_method_options[];
 extern int	io_method;
+extern int	io_max_concurrency;
+extern int	io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..82bce1cf27c
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,274 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,
+
+	/* IO finished, but result has not yet been processed */
+	AHS_REAPED,
+
+	/* IO completed, shared completion has been called */
+	AHS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioSubjectID subject:8;
+
+	/* which operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_shared_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+	uint8		iovec_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* FIXME: remove in favor of distilled_result */
+	/* raw result of the IO operation */
+	int32		result;
+
+	/* index into PgAioCtl->iovecs */
+	uint32		iovec_off;
+
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - PREPARED - in per-backend staged list
+	 * - IN_FLIGHT - not in any list
+	 * - REAPED - in per-reap context list
+	 * - COMPLETED_SHARED - not in any list
+	 * - COMPLETED_LOCAL - not in any list
+	 *
+	 * XXX: It probably make sense to optimize this out to save on per-io
+	 * memory at the cost of per-backend memory.
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary for shared completions. Needs to be sufficient to allow
+	 * another backend to retry an IO.
+	 */
+	PgAioSubjectData scb_data;
+};
+
+
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
+typedef struct PgAioPerBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+	 * having been either defined (by actually associating it with IO) or by
+	 * released (with pgaio_io_release()). This restriction is necessary to
+	 * guarantee that we always can acquire an IO. ->handed_out_io is used to
+	 * enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioPerBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *iovecs_data;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+	/* initialization */
+	size_t		(*shmem_size) (void);
+	void		(*shmem_init) (bool first_time);
+
+	void		(*postmaster_init) (void);
+	void		(*postmaster_child_init_local) (void);
+	void		(*postmaster_child_init) (void);
+
+	/* teardown */
+	void		(*postmaster_before_child_exit) (void);
+
+	/* handling of IOs */
+	bool		(*needs_synchronous_execution)(PgAioHandle *ioh);
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+
+	/* properties */
+	bool		can_scatter_gather_direct;
+	bool		can_scatter_gather_buffered;
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_sync_ops;
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ *    headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+	uint32		aio_index;
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioHandleRef;
+
+#endif							/* AIO_REF_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,11 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 87700c7c5c7..1fccaa3eb79 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -52,6 +52,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2462,6 +2463,8 @@ CommitTransaction(void)
 	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
@@ -2976,6 +2979,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
@@ -5350,6 +5357,8 @@ AbortSubTransaction(void)
 		AtSubAbort_Snapshot(s->nestingLevel);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..b253278f3c1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,9 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_init.o \
+	aio_io.o \
+	aio_subject.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index d831c772960..b5370330620 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -14,7 +14,23 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 
 /* Options for io_method. */
@@ -24,9 +40,968 @@ const struct config_enum_entry io_method_options[] = {
 };
 
 int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
+
+
+/* global control for AIO */
+PgAioCtl   *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * AFIXME: rewrite
+ *
+ * Shared completion callbacks can be executed by any backend (otherwise there
+ * would be deadlocks). Therefore they cannot update state for the issuer of
+ * the IO. That can be done with issuer callbacks.
+ *
+ * Note that issuer callbacks are effectively executed in a critical
+ * section. This is necessary as we need to be able to execute IO in critical
+ * sections (consider e.g. WAL logging) and to be able to execute IOs we need
+ * to acquire an IO, which in turn requires executing issuer callbacks. An
+ * alternative scheme could be to defer local callback execution until a later
+ * point, but that gets complicated quickly.
+ *
+ * Therefore the typical pattern is to use an issuer callback to set some
+ * flags in backend local memory, which can then be used to error out at a
+ * later time.
+ *
+ * NB: The issuer callback is cleared when the resowner owning the IO goes out
+ * of scope.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_get_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (my_aio->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(my_aio->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (my_aio->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&my_aio->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == AHS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		ioh->state = AHS_HANDED_OUT;
+		my_aio->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+			ioh->report_return = ret;
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == my_aio->handed_out_io)
+	{
+		Assert(ioh->state == AHS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		my_aio->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case AHS_HANDED_OUT:
+			Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+			if (ioh == my_aio->handed_out_io)
+			{
+				my_aio->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case AHS_DEFINED:
+		case AHS_PREPARED:
+			/* XXX: Should we warn about this when is_commit? */
+			pgaio_submit_staged();
+			break;
+		case AHS_IN_FLIGHT:
+		case AHS_REAPED:
+		case AHS_COMPLETED_SHARED:
+			/* this is expected to happen */
+			break;
+		case AHS_COMPLETED_LOCAL:
+			/* XXX: unclear if this ought to be possible? */
+			pgaio_io_reclaim(ioh);
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	*iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* AFIXME: Needs to be the value at startup time */
+	return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+	return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+	return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	for (int i = 0; i < len; i++)
+		aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+	ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->iovec_data_len > 0);
+
+	*len = ioh->iovec_data_len;
+
+	return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->subject = subjid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+	Assert(ioh->state == AHS_HANDED_OUT ||
+		   ioh->state == AHS_DEFINED ||
+		   ioh->state == AHS_PREPARED);
+	Assert(ioh->generation != 0);
+
+	ior->aio_index = ioh - aio_ctl->io_handles;
+	ior->generation_upper = (uint32) (ioh->generation >> 32);
+	ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+	ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+	return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+	Assert(pgaio_io_ref_valid(ior));
+	return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();
+		}
+		else if (state != AHS_IN_FLIGHT && state != AHS_REAPED && state != AHS_COMPLETED_SHARED && state != AHS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+
+		/*
+		 * Somebody else completed the IO, need to execute issuer callback, so
+		 * reclaim eagerly.
+		 */
+		if (state == AHS_COMPLETED_LOCAL)
+		{
+			pgaio_io_reclaim(ioh);
+
+			return;
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case AHS_IN_FLIGHT:
+				if (pgaio_impl->wait_one)
+				{
+					pgaio_impl->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case AHS_PREPARED:
+			case AHS_DEFINED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case AHS_REAPED:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state != AHS_REAPED && state != AHS_DEFINED &&
+						state != AHS_IN_FLIGHT)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case AHS_COMPLETED_SHARED:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+			case AHS_COMPLETED_LOCAL:
+				return;
+		}
+	}
+}
+
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+
+	if (state == AHS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= aio_ctl->io_handles &&
+		   ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+	return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			return "idle";
+		case AHS_HANDED_OUT:
+			return "handed_out";
+		case AHS_DEFINED:
+			return "DEFINED";
+		case AHS_PREPARED:
+			return "PREPARED";
+		case AHS_IN_FLIGHT:
+			return "IN_FLIGHT";
+		case AHS_REAPED:
+			return "REAPED";
+		case AHS_COMPLETED_SHARED:
+			return "COMPLETED_SHARED";
+		case AHS_COMPLETED_LOCAL:
+			return "COMPLETED_LOCAL";
+	}
+	pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;
+
+	elog(DEBUG3, "io:%d: prepared %s",
+		 pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh));
+
+	if (!pgaio_io_needs_synchronous_execution(ioh))
+	{
+		my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+		Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == AHS_IN_FLIGHT);
+
+	ioh->result = result;
+
+	pg_write_barrier();
+
+	/* FIXME: should be done in separate function */
+	ioh->state = AHS_REAPED;
+
+	pgaio_io_process_completion_subject(ioh);
+
+	/* ensure results of completion are visible before the new state */
+	pg_write_barrier();
+
+	ioh->state = AHS_COMPLETED_SHARED;
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	if (pgaio_impl->needs_synchronous_execution)
+		return pgaio_impl->needs_synchronous_execution(ioh);
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	ioh->state = AHS_IN_FLIGHT;
+	pg_write_barrier();
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+	ioh = &aio_ctl->io_handles[ior->aio_index];
+
+	*ref_generation = ((uint64) ior->generation_upper) << 32 |
+		ior->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	ereport(DEBUG3,
+			errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+				   pgaio_io_get_id(ioh),
+				   pgaio_io_get_state_name(ioh),
+				   pgaio_io_get_op_name(ioh),
+				   pgaio_io_get_subject_name(ioh),
+				   ioh->result,
+				   ioh->report_return
+				   ),
+			errhidestmt(true), errhidecontext(true));
+
+	if (ioh->report_return)
+	{
+		if (ioh->state != AHS_HANDED_OUT)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->subject_data = ioh->scb_data;
+		}
+	}
+
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&my_aio->idle_bbs, &bb->node);
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->num_shared_callbacks = 0;
+	ioh->iovec_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->flags = 0;
+
+	pg_write_barrier();
+	ioh->generation++;
+	pg_write_barrier();
+	ioh->state = AHS_IDLE;
+	pg_write_barrier();
+
+	dclist_push_tail(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	bool		found_handed_out = false;
+	int			reclaimed = 0;
+	static uint32 lastpos = 0;
+
+	elog(DEBUG2,
+		 "waiting for self: %d pending",
+		 my_aio->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+		if (ioh->state == AHS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	if (my_aio->num_staged_ios > 0)
+	{
+		elog(DEBUG2, "submitting while acquiring free io");
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case AHS_IDLE:
+
+				/*
+				 * While one might think that pgaio_io_get_nb() should have
+				 * succeeded, this is reachable because the IO could have
+				 * completed during the submission above.
+				 */
+				return;
+			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARED:
+			case AHS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case AHS_HANDED_OUT:
+				if (found_handed_out)
+					elog(ERROR, "more than one handed out IO");
+				found_handed_out = true;
+				continue;
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d",
+						 pgaio_io_get_id(ioh));
+					lastpos = i;
+					return;
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+				lastpos = i;
+				return;
+		}
+	}
+
+	elog(PANIC, "could not reclaim any handles");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (my_aio->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * FIXME It probably is not correct to have bounce buffers be per backend,
+	 * they use too much memory.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&my_aio->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	my_aio->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	my_aio->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&my_aio->idle_bbs, &bb->node);
+	my_aio->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (my_aio->num_staged_ios > 0)
+	{
+		elog(DEBUG2, "submitting while acquiring free bb");
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				continue;
+			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d to reclaim BB",
+						 pgaio_io_get_id(ioh));
+
+					if (slist_is_empty(&my_aio->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&my_aio->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+			case AHS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&my_aio->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (my_aio->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_impl->submit(my_aio->num_staged_ios, my_aio->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	my_aio->num_staged_ios = 0;
+
+#ifdef PGAIO_VERBOSE
+	ereport(DEBUG2,
+			errmsg("submitted %d",
+				   total_submitted),
+			errhidestmt(true),
+			errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+	return my_aio->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!my_aio)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
+}
 
 
 void
 assign_io_method(int newval, void *extra)
 {
+	pgaio_impl = pgaio_ops_table[newval];
 }
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 1c277a7eb3b..e25bdf1dba0 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,33 +14,337 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* aio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioIOVShmemSize());
+	sz = add_size(sz, AioIOVDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
+
+	if (pgaio_impl->shmem_size)
+		sz = add_size(sz, pgaio_impl->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * io_combine_limit;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
+
+	aio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(aio_ctl, 0, AioCtlShmemSize());
+
+	aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
+
+	aio_ctl->backend_state = (PgAioPerBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	aio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+	aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+	aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+	bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+	bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->subject = ASI_INVALID;
+		ioh->state = AHS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		slist_init(&bs->idle_bbs);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->iovec_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_shared_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += io_combine_limit;
+		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_impl->shmem_init)
+		pgaio_impl->shmem_init(!found);
 }
 
 void
 pgaio_postmaster_init(void)
 {
+	if (pgaio_impl->postmaster_init)
+		pgaio_impl->postmaster_init();
 }
 
 void
 pgaio_postmaster_child_init(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!my_aio);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	my_aio = &aio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_impl->postmaster_child_init)
+		pgaio_impl->postmaster_child_init();
 }
 
 void
 pgaio_postmaster_child_init_local(void)
 {
+	if (pgaio_impl->postmaster_child_init_local)
+		pgaio_impl->postmaster_child_init_local();
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..5b2f9ee3ba6
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,111 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    Asynchronous I/O subsytem.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READ:
+			return "read";
+		case PGAIO_OP_WRITE:
+			return "write";
+		case PGAIO_OP_FSYNC:
+			return "fsync";
+		case PGAIO_OP_FLUSH_RANGE:
+			return "flush_range";
+		case PGAIO_OP_NOP:
+			return "nop";
+	}
+
+	pg_unreachable();
+}
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_READ);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_WRITE);
+}
+
+
+extern void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READ:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITE:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		default:
+			elog(ERROR, "not yet");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..51ee3b3969d
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,167 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ *    IO completion handling for IOs on different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+static const PgAioSubjectInfo *aio_subject_info[] = {
+	[ASI_INVALID] = &(PgAioSubjectInfo) {
+		.name = "invalid",
+	},
+};
+
+static const PgAioHandleSharedCallbacks *aio_shared_cbs[] = {
+};
+
+
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+	if (cbid >= lengthof(aio_shared_cbs))
+		elog(ERROR, "callback %d is out of range", cbid);
+	if (aio_shared_cbs[cbid]->complete == NULL)
+		elog(ERROR, "callback %d is undefined", cbid);
+	if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+	ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, adding cbid num %d, id %d",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 ioh->num_shared_callbacks + 1, cbid);
+
+	ioh->num_shared_callbacks++;
+}
+
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+	return aio_subject_info[ioh->subject]->name;
+}
+
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+	Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleSharedCallbacks *cbs = aio_shared_cbs[cbid];
+
+		if (!cbs->prepare)
+			continue;
+
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d: prepare",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i, cbid);
+		cbs->prepare(ioh);
+	}
+}
+
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = 0;				/* FIXME */
+	result.error_data = 0;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid;
+
+		cbid = ioh->shared_callbacks[i - 1];
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cbid num %d, id %d with distilled result status %d, id %u, error_data: %d, result: %d",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i, cbid,
+			 result.status,
+			 result.id,
+			 result.error_data,
+			 result.result);
+		result = aio_shared_cbs[cbid]->complete(ioh, result);
+	}
+
+	ioh->distilled_result = result;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 result.status,
+		 result.id,
+		 result.error_data,
+		 result.result,
+		 ioh->result);
+}
+
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return aio_subject_info[ioh->subject]->reopen != NULL;
+}
+
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	const PgAioHandleSharedCallbacks *scb;
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	scb = aio_shared_cbs[result.id];
+
+	if (scb->error == NULL)
+		elog(ERROR, "scb id %d does not have error callback", result.id);
+
+	scb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..8339d473aae 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,8 @@
 backend_sources += files(
   'aio.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_subject.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..9a3e70bde33
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    "AIO" implementation that just executes IO synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool	pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "should be unreachable");
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..99ec8321746 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,9 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_SUBMIT	"Waiting for AIO submission."
+AIO_DRAIN	"Waiting for IOs to finish."
+AIO_COMPLETION	"Waiting for completion callback."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a4b3c7c62bd..e5886f3b0e9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3201,6 +3201,31 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IOs that may be in flight in one backend."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3a5e307c9dc..ed746b8a533 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -841,6 +841,12 @@
 
 #io_method = sync			# (change requires restart)
 
+#io_max_concurrency = 32		# Max number of IOs that may be in
+					# flight at the same time in one backend
+					# (change requires restart)
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
+
 
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,13 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -425,6 +434,9 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
+
 	return owner;
 }
 
@@ -725,6 +737,21 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		/* XXX: Could probably be a later phase? */
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1109,27 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2681dd51bb7..2f463d29ca1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1259,6 +1259,7 @@ IntoClause
 InvalMessageArray
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2094,6 +2095,24 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBounceBuffer
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleState
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.45.2.827.g557ae147e6

v2.1-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 00976ef4bb067dda2454e0f4c4a74fc421715954 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:24:51 -0400
Subject: [PATCH v2.1 08/20] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_init.h                |   2 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/postmaster.c           | 186 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  84 ++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 16 files changed, 311 insertions(+), 15 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb9..d043445b544 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -352,6 +352,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -380,6 +381,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
 
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 63c12917cfe..4cc000df79e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -62,6 +62,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 5bcfb8a9d58..a38dd982fbe 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -23,4 +23,6 @@ extern void pgaio_postmaster_init(void);
 extern void pgaio_postmaster_child_init_local(void);
 extern void pgaio_postmaster_child_init(void);
 
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index deeb06c9e01..b466ba843d6 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -442,7 +442,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 0ae23fdf55e..78429b2af2f 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -55,6 +55,7 @@
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
@@ -199,6 +200,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 70c5ce19f6e..3d970374733 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -113,6 +113,7 @@
 #include "replication/walsender.h"
 #include "storage/aio_init.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
@@ -321,6 +322,7 @@ typedef enum
 								 * ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
+	PM_SHUTDOWN_IO,				/* waiting for io workers to exit */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -382,6 +384,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static pid_t io_worker_pids[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -420,6 +426,9 @@ static int	CountChildren(int target);
 static Backend *assign_backendlist_entry(void);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
+static void signal_io_workers(int signal);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(BackendType type);
 static void StartAutovacuumWorker(void);
@@ -1334,6 +1343,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	pmState = PM_STARTUP;
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPID == 0)
 		CheckpointerPID = StartChildProcess(B_CHECKPOINTER);
@@ -1346,7 +1360,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPID = StartChildProcess(B_STARTUP);
 	Assert(StartupPID != 0);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -1995,6 +2008,7 @@ process_pm_reload_request(void)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (SlotSyncWorkerPID != 0)
 			signal_child(SlotSyncWorkerPID, SIGHUP);
+		signal_io_workers(SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2527,6 +2541,22 @@ process_pm_child_exit(void)
 			}
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+
+			if (io_worker_count == 0 &&
+				pmState >= PM_SHUTDOWN_IO)
+			{
+				pmState = PM_WAIT_DEAD_END;
+			}
+			continue;
+		}
+
 		/*
 		 * We don't know anything about this child process.  That's highly
 		 * unexpected, as we do track all the child processes that we fork.
@@ -2764,6 +2794,9 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		if (SlotSyncWorkerPID != 0)
 			sigquit_child(SlotSyncWorkerPID);
 
+		/* Take care of io workers too */
+		signal_io_workers(SIGQUIT);
+
 		/* We do NOT restart the syslogger */
 	}
 
@@ -2987,10 +3020,11 @@ PostmasterStateMachine(void)
 					FatalError = true;
 					pmState = PM_WAIT_DEAD_END;
 
-					/* Kill the walsenders and archiver too */
+					/* Kill walsenders, archiver and aio workers too */
 					SignalChildren(SIGQUIT);
 					if (PgArchPID != 0)
 						signal_child(PgArchPID, SIGQUIT);
+					signal_io_workers(SIGQUIT);
 				}
 			}
 		}
@@ -3000,16 +3034,26 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead_end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * dead_end children and aio workers left. There shouldn't be any
+		 * regular backends left by now anyway; what we're really waiting for
+		 * is walsenders and archiver.
 		 */
 		if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			pmState = PM_SHUTDOWN_IO;
+			signal_io_workers(SIGUSR2);
 		}
 	}
 
+	if (pmState == PM_SHUTDOWN_IO)
+	{
+		/*
+		 * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+		 */
+		if (io_worker_count == 0)
+			pmState = PM_WAIT_DEAD_END;
+	}
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/* Don't allow any new socket connection events. */
@@ -3017,17 +3061,22 @@ PostmasterStateMachine(void)
 
 		/*
 		 * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-		 * (ie, no dead_end children remain), and the archiver is gone too.
+		 * (ie, no dead_end children remain), and the archiver and aio workers
+		 * are all gone too.
 		 *
-		 * The reason we wait for those two is to protect them against a new
+		 * We need to wait for those because we might have transitioned
+		 * directly to PM_WAIT_DEAD_END due to immediate shutdown or fatal
+		 * error.  Note that they have already been sent appropriate shutdown
+		 * signals, either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
+		 *
+		 * The reason we wait for those is to protect them against a new
 		 * postmaster starting conflicting subprocesses; this isn't an
 		 * ironclad protection, but it at least helps in the
-		 * shutdown-and-immediately-restart scenario.  Note that they have
-		 * already been sent appropriate shutdown signals, either during a
-		 * normal state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * shutdown-and-immediately-restart scenario.
 		 */
-		if (dlist_is_empty(&BackendList) && PgArchPID == 0)
+		if (dlist_is_empty(&BackendList) && io_worker_count == 0
+			&& PgArchPID == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPID == 0);
@@ -3120,10 +3169,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		pmState = PM_STARTUP;
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPID = StartChildProcess(B_STARTUP);
 		Assert(StartupPID != 0);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3375,6 +3428,7 @@ TerminateChildren(int signal)
 		signal_child(PgArchPID, signal);
 	if (SlotSyncWorkerPID != 0)
 		signal_child(SlotSyncWorkerPID, signal);
+	signal_io_workers(signal);
 }
 
 /*
@@ -3956,6 +4010,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
+		case PM_SHUTDOWN_IO:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
@@ -4149,6 +4204,109 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_pids[id] == pid)
+		{
+			--io_worker_count;
+			io_worker_pids[id] = 0;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	/* ATODO: This will need to check if io_method == worker */
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_SHUTDOWN_IO)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		int			pid;
+		int			id;
+
+		/* Find the lowest unused IO worker ID. */
+
+		/*
+		 * AFIXME: This logic doesn't work right now, the ids aren't
+		 * transported to workers anymore.
+		 */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_pids[id] == 0)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		Assert(pmState < PM_SHUTDOWN_IO);
+
+		/* Try to launch one. */
+		pid = StartChildProcess(B_IO_WORKER);
+		if (pid > 0)
+		{
+			io_worker_pids[id] = pid;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* Ask the highest used IO worker ID to exit. */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_pids[id] != 0)
+			{
+				kill(io_worker_pids[id], SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+static void
+signal_io_workers(int signal)
+{
+	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		if (io_worker_pids[i] != 0)
+			signal_child(io_worker_pids[i], signal);
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index b253278f3c1..fa2a7e9e5df 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_io.o \
 	aio_subject.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8339d473aae..62738ce1d14 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,5 +6,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_subject.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..5df2eea4a03
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		/*
+		 * We normally shouldn't get errors here. Need to do just enough error
+		 * recovery so that we can mark the IO as failed and then exit.
+		 */
+		LWLockReleaseAll();
+
+		/* TODO: recover from IO errors */
+
+		EmitErrorReport();
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 4dc46b17b41..d42546db195 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3294,6 +3294,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8af55989eed..a750caa9b2a 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -335,6 +335,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 	{
 		case B_INVALID:
 		case B_ARCHIVER:
+		case B_IO_WORKER:
 		case B_LOGGER:
 		case B_WAL_RECEIVER:
 		case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 99ec8321746..ecc513aa7bd 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b8fa2e64ffe..bedeed588d3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = "logger";
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e5886f3b0e9..40737882fb4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3226,6 +3227,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ed746b8a533..8c062240373 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -840,6 +840,7 @@
 #------------------------------------------------------------------------------
 
 #io_method = sync			# (change requires restart)
+#io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
 					# flight at the same time in one backend
-- 
2.45.2.827.g557ae147e6

v2.1-0009-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload

From bc2016ad468094ccc09507d3ddd755f5c7692d4b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 15:27:00 -0400
Subject: [PATCH v2.1 09/20] aio: Add worker method

---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/postmaster/postmaster.c           |   3 +-
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |  15 +
 src/backend/storage/aio/method_worker.c       | 404 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 428 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c0a59f47bc0..1e4c8807c71 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -332,11 +332,12 @@ extern void assign_io_method(int newval, void *extra);
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /* GUCs */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 82bce1cf27c..b6f44a875dd 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -264,6 +264,7 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
+extern const IoMethodOps pgaio_worker_ops;
 
 extern const IoMethodOps *pgaio_impl;
 extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd6..7aaccf69d1e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, AioWorkerSubmissionQueue)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3d970374733..76440321d18 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4222,7 +4222,8 @@ maybe_reap_io_worker(int pid)
 static void
 maybe_adjust_io_workers(void)
 {
-	/* ATODO: This will need to check if io_method == worker */
+	if (!pgaio_workers_enabled())
+		return;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b5370330620..0ca641d9322 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -36,6 +36,7 @@ static void pgaio_bounce_buffer_wait_for_free(void);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -53,6 +54,7 @@ PgAioPerBackend *my_aio;
 
 static const IoMethodOps *pgaio_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index e25bdf1dba0..ca3513019a6 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -19,6 +19,7 @@
 #include "storage/aio_init.h"
 #include "storage/aio_internal.h"
 #include "storage/bufmgr.h"
+#include "storage/io_worker.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 
@@ -37,6 +38,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -333,6 +339,9 @@ pgaio_postmaster_child_init(void)
 	/* shouldn't be initialized twice */
 	Assert(!my_aio);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
@@ -348,3 +357,9 @@ pgaio_postmaster_child_init_local(void)
 	if (pgaio_impl->postmaster_child_init_local)
 		pgaio_impl->postmaster_child_init_local();
 }
+
+bool
+pgaio_workers_enabled(void)
+{
+	return io_method == IOMETHOD_WORKER;
+}
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5df2eea4a03..a6c21df2ea5 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO implementation using workers
  *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,24 +31,290 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/wait_event.h"
+#include "utils/ps_status.h"
+
+
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+static void pgaio_worker_postmaster_child_init_local(void);
+
+static bool	pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+	.postmaster_child_init_local = pgaio_worker_postmaster_child_init_local,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+#if 0
+	.wait_one = pgaio_worker_wait_one,
+	.retry = pgaio_worker_io_retry,
+	.drain = pgaio_worker_drain,
+#endif
+
+	.can_scatter_gather_direct = true,
+	.can_scatter_gather_buffered = true
+};
 
 
 int			io_workers = 3;
+static int	io_worker_queue_size = 64;
 
+static int	MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * io_worker_queue_size +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static void
+pgaio_worker_postmaster_child_init_local(void)
+{
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		elog(DEBUG1, "full");
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			ereport(DEBUG3,
+					errmsg("submission for io:%d choosing worker %d, latch %p",
+						   pgaio_io_get_id(ios[i]), worker, wakeup),
+					errhidestmt(true), errhidecontext(true));
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & AHF_REFERENCES_LOCAL
+		|| pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	int			nios = 0;
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return nios;
+}
 
 void
 IoWorkerMain(char *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	volatile PgAioHandle *ioh = NULL;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
 
 	/* TODO review all signals */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
@@ -49,7 +330,34 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
-	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* FIXME: locking */
+	MyIoWorkerId = -1;
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	sprintf(cmd, "worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
 
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -64,21 +372,107 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 
 		/* TODO: recover from IO errors */
+		if (ioh != NULL)
+		{
+#if 0
+			/* EINTR is treated as a retryable error */
+			pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+										EINTR);
+#endif
+		}
 
 		EmitErrorReport();
+
+		/* FIXME: should probably be a before-shmem-exit instead */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+		Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+		io_worker_control->workers[MyIoWorkerId].in_use = false;
+		io_worker_control->workers[MyIoWorkerId].latch = NULL;
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
 		proc_exit(1);
 	}
 
 	/* We can now handle ereport(ERROR) */
 	PG_exception_stack = &local_sigjmp_buf;
 
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/* Nothing to do.  Mark self idle. */
+			/*
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+#if 0
+			if (nwakeups > 0)
+				elog(LOG, "wake %d", nwakeups);
+#endif
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			ioh = &aio_ctl->io_handles[io_index];
+
+			ereport(DEBUG3,
+					errmsg("worker processing io:%d",
+						   pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+					errhidestmt(true), errhidecontext(true));
+
+			pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+			pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+			ioh = NULL;
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
 	proc_exit(0);
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ecc513aa7bd..3678f2b3e43 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -351,6 +351,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c062240373..1fc8336496c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -839,7 +839,7 @@
 # WIP AIO GUC docs
 #------------------------------------------------------------------------------
 
-#io_method = sync			# (change requires restart)
+#io_method = worker			# (change requires restart)
 #io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f463d29ca1..f1cac7aa5bf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.45.2.827.g557ae147e6

v2.1-0010-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 8cacec347f18d4d6390648928769cd084f57b77f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.1 10/20] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/pg_config.h.in |   3 +
 src/makefiles/meson.build  |   3 +
 configure                  | 138 +++++++++++++++++++++++++++++++++++++
 configure.ac               |  11 +++
 meson.build                |  14 ++++
 meson_options.txt          |   3 +
 src/Makefile.global.in     |   4 ++
 7 files changed, 176 insertions(+)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 38006367a40..7d2fcb9d0f5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -693,6 +693,9 @@
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 850e9275845..cca689b2028 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'gssapi': gssapi,
   'icu': icu,
   'ldap': ldap,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/configure b/configure
index 53c8a1f2bad..aa82bafe783 100755
--- a/configure
+++ b/configure
@@ -654,6 +654,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ XML2_CFLAGS
 XML2_CONFIG
 with_libxml
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libxml
@@ -907,6 +911,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1574,6 +1580,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         use liburing for async io
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
@@ -1617,6 +1624,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8664,6 +8675,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13209,6 +13254,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/configure.ac b/configure.ac
index 6a35b2880bf..04480cdea0a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -970,6 +970,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1426,6 +1434,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/meson.build b/meson.build
index 4764b09266e..53266e04005 100644
--- a/meson.build
+++ b/meson.build
@@ -848,6 +848,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3094,6 +3106,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3738,6 +3751,7 @@ if meson.version().version_compare('>=0.57')
       'gss': gssapi,
       'icu': icu,
       'ldap': ldap,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index b9421557606..084eebe72d7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'Use liburing for async io')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b49761..a8ff18faed6 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd	= @with_systemd@
 with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.45.2.827.g557ae147e6

v2.1-0011-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload

From b50761d00455ef1fd0a0c9625624866c60a7333f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:17 -0400
Subject: [PATCH v2.1 11/20] aio: Add io_uring method

---
 src/include/storage/aio.h                 |   1 +
 src/include/storage/aio_internal.h        |   3 +
 src/include/storage/lwlock.h              |   1 +
 src/backend/storage/aio/Makefile          |   1 +
 src/backend/storage/aio/aio.c             |   6 +
 src/backend/storage/aio/meson.build       |   1 +
 src/backend/storage/aio/method_io_uring.c | 383 ++++++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c         |   1 +
 src/tools/pgindent/typedefs.list          |   1 +
 9 files changed, 398 insertions(+)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1e4c8807c71..b8c743548c9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -333,6 +333,7 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+	IOMETHOD_IO_URING,
 } IoMethod;
 
 
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index b6f44a875dd..5d18d112e2d 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -265,6 +265,9 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
 extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern const IoMethodOps *pgaio_impl;
 extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index eabf813ce05..72f928b7602 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index fa2a7e9e5df..3bcb8a0b2ed 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_subject.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 0ca641d9322..8877a33b9f2 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -37,6 +37,9 @@ static void pgaio_bounce_buffer_wait_for_free(void);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -55,6 +58,9 @@ PgAioPerBackend *my_aio;
 static const IoMethodOps *pgaio_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 62738ce1d14..537f23d446d 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_subject.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..0f0eda0ce9b
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,383 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO implementation using io_uring on Linux
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_postmaster_init(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_postmaster_child_init(void);
+static void pgaio_uring_postmaster_child_init_local(void);
+
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.postmaster_init = pgaio_uring_postmaster_init,
+	.postmaster_child_init = pgaio_uring_postmaster_child_init,
+	.postmaster_child_init_local = pgaio_uring_postmaster_child_init_local,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+#if 0
+	.retry = pgaio_uring_io_retry,
+	.wait_one = pgaio_uring_wait_one,
+	.drain = pgaio_uring_drain,
+#endif
+	.can_scatter_gather_direct = true,
+	.can_scatter_gather_buffered = true
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	aio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &aio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_postmaster_init(void)
+{
+	uint32		TotalProcs =
+		MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	for (int i = 0; i < TotalProcs; i++)
+		ReserveExternalFD();
+}
+
+static void
+pgaio_uring_postmaster_child_init(void)
+{
+	my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+}
+
+static void
+pgaio_uring_postmaster_child_init_local(void)
+{
+	int			ret;
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			elog(DEBUG3, "submit nios: %d", num_staged_ios);
+		}
+		break;
+	}
+
+	return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+		uint32		reaped;
+
+		START_CRIT_SECTION();
+		reaped =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									reaped_cqes,
+									Min(PGAIO_MAX_LOCAL_REAPED, ready));
+		Assert(reaped <= ready);
+
+		ready -= reaped;
+
+		for (int i = 0; i < reaped; i++)
+		{
+			struct io_uring_cqe *cqe = reaped_cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		ereport(DEBUG3,
+				errmsg("drained %d/%d, now expecting %d",
+					   reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+				errhidestmt(true),
+				errhidecontext(true));
+
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will reap the completions, making the locking
+	 * unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		ereport(DEBUG3,
+				errmsg("wait_one for io:%d in state %s, cycle %d",
+					   pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh), waited),
+				errhidestmt(true),
+				errhidecontext(true));
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != AHS_IN_FLIGHT)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	ereport(DEBUG3,
+			errmsg("wait_one with %d sleeps",
+				   waited),
+			errhidestmt(true),
+			errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READ:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITE:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		default:
+			elog(ERROR, "not implemented");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5fa77412ed..b138a36c461 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f1cac7aa5bf..46d31cf2b9f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2116,6 +2116,7 @@ PgAioReturn
 PgAioSubjectData
 PgAioSubjectID
 PgAioSubjectInfo
+PgAioUringContext
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.45.2.827.g557ae147e6

v2.1-0016-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload

From 3b51bfa51eac42157c8177437fb6993ed349c0f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.1 16/20] aio: Very-WIP: read_stream.c adjustments for real
 AIO

---
 src/include/storage/bufmgr.h          |  2 ++
 src/backend/storage/aio/read_stream.c | 29 +++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c   |  3 ++-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a075a40b2ed..ac6496bb1eb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -117,6 +117,8 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 2)
 
 /*
  * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 7f0e07d9586..7ff2d6a2071 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -91,6 +91,7 @@
 
 #include "catalog/pg_tablespace.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -241,14 +242,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
 	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 *
+	 * XXX: Used to also check stream->pending_read_blocknum !=
+	 * stream->seq_blocknum
 	 */
 	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		stream->advice_enabled)
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
 
+	flags |= READ_BUFFERS_MORE_MORE_MORE;
+
 	/* We say how many blocks we want to read, but may be smaller on return. */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
@@ -307,6 +312,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -356,6 +369,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit the limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_submit_staged();
 				return;
 			}
 		}
@@ -380,6 +394,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, suppress_advice);
+
+	pgaio_submit_staged();
 }
 
 /*
@@ -494,10 +510,11 @@ read_stream_begin_impl(int flags,
 	 * direct I/O isn't enabled, the caller hasn't promised sequential access
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
+	 *
+	 * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	 * (flags & READ_STREAM_SEQUENTIAL) == 0
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
-		max_ios > 0)
+	if (max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
@@ -728,7 +745,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4914c71d41e..ed384fa1a44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1638,7 +1638,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 
 	if (did_start_io_overall)
 	{
-		pgaio_submit_staged();
+		if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+			pgaio_submit_staged();
 		return true;
 	}
 	else
-- 
2.45.2.827.g557ae147e6

v2.1-0017-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 2d390f78e46219c8bace6d37ff35d20f6ff0fd30 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:42 -0400
Subject: [PATCH v2.1 17/20] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 195 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 232 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3bcb8a0b2ed..f3a7f9e63d6 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_subject.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..4dda2f4e20e
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+	PgAioHandleRef ior;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_io_ref_clear(&tio->ior);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_io_ref_wait(&tio->ior);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->ior = *ior;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_io_ref_check_done(&tio->ior))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_io_ref_get_id(&tio->ior)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_io_ref_wait(&tio->ior);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 537f23d446d..e8a88e615c0 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_subject.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 46d31cf2b9f..a38141b4e50 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1172,6 +1172,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -2960,6 +2961,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.45.2.827.g557ae147e6

v2.1-0018-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 52aab8396a446a90e23178fd0c593fddfa433a7a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2.1 18/20] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   1 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  25 +-
 src/backend/postmaster/checkpointer.c |  12 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 64 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
 extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5cfa7dbd1f1..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,7 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ac6496bb1eb..a65888c8915 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -325,7 +325,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..6f8fe796da3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		 * about in bgwriter, but we do have LWLocks, buffers, and temp files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * XXX: Before exiting, wait for all IO to finish. That's only
+		 * important to avoid spurious PrintBufferLeakWarning() /
+		 * PrintAioIPLeakWarning() calls, triggered by
+		 * ReleaseAuxProcessResources() being called with isCommit=true.
+		 *
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index eeb73c85726..17aa980aa80 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
 		UnlockBuffers();
@@ -708,7 +711,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -741,6 +744,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ed384fa1a44..6ec700e5ef2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -77,6 +78,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2954,6 +2955,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
+	to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	if (to_write->total_writes > 0)
+		pgaio_submit_staged();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -2985,7 +3036,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3047,7 +3101,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3155,48 +3211,89 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					break;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3214,15 +3311,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3248,7 +3353,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3291,6 +3396,8 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3467,11 +3574,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3482,6 +3603,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == io_combine_limit)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3493,6 +3621,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3531,8 +3664,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3541,22 +3732,56 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
-	int			result = 0;
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
 	uint32		buf_state;
-	BufferTag	tag;
+	int			result = 0;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		if (to_write->ioh == NULL)
+		{
+			to_write->ioh = io_queue_get_io(ioq);
+			pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+		}
+
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3566,7 +3791,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3575,40 +3800,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
-	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
-	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
-
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
-
-	tag = bufHdr->tag;
-
-	UnpinBuffer(bufHdr);
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
 
-	return result | BUF_WRITTEN;
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_io_data_32(to_write->ioh,
+							(uint32 *) to_write->buffers,
+							to_write->nbuffers);
+	pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 IOOP_WRITE, to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->ior);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
 }
 
 /*
@@ -3974,6 +4441,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d29..8295e3fb0a0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1491,6 +1491,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a38141b4e50..9973162dc86 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -345,6 +345,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.45.2.827.g557ae147e6

v2.1-0019-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From b3d46d7af01fb746ab8a366a771420b4608a337e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2.1 19/20] very-wip: test_aio module

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h            |  10 +
 src/include/storage/buf_internals.h           |   4 +
 src/backend/storage/aio/aio.c                 |  38 ++
 src/backend/storage/buffer/bufmgr.c           |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_aio/.gitignore          |   6 +
 src/test/modules/test_aio/Makefile            |  34 ++
 src/test/modules/test_aio/expected/inject.out | 180 +++++++
 src/test/modules/test_aio/expected/io.out     |  40 ++
 .../modules/test_aio/expected/ownership.out   | 148 ++++++
 src/test/modules/test_aio/expected/prep.out   |  17 +
 src/test/modules/test_aio/io_uring.conf       |   5 +
 src/test/modules/test_aio/meson.build         |  78 +++
 src/test/modules/test_aio/sql/inject.sql      |  51 ++
 src/test/modules/test_aio/sql/io.sql          |  16 +
 src/test/modules/test_aio/sql/ownership.sql   |  65 +++
 src/test/modules/test_aio/sql/prep.sql        |   9 +
 src/test/modules/test_aio/sync.conf           |   5 +
 src/test/modules/test_aio/test_aio--1.0.sql   |  94 ++++
 src/test/modules/test_aio/test_aio.c          | 479 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control    |   3 +
 src/test/modules/test_aio/worker.conf         |   5 +
 23 files changed, 1290 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/expected/inject.out
 create mode 100644 src/test/modules/test_aio/expected/io.out
 create mode 100644 src/test/modules/test_aio/expected/ownership.out
 create mode 100644 src/test/modules/test_aio/expected/prep.out
 create mode 100644 src/test/modules/test_aio/io_uring.conf
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/sql/inject.sql
 create mode 100644 src/test/modules/test_aio/sql/io.sql
 create mode 100644 src/test/modules/test_aio/sql/ownership.sql
 create mode 100644 src/test/modules/test_aio/sql/prep.sql
 create mode 100644 src/test/modules/test_aio/sync.conf
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control
 create mode 100644 src/test/modules/test_aio/worker.conf

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 5d18d112e2d..a44cdb457ee 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -262,6 +262,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
 extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 
 
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
 extern const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8877a33b9f2..7efc9631f5f 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -22,6 +22,9 @@
 #include "utils/resowner.h"
 #include "utils/wait_event_types.h"
 
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
 
 
 static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -67,6 +70,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
 const IoMethodOps *pgaio_impl;
 
 
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
 
 /* --------------------------------------------------------------------------------
  * "Core" IO Api
@@ -543,6 +551,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
 	/* FIXME: should be done in separate function */
 	ioh->state = AHS_REAPED;
 
+#ifdef USE_INJECTION_POINTS
+	inj_cur_handle = ioh;
+
+	/*
+	 * FIXME: This could be in a critical section - but it looks like we can't
+	 * just InjectionPointLoad() at process start, as the injection point
+	 * might not yet be defined.
+	 */
+	InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	inj_cur_handle = NULL;
+#endif
+
 	pgaio_io_process_completion_subject(ioh);
 
 	/* ensure results of completion are visible before the new state */
@@ -1013,3 +1034,20 @@ assign_io_method(int newval, void *extra)
 {
 	pgaio_impl = pgaio_ops_table[newval];
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6ec700e5ef2..44b1b6fb316 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
 							  bool syncio);
@@ -6095,7 +6094,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520a..7df90602e90 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d236..bc7d19e694f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e52b0f086dd
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,180 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b;
+$$);
+NOTICE:  wrapped error: could not read blocks 1..2 in file base/<redacted>: read only 8192 of 16384 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block 
+-------------------
+ 
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact 
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+ERROR:  release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get 
+------------
+ 
+ 
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR:  API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+ERROR:  can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR:  can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel 
+----------
+ 
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel 
+----------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+  'prep',
+  'ownership',
+  'io',
+]
+
+if get_option('injection_points')
+  testfiles += 'inject'
+endif
+
+
+tests += {
+  'name': 'test_aio_sync',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('sync.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+tests += {
+  'name': 'test_aio_worker',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('worker.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+if liburing.found()
+  tests += {
+    'name': 'test_aio_uring',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'regress': {
+      'sql': testfiles,
+      'regress_args': [
+        '--temp-config', files('io_uring.conf'),
+      ],
+      # requires custom config
+      'runningcheck': false,
+    }
+  }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..b3d34de8977
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,51 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+-- shorten multi-block read to a single block, should retry, but that's not
+-- implemented yet
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b;
+$$);
+SELECT inj_io_short_read_detach();
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- FIXME: Should error
+-- FIXME: errno encoding?
+SELECT inj_io_short_read_attach(-5);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..ea9ad43ed8f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,94 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+    DECLARE
+	err_state text;
+        err_msg text;
+    BEGIN
+        EXECUTE p_sql;
+	RETURN true;
+    EXCEPTION WHEN OTHERS THEN
+        GET STACKED DIAGNOSTICS
+	    err_state = RETURNED_SQLSTATE,
+	    err_msg = MESSAGE_TEXT;
+	err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+        RAISE NOTICE 'wrapped error: %', err_msg
+	    USING ERRCODE = err_state;
+	RETURN false;
+    END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..9626d495241
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,479 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/lwlock.h"
+#include "storage/ipc.h"
+#include "access/relation.h"
+#include "utils/rel.h"
+#include "utils/injection_point.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled;
+	bool		result_set;
+	int			result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		page;
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+	SMgrRelation smgr;
+	uint32		buf_state;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	page = BufferGetBlock(buf);
+
+	ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get_ref(ioh, &ior);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+	smgr = RelationGetSmgr(rel);
+
+	/* FIXME: even if just a test, we should verify nobody else uses this */
+	buf_state = LockBufHdr(buf_hdr);
+	buf_state &= ~(BM_VALID | BM_DIRTY);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	StartBufferIO(buf_hdr, true, false);
+
+	pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+				   (void *) &page, 1);
+
+	ReleaseBuffer(buf);
+
+	pgaio_io_ref_wait(&ior);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* this is a gross hack, but there's no good API exposed */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+	buf = pr.recent_buffer;
+	elog(LOG, "recent: %d", buf);
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't unpin");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+	MarkBufferDirty(buf);
+	ph->pd_special = BLCKSZ + 1;
+
+	/* last_handle = pgaio_io_get(); */
+
+	PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+	if (inj_io_error_state->enabled)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->result);
+
+			ioh->result = inj_io_error_state->result;
+		}
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = true;
+	inj_io_error_state->result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->result_set)
+		inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
-- 
2.45.2.827.g557ae147e6

v2.1-0020-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 32ae60ce61cefe5c2e30341049e0c08b15e36de6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.1 20/20] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..5be8125ad3a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,11 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.45.2.827.g557ae147e6

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: 陈宗志 (#4)

Re: AIO v2.0

Hi,

On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.

Yep, that was already on my todo list. The version I just posted includes
that.

For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?

I don't think it's good to document this in a design document - that's just
bound to get out of date.

For now the patchset causes AIO to be used for

1) all users of read_stream.h, e.g. sequential scans

2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.

The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.

Also, I recommend keeping this patch as small as possible.

Yep. That's my goal (as mentioned upthread).

For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.

The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.

Greetings,

Andres Freund

Robert Pang

robertpang@google.com

over 1 year ago

In reply to: Andres Freund (#7)

1 attachment(s)

Re: AIO v2.0

Hi Andres

Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.

2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processes

I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.

Best regards
Robert

Show quoted text

On Fri, Sep 6, 2024 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.

Yep, that was already on my todo list. The version I just posted includes
that.

For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?

I don't think it's good to document this in a design document - that's just
bound to get out of date.

For now the patchset causes AIO to be used for

1) all users of read_stream.h, e.g. sequential scans

2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.

The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.

Also, I recommend keeping this patch as small as possible.

Yep. That's my goal (as mentioned upthread).

For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.

The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.

Greetings,

Andres Freund

Attachments:

0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchDownload

From bd04bd18ce62cf3f88568d3578503d4efeeb6603 Mon Sep 17 00:00:00 2001
From: Robert Pang <robertpang@google.com>
Date: Thu, 12 Sep 2024 14:36:16 -0700
Subject: [PATCH] Fix BgBufferSync to limit batch size under io_bounce_buffers
 for bgwriter.

---
 src/backend/storage/buffer/bufmgr.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44b1b6fb31..4cd959b295 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3396,6 +3396,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 	uint32		new_recent_alloc;
 
 	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
@@ -3417,6 +3418,8 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3604,7 +3607,7 @@ BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 		{
 			Assert(sync_state & BUF_REUSABLE);
 
-			if (to_write.nbuffers == io_combine_limit)
+			if (to_write.nbuffers == max_combine)
 			{
 				WriteBuffers(&to_write, ioq, wb_context);
 			}
-- 
2.46.0.662.g92d0881bb0-goog

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Andres Freund (#6)

1 attachment(s)

Re: AIO v2.0

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

+typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md. Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

+
+	/* IO finished, but result has not yet been processed */
+	AHS_REAPED,
+
+	/* IO completed, shared completion has been called */
+	AHS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	AHS_COMPLETED_LOCAL,
+} PgAioHandleState;

+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case AHS_HANDED_OUT:
+			Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+			if (ioh == my_aio->handed_out_io)
+			{
+				my_aio->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case AHS_DEFINED:
+		case AHS_PREPARED:
+			/* XXX: Should we warn about this when is_commit? */

Yes.

+			pgaio_submit_staged();
+			break;
+		case AHS_IN_FLIGHT:
+		case AHS_REAPED:
+		case AHS_COMPLETED_SHARED:
+			/* this is expected to happen */
+			break;
+		case AHS_COMPLETED_LOCAL:
+			/* XXX: unclear if this ought to be possible? */
+			pgaio_io_reclaim(ioh);
+			break;
+	}

+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup. Even so,
the next point might remove the need here:

+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi

I used the attached makefile patch to build w/ liburing.

+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	aio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &aio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them. Otherwise,
deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically. Could you add a mention of "deadlock" in the
comment at whichever code achieves that?

I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations. Is there any
specific decision you'd like to settle before patch 6 exits WIP?

Thanks,
nm

Attachments:

uring-makefile-v1.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 84302cc..b123fdc 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))

#10

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Noah Misch (#9)

Re: AIO v2.0

Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

Cool!

+typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md.

Makes sense.

Would you also cover the valid state transitions and which of them any
backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.

+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,
Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.

+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO. One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.

+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;
As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

Makes sense.

+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

Right - that's to be expected.

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{

+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}
Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
...
FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.

Yep - it's something I've been fighting with / redesigning a *lot*. Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:

1) What if the IO is being completed in a critical section? We can't reopen
the file in that situation. My initial fix for this was to defer retries,
but that's problematic too:

2) Acquiring an IO needs to be able to guarantee forward progress. Because
there's a limited number of IOs that means we need to be able to complete
IOs while acquiring an IO. So we can't just keep the IO handle around -
which in turn means that we'd need to save the state for retrying
somewhere. Which would require some pre-allocated memory to save that
state.

Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...

Otherwise, deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations.

Agreed.

Is there any specific decision you'd like to settle before patch 6 exits
WIP?

Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

- The header split doesn't yet quite seem right yet

- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications

- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.

- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.

- I'm not sure that I like name "subject" for the different things AIO is
performed for

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.

- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.

- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.

- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant
slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.

That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.

Thanks again,

Andres Freund

#11

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Robert Pang (#8)

Re: AIO v2.0

Hi,

On 2024-09-12 14:55:49 -0700, Robert Pang wrote:

Hi Andres

Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.

2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processes

I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.

Good catch, thanks!

I am hoping (as described in my email to Noah a few minutes ago) that we can
get away from needing bounce buffers. They are a quite expensive solution to a
problem we made for ourselves...

Greetings,

Andres Freund

#12

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Andres Freund (#10)

Re: AIO v2.0

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner

Works for me.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}
Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.
Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?

I'd try the following. First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED. This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns. I don't specifically know of a way it's unsafe. Do just one
pass of that; there may be newer IOs in progress afterward. If submit still
gets EAGAIN, sleep a bit and retry. Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations. This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
...
FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.

Yes. Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO. Still, I think this shouldn't use the term "Start" while no state
name uses that term. What else could remove that mismatch?

Is there any specific decision you'd like to settle before patch 6 exits
WIP?

Patch 6 specifically? That I really mainly kept separate for review - it

No. I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

No, I agree a lone commit of 6 isn't a win. Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit. I've not thought through that
grouping much.

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

- The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one. The aio.c/aio_io.c split did catch
my attention. I made a note to check it again once those files get header
comments.

- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications

Yes, that's a blocker to me.

- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.

Changing that later wouldn't affect much else, so I'd not consider it a
blocker. (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct. I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)

- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.

Okay.

- I'm not sure that I like name "subject" for the different things AIO is
performed for

How about one of these six terms:

- listener, observer [if you view smgr as an observer of IOs in the sense of https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.

- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.

Agreed. Among the post-patch check-world coverage, which uncovered parts have
the most risk?

- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker. Of these ones I'm calling non-blockers, which would you most
regret deferring?

- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant

Agreed on raising the soft limit. Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.

slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.

That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.

How far do you see the design impact spreading on that one?

Thanks,
nm

#13

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Noah Misch (#12)

Re: AIO v2.0

Hi,

On 2024-09-17 11:08:19 -0700, Noah Misch wrote:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.

Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

Of course this could be addressed by tuning, but it seems like something that
shouldn't need to be tuned by the majority of folks running postgres.

We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.

However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.

Does that seem reasonable?

Greetings,

Andres Freund

#14

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andres Freund (#13)

Re: AIO v2.0

On Mon, 30 Sept 2024 at 16:49, Andres Freund <andres@anarazel.de> wrote:

On 2024-09-17 11:08:19 -0700, Noah Misch wrote:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.

Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

FYI, I think you're off by a factor 8, i.e. that would be 2GB/sec and
666MB/sec respectively, given a normal page size of 8kB and exactly
1ms/3ms full round trip latency:

1 page/1 ms * 8kB/page * 256 concurrency = 256 pages/ms * 8kB/page =
2MiB/ms ~= 2GiB/sec.
for 3ms divide by 3 -> ~666MiB/sec.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#15

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Andres Freund (#13)

Re: AIO v2.0

On Mon, Sep 30, 2024 at 10:49:17AM -0400, Andres Freund wrote:

We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.

However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.

Does that seem reasonable?

Yes.

#16

Jakub Wartak

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#6)

Re: AIO v2.0

On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

Attached is the next version of the patchset. (..)

Hi Andres,

Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:

0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?
Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?

1. In [0]/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)" you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1]/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)")

So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?

3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future? To
add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the
earlier mentioned AioChooseMaxConccurrency() goes up to just 64

4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?

5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0

Or is that how it is supposed to work? Also pg_stat_io.read_time would be
something vague with io_uring/worker, so maybe zero is good here (?).
Otherwise we would have to measure time spent on waiting alone, but that
would force more instructions for calculating io times...

6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data). One issue I have
found is that DIO seems to be unusable until somebody will teach TOAST to
use readstreams, is that correct? Maybe I'm doing something wrong, but I
haven't seen any TOAST <-> readstreams topic:

-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)

hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.

The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)

9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ? As a debug feature
or as non-debug feature? (that is which of the IO methods should be
scrutinized the most as it is going to be the new default - sync or worker?)

10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)

-J.

p.s. I hope I did not ask stupid questions nor missed anything.

[0]: /messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)"
/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf
- "Right now the io_uring mode has each backend's io_uring instance visible
to
each other process.(..)"

[1]: /messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)"
/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
- sentence after: "FWIW there's another bottleneck people may not realize
(..)"

[2]: /messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w
/messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w

[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
/messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com

#17

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#16)

Re: AIO v2.0

Hi,

Sorry for loosing track of your message for this long, I saw it just now
because I was working on posting a new version.

On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote:

On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:

0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?

Yea, it shouldn't ever happen, but it's worth adding a check.

Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?

We'd have PANICed at that failure :)

b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?

That'd not make sense - you'd still want to use separate rings, to avoid
contention.

1. In [0] you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1])

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

In most distros the soft ulimit is set to something like 1024, but the hard
limit is much higher. The reason for that is that some applications try to
close all fds between 0 and RLIMIT_NOFILE - which takes a long time if
RLIMIT_NOFILE is high. By setting only the soft limit to a low value any
application needing higher limits can just opt into using more FDs.

On several of my machines the hard limit is 1073741816.

So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

Probably still worth adding something like this, even if we were to do what I
am suggesting above.

2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?

Yea, that's probably not right.

3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future?

io_max_concurrency is a hard maximum that needs to be set at server start,
because it requires allocating shared memory. Whereas effective_io_concurrency
can be changed on a per-session and per-tablespace
basis. I.e. io_max_concurrency is a hard upper limit for an entire backend,
whereas effective_io_concurrency controls how much one scan (or whatever does
prefetching) can issue.

To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while
the earlier mentioned AioChooseMaxConccurrency() goes up to just 64

Yea, that should probably be disambiguated.

4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?

I used to have a patch like that in the series and it was a pain to
rebase...

I also suspect sure this is quite enough to make debug_io_direct quite
production ready, even if just considering io_direct=data. Without streaming
read use in heap + index VACUUM, RelationCopyStorage() and a few other places
the performance consequences of using direct IO can be, um, surprising.

5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0

Or is that how it is supposed to work?

Currently the patch has a FIXME to add some IO statistics (I think I raised
that somewhere in this thread, too). It's not clear to me what IO time ought
to mean. I suspect the least bad answer is what you suggest:

Also pg_stat_io.read_time would be something vague with io_uring/worker, so
maybe zero is good here (?). Otherwise we would have to measure time spent
on waiting alone, but that would force more instructions for calculating io
times...

I.e. we should track the amount of time spent waiting for IOs.

I don't think tracking time in worker or such would make much sense, that'd
often end up with reporting more IO time than a query took.

6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data).

It's worth noting that with the last posted version you needed to increase
effective_io_concurrency to something very high to see sensible
performance.

That's due to the way read_stream_begin_impl() limited the number of buffers
pinned to effective_io_concurrency * 4 - which, due to io_combine_limit, ends
up allowing only a single IO in flight in case of sequential blocks until
effective_io_concurrency is set to 8 or such. I've adjusted that to some
degree now, but I think that might need a bit more sophistication.

One issue I have found is that DIO seems to be unusable until somebody will
teach TOAST to use readstreams, is that correct? Maybe I'm doing something
wrong, but I haven't seen any TOAST <-> readstreams topic:

Hm, I suspect that aq read stream won't help a whole lot in manyq toast
cases. Unless you have particularly long toast datums, the time is going to be
dominated by the random accesses, as each toast datum is looked up in a
non-predictable way.

Generally, using DIO requires tuning shared buffers much more aggressively
than not using DIO, no amount of stream use will change that. Of course we
shoul try to reduce that "downside"...

I'm not sure if the best way to do prefetching toast chunks would be to rely
on more generalized index->table prefetching support, or to have dedicated code.

-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)

hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.

The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO

I started loading the data to try this out myself :).

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

I think we still want something like it, but I don't think it needs to be in
the initial commits.

There are kernel events that you can track using e.g. perf. Particularly
useful are
io_uring:io_uring_submit_req
io_uring:io_uring_complete

8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]. As it stands in that patch from my thread, it was not covered: /*
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)

I think you can hit that if you don't take care to limit the number of IOs
being submitted at once or if you're not consuming completions. If the
completion queue is full enough the kernel at some point won't allow more IOs
to be submitted.

9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ?

I'd love to get some of it into 18. I don't quite know whether we can make it
happen and to what extent.

As a debug feature or as non-debug feature? (that is which of the IO methods
should be scrutinized the most as it is going to be the new default - sync
or worker?)

I'd say initially worker, with a beta 1 or 2 checklist item to revise it.

10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)

I don't think that's worth doing right now. There's too many dependencies and
it's going to be a while till the kernel support for that is widespread enough
to matter.

There's also the issue that, to my knowledge, outside of cloud environments
there's pretty much no hardware that actually reports power-fail atomicity
sizes bigger than a sector.

p.s. I hope I did not ask stupid questions nor missed anything.

You did not!

Greetings,

Andres Freund

#18

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Andres Freund (#17)

Re: AIO v2.0

Andres Freund <andres@anarazel.de> writes:

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.

regards, tom lane

#19

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Tom Lane (#18)

Re: AIO v2.0

Hi,

On 2024-12-19 17:34:29 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.

Arguably the configuration *did* tell us, by having a higher hard limit...

I'm not saying that we should increase the limit without a bound or without a
configuration option, btw.

As I had mentioned, the problem with relying on increasing the soft limit that
is that it's not generally sensible to do so, because it causes a bunch of
binaries to do be weirdly slow.

Another reason to not increase the soft rlimit is that doing so can break
programs relying on select().

But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Greetings,

Andres Freund

#20

Jelte Fennema-Nio

postgres@jeltef.nl

about 1 year ago

In reply to: Andres Freund (#19)

Re: AIO v2.0

On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:

Arguably the configuration *did* tell us, by having a higher hard limit...
<snip>
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:

https://0pointer.net/blog/file-descriptor-limits.html

#21

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Jelte Fennema-Nio (#20)

Re: AIO v2.0

Hi,

On 2024-12-20 18:27:13 +0100, Jelte Fennema-Nio wrote:

On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:

Arguably the configuration *did* tell us, by having a higher hard limit...
<snip>
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:

https://0pointer.net/blog/file-descriptor-limits.html

Good link.

This isn't just relevant for using io_uring:

There obviously are several people working on threaded postgres. Even if we
didn't duplicate fd.c file descriptors between threads (we probably will, at
least initially), the client connection FDs alone will mean that we have a lot
more FDs open. Due to the select() issue the soft limit won't be increased
beyond 1024, requiring everyone to add a 'ulimit -n $somehighnumber' before
starting postgres on linux doesn't help anyone.

Greetings,

Andres Freund

#22

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#1)

20 attachment(s)

Re: AIO v2.2

Hi,

Attached is a new version of the AIO patchset.

The biggest changes are:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

- The read/write patches and the bounce buffer patches are split out, so that
there's no dependency between the first few AIO patches and the "don't dirty
while IO is going on" patcheset [1]/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m.

- Retries for partial IOs (i.e. short reads) are now implemented. Turned out
to take all of three lines and adding one missing variable initialization.

- I added quite a lot of function-header and file-header comments. There's
more to be done here, but see also the TODO section below.

- IO stats are now tracked. Specifically, the "time" for an IO is now the time
spent waiting for an IO, as discussed around [2]/messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat. I haven't updated the
docs yet.

- There now is a fastpath for executing AIO "synchronously", i.e. preparing an
IO and immediately submitting it.

- Previously one needed very large effective_io_concurrency values to get
sufficient asynchronous IO for sequential scans, as read_stream.c limited
max_pinned_buffers to effective_io_concurrency * 4. Unless
effective_io_concurrency was very high, that'd only allow a single IO to be
in-flight, due to io_combine_limit buffers getting merged into one IO.

Instead the pin limit is now capped by effective_io_concurrency *
io_combine_limit.

Right now that's part of one larger "hack up read_stream.c" commit, Thomas
said he'd take a look at how to do this properly. This is probably
something we could and should commit separately.

- io_method = sync has been made more similar to the way IO happens today. In
particular, we now continue to issue prefetch requests and the actual IO is
done only within WaitReadBuffers().

- When using buffered IO with io_uring, there previously was a small
regression, due to more IO happening in the process context with io_uring
(instead of in a kernel thread). While one could argue that it's better to
not increase CPU usage beyond one process, I don't find that sufficiently
convincing. To work around that I added a heuritic that tells IO uring to
execute IOs using it's worker infrastructure. That seems to have fixed this
problem entirely.

- IO worker infrastructure was cleaned up

- I pushed a few minor preliminary commits a while ago

- lots of other smaller stuff

The biggest TODOs are:

- Right now the API between bufmgr.c and read_stream.c kind of necessitates
that one StartReadBuffers() call actually can trigger multiple IOs, if
one of the buffers was read in by another backend, before "this" backend
called StartBufferIO().

I think Thomas and I figured out a way to evolve the interface so that this
isn't necessary anymore:

We allow StartReadBuffers() to memorize buffers it pinned but didn't
initiate IO on in the buffers[] argument. The next call to StartReadBuffers
then doesn't have to repin thse buffers. That doesn't just solve the
multiple-IOs for one "read operation" issue, it also make the - very common
- case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
hit wouldn't be tracked in the same ReadBuffersOperation anymore.

- Right now bufmgr.h includes aio.h, because it needs to include a reference
to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
would be noticeable overhead, so that's not an option. I think the best
option here would be to introduce something like aio_types.h, so fewer
things are included.

- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.

One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.

This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.

The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).

- While I've added a lot of comments, I only got so far adding them. More are
needed.

- The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
improved

- The debug logging functions are a bit of a mess, lots of very similar code
in lots of places. I think AIO needs a few ereport() wrappers to make this
easier.

- More tests are needed. None of our current test frameworks really makes this
easy :(.

- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.

- I'm not sure that effective_io_concurrency as we have it right now really
makes sense, particularly not with the current default values. But that's a
mostly independent change.

Greetings,

Andres Freund

[1]: /messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
[2]: /messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat

Attachments:

v2-0010-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload

From 45154f1e08ee325875673c14470479f019ef0461 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 15 Dec 2024 12:36:32 -0500
Subject: [PATCH v2 10/20] aio: Implement smgr/md.c aio methods

---
 src/include/storage/aio.h             |  17 +-
 src/include/storage/fd.h              |   6 +
 src/include/storage/md.h              |  12 +
 src/include/storage/smgr.h            |  21 ++
 src/backend/storage/aio/aio_subject.c |   4 +
 src/backend/storage/file/fd.c         |  68 ++++++
 src/backend/storage/smgr/md.c         | 314 ++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c       |  91 ++++++++
 8 files changed, 532 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a1633a0ed3d..d693b0b0bd8 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -55,9 +55,10 @@ typedef enum PgAioSubjectID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	ASI_INVALID = 0,
+	ASI_SMGR,
 } PgAioSubjectID;
 
-#define ASI_COUNT (ASI_INVALID + 1)
+#define ASI_COUNT (ASI_SMGR + 1)
 
 /*
  * Flags for an IO that can be set with pgaio_io_set_flag().
@@ -100,6 +101,9 @@ typedef enum PgAioHandleFlags
 typedef enum PgAioHandleSharedCallbackID
 {
 	ASC_INVALID,
+
+	ASC_MD_READV,
+	ASC_MD_WRITEV,
 } PgAioHandleSharedCallbackID;
 
 
@@ -135,6 +139,17 @@ typedef union
 
 typedef union PgAioSubjectData
 {
+	struct
+	{
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		int			nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp;	/* proc can be inferred by owning AIO */
+		bool		release_lock;
+		int8		mode;
+	}			smgr;
+
 	/* just as an example placeholder for later */
 	struct
 	{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1456ab383a4..e993e1b671f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index e7671dd6c18..c3a18465c6b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_md_writev_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 63a186bd346..fe23a7f744f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioSubjectInfo;
+
+extern const struct PgAioSubjectInfo aio_smgr_subject_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,6 +124,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -127,4 +142,10 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_subject_smgr(struct PgAioHandle *ioh,
+									  SMgrRelationData *smgr,
+									  ForkNumber forknum,
+									  BlockNumber blocknum,
+									  int nblocks);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 8694cfafcd1..effb09c11c7 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -20,6 +20,7 @@
 #include "storage/aio_internal.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 
@@ -35,6 +36,7 @@ static const PgAioSubjectInfo *aio_subject_info[] = {
 	[ASI_INVALID] = &(PgAioSubjectInfo) {
 		.name = "invalid",
 	},
+	[ASI_SMGR] = &aio_smgr_subject_info,
 };
 
 
@@ -46,6 +48,8 @@ typedef struct PgAioHandleSharedCallbacksEntry
 
 static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(ASC_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7c403fb360e..eeb6288a9b5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2315,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
@@ -2498,6 +2557,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2843,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2912,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11fccda475f..b1277ed97ae 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -132,6 +133,22 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_writev_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+const struct PgAioHandleSharedCallbacks aio_md_readv_cb = {
+	.complete = md_readv_complete,
+	.error = md_readv_error,
+};
+
+const struct PgAioHandleSharedCallbacks aio_md_writev_cb = {
+	.complete = md_writev_complete,
+	.error = md_writev_error,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -927,6 +944,52 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, AHF_BUFFERED);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1032,6 +1095,52 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, AHF_BUFFERED);
+
+	pgaio_io_set_subject_smgr(ioh,
+							  reln,
+							  forknum,
+							  blocknum,
+							  nblocks);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1355,6 +1464,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1838,3 +1962,193 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_error(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_READV;
+		result.error_data = 0;
+
+		md_readv_error(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = ASC_MD_READV;
+	}
+
+	/* AFIXME: post-read portion of mdreadv() */
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+					   )
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+					   result.result * (size_t) BLCKSZ,
+					   subject_data->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_writev_error(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = ASC_MD_WRITEV;
+		result.error_data = 0;
+
+		md_writev_error(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = ASC_MD_WRITEV;
+	}
+
+	if (prior_result.status == ARS_ERROR)
+	{
+		/* AFIXME: complain */
+		return prior_result;
+	}
+
+	prior_result.result /= BLCKSZ;
+
+	return prior_result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum)
+					   )
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   subject_data->smgr.blockNum,
+					   subject_data->smgr.blockNum + subject_data->smgr.nblocks - 1,
+					   relpathperm(subject_data->smgr.rlocator, subject_data->smgr.forkNum),
+					   result.result * (size_t) BLCKSZ,
+					   subject_data->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 36ad34aa6ac..454ebe9c243 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +159,14 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+
+const struct PgAioSubjectInfo aio_smgr_subject_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -623,6 +645,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -657,6 +692,16 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
@@ -819,6 +864,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +898,43 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_subject_smgr(PgAioHandle *ioh,
+						  struct SMgrRelationData *smgr,
+						  ForkNumber forknum,
+						  BlockNumber blocknum,
+						  int nblocks)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+
+	pgaio_io_set_subject(ioh, ASI_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	sd->smgr.release_lock = false;
+	sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioSubjectData *sd = pgaio_io_get_subject_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
-- 
2.45.2.746.g06e570c0df.dirty

v2-0011-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 7a42b48f7421f071dab6cff273e4cc5b1c3c755f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2 11/20] bufmgr: Implement AIO read support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |   4 +
 src/include/storage/buf_internals.h   |   6 +
 src/include/storage/bufmgr.h          |   8 +
 src/backend/storage/aio/aio_subject.c |   4 +
 src/backend/storage/buffer/buf_init.c |   3 +
 src/backend/storage/buffer/bufmgr.c   | 364 +++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c |  65 +++++
 7 files changed, 447 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d693b0b0bd8..ff44dac5bb2 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -104,6 +104,10 @@ typedef enum PgAioHandleSharedCallbackID
 
 	ASC_MD_READV,
 	ASC_MD_WRITEV,
+
+	ASC_SHARED_BUFFER_READ,
+
+	ASC_LOCAL_BUFFER_READ,
 } PgAioHandleSharedCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index eda6c699212..37520890073 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_ref.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -251,6 +252,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioHandleRef io_in_progress;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -464,4 +467,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..ca8e8b51e68 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,12 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleSharedCallbacks;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
@@ -194,6 +200,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+struct PgAioHandle;
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index effb09c11c7..21341aae425 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -50,6 +50,10 @@ static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(ASC_MD_READV, aio_md_readv_cb),
 	CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
+
+	CALLBACK_ENTRY(ASC_SHARED_BUFFER_READ, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(ASC_LOCAL_BUFFER_READ, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 56761a8eedc..7853b1877e0 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_io_ref_clear(&buf->io_in_progress);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2622221809c..c0fb3028c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -58,6 +59,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5514,6 +5517,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioHandleRef ior;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5521,10 +5525,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		ior = buf->io_in_progress;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_io_ref_valid(&ior))
+		{
+			pgaio_io_ref_wait(&ior);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5613,7 +5626,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5625,6 +5638,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_io_ref_clear(&buf->io_in_progress);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5633,6 +5653,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5684,7 +5738,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6143,3 +6197,299 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+		BlockNumber forkNum = bufHdr->tag.forkNum;
+
+		/* AFIXME: relpathperm allocates memory */
+		MemoryContextSwitchTo(ErrorContext);
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	/* Report I/Os as completing individually. */
+
+	/* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+	return buf_failed;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_in_progress = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by IO.
+			 */
+			LWLockDisown(content_lock);
+			RESUME_INTERRUPTS();
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_readv_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+			 buf, failed, prior_result.status, prior_result.result, io_data_off);
+
+		/*
+		 * XXX: It might be better to not set BM_IO_ERROR (which is what
+		 * failed = true leads to) when it's just a short read...
+		 */
+		buf_failed = ReadBufferCompleteReadShared(buf,
+												  mode,
+												  failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_SHARED_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+buffer_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	ProcNumber	errProc;
+
+	if (subject_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	/* AFIXME: need infrastructure to allow memory allocation for error reporting */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   subject_data->smgr.blockNum + result.error_data,
+				   relpathbackend(subject_data->smgr.rlocator, errProc, subject_data->smgr.forkNum)
+				   )
+		);
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * Helper to prepare IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_prepare(PgAioHandle *ioh)
+{
+	uint64	   *io_data;
+	uint8		io_data_len;
+	PgAioHandleRef io_ref;
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	pgaio_io_get_ref(ioh, &io_ref);
+
+	for (int i = 0; i < io_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_in_progress = io_ref;
+		LocalRefCount[-buf - 1] += 1;
+
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_subject_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		buf_failed = ReadBufferCompleteReadLocal(buf,
+												 mode,
+												 failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = ASC_LOCAL_BUFFER_READ;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
+	.prepare = shared_buffer_readv_prepare,
+	.complete = shared_buffer_readv_complete,
+	.error = buffer_readv_error,
+};
+const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
+	.prepare = local_buffer_readv_prepare,
+	.complete = local_buffer_readv_complete,
+	.error = buffer_readv_error,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6fd1a6418d2..75c4d2570e0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "executor/instrument.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -620,6 +621,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_io_ref_clear(&buf->io_in_progress);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -836,3 +839,65 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *buf_hdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+	blockno = buf_hdr->tag.blockNum;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+		BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_io_ref_clear(&buf_hdr->io_in_progress);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	/* release pin held by IO subsystem */
+	LocalRefCount[-buffer - 1] -= 1;
+
+	return buf_failed;
+}
-- 
2.45.2.746.g06e570c0df.dirty

v2-0012-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From e8a5a6318b0e386afb2c1ed2d7f4cc0372358ade Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:55:59 -0400
Subject: [PATCH v2 12/20] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  27 +-
 src/backend/storage/buffer/bufmgr.c | 378 ++++++++++++++++++++--------
 2 files changed, 300 insertions(+), 105 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ca8e8b51e68..7a12ef6e9be 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_ref.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -107,10 +108,23 @@ typedef struct BufferManagerRelation
 #define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
 #define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
 
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
 /* Zero out page if reading fails. */
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
+/*
+ * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
+ * need to include that here.  Perhaps this could live in a separate header?
+ */
+#include "storage/aio.h"
 
 struct ReadBuffersOperation
 {
@@ -131,6 +145,17 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		io_buffers_len;
+
+	/*
+	 * In some rare-ish cases one operation causes multiple reads (e.g. if a
+	 * buffer was concurrently read by another backend). It'd be much better
+	 * if we ensured that each ReadBuffersOperation covered only one IO - but
+	 * that's not entirely trivial, due to having pinned victim buffers before
+	 * starting IOs.
+	 */
+	int16		nios;
+	PgAioHandleRef refs[MAX_IO_COMBINE_LIMIT];
+	PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +186,6 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c0fb3028c95..89cb7b41b03 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1235,10 +1235,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1253,6 +1252,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	return buffer;
 }
 
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int nblocks);
+
 static pg_attribute_always_inline bool
 StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
@@ -1288,6 +1290,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			 * so we stop here.
 			 */
 			actual_nblocks = i + 1;
+
+			ereport(DEBUG3,
+					errmsg("found buf %d, idx %i: %s, data %p",
+						   buffers[i], i, DebugPrintBufferRefcount(buffers[i]),
+						   BufferGetBlock(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			break;
 		}
 		else
@@ -1324,28 +1332,51 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
 	operation->io_buffers_len = io_buffers_len;
+	operation->nios = 0;
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to derisk
+	 * the introduction of AIO somewhat. It's a large architectural change,
+	 * with lots of chances for unanticipated performance effects.  Use of
+	 * IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
-		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.  Note also
-		 * that the following call might actually issue two advice calls if we
-		 * cross a segment boundary; in a true asynchronous version we might
-		 * choose to process only one real I/O at a time in that case.
-		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 operation->io_buffers_len);
+		/* initiate the IO asynchronously */
+		return AsyncReadBuffers(operation, io_buffers_len);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the
+			 * advice. That'd be a better simulation of true asynchronous I/O,
+			 * which would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
+		}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1397,12 +1428,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+		 * StartBufferIO().
+		 */
+		if (pgaio_io_ref_valid(&bufHdr->io_in_progress))
+		{
+			PgAioHandleRef ior = bufHdr->io_in_progress;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_io_ref_wait(&ior);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1412,13 +1462,38 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	bool		have_retryable_failure;
+
+	/*
+	 * If we get here without any IO operations having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. In that case, we
+	 * start - as we used to before - the IO now, just before waiting.
+	 */
+	if (operation->nios == 0)
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+		if (!AsyncReadBuffers(operation, operation->io_buffers_len))
+		{
+			/* all blocks were already read in concurrently */
+			return;
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/*
 	 * Currently operations are only allowed to include a read of some range,
@@ -1433,15 +1508,101 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	if (nblocks == 0)
 		return;					/* nothing to do */
 
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	Assert(operation->nios > 0);
 
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * XXX: We probably should track the IO operation, rather than its time,
+	 * separately, when initiating the IO. But right now that's not quite
+	 * allowed by the interface.
+	 */
+	have_retryable_failure = false;
+	for (int i = 0; i < operation->nios; i++)
+	{
+		PgAioReturn *aio_ret = &operation->returns[i];
+
+		/*
+		 * Tracking a wait even if we don't actually need to wait a) is not
+		 * cheap b) reports some time as waiting, even if we never waited.
+		 */
+		if (aio_ret->result.status == ARS_UNKNOWN &&
+			!pgaio_io_ref_check_done(&operation->refs[i]))
+		{
+			instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+			pgaio_io_ref_wait(&operation->refs[i]);
+
+			/*
+			 * The IO operation itself was already counted earlier, in
+			 * AsyncReadBuffers().
+			 */
+			pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+									0);
+		}
+		else
+		{
+			Assert(pgaio_io_ref_check_done(&operation->refs[i]));
+		}
+
+		if (aio_ret->result.status == ARS_PARTIAL)
+		{
+			/*
+			 * We'll retry below, so we just emit a debug message the server
+			 * log (or not even that in prod scenarios).
+			 */
+			pgaio_result_log(aio_ret->result, &aio_ret->subject_data, DEBUG1);
+			have_retryable_failure = true;
+		}
+		else if (aio_ret->result.status != ARS_OK)
+			pgaio_result_log(aio_ret->result, &aio_ret->subject_data, ERROR);
+	}
+
+	/*
+	 * If any of the associated IOs failed, try again to issue IOs. Buffers
+	 * for which IO has completed successfully will be discovered as such and
+	 * not retried.
+	 */
+	if (have_retryable_failure)
+	{
+		nblocks = operation->io_buffers_len;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, nblocks);
+		goto restart;
+	}
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+	bool		did_start_io_overall = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= AHF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1449,6 +1610,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= AHF_SYNCHRONOUS;
+
+	operation->nios = 0;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1464,19 +1635,38 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 	for (int i = 0; i < nblocks; ++i)
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		BlockNumber io_first_block;
+		bool		did_start_io_this = false;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_get() might block,
+		 * which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in here to the IO? If there
+		 * already are a lot of IO operations in progress, getting an IO
+		 * handle will block waiting for some other IO operation to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_get_nb() and only account
+		 * IO time when pgaio_io_get_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_get(CurrentResourceOwner, &operation->returns[operation->nios]);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * XXX: If we can't start IO due to unsubmitted IO, it might be worth
+		 * to submit and then try to start IO again.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1678,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u: %s",
+						   buffers[i], DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1497,6 +1692,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG5,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1505,85 +1705,57 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * We'll come back to this block again, above.
 		 */
 		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG5,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								io_buffers_len);
+		pgaio_io_get_ref(ioh, &operation->refs[operation->nios]);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
+		pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
 
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-			}
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
+		else
+			pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
 
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		pgaio_io_set_flag(ioh, ioh_flags);
 
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
+		did_start_io_overall = did_start_io_this = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+		operation->nios++;
 
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op_n(io_object, io_context, IOOP_READ, io_buffers_len);
+	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
+	}
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	if (did_start_io_overall)
+	{
+		pgaio_submit_staged();
+		return true;
 	}
+	else
+		return false;
 }
 
 /*
@@ -6367,7 +6539,7 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 			prior_result.status == ARS_ERROR
 			|| prior_result.result <= io_data_off;
 
-		elog(DEBUG3, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
+		elog(DEBUG5, "calling rbcrs for buf %d with failed %d, error: %d, result: %d, data_off: %d",
 			 buf, failed, prior_result.status, prior_result.result, io_data_off);
 
 		/*
-- 
2.45.2.746.g06e570c0df.dirty

v2-0013-aio-Very-WIP-read_stream.c-adjustments-for-real-A.patchtext/x-diff; charset=us-asciiDownload

From b7123290712da81631ecfbfb2437b95eb42a8e9c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2 13/20] aio: Very-WIP: read_stream.c adjustments for real
 AIO

---
 src/include/storage/bufmgr.h          |  2 ++
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c   |  3 ++-
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7a12ef6e9be..2a836cf98c6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -119,6 +119,8 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
 /* IO will immediately be waited for */
 #define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)
 
 /*
  * FIXME: PgAioReturn is defined in aio.h. It'd be much better if we didn't
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 3d30e6224f7..5b5bae16c44 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -240,14 +241,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
 	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 *
+	 * XXX: Used to also check stream->pending_read_blocknum !=
+	 * stream->seq_blocknum
 	 */
 	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		stream->advice_enabled)
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
 
+	flags |= READ_BUFFERS_MORE_MORE_MORE;
+
 	/* We say how many blocks we want to read, but may be smaller on return. */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
@@ -306,6 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -355,6 +368,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit the limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_submit_staged();
 				return;
 			}
 		}
@@ -379,6 +393,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, suppress_advice);
+
+	pgaio_submit_staged();
 }
 
 /*
@@ -442,7 +458,7 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * io_combine_limit, io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
@@ -493,10 +509,11 @@ read_stream_begin_impl(int flags,
 	 * direct I/O isn't enabled, the caller hasn't promised sequential access
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
+	 *
+	 * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	 * (flags & READ_STREAM_SEQUENTIAL) == 0
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
-		max_ios > 0)
+	if (max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
@@ -727,7 +744,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 89cb7b41b03..722e73eb7d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1751,7 +1751,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 
 	if (did_start_io_overall)
 	{
-		pgaio_submit_staged();
+		if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+			pgaio_submit_staged();
 		return true;
 	}
 	else
-- 
2.45.2.746.g06e570c0df.dirty

v2-0014-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From c1a5b7c868eb962a3e1e5348aa6309aa1005f4eb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 25 Nov 2024 16:35:15 -0500
Subject: [PATCH v2 14/20] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  18 ++
 src/include/storage/aio_internal.h            |  33 ++++
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 182 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 118 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 419 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ff44dac5bb2..1bef475b0a9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -222,6 +222,9 @@ typedef struct PgAioHandleSharedCallbacks
 
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
 /*
  * How many callbacks can be registered for one IO handle. Currently we only
  * need two, but it's not hard to imagine needing a few more.
@@ -294,6 +297,20 @@ extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Actions on multiple IOs.
  * --------------------------------------------------------------------------------
@@ -354,6 +371,7 @@ typedef enum IoMethod
 extern const struct config_enum_entry io_method_options[];
 extern int	io_method;
 extern int	io_max_concurrency;
+extern int	io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d2dc1516bdf..2065bde79c3 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -91,6 +91,12 @@ struct PgAioHandle
 	/* index into PgAioCtl->iovecs */
 	uint32		iovec_off;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -130,11 +136,23 @@ struct PgAioHandle
 };
 
 
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioPerBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -162,6 +180,12 @@ typedef struct PgAioPerBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioPerBackend;
 
 
@@ -187,6 +211,15 @@ typedef struct PgAioCtl
 	 */
 	uint64	   *iovecs_data;
 
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
 	uint64		io_handle_count;
 	PgAioHandle *io_handles;
 } PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 2d55720a54c..0cdd0c13ffb 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 893f4ffe428..0076ea4aa10 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -395,6 +395,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 2439ce3740d..e829e1752ca 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -54,6 +54,8 @@ static void pgaio_io_resowner_register(PgAioHandle *ioh);
 static void pgaio_io_wait_for_free(void);
 static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 
 /* Options for io_method. */
@@ -68,6 +70,7 @@ const struct config_enum_entry io_method_options[] = {
 
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 
 /* global control for AIO */
@@ -732,6 +735,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&my_aio->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -855,6 +873,168 @@ pgaio_io_wait_for_free(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (my_aio->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * FIXME It probably is not correct to have bounce buffers be per backend,
+	 * they use too much memory.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&my_aio->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	my_aio->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	my_aio->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - aio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (my_aio->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&my_aio->idle_bbs, &bb->node);
+	my_aio->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (my_aio->num_staged_ios > 0)
+	{
+		elog(DEBUG2, "submitting while acquiring free bb");
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = my_aio->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &aio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				continue;
+			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d to reclaim BB",
+						 pgaio_io_get_id(ioh));
+
+					if (slist_is_empty(&my_aio->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&my_aio->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+			case AHS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&my_aio->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&my_aio->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Actions on multiple IOs.
  * --------------------------------------------------------------------------------
@@ -929,6 +1109,7 @@ void
 pgaio_at_xact_end(bool is_subxact, bool is_commit)
 {
 	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
 }
 
 /*
@@ -939,6 +1120,7 @@ void
 pgaio_at_error(void)
 {
 	Assert(!my_aio->handed_out_io);
+	Assert(!my_aio->handed_out_bb);
 }
 
 
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 23adc5308e5..417526f3823 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioIOVDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConccurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -130,11 +183,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioIOVShmemSize());
 	sz = add_size(sz, AioIOVDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_impl->shmem_size)
 		sz = add_size(sz, pgaio_impl->shmem_size());
@@ -148,7 +221,10 @@ AioShmemInit(void)
 	bool		found;
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
+	uint32		bounce_buffers_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_combine_limit;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	aio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
 
 	aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	aio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	aio_ctl->backend_state = (PgAioPerBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -170,6 +247,35 @@ AioShmemInit(void)
 	aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
 	aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
 
+	aio_ctl->bounce_buffers = ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(), &found);
+
+	bounce_buffers_data = ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(), &found);
+	bounce_buffers_data = (char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	aio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < aio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->subject = ASI_INVALID;
+		ioh->state = AHS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < aio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
@@ -177,9 +283,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -201,6 +311,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &aio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b2999b86c24..39e91ebd2a5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3233,6 +3233,19 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5893eb29228..da6e248a29e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -848,6 +848,8 @@
 #io_max_concurrency = 32		# Max number of IOs that may be in
 					# flight at the same time in one backend
 					# (change requires restart)
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 
 #------------------------------------------------------------------------------
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 5cf14472ebd..d1932b7393c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResoureElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -743,6 +745,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1112,3 +1121,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a5b12b48f99..dc52d6165d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2104,6 +2104,7 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleFlags
-- 
2.45.2.746.g06e570c0df.dirty

v2-0015-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 40e15609a95f6733a7fe0e202c5ec4add3044bad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:01 -0400
Subject: [PATCH v2 15/20] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h             |  2 +
 src/include/storage/bufmgr.h          |  2 +
 src/backend/storage/aio/aio_subject.c |  2 +
 src/backend/storage/buffer/bufmgr.c   | 85 +++++++++++++++++++++++++++
 4 files changed, 91 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1bef475b0a9..caa52d2aaba 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -106,8 +106,10 @@ typedef enum PgAioHandleSharedCallbackID
 	ASC_MD_WRITEV,
 
 	ASC_SHARED_BUFFER_READ,
+	ASC_SHARED_BUFFER_WRITE,
 
 	ASC_LOCAL_BUFFER_READ,
+	ASC_LOCAL_BUFFER_WRITE,
 } PgAioHandleSharedCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2a836cf98c6..2e88b19619c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,7 +205,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 
 struct PgAioHandleSharedCallbacks;
 extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb;
 extern const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb;
 
 
 /* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index 21341aae425..b2bd0c235e7 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -52,8 +52,10 @@ static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
 	CALLBACK_ENTRY(ASC_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(ASC_SHARED_BUFFER_READ, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(ASC_SHARED_BUFFER_WRITE, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(ASC_LOCAL_BUFFER_READ, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(ASC_LOCAL_BUFFER_WRITE, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 722e73eb7d0..0f94db19f9d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6437,6 +6437,44 @@ ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
 	return buf_failed;
 }
 
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	/* AFIXME: implement track_io_timing */
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					  /* forget_owner = */ false,
+					  /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseUnowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on shared buffers for execution, shared between reads
  * and writes.
@@ -6518,6 +6556,12 @@ shared_buffer_readv_prepare(PgAioHandle *ioh)
 	shared_buffer_prepare_common(ioh, false);
 }
 
+static void
+shared_buffer_writev_prepare(PgAioHandle *ioh)
+{
+	shared_buffer_prepare_common(ioh, true);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 {
@@ -6586,6 +6630,34 @@ buffer_readv_error(PgAioResult result, const PgAioSubjectData *subject_data, int
 	MemoryContextSwitchTo(oldContext);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		io_data_len;
+
+	elog(DEBUG3, "%s: %d %d", __func__, prior_result.status, prior_result.result);
+
+	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < io_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->scb_data.shared_buffer.release_lock */
+		ReadBufferCompleteWriteShared(buf,
+									  true,
+									  false);
+
+	}
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on local buffers for execution, shared between reads
  * and writes.
@@ -6655,14 +6727,27 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 	return result;
 }
 
+static void
+local_buffer_writev_prepare(PgAioHandle *ioh)
+{
+	elog(ERROR, "not yet");
+}
+
 
 const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
 	.prepare = shared_buffer_readv_prepare,
 	.complete = shared_buffer_readv_complete,
 	.error = buffer_readv_error,
 };
+const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb = {
+	.prepare = shared_buffer_writev_prepare,
+	.complete = shared_buffer_writev_complete,
+};
 const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
 	.prepare = local_buffer_readv_prepare,
 	.complete = local_buffer_readv_complete,
 	.error = buffer_readv_error,
 };
+const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb = {
+	.prepare = local_buffer_writev_prepare,
+};
-- 
2.45.2.746.g06e570c0df.dirty

v2-0016-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 0d7dbde438633fbb7af0dd2f3efd3a2c6b587438 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:42 -0400
Subject: [PATCH v2 16/20] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 195 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 232 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..28077158d6d
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/bufmgr.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioHandleRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_get_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3bcb8a0b2ed..f3a7f9e63d6 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_subject.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..89ccfc2b9a7
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/io_queue.h"
+
+#include "storage/aio.h"
+
+
+typedef struct TrackedIO
+{
+	PgAioHandleRef ior;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_io_ref_clear(&tio->ior);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_io_ref_wait(&tio->ior);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_get_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_get_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioHandleRef *ior)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->ior = *ior;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_io_ref_check_done(&tio->ior))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_io_ref_get_id(&tio->ior)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_io_ref_wait(&tio->ior);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 537f23d446d..e8a88e615c0 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_subject.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dc52d6165d4..ca1e3427bc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1175,6 +1175,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -2974,6 +2975,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.45.2.746.g06e570c0df.dirty

v2-0017-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From ffe8489a8b44bc0a0b11ad765d578aa12801925a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Jul 2024 10:01:23 -0700
Subject: [PATCH v2 17/20] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  25 +-
 src/backend/postmaster/checkpointer.c |  12 +-
 src/backend/storage/buffer/bufmgr.c   | 581 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 407f26e5302..01a936fbc0a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
 extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 37520890073..9d3123663b3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2e88b19619c..455bbbcbfc4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -327,7 +327,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..6f8fe796da3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..71c08da45db 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -167,6 +171,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		 * about in bgwriter, but we do have LWLocks, buffers, and temp files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
@@ -226,12 +231,27 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * XXX: Before exiting, wait for all IO to finish. That's only
+		 * important to avoid spurious PrintBufferLeakWarning() /
+		 * PrintAioIPLeakWarning() calls, triggered by
+		 * ReleaseAuxProcessResources() being called with isCommit=true.
+		 *
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -248,6 +268,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 982572a75db..0c08acd611f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,9 +46,11 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
@@ -266,6 +268,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * files.
 		 */
 		LWLockReleaseAll();
+		pgaio_at_error();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
 		UnlockBuffers();
@@ -719,7 +722,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -752,6 +755,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f94db19f9d..863464f12da 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -77,6 +78,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3067,6 +3068,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
+	to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	if (to_write->total_writes > 0)
+		pgaio_submit_staged();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3098,7 +3149,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3160,7 +3214,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3268,48 +3324,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3327,15 +3426,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3361,7 +3468,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3404,6 +3511,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3424,6 +3534,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3580,11 +3692,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3595,6 +3721,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3606,6 +3739,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3644,8 +3782,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3654,22 +3850,56 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		if (to_write->ioh == NULL)
+		{
+			to_write->ioh = io_queue_get_io(ioq);
+			pgaio_io_get_ref(to_write->ioh, &to_write->ior);
+		}
+
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3679,7 +3909,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3688,40 +3918,282 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
 
-	tag = bufHdr->tag;
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
 
-	UnpinBuffer(bufHdr);
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_io_data_32(to_write->ioh,
+							(uint32 *) to_write->buffers,
+							to_write->nbuffers);
+	pgaio_io_add_shared_cb(to_write->ioh, ASC_SHARED_BUFFER_WRITE);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 IOOP_WRITE, to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->ior);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_io_ref_clear(&to_write->ior);
 }
 
 /*
@@ -4087,6 +4559,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aa264f61b9c..1f6b982c7e9 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ca1e3427bc1..cdfef5698e7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -345,6 +345,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.45.2.746.g06e570c0df.dirty

v2-0018-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From bdc7ed519ced00b6cc7fd7eb8137d5d79d846353 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:13:48 -0400
Subject: [PATCH v2 18/20] very-wip: test_aio module

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h            |  10 +
 src/include/storage/buf_internals.h           |   4 +
 src/backend/storage/aio/aio.c                 |  38 ++
 src/backend/storage/buffer/bufmgr.c           |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_aio/.gitignore          |   6 +
 src/test/modules/test_aio/Makefile            |  34 ++
 src/test/modules/test_aio/expected/inject.out | 295 ++++++++++
 src/test/modules/test_aio/expected/io.out     |  40 ++
 .../modules/test_aio/expected/ownership.out   | 148 +++++
 src/test/modules/test_aio/expected/prep.out   |  17 +
 src/test/modules/test_aio/io_uring.conf       |   5 +
 src/test/modules/test_aio/meson.build         |  78 +++
 src/test/modules/test_aio/sql/inject.sql      |  84 +++
 src/test/modules/test_aio/sql/io.sql          |  16 +
 src/test/modules/test_aio/sql/ownership.sql   |  65 +++
 src/test/modules/test_aio/sql/prep.sql        |   9 +
 src/test/modules/test_aio/sync.conf           |   5 +
 src/test/modules/test_aio/test_aio--1.0.sql   |  99 ++++
 src/test/modules/test_aio/test_aio.c          | 504 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control    |   3 +
 src/test/modules/test_aio/worker.conf         |   5 +
 23 files changed, 1468 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/expected/inject.out
 create mode 100644 src/test/modules/test_aio/expected/io.out
 create mode 100644 src/test/modules/test_aio/expected/ownership.out
 create mode 100644 src/test/modules/test_aio/expected/prep.out
 create mode 100644 src/test/modules/test_aio/io_uring.conf
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/sql/inject.sql
 create mode 100644 src/test/modules/test_aio/sql/io.sql
 create mode 100644 src/test/modules/test_aio/sql/ownership.sql
 create mode 100644 src/test/modules/test_aio/sql/prep.sql
 create mode 100644 src/test/modules/test_aio/sync.conf
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control
 create mode 100644 src/test/modules/test_aio/worker.conf

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2065bde79c3..f4c57438dd4 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -265,6 +265,16 @@ extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
 extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 
 
+
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
+
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
 extern const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9d3123663b3..1b3329a25b4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e829e1752ca..261a752fb80 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -46,6 +46,9 @@
 #include "utils/resowner.h"
 #include "utils/wait_event_types.h"
 
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
 
 
 static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
@@ -92,6 +95,11 @@ static const IoMethodOps *pgaio_ops_table[] = {
 const IoMethodOps *pgaio_impl;
 
 
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *inj_cur_handle;
+#endif
+
+
 
 /* --------------------------------------------------------------------------------
  * "Core" IO Api
@@ -631,6 +639,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
 
 	pgaio_io_update_state(ioh, AHS_REAPED);
 
+#ifdef USE_INJECTION_POINTS
+	inj_cur_handle = ioh;
+
+	/*
+	 * FIXME: This could be in a critical section - but it looks like we can't
+	 * just InjectionPointLoad() at process start, as the injection point
+	 * might not yet be defined.
+	 */
+	InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	inj_cur_handle = NULL;
+#endif
+
 	pgaio_io_process_completion_subject(ioh);
 
 	pgaio_io_update_state(ioh, AHS_COMPLETED_SHARED);
@@ -1129,3 +1150,20 @@ assign_io_method(int newval, void *extra)
 {
 	pgaio_impl = pgaio_ops_table[newval];
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 863464f12da..4a022440ada 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
 							  bool syncio);
@@ -6213,7 +6212,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c0d3cf0e14b..73ff9c55687 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c829b619530..61c854a9b63 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e62e3718845
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,295 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(8192 + 4096);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count 
+-------
+     1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(4096);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count 
+-------
+     1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(0);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block 
+-------------------
+ 
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact 
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+ERROR:  release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get 
+------------
+ 
+ 
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR:  API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+ERROR:  can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR:  can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel 
+----------
+ 
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel 
+----------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+  'prep',
+  'ownership',
+  'io',
+]
+
+if get_option('injection_points')
+  testfiles += 'inject'
+endif
+
+
+tests += {
+  'name': 'test_aio_sync',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('sync.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+tests += {
+  'name': 'test_aio_worker',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('worker.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+if liburing.found()
+  tests += {
+    'name': 'test_aio_uring',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'regress': {
+      'sql': testfiles,
+      'regress_args': [
+        '--temp-config', files('io_uring.conf'),
+      ],
+      # requires custom config
+      'runningcheck': false,
+    }
+  }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..1190531f5ad
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,84 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192 + 4096);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(4096);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(0);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e3d5ce29c60
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,99 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+    DECLARE
+	err_state text;
+        err_msg text;
+    BEGIN
+        EXECUTE p_sql;
+	RETURN true;
+    EXCEPTION WHEN OTHERS THEN
+        GET STACKED DIAGNOSTICS
+	    err_state = RETURNED_SQLSTATE,
+	    err_msg = MESSAGE_TEXT;
+	err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+        RAISE NOTICE 'wrapped error: %', err_msg
+	    USING ERRCODE = err_state;
+	RETURN false;
+    END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..e495c5309b3
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled;
+	bool		result_set;
+	int			result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		page;
+	PgAioHandle *ioh;
+	PgAioHandleRef ior;
+	SMgrRelation smgr;
+	uint32		buf_state;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	page = BufferGetBlock(buf);
+
+	ioh = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get_ref(ioh, &ior);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+	smgr = RelationGetSmgr(rel);
+
+	/* FIXME: even if just a test, we should verify nobody else uses this */
+	buf_state = LockBufHdr(buf_hdr);
+	buf_state &= ~(BM_VALID | BM_DIRTY);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	StartBufferIO(buf_hdr, true, false);
+
+	pgaio_io_set_io_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+				   (void *) &page, 1);
+
+	ReleaseBuffer(buf);
+
+	pgaio_io_ref_wait(&ior);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* this is a gross hack, but there's no good API exposed */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+	buf = pr.recent_buffer;
+	elog(LOG, "recent: %d", buf);
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't unpin");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+	MarkBufferDirty(buf);
+	ph->pd_special = BLCKSZ + 1;
+
+	/* last_handle = pgaio_io_get(); */
+
+	PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_get(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_get(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+	if (inj_io_error_state->enabled)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->result);
+
+			ioh->result = inj_io_error_state->result;
+		}
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = true;
+	inj_io_error_state->result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->result_set)
+		inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
-- 
2.45.2.746.g06e570c0df.dirty

v2-0001-Ensure-a-resowner-exists-for-all-paths-that-may-p.patchtext/x-diff; charset=us-asciiDownload

From 42af1a44eadbfc3ac4e65ab23d280d6933b23284 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2 01/20] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e0cb70ee9da..8ddcab0182a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4dc14fdb495..76fce6749a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01c4016ced6..8a09c939eff 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -755,8 +755,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.45.2.746.g06e570c0df.dirty

v2-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload

From 5eff74f7f0bd0cf7102a04263a0dc9c0439123ed Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2 02/20] Allow lwlocks to be unowned

This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
 src/include/storage/lwlock.h      |   2 +
 src/backend/storage/lmgr/lwlock.c | 110 ++++++++++++++++++++++--------
 2 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..eabf813ce05 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
+extern LWLockMode LWLockDisown(LWLock *l);
+extern void LWLockReleaseUnowned(LWLock *l, LWLockMode mode);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 9cf3e4f4f3a..bc459dc5d2b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,52 +1773,36 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	}
 }
 
-
-/*
- * LWLockRelease - release a previously acquired lock
- */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 {
-	LWLockMode	mode;
 	uint32		oldstate;
 	bool		check_waiters;
-	int			i;
-
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-		if (lock == held_lwlocks[i].lock)
-			break;
-
-	if (i < 0)
-		elog(ERROR, "lock %s is not held", T_NAME(lock));
-
-	mode = held_lwlocks[i].mode;
-
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
-
-	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
 	/*
 	 * Release my hold on lock, after that it can immediately be acquired by
 	 * others, even if we still have to wakeup other waiters.
 	 */
 	if (mode == LW_EXCLUSIVE)
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_EXCLUSIVE);
 	else
-		oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
+		oldstate = pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
 
 	/* nobody else can have that kind of lock */
-	Assert(!(oldstate & LW_VAL_EXCLUSIVE));
+	if (mode == LW_EXCLUSIVE)
+		Assert((oldstate & LW_LOCK_MASK) == LW_VAL_EXCLUSIVE);
+	else
+		Assert((oldstate & LW_LOCK_MASK) < LW_VAL_EXCLUSIVE &&
+			   (oldstate & LW_LOCK_MASK) >= LW_VAL_SHARED);
 
 	if (TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED())
 		TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
 
+	if (mode == LW_EXCLUSIVE)
+		oldstate -= LW_VAL_EXCLUSIVE;
+	else
+		oldstate -= LW_VAL_SHARED;
+
 	/*
 	 * We're still waiting for backends to get scheduled, don't wake them up
 	 * again.
@@ -1841,6 +1825,72 @@ LWLockRelease(LWLock *lock)
 		LWLockWakeup(lock);
 	}
 
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
+}
+
+void
+LWLockReleaseUnowned(LWLock *lock, LWLockMode mode)
+{
+	LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */
+LWLockMode
+LWLockDisown(LWLock *lock)
+{
+	LWLockMode	mode;
+	int			i;
+
+	/*
+	 * Remove lock from list of locks held.  Usually, but not always, it will
+	 * be the latest-acquired lock; so search array backwards.
+	 */
+	for (i = num_held_lwlocks; --i >= 0;)
+		if (lock == held_lwlocks[i].lock)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+	mode = held_lwlocks[i].mode;
+
+	num_held_lwlocks--;
+	for (; i < num_held_lwlocks; i++)
+		held_lwlocks[i] = held_lwlocks[i + 1];
+
+	return mode;
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+	LWLockMode	mode;
+
+	mode = LWLockDisown(lock);
+
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+	LWLockReleaseInternal(lock, mode);
+
 	/*
 	 * Now okay to allow cancel/die interrupts.
 	 */
-- 
2.45.2.746.g06e570c0df.dirty

v2-0003-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 93547a5a5b72fa0689b812ee6336b74c74eb95d7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2 03/20] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.
---
 src/include/storage/aio.h                     | 42 +++++++++++++++++++
 src/include/storage/aio_init.h                | 24 +++++++++++
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 33 +++++++++++++++
 src/backend/storage/aio/aio_init.c            | 41 ++++++++++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 ++
 src/backend/utils/init/postinit.c             |  7 ++++
 src/backend/utils/misc/guc_tables.c           | 23 ++++++++++
 src/backend/utils/misc/postgresql.conf.sample | 11 +++++
 src/tools/pgindent/typedefs.list              |  1 +
 11 files changed, 189 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_init.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..0ee9d0043de
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+#include "utils/guc_tables.h"
+
+
+/* GUC related */
+extern void assign_io_method(int newval, void *extra);
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+/* GUCs */
+extern const struct config_enum_entry io_method_options[];
+extern int	io_method;
+extern int	io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..1c1d62baa79
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ *    AIO initialization - kept separate as initialization sites don't need to
+ *    know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif							/* AIO_INIT_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..72110c0df3e
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..84e0e37baae
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
+
+void
+pgaio_postmaster_child_init_local(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 10e1aa3b20b..8d20759ebf8 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..c7703e5178e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 8a09c939eff..9d1025e815b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -626,6 +627,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8cf1afbad20..6d4056c68b9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3219,6 +3220,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IOs that may be in flight in one backend."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5226,6 +5239,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a2ac7575ca7..c4c60da9845 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -838,6 +838,17 @@
 #include = '...'			# include file
 
 
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync			# (change requires restart)
+
+#io_max_concurrency = 32		# Max number of IOs that may be in
+					# flight at the same time in one backend
+					# (change requires restart)
+
+
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
 #------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..2586d1cf53f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1262,6 +1262,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.45.2.746.g06e570c0df.dirty

v2-0004-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload

From b64c247210c5a5067b5c76f6ab68c978606b0902 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 9 Dec 2024 14:14:13 -0500
Subject: [PATCH v2 04/20] aio: Core AIO implementation

At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.

Todo:
- lots of cleanup
---
 src/include/storage/aio.h                     | 296 ++++++
 src/include/storage/aio_internal.h            | 244 +++++
 src/include/storage/aio_ref.h                 |  24 +
 src/include/utils/resowner.h                  |   5 +
 src/backend/access/transam/xact.c             |   9 +
 src/backend/storage/aio/Makefile              |   3 +
 src/backend/storage/aio/aio.c                 | 906 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 186 +++-
 src/backend/storage/aio/aio_io.c              | 140 +++
 src/backend/storage/aio/aio_subject.c         | 231 +++++
 src/backend/storage/aio/meson.build           |   3 +
 src/backend/storage/aio/method_sync.c         |  45 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/resowner/resowner.c         |  30 +
 src/tools/pgindent/typedefs.list              |  18 +
 15 files changed, 2139 insertions(+), 4 deletions(-)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_ref.h
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_subject.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 0ee9d0043de..b386dabc921 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -15,9 +15,305 @@
 #define AIO_H
 
 
+#include "storage/aio_ref.h"
+#include "storage/procnumber.h"
 #include "utils/guc_tables.h"
 
 
+typedef struct PgAioHandle PgAioHandle;
+
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioSharedCallback specific behaviour should be implemented in
+ * aio_subject.c.
+ */
+typedef enum PgAioSubjectID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	ASI_INVALID = 0,
+} PgAioSubjectID;
+
+#define ASI_COUNT (ASI_INVALID + 1)
+
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/* hint that IO will be executed synchronously */
+	AHF_SYNCHRONOUS = 1 << 0,
+
+	/* the IO references backend local memory */
+	AHF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO
+	 * methods. Advantageous to set, if applicable, but not required for
+	 * correctness.
+	 */
+	AHF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleSharedCallbackID
+{
+	ASC_INVALID,
+} PgAioHandleSharedCallbackID;
+
+
+/*
+ * Data necessary for basic IO types (PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/* XXX: Perhaps it's worth moving this to a dedicated file? */
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+typedef union PgAioSubjectData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioSubjectData;
+
+
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,	/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,	/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleSharedCallbackID, but can't use a bitfield
+	 * of an enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioSubjectData subject_data;
+} PgAioReturn;
+
+
+typedef struct PgAioSubjectInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+#ifdef NOT_YET
+	char	   *(*describe_identity) (PgAioHandle *ioh);
+#endif
+
+	const char *name;
+} PgAioSubjectInfo;
+
+
+typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
+typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
+
+typedef struct PgAioHandleSharedCallbacks
+{
+	PgAioHandleSharedCallbackPrepare prepare;
+	PgAioHandleSharedCallbackComplete complete;
+	PgAioHandleSharedCallbackError error;
+} PgAioHandleSharedCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define AIO_MAX_SHARED_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior);
+
+extern void pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid);
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern void pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid);
+
+extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
+
+extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+extern bool pgaio_io_has_subject(PgAioHandle *ioh);
+
+extern PgAioSubjectData *pgaio_io_get_subject_data(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_ref_clear(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_valid(PgAioHandleRef *ior);
+extern int	pgaio_io_ref_get_id(PgAioHandleRef *ior);
+
+
+extern void pgaio_io_ref_wait(PgAioHandleRef *ior);
+extern bool pgaio_io_ref_check_done(PgAioHandleRef *ior);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data,
+							 int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Low level IO preparation routines
+ *
+ * These will often be called by code lowest level of initiating an
+ * IO. E.g. bufmgr.c may initiate IO for a buffer, but pgaio_io_prep_readv()
+ * will be called from within fd.c.
+ *
+ * Implemented in aio_io.c
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
+
+
 /* GUC related */
 extern void assign_io_method(int newval, void *extra);
 
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..d600d45b4fd
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,244 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    aio_internal
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+#define PGAIO_VERBOSE
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,
+
+	/* IO finished, but result has not yet been processed */
+	AHS_REAPED,
+
+	/* IO completed, shared completion has been called */
+	AHS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioSubjectID subject:8;
+
+	/* which operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_shared_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		shared_callbacks[AIO_MAX_SHARED_CALLBACKS];
+
+	uint8		iovec_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* FIXME: remove in favor of distilled_result */
+	/* raw result of the IO operation */
+	int32		result;
+
+	/* index into PgAioCtl->iovecs */
+	uint32		iovec_off;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - PREPARED - in per-backend staged list
+	 * - IN_FLIGHT - in issuer's in_flight list
+	 * - REAPED - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 * - COMPLETED_LOCAL - in issuer's in_flight list
+	 *
+	 * XXX: It probably make sense to optimize this out to save on per-io
+	 * memory at the cost of per-backend memory.
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary for shared completions. Needs to be sufficient to allow
+	 * another backend to retry an IO.
+	 */
+	PgAioSubjectData scb_data;
+};
+
+
+typedef struct PgAioPerBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_get()/pgaio_io_get() without
+	 * having been either defined (by actually associating it with IO) or by
+	 * released (with pgaio_io_release()). This restriction is necessary to
+	 * guarantee that we always can acquire an IO. ->handed_out_io is used to
+	 * enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strict speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioPerBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioPerBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *iovecs_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+	size_t		(*shmem_size) (void);
+	void		(*shmem_init) (bool first_time);
+
+	/* per-backend initialization */
+	void		(*init_backend) (void);
+
+	/* handling of IOs */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+
+extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+
+extern const char *pgaio_io_get_subject_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern const IoMethodOps pgaio_sync_ops;
+
+extern const IoMethodOps *pgaio_impl;
+extern PgAioCtl *aio_ctl;
+extern PgAioPerBackend *my_aio;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_ref.h b/src/include/storage/aio_ref.h
new file mode 100644
index 00000000000..ad7e9ad34f3
--- /dev/null
+++ b/src/include/storage/aio_ref.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_ref.h Definition of PgAioHandleRef, which sometimes needs to be used in
+ *    headers.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_ref.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_REF_H
+#define AIO_REF_H
+
+typedef struct PgAioHandleRef
+{
+	uint32		aio_index;
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioHandleRef;
+
+#endif							/* AIO_REF_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index 4e534bc3e70..2d55720a54c 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3ebd7c40418..0356552c499 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2475,6 +2476,8 @@ CommitTransaction(void)
 	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
@@ -2988,6 +2991,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
@@ -5351,6 +5358,8 @@ AbortSubTransaction(void)
 		AtSubAbort_Snapshot(s->nestingLevel);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..b253278f3c1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,9 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_init.o \
+	aio_io.o \
+	aio_subject.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 72110c0df3e..3e2ff9718ca 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - aio.c - core AIO state handling
+ *
+ * - aio_init.c - initialization
+ *
+ * - aio_io.c - dealing with actual IO, including executing IOs synchronously
+ *
+ * - aio_subject.c - functionality related to executing IO for different
+ *   subjects
+ *
+ * - method_*.c - different ways of executing AIO
+ *
+ * - read_stream.c - helper for accessing buffered relation data with
+ *	 look-ahead
+ *
+ *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -14,7 +36,22 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation);
+
 
 
 /* Options for io_method. */
@@ -27,7 +64,876 @@ int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
 
+/* global control for AIO */
+PgAioCtl   *aio_ctl;
+
+/* current backend's per-backend state */
+PgAioPerBackend *my_aio;
+
+
+static const IoMethodOps *pgaio_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+
+const IoMethodOps *pgaio_impl;
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Core" IO Api
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_get_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_get() is called before starting an IO in a critical
+ * section, the handle needs to be be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is know to have
+ * completed, callbacks can be registered with pgaio_io_add_shared_cb().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * Once pgaio_io_prep_*() is called, the IO may be in the process of being
+ * executed and might even complete before the functions return. That is,
+ * however, not guaranteed, to allow IO submission to be batched. To guarantee
+ * IO submission pgaio_submit_staged() needs to be called.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_ref() *before* pgaio_io_prep_*() is
+ * called.  pgaio_io_ref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_get(). Once the issuing backend has called
+ * pgaio_io_ref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_log().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes, the reference to *ret will be cleared.
+ */
+PgAioHandle *
+pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_get_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_get(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (my_aio->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(my_aio->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (my_aio->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&my_aio->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&my_aio->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == AHS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, AHS_HANDED_OUT);
+		my_aio->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_get() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == my_aio->handed_out_io)
+	{
+		Assert(ioh->state == AHS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		my_aio->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case AHS_HANDED_OUT:
+			Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+			if (ioh == my_aio->handed_out_io)
+			{
+				my_aio->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case AHS_DEFINED:
+		case AHS_PREPARED:
+			/* XXX: Should we warn about this when is_commit? */
+			pgaio_submit_staged();
+			break;
+		case AHS_IN_FLIGHT:
+		case AHS_REAPED:
+		case AHS_COMPLETED_SHARED:
+			/* this is expected to happen */
+			break;
+		case AHS_COMPLETED_LOCAL:
+			/* XXX: unclear if this ought to be possible? */
+			pgaio_io_reclaim(ioh);
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	*iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* AFIXME: Needs to be the value at startup time */
+	return io_combine_limit;
+}
+
+PgAioSubjectData *
+pgaio_io_get_subject_data(PgAioHandle *ioh)
+{
+	return &ioh->scb_data;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+bool
+pgaio_io_has_subject(PgAioHandle *ioh)
+{
+	return ioh->subject != ASI_INVALID;
+}
+
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+void
+pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	for (int i = 0; i < len; i++)
+		aio_ctl->iovecs_data[ioh->iovec_off + i] = data[i];
+	ioh->iovec_data_len = len;
+}
+
+uint64 *
+pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->iovec_data_len > 0);
+
+	*len = ioh->iovec_data_len;
+
+	return &aio_ctl->iovecs_data[ioh->iovec_off];
+}
+
+void
+pgaio_io_set_subject(PgAioHandle *ioh, PgAioSubjectID subjid)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+
+	ioh->subject = subjid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, set subject",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh));
+}
+
+void
+pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
+{
+	Assert(ioh->state == AHS_HANDED_OUT ||
+		   ioh->state == AHS_DEFINED ||
+		   ioh->state == AHS_PREPARED);
+	Assert(ioh->generation != 0);
+
+	ior->aio_index = ioh - aio_ctl->io_handles;
+	ior->generation_upper = (uint32) (ioh->generation >> 32);
+	ior->generation_lower = (uint32) ioh->generation;
+}
+
+void
+pgaio_io_ref_clear(PgAioHandleRef *ior)
+{
+	ior->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_io_ref_valid(PgAioHandleRef *ior)
+{
+	return ior->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_io_ref_get_id(PgAioHandleRef *ior)
+{
+	Assert(pgaio_io_ref_valid(ior));
+	return ior->aio_index;
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();
+		}
+		else if (state != AHS_IN_FLIGHT
+				 && state != AHS_REAPED
+				 && state != AHS_COMPLETED_SHARED
+				 && state != AHS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+
+		/*
+		 * Somebody else completed the IO, need to execute issuer callback, so
+		 * reclaim eagerly.
+		 */
+		if (state == AHS_COMPLETED_LOCAL)
+		{
+			pgaio_io_reclaim(ioh);
+
+			return;
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case AHS_IDLE:
+			case AHS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case AHS_IN_FLIGHT:
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_impl->wait_one && !(ioh->flags & AHF_SYNCHRONOUS))
+				{
+					pgaio_impl->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case AHS_PREPARED:
+			case AHS_DEFINED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case AHS_REAPED:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state != AHS_REAPED && state != AHS_DEFINED &&
+						state != AHS_IN_FLIGHT)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case AHS_COMPLETED_SHARED:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+			case AHS_COMPLETED_LOCAL:
+				return;
+		}
+	}
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_io_ref_check_done(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+
+	if (state == AHS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == AHS_COMPLETED_SHARED || state == AHS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= aio_ctl->io_handles &&
+		   ioh <= (aio_ctl->io_handles + aio_ctl->io_handle_count));
+	return ioh - aio_ctl->io_handles;
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			return "idle";
+		case AHS_HANDED_OUT:
+			return "handed_out";
+		case AHS_DEFINED:
+			return "DEFINED";
+		case AHS_PREPARED:
+			return "PREPARED";
+		case AHS_IN_FLIGHT:
+			return "IN_FLIGHT";
+		case AHS_REAPED:
+			return "REAPED";
+		case AHS_COMPLETED_SHARED:
+			return "COMPLETED_SHARED";
+		case AHS_COMPLETED_LOCAL:
+			return "COMPLETED_LOCAL";
+	}
+	pg_unreachable();
+}
+
+/*
+ * Internal, should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, AHS_DEFINED);
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	pgaio_io_update_state(ioh, AHS_PREPARED);
+
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	elog(DEBUG3, "io:%d: prepared %s, executed synchronously: %d",
+		 pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
+		 needs_synchronous);
+
+	if (!needs_synchronous)
+	{
+		my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+		Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+/*
+ * Handle IO getting completed by a method.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == AHS_IN_FLIGHT);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, AHS_REAPED);
+
+	pgaio_io_process_completion_subject(ioh);
+
+	pgaio_io_update_state(ioh, AHS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	if (ioh->flags & AHF_SYNCHRONOUS)
+	{
+		/* XXX: should we also check if there are other IOs staged? */
+		return true;
+	}
+
+	if (pgaio_impl->needs_synchronous_execution)
+		return pgaio_impl->needs_synchronous_execution(ioh);
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, AHS_IN_FLIGHT);
+
+	dclist_push_tail(&my_aio->in_flight_ios, &ioh->node);
+}
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static PgAioHandle *
+pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(ior->aio_index < aio_ctl->io_handle_count);
+
+	ioh = &aio_ctl->io_handles[ior->aio_index];
+
+	*ref_generation = ((uint64) ior->generation_upper) << 32 |
+		ior->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	ereport(DEBUG3,
+			errmsg("reclaiming io:%d, state: %s, op %s, subject %s, result: %d, distilled_result: AFIXME, report to: %p",
+				   pgaio_io_get_id(ioh),
+				   pgaio_io_get_state_name(ioh),
+				   pgaio_io_get_op_name(ioh),
+				   pgaio_io_get_subject_name(ioh),
+				   ioh->result,
+				   ioh->report_return
+				   ),
+			errhidestmt(true), errhidecontext(true));
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != AHS_HANDED_OUT)
+	{
+		dclist_delete_from(&my_aio->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->subject_data = ioh->scb_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->num_shared_callbacks = 0;
+	ioh->iovec_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->flags = 0;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, AHS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&my_aio->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	elog(DEBUG2,
+		 "waiting for self: %d pending",
+		 my_aio->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &aio_ctl->io_handles[my_aio->io_handle_off + i];
+
+		if (ioh->state == AHS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (my_aio->num_staged_ios > 0)
+	{
+		elog(DEBUG2, "submitting while acquiring free io");
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * It's possible that we recognized there were free IOs while submitting.
+	 */
+	if (dclist_count(&my_aio->in_flight_ios) == 0)
+	{
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+	}
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &my_aio->in_flight_ios);
+
+		switch (ioh->state)
+		{
+			/* should not be in in-flight list */
+			case AHS_IDLE:
+			case AHS_DEFINED:
+			case AHS_HANDED_OUT:
+			case AHS_PREPARED:
+			case AHS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case AHS_REAPED:
+			case AHS_IN_FLIGHT:
+				{
+					PgAioHandleRef ior;
+
+					ior.aio_index = ioh - aio_ctl->io_handles;
+					ior.generation_upper = (uint32) (ioh->generation >> 32);
+					ior.generation_lower = (uint32) ioh->generation;
+
+					pgaio_io_ref_wait(&ior);
+					elog(DEBUG2, "waited for io:%d",
+						 pgaio_io_get_id(ioh));
+				}
+				break;
+			case AHS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&my_aio->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (my_aio->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_impl->submit(my_aio->num_staged_ios, my_aio->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	my_aio->num_staged_ios = 0;
+
+#ifdef PGAIO_VERBOSE
+	ereport(DEBUG2,
+			errmsg("submitted %d",
+				   total_submitted),
+			errhidestmt(true),
+			errhidecontext(true));
+#endif
+}
+
+bool
+pgaio_have_staged(void)
+{
+	return my_aio->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!my_aio)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+	Assert(!my_aio->handed_out_io);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+	Assert(!my_aio->handed_out_io);
+}
+
+
 void
 assign_io_method(int newval, void *extra)
 {
+	pgaio_impl = pgaio_ops_table[newval];
 }
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 84e0e37baae..b9bdf51680a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,28 +14,206 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* aio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioPerBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioIOVShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioIOVDataShmemSize(void)
+{
+	/* FIXME: io_combine_limit is USERSET */
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(io_combine_limit, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioIOVShmemSize());
+	sz = add_size(sz, AioIOVDataShmemSize());
+
+	if (pgaio_impl->shmem_size)
+		sz = add_size(sz, pgaio_impl->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * io_combine_limit;
+
+	aio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(aio_ctl, 0, AioCtlShmemSize());
+
+	aio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	aio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	aio_ctl->backend_state = (PgAioPerBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	aio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	aio_ctl->iovecs = ShmemInitStruct("AioIOV", AioIOVShmemSize(), &found);
+	aio_ctl->iovecs_data = ShmemInitStruct("AioIOVData", AioIOVDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioPerBackend *bs = &aio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &aio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->iovec_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_shared_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += io_combine_limit;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_impl->shmem_init)
+		pgaio_impl->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
-}
+	/* shouldn't be initialized twice */
+	Assert(!my_aio);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	my_aio = &aio_ctl->backend_state[MyProcNumber];
 
-void
-pgaio_postmaster_child_init_local(void)
-{
+	if (pgaio_impl->init_backend)
+		pgaio_impl->init_backend();
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..3c255775833
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,140 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO types
+ *
+ * These are called by place the place actually initiating an IO, to associate
+ * the IO specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_prepare().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_prepare(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions implementing IO handle operations that are directly related to IO
+ * operations.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &aio_ctl->iovecs[ioh->iovec_off];
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+}
+
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	pg_unreachable();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+}
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
new file mode 100644
index 00000000000..8694cfafcd1
--- /dev/null
+++ b/src/backend/storage/aio/aio_subject.c
@@ -0,0 +1,231 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subject.c
+ *	  AIO - Functionality related to executing IO for different subjects
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_subject.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioSubjectInfo.reopen callback.
+ */
+static const PgAioSubjectInfo *aio_subject_info[] = {
+	[ASI_INVALID] = &(PgAioSubjectInfo) {
+		.name = "invalid",
+	},
+};
+
+
+typedef struct PgAioHandleSharedCallbacksEntry
+{
+	const PgAioHandleSharedCallbacks *const cb;
+	const char *const name;
+} PgAioHandleSharedCallbacksEntry;
+
+static const PgAioHandleSharedCallbacksEntry aio_shared_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+#undef CALLBACK_ENTRY
+};
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (AIO_MAX_SHARED_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleSharedCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_add_shared_cb(PgAioHandle *ioh, PgAioHandleSharedCallbackID cbid)
+{
+	const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+	if (cbid >= lengthof(aio_shared_cbs))
+		elog(ERROR, "callback %d is out of range", cbid);
+	if (aio_shared_cbs[cbid].cb->complete == NULL)
+		elog(ERROR, "callback %d is undefined", cbid);
+	if (ioh->num_shared_callbacks >= AIO_MAX_SHARED_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", AIO_MAX_SHARED_CALLBACKS);
+	ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, adding cb #%d, id %d/%s",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 ioh->num_shared_callbacks + 1,
+		 cbid, ce->name);
+
+	ioh->num_shared_callbacks++;
+}
+
+/*
+ * Return the name for the subject associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_subject_name(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+
+	return aio_subject_info[ioh->subject]->name;
+}
+
+/*
+ * Internal function which invokes ->prepare for all the registered callbacks.
+ */
+void
+pgaio_io_prepare_subject(PgAioHandle *ioh)
+{
+	Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+		if (!ce->cb->prepare)
+			continue;
+
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d %d/%s->prepare",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i,
+			 cbid, ce->name);
+		ce->cb->prepare(ioh);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_process_completion_subject(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = ASC_INVALID;
+	result.error_data = 0;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+		elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d, id %d/%s->complete with distilled result status %d, id %u, error_data: %d, result: %d",
+			 pgaio_io_get_id(ioh),
+			 pgaio_io_get_op_name(ioh),
+			 pgaio_io_get_subject_name(ioh),
+			 i,
+			 cbid, ce->name,
+			 result.status,
+			 result.id,
+			 result.error_data,
+			 result.result);
+		result = ce->cb->complete(ioh, result);
+	}
+
+	ioh->distilled_result = result;
+
+	elog(DEBUG3, "io:%d, op %s, subject %s, distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+		 pgaio_io_get_id(ioh),
+		 pgaio_io_get_op_name(ioh),
+		 pgaio_io_get_subject_name(ioh),
+		 result.status,
+		 result.id,
+		 result.error_data,
+		 result.result,
+		 ioh->result);
+}
+
+/*
+ * Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return aio_subject_info[ioh->subject]->reopen != NULL;
+}
+
+/*
+ * Before executing an IO outside of the context of the process the IO has
+ * been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->subject >= 0 && ioh->subject < ASI_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	aio_subject_info[ioh->subject]->reopen(ioh);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_log(PgAioResult result, const PgAioSubjectData *subject_data, int elevel)
+{
+	PgAioHandleSharedCallbackID cbid = result.id;
+	const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->error == NULL)
+		elog(ERROR, "scb id %d/%s does not have an error callback",
+			 result.id, ce->name);
+
+	ce->cb->error(result, subject_data, elevel);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8d20759ebf8..8339d473aae 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,5 +3,8 @@
 backend_sources += files(
   'aio.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_subject.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..61fd06a277b
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,45 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "should be unreachable");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..7a2e2b4432e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -190,6 +190,9 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_SUBMIT	"Waiting for AIO submission."
+AIO_DRAIN	"Waiting for IOs to finish."
+AIO_COMPLETION	"Waiting for completion callback."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 505534ee8d3..5cf14472ebd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,14 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		/* XXX: Could probably be a later phase? */
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1100,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2586d1cf53f..bc1acbb98ee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1263,6 +1263,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2100,6 +2101,23 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioCtl
+PgAioHandle
+PgAioHandleFlags
+PgAioHandleRef
+PgAioHandleSharedCallbackID
+PgAioHandleSharedCallbacks
+PgAioHandleSharedCallbacksEntry
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioPerBackend
+PgAioResultStatus
+PgAioResult
+PgAioReturn
+PgAioSubjectData
+PgAioSubjectID
+PgAioSubjectInfo
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.45.2.746.g06e570c0df.dirty

v2-0005-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From e6c7783183c0b36f94b9debfd9edde71e4d75bbc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 25 Nov 2024 14:03:40 -0500
Subject: [PATCH v2 05/20] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_init.h                |   2 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 171 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio_init.c            |   7 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  86 +++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 19 files changed, 310 insertions(+), 12 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e4c0d1481e9..0afc57ebf27 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 24d49a5439e..4d003b7f86d 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 1c1d62baa79..70976791c93 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -21,4 +21,6 @@ extern void AioShmemInit(void);
 
 extern void pgaio_init_backend(void);
 
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..ba5dcb9e6e4
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0b1fa61310f..cafd0b334b9 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -461,7 +461,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 1f2d829ec5a..7399adfeae9 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 381cf005a9b..89ee626829d 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_max_workers;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6f849ffbcb5..8dab7072114 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -172,6 +175,7 @@ btmask_all_except(BackendType t)
 	return mask;
 }
 
+#ifdef NOT_USED
 static inline BackendTypeMask
 btmask_all_except2(BackendType t1, BackendType t2)
 {
@@ -181,6 +185,18 @@ btmask_all_except2(BackendType t1, BackendType t2)
 	mask = btmask_del(mask, t2);
 	return mask;
 }
+#endif
+
+static inline BackendTypeMask
+btmask_all_except3(BackendType t1, BackendType t2, BackendType t3)
+{
+	BackendTypeMask mask = BTYPE_MASK_ALL;
+
+	mask = btmask_del(mask, t1);
+	mask = btmask_del(mask, t2);
+	mask = btmask_del(mask, t3);
+	return mask;
+}
 
 static inline bool
 btmask_contains(BackendTypeMask mask, BackendType t)
@@ -329,6 +345,7 @@ typedef enum
 								 * ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
+	PM_SHUTDOWN_IO,				/* waiting for io workers to exit */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -390,6 +407,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -424,6 +445,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1351,6 +1374,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	pmState = PM_STARTUP;
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1363,7 +1391,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2503,6 +2530,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2867,10 +2904,10 @@ PostmasterStateMachine(void)
 			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Walsenders, archiver and IO workers will continue running; they
+		 * will be terminated later after writing the checkpoint record.  We
+		 * also let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2882,6 +2919,7 @@ PostmasterStateMachine(void)
 
 			remainMask = btmask_add(remainMask, B_WAL_SENDER);
 			remainMask = btmask_add(remainMask, B_ARCHIVER);
+			remainMask = btmask_add(remainMask, B_IO_WORKER);
 			remainMask = btmask_add(remainMask, B_DEAD_END_BACKEND);
 			remainMask = btmask_add(remainMask, B_LOGGER);
 
@@ -2963,7 +3001,7 @@ PostmasterStateMachine(void)
 					pmState = PM_WAIT_DEAD_END;
 					ConfigurePostmasterWaitSet(false);
 
-					/* Kill the walsenders and archiver too */
+					/* Kill walsenders, archiver and aio workers too */
 					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
 				}
 			}
@@ -2974,11 +3012,23 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead-end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * dead-end children and io workers left. There shouldn't be any
+		 * regular backends left by now anyway; what we're really waiting for
+		 * is walsenders and archiver.
 		 */
-		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except3(B_LOGGER, B_DEAD_END_BACKEND, B_IO_WORKER)) == 0)
+		{
+			pmState = PM_SHUTDOWN_IO;
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_SHUTDOWN_IO)
+	{
+		/*
+		 * PM_SHUTDOWN_IO state ends when there's only dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			pmState = PM_WAIT_DEAD_END;
 			ConfigurePostmasterWaitSet(false);
@@ -3094,10 +3144,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		pmState = PM_STARTUP;
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3918,6 +3972,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
+		case PM_SHUTDOWN_IO:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
@@ -4070,6 +4125,100 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_SHUTDOWN_IO)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_SHUTDOWN_IO);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index b253278f3c1..fa2a7e9e5df 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_io.o \
 	aio_subject.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index b9bdf51680a..0c2d77ec8ab 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -217,3 +217,10 @@ pgaio_init_backend(void)
 	if (pgaio_impl->init_backend)
 		pgaio_impl->init_backend();
 }
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8339d473aae..62738ce1d14 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,5 +6,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_subject.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ea749a8ba8
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		/*
+		 * We normally shouldn't get errors here. Need to do just enough error
+		 * recovery so that we can mark the IO as failed and then exit.
+		 */
+		LWLockReleaseAll();
+
+		/* TODO: recover from IO errors */
+
+		EmitErrorReport();
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 85902788181..fcd3e1eb482 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3313,6 +3313,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 6b2c9baa8c0..c48befef6a7 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -166,6 +166,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 011a3326dad..7869197dd1f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -365,6 +365,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
+		case B_IO_WORKER:
 		case B_LOGGER:
 		case B_WAL_RECEIVER:
 		case B_WAL_WRITER:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7a2e2b4432e..330a32a90ce 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 6349abb8fb6..56133cfdd08 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 6d4056c68b9..b2999b86c24 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3232,6 +3233,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4c60da9845..0f80a0680ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -843,6 +843,7 @@
 #------------------------------------------------------------------------------
 
 #io_method = sync			# (change requires restart)
+#io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
 					# flight at the same time in one backend
-- 
2.45.2.746.g06e570c0df.dirty

v2-0006-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload

From 9c9bbb42fb561fb2cf7d6d5183db5359d37e004e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 8 Nov 2024 12:38:41 -0500
Subject: [PATCH v2 06/20] aio: Add worker method

---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |  12 +-
 src/backend/storage/aio/method_worker.c       | 406 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/tools/pgindent/typedefs.list              |   3 +
 9 files changed, 423 insertions(+), 10 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index b386dabc921..2e84abfea2d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -322,11 +322,12 @@ extern void assign_io_method(int newval, void *extra);
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to bgworker. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /* GUCs */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d600d45b4fd..f974c4accf5 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -234,6 +234,7 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
+extern const IoMethodOps pgaio_worker_ops;
 
 extern const IoMethodOps *pgaio_impl;
 extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..8d00d62e208 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3e2ff9718ca..e4c9d439ddd 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -57,6 +57,7 @@ static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generatio
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -73,6 +74,7 @@ PgAioPerBackend *my_aio;
 
 static const IoMethodOps *pgaio_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 0c2d77ec8ab..23adc5308e5 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -19,6 +19,7 @@
 #include "storage/aio_init.h"
 #include "storage/aio_internal.h"
 #include "storage/bufmgr.h"
+#include "storage/io_worker.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 
@@ -37,6 +38,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -209,6 +215,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!my_aio);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
@@ -221,6 +230,5 @@ pgaio_init_backend(void)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ea749a8ba8..a508f53ebd4 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -1,7 +1,22 @@
 /*-------------------------------------------------------------------------
  *
  * method_worker.c
- *    AIO implementation using workers
+ *    AIO - perform AIO using worker processes
+ *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
  *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -16,23 +31,323 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 int			io_workers = 3;
+static int	io_worker_queue_size = 64;
 
+static int	MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * io_worker_queue_size +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		elog(DEBUG1, "full");
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			ereport(DEBUG3,
+					errmsg("submission for io:%d choosing worker %d, latch %p",
+						   pgaio_io_get_id(ios[i]), worker, wakeup),
+					errhidestmt(true), errhidecontext(true));
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & AHF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * shmem_exit() callback that releases the worker's slot in io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
 
 void
 IoWorkerMain(char *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	volatile PgAioHandle *ioh = NULL;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +368,11 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
 
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -66,8 +386,26 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 
 		/* TODO: recover from IO errors */
+		if (ioh != NULL)
+		{
+#if 0
+			/* EINTR is treated as a retryable error */
+			pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+										EINTR);
+#endif
+		}
 
 		EmitErrorReport();
+
+		/* FIXME: should probably be a before-shmem-exit instead */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+		Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+		io_worker_control->workers[MyIoWorkerId].in_use = false;
+		io_worker_control->workers[MyIoWorkerId].latch = NULL;
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
 		proc_exit(1);
 	}
 
@@ -76,10 +414,68 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/* Nothing to do.  Mark self idle. */
+			/*
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+#if 0
+			if (nwakeups > 0)
+				elog(LOG, "wake %d", nwakeups);
+#endif
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			ioh = &aio_ctl->io_handles[io_index];
+
+			ereport(DEBUG3,
+					errmsg("worker processing io:%d",
+						   pgaio_io_get_id(unvolatize(PgAioHandle *, ioh))),
+					errhidestmt(true), errhidecontext(true));
+
+			pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+			pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+			ioh = NULL;
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
 	proc_exit(0);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 330a32a90ce..8c3aafd8a18 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -349,6 +349,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0f80a0680ec..5893eb29228 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -842,7 +842,7 @@
 # WIP AIO GUC docs
 #------------------------------------------------------------------------------
 
-#io_method = sync			# (change requires restart)
+#io_method = worker			# (change requires restart)
 #io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bc1acbb98ee..9b9c8f0d1fc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -54,6 +54,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.45.2.746.g06e570c0df.dirty

v2-0007-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 309863778a6051b0e18d949551961608dbf9d399 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2 07/20] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                |  14 ++++
 meson_options.txt          |   3 +
 configure.ac               |  11 +++
 src/makefiles/meson.build  |   3 +
 src/include/pg_config.h.in |   3 +
 configure                  | 138 +++++++++++++++++++++++++++++++++++++
 src/Makefile.global.in     |   4 ++
 7 files changed, 176 insertions(+)

diff --git a/meson.build b/meson.build
index e5ce437a5c7..76c276437d7 100644
--- a/meson.build
+++ b/meson.build
@@ -854,6 +854,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3054,6 +3066,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3698,6 +3711,7 @@ if meson.version().version_compare('>=0.57')
       'gss': gssapi,
       'icu': icu,
       'ldap': ldap,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 38935196394..6e8d376b3b2 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'Use liburing for async io')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index 247ae97fa4c..dda296ee029 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1427,6 +1435,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index aba7411a1be..00613aebc79 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
   'gssapi': gssapi,
   'icu': icu,
   'ldap': ldap,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..6ab71a3dffe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/configure b/configure
index 518c33b73a9..1c3fada9fe0 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -709,6 +711,7 @@ XML2_CFLAGS
 XML2_CONFIG
 with_libxml
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -862,6 +865,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libxml
@@ -905,6 +909,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1572,6 +1578,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         use liburing for async io
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
@@ -1618,6 +1625,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8681,6 +8692,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13222,6 +13267,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index eac3d001211..60393ed8fa4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd	= @with_systemd@
 with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.45.2.746.g06e570c0df.dirty

v2-0008-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload

From de57cec96e81a1867a9f1db4c44243cdc0072b20 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Sep 2024 16:15:17 -0400
Subject: [PATCH v2 08/20] aio: Add io_uring method

---
 src/include/storage/aio.h                 |   1 +
 src/include/storage/aio_internal.h        |   3 +
 src/include/storage/lwlock.h              |   1 +
 src/backend/storage/aio/Makefile          |   1 +
 src/backend/storage/aio/aio.c             |   6 +
 src/backend/storage/aio/meson.build       |   1 +
 src/backend/storage/aio/method_io_uring.c | 386 ++++++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c         |   1 +
 src/tools/pgindent/typedefs.list          |   1 +
 9 files changed, 401 insertions(+)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 2e84abfea2d..a1633a0ed3d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -323,6 +323,7 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+	IOMETHOD_IO_URING,
 } IoMethod;
 
 
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f974c4accf5..d2dc1516bdf 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -235,6 +235,9 @@ extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern const IoMethodOps pgaio_sync_ops;
 extern const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern const IoMethodOps *pgaio_impl;
 extern PgAioCtl *aio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index eabf813ce05..72f928b7602 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index fa2a7e9e5df..3bcb8a0b2ed 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -13,6 +13,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_subject.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e4c9d439ddd..701f06287d9 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -58,6 +58,9 @@ static PgAioHandle *pgaio_io_from_ref(PgAioHandleRef *ior, uint64 *ref_generatio
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -75,6 +78,9 @@ PgAioPerBackend *my_aio;
 static const IoMethodOps *pgaio_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 62738ce1d14..537f23d446d 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -5,6 +5,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_subject.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..3f214e42767
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,386 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *aio_uring_contexts;
+static PgAioUringContext *my_shared_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+AioContextShmemSize(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return AioContextShmemSize();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	aio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &aio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	int			ret;
+
+	my_shared_uring_context = &aio_uring_contexts[MyProcNumber];
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&my_aio->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying.
+		 * io_uring can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & AHF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			elog(DEBUG3, "submit nios: %d", num_staged_ios);
+		}
+		break;
+	}
+
+	return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_REAPED 16
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *reaped_cqes[PGAIO_MAX_LOCAL_REAPED];
+		uint32		reaped;
+
+		START_CRIT_SECTION();
+		reaped =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									reaped_cqes,
+									Min(PGAIO_MAX_LOCAL_REAPED, ready));
+		Assert(reaped <= ready);
+
+		ready -= reaped;
+
+		for (int i = 0; i < reaped; i++)
+		{
+			struct io_uring_cqe *cqe = reaped_cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		ereport(DEBUG3,
+				errmsg("drained %d/%d, now expecting %d",
+					   reaped, orig_ready, io_uring_cq_ready(&context->io_uring_ring)),
+				errhidestmt(true),
+				errhidecontext(true));
+
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &aio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will reap the completions, making the locking
+	 * unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		ereport(DEBUG3,
+				errmsg("wait_one for io:%d io_gen: %llu, ref_gen: %llu, in state %s, cycle %d",
+					   pgaio_io_get_id(ioh),
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   pgaio_io_get_state_name(ioh), waited),
+				errhidestmt(true),
+				errhidecontext(true));
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != AHS_IN_FLIGHT)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	ereport(DEBUG3,
+			errmsg("wait_one with %d sleeps",
+				   waited),
+			errhidestmt(true),
+			errhidecontext(true));
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &aio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc459dc5d2b..4fdcfb1df1b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9b9c8f0d1fc..a5b12b48f99 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2121,6 +2121,7 @@ PgAioReturn
 PgAioSubjectData
 PgAioSubjectID
 PgAioSubjectInfo
+PgAioUringContext
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.45.2.746.g06e570c0df.dirty

v2-0009-aio-Add-README.md-explaining-higher-level-design.patchtext/x-diff; charset=us-asciiDownload

From c95ba2c47ddc454f19703c4361f47690ff8ff05e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 6 Sep 2024 15:27:57 -0400
Subject: [PATCH v2 09/20] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 413 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 415 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..893f4ffe428
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,413 @@
+# Asynchronous & Direct IO
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire AIO Handle, ioret will get result upon completion.
+ */
+PgAioHandle *ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioRef ior;
+
+pgaio_io_get_ref(ioh, &ior);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ */
+pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Hand AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed of, it may not further be used, as the
+ * IO may immediately get executed below smgrstartreadv() and the handle reused
+ * for another IO.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * is however not guaranteed, to allow IO submission to be batched.
+ *
+ * Note that one needs to be careful while there may be unsubmitted IOs, as
+ * another backend may need to wait for one of the unsubmitted IOs. If this
+ * backend were to wait for the other backend, we'd have a deadlock. To avoid
+ * that, pending IOs need to be explicitly submitted before this backend
+ * might be blocked by a backend waiting for IO.
+ *
+ * Note that the IO might have immediately been submitted (e.g. due to reaching
+ * a limit on the number of unsubmitted IOs) and even completed during the
+ * smgrstartreadv() above.
+ *
+ * Once submitted, the IO is in-flight and can complete at any time.
+ */
+pgaio_submit_staged();
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_io_ref_wait(&ior);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_get() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse then when using buffered IO.
+
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_get()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_get()`
+and because `pgaio_io_get()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_get()`) without causing the
+IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleSharedCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+
+### AIO References
+
+As [described above](#aio-handles) can be reused immediately after completion
+and therefore cannot be used to wait for completion of the IO. Waiting is
+enabled using AIO references, which do not just identify an AIO Handle but
+also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_ref()` and
+then waited upon using `pgaio_io_ref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 701f06287d9..2439ce3740d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  * - read_stream.c - helper for accessing buffered relation data with
  *	 look-ahead
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.45.2.746.g06e570c0df.dirty

v2-0019-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 75c690243866d3f6b476ecfb9c249da8098122f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2 19/20] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dffdd57e9b5..f5795b509c7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.45.2.746.g06e570c0df.dirty

v2-0020-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From e9c132e191cacc9fc946b611afc5f489762c4387 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2 20/20] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index a5a4511f66d..2a45dffd5e0 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.45.2.746.g06e570c0df.dirty

#23

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#17)

Re: AIO v2.0

Hi,

On 2024-12-19 17:29:12 -0500, Andres Freund wrote:

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

I think we still want something like it, but I don't think it needs to be in
the initial commits.

After I got this question from Thomas as well, I started hacking one up.

What information would you like to see?

Here's what I currently have:
┌─[ RECORD 1 ]───┬────────────────────────────────────────────────┐
│ pid │ 358212 │
│ io_id │ 2050 │
│ io_generation │ 4209 │
│ state │ COMPLETED_SHARED │
│ operation │ read │
│ offset │ 509083648 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ 262144 │
│ result │ OK │
│ error_desc │ (null) │
│ subject_desc │ blocks 1372864..1372895 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │
├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤
│ pid │ 358212 │
│ io_id │ 2051 │
│ io_generation │ 4199 │
│ state │ IN_FLIGHT │
│ operation │ read │
│ offset │ 511967232 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ (null) │
│ result │ UNKNOWN │
│ error_desc │ (null) │
│ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │

I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.

I think we'll want a pg_stat_aio as well, tracking things like:

- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOs

Greetings,

Andres Freund

#24

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Andres Freund (#22)

Re: AIO v2.2

Patches 1 and 2 are still Ready for Committer.

On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

That's a helpful addition. I've left inline comments on it, below.

The biggest TODOs are:

- Right now the API between bufmgr.c and read_stream.c kind of necessitates
that one StartReadBuffers() call actually can trigger multiple IOs, if
one of the buffers was read in by another backend, before "this" backend
called StartBufferIO().

I think Thomas and I figured out a way to evolve the interface so that this
isn't necessary anymore:

We allow StartReadBuffers() to memorize buffers it pinned but didn't
initiate IO on in the buffers[] argument. The next call to StartReadBuffers
then doesn't have to repin thse buffers. That doesn't just solve the
multiple-IOs for one "read operation" issue, it also make the - very common
- case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
hit wouldn't be tracked in the same ReadBuffersOperation anymore.

That sounds reasonable.

- Right now bufmgr.h includes aio.h, because it needs to include a reference
to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
would be noticeable overhead, so that's not an option. I think the best
option here would be to introduce something like aio_types.h, so fewer
things are included.

That sounds fine. Header splits aren't going to be perfect, so I'd pick
something (e.g. your proposal here) and move on.

- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.

One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.

That's reasonable, albeit non-critical.

- The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
improved

POSIX uses the word "result" for the consequences of a function (e.g. the
result of unlink() is readdir() no longer finding the link). It uses the word
"return" for a memory value that describes a result. In that usage, the
struct currently called PgAioResult would be a Return. The struct currently
called PgAioReturn is PgAioResult plus the data to identify the IO. Possible
name changes:

PgAioResult -> PgAioReturn
PgAioReturn -> PgAioReturnIdentified | PgAioReturnID | PgAioReturnTagged [I don't love these]
PgAioResultStatus -> PgAioStatus | PgAioFill

That said, I don't dislike the existing names and would not have raised the
topic myself.

- The debug logging functions are a bit of a mess, lots of very similar code
in lots of places. I think AIO needs a few ereport() wrappers to make this
easier.

May as well.

- More tests are needed. None of our current test frameworks really makes this
easy :(.

Which testing gap do you find most concerning? I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.

Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios. Works for me.

--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
* aio.c
*    AIO - Core Logic
*
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - aio.c - core AIO state handling
+ *
+ * - aio_init.c - initialization
+ *
+ * - aio_io.c - dealing with actual IO, including executing IOs synchronously
+ *
+ * - aio_subject.c - functionality related to executing IO for different
+ *   subjects
+ *
+ * - method_*.c - different ways of executing AIO
+ *
+ * - read_stream.c - helper for accessing buffered relation data with
+ *	 look-ahead
+ *

I felt like some list entries in this new header comment largely restated the
file name. Here's how I'd write them to avoid that:

* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initialization
* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation data

--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,413 @@
+# Asynchronous & Direct IO

I would move "### Why Asynchronous IO" to here; that's good background before
getting into the example. I might also move "### Why Direct / unbuffered IO"
to here. For me as a reader, I'd benefit from seeing things in this order:

- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions
- usage example as it is, with full comments
- the rest

In other words, like this:

# Asynchronous & Direct IO

## Motivation

### Why Asynchronous IO

[existing content moved from lower in the file]

## Synopsis

ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

## I/O Operation States & Transitions

[PgAioHandleState and its transitions]

## AIO Usage Example

[your content:]

+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire AIO Handle, ioret will get result upon completion.

Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].

+ * Once the IO handle has been handed of, it may not further be used, as the

s/of/off/

+### IO can be started in critical sections

...

+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.

The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".

+ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

+### AIO Callbacks

...

+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.

Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?

+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).

Can this say roughly how to decide when to add a new subject? Failing that,
can it give examples of what additional subjects might exist if certain
existing subsystems were to start using AIO?

+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).

Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness? In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?

Thanks,
nm

#25

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Noah Misch (#24)

Re: AIO v2.2

Hi,

On 2025-01-06 10:52:20 -0800, Noah Misch wrote:

Patches 1 and 2 are still Ready for Committer.

I feel somewhat weird about pushing 0002 without a user, but I guess it's
still exercised, so it's probably fine...

On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

That's a helpful addition. I've left inline comments on it, below.

Cool!

- More tests are needed. None of our current test frameworks really makes this
easy :(.

Which testing gap do you find most concerning?

Most of it isn't even AIO specific...

- temporary tables are rather poorly tested in general:
- e.g. trivial to exceed the number of buffers, but our tests don't reach that

- We have pretty no testing for IO errors. We have a bit of coverage due to
src/bin/pg_amcheck/t/003_check.pl, but that's for errors originating in
bufmgr.c itself.

- no real testing of StartBufferIO's etc wait paths

- no testing for BM_PIN_COUNT_WAITER

I e.g. just noticed that the error handling for AIO on temp tables was broken
- but our tests never reach that:

The bug exists due to temp tables not differentiating between "backend" pins
and a "global pincount" - which means that there's no real way for the AIO
subsystem to have a reference separate from the backend local pin -
CheckForLocalBufferLeaks() complains about any leftover pins. It seems to
works in non-assert mode, but with assertions transaction abort asserts out.

I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

That's a good one, yea.

I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.

And then we'll need more injection points, I'm afraid :(.

- Several folks asked for pg_stat_aio to come back, in "v1" that showed the
set of currently in-flight AIOs. That's not particularly hard - except
that it doesn't really fit in the pg_stat_* namespace.

Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios. Works for me.

Cool.

--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
* aio.c
*    AIO - Core Logic
*
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - aio.c - core AIO state handling
+ *
+ * - aio_init.c - initialization
+ *
+ * - aio_io.c - dealing with actual IO, including executing IOs synchronously
+ *
+ * - aio_subject.c - functionality related to executing IO for different
+ *   subjects
+ *
+ * - method_*.c - different ways of executing AIO
+ *
+ * - read_stream.c - helper for accessing buffered relation data with
+ *	 look-ahead
+ *

I felt like some list entries in this new header comment largely restated the
file name. Here's how I'd write them to avoid that:

Thanks, adopting.

* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initialization

I don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?

* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation data

Did the order you listed the files have a system to it? If so, what is it?

--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,413 @@
+# Asynchronous & Direct IO
I would move "### Why Asynchronous IO" to here; that's good background before
getting into the example.

I moved the example back and forth when writing because different readers
would benefit from a different order and I couldn't quite decide.

So I'm happy to adjust based on your feedback...

I might also move "### Why Direct / unbuffered IO" to here. For me as a
reader, I'd benefit from seeing things in this order:

- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions

Hm - why have PgAioHandleState and its states before the usage example? Seems
like it'd be harder to understand that way.

- usage example as it is, with full comments
- the rest

## Synopsis

ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...

+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire AIO Handle, ioret will get result upon completion.

Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].

Hm. Strictly speaking it's fine to block here, depending on whether
StartBufferIO() was already called. I'll clarify.

+### IO can be started in critical sections

...
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".

It is indeed awkward. I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:

... Consider
e.g. the case of a backend first starting a number of writes from shared
buffers and then starting to flush the WAL. Because only a limited amount of
IO can be in-progress at the same time, initiating IO for flushing the WAL may
require to first complete IO that was started earlier.

+### AIO Callbacks

...
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?

Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.

How about:

In addition to completion, AIO callbacks also are called to "prepare" an
IO. This is, e.g., used to increase buffer reference counts to account for the
AIO subsystem referencing the buffer, which is required to handle the case
where the issuing backend errors out and releases its own pins while the IO is
still ongoing.

+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).

Can this say roughly how to decide when to add a new subject?

Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.

E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).

The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.

How about:

I.e., if two different uses of AIO can describe the identity of the file being
operated on the same way, it likely makes sense to use the same
subject. E.g. different smgr implementations can describe IO with
RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
and it would not make sense to use the same subject for smgr and WAL.

Failing that, can it give examples of what additional subjects might exist
if certain existing subsystems were to start using AIO?

I think the main ones I can think of are:

1) WAL logging

This was implemented in v1. I'd guess that "real" WAL logging and
initializing new WAL segments might use a different subject, but that's
probably a question of taste.

2) "raw" file IO, for things that don't use the smgr abstraction. I could
e.g. imagine using AIO in COPY to read / write the FROM/TO file or to
implement CREATE DATABASE ... STRATEGY file_copy with AIO.

This was used in v1, e.g. to implement the initial data directory sync
after a crash. We do that on a filesystem level, not going through smgr
etc.

3) FE/BE network IO

+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_log()`, with the error details encoded in
+`PgAioResult`).

I agree this should be explained somewhere - but not sure this is the best
place.

The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.

Thanks for the review!

Andres Freund

#26

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#22)

1 attachment(s)

Re: AIO v2.2

On 01/01/2025 06:03, Andres Freund wrote:

Hi,

Attached is a new version of the AIO patchset.

I haven't gone through it all yet, but some comments below.

The biggest changes are:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

Thanks, the README is super helpful! I was overwhelmed by all the new
concepts before, now it all makes much more sense.

Now that it's all laid out more clearly, I see how many different
concepts and states there really are:

- For a single IO, there is an "IO handle", "IO references", and an "IO
return". You first allocate an IO handle (PgAioHandle), and then you get
a reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.

- An IO handle has eight different states (PgAioHandleState).

I'm sure all those concepts exist for a reason. But still I wonder: can
we simplify?

pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
pgaio_io_acquire() or similar. "get" also feels very innocent, even
though it may wait for previous IO to finish. Especially when
pgaio_io_get_ref() actually is innocent.

typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,

/* returned by pgaio_io_get() */
AHS_HANDED_OUT,

/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,

/* subject's prepare() callback has been called */
AHS_PREPARED,

/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,

/* IO finished, but result has not yet been processed */
AHS_REAPED,

/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,

/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;

Do we need to distinguish between DEFINED and PREPARED? At quick glance,
those states are treated the same. (The comment refers to
pgaio_io_start_*() functions, but there's no such thing)

I didn't quite understand the point of the prepare callbacks. For
example, when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it
need to be in a callback? I assume it's somehow related to error
handling, but I didn't quite get it. Perhaps an "abort" callback that'd
be called on error, instead of a "prepare" callback, would be better?

There are some synonyms used in the code: I think "in-flight" and
"submitted" mean the same thing. And "prepared" and "staged". I'd
suggest picking just one term for each concept.

I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states.
does a single IO go through both states, or are the mutually exclusive?
At quick glance, I don't actually see any code that would set the
COMPLETED_LOCAL state; is it dead code?

REAPED feels like a bad name. It sounds like a later stage than
COMPLETED, but it's actually vice versa.

I'm a little surprised that the term "IO request" isn't used anywhere. I
have no concrete suggestion, but perhaps that would be a useful term.

- Retries for partial IOs (i.e. short reads) are now implemented. Turned out
to take all of three lines and adding one missing variable initialization.

:-)

- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.

One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.

This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.

The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).

Hmm, yeah I think you might be onto something here.

Could pgaio_io_get() return an PgAioHandleRef directly, so that the
issuer would never see a raw PgAioHandle ?

Finally, attached are a couple of typos and other trivial suggestions.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

aio-typos.patchtext/x-patch; charset=UTF-8; name=aio-typos.patchDownload

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 0076ea4aa10..db3257c2705 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -15,7 +15,7 @@ In this example, a buffer will be read into shared buffers.
 PgAioReturn ioret;
 
 /*
- * Acquire AIO Handle, ioret will get result upon completion.
+ * Acquire an AIO Handle, ioret will get the result upon completion.
  */
 PgAioHandle *ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
 
@@ -46,15 +46,15 @@ pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
 pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
 
 /*
- * Hand AIO handle to lower-level function. When operating on the level of
+ * Pass the AIO handle to lower-level function. When operating on the level of
  * buffers, we don't know how exactly the IO is performed, that is the
  * responsibility of the storage manager implementation.
  *
  * E.g. md.c needs to translate block numbers into offsets in segments.
  *
- * Once the IO handle has been handed of, it may not further be used, as the
- * IO may immediately get executed below smgrstartreadv() and the handle reused
- * for another IO.
+ * Once the IO handle has been handed off to smgrstartreadv(), it may not
+ * further be used, as the IO may immediately get executed in smgrstartreadv()
+ * and the handle reused for another IO.
  */
 smgrstartreadv(ioh, operation->smgr, forknum, blkno,
                BufferGetBlock(buffer), 1);
@@ -167,7 +167,7 @@ The main reason *not* to use Direct IO are:
   explicit prefetching.
 - In situations where shared_buffers cannot be set appropriately large,
   e.g. because there are many different postgres instances hosted on shared
-  hardware, performance will often be worse then when using buffered IO.
+  hardware, performance will often be worse than when using buffered IO.
 
 
 ### Deadlock and Starvation Dangers due to AIO
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 261a752fb80..1cef6ef556b 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -123,10 +123,10 @@ static PgAioHandle *inj_cur_handle;
  *
  * If a handle was acquired but then does not turn out to be needed,
  * e.g. because pgaio_io_get() is called before starting an IO in a critical
- * section, the handle needs to be be released with pgaio_io_release().
+ * section, the handle needs to be released with pgaio_io_release().
  *
  *
- * To react to the completion of the IO as soon as it is know to have
+ * To react to the completion of the IO as soon as it is known to have
  * completed, callbacks can be registered with pgaio_io_add_shared_cb().
  *
  * To actually execute IO using the returned handle, the pgaio_io_prep_*()
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 3c255775833..9e111c04b7e 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -31,7 +31,7 @@ static void pgaio_io_before_prep(PgAioHandle *ioh);
 /* --------------------------------------------------------------------------------
  * "Preparation" routines for individual IO types
  *
- * These are called by place the place actually initiating an IO, to associate
+ * These are called by XXX place the place actually initiating an IO, to associate
  * the IO specific data with an AIO handle.
  *
  * Each of the preparation routines first needs to call
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f4c57438dd4..7a81e211d48 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -38,12 +38,13 @@ typedef enum PgAioHandleState
 	AHS_HANDED_OUT,
 
 	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	/* XXX: there are no pgaio_io_start_*() functions */
 	AHS_DEFINED,
 
-	/* subjects prepare() callback has been called */
+	/* subject's prepare() callback has been called */
 	AHS_PREPARED,
 
-	/* IO is being executed */
+	/* IO has been submitted and is being executed */
 	AHS_IN_FLIGHT,
 
 	/* IO finished, but result has not yet been processed */

#27

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#22)

Re: AIO v2.2

On LWLockDisown():

+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.

Returning the lock mode feels a bit ad hoc..

+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.

Hmm. I won't insist, but I feel it probably would be worth it. This is
only in LOCK_DEBUG mode so there's no performance penalty in non-debug
builds, and when you do compile with LOCK_DEBUG you probably appreciate
any extra information.

+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */

That feels weird. The only caller outside lwlock.c does call
RESUME_INTERRUPTS() immediately.

Perhaps it'd make for a better external interface if LWLockDisown() did
call RESUME_INTERRUPTS(), and there was a separate internal version that
didn't. And it might make more sense for the external version to return
'void' while we're at it. Returning a value that the caller ignores is
harmless, of course, but it feels a bit weird. It makes you wonder what
you're supposed to do with it.

+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+

The description is a bit funny because synchronous I/O is one of the
possible methods.

--
Heikki Linnakangas
Neon (https://neon.tech)

#28

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#26)

Re: AIO v2.2

Hi,

On 2025-01-07 17:09:58 +0200, Heikki Linnakangas wrote:

On 01/01/2025 06:03, Andres Freund wrote:

Hi,

Attached is a new version of the AIO patchset.

I haven't gone through it all yet, but some comments below.

Thanks!

The biggest changes are:

- The README has been extended with an overview of the API. I think it gives a
good overview of how the API fits together. I'd be very good to get
feedback from folks that aren't as familiar with AIO, I can't really see
what's easy/hard anymore.

Thanks, the README is super helpful! I was overwhelmed by all the new
concepts before, now it all makes much more sense.

Now that it's all laid out more clearly, I see how many different concepts
and states there really are:

- For a single IO, there is an "IO handle", "IO references", and an "IO
return". You first allocate an IO handle (PgAioHandle), and then you get a
reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.

- An IO handle has eight different states (PgAioHandleState).

I'm sure all those concepts exist for a reason. But still I wonder: can we
simplify?

Probably, but it's not exactly obvious to me where.

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.

pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
pgaio_io_acquire() or similar. "get" also feels very innocent, even though
it may wait for previous IO to finish. Especially when pgaio_io_get_ref()
actually is innocent.

WFM.

typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,

/* returned by pgaio_io_get() */
AHS_HANDED_OUT,

/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,

/* subject's prepare() callback has been called */
AHS_PREPARED,

/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,

/* IO finished, but result has not yet been processed */
AHS_REAPED,

/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,

/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;

Do we need to distinguish between DEFINED and PREPARED?

I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not. It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.

In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.

At quick glance, those states are treated the same. (The comment refers to
pgaio_io_start_*() functions, but there's no such thing)

They're called pgaio_io_prep_{readv,writev} now, updated the comment.

I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?

One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.

I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?

I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).

There are some synonyms used in the code: I think "in-flight" and
"submitted" mean the same thing.

Fair. I guess in my mind the process of moving an IO into flight is
"submitting" and the state of not having been submitted but not yet having
completed is being in flight. But that's probably not useful.

And "prepared" and "staged". I'd suggest picking just one term for each
concept.

Agreed.

I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states. does a
single IO go through both states, or are the mutually exclusive? At quick
glance, I don't actually see any code that would set the COMPLETED_LOCAL
state; is it dead code?

It's dead code right now. I've made it dead and undead a couple times
:/. Unfortunately I think I need to revive it to make some corner cases with
temporary tables work (AIO for temp table is executed via IO uring, another
backend waits for *another* IO executed via that IO uring instance and reaps
the completion -> we can't update the local buffer state in the shared
completion callback).

REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.

What would you call having gotten "completion notifications" from the kernel,
but not having processed them?

- There's no obvious way to tell "internal" function operating on an IO handle
apart from functions that are expected to be called by the issuer of an IO.

One way to deal with this would be to introduce a distinct "issuer IO
reference" type. I think that might be a good idea, it would also make it
clearer that a good number of the functions can only be called by the
issuer, before the IO is submitted.

This would also make it easier to order functions more sensibly in aio.c, as
all the issuer functions would be together.

The functions on AIO handles that everyone can call already have a distinct
type (PgAioHandleRef vs PgAioHandle*).

Hmm, yeah I think you might be onto something here.

I'll give it a try.

Could pgaio_io_get() return an PgAioHandleRef directly, so that the issuer
would never see a raw PgAioHandle ?

Don't think that would be helpful - that way there'd be no difference at all
anymore between what functions any backend can call and what the issuer can
do.

Finally, attached are a couple of typos and other trivial suggestions.

Integrating...

Thanks!

Andres

#29

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#27)

Re: AIO v2.2

Hi,

On 2025-01-07 18:08:51 +0200, Heikki Linnakangas wrote:

On LWLockDisown():

+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released, even in case of an error. This only is desirable if
+ * the lock is going to be released in a different process than the process
+ * that acquired it.
+ *
+ * Returns the mode in which the lock was held by the current backend.

Returning the lock mode feels a bit ad hoc..

It seemed useful to me, that way callers could verify that the released lock
level is actually what it expected. What do we gain by hiding this information
anyway?

Orthogonal: I think it was a mistake that LWLockRelease() didn't require the
to-be-releaased lock mode to be passed in...

+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
Hmm. I won't insist, but I feel it probably would be worth it. This is only
in LOCK_DEBUG mode so there's no performance penalty in non-debug builds,
and when you do compile with LOCK_DEBUG you probably appreciate any extra
information.

I actually thought it'd be more useful if it stays pointing to the 'original
owner'.

When you say "it" would be worth it, you mean resetting owner, or adding a
flag indicating that it's a disowned lock?

+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ */
That feels weird. The only caller outside lwlock.c does call
RESUME_INTERRUPTS() immediately.

Yea, I didn't feel happy with it either. It just seemed that the cure (a
separate function, or a parameter indicating whether interrupts should be
resumed) was as bad as the disease.

Perhaps it'd make for a better external interface if LWLockDisown() did call
RESUME_INTERRUPTS(), and there was a separate internal version that didn't.

Hm, that seems more complicated than it's worth. I'd either leave it as-is,
or add a parameter to LWLockDisown to indicate if interrupts should be
resumed.

And it might make more sense for the external version to return 'void' while
we're at it. Returning a value that the caller ignores is harmless, of
course, but it feels a bit weird. It makes you wonder what you're supposed
to do with it.

This one I disagree with, I think it makes a lot of sense to return the lock
mode of the lock you just disowned.

Doubtful it matters, but the compiler can trivially optimize that out for the
lwlock.c callers.

+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+

The description is a bit funny because synchronous I/O is one of the
possible methods.

Hah. How about:

"Selects the method of, potentially asynchronous, IO execution."?

Greetings,

Andres Freund

#30

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Andres Freund (#25)

Re: AIO v2.2

On Mon, Jan 06, 2025 at 04:40:26PM -0500, Andres Freund wrote:

On 2025-01-06 10:52:20 -0800, Noah Misch wrote:

On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:

- We have pretty no testing for IO errors.

Yes, that's remained a gap. I've wondered how much to address this via
targeted tests of specific sites vs. fuzzing, iterative fault injection, or
some other approach closer to brute force.

I'd be most interested in the
cases that would be undetected deadlocks under a naive design. An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

That's a good one, yea.

I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.

And then we'll need more injection points, I'm afraid :(.

Sounds good.

* - method_*.c - different ways of executing AIO (e.g. worker process)
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
* - aio_subject.c - callbacks at IO operation lifecycle events
* - aio_init.c - per-fork and per-startup-process initialization

I don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?

That works for me. I wrote "per-startup-process" because it can happen more
than once in a postmaster that reaches "all server processes terminated;
reinitializing". That said, there's little risk of "per-server" giving folks
a materially wrong idea.

* - aio.c - all other topics
* - read_stream.c - helper for reading buffered relation data

Did the order you listed the files have a system to it? If so, what is it?

The rough idea was to avoid forward references:

* - method_*.c - different ways of executing AIO (e.g. worker process)
makes sense without other background
* - aio_io.c - method-independent code for specific IO ops (e.g. readv)
refers to methods, so listed after methods
* - aio_subject.c - callbacks at IO operation lifecycle events
refers to IO ops, so listed after aio_io.c
* - aio_init.c - per-fork and per-startup-process initialization
no surprise that this code will exist somewhere, so list it lower to deemphasize it
* - aio.c - all other topics
default route, hence last
* - read_stream.c - helper for reading buffered relation data
could just as easily come first, not last
could be under a distinct heading like "Recommended abstractions:"

I'd benefit from seeing things in this order:

- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions

Hm - why have PgAioHandleState and its states before the usage example? Seems
like it'd be harder to understand that way.

I usually look at the data structures before the code that manipulates them.
(Similarly, I look at the map before the directions.) I wouldn't mind it
appearing after the usage example, since order preferences do vary.

- usage example as it is, with full comments
- the rest

## Synopsis

ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...

I'm not sure either. Let's drop that idea.

+### IO can be started in critical sections

...
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating the IO for flushing the WAL
+may require to first finish executing IO executed earlier.
The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms. Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".
It is indeed awkward. I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:

... Consider
e.g. the case of a backend first starting a number of writes from shared
buffers and then starting to flush the WAL. Because only a limited amount of
IO can be in-progress at the same time, initiating IO for flushing the WAL may
require to first complete IO that was started earlier.

That's non-awkward. I like specific state names here since "complete" could
mean AHS_COMPLETED_SHARED or AHS_COMPLETED_LOCAL, and it matters here. If the
state names changed so AHS_COMPLETED_LOCAL dropped the word "complete", that
too would solve it.

+### AIO Callbacks

...
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
+IO to/from shared buffers, which is required to handle the case where the
+issuing backend errors out and releases its own pins.
Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section". Pinning a buffer in a critical section sounds bad. I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten. Can this text have a sentence making that explicit?
Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.

How about:

In addition to completion, AIO callbacks also are called to "prepare" an
IO. This is, e.g., used to increase buffer reference counts to account for the
AIO subsystem referencing the buffer, which is required to handle the case
where the issuing backend errors out and releases its own pins while the IO is
still ongoing.

Perfect.

+### AIO Subjects
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "subject". Each subject has some space inside an AIO Handle with
+information specific to the subject and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
Can this say roughly how to decide when to add a new subject?
Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.

E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).

The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.

How about:

I.e., if two different uses of AIO can describe the identity of the file being
operated on the same way, it likely makes sense to use the same
subject. E.g. different smgr implementations can describe IO with
RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
and it would not make sense to use the same subject for smgr and WAL.

Sounds good to include.

Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness? In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?

I agree this should be explained somewhere - but not sure this is the best
place.

The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.

Reading it again today, that topic may already have adequate coverage.

#31

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Andres Freund (#28)

Re: AIO v2.2

On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

To me, those names don't convey that. I would perhaps call the thing
that supports issuer-only operations a "PgAio" and the thing other
people can use a "PgAioHandle". Or "PgAioRequest" and "PgAioHandle" or
something like that. With PgAioHandleRef, IMHO you've got two words
that both imply a layer of indirection -- "handle" and "ref" -- which
doesn't seem quite as nice, because then the other thing --
"PgAioHandle" still sort of implies
one layer of indirection and the whole thing seems a bit less clear.

(I say all of this having looked at nothing, so feel free to ignore me
if that doesn't sound coherent.)

REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.

What would you call having gotten "completion notifications" from the kernel,
but not having processed them?

The Linux kernel calls those zombie processes, so we could call it a
ZOMBIE state, but that seems like it might be a bit of inside
baseball. I do agree with Heikki that REAPED sounds later than
COMPLETED, because you reap zombie processes by collecting their exit
status. Maybe you could have AHS_COMPLETE or AHS_IO_COMPLETE for the
state where the I/O is done but there's still completion-related work
to be done, and then the other state could be AHS_DONE or AHS_FINISHED
or AHS_FINAL or AHS_REAPED or something.

--
Robert Haas
EDB: http://www.enterprisedb.com

#32

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#28)

1 attachment(s)

Re: AIO v2.2

On 07/01/2025 18:11, Andres Freund wrote:

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.

Right, yeah, I can see that.

typedef enum PgAioHandleState
{
/* not in use */
AHS_IDLE = 0,

/* returned by pgaio_io_get() */
AHS_HANDED_OUT,

/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
AHS_DEFINED,

/* subject's prepare() callback has been called */
AHS_PREPARED,

/* IO has been submitted and is being executed */
AHS_IN_FLIGHT,

/* IO finished, but result has not yet been processed */
AHS_REAPED,

/* IO completed, shared completion has been called */
AHS_COMPLETED_SHARED,

/* IO completed, local completion has been called */
AHS_COMPLETED_LOCAL,
} PgAioHandleState;

Do we need to distinguish between DEFINED and PREPARED?

I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not. It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.

I see.

In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.

Agreed on that

I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?

One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.

I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?

I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).

Hmm. The comments say that when you call smgrstartreadv(), the IO handle
may no longer be modified, as the IO may be executed immediately. What
if we changed that so that it never submits the IO, only adds the
necessary callbacks to it?

In that world, when smgrstartreadv() returns, the necessary details and
completion callbacks have been set in the IO handle, but the caller can
still do more preparation before the IO is submitted. The caller must
ensure that it gets submitted, however, so no erroring out in that state.

Currently the call stack looks like this:

AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
-> shared_buffer_readv_prepare() (callback)
<- (return)
<- (return)
<- (return)
<- (return)
<- (return)

I'm thinking that the prepare work is done "on the way up" instead:

AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
<- (return)
<- (return)
<- (return)
-> shared_buffer_readv_prepare()
<- (return)

Attached is a patch to demonstrate concretely what I mean.

This adds a new pgaio_io_stage() step to the issuer, and the issuer
needs to call the prepare functions explicitly, instead of having them
as callbacks. Nominally that's more steps, but IMHO it's better to be
explicit. The same actions were happening previously too, it was just
hidden in the callback. I updated the README to show that too.

I'm not wedded to this, but it feels a little better to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

aio-remove-prepare-callback.patchtext/x-patch; charset=UTF-8; name=aio-remove-prepare-callback.patchDownload

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 0076ea4aa10..25b5f5d9529 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -60,7 +60,18 @@ smgrstartreadv(ioh, operation->smgr, forknum, blkno,
                BufferGetBlock(buffer), 1);
 
 /*
- * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * After smgrstartreadv() has returned, we are committed to performing the IO.
+ * We may do more preparation or add more callbacks to the IO, but must
+ * *not* error out before calling pgaio_io_stage(). We don't have any such
+ * preparation to do here, so just call pgaio_io_stage() to indicate that we
+ * have completed building the IO request. It usually queues up the request
+ * for batching, but may submit it immediately if the batch is full or if
+ * the request needed to be processed synchronously.
+ */
+pgaio_io_stage(ioh);
+
+/*
+ * The IO might already have been initiated by pgaio_io_stage(). That
  * is however not guaranteed, to allow IO submission to be batched.
  *
  * Note that one needs to be careful while there may be unsubmitted IOs, as
@@ -69,10 +80,6 @@ smgrstartreadv(ioh, operation->smgr, forknum, blkno,
  * that, pending IOs need to be explicitly submitted before this backend
  * might be blocked by a backend waiting for IO.
  *
- * Note that the IO might have immediately been submitted (e.g. due to reaching
- * a limit on the number of unsubmitted IOs) and even completed during the
- * smgrstartreadv() above.
- *
  * Once submitted, the IO is in-flight and can complete at any time.
  */
 pgaio_submit_staged();
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 261a752fb80..ed03fe03609 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -110,7 +110,7 @@ static PgAioHandle *inj_cur_handle;
  * Acquire an AioHandle, waiting for IO completion if necessary.
  *
  * Each backend can only have one AIO handle that that has been "handed out"
- * to code, but not yet submitted or released. This restriction is necessary
+ * to code, but not yet staged or released. This restriction is necessary
  * to ensure that it is possible for code to wait for an unused handle by
  * waiting for in-flight IO to complete. There is a limited number of handles
  * in each backend, if multiple handles could be handed out without being
@@ -249,6 +249,43 @@ pgaio_io_release(PgAioHandle *ioh)
 	}
 }
 
+/*
+ * Finish building an IO request.  Once a request has been staged, there's no
+ * going back; the IO subsystem will attempt to perform the IO. If the IO
+ * succeeds the completion callbacks will be called; on error, the error
+ * callbacks.
+ *
+ * This may add the IO to the current batch, or execute the request
+ * synchronously.
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh)
+{
+	bool		needs_synchronous;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_update_state(ioh, AHS_PREPARED);
+
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	elog(DEBUG3, "io:%d: staged %s, executed synchronously: %d",
+		 pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
+		 needs_synchronous);
+
+	if (!needs_synchronous)
+	{
+		my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
+		Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
 /*
  * Release IO handle during resource owner cleanup.
  */
@@ -279,7 +316,7 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 
 			pgaio_io_reclaim(ioh);
 			break;
-		case AHS_DEFINED:
+		case AHS_PREPARING:
 		case AHS_PREPARED:
 			/* XXX: Should we warn about this when is_commit? */
 			pgaio_submit_staged();
@@ -383,7 +420,7 @@ void
 pgaio_io_get_ref(PgAioHandle *ioh, PgAioHandleRef *ior)
 {
 	Assert(ioh->state == AHS_HANDED_OUT ||
-		   ioh->state == AHS_DEFINED ||
+		   ioh->state == AHS_PREPARING ||
 		   ioh->state == AHS_PREPARED);
 	Assert(ioh->generation != 0);
 
@@ -437,7 +474,7 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
 
 	if (am_owner)
 	{
-		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		if (state == AHS_PREPARING || state == AHS_PREPARED)
 		{
 			/* XXX: Arguably this should be prevented by callers? */
 			pgaio_submit_staged();
@@ -489,8 +526,8 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
 				/* fallthrough */
 
 				/* waiting for owner to submit */
+			case AHS_PREPARING:
 			case AHS_PREPARED:
-			case AHS_DEFINED:
 				/* waiting for reaper to complete */
 				/* fallthrough */
 			case AHS_REAPED:
@@ -501,8 +538,7 @@ pgaio_io_ref_wait(PgAioHandleRef *ior)
 
 				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
 				{
-					if (state != AHS_REAPED && state != AHS_DEFINED &&
-						state != AHS_IN_FLIGHT)
+					if (state != AHS_REAPED && state != AHS_IN_FLIGHT)
 						break;
 					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
 				}
@@ -570,8 +606,8 @@ pgaio_io_get_state_name(PgAioHandle *ioh)
 			return "idle";
 		case AHS_HANDED_OUT:
 			return "handed_out";
-		case AHS_DEFINED:
-			return "DEFINED";
+		case AHS_PREPARING:
+			return "PREPARING";
 		case AHS_PREPARED:
 			return "PREPARED";
 		case AHS_IN_FLIGHT:
@@ -588,43 +624,18 @@ pgaio_io_get_state_name(PgAioHandle *ioh)
 
 /*
  * Internal, should only be called from pgaio_io_prep_*().
+ *
+ * Switches the IO to PREPARING state.
  */
 void
-pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+pgaio_io_start_staging(PgAioHandle *ioh)
 {
-	bool		needs_synchronous;
-
 	Assert(ioh->state == AHS_HANDED_OUT);
 	Assert(pgaio_io_has_subject(ioh));
 
-	ioh->op = op;
 	ioh->result = 0;
 
-	pgaio_io_update_state(ioh, AHS_DEFINED);
-
-	/* allow a new IO to be staged */
-	my_aio->handed_out_io = NULL;
-
-	pgaio_io_prepare_subject(ioh);
-
-	pgaio_io_update_state(ioh, AHS_PREPARED);
-
-	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
-
-	elog(DEBUG3, "io:%d: prepared %s, executed synchronously: %d",
-		 pgaio_io_get_id(ioh), pgaio_io_get_op_name(ioh),
-		 needs_synchronous);
-
-	if (!needs_synchronous)
-	{
-		my_aio->staged_ios[my_aio->num_staged_ios++] = ioh;
-		Assert(my_aio->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
-	}
-	else
-	{
-		pgaio_io_prepare_submit(ioh);
-		pgaio_io_perform_synchronously(ioh);
-	}
+	pgaio_io_update_state(ioh, AHS_PREPARING);
 }
 
 /*
@@ -858,8 +869,8 @@ pgaio_io_wait_for_free(void)
 		{
 			/* should not be in in-flight list */
 			case AHS_IDLE:
-			case AHS_DEFINED:
 			case AHS_HANDED_OUT:
+			case AHS_PREPARING:
 			case AHS_PREPARED:
 			case AHS_COMPLETED_LOCAL:
 				elog(ERROR, "shouldn't get here with io:%d in state %d",
@@ -1004,7 +1015,7 @@ pgaio_bounce_buffer_wait_for_free(void)
 			case AHS_IDLE:
 			case AHS_HANDED_OUT:
 				continue;
-			case AHS_DEFINED:	/* should have been submitted above */
+			case AHS_PREPARING:	/* should have been submitted above */
 			case AHS_PREPARED:
 				elog(ERROR, "shouldn't get here with io:%d in state %d",
 					 pgaio_io_get_id(ioh), ioh->state);
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 3c255775833..e84b79d3f2e 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -46,11 +46,12 @@ pgaio_io_prep_readv(PgAioHandle *ioh,
 {
 	pgaio_io_before_prep(ioh);
 
+	ioh->op = PGAIO_OP_READV;
 	ioh->op_data.read.fd = fd;
 	ioh->op_data.read.offset = offset;
 	ioh->op_data.read.iov_length = iovcnt;
 
-	pgaio_io_prepare(ioh, PGAIO_OP_READV);
+	pgaio_io_start_staging(ioh);
 }
 
 void
@@ -59,11 +60,12 @@ pgaio_io_prep_writev(PgAioHandle *ioh,
 {
 	pgaio_io_before_prep(ioh);
 
+	ioh->op = PGAIO_OP_WRITEV;
 	ioh->op_data.write.fd = fd;
 	ioh->op_data.write.offset = offset;
 	ioh->op_data.write.iov_length = iovcnt;
 
-	pgaio_io_prepare(ioh, PGAIO_OP_WRITEV);
+	pgaio_io_start_staging(ioh);
 }
 
 
diff --git a/src/backend/storage/aio/aio_subject.c b/src/backend/storage/aio/aio_subject.c
index b2bd0c235e7..321e1d8e975 100644
--- a/src/backend/storage/aio/aio_subject.c
+++ b/src/backend/storage/aio/aio_subject.c
@@ -119,33 +119,6 @@ pgaio_io_get_subject_name(PgAioHandle *ioh)
 	return aio_subject_info[ioh->subject]->name;
 }
 
-/*
- * Internal function which invokes ->prepare for all the registered callbacks.
- */
-void
-pgaio_io_prepare_subject(PgAioHandle *ioh)
-{
-	Assert(ioh->subject > ASI_INVALID && ioh->subject < ASI_COUNT);
-	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
-
-	for (int i = ioh->num_shared_callbacks; i > 0; i--)
-	{
-		PgAioHandleSharedCallbackID cbid = ioh->shared_callbacks[i - 1];
-		const PgAioHandleSharedCallbacksEntry *ce = &aio_shared_cbs[cbid];
-
-		if (!ce->cb->prepare)
-			continue;
-
-		elog(DEBUG3, "io:%d, op %s, subject %s, calling cb #%d %d/%s->prepare",
-			 pgaio_io_get_id(ioh),
-			 pgaio_io_get_op_name(ioh),
-			 pgaio_io_get_subject_name(ioh),
-			 i,
-			 cbid, ce->name);
-		ce->cb->prepare(ioh);
-	}
-}
-
 /*
  * Internal function which invokes ->complete for all the registered
  * callbacks.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9bc0176a2ca..dd30856aca0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -179,6 +179,9 @@ int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
+static void local_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
+static void shared_buffer_writev_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
+
 /*
  * Backend-Private refcount management:
  *
@@ -1725,7 +1728,6 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 
 		pgaio_io_set_io_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgaio_io_add_shared_cb(ioh, ASC_LOCAL_BUFFER_READ);
 		else
@@ -1736,6 +1738,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 		did_start_io_overall = did_start_io_this = true;
 		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
 					   io_pages, io_buffers_len);
+		if (persistence == RELPERSISTENCE_TEMP)
+			local_buffer_readv_prepare(ioh, io_buffers, io_buffers_len);
+		else
+			shared_buffer_readv_prepare(ioh, io_buffers, io_buffers_len);
+		pgaio_io_stage(ioh);
 		ioh = NULL;
 		operation->nios++;
 
@@ -4170,10 +4177,11 @@ WriteBuffers(BuffersToWrite *to_write,
 					to_write->data_ptrs,
 					to_write->nbuffers,
 					false);
+	shared_buffer_writev_prepare(to_write->ioh, to_write->buffers, to_write->nbuffers);
+	pgaio_io_stage(to_write->ioh);
 	pgstat_count_io_op_n(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
 						 IOOP_WRITE, to_write->nbuffers);
 
-
 	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
 	{
 		Buffer		cur_buf = to_write->buffers[nbuf];
@@ -6952,20 +6960,16 @@ ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
  * and writes.
  */
 static void
-shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
+shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write, Buffer *buffers, int nbuffers)
 {
-	uint64	   *io_data;
-	uint8		io_data_len;
 	PgAioHandleRef io_ref;
 	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
 
-	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
-
 	pgaio_io_get_ref(ioh, &io_ref);
 
-	for (int i = 0; i < io_data_len; i++)
+	for (int i = 0; i < nbuffers; i++)
 	{
-		Buffer		buf = (Buffer) io_data[i];
+		Buffer		buf = buffers[i];
 		BufferDesc *bufHdr;
 		uint32		buf_state;
 
@@ -7022,16 +7026,16 @@ shared_buffer_prepare_common(PgAioHandle *ioh, bool is_write)
 	}
 }
 
-static void
-shared_buffer_readv_prepare(PgAioHandle *ioh)
+void
+shared_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
 {
-	shared_buffer_prepare_common(ioh, false);
+	shared_buffer_prepare_common(ioh, false, buffers, nbuffers);
 }
 
 static void
-shared_buffer_writev_prepare(PgAioHandle *ioh)
+shared_buffer_writev_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
 {
-	shared_buffer_prepare_common(ioh, true);
+	shared_buffer_prepare_common(ioh, true, buffers, nbuffers);
 }
 
 static PgAioResult
@@ -7135,19 +7139,15 @@ shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
  * and writes.
  */
 static void
-local_buffer_readv_prepare(PgAioHandle *ioh)
+local_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers)
 {
-	uint64	   *io_data;
-	uint8		io_data_len;
 	PgAioHandleRef io_ref;
 
-	io_data = pgaio_io_get_io_data(ioh, &io_data_len);
-
 	pgaio_io_get_ref(ioh, &io_ref);
 
-	for (int i = 0; i < io_data_len; i++)
+	for (int i = 0; i < nbuffers; i++)
 	{
-		Buffer		buf = (Buffer) io_data[i];
+		Buffer		buf = buffers[i];
 		BufferDesc *bufHdr;
 		uint32		buf_state;
 
@@ -7199,27 +7199,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 	return result;
 }
 
-static void
-local_buffer_writev_prepare(PgAioHandle *ioh)
-{
-	elog(ERROR, "not yet");
-}
-
-
 const struct PgAioHandleSharedCallbacks aio_shared_buffer_readv_cb = {
-	.prepare = shared_buffer_readv_prepare,
 	.complete = shared_buffer_readv_complete,
 	.error = buffer_readv_error,
 };
 const struct PgAioHandleSharedCallbacks aio_shared_buffer_writev_cb = {
-	.prepare = shared_buffer_writev_prepare,
 	.complete = shared_buffer_writev_complete,
 };
 const struct PgAioHandleSharedCallbacks aio_local_buffer_readv_cb = {
-	.prepare = local_buffer_readv_prepare,
 	.complete = local_buffer_readv_complete,
 	.error = buffer_readv_error,
 };
 const struct PgAioHandleSharedCallbacks aio_local_buffer_writev_cb = {
-	.prepare = local_buffer_writev_prepare,
+
 };
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d12225a9949..bf4522eeac6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -985,9 +985,9 @@ mdstartreadv(PgAioHandle *ioh,
 							  forknum,
 							  blocknum,
 							  nblocks);
-	pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
-
 	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	pgaio_io_add_shared_cb(ioh, ASC_MD_READV);
 }
 
 /*
@@ -1136,9 +1136,8 @@ mdstartwritev(PgAioHandle *ioh,
 							  forknum,
 							  blocknum,
 							  nblocks);
-	pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
-
 	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	pgaio_io_add_shared_cb(ioh, ASC_MD_WRITEV);
 }
 
 
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index caa52d2aaba..d126a10f9d4 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -212,12 +212,10 @@ typedef struct PgAioSubjectInfo
 
 
 typedef PgAioResult (*PgAioHandleSharedCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
-typedef void (*PgAioHandleSharedCallbackPrepare) (PgAioHandle *ioh);
 typedef void (*PgAioHandleSharedCallbackError) (PgAioResult result, const PgAioSubjectData *subject_data, int elevel);
 
 typedef struct PgAioHandleSharedCallbacks
 {
-	PgAioHandleSharedCallbackPrepare prepare;
 	PgAioHandleSharedCallbackComplete complete;
 	PgAioHandleSharedCallbackError error;
 } PgAioHandleSharedCallbacks;
@@ -247,6 +245,8 @@ struct ResourceOwnerData;
 extern PgAioHandle *pgaio_io_get(struct ResourceOwnerData *resowner, PgAioReturn *ret);
 extern PgAioHandle *pgaio_io_get_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
 
+extern void pgaio_io_stage(PgAioHandle *ioh);
+
 extern void pgaio_io_release(PgAioHandle *ioh);
 extern void pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error);
 
@@ -261,7 +261,7 @@ extern void pgaio_io_set_io_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
 extern void pgaio_io_set_io_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
 extern uint64 *pgaio_io_get_io_data(PgAioHandle *ioh, uint8 *len);
 
-extern void pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_start_staging(PgAioHandle *ioh);
 
 extern int	pgaio_io_get_id(PgAioHandle *ioh);
 struct iovec;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f4c57438dd4..55677d7dc8c 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -37,10 +37,10 @@ typedef enum PgAioHandleState
 	/* returned by pgaio_io_get() */
 	AHS_HANDED_OUT,
 
-	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
-	AHS_DEFINED,
+	/* pgaio_io_start_staging() has been called, but IO hasn't been fully staged yet */
+	AHS_PREPARING,
 
-	/* subjects prepare() callback has been called */
+	/* pgaio_io_stage() has been called, but the IO hasn't been submitted yet */
 	AHS_PREPARED,
 
 	/* IO is being executed */
@@ -249,7 +249,6 @@ typedef struct IoMethodOps
 
 extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
 
-extern void pgaio_io_prepare_subject(PgAioHandle *ioh);
 extern void pgaio_io_process_completion_subject(PgAioHandle *ioh);
 extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
 extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3523d8a3860..5c7d602d91b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -425,6 +425,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void shared_buffer_readv_prepare(PgAioHandle *ioh, Buffer *buffers, int nbuffers);
 
 
 /* freelist.c */
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index e495c5309b3..446da4f0231 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -264,6 +264,8 @@ read_corrupt_rel_block(PG_FUNCTION_ARGS)
 	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
 				   (void *) &page, 1);
 
+	shared_buffer_readv_prepare(ioh, &buf, 1);
+	pgaio_io_stage(ioh);
 	ReleaseBuffer(buf);
 
 	pgaio_io_ref_wait(&ior);

#33

Jakub Wartak

jakub.wartak@enterprisedb.com

about 1 year ago

In reply to: Andres Freund (#23)

Re: AIO v2.0

On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-12-19 17:29:12 -0500, Andres Freund wrote:

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

I think we still want something like it, but I don't think it needs to be in
the initial commits.

After I got this question from Thomas as well, I started hacking one up.

What information would you like to see?

Here's what I currently have:

├─[ RECORD 2 ]───┼────────────────────────────────────────────────┤
│ pid │ 358212 │
│ io_id │ 2051 │
│ io_generation │ 4199 │
│ state │ IN_FLIGHT │
│ operation │ read │
│ offset │ 511967232 │
│ length │ 262144 │
│ subject │ smgr │
│ iovec_data_len │ 32 │
│ raw_result │ (null) │
│ result │ UNKNOWN │
│ error_desc │ (null) │
│ subject_desc │ blocks 1373216..1373247 in file "base/5/16388" │
│ flag_sync │ f │
│ flag_localmem │ f │
│ flag_buffered │ t │

Cool! It's more than enough for me in future, thanks!

I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.

If you are looking for other proposals:
* pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
* pg_debug_aios ?
* pg_debug_io ?

I think we'll want a pg_stat_aio as well, tracking things like:

- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOs

If I could dream of one thing that would be 99.9% percentile of IO
response times in milliseconds for different classes of I/O traffic
(read/write/flush). But it sounds like it would be very similiar to
pg_stat_io and potentially would have to be
per-tablespace/IO-traffic(subject)-type too. AFAIU pg_stat_io has
improper structure to have that there.

BTW: before trying to even start to compile that AIO v2.2* and
responding to the previous review, what are You looking interested to
hear the most about it so that it adds some value ? Any workload
specific measurements? just general feedback, functionality gaps?
Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay to
try the error handling routines? Some kind of AIO <-> standby/recovery
interactions?

* - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
let's officially recognize the 2025 as the year of AIO in PG, as it
was 1st message :D

-J.

#34

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Jakub Wartak (#33)

Re: AIO v2.0

Hi,

On 2025-01-08 15:04:39 +0100, Jakub Wartak wrote:

On Mon, Jan 6, 2025 at 5:28 PM Andres Freund <andres@anarazel.de> wrote:

I didn't think that pg_stat_* was quite the right namespace, given that it
shows not stats, but the currently ongoing IOs. I am going with pg_aios for
now, but I don't particularly like that.

If you are looking for other proposals:
* pg_aios_progress ? (to follow pattern of pg_stat_copy|vaccuum_progress?)
* pg_debug_aios ?
* pg_debug_io ?

I think pg_aios is better than those, if not by much. Seems others are ok
with that name too. And we easily can evolve it later.

I think we'll want a pg_stat_aio as well, tracking things like:

- how often the queue to IO workes was full
- how many times we submitted IO to the kernel (<= #ios with io_uring)
- how many times we asked the kernel for events (<= #ios with io_uring)
- how many times we had to wait for in-flight IOs before issuing more IOs

If I could dream of one thing that would be 99.9% percentile of IO
response times in milliseconds for different classes of I/O traffic
(read/write/flush). But it sounds like it would be very similiar to
pg_stat_io and potentially would have to be
per-tablespace/IO-traffic(subject)-type too.

Yea, that's a significant project on its own. It's not that cheap to compute
reasonably accurate percentiles and we have no infrastructure for doing so
right now.

AFAIU pg_stat_io has improper structure to have that there.

Hm, not obvious to me why? It might make the view a bit wide to add it as an
additional column, but otherwise I don't see a problem?

BTW: before trying to even start to compile that AIO v2.2* and
responding to the previous review, what are You looking interested to
hear the most about it so that it adds some value?

Due to the rather limited "users" of AIO in the patchset, I think most
benchmarks aren't expected to show any meaningful gains. However, they
shouldn't show any significant regressions either (when not using direct
IO). I think trying to find regressions would be a rather valuable thing.

I'm tempted to collect a few of the reasonbly-ready read stream conversions
into the patchset, to make the potential gains more visible. But I am not sure
it's a good investment of time right now.

One small regression I do know about, namely scans of large relations that are
bigger than shared buffers but do fit in the kernel page cache. The increase
of BAS_BULKREAD does cause a small slowdown - but without it we never can do
sufficient asynchronous IO. I think the slowdown is small enough to just
accept that, but it's worth qualifying that on a few machines.

Any workload specific measurements? just general feedback, functionality
gaps?

To see the benefits it'd be interesting to compare:

1) sequential scan performance with data not in shared buffers, using buffered IO
2) same, but using direct IO when testing the patch
3) checkpoint performance

In my experiments 1) gains a decent amount of performance in many cases, but
nothing overwhelming - sequential scans are easy for the kernel to read ahead.

I do see very significant gains for 2) - On a system with 10 striped NVMe SSDs
that each can do ~3.5 GB/s I measured very parallel sequential scans (I had
to use ALTER TABLE to get sufficient numbers of workers):

master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s

This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).

This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.

I also see significant gains with 3). Bigger when using direct IO. One
complicating factor measuring 3) is that the first write to a block will often
be slower than subsequent writes because the filesystem will need to update
some journaled metadata, presenting a bottleneck.

Checkpoint performance is also severely limited by data checksum computation
if enabled - independent of this patchset.

One annoying thing when testing DIO is that right now VACUUM will be rather
slow if the data isn't already in s_b, as it isn't yet read-stream-ified.

Integrity/data testing with stuff like dm-dust, dm-flakey, dm-delay
to try the error handling routines?

Hm. I don't think that's going to work very well even on master. If the
filesystem fails there's not much that PG can do...

Some kind of AIO <-> standby/recovery interactions?

I wouldn't expect anything there. I think Thomas somewhere has a patch that
read-stream-ifies recovery prefetching, once that's done it would be more
interesting.

* - btw, Date: 2025-01-01 04:03:33 - I saw what you did there! so
let's officially recognize the 2025 as the year of AIO in PG, as it
was 1st message :D

Hah, that was actually the opposite of what I intended :). I'd hoped to post
earlier, but jetlag had caught up with me...

Greetings,

Andres Freund

#35

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#32)

Re: AIO v2.2

Hi,

On 2025-01-07 22:09:56 +0200, Heikki Linnakangas wrote:

On 07/01/2025 18:11, Andres Freund wrote:

I didn't quite understand the point of the prepare callbacks. For example,
when AsyncReadBuffers() calls smgrstartreadv(), the
shared_buffer_readv_prepare() callback will be called. Why doesn't
AsyncReadBuffers() do the "prepare" work itself directly; why does it need
to be in a callback?

One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc. Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.

I assume it's somehow related to error handling, but I didn't quite get
it. Perhaps an "abort" callback that'd be called on error, instead of a
"prepare" callback, would be better?

I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).

Hmm. The comments say that when you call smgrstartreadv(), the IO handle may
no longer be modified, as the IO may be executed immediately. What if we
changed that so that it never submits the IO, only adds the necessary
callbacks to it?

In that world, when smgrstartreadv() returns, the necessary details and
completion callbacks have been set in the IO handle, but the caller can
still do more preparation before the IO is submitted. The caller must ensure
that it gets submitted, however, so no erroring out in that state.

Currently the call stack looks like this:

AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
-> shared_buffer_readv_prepare() (callback)
<- (return)
<- (return)
<- (return)
<- (return)
<- (return)

I'm thinking that the prepare work is done "on the way up" instead:

AsyncReadBuffers()
-> smgrstartreadv()
-> mdstartreadv()
-> FileStartReadV()
-> pgaio_io_prep_readv()
<- (return)
<- (return)
<- (return)
-> shared_buffer_readv_prepare()
<- (return)

Attached is a patch to demonstrate concretely what I mean.

I think this would be somewhat limiting. Right now it's indeed just bufmgr.c
that needs to do a preparation (or "moving of ownership") step - but I don't
think it's necessarily going to stay that way.

Consider e.g. a hypothetical threaded future in which we have refcounted file
descriptors. While AIO is ongoing, the AIO subsystem would need to ensure that
the FD refcount is increased, otherwise you'd obviously run into trouble if
the issuing backend errored out and released its own reference as part of
resowner release.

I don't think the approach you suggest above would scale well for such a
situation - shared_buffer_readv_prepare() would again need to call to
smgr->md->fd. Whereas with the current approach md.c (or fd.c?) could just
define its own prepare callback that increased the refcount at the right
moment.

There's a few other scenarios I can think of:

- If somebody were - no idea what made me think of that - to write an smgr
implementation where storage is accessed over the network, one might need to
keep network buffers and sockets alive for the duration of the IO.

- It'd be rather useful to have support for asynchronously extending a
relation, that often requires filesystem journal IO and thus is slow. If
you're bulk loading, or the extension lock is contented, it'd be great if we
could start the next relation extension *before* it's needed and the
extension has to happen synchronously. To avoid deadlocks, such an
asynchronous extension would need to be able to release the lock in any
other backend, just like it's needed for the content locks when
asynchronously writing. Which in turn would require transferring ownership
of the relevant buffers *and* the extension lock. You could mash this
together, but it seems like a separate callback woul make it more
composable.

Does that make any sense to you?

This adds a new pgaio_io_stage() step to the issuer, and the issuer needs to
call the prepare functions explicitly, instead of having them as callbacks.
Nominally that's more steps, but IMHO it's better to be explicit. The same
actions were happening previously too, it was just hidden in the callback. I
updated the README to show that too.

I'm not wedded to this, but it feels a little better to me.

Right now the current approach seems to make more sense to me, but I'll think
about it more. I might also have missed something with my theorizing above.

Greetings,

Andres Freund

#36

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Robert Haas (#31)

Re: AIO v2.2

Hi,

On 2025-01-07 14:59:58 -0500, Robert Haas wrote:

On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

To me, those names don't convey that.

I'm certainly not wedded to these names - I went back and forth between
different names a fair bit, because I wasn't quite happy. I am however certain
that the current names are better than what it used to be (PgAioInProgress and
because that's long, a bunch of PgAioIP* names) :)

To make sure were talking about the same things, I am thinking of the
following "entities" needing names:

1) Shared memory representation of an IO, for the AIO subsystem internally

Currently: PgAioHandle

Because shared memory is limited, we need to reuse this entity. This reuse
needs to be possible "immediately" after completion, to avoid a bunch of
nasty scenarios.

To distinguish a reused PgAioHandle from its "prior" incarnation, each
PgAioHandle has a 64bit "generation counter.

In addition to being referenceable via pointer, it's also possible to
assign a 32bit integer to each PgAioHandle, as there is a fixed number of
them.

2) A way for the issuer of an IO to reference 1), to attach information to the
IO

Currently: PgAioHandle*

As long as the issuer hasn't yet staged the IO, it can't be
reused. Therefore it's OK to just point to the PgAioHandle.

One disadvantage of just using a pointer to PgAioHandle* is that it's
harder to distinguish subystem-internal functions that accept PgAioHandle*
from "public" functions that accept the "issuer reference".

3) A way for any backend to wait for a specific IO to complete

Currently: PgAioHandleRef

This references 1) using a 32 bit ID and the 64bit generation.

This is used to allow any backend to wait for a specific IO to
complete. E.g. by including it in the BufferDesc so that WaitIO can wait
for it.

Because it includes the generation it's trivial to detect whether the
PgAioHandle was reused.

I would perhaps call the thing that supports issuer-only operations a
"PgAio" and the thing other people can use a "PgAioHandle". Or
"PgAioRequest" and "PgAioHandle" or something like that. With
PgAioHandleRef, IMHO you've got two words that both imply a layer of
indirection -- "handle" and "ref" -- which doesn't seem quite as nice,
because then the other thing -- "PgAioHandle" still sort of implies one
layer of indirection and the whole thing seems a bit less clear.

It's indirections all the way down. The PG representation of "one IO" in the
end is just an indirection for a kernel operation :)

I would like to split 1) and 2) above.

1) PgAio{Handle,Request,} (a large struct) - used internally by AIO subsystem,
"pointed to" by the following
2) PgAioIssuerRef (an ID or pointer) - used by the issuer to incrementally
define the IO
3) PgAioWaitRef - (an ID and generation) - used to wait for a specific IO to
complete, not affected by reuse of PgAioHandle

REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
but it's actually vice versa.

What would you call having gotten "completion notifications" from the kernel,
but not having processed them?

The Linux kernel calls those zombie processes, so we could call it a ZOMBIE
state, but that seems like it might be a bit of inside baseball.

ZOMBIE feels even later than REAPED to me :)

I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.

How about

AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed

Greetings,

Andres Freund

#37

Ants Aasma

ants.aasma@cybertec.at

about 1 year ago

In reply to: Andres Freund (#34)

Re: AIO v2.0

On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote:

master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s

This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).

This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.

I'm curious about this because the checksum code should be fast enough
to easily handle that throughput. I remember checksum overhead being
negligible even when pulling in pages from page cache. Is it just that
the calculation is slow, or is it the fact that checksumming needs to
bring the page into the CPU cache. Did you notice any hints which
might be the case? I don't really have a machine at hand that can do
anywhere close to this amount of I/O.

I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.

--
Ants Aasma

#38

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Ants Aasma (#37)

Re: AIO v2.0

Hi,

On 2025-01-09 10:59:22 +0200, Ants Aasma wrote:

On Wed, 8 Jan 2025 at 22:58, Andres Freund <andres@anarazel.de> wrote:

master: ~18 GB/s
patch, buffered: ~20 GB/s
patch, direct, worker: ~28 GB/s
patch, direct, uring: ~35 GB/s

This was with io_workers=32, io_max_concurrency=128,
effective_io_concurrency=1000 (doesn't need to be that high, but it's what I
still have the numbers for).

This was without data checksums enabled as otherwise the checksum code becomes
a *huge* bottleneck.

I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.

It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.

I remember checksum overhead being negligible even when pulling in pages
from page cache.

It's indeed much less of an issue when pulling pages from the page cache, as
the copy from the page cache is fairly slow. With direct-IO, where the copy
from the page cache isn't the main driver of CPU use anymore, it becomes much
clearer.

Even with buffered IO it became a bigger issue with 17, due to
io_combine_limit. It turns out that lots of tiny syscalls are slow, so the
peak throughput that could reach the checksumming code was lower.

I created a 21554MB relation and measured the time to do a pg_prewarm() of
that relation after evicting all of shared buffers (the first time buffers are
touched has a bit different perf characteristics). I am using direct IO and
io_uring here, as buffered IO would include the page cache copy cost and
worker mode could parallelize the checksum computation on reads. The checksum
cost is a bigger issue for writes than reads, but it's harder to quickly
generate enough dirty data for a repeatable benchmark.

This system can do about 12.5GB/s of read IO.

Just to show the effect of the read size on page cache copy performance:

config checksums time in ms
buffered io_engine=sync io_combine_limit=1 0 6712.153
buffered io_engine=sync io_combine_limit=2 0 5919.215
buffered io_engine=sync io_combine_limit=4 0 5738.496
buffered io_engine=sync io_combine_limit=8 0 5396.415
buffered io_engine=sync io_combine_limit=16 0 5312.803
buffered io_engine=sync io_combine_limit=32 0 5275.389

To see the effect of page cache copy overhead:

config checksums time in ms
buffered io_engine=io_uring 0 3901.625
direct io_engine=io_uring 0 2075.330

Now to show the effect of checksums (enabled/disabled with pg_checksums):

config checksums time in ms
buffered io_engine=io_uring 0 3883.127
buffered io_engine=io_uring 1 5880.892
direct io_engine=io_uring 0 2067.142
direct io_engine=io_uring 1 3835.968

So with direct + uring w/o checksums, we can reach 10427 MB/s (close-ish to
disk speed), but with checksums we only reach 5620 MB/s.

Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?

I don't think the issue is that checksumming pulls the data into CPU caches

1) This is visible with SELECT that actually uses the data

2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much

3) It's visible with buffered IO, which has pulled the data into CPU caches
already

I don't really have a machine at hand that can do anywhere close to this
amount of I/O.

It's visible even when pulling from the page cache, if to a somewhat lesser
degree.

I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.

I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.

You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?

FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:

Percent | Source code & Disassembly of postgres for cycles:P (5866 samples, percent: local period)
--------------------------------------------------------------------------------------------------------
:
:
:
: 3 Disassembly of section .text:
:
: 5 00000000009e3670 <pg_checksum_page>:
: 6 * calculation isn't affected by the old checksum stored on the page.
: 7 * Restore it after, because actually updating the checksum is NOT part of
: 8 * the API of this function.
: 9 */
: 10 save_checksum = cpage->phdr.pd_checksum;
: 11 cpage->phdr.pd_checksum = 0;
0.00 : 9e3670: xor %eax,%eax
: 13 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e3672: mov $0x1000193,%r8d
: 15 cpage->phdr.pd_checksum = 0;
0.00 : 9e3678: vmovdqa -0x693fa0(%rip),%ymm3 # 34f6e0 <.LC0>
0.05 : 9e3680: vmovdqa -0x6935c8(%rip),%ymm4 # 3500c0 <.LC1>
0.00 : 9e3688: vmovdqa -0x693c10(%rip),%ymm0 # 34fa80 <.LC2>
0.00 : 9e3690: vmovdqa -0x693598(%rip),%ymm1 # 350100 <.LC3>
: 20 {
0.00 : 9e3698: mov %esi,%ecx
0.02 : 9e369a: lea 0x2000(%rdi),%rdx
: 23 save_checksum = cpage->phdr.pd_checksum;
0.00 : 9e36a1: movzwl 0x8(%rdi),%esi
: 25 CHECKSUM_COMP(sums[j], page->data[i][j]);
0.00 : 9e36a5: vpbroadcastd %r8d,%ymm5
: 27 cpage->phdr.pd_checksum = 0;
0.00 : 9e36ab: mov %ax,0x8(%rdi)
: 29 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.14 : 9e36af: mov %rdi,%rax
0.00 : 9e36b2: nopw 0x0(%rax,%rax,1)
: 32 CHECKSUM_COMP(sums[j], page->data[i][j]);
15.36 : 9e36b8: vpxord (%rax),%ymm1,%ymm1
4.79 : 9e36be: vmovdqu 0x80(%rax),%ymm2
: 35 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.07 : 9e36c6: add $0x100,%rax
: 37 CHECKSUM_COMP(sums[j], page->data[i][j]);
2.45 : 9e36cc: vpxord -0xe0(%rax),%ymm0,%ymm0
2.85 : 9e36d3: vpmulld %ymm5,%ymm1,%ymm6
0.02 : 9e36d8: vpsrld $0x11,%ymm1,%ymm1
3.17 : 9e36dd: vpternlogd $0x96,%ymm6,%ymm1,%ymm2
2.01 : 9e36e4: vpmulld %ymm5,%ymm0,%ymm6
13.16 : 9e36e9: vpmulld %ymm5,%ymm2,%ymm1
0.03 : 9e36ee: vpsrld $0x11,%ymm2,%ymm2
0.02 : 9e36f3: vpsrld $0x11,%ymm0,%ymm0
2.57 : 9e36f8: vpxord %ymm2,%ymm1,%ymm1
0.89 : 9e36fe: vmovdqu -0x60(%rax),%ymm2
0.12 : 9e3703: vpternlogd $0x96,%ymm6,%ymm0,%ymm2
4.48 : 9e370a: vpmulld %ymm5,%ymm2,%ymm0
0.49 : 9e370f: vpsrld $0x11,%ymm2,%ymm2
0.99 : 9e3714: vpxord %ymm2,%ymm0,%ymm0
11.88 : 9e371a: vpxord -0xc0(%rax),%ymm4,%ymm2
2.80 : 9e3721: vpmulld %ymm5,%ymm2,%ymm6
0.68 : 9e3726: vpsrld $0x11,%ymm2,%ymm4
4.94 : 9e372b: vmovdqu -0x40(%rax),%ymm2
1.45 : 9e3730: vpternlogd $0x96,%ymm6,%ymm4,%ymm2
8.63 : 9e3737: vpmulld %ymm5,%ymm2,%ymm4
0.17 : 9e373c: vpsrld $0x11,%ymm2,%ymm2
1.81 : 9e3741: vpxord %ymm2,%ymm4,%ymm4
0.10 : 9e3747: vpxord -0xa0(%rax),%ymm3,%ymm2
0.70 : 9e374e: vpmulld %ymm5,%ymm2,%ymm6
1.65 : 9e3753: vpsrld $0x11,%ymm2,%ymm3
0.03 : 9e3758: vmovdqu -0x20(%rax),%ymm2
0.85 : 9e375d: vpternlogd $0x96,%ymm6,%ymm3,%ymm2
3.73 : 9e3764: vpmulld %ymm5,%ymm2,%ymm3
0.07 : 9e3769: vpsrld $0x11,%ymm2,%ymm2
1.48 : 9e376e: vpxord %ymm2,%ymm3,%ymm3
: 68 for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
0.02 : 9e3774: cmp %rax,%rdx
2.32 : 9e3777: jne 9e36b8 <pg_checksum_page+0x48>
: 71 CHECKSUM_COMP(sums[j], 0);
0.17 : 9e377d: vpmulld %ymm5,%ymm0,%ymm7
0.07 : 9e3782: vpmulld %ymm5,%ymm1,%ymm6
: 74 checksum = pg_checksum_block(cpage);
: 75 cpage->phdr.pd_checksum = save_checksum;
0.00 : 9e3787: mov %si,0x8(%rdi)
: 77 CHECKSUM_COMP(sums[j], 0);
0.02 : 9e378b: vpsrld $0x11,%ymm0,%ymm0
0.02 : 9e3790: vpsrld $0x11,%ymm1,%ymm1
0.02 : 9e3795: vpsrld $0x11,%ymm4,%ymm2
0.00 : 9e379a: vpxord %ymm0,%ymm7,%ymm7
0.10 : 9e37a0: vpmulld %ymm5,%ymm4,%ymm0
0.00 : 9e37a5: vpxord %ymm1,%ymm6,%ymm6
0.17 : 9e37ab: vpmulld %ymm5,%ymm3,%ymm1
0.19 : 9e37b0: vpmulld %ymm5,%ymm6,%ymm4
0.00 : 9e37b5: vpsrld $0x11,%ymm6,%ymm6
0.02 : 9e37ba: vpxord %ymm2,%ymm0,%ymm0
0.00 : 9e37c0: vpsrld $0x11,%ymm3,%ymm2
0.22 : 9e37c5: vpmulld %ymm5,%ymm7,%ymm3
0.02 : 9e37ca: vpsrld $0x11,%ymm7,%ymm7
0.00 : 9e37cf: vpxord %ymm2,%ymm1,%ymm1
0.03 : 9e37d5: vpsrld $0x11,%ymm0,%ymm2
0.15 : 9e37da: vpmulld %ymm5,%ymm0,%ymm0
: 94 result ^= sums[i];
0.00 : 9e37df: vpternlogd $0x96,%ymm3,%ymm7,%ymm2
: 96 CHECKSUM_COMP(sums[j], 0);
0.05 : 9e37e6: vpsrld $0x11,%ymm1,%ymm3
0.19 : 9e37eb: vpmulld %ymm5,%ymm1,%ymm1
: 99 result ^= sums[i];
0.02 : 9e37f0: vpternlogd $0x96,%ymm4,%ymm6,%ymm0
0.10 : 9e37f7: vpxord %ymm1,%ymm0,%ymm0
0.07 : 9e37fd: vpternlogd $0x96,%ymm2,%ymm3,%ymm0
0.15 : 9e3804: vextracti32x4 $0x1,%ymm0,%xmm1
0.03 : 9e380b: vpxord %xmm0,%xmm1,%xmm0
0.14 : 9e3811: vpsrldq $0x8,%xmm0,%xmm1
0.12 : 9e3816: vpxord %xmm1,%xmm0,%xmm0
0.09 : 9e381c: vpsrldq $0x4,%xmm0,%xmm1
0.12 : 9e3821: vpxord %xmm1,%xmm0,%xmm0
0.05 : 9e3827: vmovd %xmm0,%eax
:
: 111 /* Mix in the block number to detect transposed pages */
: 112 checksum ^= blkno;
0.07 : 9e382b: xor %ecx,%eax
:
: 115 /*
: 116 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
: 117 * one. That avoids checksums of zero, which seems like a good idea.
: 118 */
: 119 return (uint16) ((checksum % 65535) + 1);
0.00 : 9e382d: mov $0x80008001,%ecx
0.03 : 9e3832: mov %eax,%edx
0.27 : 9e3834: imul %rcx,%rdx
0.09 : 9e3838: shr $0x2f,%rdx
0.07 : 9e383c: lea 0x1(%rax,%rdx,1),%eax
0.00 : 9e3840: vzeroupper
: 126 }
0.15 : 9e3843: ret

I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.

Greetings,

Andres Freund

#39

Ants Aasma

ants.aasma@cybertec.at

about 1 year ago

In reply to: Andres Freund (#38)

1 attachment(s)

Re: AIO v2.0

On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote:

I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.

It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.

Interesting, I wonder if it's related to Intel increasing vpmulld
latency to 10 already back in Haswell. The Zen 3 I'm testing on has
latency 3 and has twice the throughput.

Attached is a naive and crude benchmark that I used for testing here.
Compiled with:

gcc -O2 -funroll-loops -ftree-vectorize -march=native \
-I$(pg_config --includedir-server) \
bench-checksums.c -o bench-checksums-native

Just fills up an array of pages and checksums them, first argument is
number of checksums, second is array size. I used 1M checksums and 100
pages for in cache behavior and 100000 pages for in memory
performance.

869.85927ms @ 9.418 GB/s - generic from memory
772.12252ms @ 10.610 GB/s - generic in cache
442.61869ms @ 18.508 GB/s - native from memory
137.07573ms @ 59.763 GB/s - native in cache

Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?

I don't think the issue is that checksumming pulls the data into CPU caches

1) This is visible with SELECT that actually uses the data

2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much

3) It's visible with buffered IO, which has pulled the data into CPU caches
already

I didn't yet check the code, when doing aio completions checksumming
be running on the same core as is going to be using the page?

It could also be that for some reason the checksumming is creating
extra bandwidth on memory bus or CPU internal rings, which due to the
already high amount of data already flying around causes contention.

I don't really have a machine at hand that can do anywhere close to this
amount of I/O.

It's visible even when pulling from the page cache, if to a somewhat lesser
degree.

Good point, I'll see if I can reproduce.

I wonder if it's worth adding a test function that computes checksums of all
shared buffers in memory already. That'd allow exercising the checksum code in
a realistic context (i.e. buffer locking etc preventing some out-of-order
effects, using 8kB chunks etc) without also needing to involve the IO path.

OoO shouldn't matter that much, over here even in the best case it's
still taking 500+ cycles per iteration.

I'm asking because if it's the calculation that is slow then it seems
like it's time to compile different ISA extension variants of the
checksum code and select the best one at runtime.

You think it's ISA specific? I don't see a significant effect of compiling
with -march=native or not - and that should suffice to make the checksum code
built with sufficiently high ISA support, right?

Right, the disassembly below looked very good.

FWIW CPU profiles show all the time being spent in the "main checksum
calculation" loop:

.. disassembly omitted for brevity

Not sure if it's applicable here or not due to microarch differences.
But in my case when bounded by memory bandwidth the main loop events
were clustered around a few instructions like it was in here, whereas
when running from cache all instructions were about equally
represented.

I did briefly experiment with changing N_SUMS. 16 is substantially worse, 64
seems to be about the same as 32.

This suggests that mulld latency is not the culprit.

Regards,
Ants

#40

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Ants Aasma (#39)

Re: AIO v2.0

Hi,

On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:

On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote:

I'm curious about this because the checksum code should be fast enough
to easily handle that throughput.

It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
workstation. But we don't have a good ready-made way of testing that without
also doing IO, so it's kinda hard to say.

Interesting, I wonder if it's related to Intel increasing vpmulld
latency to 10 already back in Haswell. The Zen 3 I'm testing on has
latency 3 and has twice the throughput.

Attached is a naive and crude benchmark that I used for testing here.
Compiled with:

gcc -O2 -funroll-loops -ftree-vectorize -march=native \
-I$(pg_config --includedir-server) \
bench-checksums.c -o bench-checksums-native

Just fills up an array of pages and checksums them, first argument is
number of checksums, second is array size. I used 1M checksums and 100
pages for in cache behavior and 100000 pages for in memory
performance.

869.85927ms @ 9.418 GB/s - generic from memory
772.12252ms @ 10.610 GB/s - generic in cache
442.61869ms @ 18.508 GB/s - native from memory
137.07573ms @ 59.763 GB/s - native in cache

printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 1 --membind 0 ./bench-checksums-native 1000000 $mem;done; done

Workstation w/ 2x Xeon Gold 6442Y:

march mem result
x86-64 100 731.87779ms @ 11.193 GB/s
x86-64-v2 100 327.18580ms @ 25.038 GB/s
x86-64-v3 100 264.03547ms @ 31.026 GB/s
x86-64-v4 100 282.08065ms @ 29.041 GB/s
native 100 246.13766ms @ 33.282 GB/s
x86-64 100000 842.66827ms @ 9.722 GB/s
x86-64-v2 100000 604.52959ms @ 13.551 GB/s
x86-64-v3 100000 477.16239ms @ 17.168 GB/s
x86-64-v4 100000 476.07039ms @ 17.208 GB/s
native 100000 456.08080ms @ 17.962 GB/s
x86-64 1000000 845.51132ms @ 9.689 GB/s
x86-64-v2 1000000 612.07973ms @ 13.384 GB/s
x86-64-v3 1000000 485.23738ms @ 16.882 GB/s
x86-64-v4 1000000 483.86411ms @ 16.930 GB/s
native 1000000 462.88461ms @ 17.698 GB/s

Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
x86-64 100 417.19762ms @ 19.636 GB/s
x86-64-v2 100 130.67596ms @ 62.689 GB/s
x86-64-v3 100 97.07758ms @ 84.386 GB/s
x86-64-v4 100 95.67704ms @ 85.621 GB/s
native 100 95.15734ms @ 86.089 GB/s
x86-64 100000 431.38370ms @ 18.990 GB/s
x86-64-v2 100000 215.74856ms @ 37.970 GB/s
x86-64-v3 100000 199.74492ms @ 41.012 GB/s
x86-64-v4 100000 186.98300ms @ 43.811 GB/s
native 100000 187.68125ms @ 43.648 GB/s
x86-64 1000000 433.87893ms @ 18.881 GB/s
x86-64-v2 1000000 217.46561ms @ 37.670 GB/s
x86-64-v3 1000000 200.40667ms @ 40.877 GB/s
x86-64-v4 1000000 187.51978ms @ 43.686 GB/s
native 1000000 190.29273ms @ 43.049 GB/s

Workstation w/ 2x Xeon Gold 5215:
march mem result
x86-64 100 780.38881ms @ 10.497 GB/s
x86-64-v2 100 389.62005ms @ 21.026 GB/s
x86-64-v3 100 323.97294ms @ 25.286 GB/s
x86-64-v4 100 274.19493ms @ 29.877 GB/s
native 100 283.48674ms @ 28.897 GB/s
x86-64 100000 1112.63898ms @ 7.363 GB/s
x86-64-v2 100000 831.45641ms @ 9.853 GB/s
x86-64-v3 100000 696.20789ms @ 11.767 GB/s
x86-64-v4 100000 685.61636ms @ 11.948 GB/s
native 100000 689.78023ms @ 11.876 GB/s
x86-64 1000000 1128.65580ms @ 7.258 GB/s
x86-64-v2 1000000 843.92594ms @ 9.707 GB/s
x86-64-v3 1000000 718.78848ms @ 11.397 GB/s
x86-64-v4 1000000 687.68258ms @ 11.912 GB/s
native 1000000 705.34731ms @ 11.614 GB/s

That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.

The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.

I just realized that

a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
matter in my numbers though because I was building with -O3 and
march=native.

This clearly ought to be fixed.

b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
of which include checksum_imp.h directly.

This probably should be fixed too - perhaps by building the relevant code
once as part of fe_utils or such?

It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.

I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.

Here's a comparison of different flags run the 6442Y

printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2 x86-64-v3 x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf "%16s\t%32s\t%16s\t" "$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000 $mem;done; done;done
march flags mem result
x86-64 -O2 100 2280.86253ms @ 10.775 GB/s
x86-64 -O2 -funroll-loops 100 2195.66942ms @ 11.193 GB/s
x86-64 -O3 100 2422.57588ms @ 10.145 GB/s
x86-64 -O3 -funroll-loops 100 2243.75826ms @ 10.953 GB/s
x86-64-v2 -O2 100 1243.68063ms @ 19.761 GB/s
x86-64-v2 -O2 -funroll-loops 100 979.67783ms @ 25.086 GB/s
x86-64-v2 -O3 100 988.80296ms @ 24.854 GB/s
x86-64-v2 -O3 -funroll-loops 100 991.31632ms @ 24.791 GB/s
x86-64-v3 -O2 100 1146.90165ms @ 21.428 GB/s
x86-64-v3 -O2 -funroll-loops 100 785.81395ms @ 31.275 GB/s
x86-64-v3 -O3 100 800.53627ms @ 30.699 GB/s
x86-64-v3 -O3 -funroll-loops 100 790.21230ms @ 31.101 GB/s
x86-64-v4 -O2 100 883.82916ms @ 27.806 GB/s
x86-64-v4 -O2 -funroll-loops 100 831.55372ms @ 29.554 GB/s
x86-64-v4 -O3 100 843.23141ms @ 29.145 GB/s
x86-64-v4 -O3 -funroll-loops 100 821.19969ms @ 29.927 GB/s
native -O2 100 1197.41357ms @ 20.524 GB/s
native -O2 -funroll-loops 100 718.05253ms @ 34.226 GB/s
native -O3 100 747.94090ms @ 32.858 GB/s
native -O3 -funroll-loops 100 751.52379ms @ 32.702 GB/s
x86-64 -O2 100000 2911.47087ms @ 8.441 GB/s
x86-64 -O2 -funroll-loops 100000 2525.45504ms @ 9.731 GB/s
x86-64 -O3 100000 2497.42016ms @ 9.841 GB/s
x86-64 -O3 -funroll-loops 100000 2346.33551ms @ 10.474 GB/s
x86-64-v2 -O2 100000 2124.10102ms @ 11.570 GB/s
x86-64-v2 -O2 -funroll-loops 100000 1819.09659ms @ 13.510 GB/s
x86-64-v2 -O3 100000 1613.45823ms @ 15.232 GB/s
x86-64-v2 -O3 -funroll-loops 100000 1607.09245ms @ 15.292 GB/s
x86-64-v3 -O2 100000 1972.89390ms @ 12.457 GB/s
x86-64-v3 -O2 -funroll-loops 100000 1432.58229ms @ 17.155 GB/s
x86-64-v3 -O3 100000 1533.18003ms @ 16.029 GB/s
x86-64-v3 -O3 -funroll-loops 100000 1539.39779ms @ 15.965 GB/s
x86-64-v4 -O2 100000 1591.96881ms @ 15.437 GB/s
x86-64-v4 -O2 -funroll-loops 100000 1434.91828ms @ 17.127 GB/s
x86-64-v4 -O3 100000 1454.30133ms @ 16.899 GB/s
x86-64-v4 -O3 -funroll-loops 100000 1429.13733ms @ 17.196 GB/s
native -O2 100000 1980.53734ms @ 12.409 GB/s
native -O2 -funroll-loops 100000 1373.95337ms @ 17.887 GB/s
native -O3 100000 1517.90164ms @ 16.191 GB/s
native -O3 -funroll-loops 100000 1508.37021ms @ 16.293 GB/s

Is it just that the calculation is slow, or is it the fact that checksumming
needs to bring the page into the CPU cache. Did you notice any hints which
might be the case?

I don't think the issue is that checksumming pulls the data into CPU caches

1) This is visible with SELECT that actually uses the data

2) I added prefetching to avoid any meaningful amount of cache misses and it
doesn't change the overall timing much

3) It's visible with buffered IO, which has pulled the data into CPU caches
already

I didn't yet check the code, when doing aio completions checksumming
be running on the same core as is going to be using the page?

With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.

With worker mode normally no.

Greetings,

Andres Freund

#41

Ants Aasma

ants.aasma@cybertec.at

about 1 year ago

In reply to: Andres Freund (#40)

Re: AIO v2.0

On Thu, 9 Jan 2025 at 22:53, Andres Freund <andres@anarazel.de> wrote:

Workstation w/ 2x Xeon Gold 6442Y:

march mem result
native 100 246.13766ms @ 33.282 GB/s
native 100000 456.08080ms @ 17.962 GB/s

Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
march mem result
native 100 95.15734ms @ 86.089 GB/s
native 100000 187.68125ms @ 43.648 GB/s

Workstation w/ 2x Xeon Gold 5215:
march mem result
native 100 283.48674ms @ 28.897 GB/s
native 100000 689.78023ms @ 11.876 GB/s

That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.

In hindsight building the hash around mulld primitive was a bad decision
because Intel for whatever reason decided to kill the performance of it:

vpmulld latency throughput
(values/cycle)
Sandy Bridge 5 4
Alder Lake 10 8
Zen 4 3 16
Zen 5 3 32

Most top performing hashes these days seem to be built around AES
instructions.

But I was curious why there is such a difference in streaming results.
Turns out that for whatever reason one core gets access to much less
bandwidth on Intel than on AMD. [1]https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth

This made me take another look at your previous prewarm numbers. It looks
like performance is suspiciously proportional to the number of copies of
data the CPU has to make:

config checksums time in ms number of copies
buffered io_engine=io_uring 0 3883.127 2
buffered io_engine=io_uring 1 5880.892 3
direct io_engine=io_uring 0 2067.142 1
direct io_engine=io_uring 1 3835.968 2

To me that feels like there is a bandwidth bottleneck in this workload and
checksumming the page when the contents is not looked at just adds to
consumed bandwidth, bringing down the performance correspondingly.

This doesn't explain why you observed slowdown in the select case, but I
think that might be due to the per-core bandwidth limitation. Both cases
might pull in the same amount of data into the cache, but without checksums
it is spread out over a longer time allowing other work to happen
concurrently.

[1]: https://chipsandcheese.com/p/a-peek-at-sapphire-rapids#%C2%A7bandwidth

The difference between the baseline CPU target and a more modern profile is
also rather impressive. Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.

Yes, along with using function attributes for the optimization flags to avoid
the build system hacks.

--
Ants

#42

Robert Haas

robertmhaas@gmail.com

12 months ago

In reply to: Andres Freund (#36)

Re: AIO v2.2

On Wed, Jan 8, 2025 at 7:26 PM Andres Freund <andres@anarazel.de> wrote:

1) Shared memory representation of an IO, for the AIO subsystem internally

Currently: PgAioHandle

2) A way for the issuer of an IO to reference 1), to attach information to the
IO

Currently: PgAioHandle*

3) A way for any backend to wait for a specific IO to complete

Currently: PgAioHandleRef

With that additional information, I don't mind this naming too much,
but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
PgAioHandleRef -> PgAioHandle.

ZOMBIE feels even later than REAPED to me :)

Makes logical sense, because you would assume that you die first and
then later become an undead creature, but the UNIX precedent is that
dying turns you into a zombie and someone then has to reap the exit
status for you to be just plain dead. :-)

I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.

How about

AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed

?

That's not bad. I like RAW better than KERNEL. I was hoping to use
different works like COMPLETE and DONE rather than, as you did it
here, COMPLETE and COMPLETE, but it's probably fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#43

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Robert Haas (#42)

Re: AIO v2.2

Hi,

On 2025-01-13 15:43:46 -0500, Robert Haas wrote:

On Wed, Jan 8, 2025 at 7:26 PM Andres Freund <andres@anarazel.de> wrote:

1) Shared memory representation of an IO, for the AIO subsystem internally

Currently: PgAioHandle

2) A way for the issuer of an IO to reference 1), to attach information to the
IO

Currently: PgAioHandle*

3) A way for any backend to wait for a specific IO to complete

Currently: PgAioHandleRef

With that additional information, I don't mind this naming too much,
but I still think PgAioHandle -> PgAio and PgAioHandleRef ->
PgAioHandle is worth considering. Compare BackgroundWorkerSlot and
BackgroundWorkerHandle, which suggests PgAioHandle -> PgAioSlot and
PgAioHandleRef -> PgAioHandle.

I don't love PgAioHandle -> PgAio as there are other things (e.g. per-backend
state) in the PgAio namespace...

I do agree with Heikki that REAPED sounds later than COMPLETED, because you
reap zombie processes by collecting their exit status. Maybe you could have
AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
there's still completion-related work to be done, and then the other state
could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.

How about

AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed

?

That's not bad. I like RAW better than KERNEL.

Cool.

I was hoping to use different works like COMPLETE and DONE rather than, as
you did it here, COMPLETE and COMPLETE, but it's probably fine.

Once the IO is really done, the handle is immediately recycled (and moved into
IDLE state, ready to be used again).

Greetings,

Andres Freund

#44

Robert Haas

robertmhaas@gmail.com

12 months ago

In reply to: Andres Freund (#43)

Re: AIO v2.2

On Mon, Jan 13, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:

Once the IO is really done, the handle is immediately recycled (and moved into
IDLE state, ready to be used again).

OK, fair enough.

--
Robert Haas
EDB: http://www.enterprisedb.com

#45

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Andres Freund (#1)

29 attachment(s)

Re: AIO v2.3

Hi,

Attached is v2.3.

There are a lot of changes - primarily renaming things based on on-list and
off-list feedback. But also some other things

Functional:

- Added pg_aios view

- md.c registering sync requests, that was previously omitted
- This triggered stats issues during shutdown, as it can lead to IO workers
emitting stats in some corner cases. I've written a patch series to
address this [1]/messages/by-id/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp. For now I've included them in this patchset, but I would
like to push the reordering patches soon.

- Testing error handling for temp table IO made me realize that the previous
pattern of just tracking the refcount held by the IO subsystem in the
LocalRefCount array leads to spurious buffer leak warnings [2]/messages/by-id/j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c@t5h3bwhrgfqx. I attached
a prototype patch to deal with this by bringing localbuf.c more in line with
bufmgr.c, but it needs some cleanup.

That's in v2.3-0020-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patch

- Wait for all IOs to finish during shutdown. This is primarily required to
ensure there aren't IOs initiated by a prior "owner" of a ProcNumber when a
new backend starts. But there are also some kernels that don't like when
exiting while IO is in flight.

- Re-armed local completion callbacks, they're required for correctness of
temporary table IO

- Added a bunch of central debug helpers that only lead to output if
PGAIO_VERBOSE is defined. That did make code a good bit more readable.

Polishing:

- Lots of copy editing, a lot of it thanks to feedback by Noah and Heikki

- Renamed the previous concept of a "subject" of an IO (i.e. what the IO is
executed on, an smgr relation, a WAL file, ...) to "target". I'm not in
love with that name, but I went through dozens of variations, and it does
seem better than subject.

Not sure anymore how I ended up with subject, it's grammatically off and not
very descriptive to boot.

- Renamed "PgAioHandleRef" and related functions to
PgAioWaitRef/pgaio_wref_*(), that seems a lot more descriptive.

- Renamed pgaio_io_get() to pgaio_io_acquire()

- Renamed the IO handle states (PREPARED to STAGED, IN_FLIGHT to SUBMITTED,
REAPED to COMPLETED_IO).

Particularly the various COMPLETED state names aren't necessarily final,
I've been debating a bunch of variations with Thomas and Robert

- Renamed aio_ref.h to aio_types.h, moved a few more types into it.

- Renamed completion callbacks to not use "shared" anymore - ->prepare was not
really shared and now local callbacks are back (in a restricted form).

s/PgAioHandleSharedCallback/PgAioHandleCallback/
s/pgaio_io_add_shared_cb/pgaio_io_register_callbacks/

Not entirely sure *register_callbacks is the best, happy to adjust.

- Renamed the ->error IO handle callback to ->report

Also renamed s/pgaio_result_log/pgaio_result_report/g

- Renamed the ->prepare IO handle callback to ->stage

Also renamed s/pgaio_result_log/pgaio_result_report/g

- Partially addressed request to reorder aio/README.md

- Determine shared memory allocation size with PG_IOV_MAX not io_combine_limit

io_combine_limit is USERSET, so it's not correct to use it for shmem
allocations. I chose PG_IOV_MAX instead of MAX_IO_COMBINE_LIMIT because this
is a more generic limit than bufmgr.c IO.

- Prefix PgAio* enums with PGAIO_, global variables with pgaio_*

- Split out callback related cod from aio_subject.c (now aio_target.c) into
aio_callback.c. The target specific code is rather small, so this makes a
lot more sense.

- Distributed functions into more appropriate .c files, documented the choice
in aio.h, reorder them

- Disowned lwlock: More consistent naming, reduce diff size, resume interrupts

Heikki asked to clear ->owner when disowning the lock - but we currently
*never* clear it doesn't seem right to do so when disowning the lock.

- IO data that can be set on a handle (to e.g. transport an array of Buffers
to the completion callbacks) is now done with
pgaio_io_(get|set)_handle_data(). Mainly to distinguish it from data that's
actually the target/source of a read/write.

Heikki suggested to make this per-callback data, but I don't think there's
currently a use case for that, and it'd add a fair bit of memory overhead. I
added a comment documenting this.

- Lots of other cleanups, added comments and the like

Todo:

- Reorder README further

- Make per backend state not indexed by ProcNumber, as that requires reserving
per-backend state for IO workers, which will never need them

- Clean up localbuf.c "preparation" patches

- Add more tests - I had hoped to get to this, but got sidetracked with a
bunch of things I found while testing

- I started looking into having a distinct type for the public pgaio_io_*
related functions that can be used just by the issuer of the IO. It does
make things a bit easier to understand, but also complicates naming. Not
sure if it's worth it yet.

- Need to define (and test) the behavior when an IO worker fails to reopen the
file for an IO

- Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
such. I don't love that name though.

- There's some duplicated code in aio_callback.c, it'd be nice to deduplicate
the callback invaction of the different callbacks

- Local callbacks are triggered from within pgaio_io_reclaim(), that's not
exactly pretty. But it's currently the most central place to deal with the
case of IOs for which the shared completion callback was called in another
backend.

- As Jakub suggested (below [3]/messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat), when io_method=io_uring is used, we can run
of out file descriptors much more easily. At the very least we need a good
error message, perhaps also some rlimit adjusting (probably as a second
step, if so).

- Thomas is working on the read_stream.c <-> bufmgr.c integration piece

- Start to write docs adjustments

[1]: /messages/by-id/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
[2]: /messages/by-id/j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c@t5h3bwhrgfqx
[3]: /messages/by-id/tp63m6tcbi7mmsjlqgxd55sghhwvjxp3mkgeljffkbaujezvdl@fvmdr3c6uhat

Attachments:

v2.3-0024-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 98ba93250f1fd40e4a97387bf08f90b28686705c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:09:51 -0500
Subject: [PATCH v2.3 24/30] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |  2 +
 src/include/storage/bufmgr.h           |  2 +
 src/backend/storage/aio/aio_callback.c |  2 +
 src/backend/storage/buffer/bufmgr.c    | 90 +++++++++++++++++++++++++-
 4 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 30b08495f3d..7bdce41121e 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -180,8 +180,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f205643c4ef..cf9d0a63aed 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -203,7 +203,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 
 struct PgAioHandleCallbacks;
 extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 
 /* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6054f57eb23..acfed50bfeb 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -45,8 +45,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 118a6e1ca31..d5212da4912 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6402,6 +6402,42 @@ ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
 	return buf_failed;
 }
 
+static uint64
+ReadBufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on shared buffers for execution, shared between reads
  * and writes.
@@ -6466,7 +6502,6 @@ shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
 			 * Lock is now owned by AIO subsystem.
 			 */
 			LWLockDisown(content_lock);
-			RESUME_INTERRUPTS();
 		}
 
 		/*
@@ -6483,6 +6518,12 @@ shared_buffer_readv_stage(PgAioHandle *ioh)
 	shared_buffer_stage_common(ioh, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh)
+{
+	shared_buffer_stage_common(ioh, true);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 {
@@ -6558,6 +6599,36 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
 	MemoryContextSwitchTo(oldContext);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->target_data.shared_buffer.release_lock */
+		ReadBufferCompleteWriteShared(buf,
+									  true,
+									  false);
+
+	}
+
+	return result;
+}
+
 /*
  * Helper to stage IO on local buffers for execution, shared between reads
  * and writes.
@@ -6644,12 +6715,26 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 	return result;
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "not yet");
+}
+
 
 const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
 	.complete_shared = shared_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
 const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
 
@@ -6662,3 +6747,6 @@ const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+const struct PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0025-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From e7e8e954a1432f531d830242fd170564f268521c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.3 25/30] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  31 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 198 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 233 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..f5e1bc07ff3
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..62ad06c8bfe
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b3f06711e6a..91d8198af9f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1179,6 +1179,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -2986,6 +2987,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 0e410933546b25b259ee8a02fa27bb7f34b3f736 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.3 26/30] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  20 +-
 src/backend/postmaster/checkpointer.c |  12 +-
 src/backend/storage/buffer/bufmgr.c   | 587 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 581 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2d5854e6879..517c40cd804 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
 extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9f936cd6b84..aeefb1746ec 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cf9d0a63aed..bc7ee73246e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -321,7 +321,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index d06208b7ce6..a2bd1db92d0 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3eff5dc6f0e..cf16f8bed5d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,10 +38,12 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -89,6 +91,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -130,6 +133,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -170,6 +174,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		ConditionVariableCancelSleep();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
+		pgaio_at_error();
 		AtEOXact_Buffers(false);
 		AtEOXact_SMgr();
 		AtEOXact_Files(false);
@@ -226,12 +231,22 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -248,6 +263,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 767bf9f5cf8..0fb7f3b7275 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,9 +49,11 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -278,6 +280,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		pgstat_report_wait_end();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
+		pgaio_at_error();
 		AtEOXact_Buffers(false);
 		AtEOXact_SMgr();
 		AtEOXact_Files(false);
@@ -762,7 +765,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -796,6 +799,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d5212da4912..1e8793d1630 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -77,6 +78,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -511,8 +513,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
@@ -530,6 +530,7 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3068,6 +3069,56 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	if (to_write->total_writes > 0)
+		pgaio_submit_staged();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3099,7 +3150,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3161,7 +3215,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3269,48 +3325,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3328,15 +3427,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3362,7 +3469,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3405,6 +3512,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3425,6 +3535,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3581,11 +3693,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3596,6 +3722,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3607,6 +3740,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3645,8 +3783,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3655,22 +3851,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3680,7 +3904,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3689,40 +3913,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4088,6 +4566,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index a931cdba151..7fd8e7681ae 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91d8198af9f..bbd08cd6b4d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -347,6 +347,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0027-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 4507a8ca905a9272bad59198f93e01e40da87451 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.3 27/30] very-wip: test_aio module

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio_internal.h            |   8 +
 src/include/storage/buf_internals.h           |   4 +
 src/backend/storage/aio/aio.c                 |  39 ++
 src/backend/storage/buffer/bufmgr.c           |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_aio/.gitignore          |   6 +
 src/test/modules/test_aio/Makefile            |  34 ++
 src/test/modules/test_aio/expected/inject.out | 295 ++++++++++
 src/test/modules/test_aio/expected/io.out     |  40 ++
 .../modules/test_aio/expected/ownership.out   | 148 +++++
 src/test/modules/test_aio/expected/prep.out   |  17 +
 src/test/modules/test_aio/io_uring.conf       |   5 +
 src/test/modules/test_aio/meson.build         |  78 +++
 src/test/modules/test_aio/sql/inject.sql      |  84 +++
 src/test/modules/test_aio/sql/io.sql          |  16 +
 src/test/modules/test_aio/sql/ownership.sql   |  65 +++
 src/test/modules/test_aio/sql/prep.sql        |   9 +
 src/test/modules/test_aio/sync.conf           |   5 +
 src/test/modules/test_aio/test_aio--1.0.sql   |  99 ++++
 src/test/modules/test_aio/test_aio.c          | 504 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control    |   3 +
 src/test/modules/test_aio/worker.conf         |   5 +
 23 files changed, 1467 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/expected/inject.out
 create mode 100644 src/test/modules/test_aio/expected/io.out
 create mode 100644 src/test/modules/test_aio/expected/ownership.out
 create mode 100644 src/test/modules/test_aio/expected/prep.out
 create mode 100644 src/test/modules/test_aio/io_uring.conf
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/sql/inject.sql
 create mode 100644 src/test/modules/test_aio/sql/io.sql
 create mode 100644 src/test/modules/test_aio/sql/ownership.sql
 create mode 100644 src/test/modules/test_aio/sql/prep.sql
 create mode 100644 src/test/modules/test_aio/sync.conf
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control
 create mode 100644 src/test/modules/test_aio/worker.conf

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 531532e306a..1855b57f355 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -316,6 +316,14 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
 				__VA_ARGS__)
 
 
+/* These functions are just for use in tests, from within injection points */
+#ifdef USE_INJECTION_POINTS
+
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#endif
+
+
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index aeefb1746ec..9939032d5f0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -423,6 +423,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 431f2c2e5af..7a873f6ffbb 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -46,6 +46,10 @@
 #include "utils/resowner.h"
 #include "utils/wait_event_types.h"
 
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
 
 static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
 static void pgaio_io_reclaim(PgAioHandle *ioh);
@@ -92,6 +96,11 @@ static const IoMethodOps *const pgaio_method_ops_table[] = {
 const IoMethodOps *pgaio_method_ops;
 
 
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
 
 /* --------------------------------------------------------------------------------
  * Public Functions related to PgAioHandle
@@ -452,6 +461,19 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)
 
 	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
 
+#ifdef USE_INJECTION_POINTS
+	pgaio_inj_cur_handle = ioh;
+
+	/*
+	 * FIXME: This could be in a critical section - but it looks like we can't
+	 * just InjectionPointLoad() at process start, as the injection point
+	 * might not yet be defined.
+	 */
+	InjectionPointCached("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_inj_cur_handle = NULL;
+#endif
+
 	pgaio_io_call_complete_shared(ioh);
 
 	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
@@ -1128,3 +1150,20 @@ assign_io_method(int newval, void *extra)
 
 	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1e8793d1630..7f6eabcb92e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -514,7 +514,6 @@ static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
 							  bool syncio);
@@ -6184,7 +6183,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c0d3cf0e14b..73ff9c55687 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 4f544a042d4..b11dd72334c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..b4903eba657
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,6 @@
+# Generated subdirectories
+/log/
+/results/
+/output_iso/
+/tmp_check/
+/tmp_check_iso/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..ae6d685835b
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,34 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+REGRESS = prep ownership io
+
+ifeq ($(enable_injection_points),yes)
+REGRESS += inject
+endif
+
+# FIXME: with meson this runs the tests once with worker and once - if
+# supported - with io_uring.
+
+# requires custom config
+NO_INSTALLCHECK = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/expected/inject.out b/src/test/modules/test_aio/expected/inject.out
new file mode 100644
index 00000000000..e62e3718845
--- /dev/null
+++ b/src/test/modules/test_aio/expected/inject.out
@@ -0,0 +1,295 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(8192 + 4096);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+ count 
+-------
+ 10000
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count 
+-------
+     1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(4096);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+ count 
+-------
+     1
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT inj_io_short_read_attach(0);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..2 in file base/<redacted>: read only 0 of 8192 bytes
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_attach(8192);
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 0);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 1);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_a', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+ inj_io_short_read_attach 
+--------------------------
+ 
+(1 row)
+
+SELECT invalidate_rel_block('tbl_b', 2);
+ invalidate_rel_block 
+----------------------
+ 
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+NOTICE:  wrapped error: could not read blocks 2..3 in file base/<redacted>: Input/output error
+ redact 
+--------
+ f
+(1 row)
+
+SELECT inj_io_short_read_detach();
+ inj_io_short_read_detach 
+--------------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/io.out b/src/test/modules/test_aio/expected/io.out
new file mode 100644
index 00000000000..e46b582f290
--- /dev/null
+++ b/src/test/modules/test_aio/expected/io.out
@@ -0,0 +1,40 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+ count 
+-------
+     1
+(1 row)
+
+SELECT corrupt_rel_block('tbl_a', 1);
+ corrupt_rel_block 
+-------------------
+ 
+(1 row)
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+ redact 
+--------
+ t
+(1 row)
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
+NOTICE:  wrapped error: invalid page in block 2 of relation base/<redacted>
+ redact 
+--------
+ f
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/ownership.out b/src/test/modules/test_aio/expected/ownership.out
new file mode 100644
index 00000000000..97fdad6c629
--- /dev/null
+++ b/src/test/modules/test_aio/expected/ownership.out
@@ -0,0 +1,148 @@
+-----
+-- IO handles
+----
+-- leak warning: implicit xact
+SELECT handle_get();
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+ERROR:  release in unexpected state
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+ handle_get 
+------------
+ 
+ 
+(2 rows)
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+WARNING:  leaked AIO handle
+ handle_get 
+------------
+ 
+(1 row)
+
+-- normal handle use
+SELECT handle_get_release();
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT handle_get_twice();
+ERROR:  API violation: Only one IO can be handed out
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+ERROR:  as you command
+ handle_get_release 
+--------------------
+ 
+(1 row)
+
+-----
+-- Bounce Buffers handles
+----
+-- leak warning: implicit xact
+SELECT bb_get();
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+ERROR:  can only release handed out BB
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+WARNING:  leaked AIO bounce buffer
+ bb_get 
+--------
+ 
+(1 row)
+
+-- normal bb use
+SELECT bb_get_release();
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- should error out, API violation
+SELECT bb_get_twice();
+ERROR:  can only hand out one BB
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
+ERROR:  as you command
+ bb_get_release 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/expected/prep.out b/src/test/modules/test_aio/expected/prep.out
new file mode 100644
index 00000000000..7fad6280db5
--- /dev/null
+++ b/src/test/modules/test_aio/expected/prep.out
@@ -0,0 +1,17 @@
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+ grow_rel 
+----------
+ 
+(1 row)
+
+SELECT grow_rel('tbl_b', 550);
+ grow_rel 
+----------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_aio/io_uring.conf b/src/test/modules/test_aio/io_uring.conf
new file mode 100644
index 00000000000..efd7ad143ff
--- /dev/null
+++ b/src/test/modules/test_aio/io_uring.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'io_uring'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..a4bef0ceeb0
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,78 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+
+testfiles = [
+  'prep',
+  'ownership',
+  'io',
+]
+
+if get_option('injection_points')
+  testfiles += 'inject'
+endif
+
+
+tests += {
+  'name': 'test_aio_sync',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('sync.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+tests += {
+  'name': 'test_aio_worker',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': testfiles,
+    'regress_args': [
+      '--temp-config', files('worker.conf'),
+    ],
+    # requires custom config
+    'runningcheck': false,
+  },
+}
+
+if liburing.found()
+  tests += {
+    'name': 'test_aio_uring',
+    'sd': meson.current_source_dir(),
+    'bd': meson.current_build_dir(),
+    'regress': {
+      'sql': testfiles,
+      'regress_args': [
+        '--temp-config', files('io_uring.conf'),
+      ],
+      # requires custom config
+      'runningcheck': false,
+    }
+  }
+endif
diff --git a/src/test/modules/test_aio/sql/inject.sql b/src/test/modules/test_aio/sql/inject.sql
new file mode 100644
index 00000000000..1190531f5ad
--- /dev/null
+++ b/src/test/modules/test_aio/sql/inject.sql
@@ -0,0 +1,84 @@
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+
+-- injected what we'd expect
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+SELECT inj_io_short_read_detach();
+
+
+-- injected a read shorter than a single block, expecting error
+SELECT inj_io_short_read_attach(17);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to a single block, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten multi-block read to 1 1/2 blocks, should retry
+SELECT count(*) FROM tbl_b; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192 + 4096);
+-- no need to redact, no messages to client
+SELECT count(*) FROM tbl_b;
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read that block partially, we'll error out,
+-- because we assume we can read at least one block at a time.
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(4096);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- shorten single-block read to read 0 bytes, expect that to error out
+SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)'; -- for comparison
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(0);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- verify that checksum errors are detected even as part of a shortened
+-- multi-block read
+-- (tbl_a, block 1 is corrupted)
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_attach(8192);
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
+
+
+-- trigger a hard error, should error out
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT redact($$
+  SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';
+$$);
+SELECT inj_io_short_read_detach();
diff --git a/src/test/modules/test_aio/sql/io.sql b/src/test/modules/test_aio/sql/io.sql
new file mode 100644
index 00000000000..a29bb4eb15d
--- /dev/null
+++ b/src/test/modules/test_aio/sql/io.sql
@@ -0,0 +1,16 @@
+SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+
+SELECT corrupt_rel_block('tbl_a', 1);
+
+-- FIXME: Should report the error
+SELECT redact($$
+  SELECT read_corrupt_rel_block('tbl_a', 1);
+$$);
+
+-- verify the error is reported
+SELECT redact($$
+  SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)';
+$$);
+SELECT redact($$
+  SELECT count(*) FROM tbl_a;
+$$);
diff --git a/src/test/modules/test_aio/sql/ownership.sql b/src/test/modules/test_aio/sql/ownership.sql
new file mode 100644
index 00000000000..63cf40c802a
--- /dev/null
+++ b/src/test/modules/test_aio/sql/ownership.sql
@@ -0,0 +1,65 @@
+-----
+-- IO handles
+----
+
+-- leak warning: implicit xact
+SELECT handle_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT handle_get(); COMMIT;
+
+-- leak warning + error: released in different command (thus resowner)
+BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;
+
+-- no leak, same command
+BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;
+
+-- normal handle use
+SELECT handle_get_release();
+
+-- should error out, API violation
+SELECT handle_get_twice();
+
+-- recover after error in implicit xact
+SELECT handle_get_and_error(); SELECT handle_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT handle_get_and_error(); ROLLBACK; SELECT handle_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;
+
+
+-----
+-- Bounce Buffers handles
+----
+
+-- leak warning: implicit xact
+SELECT bb_get();
+
+-- leak warning: explicit xact
+BEGIN; SELECT bb_get(); COMMIT;
+
+-- missing leak warning: we should warn at command boundaries, not xact boundaries
+BEGIN; SELECT bb_get(); SELECT bb_release_last(); COMMIT;
+
+-- leak warning: subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get(); COMMIT;
+
+-- normal bb use
+SELECT bb_get_release();
+
+-- should error out, API violation
+SELECT bb_get_twice();
+
+-- recover after error in implicit xact
+SELECT bb_get_and_error(); SELECT bb_get_release();
+
+-- recover after error in explicit xact
+BEGIN; SELECT bb_get_and_error(); ROLLBACK; SELECT bb_get_release();
+
+-- recover after error in subtrans
+BEGIN; SAVEPOINT foo; SELECT bb_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT bb_get_release(); ROLLBACK;
diff --git a/src/test/modules/test_aio/sql/prep.sql b/src/test/modules/test_aio/sql/prep.sql
new file mode 100644
index 00000000000..b8f225cbc98
--- /dev/null
+++ b/src/test/modules/test_aio/sql/prep.sql
@@ -0,0 +1,9 @@
+CREATE EXTENSION test_aio;
+
+CREATE TABLE tbl_a(data int not null);
+CREATE TABLE tbl_b(data int not null);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
diff --git a/src/test/modules/test_aio/sync.conf b/src/test/modules/test_aio/sync.conf
new file mode 100644
index 00000000000..c480922d6cf
--- /dev/null
+++ b/src/test/modules/test_aio/sync.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'sync'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e3d5ce29c60
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,99 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE OR REPLACE FUNCTION redact(p_sql text)
+RETURNS bool
+LANGUAGE plpgsql
+AS $$
+    DECLARE
+	err_state text;
+        err_msg text;
+    BEGIN
+        EXECUTE p_sql;
+	RETURN true;
+    EXCEPTION WHEN OTHERS THEN
+        GET STACKED DIAGNOSTICS
+	    err_state = RETURNED_SQLSTATE,
+	    err_msg = MESSAGE_TEXT;
+	err_msg = regexp_replace(err_msg, '(file|relation) "?base/[0-9]+/[0-9]+"?', '\1 base/<redacted>');
+        RAISE NOTICE 'wrapped error: %', err_msg
+	    USING ERRCODE = err_state;
+	RETURN false;
+    END;
+$$;
+
+
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..20d7e6dc82f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,504 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled;
+	bool		result_set;
+	int			result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		page;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+	uint32		buf_state;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	page = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_get_wref(ioh, &iow);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+	smgr = RelationGetSmgr(rel);
+
+	/* FIXME: even if just a test, we should verify nobody else uses this */
+	buf_state = LockBufHdr(buf_hdr);
+	buf_state &= ~(BM_VALID | BM_DIRTY);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	StartBufferIO(buf_hdr, true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+				   (void *) &page, 1);
+
+	ReleaseBuffer(buf);
+
+	pgaio_wref_wait(&iow);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* this is a gross hack, but there's no good API exposed */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+	buf = pr.recent_buffer;
+	elog(LOG, "recent: %d", buf);
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't unpin");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+	MarkBufferDirty(buf);
+	ph->pd_special = BLCKSZ + 1;
+
+	/* last_handle = pgaio_io_acquire(); */
+
+	PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled);
+
+	if (inj_io_error_state->enabled)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->result);
+
+			ioh->result = inj_io_error_state->result;
+		}
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = true;
+	inj_io_error_state->result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->result_set)
+		inj_io_error_state->result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
diff --git a/src/test/modules/test_aio/worker.conf b/src/test/modules/test_aio/worker.conf
new file mode 100644
index 00000000000..8104c201924
--- /dev/null
+++ b/src/test/modules/test_aio/worker.conf
@@ -0,0 +1,5 @@
+shared_preload_libraries=test_aio
+io_method = 'worker'
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0023-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 398eabe5a83ff4ed74141b7803b7f6b23d0a3bdd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 17:13:46 -0500
Subject: [PATCH v2.3 23/30] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  19 ++
 src/include/storage/aio_internal.h            |  33 ++++
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 180 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 423 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 6f36a0b9e4d..30b08495f3d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -247,6 +247,10 @@ typedef struct PgAioHandleCallbacks
 
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+
 /* AIO API */
 
 
@@ -330,6 +334,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -345,6 +363,7 @@ extern void assign_io_method(int newval, void *extra);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index eff544ce621..531532e306a 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -97,6 +97,12 @@ struct PgAioHandle
 	 */
 	uint32		iovec_off;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -133,11 +139,23 @@ struct PgAioHandle
 };
 
 
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -165,6 +183,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -190,6 +214,15 @@ typedef struct PgAioCtl
 	 */
 	uint64	   *handle_data;
 
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
 	uint64		io_handle_count;
 	PgAioHandle *io_handles;
 } PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 1b6f9d2c40b..dacff46ad12 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -412,6 +412,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b3b4e74c3ce..431f2c2e5af 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -55,6 +55,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -69,6 +71,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -588,6 +591,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -874,6 +892,166 @@ pgaio_have_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * FIXME It probably is not correct to have bounce buffers be per backend,
+	 * they use too much memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -904,6 +1082,7 @@ void
 pgaio_at_xact_end(bool is_subxact, bool is_commit)
 {
 	Assert(!pgaio_my_backend->handed_out_io);
+	Assert(!pgaio_my_backend->handed_out_bb);
 }
 
 /*
@@ -914,6 +1093,7 @@ void
 pgaio_at_error(void)
 {
 	Assert(!pgaio_my_backend->handed_out_io);
+	Assert(!pgaio_my_backend->handed_out_bb);
 }
 
 void
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 76fcdf64670..a4f4a0b698e 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConccurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -130,11 +183,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -149,6 +222,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -172,6 +249,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -179,9 +290,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -203,6 +318,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a83dcc820d..57865d45124 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3234,6 +3234,19 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5005e65cee0..294d661ebf4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -853,6 +853,8 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #io_max_concurrency = 32		# Max number of IOs that may be in
 					# flight at the same time in one backend
 					# (change requires restart)
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 
 #------------------------------------------------------------------------------
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e5d852b5ee6..9db3c07326c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResoureElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -743,6 +745,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1112,3 +1121,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be2dd22f1d7..b3f06711e6a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2110,6 +2110,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0001-checkpointer-Request-checkpoint-via-latch-inste.patchtext/x-diff; charset=us-asciiDownload

From 369d7d8f81f26bdaf4097c6acd09b58cc8f8d151 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v2.3 01/30] checkpointer: Request checkpoint via latch instead
 of signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..dd2c8376c6e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0002-postmaster-Don-t-open-code-TerminateChildren-in.patchtext/x-diff; charset=us-asciiDownload

From ea4f243a510b7151f0853b8b984fc81070c618c2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:20:25 -0500
Subject: [PATCH v2.3 02/30] postmaster: Don't open-code TerminateChildren() in
 HandleChildCrash()

After removing the duplication no user of sigquit_child() remains, therefore
remove it.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 42 +++--------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..8153edc446c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -424,7 +424,6 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
@@ -2699,32 +2698,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
 	 * already been removed from ActiveChildList.
+	 *
+	 * We could exclude dead-end children here, but at least when sending
+	 * SIGABRT it seems better to include them.
 	 */
 	if (take_action)
-	{
-		dlist_iter	iter;
-
-		dlist_foreach(iter, &ActiveChildList)
-		{
-			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
-
-			/* We do NOT restart the syslogger */
-			if (bp == SysLoggerPMChild)
-				continue;
-
-			if (bp == StartupPMChild)
-				StartupStatus = STARTUP_SIGNALED;
-
-			/*
-			 * This backend is still alive.  Unless we did so already, tell it
-			 * to commit hara-kiri.
-			 *
-			 * We could exclude dead-end children here, but at least when
-			 * sending SIGABRT it seems better to include them.
-			 */
-			sigquit_child(bp);
-		}
-	}
+		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
@@ -3347,19 +3326,6 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process.  We apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
 /*
  * Send a signal to the targeted children.
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0003-postmaster-Don-t-repeatedly-transition-to-crash.patchtext/x-diff; charset=us-asciiDownload

From ae79a4158d88ab0fbe78df9ab6ec15be3152343a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Jan 2025 23:30:46 -0500
Subject: [PATCH v2.3 03/30] postmaster: Don't repeatedly transition to
 crashing state

Previously HandleChildCrash() skipped logging and signalling child exits if
already in an immediate shutdown or FatalError, but still transitioned server
state in response to a crash. That's redundant.

To make it easier to combine different paths for entering FatalError state,
only do so once.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8153edc446c..939b1b2ef82 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2676,8 +2676,6 @@ CleanupBackend(PMChild *bp,
 static void
 HandleChildCrash(int pid, int exitstatus, const char *procname)
 {
-	bool		take_action;
-
 	/*
 	 * We only log messages and send signals if this is the first process
 	 * crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2685,15 +2683,13 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
 
-	if (take_action)
-	{
-		LogChildExit(LOG, procname, pid, exitstatus);
-		ereport(LOG,
-				(errmsg("terminating any other active server processes")));
-		SetQuitSignalReason(PMQUIT_FOR_CRASH);
-	}
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+	SetQuitSignalReason(PMQUIT_FOR_CRASH);
 
 	/*
 	 * Signal all other child processes to exit.  The crashed process has
@@ -2702,8 +2698,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	if (take_action)
-		TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0004-postmaster-Move-code-to-switch-into-FatalError-.patchtext/x-diff; charset=us-asciiDownload

From c816b542699fdde710bbf5a909be45ffc9b8488e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:57:12 -0500
Subject: [PATCH v2.3 04/30] postmaster: Move code to switch into FatalError
 state into function

There are two places switching to FatalError mode, behaving somewhat
differently. An upcoming commit will introduce a third. That doesn't seem seem
like a good idea.

This commit just moves the FatalError related code from HandleChildCrash()
into its own function, a subsequent commit will evolve the state machine
change to be suitable for other callers.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 70 +++++++++++++++++++----------
 1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 939b1b2ef82..13d49eecd22 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2665,40 +2665,29 @@ CleanupBackend(PMChild *bp,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, archiver, slot sync worker, or background worker.
- *
- * The objectives here are to clean up our local state about the child
- * process, and to signal all other remaining children to quickdie.
- *
- * The caller has already released its PMChild slot.
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
  */
 static void
-HandleChildCrash(int pid, int exitstatus, const char *procname)
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 {
-	/*
-	 * We only log messages and send signals if this is the first process
-	 * crash and we're not doing an immediate shutdown; otherwise, we're only
-	 * here to update postmaster's idea of live processes.  If we have already
-	 * signaled children, nonzero exit status is to be expected, so don't
-	 * clutter log.
-	 */
-	if (FatalError || Shutdown == ImmediateShutdown)
-		return;
+	int			sigtosend;
+
+	SetQuitSignalReason(reason);
 
-	LogChildExit(LOG, procname, pid, exitstatus);
-	ereport(LOG,
-			(errmsg("terminating any other active server processes")));
-	SetQuitSignalReason(PMQUIT_FOR_CRASH);
+	if (consider_sigabrt && send_abort_for_crash)
+		sigtosend = SIGABRT;
+	else
+		sigtosend = SIGQUIT;
 
 	/*
-	 * Signal all other child processes to exit.  The crashed process has
-	 * already been removed from ActiveChildList.
+	 * Signal all other child processes to exit.
 	 *
 	 * We could exclude dead-end children here, but at least when sending
 	 * SIGABRT it seems better to include them.
 	 */
-	TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT);
+	TerminateChildren(sigtosend);
 
 	if (Shutdown != ImmediateShutdown)
 		FatalError = true;
@@ -2719,6 +2708,39 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		AbortStartTime = time(NULL);
 }
 
+/*
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter, autovacuum, archiver, slot sync worker, or background worker.
+ *
+ * The objectives here are to clean up our local state about the child
+ * process, and to signal all other remaining children to quickdie.
+ *
+ * The caller has already released its PMChild slot.
+ */
+static void
+HandleChildCrash(int pid, int exitstatus, const char *procname)
+{
+	/*
+	 * We only log messages and send signals if this is the first process
+	 * crash and we're not doing an immediate shutdown; otherwise, we're only
+	 * here to update postmaster's idea of live processes.  If we have already
+	 * signaled children, nonzero exit status is to be expected, so don't
+	 * clutter log.
+	 */
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
+
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
+
+	/*
+	 * Switch into error state. The crashed process has already been removed
+	 * from ActiveChildList.
+	 */
+	HandleFatalError(PMQUIT_FOR_CRASH, true);
+}
+
 /*
  * Log the death of a child process.
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0005-WIP-postmaster-Commonalize-FatalError-paths.patchtext/x-diff; charset=us-asciiDownload

From 97b4983b1443d03525b0565eb104b359a43044af Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:25:01 -0500
Subject: [PATCH v2.3 05/30] WIP: postmaster: Commonalize FatalError paths

This includes some behavioural changes:

- Previously PM_WAIT_XLOG_ARCHIVAL wasn't handled in HandleFatalError(), that
  doesn't seem quite right.
- Failure to fork checkpointer now transitions through PM_WAIT_BACKENDS, like
  child crashes. That's not necessarily great, but...

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 61 +++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 13d49eecd22..41f2bbc214c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2693,12 +2693,47 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
-	if (pmState == PM_RECOVERY ||
-		pmState == PM_HOT_STANDBY ||
-		pmState == PM_RUN ||
-		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_WAIT_XLOG_SHUTDOWN)
-		UpdatePMState(PM_WAIT_BACKENDS);
+	switch (pmState)
+	{
+		case PM_INIT:
+			/* shouldn't have any children */
+			Assert(false);
+			break;
+		case PM_STARTUP:
+			/* should have been handled in process_pm_child_exit */
+			Assert(false);
+			break;
+
+			/* wait for children to die */
+		case PM_RECOVERY:
+		case PM_HOT_STANDBY:
+		case PM_RUN:
+		case PM_STOP_BACKENDS:
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_BACKENDS:
+			/* there might be more backends to wait for */
+			break;
+
+		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+
+			/*
+			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
+			 * the PM_WAIT_BACKENDS && FatalError code in
+			 * PostmasterStateMachine does not have to be duplicated.
+			 *
+			 * XXX: This seems rather ugly, but it's not obvious if the
+			 * alternative is better.
+			 */
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_DEAD_END:
+		case PM_NO_CHILDREN:
+			break;
+	}
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2836,6 +2871,9 @@ PostmasterStateMachine(void)
 	 * PM_WAIT_BACKENDS, but we signal the processes first, before waiting for
 	 * them.  Treating it as a distinct pmState allows us to share this code
 	 * across multiple shutdown code paths.
+	 *
+	 * Note that HandleFatalError() switches to PM_WAIT_BACKENDS even if we
+	 * were, before the fatal error, in a "more advanced" state.
 	 */
 	if (pmState == PM_STOP_BACKENDS || pmState == PM_WAIT_BACKENDS)
 	{
@@ -2967,13 +3005,12 @@ PostmasterStateMachine(void)
 					 * We don't consult send_abort_for_crash here, as it's
 					 * unlikely that dumping cores would illuminate the reason
 					 * for checkpointer fork failure.
+					 *
+					 * XXX: Is it worth inventing a different PMQUIT value
+					 * that signals that the cluster is in a bad state,
+					 * without a process having crashed?
 					 */
-					FatalError = true;
-					UpdatePMState(PM_WAIT_DEAD_END);
-					ConfigurePostmasterWaitSet(false);
-
-					/* Kill the walsenders and archiver too */
-					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+					HandleFatalError(PMQUIT_FOR_CRASH, false);
 				}
 			}
 		}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0006-postmaster-Adjust-which-processes-we-expect-to-.patchtext/x-diff; charset=us-asciiDownload

From 8f44b56322e97dbf7f5e8e514c8e6d3e603b73bd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 00:31:03 -0500
Subject: [PATCH v2.3 06/30] postmaster: Adjust which processes we expect to
 have exited

Comments and code stated that we expect checkpointer to have been signalled in
case of immediate shutdown / fatal errors, but didn't treat archiver and
walsenders the same. That doesn't seem right.

I had started digging through the history to see where this oddity was
introduced, but it's not the fault of a single commit.

Instead treat archiver, checkpointer, and walsenders the same.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 41f2bbc214c..54801a32609 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2906,16 +2906,20 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect the checkpointer to exit as well, otherwise not.
+		 * expect archiver, checkpointer and walsender to exit as well,
+		 * otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
-			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+			targetMask = btmask_add(targetMask,
+									B_CHECKPOINTER,
+									B_ARCHIVER,
+									B_WAL_SENDER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Normally walsenders and archiver will continue running; they will
+		 * be terminated later after writing the checkpoint record.  We also
+		 * let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2926,13 +2930,17 @@ PostmasterStateMachine(void)
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
 			remainMask = btmask_add(remainMask,
-									B_WAL_SENDER,
-									B_ARCHIVER,
 									B_DEAD_END_BACKEND,
 									B_LOGGER);
 
-			/* checkpointer may or may not be in targetMask already */
-			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+			/*
+			 * Archiver, checkpointer and walsender may or may not be in
+			 * targetMask already.
+			 */
+			remainMask = btmask_add(remainMask,
+									B_ARCHIVER,
+									B_CHECKPOINTER,
+									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
 			remainMask = btmask_add(remainMask,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0007-Change-shutdown-sequence-to-terminate-checkpoin.patchtext/x-diff; charset=us-asciiDownload

From ecb9f5995b5f0b38b01c8b86168aa848c9459c83 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 14 Jan 2025 01:18:42 -0500
Subject: [PATCH v2.3 07/30] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. Serializing stats already
happens in checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.

Another need for this change originates in the AIO patchset, where IO
workers (which, in some edge cases, can emit stats of their own) need to run
while the shutdown checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled (via
SIGINT) to trigger writing the shutdown checkpoint without terminating
it. Once checkpointer wrote the checkpoint it will wait for a termination
signal (SIGUSR2, as before).

Postmaster now triggers the shutdown checkpoint via SIGINT, where we
previously did so by terminating checkpointer. Checkpointer now is terminated
after all children other than dead-end ones have been terminated, tracked
using the new PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 125 +++++++++++----
 src/backend/postmaster/postmaster.c           | 143 +++++++++++++-----
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 200 insertions(+), 72 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index dd2c8376c6e..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +164,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +655,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +904,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54801a32609..115ad3d31d2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -334,6 +334,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -2354,35 +2355,19 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point only dead-end children should be left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_WAIT_DEAD_END);
+				ConfigurePostmasterWaitSet(false);
+				SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 			}
 			else
 			{
@@ -2718,6 +2703,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_CHECKPOINTER:
 
 			/*
 			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
@@ -2980,9 +2966,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2996,10 +2982,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -3024,22 +3010,39 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+	 * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
+	 * PMSIGNAL_XLOG_IS_SHUTDOWN.
+	 */
+
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			UpdatePMState(PM_WAIT_DEAD_END);
-			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone important is gone, tell checkpointer to shut
+			 * down too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+	 * in proccess_pm_child_exit().
+	 */
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/*
@@ -3176,6 +3179,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3593,6 +3597,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3704,9 +3710,67 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell
+			 * them to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should
+			 * be around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown != ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			HandleFatalError(PMQUIT_FOR_CRASH, false);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3714,7 +3778,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -4025,6 +4089,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0008-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From 1476ef34b2a2c36e8e1eccbf6d2ac12607b4dab7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.3 08/30] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 359f58a8f95..5d41cfc6eb0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0b25efafe2b..1f8ec3daa6a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01bb6a410cb..b491d04de58 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -755,8 +755,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0009-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload

From 552b094c4f52b4092d7998cce01908bff5ddcf8b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.3 09/30] Allow lwlocks to be unowned

This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
 src/include/storage/lwlock.h      |   2 +
 src/backend/storage/lmgr/lwlock.c | 108 +++++++++++++++++++++++-------
 2 files changed, 85 insertions(+), 25 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 2aa46fd50da..13a7dc89980 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
+extern void LWLockDisown(LWLock *l);
+extern void LWLockReleaseDisowned(LWLock *l, LWLockMode mode);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2f558ffea14..c3d6f886e3c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,36 +1773,15 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	}
 }
 
-
 /*
- * LWLockRelease - release a previously acquired lock
+ * Helper function to release lock, shared between LWLockRelease() and
+ * LWLockeleaseDisowned().
  */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 {
-	LWLockMode	mode;
 	uint32		oldstate;
 	bool		check_waiters;
-	int			i;
-
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-		if (lock == held_lwlocks[i].lock)
-			break;
-
-	if (i < 0)
-		elog(ERROR, "lock %s is not held", T_NAME(lock));
-
-	mode = held_lwlocks[i].mode;
-
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
-
-	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
 	/*
 	 * Release my hold on lock, after that it can immediately be acquired by
@@ -1840,6 +1819,85 @@ LWLockRelease(LWLock *lock)
 		LOG_LWDEBUG("LWLockRelease", lock, "releasing waiters");
 		LWLockWakeup(lock);
 	}
+}
+
+void
+LWLockReleaseDisowned(LWLock *lock, LWLockMode mode)
+{
+	LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * This is the code that can be shared between actually releasing a lock
+ * (LWLockRelease()) and just not tracking ownership of the lock anymore
+ * without releasing the lock (LWLockDisown()).
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). This is somewhat intentional, as it makes it easier to
+ * debug cases of missing wakeups during lock release.
+ */
+static inline LWLockMode
+LWLockDisownInternal(LWLock *lock)
+{
+	LWLockMode	mode;
+	int			i;
+
+	/*
+	 * Remove lock from list of locks held.  Usually, but not always, it will
+	 * be the latest-acquired lock; so search array backwards.
+	 */
+	for (i = num_held_lwlocks; --i >= 0;)
+		if (lock == held_lwlocks[i].lock)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+	mode = held_lwlocks[i].mode;
+
+	num_held_lwlocks--;
+	for (; i < num_held_lwlocks; i++)
+		held_lwlocks[i] = held_lwlocks[i + 1];
+
+	return mode;
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released (via LWLockReleaseDisowned()), even in case of an
+ * error. This only is desirable if the lock is going to be released in a
+ * different process than the process that acquired it.
+ */
+void
+LWLockDisown(LWLock *lock)
+{
+	LWLockDisownInternal(lock);
+
+	RESUME_INTERRUPTS();
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+	LWLockMode	mode;
+
+	mode = LWLockDisownInternal(lock);
+
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+	LWLockReleaseInternal(lock, mode);
 
 	/*
 	 * Now okay to allow cancel/die interrupts.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0010-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From a6f1745cefdfb932be393f0374765e60563ab23d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 10 Jun 2024 13:42:58 -0700
Subject: [PATCH v2.3 10/30] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.
---
 src/include/storage/aio.h                     | 37 +++++++++++++++++++
 src/include/storage/aio_init.h                | 24 ++++++++++++
 src/include/utils/guc.h                       |  1 +
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 36 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 +++++++++++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 ++
 src/backend/utils/init/postinit.c             |  7 ++++
 src/backend/utils/misc/guc_tables.c           | 23 ++++++++++++
 src/backend/utils/misc/postgresql.conf.sample | 11 ++++++
 src/tools/pgindent/typedefs.list              |  1 +
 12 files changed, 184 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_init.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..0e3fadac543
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+extern void assign_io_method(int newval, void *extra);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
new file mode 100644
index 00000000000..44151ef55bf
--- /dev/null
+++ b/src/include/storage/aio_init.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.h
+ *    AIO initialization - kept separate as initialization sites don't need to
+ *    know about AIO itself and AIO users don't need to know about initialization.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_init.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INIT_H
+#define AIO_INIT_H
+
+
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif							/* AIO_INIT_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 532d6642bb4..aa859c92085 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -314,6 +314,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..f68cbc2b3f4
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "utils/guc.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..f7ee8270756
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_init.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..e11e82fc897 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b491d04de58..8ea50314a4e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -626,6 +627,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 38cb9e970d5..de524eccad5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3220,6 +3221,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IOs that may be in flight in one backend."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5236,6 +5249,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Selects the method of asynchronous I/O to use."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 079efa1baa7..fba0ad4b624 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -843,6 +843,17 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #include = '...'			# include file
 
 
+#------------------------------------------------------------------------------
+# WIP AIO GUC docs
+#------------------------------------------------------------------------------
+
+#io_method = sync			# (change requires restart)
+
+#io_max_concurrency = 32		# Max number of IOs that may be in
+					# flight at the same time in one backend
+					# (change requires restart)
+
+
 #------------------------------------------------------------------------------
 # CUSTOMIZED OPTIONS
 #------------------------------------------------------------------------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5aa5c295ae..3bec090428d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1266,6 +1266,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0011-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload

From ac42f990b85ae4034f16acf9929ce28e18ec2088 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 7 Jan 2025 14:42:12 -0500
Subject: [PATCH v2.3 11/30] aio: Core AIO implementation

At this point nothing can use AIO - this commit does not include any
implementation of aio subjects / callbacks. That will come in later commits.

Todo:
- lots of cleanup
---
 src/include/storage/aio.h                     | 301 ++++++
 src/include/storage/aio_internal.h            | 295 ++++++
 src/include/storage/aio_types.h               | 115 +++
 src/include/utils/resowner.h                  |   5 +
 src/backend/access/transam/xact.c             |   9 +
 src/backend/storage/aio/Makefile              |   4 +
 src/backend/storage/aio/aio.c                 | 904 ++++++++++++++++++
 src/backend/storage/aio/aio_callback.c        | 280 ++++++
 src/backend/storage/aio/aio_init.c            | 186 ++++
 src/backend/storage/aio/aio_io.c              | 175 ++++
 src/backend/storage/aio/aio_target.c          | 108 +++
 src/backend/storage/aio/meson.build           |   4 +
 src/backend/storage/aio/method_sync.c         |  47 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/resowner/resowner.c         |  30 +
 src/tools/pgindent/typedefs.list              |  21 +
 16 files changed, 2487 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 0e3fadac543..ffd382593d0 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -14,6 +14,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +29,307 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not publically referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ */
+typedef struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+} PgAioTargetInfo;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+typedef struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+} PgAioHandleCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+extern void pgaio_at_xact_end(bool is_subxact, bool is_commit);
+extern void pgaio_at_error(void);
 extern void assign_io_method(int newval, void *extra);
 
 
+
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..174d365f9c0
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,295 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that shoul only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/* AFIXME */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/* returned by pgaio_io_acquire() */
+	PGAIO_HS_HANDED_OUT,
+
+	/* pgaio_io_prep_*() has been called, but IO hasn't been submitted yet */
+	PGAIO_HS_DEFINED,
+
+	/* target's stage() callback has been called, ready to be submitted */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted and is being executed */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/* IO completed, shared completion has been called */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_shared_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		shared_callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * without having been either defined (by actually associating it with IO)
+	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
+	 * to enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strict speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * The set of callbacks that each IO method must implement.
+ *
+ * AFIXME: Document these.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+	size_t		(*shmem_size) (void);
+	void		(*shmem_init) (bool first_time);
+
+	/* per-backend initialization */
+	void		(*init_backend) (void);
+
+	/* handling of IOs */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at buildtime. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..d2617139a25
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d78..a252c3a81b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2475,6 +2476,8 @@ CommitTransaction(void)
 	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ true);
+
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
@@ -2988,6 +2991,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ false, /* is_commit = */ false);
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -5185,6 +5190,8 @@ CommitSubTransaction(void)
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ true);
+
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
 	 * upper is read-write while the child is read-only; GUC will incorrectly
@@ -5351,6 +5358,8 @@ AbortSubTransaction(void)
 		AtSubAbort_Snapshot(s->nestingLevel);
 	}
 
+	pgaio_at_xact_end( /* is_subxact = */ true, /* is_commit = */ false);
+
 	/*
 	 * Restore the upper transaction's read-only state, too.  This should be
 	 * redundant with GUC's cleanup but we may as well do it for consistency
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index f68cbc2b3f4..cefa888884c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -14,8 +36,22 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "utils/guc.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -28,9 +64,877 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * Once pgaio_io_prep_*() is called, the IO may be in the process of being
+ * executed and might even complete before the functions return. That is,
+ * however, not guaranteed, to allow IO submission to be batched. To guarantee
+ * IO submission pgaio_submit_staged() needs to be called.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			/* XXX: Should we warn about this when is_commit? */
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG4, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared, executing synchronously: %d",
+				   needs_synchronous);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+	{
+		/* XXX: should we also check if there are other IOs staged? */
+		return true;
+	}
+
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == PGAIO_HS_STAGED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();
+		}
+		else if (state != PGAIO_HS_SUBMITTED
+				 && state != PGAIO_HS_COMPLETED_IO
+				 && state != PGAIO_HS_COMPLETED_SHARED
+				 && state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming, result: %d, distilled_result: AFIXME, report to: %p",
+				   ioh->result,
+				   ioh->report_return);
+
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_shared_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * It's possible that we recognized there were free IOs while submitting.
+	 */
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+	{
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+	}
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+bool
+pgaio_have_staged(void)
+{
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+void
+pgaio_at_xact_end(bool is_subxact, bool is_commit)
+{
+	Assert(!pgaio_my_backend->handed_out_io);
+}
+
+/*
+ * Similar to pgaio_at_xact_end(..., is_commit = false), but for cases where
+ * errors happen outside of transactions.
+ */
+void
+pgaio_at_error(void)
+{
+	Assert(!pgaio_my_backend->handed_out_io);
+}
+
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes: - it's somewhat annoying to see partially finished IOs in
+	 * stats views etc - it's rumored that some kernel-level AIO mechanisms
+	 * don't deal well with the issuer of an AIO exiting
+	 */
+
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
+}
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..93f71690169
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/memutils.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+	if (cbid >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cbid);
+	if (aio_handle_cbs[cbid].cb->complete_shared == NULL &&
+		aio_handle_cbs[cbid].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cbid);
+	if (ioh->num_shared_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_shared_callbacks + 1,
+				   cbid, ce->name);
+
+	ioh->num_shared_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage",
+					   i, cbid, ce->name);
+		ce->cb->stage(ioh);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared with distilled result status %d, id %u, error_data: %d, result: %d",
+					   i, cbid, ce->name,
+					   result.status, result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+				   result.status, result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d, id %d/%s->complete_local with distilled result status %d, id %u, error_data: %d, result: %d",
+					   i, cbid, ce->name,
+					   result.status, result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "distilled result status %d, id %u, error_data: %d, result: %d, raw_result %d",
+				   result.status, result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cbid = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index f7ee8270756..0e98cc0c8fb 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,210 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/aio_init.h"
+#include "storage/aio_internal.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConccurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConccurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_shared_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..bb010d6152c
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,175 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..15428968e58
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,108 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..43f9c8bd0b3
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "should be unreachable");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..b5d3dcbf1e9 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,9 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_SUBMIT	"Waiting for AIO submission."
+AIO_DRAIN	"Waiting for IOs to finish."
+AIO_COMPLETION	"Waiting for completion callback."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..e5d852b5ee6 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,14 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		/* XXX: Could probably be a later phase? */
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1100,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3bec090428d..c7f34559b1b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1267,6 +1267,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2105,6 +2106,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0012-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 5e84720afa46fdfd892a8bac36585f0f7a29d3f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:43:40 -0500
Subject: [PATCH v2.3 12/30] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- should completely get rid of ID assignment logic in postmaster.c
- postmaster.c badly needs a refactoring.
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_init.h                |   2 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 169 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio_init.c            |   7 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  86 +++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 19 files changed, 305 insertions(+), 15 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d016a9c9248..c2b3e27c613 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 188a06e2379..253dc98c50e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_init.h b/src/include/storage/aio_init.h
index 44151ef55bf..bc15b720fca 100644
--- a/src/include/storage/aio_init.h
+++ b/src/include/storage/aio_init.h
@@ -21,4 +21,6 @@ extern void AioShmemInit(void);
 
 extern void pgaio_init_backend(void);
 
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_INIT_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..223d614dc4a
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..64e9b8ff8c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -448,7 +448,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index a97a1eda6da..54b4c22bd63 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 115ad3d31d2..ddd82b94720 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_init.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -334,6 +337,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -396,6 +400,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -430,6 +438,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1357,6 +1367,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1369,7 +1384,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2493,6 +2507,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2704,6 +2728,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
@@ -2892,20 +2917,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2920,12 +2946,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3020,11 +3047,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
-		 * than checkpointer and dead-end children left. There shouldn't be
-		 * any regular backends left by now anyway; what we're really waiting
-		 * for is for walsenders and archiver to exit.
+		 * than checkpointer, io workers and dead-end children left. There
+		 * shouldn't be any regular backends left by now anyway; what we're
+		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3151,10 +3192,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3178,6 +3223,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -4093,6 +4139,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4243,6 +4290,100 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 0e98cc0c8fb..233c144965b 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -221,3 +221,10 @@ pgaio_init_backend(void)
 
 	before_shmem_exit(pgaio_shutdown, 0);
 }
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..1d79e7e85ef
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO implementation using workers
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		/*
+		 * We normally shouldn't get errors here. Need to do just enough error
+		 * recovery so that we can mark the IO as failed and then exit.
+		 */
+		LWLockReleaseAll();
+
+		/* TODO: recover from IO errors */
+
+		EmitErrorReport();
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5655348a2e2..605c8950043 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3313,6 +3313,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..b2151ab4ca3 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -241,6 +241,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..70518749142 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -365,6 +365,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b5d3dcbf1e9..e702aa7152a 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0347fc11092..cbca090d2b0 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index de524eccad5..8a83dcc820d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3233,6 +3234,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index fba0ad4b624..e68e112c72f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -848,6 +848,7 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 #------------------------------------------------------------------------------
 
 #io_method = sync			# (change requires restart)
+#io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
 					# flight at the same time in one backend
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0013-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload

From 5bdabe467f82dc7cc7348d8698b0c10f7bbeb7b8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:34 -0500
Subject: [PATCH v2.3 13/30] aio: Add worker method

---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |  12 +-
 src/backend/storage/aio/method_worker.c       | 394 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/tools/pgindent/typedefs.list              |   3 +
 9 files changed, 410 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ffd382593d0..39d7e4cff55 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -23,10 +23,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 174d365f9c0..86d8d099c91 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -285,6 +285,7 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index cefa888884c..6c264b61ca5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -57,6 +57,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -73,6 +74,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 233c144965b..76fcdf64670 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_init.h"
 #include "storage/aio_internal.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -211,6 +217,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
@@ -225,6 +234,5 @@ pgaio_init_backend(void)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 1d79e7e85ef..92415467c71 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -1,7 +1,22 @@
 /*-------------------------------------------------------------------------
  *
  * method_worker.c
- *    AIO implementation using workers
+ *    AIO - perform AIO using worker processes
+ *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -16,23 +31,323 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 int			io_workers = 3;
+static int	io_worker_queue_size = 64;
 
+static int	MyIoWorkerId;
+
+
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * io_worker_queue_size +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * io_workers;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG1, "io queue is full, at %ud elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * shmem_exit() callback that releases the worker's slot in io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
 
 void
 IoWorkerMain(char *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	volatile PgAioHandle *ioh = NULL;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +368,11 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
 
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -65,9 +385,18 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 
-		/* TODO: recover from IO errors */
+		/* FIXME: recover from IO errors */
+		if (ioh != NULL)
+		{
+#if 0
+			/* EINTR is treated as a retryable error */
+			pgaio_process_io_completion(unvolatize(PgAioInProgress *, io),
+										EINTR);
+#endif
+		}
 
 		EmitErrorReport();
+
 		proc_exit(1);
 	}
 
@@ -76,10 +405,63 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/* Nothing to do.  Mark self idle. */
+			/*
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			ioh = &pgaio_ctl->io_handles[io_index];
+
+			pgaio_debug_io(DEBUG4, unvolatize(PgAioHandle *, ioh),
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			pgaio_io_reopen(unvolatize(PgAioHandle *, ioh));
+			pgaio_io_perform_synchronously(unvolatize(PgAioHandle *, ioh));
+
+			ioh = NULL;
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
 	proc_exit(0);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e702aa7152a..05751417482 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -350,6 +350,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e68e112c72f..5005e65cee0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -847,7 +847,7 @@ autovacuum_worker_slots = 16	# autovacuum worker slots to allocate
 # WIP AIO GUC docs
 #------------------------------------------------------------------------------
 
-#io_method = sync			# (change requires restart)
+#io_method = worker			# (change requires restart)
 #io_workers = 3				# 1-32;
 
 #io_max_concurrency = 32		# Max number of IOs that may be in
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7f34559b1b..1e7bbeff1b6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0014-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From cbd5bc8e99f0d80fa37f5065c893751f238c26da Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.3 14/30] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                |  14 ++++
 meson_options.txt          |   3 +
 configure.ac               |  11 +++
 src/makefiles/meson.build  |   3 +
 src/include/pg_config.h.in |   3 +
 src/backend/Makefile       |   7 +-
 configure                  | 138 +++++++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml          |   1 +
 src/Makefile.global.in     |   4 ++
 9 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 32fc89f3a4b..2bca586e5f3 100644
--- a/meson.build
+++ b/meson.build
@@ -854,6 +854,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3058,6 +3070,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3702,6 +3715,7 @@ if meson.version().version_compare('>=0.57')
       'gss': gssapi,
       'icu': icu,
       'ldap': ldap,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index d9c7ddccbc4..abe8600ec35 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'Use liburing for async io')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index d713360f340..00d6c366ecd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [use liburing for async io],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io-uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1427,6 +1435,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index d49b2079a44..714b7ccaa4e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
   'gssapi': gssapi,
   'icu': icu,
   'ldap': ldap,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..6ab71a3dffe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
+/* Define to build with io-uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/configure b/configure
index ceeef9b0915..e477baedfb6 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -709,6 +711,7 @@ XML2_CFLAGS
 XML2_CONFIG
 with_libxml
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -862,6 +865,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libxml
@@ -905,6 +909,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1572,6 +1578,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         use liburing for async io
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
@@ -1618,6 +1625,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8681,6 +8692,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13231,6 +13276,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 18e944ca89d..67d3d77fb10 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -334,6 +334,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 1278b7744f4..8ad259a54cd 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd	= @with_systemd@
 with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0015-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload

From 8729492fc8eb698851442f0165cb12948c4db4f4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:36 -0500
Subject: [PATCH v2.3 15/30] aio: Add io_uring method

---
 src/include/storage/aio.h                 |   3 +
 src/include/storage/aio_internal.h        |   3 +
 src/include/storage/lwlock.h              |   1 +
 src/backend/storage/aio/Makefile          |   1 +
 src/backend/storage/aio/aio.c             |   6 +
 src/backend/storage/aio/meson.build       |   1 +
 src/backend/storage/aio/method_io_uring.c | 382 ++++++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c         |   1 +
 src/tools/pgindent/typedefs.list          |   1 +
 9 files changed, 399 insertions(+)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 39d7e4cff55..8c1b9a1b496 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -24,6 +24,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 86d8d099c91..eff544ce621 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -286,6 +286,9 @@ extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 13a7dc89980..043e8bae7a9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 6c264b61ca5..c1dd073e37f 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -58,6 +58,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -75,6 +78,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..da92795fce7
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *pgaio_uring_contexts;
+static PgAioUringContext *pgaio_my_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	int			ret;
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+		}
+		break;
+	}
+
+	return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will consume the completions, making the
+	 * locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_DRAIN);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c3d6f886e3c..dbc169c8541 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1e7bbeff1b6..be2dd22f1d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2128,6 +2128,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0016-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 0a201985c794113e4cf062e8f5037fb7ab03c1ea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.3 16/30] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 430 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 432 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..1b6f9d2c40b
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,430 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * As mentioned above, the IO might be initiated within smgrstartreadv(). That
+ * is however not guaranteed, to allow IO submission to be batched.
+ *
+ * Note that one needs to be careful while there may be unsubmitted IOs, as
+ * another backend may need to wait for one of the unsubmitted IOs. If this
+ * backend were to wait for the other backend, we'd have a deadlock. To avoid
+ * that, pending IOs need to be explicitly submitted before this backend
+ * might be blocked by a backend waiting for IO.
+ *
+ * Note that the IO might have immediately been submitted (e.g. due to reaching
+ * a limit on the number of unsubmitted IOs) and even completed during the
+ * smgrstartreadv() above.
+ *
+ * Once submitted, the IO is in-flight and can complete at any time.
+ *
+ * TODO: rename to kick as suggested by Heikki?
+ */
+pgaio_submit_staged();
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_acquire()`
+and because `pgaio_io_acquire()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_acquire()`) without causing
+the IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c1dd073e37f..b3b4e74c3ce 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0017-aio-Implement-smgr-md-fd-aio-methods.patchtext/x-diff; charset=us-asciiDownload

From 6c9493bdbc9164decc460c7ab74aaceea19d67a0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.3 17/30] aio: Implement smgr/md/fd aio methods

---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  12 +-
 src/include/storage/fd.h               |   6 +
 src/include/storage/md.h               |  12 +
 src/include/storage/smgr.h             |  22 ++
 src/backend/storage/aio/aio_callback.c |   4 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  68 +++++
 src/backend/storage/smgr/md.c          | 360 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 126 +++++++++
 10 files changed, 614 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8c1b9a1b496..a948eaeefa7 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -108,9 +108,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -174,6 +175,9 @@ typedef struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index d2617139a25..762fce3f075 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -58,11 +58,17 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		release_lock:1;
+		bool		skip_fsync:1;
+		uint8		mode;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..e2fd896646e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..7b28c3d482c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleCallbacks aio_md_writev_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..86fa07b110f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioTargetInfo;
+
+extern const struct PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,6 +124,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -127,4 +142,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(struct PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 93f71690169..7fd42880535 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 #include "utils/memutils.h"
 
 
@@ -38,6 +39,9 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 15428968e58..a43edd89890 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -18,6 +18,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -31,6 +32,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 843d1021cf9..89f2dc29555 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2315,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
@@ -2498,6 +2557,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2843,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2912,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7bf0b45e2c3..e204b7abba6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -132,6 +133,22 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const struct PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+const struct PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -927,6 +944,53 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1032,6 +1096,53 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1355,6 +1466,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1405,6 +1531,35 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -1838,3 +1993,208 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_report(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		md_readv_report(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	/* AFIXME: post-read portion of mdreadv() */
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *sd, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	char	   *path;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   sd->smgr.blockNum,
+					   sd->smgr.blockNum + sd->smgr.nblocks,
+					   path
+					   )
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   sd->smgr.blockNum,
+					   sd->smgr.blockNum + sd->smgr.nblocks - 1,
+					   path,
+					   result.result * (size_t) BLCKSZ,
+					   sd->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	pfree(path);
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_writev_report(result, sd, LOG);
+
+		return result;
+	}
+
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		md_writev_report(result, sd, LOG);
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < sd->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (prior_result.status == ARS_ERROR)
+	{
+		/* AFIXME: complain */
+		return prior_result;
+	}
+
+	prior_result.result /= BLCKSZ;
+
+	if (!sd->smgr.skip_fsync)
+		register_dirty_segment_aio(sd->smgr.rlocator, sd->smgr.forkNum,
+								   sd->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return prior_result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *sd, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	char	   *path;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   sd->smgr.blockNum,
+					   sd->smgr.blockNum + sd->smgr.nblocks,
+					   path)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   sd->smgr.blockNum,
+					   sd->smgr.blockNum + sd->smgr.nblocks - 1,
+					   path,
+					   result.result * (size_t) BLCKSZ,
+					   sd->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	pfree(path);
+	MemoryContextSwitchTo(oldContext);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..fb231e6ad48 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +159,16 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+const struct PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -623,6 +647,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -657,6 +694,19 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
@@ -819,6 +869,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +903,73 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 struct SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	sd->smgr.release_lock = false;
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+	sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	char	   *path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path);
+
+	pfree(path);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0018-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From a92ecb8ff29feaa485c50c10914f30678d3694ad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.3 18/30] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 240 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  17 ++
 6 files changed, 272 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18560755d26..df29275d7b1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12435,4 +12435,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,error_desc,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..884c73cd2bf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1388,3 +1388,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..65ee3cb22a6
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,240 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+static const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	16
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) determine the state + generation of the IO
+		 *
+		 * 2) copy the IO to local memory
+		 *
+		 * 3) check if state and generation of the IO changed
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: error description */
+		/* AFIXME: implement */
+		nulls[11] = true;
+
+		/* column: target description */
+		values[12] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[15] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..c0e18a350f5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,23 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    error_desc,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, error_desc, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0019-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 34bdf7e671846828be4d194cee881218b78a817b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:08:58 -0500
Subject: [PATCH v2.3 19/30] bufmgr: Implement AIO read support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   6 +
 src/include/storage/bufmgr.h           |   8 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 389 ++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c  |  65 +++++
 7 files changed, 473 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a948eaeefa7..6f36a0b9e4d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -178,6 +178,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..9f936cd6b84 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -251,6 +252,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -464,4 +467,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern bool ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 46b4e0d90f3..5cff4e223f9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -177,6 +177,12 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
@@ -194,6 +200,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+struct PgAioHandle;
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 7fd42880535..6054f57eb23 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 #include "utils/memutils.h"
 
@@ -42,6 +43,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0d8849bf894..169829e8031 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -58,6 +59,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -514,7 +516,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1081,7 +1084,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1566,7 +1569,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2450,7 +2453,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3899,7 +3902,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5456,6 +5459,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5463,10 +5467,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5555,7 +5568,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5567,6 +5580,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5575,6 +5595,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5626,7 +5680,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6085,3 +6139,324 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static bool
+ReadBufferCompleteReadShared(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	blockno = bufHdr->tag.blockNum;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&bufHdr->tag);
+		BlockNumber forkNum = bufHdr->tag.forkNum;
+
+		/* AFIXME: relpathperm allocates memory */
+		MemoryContextSwitchTo(ErrorContext);
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathperm(rlocator, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	/* Report I/Os as completing individually. */
+
+	/* FIXME: Should we do TRACE_POSTGRESQL_BUFFER_READ_DONE here? */
+	return buf_failed;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_wref = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+			RESUME_INTERRUPTS();
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh)
+{
+	shared_buffer_stage_common(ioh, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	int			mode = pgaio_io_get_target_data(ioh)->smgr.mode;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		ereport(DEBUG5,
+				errmsg("calling rbcrs for buf %d with failed %d, status: %d, result: %d, data_off: %d",
+					   buf, failed, prior_result.status, prior_result.result, io_data_off),
+				errhidestmt(true), errhidecontext(true));
+
+		/*
+		 * XXX: It might be better to not set BM_IO_ERROR (which is what
+		 * failed = true leads to) when it's just a short read...
+		 */
+		buf_failed = ReadBufferCompleteReadShared(buf,
+												  mode,
+												  failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	/*
+	 * AFIXME: need infrastructure to allow memory allocation for error
+	 * reporting
+	 */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum)
+				   )
+		);
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * Helper to stage IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_stage(PgAioHandle *ioh)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_wref;
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_wref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_wref = io_wref;
+		LocalRefCount[-buf - 1] += 1;
+
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	int			mode = td->smgr.mode;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	Assert(td->smgr.is_temp);
+	Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+		bool		buf_failed;
+		bool		failed;
+
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= io_data_off;
+
+		ereport(DEBUG5,
+				errmsg("calling rbcrl for buf %d with failed %d, status: %d, result: %d, data_off: %d",
+					   buf, failed, prior_result.status, prior_result.result, io_data_off),
+				errhidestmt(true), errhidecontext(true));
+
+		buf_failed = ReadBufferCompleteReadLocal(buf,
+												 mode,
+												 failed);
+
+		if (result.status != ARS_ERROR && buf_failed)
+		{
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			result.error_data = io_data_off + 1;
+		}
+	}
+
+	return result;
+}
+
+
+const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8f81428970b..b3805c1ff94 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "executor/instrument.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -621,6 +622,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -837,3 +840,65 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+bool
+ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *buf_hdr = NULL;
+	BlockNumber blockno;
+	bool		buf_failed = false;
+	char	   *bufdata = BufferGetBlock(buffer);
+
+	Assert(BufferIsValid(buffer));
+
+	buf_hdr = GetLocalBufferDescriptor(-buffer - 1);
+	blockno = buf_hdr->tag.blockNum;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, blockno,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&buf_hdr->tag);
+		BlockNumber forkNum = buf_hdr->tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							blockno,
+							relpathbackend(rlocator, MyProcNumber, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s",
+							blockno,
+							relpathbackend(rlocator, MyProcNumber, forkNum))));
+			failed = true;
+			buf_failed = true;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_wref_clear(&buf_hdr->io_wref);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	/* release pin held by IO subsystem */
+	LocalRefCount[-buffer - 1] -= 1;
+
+	return buf_failed;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0020-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patchtext/x-diff; charset=us-asciiDownload

From 9c9745754dc88502e050b5822d90d20b517a052b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:44 -0500
Subject: [PATCH v2.3 20/30] WIP: localbuf: Track pincount in BufferDesc as
 well

For AIO on temp tables the AIO subsystem needs to be able to ensure a pin on a
buffer while AIO is going on, even if the IO issuing query errors out. To do
so, track the refcount in BufferDesc.state, not ust LocalRefCount.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (nobody else has access to the
BufferDesc).
---
 src/backend/storage/buffer/bufmgr.c   |  40 ++++++++--
 src/backend/storage/buffer/localbuf.c | 108 ++++++++++++++++----------
 2 files changed, 101 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 169829e8031..fe871691350 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5356,8 +5356,20 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
@@ -5409,8 +5421,18 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
@@ -6388,9 +6410,15 @@ local_buffer_readv_stage(PgAioHandle *ioh)
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 		bufHdr->io_wref = io_wref;
-		LocalRefCount[-buf - 1] += 1;
 
-		UnlockBufHdr(bufHdr, buf_state);
+		/*
+		 * Track pin by AIO subsystem in BufferDesc, not in LocalRefCount as
+		 * one might initially think. This is necessary to handle this backend
+		 * erroring out while AIO is still in progress.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 	}
 }
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b3805c1ff94..72c93ae15a2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -208,10 +208,19 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
 				PinLocalBuffer(bufHdr, false);
+				/* the buf_state may be modified inside PinLocalBuffer */
+				buf_state = pg_atomic_read_u32(&bufHdr->state);
 				break;
 			}
 		}
@@ -476,6 +485,44 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (LocalRefCount[bufid] != 0 ||
+		BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)),
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -496,7 +543,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -506,24 +552,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)),
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -543,7 +572,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -551,23 +579,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)),
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -667,12 +679,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -698,7 +711,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
@@ -894,11 +917,14 @@ ReadBufferCompleteReadLocal(Buffer buffer, int mode, bool failed)
 
 		buf_state = pg_atomic_read_u32(&buf_hdr->state);
 		buf_state |= BM_VALID;
+
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		buf_state -= BUF_REFCOUNT_ONE;
 		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 
-	/* release pin held by IO subsystem */
-	LocalRefCount[-buffer - 1] -= 1;
-
 	return buf_failed;
 }
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0021-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From b153f4c8c7cf10171dd7390920ef38e079be1c87 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:45 -0500
Subject: [PATCH v2.3 21/30] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  25 +-
 src/backend/storage/buffer/bufmgr.c | 377 ++++++++++++++++++++--------
 2 files changed, 298 insertions(+), 104 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5cff4e223f9..46ee957e99c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -107,10 +108,18 @@ typedef struct BufferManagerRelation
 #define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
 #define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
 
+
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+
+
 /* Zero out page if reading fails. */
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +140,20 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		io_buffers_len;
+
+	/*
+	 * In some rare-ish cases one operation causes multiple reads (e.g. if a
+	 * buffer was concurrently read by another backend). It'd be much better
+	 * if we ensured that each ReadBuffersOperation covered only one IO - but
+	 * that's not entirely trivial, due to having pinned victim buffers before
+	 * starting IOs.
+	 *
+	 * TODO: Change the API of StartReadBuffers() to ensure we only ever need
+	 * one IO.
+	 */
+	int16		nios;
+	PgAioWaitRef wrefs[MAX_IO_COMBINE_LIMIT];
+	PgAioReturn returns[MAX_IO_COMBINE_LIMIT];
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
@@ -161,8 +184,6 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
-#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
-#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe871691350..70f1da84083 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1235,10 +1235,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1253,6 +1252,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	return buffer;
 }
 
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int nblocks);
+
 static pg_attribute_always_inline bool
 StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
@@ -1288,6 +1290,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			 * so we stop here.
 			 */
 			actual_nblocks = i + 1;
+
+			ereport(DEBUG3,
+					errmsg("found buf at idx %i: %s",
+						   i, DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			break;
 		}
 		else
@@ -1324,28 +1331,51 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
 	operation->io_buffers_len = io_buffers_len;
+	operation->nios = 0;
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to derisk
+	 * the introduction of AIO somewhat. It's a large architectural change,
+	 * with lots of chances for unanticipated performance effects.  Use of
+	 * IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
-		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.  Note also
-		 * that the following call might actually issue two advice calls if we
-		 * cross a segment boundary; in a true asynchronous version we might
-		 * choose to process only one real I/O at a time in that case.
-		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 operation->io_buffers_len);
+		/* initiate the IO asynchronously */
+		return AsyncReadBuffers(operation, io_buffers_len);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
+		}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1397,12 +1427,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+		 * StartBufferIO().
+		 */
+		if (pgaio_wref_valid(&bufHdr->io_wref))
+		{
+			PgAioWaitRef iow = bufHdr->io_wref;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_wref_wait(&iow);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1412,13 +1461,38 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	bool		have_retryable_failure;
+
+	/*
+	 * If we get here without any IO operations having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. In that case, we
+	 * start - as we used to before - the IO now, just before waiting.
+	 */
+	if (operation->nios == 0)
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+		if (!AsyncReadBuffers(operation, operation->io_buffers_len))
+		{
+			/* all blocks were already read in concurrently */
+			return;
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/*
 	 * Currently operations are only allowed to include a read of some range,
@@ -1433,15 +1507,101 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	if (nblocks == 0)
 		return;					/* nothing to do */
 
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	Assert(operation->nios > 0);
 
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * XXX: We probably should track the IO operation, rather than its time,
+	 * separately, when initiating the IO. But right now that's not quite
+	 * allowed by the interface.
+	 */
+	have_retryable_failure = false;
+	for (int i = 0; i < operation->nios; i++)
+	{
+		PgAioReturn *aio_ret = &operation->returns[i];
+
+		/*
+		 * Tracking a wait even if we don't actually need to wait a) is not
+		 * cheap b) reports some time as waiting, even if we never waited.
+		 */
+		if (aio_ret->result.status == ARS_UNKNOWN &&
+			!pgaio_wref_check_done(&operation->wrefs[i]))
+		{
+			instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+			pgaio_wref_wait(&operation->wrefs[i]);
+
+			/*
+			 * The IO operation itself was already counted earlier, in
+			 * AsyncReadBuffers().
+			 */
+			pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+									io_start, 0, 0);
+		}
+		else
+		{
+			Assert(pgaio_wref_check_done(&operation->wrefs[i]));
+		}
+
+		if (aio_ret->result.status == ARS_PARTIAL)
+		{
+			/*
+			 * We'll retry below, so we just emit a debug message the server
+			 * log (or not even that in prod scenarios).
+			 */
+			pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+			have_retryable_failure = true;
+		}
+		else if (aio_ret->result.status != ARS_OK)
+			pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+	}
+
+	/*
+	 * If any of the associated IOs failed, try again to issue IOs. Buffers
+	 * for which IO has completed successfully will be discovered as such and
+	 * not retried.
+	 */
+	if (have_retryable_failure)
+	{
+		nblocks = operation->io_buffers_len;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, nblocks);
+		goto restart;
+	}
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* FIXME: READ_DONE tracepoint */
+}
+
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+	bool		did_start_io_overall = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1449,6 +1609,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
+	operation->nios = 0;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1464,19 +1634,39 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 	for (int i = 0; i < nblocks; ++i)
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		BlockNumber io_first_block;
+		bool		did_start_io_this = false;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+		 * block, which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in here to the IO? If there
+		 * already are a lot of IO operations in progress, getting an IO
+		 * handle will block waiting for some other IO operation to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+		 * account IO time when pgaio_io_acquire_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_acquire(CurrentResourceOwner,
+								   &operation->returns[operation->nios]);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * XXX: If we can't start IO due to unsubmitted IO, it might be worth
+		 * to submit and then try to start IO again.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], did_start_io_overall))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1488,6 +1678,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u: %s",
+						   buffers[i], DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1497,6 +1692,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG5,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1505,85 +1705,58 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * We'll come back to this block again, above.
 		 */
 		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG5,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
+		pgaio_io_get_wref(ioh, &operation->wrefs[operation->nios]);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
 
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-			}
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgaio_io_register_callbacks(ioh, PGAIO_HCB_LOCAL_BUFFER_READV);
+		else
+			pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
 
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		pgaio_io_set_flag(ioh, ioh_flags);
 
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
+		did_start_io_overall = did_start_io_this = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+		operation->nios++;
 
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op(io_object, io_context, IOOP_READ,
+						   1, io_buffers_len * BLCKSZ);
+	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
+	}
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	if (did_start_io_overall)
+	{
+		pgaio_submit_staged();
+		return true;
 	}
+	else
+		return false;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0022-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload

From b0bb4b478b27c2a38bf819ee927be9167e551d28 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 31 Aug 2024 21:39:30 -0400
Subject: [PATCH v2.3 22/30] aio: Very-WIP: read_stream.c adjustments for real
 AIO

Things that need to be fixed / are fixed in this:
- max pinned buffers should be limited by io_combine_limit, not * 4
- overflow distance
- pins need to be limited in more places
---
 src/include/storage/bufmgr.h          |  2 ++
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++------
 src/backend/storage/buffer/bufmgr.c   |  3 ++-
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 46ee957e99c..f205643c4ef 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -119,6 +119,8 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
 /* IO will immediately be waited for */
 #define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)
 
 
 struct ReadBuffersOperation
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..c2211cab02a 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -90,6 +90,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -240,14 +241,18 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
 	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 *
+	 * XXX: Used to also check stream->pending_read_blocknum !=
+	 * stream->seq_blocknum
 	 */
 	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		stream->advice_enabled)
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
 
+	flags |= READ_BUFFERS_MORE_MORE_MORE;
+
 	/* We say how many blocks we want to read, but may be smaller on return. */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
@@ -306,6 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -355,6 +368,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit the limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_submit_staged();
 				return;
 			}
 		}
@@ -379,6 +393,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, suppress_advice);
+
+	pgaio_submit_staged();
 }
 
 /*
@@ -442,7 +458,7 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * io_combine_limit, io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
@@ -493,10 +509,11 @@ read_stream_begin_impl(int flags,
 	 * direct I/O isn't enabled, the caller hasn't promised sequential access
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
+	 *
+	 * FIXME: Used to also check (io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	 * (flags & READ_STREAM_SEQUENTIAL) == 0
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
-		max_ios > 0)
+	if (max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
@@ -727,7 +744,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (stream->ios[io_index].op.flags & (READ_BUFFERS_ISSUE_ADVICE | READ_BUFFERS_MORE_MORE_MORE))
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 70f1da84083..118a6e1ca31 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1752,7 +1752,8 @@ AsyncReadBuffers(ReadBuffersOperation *operation,
 
 	if (did_start_io_overall)
 	{
-		pgaio_submit_staged();
+		if (!(flags & READ_BUFFERS_MORE_MORE_MORE))
+			pgaio_submit_staged();
 		return true;
 	}
 	else
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0028-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 2dea8961fd6383afe1e457926131c2213db211f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.3 28/30] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1f757d96f07..ac19fb87433 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.3-0029-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From ca1654b4d99e3565b2e14525b3409bc8c164849e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.3 29/30] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#46

Jakub Wartak

jakub.wartak@enterprisedb.com

11 months ago

In reply to: Andres Freund (#45)

4 attachment(s)

Re: AIO v2.3

On Thu, Jan 23, 2025 at 5:29 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

Attached is v2.3.

There are a lot of changes - primarily renaming things based on on-list and
off-list feedback. But also some other things

[..snip]

Hi Andres, OK, so I've hastily launched AIO v2.3 (full, 29 patches)
patchset probe run before going for short vacations and here results
are attached*. TLDR; in terms of SELECTs the master vs aioworkers
looks very solid! I was kind of afraid that additional IPC to separate
processes would put workers at a disadvantage a little bit , but
that's amazingly not true. The intention of this effort just to see if
committing AIO with defaults as it stands is good enough to not cause
basic regressions for users and to me it looks like it is nearly
finished :)). So here to save time I have *not* tested aio23 with
io_uring, it's just about aioworkers (the future default).

Random notes and thoughts:

1. not a single crash was observed , but those were pretty short runs

2. my very limited in terms of time data analysis thoughts
- most of the time perf with aioworkers is identical (+/- 3%) as of
the master, in most cases it is much BETTER
- up to like 2.01x boosts can be spotted even on low-end like this but
with fast I/O even without IO_URING (just workers)
- on seqscans "sata" with datasets bigger than VFS-cache ("big") and
without parallel workers, it looks like it's always better
- on parallel seqscans "sata" with datasets bigger than VFS-cache
("big") and high e_io_c with high client counts(sigh!), it looks like
it would user noticeable big regression but to me it's not regression
itself, probably we are issuing way too many posix_fadvise()
readaheads with diminishing returns. Just letting you know. Not sure
it is worth introducing some global (shared aioworkers e_io_c
limiter), I think not. I think it could also be some maintenance noise
on that I/O device, but I have no isolated SATA RAID10 with like 8x
HDDs in home to launch such a test to be absolutely sure.

3. with aioworkers in documentation it would worth pointing out that
`iotop` won't be good enough to show which PID is doing I/O anymore .
I've often get question like this: who is taking the most of I/O right
now because storage is fully saturated on multi-use system. Not sure
it would require new view or not (pg_aios output seems to be not more
like in-memory debug view that would be have to be sampled
aggressively, and pg_statio_all_tables shows well table, but not PID
-- same for pg_stat_io). IMHO if docs would be simple like
"In order to understand which processes (PIDs) are issuing lots of
IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
it should be good enough for a start.

Bench machine: it was intentionally much smaller hardware. Azure's
Lsv2 L8s_v2 (1st gen EPYC/1s4c8t, with kernel 6.10.11+bpo-cloud-amd64
and booted with mem=12GB that limited real usable RAM memory to just
like ~8GB to stress I/O). liburing 2.9. Normal standard compile
options were used without asserts (such as normal users would use).
Bench had those two I/O storage (with XFS) attached:
- "sata" stands for Azure's "Premium SSD LRS" mounted on /sata
(Size=255GB, Max IOPS=1100 (@ 4kB?), Max throughput=125MB/s)
- "nvme" stands for bulit-in NVME on that VM mounted on /nvme
(Size=1788GB, Max IOPS=8000 (@ 4kB?))

I'll try to see in the coming weeks if dedicating more time is
possible (long run tests, more write tests, maybe some basic I/O
failure injections tests).

-J.

* = 8640 test runs, always with restart and flushing VFS cache, took
probably 2-3 days? I've had to reduce tries to 1 and limit myself to
just reads just to get it running solid, before I left and not to miss
the plane :^)

Attachments:

aio23_potential_parallel_seqscan_regression.pngimage/png; name=aio23_potential_parallel_seqscan_regression.pngDownload

�PNG


IHDR��S��{sRGB���gAMA���a	pHYs���o�d��IDATx^���[e�����x���T�]Vw�-�d���{,(5G+�����(�lV�3K�r�?�7�ea��XYLp�H+:�������B=p��Q'����BW]�u��������qN~������d��>�xL��:W�����s�������O�!�B!�B��q�5 �B!�B��%�u!�B!����`]!�B!�h1�e����5�o��_�"c
� ���H��In�'��������j�m���m���V 9�Or47�5�����x����a���|�#|�C�������7��
��X �Z�������}r�.��mK-/���V;�����k���$Gs�]��������L�h��_�!�@�����=�M���]�1��Z^Z���v<�����
$��I��F����A���L�h���&umb��6uY7,.��Z��cb�����~k���R���Q��=�9&7�}gb;������s���mG�*��$�x�&���as��W�Y{��Q�7�Z�q3;�f��cs���o6���\�������X��]�Z'G�H�+g������6�abs�� s~]�
���=7\h��J��%��Lge�~�Yg��n;�����b��Oq�s���I��/r��zzz��p?+���IR�Z&�;��[{�����~�Ak�y�p�)�;=������������=����}v�};U��%���w0��L�+y����������u�����vo_+h���`[���_�Mj���y��E�,��]\q����lkf�9�^���]���s�f�8���Yg����k�b��GW\<���2u�,����y-�����f�F���t�J��0'��h��d�������9��������K;f���e�������Z�xf[����^�������@���Q{�������+������z.~�~���e�2��?��`?�>^o&�3�W�3_�&���+�xj/�����k�~a&�fE��+�f���s����B�J�D����b�����d���`�#/���Nk|��}J�'s�Dw����������F�k^Z�}>����~[��Z�	3?�����k����o��#��\5���v1y=����&�����G^���]�2�+�y��^>X��5��i������Is���k��k�*��w?�[�3�h���t������L�n���f>rP�,|�N������e��Z���N����V3oo�����uF3<3?�����k����X���.�l�����k��b�&��>��������W���z��5X�5Pg�>M�;���.�b�l���5!���x->��/^D+~&��;�S+.�����Fn0g$W���|m^�Lf�]�1��5/����m�g<�m�����v<�����
Z+�/r���L]s��}����Q���������\�o�kkN���^8?����!�����e4�������c����h�������o��d#�k����zZ#����p�m�J������k���g��k������s-����k�v����K���������j�3�]���qS���f�Z/���q������Z/G�a��z��\y-���$_���f�����Z����uKkh&�M
�����cn������Q�b^
���E�5��d�]���*�;�m=���]�kA����uZ���^��.�]��>����W��};���K��|+����-k�����j�3�S��w+
�om���vZ2��od��5������QU%��&
���d�}Sp���OE3����c���k��m�xkh&��e��5T��@�[��V���~�<�`��b��k13�?{����A��c�-���Z{[�xf[���H�����v����;��E�z{{��q�5\U�3�/������l��m���K^��������O���m������j�3���}�@r\��hn�k^�����c���f���Q��3#��'�i��K;f�R�K�����g��{�Z���>���h��63����c����q��A��/f
� ���H��In�'��������j�m���m���V 9�Or47�5��|�=����cw�����f������o�����N�/�V���W�c}!�B!����N~���?:?l
U5'3���o����h^!�B����|��WZ�
��wf�g�5,����e������)V�\i
M��9�f]!�B�����B�6 �u!�B���c�FB�d�.�B!ZR.���C��GGWUtkxIi$��i�
��6�gb��x~���\��������U���������B!D�:|7������h��\���<d{�4w�P��FB�=��p�(<��:CE5���P��1�T�e�a\����b��U��q|����(Jt�l�A#eZH��z��L0q���
Mz��;��~�Q��9������{Y�U����|��F�
!�B��f���-���4�	�j<#�d�YF)���:�Fd��%2fV������y����EHd��w3j�����nU����2-��`� ������>������<�u|f�Z�e�W8�C���%���{�����s#7,7f�o�o?|7,?wn�����+�@��B9#�x�$�MW��>u6`So~���<�r�1�7����>�o4?Q6�{��d������B!�����rO��N���x�s���,�5�W�������T�s"����������s��\��o�����e��|�������rB�N���/P�������>���������Uz��AJT/��U�������d����He��G��G�����GK�5���Eq(D�(�q�������S��5*sP,���y��f(+cs�f����q,���:�����G������~��g�������W����������dx���i�Y���di�3�*����x��ps���iw�N��W�	���nn?�vm]��<b�A9\����Ht7W][w�sp��W�m7W>{/_��Q��g������!c�W�!�B�f���
[�e����y��Q����-�p����y��g�G��N�r����p0_O�s��s��K��v
�������f��s��se���t���&~��?���/s�����{{E9!ZO]��&�@7�Nc��L3�������-"�T&�f<�#���(c��Pb��9c���};��@z�'����F��(�qC�?Jw��aEAQ��U���t�g���@�.5�k0E`<Kv<@j�U�)F����3�I���K�	@j����r�T��a�&��i|$�+�������l��������P�x��!�<��[6?��C��K�:^>�x�'�eR���x���u����8�3|Uy#�(�'���?y]��>}�����|�*^z�F�����H�k�^�Y/�*����V��}�^�i�.����/�p��������/=���K/+)/�B��We���h������:�>����+�s#�6��~��|=5�u*�i^�;��=':��V�[�����9��Z������������\�.ZUj����`�Cx}l�)].c�<=iRz	
��6�g�P��*��;���F�u�FSx���s9C	2~����;��2�o2Bz Lx M�d��>���N������<�#`y��N/N���n����������o��r>�t���yqZ����C�A��?��P��"O����@���m�k~��`��#<�,����7n6�I�g�W�X�3y��<���{����G~{��7r�����������>�3�������-p��
%���U7p�����
;��*]
&�B���8��9��3�8��.oo�!D���d��r� �;\2S`�����YrsI���L�C�<�[�sv�M����tw����\NB1
Mb2��;�L��N/��nci�B�VBh�,�#����_���31��u�S��?;���Rx�����l�a���u�l;���y����s���k�J?m���Z������6�H��[g���|I��
����
��u����s�!N����u-<�K^.�_!��q��(������������cLl���{�e���u�����K�_�^n����7�k�����q�j~�7NR5�9W�k��2Y�l�2eu�,�����'��[��
�a�����r{HO��q�
EJ�*�N���m�MGW��)���Q)F�tt]��b������F	��B1#He�~�0;�;��x�n,o��s9=Zu.]%K�/��:#��*.����[������D��~���U~*|��n�Js�K���/��+��
��_����7R�e\�u-�������`�/)�<|�u�~�e�����gC~�~��q�B!��Qr��A�ws�:`�vm]���c��{��z���X7�g6����KYwqF��������\�����My��M�������-�����*8C$F�{]8\��=���	�!���^?c}a�8��diE^|#R�����Z��P��b�4W}�2:�H���r��t������%u�\�7nD�����u�����N������]��=�c^�!��sz;!�����5����2�u�s��l��Y���}�v��j�5</~�����n��q�(w.�������Us�.nP6�j�q��B!��4��4��|GS{���Rk�!�?|���<�r����!1CSSS�\����Q� ����������1�3����+��kg��� ��xG��I�B!�ba�1o�!����f��B!��w4U�~�i��������kf�T,=����5��:::(����N:�z�|[1��^}��g���B!�s�����5$�h2��������_i
7���}X�B!�������ffN�vm�|k&�mw�z[�^���C!�B!��$��E�:o�!�B!�bq���"�����b�������5,����KS+���|l���mB�7]U���w1u��;&��0{6uY77�Q�\�Fn���uCu����������
6���%��_��3��>6[9l��[r����[~���L�����o�����6l���Z�I��7t����}����7���x���s������_.Y~#?.�8�C��X��T(c>��X�����E���F~7Rf.��u�e����5����������^Z0-�w�`��V���z{V���Eq(D[-���`�kg���������L]s����M��h5�x`c?�|���fl����u;�w�|nZ�9l��_����_�C<��g���
N+���7�u_�'�W�-�@��������g���!s���
�w�3<Q�x�����>ftl
���6�6����}G,Y�{{����H�#���Z7���=X?��m���������g����_�)�a�y�e:����LJ��+����o����������32w6:���N�re�������������f���*�US+�G�����Z���������.�3"�������������K��l��+��*��U��Y���7r���s}?��WXib��W�)��^�����K_�C<T���N�����_����R������7��|.���&v��;��W�wZ�����}�*�`�L��V}
?����}��]��p���8�e�~��<���+�^���s]�`�l�}�Z����]*<����Z�����GU���=mSf��}�F�����6�~U�m��m0�s��%�l�R�{c�������������Z��"�{��������'���6���=���h����P�:�������DK��|��@	��:Q��9
��n�z��A�9�����4�����(%�5��Eq8(>}I���;p���Q�
?W�W�-�;���+�����������
��(/��v����������	�3�*���6fP�~����dx���i�Y����G$���Y�g�?�k���Cm�����\��s7���n�������������q���^Q������^V�<���w@Y���k�/����g7K��r^��?^�����������O�}�3<��##�qE�][��g��2�zg�x\���H_���s��\i�~�Sg��q��l��W�<����qp�y}����;��X�o.���;�����+C�_�_��~c=��x�Uka���C��C�{�_��;�j������d���u�eZ��]�vm]�q����/9��3VY��b��\p/7+�<|�\}�e;ppK?�~~v-W^Q������g��e��B������N�Y�
����F���w�\h��j���
��Y�@�����l�8�e�W8��g2�t��=��*.��_���>�W�7�n.+�oe��-�^��}����W���I1��Dv�O�xd�N_��HR���*�t3�����@���Q?�)#��� ���Pb��'r��?�f �#��;�[�B�4>��������[^$�k0E`<Kv<@j�EPu��
�'���I���A�LC����~�J��E'�$7f��=s���z�&�=�������O�e|��W���7r���� �p���i-���2N[um4���q��W�7r�b|�y�P�'��]�!�x��*��K�����������Z�������������5����q���9m�
�U���8�>m����69�]<
��g���;.��.��6����^������K����+��wf*��+����/��U@apr6��x=��Wqe��^����>�\2��m�����2�^/k�����q�r�tu�k�����V�}��8�'�3P5�v�*:m�!��][�|dU�����G���+���vm=���`�Vxxc��Ba����1����
��k��d����������>W�������~?{����
^q��a��,�/���l|�#���M\g��g(�������������[���}������%fOw���kT�}x�N�}n<��d���|L�	��2���<�t)����y�Afa���	�c�>�.�1������=yq��������'�@�����#`���T��>I�3�H E&�36���v��g�l����`*����~�s1w���6�ab�^zfu�n.�R�9r� W_Q����u�~��3wYgW��]Gp�V7��gC���E����������*�w�����O���u��z�5����y�*��U������V�]�/6�wg�N��������x�*�&_���=�lwy��/g��S3}�7Z�L���u��35��o����i���������%�u�n�Z~d�#U.7iD���3��bc�Z4�'�P���Q�8\�d:��
����!�\���n�����Z�`�������H���^�$�3��;�s{�'���<�5��@�d�M��1s��`�k��y=7.\���#�x����p��,�\�
�2��������qm���~�.�e��C�u?��k���>�+{�����8�S;������,r�Yo��?+�Sa#��X�G���|i���s�n���j�.���Itsj�;W�Y��Mj��j���vy4"f|�x����t�5��7���1������^�~S�?������0�qQa@W����}1j���w�����#���e�{#7�\���W����]���:�n��w���{�b)���s���5����7��V��]�������Z����YT����h���k����z�����j��������r���v�����B�����.�~.'���~����O�zJg���
MW��Y\���1#]�g��|����f+���s��xK9;��h]M�;���'E<'����q��������c<���A�|�b.��w]q1+.�R����}����7���q��n������o�K�XK�eU�|I�����X�n�[������D��~����
����6xz�~����od��r� Z�^����}T(��e3l����f��r�@=~�������9�i�l^�u�|�c9�Yws3	`_�\��HY����K]�E[�rpK?_���}��`���<��<6l���������ey���n��l��7���l�������T;��.�]�>���W��kyMl�����*��ZVy�k����G��1��r6?j��`�z���������;
����\p.�[���i��m���
{hU�������4r�0��������Yc�\v�h���4��g�m�(e���� �r���L{������+�]�t.7����O��m_
���
���u8�'K��NH��r8p
B`dc������`/������^� t0��1��!��-�����u������w�w�7�<�]��v���X�o��=�������b.9�-[VmX7#��o���6Z�b����{���pC��8�=u�"��������Y�������
����k ��"t�����W�����]�
�kB�6u����!1CSSS�\��^���]���<��Y��:c��C��b��6�1w�wb��W��|�B!�b'3�-�hjf���B!��#3��������]�5����c���s�?�@GG'�T=^���?V�>��W��\!�B!D����Y��mj��oZCb��N�Kk��9�Y��X�|r#*I��In�'��������j�m���m���V 9�Or47�5���j
�i��������u!�B!����`]!�B!�h1m:X��U�B!�B!�:��~vLL0a>�l��XXz������a\�������B!D��Q�
���p����f���Q���)��� ��K��zJ�	N�t����� �Z�9�J�9�� G��(}>E�(�W�
v���1�j�(k9�x����W������q��+U	.����g�p�{�p�46��;���;���F��"v���p�{�on�#dy��=�r��,��-��yu��������z6��\O��H��I�c�yc�r9b^�,���D��N�B!��:L/ar9
-�e �KP�`I<G�:�j� �Q���;C14M+<����<�BX���0���C�#$
u����7|������p�����%�Q"��9�=D*'���`w��8�M"�3��,o�{^�l��]���Ikh�2�xI�&q��W��?����7���=e�,��_����%������������>Uxx���{���g~�)>���T~`���	WY���3X�x���X�	�\x6�}�*�5+�������3|���R:�����@	�����3�9��������NT�%�](Q�85�q(F9�9K?�5>�t��b������|���|�U���I��~5h�o�����B!�5yc��Y�Igw���d���U#^A%�pb���������W��>c �_ E���T��l�.G8	i9���X���d����cm���Q��i��3&�����+�@�>b�m�zIGNg��G���;S��3������6��<��/����c�+wq���x�0cnz�i���t-���c��+G~/?��G���N/��iu�]����o����V�4`:N�8�����x����e�a����A�|k)��Hd�	��k�MNB�q�gd�D���5����G���%�� 5�����$��gs��#�����?��3��0�������k0N�x������h��?�"1��4��!�BL�J2^>H3&\�$�N���yj0asf��x�>6Jw8TL�+����A�$�(��e��Y�z����Up�H��r(M=_z�:�c�o�c�fRt3f.[.N����j?�hw�}�TzR�Y���n�u8�{��R����W^4f��~�Js�_��?�|�)��ig�����5o�<q�=�y����l�������X�y�mb������n��;q�l���N7>���]������>/N�_HO�w��/$W�G����������:q������Kh���X���(�B���I�R���N���8�	����
����o�_!�b�t�J��i�P�Xv=�/[�hc"B�<�i ^�2<�Ma|���>����rz�R������9���������|����1i:�44-G�=��\�i����%�;�;���3���H��o,I&���22ih+��_9�����]����=��5�y~��Nw���W��s��5�?c��Au��q���X7��z��	v� 1r]��z�P�8\�d:��
����B�e��x/.�O���_!��):Q���@�����I�@�Y�jq�h$N*1f����R���A�U��7���2g����~�����xK����4x�����"}2]���I(����m(e=�������M@9;�Ie���r��KEe4��_��q/�3�k��+<������
?�zt�|�=�2y����?����������o3��[g���`�k�^�������_�������K1�$U=�ih������$�@wg�z��,�A�3���lM1:��='j�`0�+#1��T�,����oO�Z�z�!�B��1��h%S�zT1��c�f�3-�8`~�q}l.�����X�X�x~��1�.�]=!�^�x���uqI�U��"�����t�<��T���XM�?�2�F�k���7�c��+%�x�.������e�m�yz�����
�������=�0/�r`���w~V�;P���4*������Q{������m��aj���d����w���t�{���M�{��|_;����y��@���?Y�N�<tg��]��=�+�Ty��xH
��-��
���uwY�0:\��No'�{q9�!���"1 ������8C��B!��&�^:)��W59C	2�X��>vq�(��H�}u�TvS�z��@DQP��K�����3���y�x�cn�]���	���'No>�b�H�������
����cp�H�3��B.�T��=�hU��M���u
g��(g���_A�P�|�������~~���]�k�\\�YP���;����pq^v��r��k���OA�������;�Aq�|���-�Y�3�}�v��/�=���{<�wh�c�ajj��+WZ��A�?{����A��c�-���Z{[�xf[���H�����v���i�ZCb����Ek���3�-C',}D��FB!�B!D�i���3�t'�X�X�#d�m}����A!�B!���5k��uttP�x�tR�x��b,�X��\~�������Y��A!�B����{�!1M�y�zk����f]�X�^33_$�$7���vi�l[jyi�������vo_+��'9����]�����
����Z{�B!�B��`]!�B!�h1Kd����r�x!�B!��C����L�����Eq(D����a\����
B!��IW�(����O�t5�����*fYEAQJ��A�C1��['��`��`�	[���c+-�(�s8��m��c�g��,>_������Ze�����dl�s��>o�\��Q�OU�#vq��+U	.��rl�w�����4v�w��a;[7l���S�%�};7<�12�k�}��|����)��i�
��6q�E���<8�����������	!�b����#n���������P~+�,FHh���i1*N�� ����4�\��:U�!�ihZ��`@�I���&BTg(f>�x�������&���v��Q��m�_���M�l_'�c���:\��Y�f=6����O��������� 3ft=��=���i��
�l�,_�XW�0�����7l�u9���%�p`�w��#�*���p�����5}�����,��]�S��w���3�o�~���=8�.��l���U(kV��9���-g��}%�t�A���4>i��Eh�3�9�`����P�(J]��Q�8��JT��!����c�0�����7���p�rh1_�����f>t&��.��T��>s���H���1���������]|����a����k���m��9��l����y�tv��e���X���)�>b�K:E:�=�p�G(�)��[�b�W�r�����������o�9���9:���������5����-������d��j�XQOK��z��]�y����n��'Np��q>���su��8��$|�*�ZE�4>�q�8����f<*�,J�v��I
��x!�B��l&E7c��VJ����>I�Q����k���� ����@��������Q\��}l��p�9\Y��c��f��C<��K'��T������2�z�l�������UI��	��h�}����q�|���v�gpATQ���Xb^:b����-?��?f�
������g��x�k,���������7��`�k��=��gk���:��b�N�(�n|�������r�)�Z���>/N�_HO������g�_H����B!�Q�N�4
M��p�������G8� �_j^m����K�G��ejR��G���U�Q��0�mo������J0B���{�Iw�4u#�o�L^��������*7�us-�<��������#����$�������������r��^"~��Y������8���<�j����9z������16�E�5��\�
V\����	&�\m��M�r'N��� ���*g��B!�����U��vvv��Wr��g8\n��l��X^�L��N�11SX��$�.�G�FH�U\^o�����w�i���Z�>�x*��7���e�_'�c����/�U��$������X�}��7����`8S�V=�����q&<�5�l�����>�]���k�<�1����
[v}�}�p�������Y�f������z����a�Vzf8��_�������K1�$U=�ih������$�@wg��)!�B���i\h]X��&�U�EzT1��n�fR������Q�dv>��s����'��T�f[?8P�������p�Z��x�cn������1.+����?���Nv�f���@=N����v�E[��#v�<=J��k��yO���V�����w������X����}���a���a�m���a���bN�����$��V��3�ZhP�����h����c:�=�C�����e��e��K���C�x ������e���$~W/qO���w�B!�X�!�~��� �8	�,��Z�^���5���J0N��>�,�'���c��<| �/|�S�:���DE����4�;�xa�yy=�7�*Q%n{�
������8���K�{&9�y����.�z��`*E:R|�`T��k��U�S6}�>n�5�!l���}n2~%B��Q��?������^4�_����@r�v�n�!��k*����������U?���q�����58��l�A�e��������};�@��L�Eq
�=^�������+W��/D�$�$7���vi�l[jyi�������vo_+��'9����]�����
�����uf}�t�� ��G�Xn$�B!�B����;Ch�1��P,F������B!�B!��c���s�?�@GG'�T=^���?V�>��_a}.�@��z��tB!�B,!����5$����^l
U�����k�kf�������}r�.��mK-/���V;�����k���$Gs�]����5$����/���Z{�B!�B��`]!�B!�h1�|����rWx!�B!����`�������:����a����Eq86Z^!�b�R�
���p�N~t��EQ�G���������G
���1ER�|��r�}t5h���[���F=�u9F;��L���W9�j�T���+c�\�i�Up(Qs2��-�=T�v���q��+U	���v���o�{�7�ii��op��7�����cu��a��G�%��������`���V���z������Fv[�4����T<��#���Z7!�B,aj�^��rZ.G�^s�"K����ihZ/�����p�C��-�c��G�D�����Ak��Ac_�(���D.g��?6��������0��7�#�(�����f"�q���������7pv�U��j�(x
���6`���/6q5	a
-1@f,�a\��R�)�H��3�C,��<t\�'�����[�1�6����`�/�]���1����9�������:������O����y�;W��Y���t�����������J�����[^�L���P���}���o8c��,[�I�U�O�T��O��7���0���	��
��������B!�,���f3t&��.K;�@q�j2^2w�7��	}��e�$���<����dA��>�f1o��x�}.�z�IH���|�>M�I���L�?��Sw��m�X�n�2%�U��*���
�-�����M\/v.��,�Q�3>B�7h�;��_�}Myx�����N�c��,/�`����]�����<���wf!<-
�Wp�.c��M]��M;q�������
?�Q��C`<K����u�H��H�������������(��4���8�q��Lg(L�8%B��1/z��`��H6Gv|����8��Y��#`�*�t3�����@|����*�(�B1�I�	��[�$�(��e�UfD��Q��!����!�����f e�����'=�����f�\z[��h������7��p��!��Jyy�z��$�����}�<W5j0a���][��b�~IO�8K>E�v�P�3�� �(�	G��C����:��g�^��*�)������iH��p����j���3X?����%�[���6f:^��6�#w�D�v���n|^'N��dO#<������7P�O%E�A����x�#{��J���K����b���Ov2����]�F��ZaI������� �a�y�k*�Q!�b.8C1c	����Pp�'�����~��v���v��1��f�m�./��4�i��#���x���p�� �Qsiziy�z
t�J�����y�
j���KC��"D�wwRA�g�����G�X�LXC�edb�Q�����7}�(k?��<�.|�y^������^���3X/����Wpf�5������+@b�����y`<�#g>4�7h6c��>�u��rY�G�!���_�S&5���K�s�!_�!�B��g,)7&%�x�����)�u����d��I(��Tk1�H�)��,_[w����
3���n����s�����N[�������NB��*3���`���$���Y]q���+O'��JG��s� �� �`��-bi�{�q�}�i1��U������:�k
�S����{�
�>Z�V����2v���41X�g�E/rd�oN~�{~�~�2�n,���Tu�l��e�����F}c�����r�y���]��#q##�7���(�`W(Fb��YKz����3�P~��B!�\��
J�=v�zT)�k{6�*9�7f��%����������_����'��T��;���j2����.�zf��?T0��x�|�z��L8Q�����n_�����v9���\b�g�D,6mm���b���D��V}���uk��|��;����2�v��#������
���=X�������m����v�Z�q�}�~L����yh��������]{I��w��{]���w��&�#��Lw�>g��H�t��k��:Q�Iy�
1�I18���vB����k�O�K�p�!0��������a��Q!�b>9C	2���4E'r�%���u8�0^\:]v�j����|���Q���^=^�F �((JFJ��g�;�c�[���������K�8�U�����s�b��L��|�z�{�HG���Qcb(�#�}���&�l�������]��k8C���>7����K\�������Py�-��c;�����p��=p����]�����\}�1�~����g#�4k��4��l���58S��o'������(�A������kjj��+WZ��A�?{����A��c�-���Z{[�xf[���H�����v��z�����&�������=��2t�� ��G�\b%�B!�B����;�wXw����>B���B!�B�v�X���\GG��7����I'U��o+���������WX�K,�����+�B!�K��?o
�i��6k����f]�X�^33_$�$7���vi�l[jyi�������vo_+��'9������?a
�i�j�[�PU��^!�B!�X�d�.�B!�B��!�B!�-���z��L011���6uY�!��
�(
�����%
!������@	���W�*���yE	�/Uk_]
�p�RZ^Q
�K��(�Q���V�<f���<�������b����j$G�F����N�\�Cs��������k[����U�f�XT%X�����%�S�
�������)�
>�wo���<jF�������������/y�}���Sw���c?����zz�����Ds�I�C,�� ���b}�+_��Ak��A�/����V�-�bN���&���rY��T�{g�0BB��4
M��P�%��������Pva=J$=@6g�&BTg(f�k�>�my�(���D.g����;3n���=���)h4G����o�m.l���zlrdw��
������m\MBXCK��(��=�-��}�������9i��?������KcW���?�<��4�~�>���]��?�=���p�e�K�nZ��z�&�]y?�������O����y�;W��Y���t�����������J��t��g��IG�O������vu7[O���|<����u;3��z-j�G��������4��'�|��^|��lq���������a�}f���Q�B��^T���ry���~B�%�#��l;���l��l_�����IH���|����L���������X^��>��o,G�k<j2nTp�7��`�!
���67���4v�0�z�}���d������^��Mg��G�����m��������Ok(�/?.X����_��j.��9��B����{��_��WO���E!<-���g�b�L��0����d���y'N������N�(�\Ff0��,	������~SF�9��@�8i����ds���t����V���5�k0N�x�����Ae�n��9r���R���"�"1��d���u_��5�k0E`<Kv<@j�U���{���';	�����]5���-'}HO�c�����2���#�g��"��� ;�����{�!������o%�I��2�_Rm��i|�bt Q���"10�������F�����M�l&E7c������|��q�=���	�lR#�F����M����,�E����H=�	A������q�|J��v�gpATQ*��~��oxKa��|w����ON��B����f��WN�����
e���`�"���COO[����^��?����w����'����t��:q��x���s�
�!��K���d�������"������K���vu��3�Fyo-g~�������� �a���<����z����=�y���K@���^����
����o9���F�u�FSxJfX�;�8�^c����������c�����B���D��W�9����K�K������4{�_��O5��\^����`7�I���t2�ihZ��{��=��8C	F����3��:5��9�S�
�YrQ��1T���#!���N*(���1���K�	kh�LKO<.���f�q����Ws��/��?���������}����G9�i���Wx��|��_�Y�G
�?X���������g�������8����|�+�$1r���[��!�\���n�����~�`<�#g>�Z��j�S����%�9��o��
�lV_��L������>*��h�n��t�"36F&���5��B���U��$�y��>t�-����*r��+�E�d�d���P������2����w�=���!�����Ly��8	�4�:m-Fi<v������:�6TS����3��1��'��k&���)#n��4-����������rz�[�7����)�k���X6��|��w���Z�V�ds�o���7��`}�^�_tm����.��#��RM)]z{��qNXf���q������^�F	��B1#He���)7�@�w<T��<*�n�]��q/�Y�����z����<�I�5~	��U��������Ek/������W�m��/��A�*fj�{�!��d3�D�����E��}B0��w:��J��h��Zgc��j5/�j��K��������0a�&�x�.��)����7s>R?Gv��P�oge.l���z�r$� ��c���D��V]�������������	�����b���Ry��[N))����uvs��)��e\���������>��.�����{���)v���kG���b!�#����'�9k��vB����k�O�!#��.�A�`������z\�A�3�g(L���a��q1j��������u���������������DE��HcK�����g���p�����b	2���"��U���aL7���)|X�qb^#>�)�a�������^������b����|�M�Lv��!�~��p<����x�������Sy�����\�og�\�C������m��kk�w��f]���	���M���Dhj�d�������I��������Y����8�K��������<��\���@|�����������c^��1�;��y��c��e9kp��o�N 0_�GQ\�t����EsYw���b�����h������}r�.��mK-/���V;�����k���$Gs�]����'�!1M_
|�����z����K�����B!�B����������b1b���q��i+�m� �B!�B,,��5��:::(����N:�z�|[1��^}�_~����B!�B,B���n�|'�Ck����f]�X�^33_$�$7���vi�l[jyi�������vo_+��'9������:����=����j�e�B!�B!�$�u!�B!����`]!�B!�h15�]��011Q�����ZL!��������}�zE�V~�EqQ�q�0=��p(DK��G�(�%X��TpT��+=DQE!X� ��y����
#^���G%��j��� ��k��S+^��e��Rvet���b=�j��:E1~Y�d��������.^��E{������U�*A3V,����og��K������;��9�����%~����|�Wd~����cd~�R���_�����3������/��MW����m����1����������B�yp�'���2��Kdl��dX��p�!�\�j�*�G����E�������T�S�x�`���H�{	���7��,����)kn*���� �5
M����
�ny�m��Q"��9
M�)�W
�O����u����v��z��j�^�h���%p�Vkc-6�o���q����c2�����y�f�����\�m��4�\��������8G�u���l��]\MBXCK�1���h�P�3����?y�����y���^�W�~����U���/��K�u����3
���.��W�^�7�9��_�f�����?���{ci�M�9X/�=?�5v[�BQ����p��g��w�BY�O���=o9���+)e��1f4����g:���D�^�@j��b��,�:�@�����8JfST��O�u�A�`��$|�>�NKKDc9�������-��()�c^�}n<@zR���!xp��v]4���$��y�Rd�����4�z7R��t��U�'J�%J�;	i9����4YO�b\M�K�N�`��ui�M}l���J��r����2���8��o�\aG��t��O��#��l�_6}�.��t�t&z�����9�p�G���sx�{�T?�.|�1�v�G����6����q��7s���j����������W_o���]���`��M�"��B���8q�����;q��s)=�g0�a$�#;>d�I����eG >���$�'xF�I��6���b���)��44-�x� ~�t�:��;A,CKD���9u���B���bX#�_�����;���#�x�Wo�E}r�����wU�O��{��9�������"10����8\�$
�
5��e�O��v���&�r{�'��at�H���d�9�fRt3f..~hl/����.�����q	O�\���"Gb�����Iggw���n�p�D�j�]�r����2��Iy�?^�=���O������T������r����<x�/�������l\c�����fj���!�0\����pC���e��}xH1�R�';���'-^BC>&���GIa~\8�s�.�B�����D����:�i-z�X�n��E���a$�#��;�[X����L/?B��A�����\��Q������b��v���&�%!b^�=���8��i:��\����vq���`7��n�\���XC���/t��"���TP�|}c$��7�$��|�k��������3\~J�,y�?������a�8�eS/��'�v������?y�����3x�����/�mFC���u�������8����|�+�$1r]��3���2>�
�^\~��[5���K�s�!�@�e�F�-��9h��G��%$��0�D��3���Q� ��8Y��8�I�g�>'�t��,��|~�����Yg�S��F^��N��,����I(�MzR%��JG�T�I����U��}��7]O�����c�q]���4����Mi4G.���dpvv��dm�yz4Bzd��E��Ey�����ra������_q��hZ�3�k�����������1���S8����9��g���'�����o{���:���������e��h`������#cu!D�J�+?~��>]�fq�b$F<���%�T��i^Cle,��_n���Ph7X�+,����b��X�e������z�$7k�\���r���H
2����0�)�Il>?��zE~�>��y?�d��5��m�x�uU5�6e�L5\����M�#x<#$b!�U��1��m����zT)Y
�c\.��/J���H�����wnO���]�q�v.�2��em3?\�����s$�>k���D��V�n������9wW�@=?���';����.�oF��z�Y�|����BT���w���t�{���M�{��|_;�����8������p���9���	x��:���y��xH
����5���Fr���
�"���l���A2H0h<���c��T'�7�?'�`�WPE$��r���3�`<�)�f=O�����x����(�]�X�l^o=��7b�����w����i�u�5�_�K�8���:����3�` �7���O&�l��i#9r�H�3���x�vq����������]|�9������������3���g�>7���x�F���^�?x�_�;~���x����1u������W��p�������o*��������������?���������Mr,[�,g
�����	�~I��gSSS�\��
������}�����x'a��g�\�c�Zjyi�������vo_+��'9������:����=�����?�.�bAx����k������w!�B��d�.���"���Y��b���!�B�v�X���\GG��7����I'U��o+���������/�>�B!�B�E(�_����4����������6�������=�M���]�1��Z^Z���v<�����
$��I��F��U=�MkHL�w�_ZCU�2x!�B!����`]!�B!�h12XB!�B!ZL��z��=LLL����B!�Xt������8(A�Z`A�Q����uK{�� G������� ���(
A��V�Z9��
%���NTq�Q��>�}WW�FLQ�O�j^�2s�#5�/[��b���w��Z��J�������^��
zO�?-���
nZ���v�s�N�;�X�H��v�|���t���s�5Sl����g����M]�2B!�Xh����QF��0�����2��e.����Rq<�JPq�L����1��'aq�rWex�����a�_���2ur�������a����i��K���G�G�$r94M#���i{v�o���H
�K�,��=Z����d�w���M\MBXCK�3?�&qy�jm_�H��3�C��R������{��-���p����a0����.^z���z�������jmV���B!Z���?������\��f���Y{�r���W,�����OE�tv7�����I�!;^u�����Z�g����I��y��H��6Z�VT�'�*��]}l�p��(�O��o����HM�	����5db���;v}�&�O���3Y��g|��o��w
�=	�|������Q7�������Y^z�>���������GyH_���BxZ��ws��#\;1���.��o=�Z�!�b��8q�����;q���=�d���e��������
-�x�'����Q����y�$iF�/��:M^���L�n��%��`~�����~(39r�=���2:��*>I�����'u�%��v�]���CU��zO�����K�K>t�S�>���U�S���G9���\������=W{����=����e�G��@.[B!Z�����x�.����l�!F��	k�NT��xbNgN?���A"�+g��}��	b���e��}7M'C����H�G����Y��3�`��q-�2��{�����;�;���3���H��o,I&���2W�<HTZ���F���GY����p�c���`�/_~���ZwhL��z����`L�d�}��h����B�V�;q�]w��W>Ib����l0O���$�y�p���������~��/���I�#����Tj0��oI�t�=%3��}������ggw������������b�q-���4��\R#Z�]_3�F��g
��������R���S�!(�O�w{�/��+��h�Z�3�2�����Ss�~��)V\|�=���t�E�xd�ZL!��t�����9Q1�m��g�	���R7�?@��K,�%?�����'�;M�_>������(S���6��AZb�g�D,QG�4q6�2�������o�d|a�5���9��
Ja���q��������kv�<=J��k���mo��;�7��gv����]<���<@���v�zj��}#���]�5��Lme�\�.�B������o?�����<�o���_���Q����PF��L�HG�U��gv5r<���p�����.\U��]�����%N��9����24EAQ"0b.A���{#�*�)�d��������yk�]g�D8���(�OXG����-�#g(�@&3��"1'����f7��6��p������O��~zs���oC�	����<>r<�>�M���-�2�vq����r������X�K��p��]W,���e�r��Lm���@�]����LMM�r�JkX4H�gOr�>9h�v�����Vko��lk����q}�����yU�~���]���PU�g��B!�B1�d�.�B!�B���5��:::(����N:�z�|[1��^}�_~����B!�B,B�������������6�������=�M���]�1��Z^Z���v<�����
$��I��F����l���4���;���d�B!�B�bd�.�B!�B��!�B!�-��`�k�&&&���`��.�f!�BQ�Tp8EA	�����
mP�Q��j���]
�p)��;e�u5h����m��7��[Uks�}����|�<�Up(Q�-�����x�����X��Jp�������[�U||b�w��a���b�6�us���f��},_�an3����������I6������'/�
�!�B��G����44M#L��Dx1Q����ia�Wi�M��D�$���X
�K�\NC���KP5��G�$r9�3����2�6�����el�l��]��i���%p�Z�K
�O��6��6�d����Ikh�2c�G<�$�!oY�K��������(<>����?L���6�^V=����I����������eu�UW��V�������V����p��<�$\|����B!�3���	5I<��8m����d�Q�,NBZ-�+/�����:�in�c�0���P,G�k-�����q�5�~�6��k�65'�+��`�`�J0�����zD;h�O����_N�L�(���N�������pM<��c����g-<������o�����c�q��N>~J!<-5�_�b��W`������"B!���"10����8\�$���7=Y1m\!=�O;z4XV>�I����T���/�.�����V�H��T�[l���!�,^��@��d@
F 3c��V�����S�������n�u8�{��b�^[z^~�W8���t�����������y�(����8����;��)O?�V]	���-�������������F�N]���	&&n����%�B!���?j.�e�/������3�(���!MC�r$�����~�D��7�a��R�6g(��>�L��S�A"��U�s���x�p'_�I|��%��54_������0���.�1�w���M<��>�-�z�O�Zn�{������_��o�d�����<?���O������'�*mX��:���zzz��Y�^V0���(^!�BT�O�K�n;	��mg��������N�s`,�NO����UX���.�&�P?�	b���5_��������$��k��}���!��JG����I��fE��#���h��>�i1��U���h�=��^V�F��]��9����'���>��o����[V/��B���^��p���S���W�7�kN��zA��������nB!����������G �4�4]��ZWUc0hS���Q�d�<�I�4.�.,�U�q<nWa�	'�r������������V�#�r~/!-GN3n��%F�xFH�B8m�m��O����(*�U��/
���'�'"/����w��=XS�����:���;O-)��:��~v�_�6����f�lN!�B��1^X��a���,xQ�24EAQ"0b.����{#�6�)S#����'N��5R�P������R����p����p�z��`*E:R���W��=�����z����k[y��Ax�neW�h��)��Y�p��y���M���D�p?��hb����,�U7����j~u���~o������'���U?}�[9�w���wSXF�,��e�r��Lm���@��K�����b�����h������}r�.��mK-/���V;�����k���$Gs�]���6[Cb��y��PUuf��B!�B1�d�.�B!�B���5��:::(����N:�z�|[1��^}�_~����B!�B,BO����!1M���������6�������=�M���]�1��Z^Z���v<�����
$��I��F�����[Cb���������e�B!�B!D����B!�B�bd�.�B!�B���z�&�L���4�����	���l�B!�����@	�fT���=DQE!���t5�������DG�NE)�S��}�U�~�����S�~�.���v}�z\%h��EU�-�~�K�~� ��� Jc�$��
�W�����l�o�
~�S�|�,���������t��U�w�i��]l���c�J���CO��_��!�bIS��&���r9��Tu�$�e m����� �5
M����D?�|��Pv�a��d�0BB��4
M���V�m�-c.rdW�]��~��^[��`W���d���G����e�������&�y�W���>�Y����*�?z�{W��i��g	����������4��o/��yU��������1U�_�E��c�A��#O�����JK!�B,-��X��Vg2������tv����$���L��H��Z9	i9���<�ln�qI�U�E�������/���kk�l��d��k:�=�p�G�Y���rY����h�E��4O|���x;k>r���3����[{
��M�Z���~
/>��B7��(����`����V�������0��3;��!�B,Ez4��p1:��r������@r�'+��P�}�4����[���rloE3�Qvu����c���:��O���.���!�*
J0J�=�������>���9�7����������Hc���������kx�.�B!�s�b�r�Q�e)�NT��x�$��%7�>��1s�uY�]����A���c�!�I����7F}cI2a
��a���S3�������X��y���
�/c�E�����?��O}�������R�A
��3���v!�B���I�@�,�NT�3:� �m��Vwg3m��'��<���*3����������]�vq����#�$3��p�p�z~%G������+��{�&�?f�x��I�I���)��V.��g	'>
�c�k���������jx���^�h�yw�.���|����B!���G����e3�����L8���S�^�x��u�2m]U�,���=��(��fRtw:mslos�#�:��b���v�<=J��k���m�=�����yw�_��Cg�|�q}{"�Y������������^N����S���Z��W��<������~+l��`bb?�y�fsB!�K�3�` �/|R�q!'z��`*E:R�f�:�����DE����L[����`�T5��(����k��.7�P�q"F�� �81�}����c�sd[�m�z��=Tm���]��k8C������M���D��n
������<�������f���
�c���j$�������)���/~��=���l�A�e��������};��|AD+���b�����h������}r�.��mK-/���V;�����k���$Gs�]�����!1My��XCU5>�.�B!�B�y!�u!�B!���8��9?���A��
tttp�I�������c��s������%�B!�b:�GOYCb������PUr�z�k�kf�������}r�.��mK-/���V;�����k���$Gs�]�:��������S�����e�B!�B!D����B!�B�bd�.�B!�B���z�&�L�����B!D�T�%�n��t5��D�nhz4��((�B0Z���U�Vn��R�'t���P���Q�
���p���4?j�3�V���B������|^��U�;v}�z\%h��EU��~Z��������li������V6�����?�3��������@a���2X�b��	&n����B!D��R��F�F�[�m�T������U����ia�����+S'7j�(x
�,FHh���i1�f�^��rZ.G�^f|a��F��V.�Kr�e m����Q6�{����������5������� 3f~$M��������������e�<����E��� l���p����`��l���7����+>}^Y���2X?���=�������B!�E���K.<���s��xzNg�y��o���q��)�D�����C������$�E��)2�F����J0���uC%o�\,?���L{p�,e��m;)3�p��m��&�|�x���2������f�'��]��,�Q�3>B�B�����|�mw�_���l<�8�wo���9(�����]�d����P�|>����*�u!�B��9q���'w�D�g+5�K<&l,�����?�WF
F l�����e8���2}�G�(���`�k'
��N.*�������!�����f U�I�hwv}-=����t�]���CU�`{^�4]G���:r���}z��g�N������������i�y�l���`]!������|<xW�n6�#w�Dy=J$�ad�,�u�A"�)L�9����������P�X�=�_�e�s`��0�FG`����J0B���]f�-��	��������#����$�����0�.�����������Q�}�Os>�z�9^�"��8�C{�;4F�B!��u�'�uW�o~��$F���YW�I�+��������YZ���N4'��3��AR�A��(:N���~.���L����2�0��Yd_f����*~F�
e��b�q-���4��V@,$��f��>�i1��U�w�VyO��3�>��B�>x����Z��h�Z��2x�Y�zd�.�B�YW������(�Y��L�A\.�q 5H�_�\.^�x��uq�5�����X�L%���\��i�<��D%���L��N'zTA)�[Z6�*�<�j�s�r��g�	��k�s�36
}�n�M��5�x�%B���v����|����yw�<��e���y��T^�n7��G�B!��U�}�~L����yh��������]{�Y�����x����B��i���Y�����v_��eh"���D`��^\����`����L��q��c�2;D'�5����"����4��M;�1z��`*E:R�G0�[rd�$EbNT�;v}�.n�5�!l^w��s��+((����=��V6���Gx�-��c;��l���������sn7���(W�a���z�r^���x��b�f8�-[��gj���U��B����)V�\i
�I��In�'��������j�m���m���V 9�Or47�5����o
�i�=�k�*�YB!�B!Z���B!�B��X���\GG��7����I'U��o+���������/�>�B!�B�E�?��7����?��vk�*�f����53�E�gOr�>9h�v�����Vko��lk����q}�����y�}�����5T�,�B!�B!Z���B!�B�#�u!�B!������6�gb�%��LL�=��J�!�B�RjP��PP%�Z7/z4h�AQFu�f�QFW�(GE��F�*8�(������q��j=����V���Rz�v��A��6uV�Q3�E{h�OU���X��Jp�O�b���������O��G��O=��w>el+������c<E��|�c|�d���2X�b��	&n����Y{{�����g+S������B!���D�ds��&Rr"���BX���0�������q�����T���?
��������&���rY�f�����et���o�l����0���i	���:�9j��hM�)������� 3ft=��=�-�u�y]g�3o���������
���a����$�0���������������+���7��?����XknJ���A����g�}L����mw�?�y�t�B!��Ng2m�-j�x��q���H��6VF��>�f1o,G������`����`�����B�N:�K��'���s�rh1_y�f_5'�+��oF��T�5[^�����]\/y�3Y��g|��o�%j��Q8�m��������u��L^������
S������o���|�7���\OK��z��������
B!�g���(.���p1:�h����d��q���N6���1s��%h.���5�p�P�j�/RI�=�]���h$Gv��:.��x���?����	@�5[^,-v�1=��,�����B�������K��o~K'��K�w��G���������|��s��?V����������~�{����OU������;����oC��B!�U�A�����\��Q�-�^Xi:�44-G�=�?���F
!\9�nW�@'�D`�� ~qr��1��f�m�lr�ly!
�C��
�?��o�$>���d��/�����7�������;W�s�[���ycy;��5��;����z�X���3�{��|`�k<�r.9|}�{����|�pY�
kj���i���#u!�B[�d�d	��P��vvk����?"��t�r{Wa����T&k��h$N*1n��$����V-���NT�3:� ��\����1�u�i���Z�>�x���Ho��XJ��ci_��g
���W�����f�|�[�7���<����7��z������&�����\������;��(i�i^����M�v�}��X�p]!�BT�����x��E��#O�w�._j���1��)�4.�.��<��v���EH������i�<�����)�3��BN��������W�*(�;����{msD��0[b���)�x�%B���v�v�r�r��k���Oe����������c%3�S��m�g0nLw���w����@��w]q1+.�R��6��6!�B�!���"����/
^�F �((JF���zo�f6e�!�~���(��������~S)�����O6�/�Q���z���p���u�d��2>����f����L����u
g��K8��d�
J
�-\jNvr�������|������;����k�������s{��B<��������\�f�G��c��V�����e�r��Lm���@��K�����b����,��=�M���]�1��Z^Z���v<�����
$��I��F��5;�!kHL�k�{�PU
��!�B!�b~�`]!�B!�h1�5k��uttP�x�tR�x��b,�X��.��r�s	!�B!�X�~��}�������K������6�������=�M���]�1��Z^Z���v<�����
$��I��F����W����+O�[CU�2x!�B!����`]!�B!�h12XB!�B!Z��`�k{&v�_��cb�	��gSWYq!�B��H
*8
���T��n_|�hEQP�`T�n.�� G�bKk�����s�Q��z�������*�l��zu�Q��tK3���P������q��+U	.����	�:=�s���'�`��F����T��Z��e��s�$v��B���2X�b��	&n�����==������~������B!���%L.���r��%��pI<�@��/:*���44-����<��������������s�Q"��9
M���w6�]����9�S��?
�B�����g����������� 3ft2=��=�-��}�{>�W|���,~����~�����[�� ���?��0��w����������3>}^i��U�d��z����uS��0i�	!�B,%��X��Vg2������tv�����I�FK��)2Yk!'!-����ms`S�����L[c���������v9�Y�J0��@>�|��
���v}�&���q��,�Q�3>B�B������
�oGy������wp>��v��K����z����G����G�^��?��X����um�c,��
>�~��B!� =Dq�HT9�UI��A|HOVL7`z9HO�����P�slos��|�j0��9�jF��E�����Ig�'8�n�p�D%X~��Rw�P�R�R��}F��3]o���_��O��F��Q^-+������7��n+��]!�b�r�b�2�Q�e��NT��x�$��s������Ksl_���$B��dy��_,=�!�I����7F}cI2a
��ax���������H������t�gx�|�g�4f��W�����`���6���d�.�Bar�7P:��U��$�y�g����L[f���N��Lw�co'�p7�I�6�
�"G����I�#�M����R����~6V�h_v���;	�44-���������jg���P��`��a�����_-\�~�*ck�Y�Z�7X�����g�n�!�B��C�*(%w��fR�-D?�pm1Oez}�I�N����uUmh��pl����M��"p5�������I���i\�����i�<�����E56���U������Q"T^�n7�o{�{W���y��<��Yx�t�Q�M���������(|m���k��z#2VB!�R�%��_�a�D���3�J���U�����eh"���D`�\F�G��F0o�TP�^���5�Z+���>�w�q"��h���6>�l�{F9���������kn�w��f]�����>7��_��\,>��a��_����������n�o0����-���zys�������Y�S/=��B�F`��
E��X�lY���������XSSS�\��
������O���m������j�3���}�@r\��hn�k^~%a
�i��4�5TU�3�B!�B!��2XB!�B!Z�c���s�?�@GG'�T=^���?V�>��/���\B!�B!����5$���?�k
U%����v�ff�H��In�'��������j�m���m���V 9�Or47�5����o
�i�=�k�*Y/�B!�B��!�B!�-F�B!�B!D���wmb����q��;�mB!�h�Tp88��j��t5���[�G�(���(��u3�)S+7zT��D1���*�B=�R���C
���Z���Uf���ms��e+�W,~���]_�W	��bQ������{��o�}<[��>6���M�oe�=������}�p��vl>P�o:�����g���.�n0tm����A!�B���R�(���C`<Kv|������1��'�=o��2<aMC��0�/9��W�Nn� �Q�Y2���44MC�bx�r����4�\�0�����H����pI.���\�92�&p�V;6�8����fW���d�����I�C��Z��1���/����,~�����{���=�-qsn��~�m�w������{�c���+��YU��������1e�D�n;�����nB!�s�~��g��w�BY�O���=o9���+r�K ���U���9	i9�����=�I��1P��/�"�m�L���#NX7T�����
����!�v6Rf.r�����IM�	����5db���;v}�&�O���3Y��g|���p�;���&��8�<��9�x>p����EsP^-��������?����|pU!<-U���w����nd�u�B!D�'Np��qr'N~.��#�x�Wo�%tr��������������<}�4���~[���h��bt �2}�^;i��traPI��.\n�d��N6��OD���k�Ig�����B��������_�#��K�K9t�S�>���3]����GY�i���[�����Z�G��wm���G6p����BQ�����x�.����l�!F�����z�OoF�9r�q���(�>wiS�DS� �s�'��K���������{��@����4sa,�������3�`��q-�2���80��������#����$�����0�.�����������Q�}�Os>�z�9^�"��8�C{�;4���z?�_����bbb��-�E[�&sB!��*w���
���|���u3��L
<�9��Gf��������et��8�t��1��J
�F�q���s�=Uf
�%����a��"�2���NT�3:� V(�$��k��}��,��b!��5#n�M��p�p�z����{��y�)�����;���=���G���0��������`}77����l����#��B!���t����9a�Yw�=�dX���aFS� b1���������tU5���(S���6��qZb�g�D,QG��q6�����U�Vcd3�����F;�+��z&�@+��<G:c�0�W}�&��]_����Q"T^�n7�o{�������w�?��;]���.�w��<@���v�z�!�B4��������<��0��b�����v��k/)�q��<%������%.r�Wf�'No�}m����(
���zq=��7�q3�2M��J0�9��pa����d������P}�����=�g0�")�#�-92��"1'����f7��6��p������O����|+����#<�������G���oe������9���u�}���0f�O�t9/�ne��{]�L3��-�Y�3�}�v��_@!����+W���E�$�$7���vi�l[jyi�������vo_+��'9�����W����4��z�5T���!�B!�-F�B!�B!D�q�Ys~����������������c�����p���[�K!�B!�"������G�}�5T�\��������"��'�i��K;f�R�K�����g��{�Z���>���h��fo��5$��u�C�PU�^!�B!�h12XB!�B!Z���B!�B�c?X���������wL01��oB!�EjP��PP%�Z7/z4h�AQFu�f�QFW�(GE������:Q�,�((J����j��<V�����9�k�U�6����l:n���0���4c��*A�+L��<���|l�>>��k��S�����}��B�5��k���������������`��M{&������Y+_��
=�����s#�-%�B!�G����44M#L��Dx1Q����ia�Wi�M=�?�&���9�%�j�^�h���%p�����a��f�L�bx�0��'����f=�������w6�/c���]�M��^�f���Ikh�2cF���I�C��o�z=������?�w6\�
h|�����$�2������%����K�N�i�/�S�����YknJ���A����g�}LY7	!�B�&�L���EBM��A�_ E&�X}l�p����1/��8_~ ��oF�ABuNBZ-�+{c�b�zt&����"�����l�`������k �������S:�=�p�G(�i���^�sO%O��^�an=8k
�������_���0����������Y�����O.����`��\��X�gS�u�B!�p�H��r((���9�MO�X��:�L�n��%��`p�=����]�lR���4����78�G�-��Fr�,�:k��^����#�������.���!�*���/S��-��j.m�.{4[6�=���|��2~��O��'�qa[�[��S?�=�wBb���}���`��q����g+S������B!�P��G�e��,��%��8M'C����H�G�Gu��#D�����t��g�p�����%��PlI���j���!�I����7F}cI2a
��ax	��&y3��p	���a�x�+l~�X����5|g�����Yc�Y�8�}|l��|��Wy��\��Q�w_�w����/�U��&��v�w�
�����B!�>�.Y~�$����Zl�;�O_ww:q�=������m���$��k��}���]���X���i"g�r�����Fr�,�:k��_����#�����`8S�V=II��������xSa��������8X7��K�����
x��.�;��O_�V�i.�7�4o���~�]�"G&�q!�Ba��x��E��#O�w�VI������j,��)�4.F/,�U�q<nzTA)����U;��
����l&e;�,���M�)�Q���l2n��6f�l�yz�����z���g��|��|�)�0�x���zn+�a���o?�qc����V���7���`�k{
_����g�Ak!!�B�w���Ro��$�g������(
����p=��7�q_8�2��p����pr�%�������	����D��Y>34���z�������z2�6�/�Q�6��m�l6n���0;}��k8C�|�9��d�
J
��\jNvq����;�����.w�p��ob�����g��s����{�����t+�z3��>>�4|������X�lY���������XSSS�\�_�!�%��'�i��K;f�R�K�����g��{�Z���>���h��fo��5$��u�C�PU���!�B!�b^�`]!�B!�h1�5k��uttP�x�tR�x��b,�X��.��r�s	!�B!�X��;��5$����>`
U%����v�ff�H��In�'��������j�m���m���V 9�Or47�5��;�ukHL��V}��J��!�B!�-F�B!�B!D����B!�B�b��]��3���� ��L011���6u�mB!�Xr������8(A���6��G�QEQFu��]
�p)��.v�"=��P����Z���Rc��J]�a�sd_�]�����R�����S�}�z\%h��EU�K�����~>|����4v��|�#���������'3�O��������/��)��?n�����O�k�9_+l��*�us@~���
�����On�������l;h-!�B���Az	��ih�az	��:\�2�6����� �5
M�����>O'�8�'����v9����O�������iZ�x��%��9��]�6�(�W������}6}������$�5���1����$����O���?�)v<��g�9��>�����CW��o����W��>��^��x~����7��<_��N^�k�_�o����jmV���A����g�}>��i�v��|^F�B!�o�\,R�3���vY�N:�K�YL�$���%^|��������C����v9����@81P����%�`A�s�#�:k��J�J-d����Sv}�&�O���3Y��g|���p���|����{���&x����j.��^>\r
�����_p�9�O�~�����1�z���=\9�/�2X��y&+����	s��M�*x!�Bsy����@��I�J2^e��H�'+��`�����@8f(����>6J�������u���K
c��rS��}���%��z��Eg%�[��f�3���"P�2(�0���C\���i��T��v�"���P�P��j�i�mzCi�-\��J����H::��]'������#��|tt>o����������8Y�������rG�H;~��<cl���?B�T���((���(��������?�3����9���	��s�s��y���#_�������b�R�30���7�nb���.�Ba.7�U����������s����+�%H���8v���Ff������/G�Zr1O��#G����.*(�
�C�fh�H%��W�,��w������;�_�m��_��3�����=���?�w��=�<�����W<�W�q��=+����.��]{�e����B!�+U��X�,����)��d�mT`�D�B���w�@M�(�������J��[f��9CMS���Q>�&G�Z�i�B�7w��#q��������z2�\u{�����J����X?=��W���ws�ev�5�����?wn������g�cnm��oO����=�{�f�
�/���f���B!VCUP�&V���-�J�J*���Ti��3�+ZW�n^�mh�<Kd�r�7����C�O�L��&�7�83�������#�}z���o��=v@����3v|�6C%M���~�~��l�+>?m]����.�}������B�������s������_����n�e���9�7p#�bsB!�X���<�J��:�4S��A5�h�D9]�6�Wzu�(��V%
�2mC%>���x4ZBAQ��c���_��~�r�.�u�,}���9g.�tD����5������W�B�:�"8�WP���S'~��-��U7q7�$�|M���G������W�	5��?��_�q�����x|��0�8o�.@w���������p�e
����W���k��m��4��8>fgg��y��!\�$�$7���n��R[iy���v��,�n�_'��Ortttk^������X��8�w��f��B!�Bq,H�.�B!�Bt�����V�ZE��%�Z��N��7o�����������%�B!�b����������y�s��\��3s�H��In�'��������i����Yj���N 9��������^?�����C��'Y/�B!�Bt)��B!�B�#��B!�B�a����1vOog���n����n���\wB!�>CM����WH�F�}@#PT�-ZB!���ne�	EAQ��#�M;o�6�s�����p�+Jg�f>���7�����>��o?^��cW�1|����_�c����N�rd3T���l�q��Z�@�>n�k�z�5V��T#����r�_�=���>�sN�7<��C��!~r����;�r�8q�PC�W���S��-����-�G�����i���k��������v5���[�5�B!^�;��k�51T��9�bjb�Rn�=oC|r���7`�ir����`d�Juj��,S����
��(�t]O�+?������ic���1�5]�I���[��y���z�(�s/&_�Y�����`����^����}�	�����x0���a�H������s��;m|��/��>9���sX	��4����AO������~q�)=�2i2C-_�si9�
Ois�8�1��/0x%k�q:���,Nz�	><��?{"}�8����s?7���~���t"�����_?�����q��Q��c��6�`�����Z���gv��B!��p��}\v�y������]Cd�L.��t2;�q��J���F������G��G����P�`�����D������*@�p�a�2�N����F��B�(�#%*��f�e��{�����;���,@l���{4[#�YjY�
��L�|�i�\CQd(�I�9���x���������1����q�~9���HC*k��b�by��k���������
O�r�
�J�2L��]��xM�\���UT�q2k��o�;'5o6�������=�k'�~�x�9���j^��z���(�����fdV]!�Qq��>L����F��N�f�>U+���^�]E?�A �K�}�� ����`p���C�s0�Z�oN�\�-q�g���gZ�Ek��]�1 �$+
((��X�|n3�)���l[KQ���LZKi(������)����rE{�A��J�w?~�F�d��T�)�����_�Q�����/GZ"
)�5��B�b�����`o�?��p-S!<��x����W�|�w���/{9g���n=�On}�g��=^�2�����i��^�[<����N<��'�y��j�_�^�o��M�{dV]!�Qa�o�g��}YjG�����\��:s��PI�"L���5�`&�5j�)�s�(�k`�
;o��yb>Z�x�Z_�+��%��!R�<Y{)���(������5�����R0�m�O0�g��uN|��p�X���_����gA+����1��M�c���o��iR��y�a��.*(�
�C�fh�H%��W�,�����oxJ�9��Y����8`-��Y�t�����3^��<r��|��������Y�+�����9��_��c���g�����/��=R�!�8:jG�p��	n������!6T��(��)�M�B���l�I��R�1�0?P5�0"1��@0D8R��Z����ys��H+A;�����,�Y�$���f��D��}���3�p9����~�������O2�����Y�(	�|���j��������#�c���Z��;��95��TN����QJ�Q����e�t���f�k���L�9W�>-�oV~e�
Oi?����5g������o�6�K�/|)�����?_r:�GO�}�v�U����{2�,����\~��H�.��hi\��a��g��B}��xY��g4��u������QB��9�4�`\���G�h���P��r=i�!ov�����0#��u�`�bn���/�0����4�X�i�.0��yiFCU���^����
4O���B1G$�P���*�J	�����i�����[���'��sX|�O;9�9&���[�����;GQ�z��}1�����l����/�t��|��o�f��i=W����~f��/�=���f��r�_��^x?�H���`]���s������X���}��s�3��B��2�:�p��{����Yv��(_�1��2����D��B��u��'F 7h^�}"O2h^��V3oS#@d���$��35i8g}���]�2�N� ���EI��u��R>oC%>���V�O��8S�l�4�s#��3�5K��:%�$�����>V�;_��SIY����8�.��P>�o'G~����X�r��#!�_�7H�������Y����L���R-8�WP���K����s��hO�4�ah�8����_?�)��������mV�3^A��;_��8�rvC1���_s�ua��>����o�'�+�hx��^����X��mcd��K��1;;KOO�;,�$��'���tK?��J�K�����g�u{�:��x~����[�zQ�<wH,���C�6�.�B!�B��N�u!�B!��������j�*�o/a��U�p�w�y[=f�.��|��_�~,!�B!��P����Cb���Y�y�s��\��3s�H��In�'��������i����Yj���N 9��������j�nu��"EO�;�I��!�B!�F�u!�B!���H�.�B!�Bt�b�o�����������y���E!���@K(
J ����
�C5�I5�[���%tF��P(���($|�Ps��+��PT�{�J��������N\Q�����������=�X�����X��k$�X��F�8>�����|��3>��c�|�nH��
i����<��n�������8�[�b�����Loq_�n#�n�������������!����c���[-� )j5�V%V�h�sK��	%@h���fQ/�����FfR����`4����_�yr�%� ��T� ������g�2��5T�����>E���������(�HK0H�j�'\�:6�<�����������1*��GBj��x�i���I��d���\r<������N�����o��p����a4��nz��m���r?g�����.�G�������Y�&!�Bt�;�������
�=e�"gr�����qO�Q4K-k�	��_����:f����g���$�zv��ay���F��7�Q�GJT����+�4��1��Vm�]���;v�����6G7GZ1��p=>��U����g���5�����)W�`�d*�$�N������<����������r-��B.��*�<���G���7��_c��������x�~vq���<=������c[���!��x8r���v����?�b.B8��/�y�'�$�M�b�v������|m�DR������L�����s�G��
&��
�
J D!���c���F���Q(!W��T+Pj�$At;��V�16|���e*��AU��}����P})|��=�i����~N����C��^(nH������������1vo���e�6O#��!�������-q3o�g��}YjG���XT%
SG��i�x���%H�����H��d�%�)��j���� ^������
��zj�1�D9
&�L�����0�\� V��8����0<4I�a�&�TR:�p�L�<��������4��!.����r�����h�O�?��=�;���b���u��.���}l�q/�^.��B�	jG�p��	n����j����|�B,O6z+���G,G�����6j:G��6g����J��*A�
�2�4���]c��!kuF�d��w�X������Ifu�\v=�e"GsI��H~c���cD����8�����S�S����Nq������'R|j�;�3��s��:�l�7+?����}��f������cl��<0�n&�B���q�����9�2�m������GsJ�2���Q_bYm�Yt��0#��u����'M3�fs�ie�g[�/��� � �M��h�&�VJV1�=v���MK��b��������;G���8W�3�,@l��?��q�7���6C%M���~�~��|-����:�|��Bf�������\u�B>m��������v���M�7�ANZB!�����/�mp�����g�]�>�W�C�o��ic�qFK%����W�,�v��x��`(1:"�5k�M�Wf
�#��Q���e|�����a�Z�n�����*�i���y��f��L���1v��L9K��L�wW��O�; G�d�X�����qI�8*����X��[��THY�]��T�
J��v��}8�7lE���|��\����
i>�a7��{��/n��3�x�'�Y���8�G>���p���6X�zu�|��m���H�^�uy�������m�����tO��Km������i������@r<?�����y����E���w�S�3�B!�B!�8&�XB!�B!:L`���j�V����V�Z�	'x����c����g�����B!�B���??��Ez���$��w�n=g�X�����tO��Km������i������@r<?�����y��_���X���R�!O�^!�B!��0R�!�B!�F�u!�B!�����}c�������n������f�X_�!�B�HK(
���$4��e�Pf��j�7�m����*���Z�@ Ac�y�x��X�;>7������i�*'�(������f~|�byk���5V��T#����#w�a��;��w��E�������7�q+��������r��;���X�cl�4�[�5�7n��u{�z`������nR�!�Bx0T��������H7�^N42���ut=�q�~���/Z�x"N�,D�Eh�L��������t�Z�����������/���*&��fNu=K�9rm���a���?^9Ko�c�'�!���cT&��k�E������4g]q9;��s��
��k��'���;_�?���8�����������w�����x���9�����^�b}[70�a�
��s{8��.�Ym�����j]!�bn3ewl����F����(�#%*�E�i��F"
�|�!$�����
�9��R����L9B8�t�c�������/��>��smL 6D����pR&��������,W�`�d*�$�A#��i>{(�;.�G����Y�&%�>r<������������z�\��)NxQ<�uo��e��+1��>�\���D!�B��cB%��w����L��q���1g�DR���"4��&Z������*�`�P�@�^oO����Z)�����9��h>�@�~��<c��w~���2���*���#T	�i�����'�,y����i�:��N����D�����}�N��h�Xg�u�8������������-�B!��x�Z�\�+�W��c�h	���l�7�����H��d�e���v�\ez�ut�F>\ ��F_�d�q�E%^axh�"�M��t��
�nN���~��\vA=b�J�q�[�q��������?��r��w��_���p�r���Moa��g����}���b�u�l`k�}�\/�B!���r��� �T����r��;��uo�'j:G��6g����J��8��zA�b�3��Z���x�>�D��}���3�O�
B�r��{�)���^t	��h��d�����s��}JI�<���G������X��9�4���2��N}5�9�J�G�-���r���\�s^c�3�0�c_��������6�y�{�B!�pEZ1��}��3�+ZW��(�����f�>m�s"����H�O�L��&������
UAi�%�VJ
w:�|���-��>
U!��|�Z)����5�bP�t�A+�����1�3v|�6C%M���~���`��_s�+��Y����e���e.<���'��,���Omh�(�7��������
[���	!�B�&�q�H;_��f���]��e|�����a�:/�P��1/��f9�
�2H����Ky�?�������jg�G���5���o~q�}�y��V%��2O'��u0I>U!8��O��������������M	����4/��*�����7����p����g��w���w��g��?4��]�pM��U?��W��_�`��4-�_�����k����m�6FF��D@���������M�?����A��c����tZ;�x�Z���H��'9::�5���k����]�;�i3�B!�B!�8�XB!�B!:L`���j�V����V�Z�	'x����c����g�����B!�B�}�����X�7�t�;�I�Y�r�z���"��'���tK?��J�K�����g�u{�:��x~����[�������X�w��C�d�B!�B�a�XB!�B!:��B!�B!D��,�7n�fz����k�����ol��B!���%%@IhV���/?��@QE!���CK$���?��0c���k�%��w�����c����s��b�MQP���|���k�Z������k$�X��Fb������l���<����|<�i>�4�����������x��������{?�s���,Jk��7��{``�Ff7m���7���n�z��������^B!���%$E����j�$�Z�!^%V����FfR����`4����f��"4]^�/�J<&_���zC��ri�u]G����1*�r�jM����8�c�(��w�U*L���>�z�(s��/����)���������1*���5�"��h�^��Y|��
X��1S��?��	d�`�O�xO��|$�~>����2�����/��{����/sF�����P�����l�e�2�����/��{w�u�>��{��sf��B!V�h�Z�~Sk0S���Az����hEr#�fAH������Q��^C�7�}r`L 6D�iV#u���R+���g(���3�)�B��Q�Q{�l��;�\�.�3����O�hx��+U0T2�a����z'sE��|$�z����}s�Y���\��<f����}�������G�����\qVS�k-�m��M���}�wn�8��=<���{��!�B�D��@	�(��or5�9����w���z���LZKo(	��=�s
G���`�jJ�*��cBe��/6G^�33�)����������/�be����`�'���Z�BxTE���������S<�L����e���9�iw��'�`�)�����O���7y�a?�_�o����\�a+����B!�#�����q�2cUI�T���^k��2���z�|�@�a��;��d�	����f�Vq�%��e���_NZs�)8D*�'k-�w/kw��yr-�q�E%^axh�"�M��t��
�e��:J�
�a�m��Z����9P.��o|������K+��~��{�<6�~>������;j�g��7�����0�P��g��3�B!�+��D�>�e�*q
�<��\��������� �!giv����)o��� ��n���g�L$2��:K��$S��3������s�A�
mB��G���k/�����G{L%��s�L�9W�^��:�V�3zNN��d'v��g�3���)�v�O9�����lg�Y��b�o�-g�`�:��u��^z9�5���r��K���B�r���0SY���-�J�J*�>�Ti��3�+ZW�n^�mh�<Kd�s4O:w��sf�������yv�����9��c�(��o���h�J�VJ��}s��k��|��o�f��i=W����zJ?��Tv����/��k���4�������"���7�M���Mm �z��Zc�ol7�o�>���j6l�g.�������m��12�|]Bq|��������6I��In�'��������i�����g�&��k4���g��*�FKD"�e,O�z'�|�g��:���PQBbU�d����t�D�D"��������%�Z����l��O.����=m�%�� ilo:�9>*9�����t(�`d
=��A���j���X�h%9Vy�S�c�g�:�J@��vC%/P���������:�����O��z����Y��m����S���[�j^8�+O�`����7_�GR�+�?��	)�
��|����1Ne�}�x�;�~��R�/)�;��z��V�?����A��c����tZ;�x�Z���H��'9::�5��Z�������B!�Bq\I�.�B!�Bt�����V�ZE��%�Z��N��7o�����������%�B!�b���o�Cb��~�w����������cE��Or�=9��~,����N�o��R���u���$GGG��u����!�H��mr�<�2x!�B!���H�.�B!�Bt)��B!�B��Y�o�>���y�=����o������B!�X�D @@Q1�����%h�
���&PEQH�������CU���p��(��h	�@@A	P�=������o���������k���5��F����j$V�Xy|��N���c��`l����1����<��n����n�����~��Z���q�����M[0��>�vO3�e��B!�M�������P�����`U�",�K�jdF!���z
F�
o��k3O��D�@�
�u]���,Q�� )j5�V#� 	m��1���v��e�Ub���#������X��kEH����I�#!�Hx<�����$_��>y\���7��w����n�YE�O|�`��lU_���������5���i��Z�����]�/38�l`��6�`�i,�B���������x�k�AY�����\r��dv��ji�y��+�.$������
��V$72l�D)Q���f�h$�����ZE���v�`0S��?�|��N�c�� ��
������k>qc����\������:�.w
��V���E��=?�����?�B^��UZE�W���u�z��$��*;����8�Ei-�m��M���}�
B!���9�����9����%���HI���gZ��[��FK�!e�
�����K�]S���@	�(��M��_�x�����f)sd�(������k��`�'8��Z�BxTEAIt��I�u���R�Fz���;�����)|�3�X�^����1�7_���{���X������z�V�VB!�B>|��%n����W�/K�����J:ab�,�-A����-8D*�'k-�v/k&��R�B������92�L�a���+Zt�pQA�W���0C�E*)}�B�[�SG������c�%��?���\�����~��Wb����Y����f��=H�.�B�E�9��7'���k�O��ef]��R�1�0oS�)�e��w�j������Q*�����(��(���A�h�~�P8�1#d(�5S�?>��_��f�sd�*q
�<������o��� ��y�2�\u{F��Z��>�w��)��?~���7�g9������������X�c��;��9q]!�bA��>|�#M3�3e 7J(b0�F�����f$W��4�����4��5G�V�y�5�i�	"�	��$�
����j�DoCUP��V���*�^�cn�����z%�G�)���o���m�J��s��
��w�E\�o?�������Yx��m����i=W����OK��w�:�\z���m�_�&�B��-���w��g?;��e�������/�ZX�H��M��	�z�nz[�%e�9=��ky�2>iEAQ�0a-�6T��i��G��Y`��<SX���i��F�x�w�r*��d�7~����rd�qFK%������/����7�����2R�y��0�����a��]���1�6|���78_�v!W����1�6|n�
p��������h����'�s0�1�&�����,D`���5w����m##�_@!����Yzzz�a�&��?�M���[���VZ^:���v<K����	$��������os��"
�����2�.�B!�B��K�u!�B!��������j�*�o/a��U�p�w�y[=f�.��|��_�~,!�B!��������!�H/����!Or�z���sf���?�M���[���VZ^:���v<K����	$���������3wH,�y����$���B!�B�#��B!�B�a�XB!�B!:�g��q�4���m�X��q!�B�LK(
���$4��e�Pf��j�7�m-��r`�s4��;n�*�>EI`n��{~�w3��@�q���;�9m�����	��<�����1��HX�zS���fg$��;����}��z�?h�v�������S����~a�|�T����X���=00p#���0�7G\!�B43T��������H7�^N42���ut=�q�~��1T��0�Z��� 	
�2��V��kUbe;�h�����Ra��n�R��D�+~�������B���`���_.|rLf�X�m$��p����S~q�)=�2i\C-?>��������D2�����}�f���{�}W)�
��?4���'��
�+N�8`����I�k_���������u����B!�s0�)�c��V$72l�Q�GJT���1&"h5�fkd�@4K-kAz��
q��r�p�������Y��^C�7�������6�>(���^,SS>q����\��������
5{��N�����7��t�Y�������{��0������<��b?�<s
�|�^��b���k��^���f\!�B@0I>V PP!
�|���-��L�(�T+%�����P*���(���rCM4����L��������w�vr�����\��Sc�@*I������7�3A��2���2���*��/�y�^�����I���u7����W��������:����@p-|n�>�}��_�o����\�a+M5�_\!�B������V%V�����ez�ut�F>\ ��>�@U�0��AF0�m�Yp�T*O�^�m/���w��\��S��h?�
��s�+^t�pQA�W���0C�E*)}�B���S1�����
W����1������<����~�+�~��9�7x��
�s_��EV���z��n�/���� ��!�B�:c����8H2��;�������D��7H(B��`o?%g}����)��d�^�2�g�D����y��_;9��z�z.��)j����s��|����:�Lf<H2k^��L�9W�^��)��c��W���,��y�����\�L��\
���oT(\u���s�z��f����Q����X�c��;��9A}��B!�h�.��b�i��f$W��4��d��4s��O��P
��2Z��#9�z%�Go�R7T�a��Z)9q���v��7�?�|�Oc�����/�95���'��������3�|�6C%M�����9�z�>�O:���><��W�=���}O����{�INA���n��\u��_��b���u����+���i��!�B��8S���nK3E��.xY�2>iEAQ�0a�G�Lc^<��M0I>U!�N5�h�D9o��`2O�R�����<S���$K3E6�o����S���bS�Ar�t�:���~���������^,{S�qk_�
)�	
S�+(i�N�XQ^q�������$��;������V��=���;ub;�p�4�{������U�9�D��ub��+�)�
�^����X��mcd��K��1;;KO�"G����Ar�=9��~,����N�o��R���u���$GGG��������t���p�<���!�B!����b]!�B!��0��k/��Z����KX�j'��o�V���.8�����K!�B!�2t����!�HkO���$��w�n=g�X�����tO��Km������i������@r<?�����y������X���\�y�e�B!�B!D��b]!�B!��0R�!�B!���X��}��i��{������5��B!V.-�((�JB���O|�1����(
	�povZ�@ A��s����S��r�*VLQP�>���(�c�S��#������q��9��;W\t���7��	+Vo��Xac��{n�/����c;n����p�U[��OY����#[��s�p������<h�����w��{x�v�/Nk��7��{``�Ff7m���wq����6�t-��B!�XI�����t�Z��$4@�4����V|����BJ���������b:^���������O���j��^�%U*L��ut]G��D�`2k�n�R#F���{<F^D�����2��'\�����9��r�j��i�c������>q�)=�2iC-?����)���=�t7�iS�A��1x������M��1��{�J�7����g��G��M��=�=��������<��,��1��#4�u�Z��}[����e��7;`�B!�XI�YjY�M��L9B8����o��r���c�$��H�J��(HR��g����9�noL 6D��=����}>VQz\���#��V�5|�d(�I�7G~�f3ewL,_�c�w<����AQ�T�P�T�I������7���-h�3���>�Z.�uo�G��m�����p���O�������$�~����E~)@k��h��lb/w�3��m.���o��B!�X�5�Q��=��jsV��3����?�J�~&�%����`�P�@�^���D��,��J:��������rE���
�*U���	&��
���St;��X�16|���e*��AU��8��������������i���D{��p�q��;�����:�7��}�����Lo>��
E����e�������^!��^�]�+�]��
T%
S+�Hj?ez�ut�F>\ �"������;��m��~��
��$��3A�:�<�L�^Py��/�%��e���S�x�q�E%^axh�"�M��t��
/>������}�9{~�z��{��=��.����=�<����r��|�����Y����f��=�������{/�r���B!,�r��,����)��d��T��H�B��~B�rf�����*U H����p�i��P��'�����aa9�$������,C���C�9��3����A��~�YV�����7���g!Sq�U�W��x��9�5�OOq�U��>�n������7}�z������s��<���:+���b�o�-g�`�:���V}cl��^���D!������p��j�doAU�TRy����;Yt��\���t�RvC��Y"�����}j��pCU���^���sV=������#o����<%=��#����G=.���x����4�����]�������������B�����-��B�������s�������}W�c��78_��|M����������:�VB!�JL��U���!��"�b�qFK%��������sE�����(i����*��4�����$G�A��s���=�$�T�x �8�&�L�6�a����5]4�xY|��������#��A���o<:�����>>Ew�;���7n�+S!e=��Ca*q%MW�^����l���n�k|���5m���;����W�	5��?��_���f������np�q�b]����O���-\�ix�e
����W���k��m��4��8>fgg��y��!\�$�$7���n��R[iy���v��,�n�_'��Ortttk^w���;$i�9��C�Zf��B!�Bq|I�.�B!�Bt�����V�ZE��%�Z��N��7o�����������%�B!�b����;$�����;�I�Y�r�z���"��'���tK?��J�K�����g�u{�:��x~����[�:s�[�!�H�[�p�<�2x!�B!���H�.�B!�Bt)��B!�B��Y�o�>���y�=���ln���B!�h��P�����-A �`9��P(���($T���h��f�9nXcBAq�	'�(���<����h�6s�CU(*�=T%��GQ��,<w�;x�����HX�zS���f�o�_;y�m��mw=�����������w�7x�m� ����	���~�Z���q�����M[h�����|i��B!�.wL��rk�%���T��������fQ/�����FfR����`4���~�6>q-� )t]G���v<� )j5�V%V�����.���tt]'E�i?f�)����g[m�'Z�x"N�J�	���]�e�]�o���5��V�����Q��>R����M{]q�5���
>����/lz#������?�>�#����@y#_x�Y<���~�1�����;-\k��o+[w���p�`��>������m
!�B8�����.>�7����k���%�Nf�=�F�pCR"�C���_��^C��7,/Z����Yex�D��f��V�12lA�bP�4 �����{��������������g{m�'�4��1��V���>c�o�����AQ�T�P�T�I���4��#��O�������8�����_���z�g�s~�y�x��_�>��INxQZ��F�e{�k�����9{�u8��B!��#G�p��ajG�8?7	���`��`FR+���rQ��j��My� ��+�S��
�Z�[�9��&��
�
J D!�o+����O~�o4_-���5sn3f(S c/��\J�~����7��3��Ov��!�L��8�����O�Xyf���������.{�;|���rohh����;?��|"���/8�b������.���q;��p������v����\'��B!�q��a�������3\��,�#G��j��LTk��S��Qd}��L�� m�G�a&��6P�4LY��� ^��r���
�����d�3��i	��p&�m�!R�<Yk|�����N����
J����$E��,RI���2+x��p"���F���*>q�#|��/��?����
���s����o��;����_�.{�1R��z��4.�_�b�ol7���a�*�a#�nZ��M�����R�����B!<������z�5�'��2�^�� c(C�#^������?5����Ifu��j=�e"��uU�S���F�}3e�
a�$����)4�������7�oc��s��is�<>J�4J<�b$j� �4�y����o�5>u=��s����\�����W':��{^�r~��������?2/:w�4?|l��e0����|#_�t|�����Y��g�1�c_��b�o�-g�`�i
}�
��`��{��]m�B!L���>���z(��(���2J4l�cD���+M7,�M3�����������`����SlVRy��������r�b��g�~����~�N�<��"�����'�D&�g��*�����Up-,w����5���PI�z��_���z�;~�D}���~���S�o���lz#_��F������>qe�)�y���h=W��U'6��R��]��5���|u����	!�Bxy������w����{f�u��|E?�����6�d���H�9�S����9�����c������(��V%
��jC%>���x�O�x0�'V�[_!e���yj�h�D9mo�rg�Y6���+�����=�~������`2���{ @�)����Nt���7�����2R�y��0�����Z�+�IA>������������#>�������/��+�Y���^�Cw~����'�e�m
�^����X��mcd��(�16;;KO��Y�X ��?�M���[���VZ^:���v<K����	$�����������E��r�;��ef]!�B!�����w�n�d�X�����tO��Km������i������@r<?�����y=�C��!�"=����C-w�uW�V3��Z�F;?{�����m^����,�B!�b�{�v��]j����k��fIg�����O
������K�?����A��c����tZ;�x�Z���H��'9::�1�����x/�Br)��!�B!�hO  ���H�u!�B!���H�.�B!�Bt�b�����-����m<�3~�-����{	!�B��PEAIh�����&�>(
	�po�6F=�@S-a��r3G{�������;>7CK$h<R��6���G ��b�WQP����rg3T����tb9jwL��5V��T#�1�V
�us3��v�V;��M����h-�������y�{x�{>��	�z�LN?�	�n����w��B!�`���1�5]�I�nx#��hdF!���z
F���i�e$E�����������J<&_�Y���~������c�����Ev������o0�E�����:��#�Q�J�	���,Q-���Zk.��DBb�[����kEH����I���"��h�^W�A��@�������������]���Xt��*�1=��Y!�B,��L�[&�"��a� $��H�J��6�,����?Ho���1Y��AkK4[#�o����k>��,HR��g��������g���)w3����Q#��T>Vo/����O�hxq*W�`�d*�$�'�
���i�
?c����?��3��
��h-�]��r?�{����
����B!D�`�|�@(��Bb��y�[�i�6n��F��3�j�D?����J�kiv�������~������}_c�@*i~(a�P�@�^�ZJ`�����%���f����S���A��T��!�L��8���Q�����0Kn?��S�/��V���$^q��_�_�>ny�!n��.�Z�Qv}�ky�B!�h�%��e��*�B�8-������zY��q]G�k���M��[���{?�@#3�����`�T*O�^�Z�Lf���� M���8�Ct�pQA�W���0C�E*)}�Bf��n�������i ����
�a�������,�bx�gn�����y�S��}��O���g��B!���)7,��L���n-7���T�MmT%N!�'5c�p93Q��~J�����^�{�����j'G~�k�i��
3xA�
y	�#�*�P�3j:G��6g����J��e�t���f<H2k^��L�9W�^��:�V(4�*Pj��\�:w���y�jg����.�L��b���$N����<���.����c��B!�p�Z1��}��3�+Z�>����f}�m������7L��bP�t
F��#������~�1���1G~���=��j���*�U�M�J��� ���4�*�VJ@��^�f_�.?A$2A>k-���oL��m�J��s��
��@m��%.:�[������=�U��l�0-����]���4|E�u~�����>�QvyO�!�B�l�q�H;_��f���]��e|�����a�Z�n������nc�qFK%��x��E��S��J ���������s��_S���VS�Ar���o�� �)��3�5Sn�95��J=o�/�XZ����[��THY�K��T�
J��)+��Z3��,��j�^����b�^��k����Bw/��� ���M�8�������m�����tO��Km������i������@r<?�����y=�C�����f1���J���u!�B!�B,�s?���6)��B!�B��V��Bo�����j5k�Z�v~���?�����7�[�RB!�B�x��wH,������L�|5s`����/�}l������v�n<g�X�����tO��Km������i������@r<?�����y�*4E�������&�z����cI��Or�=9��~,����N�o��R���u���$GGG���:�fwH�)�����u9g]!�B!�8J��o�����������:R�!�B!����X��}��i��{��aKc��m�i�$�B�i	�@@A	P�5|����&PEQH��{3��i�)����+��PT��:9S��\�J�a_	�-��c��{1���}�s��D=w�}����f�)w~���k���G��F����j$����S���!o���y������=�������~r����?�G?���~~/�g��O����-Fk��7��{``�Ff7mq����og����m���}g!�B�DK0H�ZMG��H1HB�LC�J�l����(�t]O�h��
��m��,��;��H���h����7T��������XU*L�w��%J'�������Eh���O�Wf���>��l���a����;{��Gt���;}�ZR:z>Fe����Hx����=k�7�y������o_��(�����r
�3��u|�C�.Y��<c����8������������u���Z?�����6��
]!���R��oj
f��!w<Ho�}��Hnd�,��2<R�R]D�����H�!��5�s���1�N�q;�'HR��g���k�\�A�bP�4|�lL 6D�����F�����c�w<����'Z�RC%S&i�jv�cp����	|���;8���S���?{5�W����9��y�x���O�F�$'�(��z���������=�5�gs��<~z��
^!��Z^Q��=��jsV��3�{�6�d��T�)&�DR��-�$+
(��4f(S c/���>�����P�p�\�~@���O�}�VJ�3i-g�$�S����v~��<cl���?B�T���(�8Z�f������/����6n������_c�����='����{��|�{�|-�?�����_�o����\�a+�\���g�\"��&��I�B!����Z�X!�Z�m�*i��*�W
��h?��� M
g��i� ^���7�28D*�'k/���x;�/��d�	����f��e�}.������5��q�hc?b���.*(�
�C�fh�H%��W�x}6�B�p"��}_|�_��'?���o.o�Y�|�C��O~`.�?'D������<W��ws����u|1�<������y�}c���|��:��{�W���s/k��m�*�B�������X�����vI	������6���<1n�����r��%��R*�O�h3����A��~+�A�
��#��vr4� ��n���g�L�Y6���P8�����~k}�����o<�qs��z2�\u{����w���_
��,��9�����z�n:�����g_N$����������s�z��fK�����}cl9{�9'��v���K7�/6w��<0��F!�b1T���]�J����������y�7:�H�h]i�y���i��9����)'�y�5�bq�	"�	��$Q����Y�������W+%������'G�<��S���4OF�_U��#�������_�f��i=W����v=��?y�Y����~�����/om�a�}�g��`^��fZ�U�=���@�Z���+������n�}�.��q�M�����
r�9!�B�`�d�X%�|R�)�� �g�T���o[�_�e|�����a�>�\%>���V�O���Gt�)g�v=��d�)���@�4Sd�tP�}���#�+���c��X�����,���9�L�OU�N���y�Gt���;}���2R��)��0�����~��JsR���y���_����27���H���|����}~��}�{��>�t=�������;p�9
�^�����k����m�6FF��D@������8�BI��In�'��������i����Yj���N 9��������V����E
e��yj�YB!�B!��%��B!�B�ak�^T[�j����j�*N8�;����o\p>���w?�B!�B�e(�{�������$��w�n=g�X�����tO��Km������i������@r<?�����y}����!�H�\���$���B!�B�#��B!�B�a�XB!�B!:�g��q�4���m�X}c���{�B���
%���(����0��0�
���(	���@����(Jm�]	!���&���p^+�V��P��Zb�>����gh	��YV{�����c�*�����gi�_g�
E�u��-a���8�L�}�~,��������>�v�#�7���������HX�zS��
'�U�����G��?��{����?���|�	+���f��=m�|�o9���X���=00p#���0���n�b\�m�e��}�{!�������L���gJ���&�kY�`�q������e$�����*�� ��SG��ut�F5Vf0����B�����a�R#F��W��ZR��^K�t9F�f�����d��"4]FXK�_��<���V��y�8���%�}{���P[�x:L�V������i��c����cy������g���w?b����FfR����`��>q�)=�2i>��Hx|������K�}~����U��A��������~l��C��2�l�<����{������{/Xk��o+[w���p�`�f6^������L!���������x�k�AY�����\r��dv�c�0P����V��F��T\K�-��+�9�aV��u��M	��kd}�O��C��b��������,5��#Ho�k;3e�g��F�7�����0�P
�������k$������c��!������k�b��U{�������g��l��������"��ak ��H�J�?n����+U0T2�a��`�z'���o���zz��F/��!���~��}����*<q:}'�[/Fk��h��lb/w5M��1��UB�99r���������uA��)F"��l���l��H���n���2��[�7����q�(
($T��j>����(B�~RC+�[!�c�@*���4���L����1J D!����{(!W�/��V�dV�)��?p����j���HC�^��0�J�~&�%��D���}��������������>��#V�����.�>����2���xN�'�X��1
�����Wyx��w2���^�49��y�?���?�}���?�X�^�/���q;��p���4���e���UB�9��:����R;r���=�aF��(	�:D6�D�:�Z�a��C�,<���S����B�l��~�^�l�*i���f-A�`-S�U�������<��s�3���7��!R�<Y{	���{���%H�j��C�^����|�@|�s��|����~�5G{�����>�3��'\TP���&)2��d�JJG� g��T�g���2���s����[ ��I����<=}�3���������?��y��w��b�ol7���a�]�/��{�H�.�s�9��7'�O��5��Q�z
}��!��@�f]8.H4�%?���y0:N�s��Bt9CMS��c�@U�by��@kY��d;H2���S$���s��,C��Sp�D�}b�F�g�%�?���r�����R*�_�L`(B�Ls��������1����������~���<�8����{=�7���g!Sq�U��������<���`�i���'����<�����z��k�?�w�U?�5���6�Z������\�U�o��K�Eju!������.��,rf��TT��`�d~�Hym2�����I
�~|��Bt=s65�Zcnh�U���z%�Goh�.Z�b�����*(
WC7O��W+o�&�VJ���h�Q�z��}������l�������N��s�u>�9��}Ls������[�\?����������������No����4�����+����<�D��~���M�y��4�����{'6��R��]��5����m}��s�3�;	!�p�e�u|�n�;��g�=����Q��b|�e����ZK���^*��5�b�om\~���;����Y�
�O-��F!D�h����P���4�P�Z��������L9K��L9�1��������yb{f���K���f��L����h�/Z0I>U!8��O�r�����>�����������������oe|�����a�~����}e*���
S�+(iV���|��c�����W��������g<��z~�|u�4
����G��y�j�����e<{�=��W�'�4Y�����k����m�6FF��D@���������M�?����A��c����tZ;�x�Z���H��'9::�5���~�;$���v�C�Zf��B!�Bq|I�.�B!�Bt�����V�ZE��%�Z��N��7o�����������%�B!�b��C�u��"��y��C����.����+�?����A��c����tZ;�x�Z���H��'9::�5���|�������$���B!�B�#��B!�B�a�XB!�B!:�g��q�4���m�X�������ol��B���
%���(����0�61P�	��SP
�D @@QP�[B���{ `����P�kAB�]1���� �u�yQ�y���;�=�c�>6��*p�V-4�>��h��7�����\�[���c����0���� �����#~q���7�H�����}S|����Qc�KS����q�>�_z����Sf�9np���cvK�;>������b�o�s�000�����n��Y�o��M��80���������!�p�c���[3�b��)=K2�D�e����+o��-�P,By���1I�CA��t��e��e$e�^�+��}!D����BJ����z���^����Ra���Z������'�%��=z�p�>�c�^�j5�k�B�m�9�&s��|�)�c���<�[���c�io���a����7�_,>c�w����"�t�|����0�"�qs�u�g��������Mq�o�\q�;��M���,��������N>p�N�k�g�/��>������_�������P�����l�e�2������B���������x�k�AY�����\r��dv�c�0P����V��������3Z*1W�7T����I�j@p(�I�����������}��^#�R�b����F��/��H�J���|=����
>��{��b��a��&���e�[4K�yq
��o���x�����w���`�!���>�}�~�:�:������w���7F|��L��k�RC%S&9���.�*�pS�|��9|�������Q�|~���y���y�u0��gx�����>>�<v����
g8�Ei-�m��M���}����l��fz�v�������wBp��>��7;?�If��L���M�!F�y&"&����J�0�o���cT���Q����|/f�$�`Cq����8S�6�*{!D�jZ�3c�22�k�"�bC�����jJ>�e��is^��B��zs3�J D!�'��������
�3�o�j�D?��2�JBu>�+���.�>����2���2^�<��3�^��|�[vV����p`�N>�5N��z^����4������q����ig�����3��Wo��>��������2���������]�\��,�#G�M�f�<��PQ�Cd������%��,�7gP�D�:�Z�a��C���Bx�J��Z�#�]:L�� m���d�	�;�m�*i�j.h?������'���gw{��B�s�/������5������b%��.*(�
�C�fh�H%��W��5�V�����a����s�������H�%����q�'����2���3���N>�GO�
���m�,���v3}��B��r����}[w����K�.�^jG�p��	��r��/D��^C��t�����Y
_��������=�%?��)��{����J� H4Zo
����� ��n���g�L�g��~l���[N�_��s�����Q;}nl�o�o`�w�P8���#�����D�����If��J����n[�q��v���e�|'���w���	�g����i���s��)�2���_�n!Z���1�����������,k�]�yM�>�6_��3Mm�B����v�~d�3����j�$�D�����!bT��
K���i��q!V��0#��u���%�������P��A���>o��b�
J����Sv|��,�+�<z��B�m�=��rT�����x����o�{�
|�����'Z1�����*>c�7n3T�������~Wx]/�����oP�<�S�����9nr����0p����{�>���\u��_�����k�������iMc���]�������{cKAo��m##��%����,===��h�����f��/&W7�����4��z� �@�a�������)�lT#���/����������r�$���:5g4J02A>�tfZ���J����i����Yj���� ^0_=�cy�� *J�@����_�.�(��H��X�l���e��J02e~�~��>qPq�C�?U���:6�5�D$��@,�8���V��N�����5����l���z�����_���������j�%�����LY�7��#�p�*��c3�3F�����d��"��(�O��&����N�n��O}��<��}@���~����W����������N�b����n���=�/7�����}j
>8h^�������5\q�����k��_K���X�����[I��In�'��������i����Yj���N 9��������vk�~<�[��.�B!�B!�q%��B!�B�ak�^T[�j����j�*N8�;����o\p>���w?�B!�B�eh�a]�@�hW�;e\�Y�r�z���"��'���tK?��J�K�����g�u{�:��x~����[���??��E����C�d�B!�B�a�XB!�B!:��B!�B!D��,�7n�fz����s�}c�=�B���%EQP�{�B��e�_���p�6~��Z��)
	�����e�	��u��PTW[E1������`h	�1����q�Z�@ Ac�c���7<�o�*V�E����_���n����{�y�5V��T#1���=}����D����/k|:y+�N���/?�����5�b���cv�����!�~��Z���q�����M[�6n��u{�z`��������!��;��k�51T��������H7�������}!�8������t�Z�Xy����Z��M�V�������z�p!n��K#3
)]G�S0j�ox�x"���d�����#F����k���a���u�}���q72��x�.��wL
�������Ra��s�,Q�8f�����������{����"�t�|���9�
�Hx<������L�O���%M������O�w�~U���,����,
�W����7���v|�~�����S�5���<�kb�5�u�Z��}[����e�����������g6�����J���X������.>�7����k���%�Nf�=�����;���}k	�D��1g��t������(U�����P8�5N��Y!�XQ�YjY��v��~�v�mf��!W������
L� C1(L������E&Q�GJT��64iH�cv�����x��q����Z[���a�����I�5��pS�����m��o������o����e��o�����
o���**��0I����^��������o������G���<��X�!�(v�*�����>����������c����^�p�^��b���k��^����e��+1��>�\���Z��9r���S;r���I0I>V PP!
�|�/�������{��(���O�)�����"P�1���!c(d�-0L�:�D�D.��w8OubJ&���2�j�Zu�����B�D��W!n2?�lx��y��#����A��z�m����3fA�%���f�=��SI��?n�R��Ik	o%Q��v��>n7�cb�}�e�P�@�^��Z�1��-�hx���g�
�R��Ch�
�qP�E>����>��=g��^��O����g9���s���s?.<�kb����|z��x���0�����Lo>�����������M�>=�����������]L�������e�9��@K/X� kUb�x����o�{�D�
	��N$8#R*0iLJDfL�{��Qs��h4Hp(L����%J��B�����I!VUI���Lf�_�}^���<��s�3���S��?��O��c\#3��=q=����e\����p�x��k5i������+8D*�'k/�oY��}�-K!�'\TP���&)2��d�JJG���q����x������������H���F��O����_�ez	w���~�W���E��Y����f��=���e�u�����fn�*�XNjG�p��	n����j��6f�
��$S�m����}�
��O���$�������1�T�F������B��d�*q
�<��|���R���1���dV7����Q&�L���X����s��is�8>J�4J�a�P��'�f��7�!�C�`o��L�|�;�V^�����	m8�P8���|�\�[
�����9�u=��s�����qTw���2����;�������(a~��='9���Q(<����k���/P���z�[����u����6n������9�!:[�2���s�5��~��sM��b��.����exr������Ap(F�E�C5�t��0B���,�*�|����YW(WPf����������<e<��X�aFrE�*���m�<��}��������,��yN5���q����N!�s
����r�������CU��_����������b�|���*iZ�U�+����z�dr��:�C�'��5g��_��O;W���^���W������:���������}W�c��78_�V����l�c7���Yw!���������
����������G��~�����E��r��)��"�~�������i�\�M�$����!�Q��1�9� ���P���J�����j2T��i&
s)w�R�������7�7�dp��"���EI�D���.�e�{\�I��
�@%h��)��q7���DS�Ar�����\<����������������g��;x�m��w~qk_�
)����P�J\AI�������*�/�~�g��c;���
����O'u��7��3�e���i}u��<u�w�Ux�?�ksV���O����O��(�'IDAT����Y�����k����m�6FF�":���,===��h������{r�-�Xj+-/���N;���������I���n������;$i��?�yj�YB!�B!��%��B!�B�ak�^T[�j����j�*N8�;����o\p>���w?�B!�B�eh�����X��n��;�I�Y�r�z���"��'���tK?��J�K�����g�u{�:��x~����[���W���E���'�!O�^!�B!��0R�!�B!�F�u!�B!���x��>=��u�=���i������NBq��f������(*-]2T�@�m3T�@@AmH��&P�J��]-���w+CM�(
���hL��J�������
3���QP<�K��1�l���B��5r�������_C3_W�9r�������c[E1�Hh�����9N��Z���}�����k$�X��Fb���_?�<?�����!����O��3���<r�5l�j�_<y��x�����b��-�G����`���`���e�F����mW00p5������B/wL��r{q�dIOz���D�����`��%����v���gKN
�� 4Zr�U��9�bjb�Rn��h	s02U�:5v|�r��%?hdF!���z
F�Mh�n��������.���tt]'E��k	IQ����)�K��u]�.X��2
������[��V�x"
�E�����v�*�t�|�f�h�>��ya�&�V����H����o��o)���������>c�/�!���cT��Y�Z$<��N����<�������'��QN��7����|
���<��\�
��j�����=~�_W�o��vc��(�]``��\z�����]{�sn���BX�e��^{��5D���O'����V�0����l�=��b`�*����h�j���s���
����f_4�'�jB!�0?	�0D4���
�^Ng���T��c�@�R�(��0�<c@���T"�CMw]6���V$72l}ex�D�Z�,�L;�v�`0S�~�f�e�7�3e�9�sfa@��&
W� ��s����HC*���������1&�~I�fk8]�:&�>/j��o�����K��$�zv�9�3���F��(W�`�d*�$W���U���S8k�K����������a~��K�������_7�����_���/��'65[0�b�ol���}\�a+���s{8x`�i���Y����t?!�8r���v���s#C�3Z�0Q�Q��U�P����j�Zur&� ��#@db�|2����!���r@	��:�^c���{�C�P��f�������
6oN�I�����������J@o�C�F 7"4�����~sb��m������<�aOqT���.�L���@�B,���0O+��C���=�lP�@���J��k�@��-��T�������U+%����7h<�1y��m������������<K�2�����A��
��!�L��8���9�V�_W���/=��g�r����/��/}�K���a~�~~��x��_Q/���X��u���z�2��Z/�������3\��,�#G���bD(1R�{�Zo6�$����L��(a}���������"a������7��Q[9��������F��s0Q�Q�N��tVn,Gv~���G�� ^��N���
��%��d�)L�� m��a&\�o2P�4L5����%H�j�A^h�)���n~���\�9��>��s���h?����<K!��.*(�
�C�fh�H%��W��������8��'����e�>N]�o��|�0'��_��KY������5g�~�g���<�u������R.��:���iB[��n�9��7]C~�]�3|�$z���D?�	��On��� ��q��c�����8�7���_��$$�b�kW��SCE	�R��j���VJ�1�!��������v~�{e�m%i����4��:K��$S�3s�rw3$������,C��8S������<�������Q*��`�G)�F�'$q3���(�4|���~J�I�>���1w���_}2�4�����{w�����v~�����_]�B����n?�e��b������u�\�j���e0s�9W�\:?�}��]������{���9��K/�f���r��k��B��\����q���J"Q%�����@�B�~S����u���\~�v��|9�~���O'h^�%�%�gW����rj�7�U�g�P��:r���(���-C�DC���������D����!�.I]f����i�s����0��9�~�����Z1/��70O�:�z%�Gox��O��^�f_$-?A$2A>�%��x��p_>����'����sD�C>}H�g���	k���Z�Sh���-E�S�q�����\u�B�����3����<�����z�W<C���/}���@��W��5E6ng��K�_��q��v�n;x��l0�6�b��m��4_�G���������M�?s���Bj0�����4� �@���<Y�\t;�\��)��:��08Z"2QEOV}�K/*�����_�\.A��&S#(2\�54��s�2���m�\��$��U	�t1��)�d���td�Z6���3hD"#S���)�'/���c�//�����;S�����g���y��
%T V�1�h�a=7 �S�s�@M������S�,��*�FKD"��O�c��������z�[@��{����_��_\\�f��c���"��%�.Xg����-M��H,�����PH�K�J@$��5~|��O��W����D�@�~Rz�����N�n����3��?~�3��/��g��SB������Y�^~"���m^���,�y����e�.f����������2+f��������K@���q�^,���������@K(�?�p��T�v?�����N�o��R���u���$GGG���[�����b}�e�B!�������uE!���W�!�B�n$��Bt�`��n������X
!�B��X�����U�h���U�Vq�	���m��}����Y�~����B!�B,C�����E
�f�;�I�Y�r�z���"��'���tK?��J�K�����g�u{�:��x~����[�����=���'Y/�B!�Bt)��B!�B�#��B!�B�a<���l��f����k��7����ll�
!���h	�@@A	P���qa��1��{Kw0��@�����jEQP���j�6s��P��yU	8�Q�>�cWK�m[����\m^|��y�Nt�1�7���	+Vo��Xac��=�c�������-����|pC���<9O��[����l������X��u0p�m��Z�0�cl�4�[��r�B!��;��k�5�2��V��kUb�A��{1������P�FK���aO�",�K�jdF!���z
F�����'Z�x"N�J�	������z�(s�]-� )�m�p����'����f�rg���]�G�y/�%�1�7�|�ZR:z>Fe���G-�6��{=IqC��W1�3�������)>�{��-V��s?����M�������[����^�����(�]`�}l�0������!����c�>.��<���sP��!2p&�\x:���E����[� ���MK���a�'�Ny�"] HR��g����Hnd�zex�D��n��r��HC*soh�3v�b���z|(����yi������L9B8�j#�)�1�7�|��L��k�RC%S&t�]��w���'�6������ky-r�_Vy�a���=��u��$�;�������X��m.���o��>w!�Bt�#G�p��ajG�8?��(�����y�'�$�M�b�v�����|m�DR������L�����s
�>vC������Z�RK5�Y��m�YL�5�Q��WP&l~c�<cl���?B�T���((	�4����R�F�z���;�����)|�3�8��P������7��}������e���9?]!��x��m��y�{?����R;r���b�*i�:�o��?��h	��p&|m�!R�<Yk)w�����L�� m���d�	��"�b��&��)��,������
J����$E��,RI���22^�sy������8�%W�o��Z.��~M�������}��x��}[�q��\.��B��jG�p��	n����j�����S���F�b��������w�����@M�(���p|�Ri�xB� H�a\����A��$������,C����"K�?/u�m�;�y�@k\t;�1e�����g!Sq�U��J2^��9�w��)"��v�>s�9W��s�6~�����X�c����g�{�B!:I�����s�e&�,v*�<���R��<�a��XV~�#:�H�h]i���	C����mZ�������'�D&�g��*�}���ULx�]CUP�+�L 6t��v������]s����r~c�/n3T�����W�w���������?�w�-d�~q����yZ�U�+���Z�o��|m�������:�VB!:�[_��6�c�~v�3��{�+�!�7_��1�8���t���{�V;�j<Dh0��������
�#��Q���e|�����a�:}�P��1����f�9&�La���"���d�X�������8�|�rs��#�4S�;+I�E�#~c�/n�+S!e�_
S�+(ip�������4���op��ulr����7���������
p��������0g�O��t�@�N�%���,D`���5w����m##�zu��evv���wX�I��Or�=9��~,����N�o��R���u���$GGG��U;t�;$)z�;�!O�3�B!�B!�8��XB!�B!:L`���j�V����V�Z�	'x����c����g�����B!�B���_�Cb�N��?�C����.����+�?����A��c����tZ;�x�Z���H��'9::�5���j��t�?M�C�d�B!�B�a�XB!�B!:��B!�B!D��(�7�}z�i��{���e�w\!�B4�
����((	��y�0���E!����FCU(*�U	8�%�;;����(
J @�����@���x����f��Z���k�h�z����o��Z,o�� h��z�5��F����j$����3�����j������cm�g��6y��V��|�����������?�Y�b}�
000������k��7��{�����n����B!��t9F����:)�
o����(�t]O�h������qU*L�����z�hCsw�`2k��r9ad8
Z�AR�j:z�F�A�O}1O��h��7C%������c�����>O�`=��x�'�b3?�������c�7�!���cT&�����W�H�a�n|=;7���/�2���.�V�B�=`����^���O��5��� 8�{�/�G��r�3���u������VB!����L�[&�"��a���2<R�R]H�DR�X�=�0O{�(��,��]P��#�C������_@�����
�6E�5�Q��9�� �P
V�U����X� I���n��5���0X��**��0I{��&�����c��������/���N��<�+��r����W8�E�,���v����������`��lb/w�lB!���cB%��w������8le��iH�fs����K���x{�o`L�O%�"{io��x!9�5��Z)����<9��0O	�#���%��
�*U��6G����k���A�����?B�T���(��[�~�K~����e��r�sN:����������������W������Vs���`��os���:[��X��u����z�2��\o���������B!��RnkYr�J�?NK��#-A����-8D*�'k/�nX���������X���.�qk���2���z�|�@\5&�L���M�0�
-����k!l�q�E%^axh�"�M��t��
��:^^�{��������|����/�����os�����U��9���os����������\>��E�g�9��[��t�b��o+;�����j�ol7���a@
u!�B_�L�as�d��wvk��������t�R9m���G)�F�'T�D��}���3��M���<1�3�n.	�����zw�B�rV{��t�$��y���e�2�ph��������8v���7���g!Sq�U�O)Y���w�l��:��s�I����~���^���5�c��X{�j'�s�0��s�z�Zo�w��.����|����e����\���.�B!<��+�V�5m_6��������5���y������g�|��}a�����lT���9�����o{��0g�S
��
UAi��[�Rr~>�<�on�;G�Z�4OFw>���9�(o���yj{�w����.�D��k~q�����\��:^f�6�6\�}���r�+����U�����`��^���|��s��eC�h-�7nw��m��M��x���+����������B!�E��r�++��"�~�,D�����(i����V��1�m���G0�g
k= ��K�-M�2�yb���UT�/�>�o7G}#�$�����AW��TR�~|���r-�
�k���c��:6���������Gp(L%���i9�d����B��<_�3S|����5f|����n{���
�{������Y����a��6W��XSo����W���k��m��4��8>fgg���q�E�$�$7���n��R[iy���v��,�n�_'��Ortttk^����;$���t�<���!�B!����b]!�B!��0��k/��Z����KX�j'��o�V���.8�����K!�B!�2���w�;$���\�y�s��\��3s�H��In�'��������i����Yj���N 9��������~����Cb����k�!O�^!�B!��0R�!�B!�F�u!�B!���x��>=��u�=�7O\!�b��
����$4+j���CM�(
���P
�f��%�{�������������RQ�r����l�������PT�����N��xq�������k$�X��FB�
�����1��~�)��mN��c[���{X�zx������� j���].�G����`���`��l�3.�B�Bi	IQ����)Ih��i�W������������)�7����Ev��������J�	������z�(���.����x����/�1����/G~q�� ^�����x�����#~��'�!���cT&��~�"��h�^W����Q�����l+��5S���Y8�����y{���e��`�4��z�e^��o���4k�s�{��Q��<��;�q!�B��"��������#�C�x�����,'Z����Y4ex�D��n$�������������;�����~���7�9���T>V�5�:0G�E�#~��'n4�r�
�J�2L2��W�����Y��^7���p������k�j<��Q'|�xf��Nc��/�K<������r�-p�����'.�B��j%��{���(��"���������:��/����vw���MI�VNAqz=-SZ�M[�@�
{�x�� 23k��de���b��b{���5�,H,���Lg�.����6st��m�����?���>�?���#�J�����x���?�{�����V}���������M�`O�c���2��R�`�L���6�~�&v.rgFwW4T5�V`����w�i�`�';�v��<�aHF���b�Y���������I��K5mN~������X{������%<��S���p��>>�'k�\�����=����w�}S������""""�Y0�r��E<��m�F&T$�+�C<�!�.��l)�s�rW��/���(	��'��x��j��,Nx�P������'K/=�Y�q�7��*/k?�����������J1�x�G�s����7oe���������mO�m:��[S���n����>�9}��e��p��n�V�~q�U+HO�,�M��0��!n�J�c����s� ��<����M��2���#Ab��9����rg�L���%���A&'��g?���J���{~�������{D0�/�U/�Z���b�w�U�n����z�'��|�����L|+�{����=U���n�st��g	��]�ob���9�������*b'
��;v���#$��xk%O��{Hg�;M�.��MsiKd}�i'
U����$�7v0Y�����|�M���,9w��b�|�Q��F��b}�[��H�=�q�/��$����W��������w���yrnz��������,�^e�<L�^�
W4h���z�����M����]�+."""�Jc�����!%� b'#NN�KT����]�+��($�H�����N�N��<3j`��I�]�������2L��$�p�{��� Q���R�}�M�q?�.w����������c�g����k�����+�=!�#��[.Z���7��s���s���*��SL���jV��������3���m�rb�>>���r�E
�Y���
���{�20���&d�����~�Y��pS��)7���V�������l�m��s��z���r<?���h��~�����%z�U�
5T?�."""""""��u�&�����mmm�>^E[[]�8^{�+=6o���[�z���������@���{C�D����������{f����O�i��J?�����f�o�=�s�������)G�G������
����?xC
i�������H�Q�.""""""�dT��������4��z����r��6y��o��}}������"�DF�{��f�@ ��=����(�a`�d����4��5�@#Z9b�Q'fUq������\�|����o������Q����F���/�oN�%Gr~4;~c�q�$��*MM�+<����p�_x�{������-^��n�/�(�8��'�U��h
��C�����������;�)�7
�sKu@DDD��#O�=��	��`Kp��HV�-wMF!nYXV#Uo��k��3J7q�E�X$N7Q��D!2�"�eU��3J���,�J�i���S(:�8��m9��!m|rTbF��AW9��s�*��s��6G����;~c�'nf!nae����	%����5W]u�>���|�#$������"�9~�W!T��i��O^��'����/<���O|/F�b�����_61t��>Z�DDDD����'���Wq�o]�q�:�:/��k.c��c������J%�B���"V��{`e1��z���0���m���p�b�T�L����=>�=�������f:��-��/��O����L@<��=P�&�A6vx������>c�o�����A���N2��%Vz!�RO~�o�y[��~=7l���"�1��������������?�2��c�]���+���a��i����>�{�J��}����.y�����T;s��O��x�L�g/3�Mz N�aa!�*7=���B���(F����� ��t0�.�
`D����49�)-�.-�
�����0j�������������\�0���@�9G�|��Zn�&X��NG�s$Oh��Qy
�B�_y��*U����s��e���	�s}U����-�|�����_d���; ���R��kX����Y7�7��>���9��]��EDDd�O��}�����O����3gj�I�.FW�R�U,K9K��"�rw���a����dBcD�6{��3�������f������s���� ^���/e6I#nQ��9�s#<L(k`D���������,�����gd������C��c��v|���t���s�0��m�[���o=@�c/�uS�I���'����GH��E���0�a�^vl�n���>����u;:3��l�-��&s"""�P��������!�����Y7G����j���Z�pV����O�.��#HO�3��������;��� �p�z��.r���|��l>H,��;�����66�D��\��=�299H$����E�<#�X�����s$��o�9� ��s�F�����v���/��w��#|����|�vn�$�+����!_0��}�Q�S�r�'��n��C���_,�U_��������-�����!vn9��CU7����s�Q8��NM����H���O�>����uw�lz���v���� ������i�{Hg�;MW-�l�t���h���40��rV�O������0�i�B��������I:6���J�b��/-G�^�b�&r�Q��F��b���R���g����7M�d���5�x��$A�^u�B���������
��*�|�&~���|���87���m�:o��?}��N<�-R�W�����
,P}������mSw0�[{�EDDd���}=�u�G����f9t�y�d����[�n1Rt@�(+V�Umf��0�I�����u�a�G!aFF���v�Hw���>m|r�e��G�_!�`�L,�������@9�e���ev����|w�D�k���Yv>�?�������089I.Q�k4i7Q��|h<v���_���H����"�"10�����J\��7?�q��-���H]���������_v�8�������l~�k��)z�gk���4�
Yf����_����ES��)7���V�������l�m��s��z���r<?���h��~#�_�!Y��w�o����u��T��������4����^Wlkk���*���������c�X��y��l�����DDDDDDdz��~��%�����7������V�3�\�?�M���U�q����4[����k���f��O9:?Z5�/u���%�d�)o�!-�i2*�EDDDDDD���u�&��X�c��S�������}�*���}���'""""%f� 00#jz�xv2���0�&m�a��M97�@97�m
�9���N�$�+��J{�����Y�Q�>���n����S�������f�@ J�Hh4���&Q7Vijm0�V���gy�w^f���H�Y���1<�K7�����?���G�s�X�eI�����Igg'���w�E�&�\�����:wq�{�������$���P��,�8��7���dd���e�a0��>m�(��)-�b�8�DM�RX��/���t1���\��tU��M�m�!4��9,�~.��_����#U�+��srW�3����&i�d�������o��B�����/}��%4\��[�^���NA��z���{\��ox_��
�Ow��W��z_���~��3�G����j���X�81��7&""""d3���V83Kz���|���I���	�(�J���t��P{���;*��I4�L���MW�Az�a�-6��_?����
�5�����a}�����$f�R��a����������N2��%,�W�d������9����qI�5����������m~��_�?������R/I�b}��ag��}p��=+Y������EDDD�J0F�����hg�?��o~s����m�d�77�����M@<���P�liZ������
��Xl.h����`�MW>�h�����zx�ZIn�&X��OG�s$Oh���]��&��^���skU����
�TZ��K��)<��_�a��Yn�����#����+ ��gyw���~����ls���
����c����������""""
�Q"c�2�b���H�req�z7����`�Is3J�8�	eW0�a��{_��C5��+�����m�F&�:|r�)	��<�=�d��g<K>na��Y���ze�?�2F^_���
>���|�w�E���E���@�K���g���o�&��;�w�z_�����J�u�a�^vl�n���;����ul��������=���w64H,��;��*:6���z��q�����'���3�6�D��\��1Vd���A"�$6Ab)���m��!GW�z�����������q�&iD���
��Q�h����-]����w^G����|y�z������j����c��������O���������j:����7����N���������|�z8������k�+�����-�����!vn9���;��q���h3����H�����%�f6]s|��2���w��]�m��S����F��
�������xy]��/�X��Zf���Q2��\�f|�{%��O?Y@.�>�w
�|<�5o�R��-����7���%v��{��
���jn��ox�s#��o�-��F�K��0�,�.���|�l��WWN����3����yMU�E�/���U����fw�w}�4������������]DDDDJ��L��h$� �}�����a`	u�c�I"�	���5n�e��G�_!U����`���:N{aR�������q;apr�\��?������|ua7i�t����X�����w�K{B�#F�[0V�
��`�O��_�v��UW��G��3����o�����3���B���~�k��)z�gk����~��\����_���R��)7���V�������l�m��s��z���r<?���h����]�39K�L<�
5T?�."""""""��u�&�����mmm�>^E[[]�8^{�+=6o���[�z���������@�3{C�D�7��
5�=�-�U��,���r�:9h�~�k�-/���f{>�Z���(��S���V����?��,�M7���PCZ/""""""�dT��������4�"""""""M�A�����)�����MU�61t�t�05�DDDDV!3j�F�t��O|���Q��0�I�{��6�Q�{�w�o��:�2���5�~9�
�/'��x5��\q;i0�8W�In~����(w���0����y�s�������&Q7Vij�8��|��o�w���|�o���-���=Yj��<Pu�R4(�������N:�?;���=��� 7?��9���=�<������&f�n��V�H�n�&`�T������c22q����0�zC_���,��^��\���$��)�,��K���I�~
E�m�h����Oj���7�2�D���(�g������R���]0�r��9�b�7\�d��;~��'nf!nae����	%���W�89i������cw��|�o9E�_��y�������|�'�%������W^������6
�s�C��
]DDD�NQL����L���{�A6vT����Y��NAH���I�o� 1�����
����������{z8U�I���k�L���e���Z~}����n& ��������F����k>c�o<����N._;�H��Xi0	�~t��&~�G/���g~'-�	q��J��hX�o:�,u�����c7�nvw���B��EDDD����v��3
���d�n�r�u���wnn������qwIn#ZZ��##�?F{��#��_?��& �������c����]������ct�c��^V�����	V}��j����a4G��Z�?y�3��-���[�[o{�Sx��������Wl�
����l���3���Eye�"5,�����,u����W^�8�.��=����i]DDD�Yn\,�?�,w�I	�h�B������eaYE2�1"���+�(�1w|���`f�����`�x�TiY{y��|�3�@+��Nx�P������'K/=�Y�q�7�H���������r��?����Gx��s��z�~�;���l��Y���o�������2���a�^vl�n��R�~���9��
������fAz��g�l�F����p�T����s;6iu������&����������Ab���������I&�L���yd���A"�$6A�Uc�=�En��7wv2Ant�v�^V
������R�������V5�kj����K��7��b�}����'�qx�}������_.�U�p�s=����]�ob���9:��-;�w���u'f��g�����v�����,�'KGH���J�R�2���w��]�o���Kd����{�al�|M3��k�������k�/����5��E.�n���k�L*I��n���$�����U���q(�7��%v��{��
�V�������q���}r�x�I����~�_WX��b�o_�k���`v�.w6��v����s�v�M7��U,������)��X;apr�\�rl���j^a�G!aFF���v�Hw����Q��&M��r_}���cd�y"�F P������L�(u[u�������?�H0�aw���Q*<w�jo,&�������q�Z#y����`O�|��H�z�N��&>���_�� �On{�����_�qf���x13��3�� lyk��!�f���7x������@�/��e6;;�����aY ���r�:9h�~�k�-/���f{>�Z���(��S���V����?��,�M7���PC�3�"""""""rA�Xi2�k������F��U���q�E����*��c�����u��o�������
t�_<�
�m��7yC
i�z�k�=3�E�����NZ���j�K����������k�������U�:q�!oH�����PCZ/""""""�dT��������4�"""""""M�A�����)�����Ml:\�y�����,�5���=�l3J �{g'���aD���0��f���I����9�&i��1��9�20.����%s�9�\���0����WV�Fc�o�5��D�X��I���.�SG�s����}�C��eh�����K���f����R��������-E�b��:;����s�C����c{�9��N:;������s�{�����L<Y��aF�N��D���(��i���NQ�Bk�r�dd���e�a0R�F�6������r�@�Q2��eYXV�����8���U,�B�!�~.�����HU.
���\�9r�f�5zn�2�����7���2�������YB������������W�]5��y������s�����[����|?�������S�����X��kj��X
�u�3L{c}wp�����7."""��#Or�[�������utu^�
�\����*��CUH�"�^9���YE�T��@k0��z�B�0���m3WnL�	�g����SS���f:w���o?��|�"���'3�f�����1� ���g���5��=�+����N2��%,�[�%�������j�G��W�o�n|�nQ�(~����O�)���������+��%iX�����wo�C���&�v�Yu�u��N�>M�����5�=� ��N{w����ak�M�_����& ������c����3}n'��v��3M3���'h��\8L�i����P�l��M!�u�$H��k�i�`���#��9�'4I�����_��+K�����%.������/��O���������G����3�a�^^�~7�7�����}w�c��f�EDD�����y_t?���l�3E�����v2BwFE��	:��Z����Q�)O�{��3��������X�Y�=�@����%��Y��	���`,�(	g/�1�t�R��&�50"yz{���K�x�|����3�*�����n�}�^�b/p����6�q�7�������]/���	��X/;��G�pkU��w��Q�."""��g�pp������~��z!?	]���`;�.���t����oc�L���%��� ���D�Il������C]
f
�%�������m����a�?C��&H,e9{��=��� {�B�kN�#����|y�zi�4�k�l��o��K��O}�=�w��_(�U_�.����������!vn9J�6�����EDDD�U/�?}�4g<3���.�d�����&Q���{Hg�;PW�]���,���M=g�m�t���(]]�dR1H����I:6��F�j�B~����������R���g������f|�{n�"���_��N��~��_���n��w}�{������_hw
o�x����0�{��
����}�*_�vp��wU��o����n8'"""�zo�����m9r�G���������;o)�	�2LtU�Y� �}���9_��M�4�-��Ya�G!aFF���v�Hw�f>m��`,��,s @�	Ra'������*��C>�\�\�������|D��'GN1AR$�E���7�����F���}���������	[�S���m���=��c���v����eh��`��[����k}�n��3�~��/�D�^�F��[+m#�f���7x������@�/��e6;;�����aY ���r�:9h�~�k�-/���f{>�Z���(��S���V������!Y��KwxC
������������b]DDDDDD������b[[��W����E5����J����f�����%""""""+�������,�/��#�PC����Zu��rQ��)7���V�������l�m��s��z���r<?���h�������!Y�7����PCZ/""""""�dT��������4�"""""""M�A�����)�����M�#����S��j��
3j��5��W;u�`D���0�����@�&~�i|��(�
#J�3Z��?����y�j��:fN�n>l3����#�>7�~�}�i
�%��&�c,4��D�X��ITc����g��z����G�y��X�c��/��?0���?/��?��_��A�~�]��tvv���!�qNY��;f���Ig�vZ���:^DDDDJ�$�\?���eY�IT�^ILF!nYXV#
������M�b��*��M��+7>��@�Q2����R��\��e�5zn���y��&K�����@�a��D!2���#7w�}n|}�i��X�,����g,���,�-�L?�qg���,��p�UW��g��/�[�~������&���@�3Nq���A������u
/9���G�U���j���X�81��7&""""d3���V3Kz��)�	�;0I���6��T�
��t��P{�������3�v
[�����[l,��<���q?��=>�=�C�T�Tx�>�]A��2�YE�Tom�o,�����8������{��������������^����
�������_����4����������?�5���K��X//w�����c���o�a��SS�p`{�""""R-#�?F{���3��i�7������6v2Z��E�&7m�=M�1FJK�����P�li�������
��Xl��)�;�1�@!?I����F4���>��o ��w����m�;��w��1G���!i�q�Zy��o������Z��Rii�/8�3x�g���_yM���~�����i�
���������jX����Y7�7��Y�i������2���Sh�����Hf������X�,�j�c��,67�����Rqw)w0�a����}��P��XyLF;(O�962lYXV�Lh�H�^r���
DJ����F$Oo�8Yz����[X�yFV�xyy��\�����+���nNq����=������������=��9�K��_�/���7�������r�u�a�^vl�n��>������������s�([nU�."""�eO���1��;|g�V����O���q�l���E���N�p�r��P��>H,e9���=��j�5����W��N&���K��>B{y|pc�;���>�?�����.�c�Qz}�R�xb$_��^V�s���t����S����?�S7�|�e��6��S�v����_��_}5���66�^��������Z&�s������(G���fYw��p�)����[81���""""^���af�5�W�p/��{�i�l����6Mg��O;i`T�i�����r3�uJwH��D|c�s}g�w�����y�, GgV=^�>=�lF//A6�i�B��}�������0���/��$����W����0`���797�����������/���������_��\RY��_f/�{��T�L~1����}��g;�����8ph�����������M�""""���L��+$� �}�"���a`	u�<�I"�	���5n�e��G�_U��on��3A���f	&H������_������Br���12�<�@���L,8G�_���@Z��5~��I�]�:��c�?�^k$O��a�'D>b`$���!�2���l����
�j
�#?�9����_q�
��� �YN�P�5k�����w�^j�D@.���Y��_�
�)�����A���\[myi��6��9�Z��@9��rt~�j^_|����,���Io����u��T��������4����^Wlkk���*���������c�X��y��l�����DDDDDDd:y��7$K�v�No�!�Yoq��gf�(�����A���\[myi��6��9�Z��@9��rt~�j^''��7$K����\j�������H�Q�.""""""�dT��������4��z����r��6��l:�0."""�Z�Q�@��0���}���NF1�0�&m�a���ub��mF�|F]�l3J �:Z�q������^��0W�N�$�������yn�25;~��q�$��*MM����,���s=�#:�9��
��<���a}��(p���v�T�E����
��C�����������;������g{g'���y���P�."""����8���U,���	�#U��97�2LF!nYXV#Uo��i����,,+Ch���I"��b��*��&i�d����v�D��B�r�'<�����>}����(�1�����x�;^�ki~c�o<���,�-�L?�qgp��,��p�UW�S����&�j�H��v�DU<�e�}:���-���>���O��<��������NZ�������MW����_�����p��T�����*NQL����L���{�A6����3Kz���az&�{'�|���4���������=>�=�#�T'�AbV+�[�r6�9ol�����_��&��3�����[%��ki>c�o<����N._;�H��X��(���������0��H�~��K���;�8�����K��%iX�����wo��1��s����]8Oi��y��4�U�NF1���g��5���������/s�6��.���v�B&�
�I:w��0����
cd��hs���XH.�cFO��U-������ZV�����	V}��j����a��le����o:������}|���r���}O���������������^�w�1W�cC�#���T~����ls���
�M�s��������bj�>6����&"""�*c)g��X����&i$`�y
�-�0J��k>�t�R,����eaYE2�1"s��6�D��e�
s���Q�������7>W�E��<�=�d��g<K>na��i���b=�e�}�]?����!����&�z���[�����#������i�j���[����n�1���_���z��=8��[��_�r��wn���}�Y/""""�2��,�M��0��!n�J�c��}t���,g����]�v�C]@{y|pc�����t�j�|�X��wq�-$�l��4���s3�� ���D�f9���Q�q�eu����G�J�H��W�4L�����������\����%<w�����#7��s���J���r����<�N��? �������{��X�\����>����X�4��-G9r����=��3������*b'
����B~�t��!�`���z���t��u�2�4����Mm�l���A�����]���9Ko1of�5���O?����8{���s�<+3JW�(�T|���8����0���/��$����W�����^B��O�����^G��{��87�K}d+o���w���������~��_�
,P}������lSw0�{NM^��n��nw/�����j�e��G�_��`�L,���089I.Q9�Z_�fx��a$`�]�o'�t'p�a��Mm��4bAwz<O$�������u�&Mw)��a&�K�+�_~��Y��>�o�o�-,^�ki	����x�����w�]{B�#F��'\m����(u�~������w�1;w9��\~�op���}n�.�_���5k������w/
��B����,�����e��?�M���U�q����4[����k���f��O9:?Z5�����%��^X.�g�EDDDDDD��R�.""""""�d�^{]�������hkk����k�Ub����W�u�V������_��d�����ZY{�[\���Y.��?��ur�*�8�V[^�����|��V�_3P���������������W.�V�2x�&�b]DDDDDD���Xi2s�}�����W`jj�yT�EDD����b��� nM�5��I4 `$)1���iUv2�aF���4��q���6���aU�1������R:bFK���Nu�������yW�k��o�����Q��o���w��7��_j\V6��T^C�5f�M�n���$�`L��S�}�������}�|�������O��/����~�~n��~>vU���sn<G�m;>[�B�T����������:���{����N:;����{P�.""��#O�=j�I"�i�`bt��t7Q���`�����5��1�t�a`�@abJ�Y��9��&#��,,+���4��q���NI���X�E��k�@�Q2��eYXV�0�k�����-3Jd��~����|��B����t�������c)�_�#>��@o���K��
�|��B���|��_��B�����w^v2Kh�j����o���
xGM<��{��G��������"����Izx��w���K���#����y�|�-<�-��?��).��5W],�b}C�m�����P��l9z�=�����O���MU������G&����^�M�u����������������cL2@<&���M�n����0��~��d�}�Qz��U�N��9����W_A�7�,��^�("L��$�B��HC7>m��1��!�6����x�of�Ui��~j^���3�l�>�{!m��[�s�k��a'I����N��_Z\V� 1�����
��Y��=]����`'��+
��w1=���d��6��_�������o������G��b.���������'��a���;�9�/hX���;���8T�t�zN�L�?��,�6l�j!""r���9����)�9S��Z0fQ,�3q�<�@�����6��F�?�6�����������n��;
�������E�7�[#��E6"������M!?I���F�]�mO�c���owJ�=�E:[Y*_���[=������k��Ku�}�y�~������1:��rA���K����7�s�6����1v��1G���!i
��jvb���$��_��W��������������������������X�4t��3��U]����,�R�y��>��;S���6q�Ig{�LT5;I"���gI�������B�ba��t7F����7`���,��,,�H&4F$i;��3�JK��K�3�� 00��CnaFI�{���_&
���\}��w��vP^@�w���EJ����F$Oo�8Yz����[X�yFV�����e��;�������N.���|�#���m��$�\��)���.>��S�Z���b��;v�c����M���[�aj__�L�w�]DD�\(�9���Q��?$3���g��$F� �j�����K3wsd���~z��
�3�EW?=A �N���g�.�R���yTV$�,�B�M�� ��.��<#���������k������ ��E���c�!GW��4���3drr�H4������e9����>W�/�\���!;� 7:����>��]T\������v�����R0�/�U/���f�W�u.vz���`�g'�[J���I��'y?|�Ty���W8G�g��S�bWg'�����pt7����'���sO�M��fx�+�jO9K���O�>��q�P�0���
{�}�#&�����s@z���v���� ����� #��m�06�Y����Jy+�uy�2���w6��W�}YFs��4����M��D^Y�M�jw�J^u7�B~���A��Q������&f)�n����k�L*
�7.J�y�,�o=��5��}��uJ/LgV=^����Q��~�����%���|��o��N��~��_���n��O�w��,w/�����O~�T�W��&�R�W�R�/N�2x_���m7�35���An~�n�fs"""��{���/�n����<��,��>����yK��92�$�����vg
-��Hw;wy��on��cb��`��2LtU�Y� �}��B,$ofx��a$`4S��L������D�8�s�i������	��X0�aw���������|��j����ct����<����/��~q��L�������I�]�:6�1�w�5�'��T���G��--������{��u������u�����{��c���;��O}���s��������?{����]��Ky�#����p}���!�f���7x������@����1;;���gy�UL�����NZ���j�K����������k�������U������T9�{��j�������������P�.""""""�d�^{]�������hkk����k�Ub����W�u�V������x��%���o�������{f����O�i��J?�����f�o�=�s�������)G�G��u��C��,Q��;����^DDDDDD���Xi2*�EDDDDDD����z��)����7
qxj�����453j�F��6� �������Hk��(�@�������(�a`Q������;i0�8g�$�@�:�Q9g��;����\�/���\}XX��_���a��NV�Fc�o�5��D�X��It���SG�s����}�C��eh�����K���f����R��������-���i��[j�b�����"""��x��Q���8���U,����|��>���c�D��������U�,��[���B����8F|���f��Q"c�U�3J���,�Jv�U�n�8��;����_��f�>,8G>�o;I"�O�����h��de�;~c�'nf!nae����	%����5Wm]/��m�����j�O��=p������o�=i����~�=��9�
��?�����C��\u�|��M�����V���g['��0[����'���Wq�o]�q�:�:/��k.c��c�F��T��W���C������1Ja�a�����"V��{`e1��z���0���m3WL�	�g�����]��\�v���2���B���������m3���d��;~c�'nW
�\�v��|/�`9��.��?��O]W>�=����x�p���p��F�S?|�|�L�%8�U-\�o_Q/I�b�o�A6��!�iJg��������)���$�.<����'#���j������S���1�	����%�49�)-��.wOF1���g������,��Rr��z}cd��h+&wr������M�����P;�H��0$
#Z�~!3?�,��6���������oh���>��>_�v/C�*�j�X��b}��av�lg�*u������/���?�i����x�����&i$`���i_���U��� Ny"�$�C<�!�.��,ww�R����y���������%2�.�_-���	��<�=�d��g<K>na���xY�[�Y��{����o���\����'?��w���G�',��X����X�� SSSL����S�9i*�3g8�?����dF�~&�7���3����R_�����c��c���M2�f2�pf�#�LN�&�	�����3�Az��g
���+��,5G��o{:�=�
� �xG��N�=������R�������V^i�Tl��o��K��O}�=�w��_(�U_�.�������X?���N:K��G��n:5�.""������>}�3u3���|<�u>��]�?��+K,U?K��2���w���>a���o6G�z�>�b�i�Q��F��b�4TM���tlb'
�����MvS�9��\9�{}7v0Y������?K��k~�;I����~�~���:�����w��o��)���%������U�+��S�^DDDV��v_�_~���#�y��Y}�/Y/0���r;apr�\"���{���<�IF�i�N�v���b�s���at�&M�y���fx��a$`�]^m'�t'pn��f�9�2L��&$� v�����M0A���������e����;<���W�U�NV��c�o����k�����.�=!�#���-��������
������]�m���m�2��K�{�-�����>����U����q"v/C�p���6�X�fM�<[{��e`�U��������~�zoXH�����NZ���j�K����������k�������U�:q�!oH�����PC�Yi2*�EDDDDDD�L��k�+���Q�xmmm\tQ�x��J�����j�n���[""""""���{��,���g�
5�=�-�U��,���r�:9h�~�k�-/���f{>�Z���(��S���V���kxC�D��w�jH��EDDDDDD���u�&�b]DDDDDD���Y����bj__��S�������""""RaF
�00�����a'�N�h���9��s�����y�io��/��#��Y����_��5jS3�����u��`��f�@ J)�w����X�g���M�n���$����/����l{���C���#���_|�m_�d����m��������J��{�*_�b}�;���~��N:;;�������P�."""���$���P��,�8��7�+��� �-���`�A?|��Q��S,ZX�"q�����eYXV��X��HU��9�}��D��AWUh�������c)7�8�b�7v�H"D�Xt�O)6I#@$��r���_\ZA���7�|�f�V���x�C�,���>N~������9|�f�|��x���������s-���<w����z3���
��������{�E�)�71t��>Z	���C�_��9Q9$""""~l�s��
afI���m�;0I���6��T�
��t��P;��tU�����mO� ;����K@<�_�#������6�<�a�������{(�*��&H�*b�zk��Vu��e��~c�'nW��)�/��d$�K�4������o�
��p��t~�����M~��_��N����������4,���d��]�ks��;���|������������=``��������t��q��6v2Z���P�ly7�<L�U�&��S����Sn�Q��'~�-6G���A�����qwyr#��%���YH\Z�w����m�U��u��1G���!i�w-�g���1�.-�o���������U���k��7��5)��`;|��g���S����pu���������.�J�oS;g��m��EDDD0�D��e���c��%��H0���A0�a������P����7�f�qwf��_|E2��z�z��[�U$#R��������U/<L(k`D���������,�����gd��w�_�5�n�~w� �r��_���]�����<����s�����*��/�������|������}��c�vtn$w��rO�&s��3u�:U��������sU�������[+M����f��8�����,go����]�)tl�F����p��=�f2�pn�drr�H4J�a�9f��_����L�.�hu��Y������y�3_\V�������e�`$_��^zI����5����������?s��;�z��^����6�����������{���2�����)����t�N:w�����u6
q����""""��[\��t��#��@:��i�vi�m�Na���NU�����s^M������F�|<�U�3���-�n����k�L*E�a�,���?������W��:��<��t�~��3w\Z��X����I��U�+�[���k���?�����_}M�P��n�p�R�������Uw�]��e�~6��f�m����m��6�a&�K�
L���^��B�00���K��$������	�2��#���*��6��� �������s����4��Brxn��
����D�@�f�8_{�M�4��9jt�����5~c�/�^k$O��S�	��	V���
�v�G�~�/N2J���3v��������-����.�2k~��s��mx���=�����o���"��YS�����{���0fggY�~�7,���SnZ'���sm������l��\k��5�x~�����y}�w
oH����nyC
-xf]DDDDDDD���u�&�����mmm�>^E[[]�8^{�+=6o���[�z���������@����
�]�/V+k�z�k�=3�E�����NZ���j�K����������k�������U����
�V�&������Z/""""""�dT��������4�"""""""Mf�b�o�S��J��oj�)�qxh��������cF
#���n����<v2�a�A4i{��f�@ �����~�����:�m
��s��y��]�����v� `$)]u�qU��&i����o���h�����q���45�6[������;����\��|�|xc�9�����t��-���,_�b}�;�T������N:�?;��T������Jf�n��V�H�n�&`�T������c22q����0�zC_���,��2�/�wM�x��c),�*?�]���Z,��]�q��.3Jd��~o8���S �(�r�R\����7v���O��B�����w?�If	
������8=������W;Y7q�)�g���/�#�)���]��>�OOL��w��So��cNa23��?��^yQ|��M�����zT��a�YM�)����Z��\�vo<���{�v��Y��n��w`�|��(H�*b�z���_��:��$�\?���}�����0�& �������_{ia>c�o<������o._;�H��X�^����Q��Q���\�
O<�2'{
�7��=r��)b7���'����/��7y����,��a���� ���'�i����>�{��y������Fv2�hg�?��M�I6]]T�l���i���wM���=>FG<F]�/��>�������7o
����c��~e.����7s�6��O;B��#yB��4�he��j�6x%_/}��2'm��/s�~�
<�.y��������+J�;l^����������m���X�4t��3������c{�9�������i����HiIv�@�X����&i$`�Q/����`d��bFI�<��Q7���{��3�J[���E����F$Oo�8Yz����[X�yFV�xY����u��|��}���J1>�%��Ww���R|$�d^���������?z�oy�o����Mq������S<�D���S��q��u��q��A�gl���&s�c{8pt��Zq������I��3���S�wl<�}���_���L�n8}�,�y��I&�L��lxd���A"u����f��>H�j����|gY�u��G'$�r�g�H��W��h�������.�:}{���\�IB��Y����~�k��/�������;U����]?s�r�����U7����s�Q8��N�4��!vn9����""""���40��&V�O���4"����<��e �u�4]���6��-����_|N��z�B�x����9{�����eF��%��A���>g��;����������/��$����Z�����|�|��yrn��b����x������%��r��7y������j�u��}��+m��������."""��c�����!%� b'#NN�KT����^�+��($�H�����N�N��<3j`��I�]���q�k�������Y���>�5�7��c&pg�L�/��������q�Z#y������G����d�Fvr�}������s�S����|��S<�1��7Fy�]|$R)���o���^gV}�-�r������p�MU_���5k������w/
��B����,����mW1���r�:9h�~�k�-/���f{>�Z���(��S���V���
ioH������>�."""""""�B��������H�	\{�u���6j������.j�=V���7_���[�KDDDDDDV ���
��������g����������SnZ'���sm������l��\k��5�x~�����y��3��!Y���No�!-�i2*�EDDDDDD���u�&3g���o��}}��o\DDDdqL��#��=�l3J ��XA�d�00�h����\m���F �����3��x��B������\m��|,iT�>l�����0*�4����V�h�����q���45�j���'�������{s�{x��$y��m�x��"��O��OY4�b}�;�x�s�EDDD\�L<Y�h�N&H{�-�)�"YX���5��eaYq�T�����O�(��)-�b�8�DM�NI���X�U�/�:����i���3Jd���yF�X�eaY)�n���0G���s*Gr������7���2�������YB�����:�0B���C$���k��"�9���������H&�c�	��?�������NZ<�b}C�m���������8�x�[�z7�������rn��2F<�i��y��T"-$H�*b�z�V3Kz��)	�;0I���6>9�(�J���t��P;��c��C�����lq�Yv��_H��`M@<��=P�/5� ;����g���5��=�+����N2��%Vz!�RO����Th�o}'<���go�'�+�?9�B����^��1������+���a���� ����""""���9����)�9S����v��W��������,����b���B!?I����Fta[#��i��|m�h���y�=M�1FJ���[���I6}�>���o���m�U��t��1G���!i~
���W^�����s�98���<�	n����.+�����y���X�A����Di��������sf;�<�_\DDD������/���?�i����x�Lm;I"���j_j��c)gi�X��4;�F�-�*�	��_O^��uV<3J�����J��x<C�]�]��������FE��j�aBY#���g�,���g��-��<#���Z�����0y�5{�}��<�[����$���������C,~
���o}�1;0D��S�����y��>����u;255��=[`�=L����M�DDD�^��������!�����Y7G����j��0]&M�c����B�8���;3|��.���>�������~*�i���d"�d.���G��$Mb$����j�go.l�F�����se��kN<H,�����|y�ziF��Z-~����!����o�����+��i�c���x.�(���S�x��n������O����w�������n!<��!vuv�Yz�>
Gw��k�O\��"""R�z����9S3�n3��������&����%�M+��@:��i�v�m����m��F��y!?	@�����c������B~�Yvs�i9r�"K7�����5J&��A�j�����ccp�\8�z>�������7���%v��{��
�Vw�{����c��5�������97�K�������������'� �{�7l�����-�9��������#G���c�:�<_�^`x�-n�):���k��+���3j`��I�����V�0���0#��2j;I�;�s�h�6>9�2��#���J0A&�`�L<O$�*��^g����,r�'�0�;���(v��ra'#NN�KT��u}YY����w�5�'������G��z�b�k����2�W�e�����<��Sl�SgV}�M�2�C,
�w����YS�����{h���fggY�~�7,���SnZ'���sm������l��\k��5�x~�����y��3��!Y���No�!���������4�"""""""M&p�������}����6.��q��X%Vzl�|5[�n��-Y�^�n��d�~�_�������{f����O�i��J?�����f�o�=�s�������)G�G��������d�.��PCZ/""""""�dT��������4�"""""""Mf�b�o�S��j�*=�Q9"""""���A ``F��^1�d���aM���0G�r�r���s,����4IJWjt�j���������+�u�[3~l��3#J���\g�i�%��|���$��*MM�+e33?�
�~���#���w�y������n,�O�3���g������������&���O���N:;;�������""""���$r���e'Q�Fx%1��eaYq�4��O3J7q�E�X$N7Q�������,�]���.%2]U�7��o�e��������$��)��S�s�<�d���S������@Z���M$���n<�|�f�V���x�C�,�����j��,t
�R���q_���|��+i?��S����|3_���)���33O�����j��S�ob��
�}�{@DDDD�f:���f��@�S�w`�|a�m�)���~��\���3�jT��h���J�9����r���b�T�������p�H9
��t�Az�al��'w���YE�Tom�o������8������{����������z9r��W���R�j�y)~�������m~��_�?�����k~�^���z���l8�h�|;:��m��`�L��#��X�e��������T���Q���ct�c����& ��{4��\�/���O.
�I:w�'0��{�c���������.���R�B&�>
h�;Y=�c�$7m��Q��#��9�'4I����U��3/@p=�V��|�������������S?�~eM���_��<����������,�_��b}��av�lgW]�~�=�JK�w3��>T�����4`F��������"�v�q0������`�	a3J���r�u�i��xr����eaYE2�1"I�=��R�e���FI��Ha:T)�h�;���0������3N�^z����Vo���:^^���������7l��O�o��������u�������������=&��w����o��o��_���By��>����u;:7��gl���&s�C9��
=aq����1��;|g�V����O���q�f���'�����6�D��\��1�299H�nf�ts��/�����nS�g���+�;��� �p���P��� ����M�R����n�{����~���;c��R0�/�U/
��9^~I�~��~��s#���S�#�������l~3/v��/���w�s�����������_���������t#�������Y�=��s�Q8����i�>n�r��iOXDDDD��+��M�_1�������M����g�4������40���V�O�.�$��k��������2�tu��I���u�}�7.K�#����9�������r7�l��P�s�����B~���AO�m�-�����@Z��X����I��U�+�[����-�x����]�F��ko��l�f~���w��%_;�^������F�^�7�JU�E�[�k���_�v����c�F""""Bx����d�d���W�0���0#���g;I�;�s���m����H���jrPs����y��������"#�	0�r�����	�x*��Q�||���9_��M�4���c�k~q�Z#y����`O�|��HP�5C�e��~����m�Q^YS�y���fgV}��_C���n����X�fM�<[{��e`��K�����e�����,���O�i��J?�����f�o�=�s�������)G�G��������d�.��PC�Y�e�b]DDDDDD������b[[��W����E5����J����f�����%""""""+���c��,��������g����������SnZ'���sm������l��\k��5�x~�����y}���d�n6�����^DDDDDD���Xi2*�EDDDDDD����z��)���UE61tx���)��3��������*dF
#���n����<v2�a�A4i{��f�@ JuO�:���M���F�X9��Q�K��:�����\��V����v� `$���b���R>�~n�25~��q�$��*MM����,������o�s����S������y<0�7�,��G�/�|���[
�b}�;������������Ig�6��=."""���Q��S,ZX�"q����9R/��s�+��� �-���`��
}�SdG�P{{a�s���3J���,�J��$r��N<N����$��)�����_?����/�2�D���Xd{w|:��k��de�~��'nf!nae���;��Nf	
�k���<�����s'�?s'��N��/r��������y������d�f&���X�
��&����X���}8��hUh����nU�""""�p�b����f:�E������������u�f��L�/x�YE�Tom��\��b3�s�����p�H9������������L@<S�O�kof��V�aO?����t>c�o<����
��`'��+���	?z����O|�w�T*�_�����&�������������]^/I�b�o�A6������
����S�2��Ch�������4��X���\�l�-�[@nz��������c���4M����=`�������qw	o#Z�T�B���dFwW,@����.�����B&�?I��7s�6��O;B��#yB��4��zM-����'�>����;���|���q��p���|���������y���.-�z������x�
�����;?u�����SW�o:��������][�Hg'�������}��."""B0�r���E<K�m�F&�R#�C<�!�.��l)�s��{r�c#���e������o3J���W
���2��p���0Z�K<��	��<�=�d��g<K>na��� [K��w��>�.����x�����������G|����8?���7���'�sg��t�9J�'��|�N���
���S��q��u��q��=�gl��r���G�����e����'�����b�2��,�M��0��!n�J�c���2��A�Uyju����e����Ab�r�6��.���>���if����|l��4����� 2��� �Y�����,��	V�rt��Y0�����1bY)���������|�;y��b-���^��?q~���<�7w��3����b������?�;�����op��;�.��X?�.w�������G��n:w�CG8�eg��}�n���t��""""���40�fs����F�|<������������k�����S<��8�;i�Z�P�O��1XW���4Agv������)D����9j���\,�h/3JW�(�T���D-����Og���H���~�;I����~�~�[��R���R���O8S�'��sg��?��G�m]�+����<H�^�+�U�Y��e���k�,;:{��a7�t�9Y�����H���L����''�%*�V��f���a`	u���I"�	J�*s���4i��}�9��}0�awv8 ����;<�Dy)w%�cd�y"�@�}�[i,�~.(G�q?�i_;>��.H���h<|��o���H����"�"10P�?�js���_^����}���u��o��K������:{�K������:��k�^�Lb?w~��Pi��5k�����w�^~��,���Y��_�
�)�����A���\[myi��6��9�Z��@9��rt~�j^�F�!Y���ao��E����������rP�.""""""�d�^{]�������hkk����k�Ub����W�u�V���(�<�
���������g����������SnZ'���sm������l��\k��5�x~�����y5_x��%
_�o�!-�i2*�EDDDDDD���u�&3g���o��}}l:���T����&�)""""bF
��@#jz���DY����Q��0�I�{�i3W��A�H��a�4��F��+�g��f��FS����R2W���]�uV�;i
�����j7���JS��*+��|�����T�>�y�x[�?��`�g_�'�4��Xv���i���t����/�7
�sK��c{�����>������s���x��Q���������QHw�Z�
��)���}�]��A�[���H������3Jd���yF�X�eaY)����)-�b���;V�$�D�L��eY�i�1�����9W��V����9���#~c�'nf!nae�����$����5Wm]/��a��8����y�?�������m��>��>�#�����_������>�?y������b����ov�p������C������z�L<�-o���~�
�k���y97\s#�4jUy��P{���$f�R��+��%=����
�;0I���6s��$��x��{�^8E1Uzdc���=>�=�����5��,������t��^g����S>q{:W>5�/��d$�K���jy��{8���mm����|����p����9���>_>����|�G�k��P/I�b�o�A6��S�obh'�U_g��������)�\#�C|����w�a ����l�����:��1�	����%�49�)-�n8l�M;g!?I���F��$�y���f)���Q�@;c���V!�1���	�>�:B��#yB��4���ZN�XY
_��_���./�~E����_�����|��8Us�������sf;�W��w;f���""""p��i�����4��LQ<s���������B�ba��t7F�aiEf���Y�`�x�����~��M�H�D�����a����dBcDZ}-1w�X��B0�2x�&�50"yz{���K�x�|����3���0��y�W��|����������O~4������G�',�����}��cIEND�B`�

results.out.bz2application/x-compressed; name=results.out.bz2Download

BZh91AY&SY�	?�p_�`�	����b����5@��y���
(h�����d=�(C9��P �PP@P{��4����������H�:h�Uk@@Z�U����P�a�^�
`:@H@�((@	@�*�����}����WA���4(
�B�U�@V�PP��
Y�6L��b*�J�
!J@���T'l
����� * ��l��Z���A@[l9����
(i��Y*E`�����<B�`��D��F��s�@��4	@�%LA�f������������t��������$B�@
�@��E�#@�"B�@@@
�� �P( m��9pP  �Tl�U*�
R�	!wn�@(
%R�J
�6�
�D���v��b��E���V�h�a�*(T(@�2�p7`���T� v�@ @B�PA@C��B��`����P;`P
�BT�m�(��*�����0*�N�t�� �
0@��(
( u��P�QPAA�@�T"�����B�UR�
@�!T BT�%	@A�	A���U*�(*@)PJ(��� �Q E
AH���o���4�R���`���IQ(�����2	2��P4OT��mS
��4U �U=*�zjd�4i����O(�g�?����C����3mPV������D�;����n�6��V5
�E��Wq�vU�a�JY&lB;Q�mAu[��:�FfY�p+����H��O G
C��n���t���"*
���c�1R��W���a�ri{���8��
i^���B�	]�P<�4c!q��q}
bF�����
)�&�+��E��}�x/��2K8h$�u��#O$��LP��>t�zE����D����"�D�W��T�Oc=*��������,�]����c�Z��
�v�2d�=03��-��QWL�������������H��b�z�����ZAkmB�W���b+lB������;���<7%j�]���[����,�McO�R$p'���<���R�wy�;tC�l�9T��&��j��{Zo�w��vN��`�[�����yZ��
��z�'�O���P�q��%M*!��V1�"uW�Z�#������K;�Z+_�}�!:o3*������Q�8Y���K��T�g]m�k�R��e���B3iWZ�xy�[q���]w��&^�u�65�d�Vt�R:P���W��	kYx��tk
�UB�\Pgc�����z�ft+��I)��5Y��n;��nN�q}i[�� ���!j��f�5���so
mJ'r�.uJr���NV<�a�fZ����N6^��xs����v��WnM�����	��(��u�I���r�A]z�4Z���Z3���jX�%�YY���m�8!�kjyQ�W3���qZ�
�vs��x�-�]o��Fo/XK���$��mV���MO%J��S]����&j���[����#1U���p;�j��<�%"��cDh�Z����&��G�P���n�B�j@����o"�O������M/em,�dvXN���/[�A�k�^�}���d�.��,�Us{WY�l��h9c4y!�r��L��w�n�\�F�_b�h��4k���=��E��9�q��s;6�twS������.���j^�6;���@F�6|s��v�Y���'C��1�[g�n���v�`����#v
���a�y���z���������H���-M\4�{E�� Aq,F,P�Fq)����8_��H]��^TO�����6����n=��R�����'2��(RBfDw^j
�6+e:M"6�*�������A���n.�]$�%l�U�T��GZ���������bZE�!�)R�N@�&�4�n�MT�Ej�s�#���5�]����}{dn���pL^d�����;U��x��j��p2��dQ����@�a���y���fb��7en�]�$���#�mNN����"��BR�����+��+*v�����W
i��|�I�V��Y��M<��cI������Tn�����%����E��;���#�]���7JR�t������S$<05����vr@u`��3������)���������{U������KN��Hg��|�la��9���{s�p��yh����]�\��7�j1x&vL�vb��1�c��u�w���]����U\�S7�K�eN�=��F8V���-h{vsu�	Gkj��w1�h��:��ET�����7�wa�v�L�#�9\l�Kf�}Hl���|F_n6���:��\����X��'�3�.�M�[e_Y;��-��v#���81��������:}��,�+��f�e,�ou�3w��L;)�������[]�y�sGV
�*�7�B�+CA
���������}AU����6���uK%^�^��\��F�JW�5|���k��J��j��N�����U�)M<M+�Z@�I,�MlS���e+�)^$����)h���r���!T(�qkiR[Hx��HD$�)��i�j�T�x�a����'P�f�/r\�����,�vB+�=�n��p��n�l��]�9�f'�AN{yV�����-q|�^F���2��#i_r�{'�'
�d���� ]����Vqq��V�R�v�v.��])
���=c	c��u}����]����hYg]��K����(����f u����U`n�@�����-b�m���������D���%J]�n�~��p�\���,t����=&�)��;uq��wzSY�����e�-����-Z�;E��R���]�BM�a�z�3n�m^T��)B<[7���\�R�)�)�b�5��%su%[�Xd����Li������:#E��F���mKL'�����0K'�R#2�j���YU�;��*��{����b\�1Q,WE�a�Q��J�����*=bfz�t��TF���B���:���]��.��YKAx�-���D\����������tu���{1v!�MW�>a�6�#j��:�.�,�<�.�C�*�7�9�����\72��3[������3�9Z���7Uk�v�a�W�%e��{�'TThQ�o\������JM�[W������{����)r�3�g�V�7�mQ��;`����k{�QF�wU�f����-�{E�����n��������poJ-*7���T�^�D\7�3�nB�����L5{7�;�F�mUo-k4��Gx����f��m���������,a�*d��}�WwY���/gAr��E,���\������N�V�����������2G�U8��]n���UT�u�A�U�[nQ��YF.��f������f���V�"w�CJ�l����s,A�9�k�Ei�)Q���s��	����f�|��s>��p9�:�n����>�%#kb���*I����Y��<]k1����m��*���Xo�\���P
b��%�D�mV+�mdV���B����P�N�gm����][�*�D���Yh���I�h�Y0�>i��3�
��n��G����TN�Q�!�}�l�O���^l�AXyT�����u���t����(Q����5/���q�X��:���������yX�H�U�z���,4��H��J�w+�R�re���(�x�8��4����r��i5*!����S2�Iu��������it�kf���{<�j���7v�)o�J"�G��EM�)Y�h��D��.p]y�n]�������4o����f��N�ol��!W�<Rw8<��J������������9�*�f���n�5Y�z�X����U;�l)�A��'�I|���$E�%����l�F���Y��z�SP�:�-Oz�el��.���n���O}����q������*^���@��j2o���Z)��	���4R`�Q%kq] L�3a�%tu���������i�m���Iuz��1 TUkR�N(��Q p��Yj�(�\Oz�?L}�I]'���QUiSW�I�Q��l��*UfS�l�mf�G������5TE�u�%Q���ub}�����{|��f�1u��FpyF��Q������S��S��M$���FD��5�)kz���/V�~�]��T������`�T�>v%��Z���]�����F���]R[�����v������y����{h-�(�j/��7	K���
�Q���u#���;}���7�����x�y���`��kc������c�dQ���t�V��{��S;��2q�����u��A����0��-b���y|��WN�G{��}k�����
��6:��
R����RWu,�o+{f3Qu�{t�X��E�R��-�t��LH��|N���'��Y{#�X3;����bd�a����U���+9��!���#L�z����t�^X�r`���L�"/K������:� ���E+�]��R��+��;�ds]=��>k-
��q\�VZ���j����i	�:�H���X4������=��]�3]����\�����AR�e��U���������|T.�d�u�L�'�q��Mg<|n�t���*/�Xz�,����^U�������;Xl�� �5x���:��ozgm�b�-	t�U\��mi�6�0h ���������r�PS9a�HI������w6����qY1��vct0��������x����$���I�����NN<=H�<D�����(��*;�����T6Ajg<qN�2�0��*W����}&��	�&���f��X��i���� ��P�a�+]t�|��x���$�J��������4}%�%R
��^��7���%�[`�XTQ�L��KErS(B��2��e`���J�v^b;�Hs���J�88�y�d�m�!
`���p5���=J�EgZ�#7F#=D�-�������+��(
���U]\�+4i��<
3B����4!�(�������������������I�L#�Tl�n�9*���T�h���U�m
@H���y�u�>�B��#��M������7��(C���W�:��
*8���G�\��"���:��t*����v=����A�����7����� ��B��Z�vv������6�.���[�J�x������Z���`xrh0�x�lCU�������,����,VD�:S��@������]Q��%!�"�`&�����}��0<,O���n�~��f��kHfP�TL\��d&��D�`pfj�fSN�9{��T�����0j��oOopw�����w�V�)AY���`4�����U� �n�J�0��:������H��$���Si��7MY��5qm�
���r=���h�'MU`Uj��Sy���Q�����G������`������mt�P���{��o7���x���6������Q���s,{Z�z��B\��]�*P��&;;Y���if�=K�B%��fP��0t|%�-�WY�����Q	hd������hY�4�B�0������VV�$S��7���]���}c���)�

����)��u���s�d�<��rBI��AT���PS�<���9�.�cm��x�9xa�G]N�k`��C�������Dh&��ajZ���*(o_wc/z��l�]k]����yu�v�[�y�C�%��i	��S1�����(�W��0�U�^��{
!a�kU����m�38�&�p��F�T���0vE��{j��V��k5G�i]�%��C�nc��)���6�sx^T��1���YFoT������U����Bu���@��'v��.
0u��Vfut������u�������n��u��X��X�&�����n�{�{���U�d��j����#(`��aS
�$��c����r�K����W`�����i*��0b�,���&�A�Wc�I��zX���q_<	��������[�����x��:{��6�X3��L>���)|��G�_��u�}���Vk��5oh�$����{���^�����Yv�=,�c��
y��v�]a]}��r�yy]�{�e����
V��]�K�
n��b<�����v��/�P��*������jD���1Y�0�����+5�����1�d�(�gv��66Moif����m�N��]
3]Jz]��z����)g_o3c,�P�������i�(i��(v4J�bZ��U�<l�V�Y�~��X]����n�m����V����u��U�A�WTu�v�y(�����bm�����u[�c�������g�[��_PK��������>o9AX|�q����J��3y�h����
���yU1m�5:�5�M:�A��P�Ff����LWn#��}�����ww=��)��}�����<��+
S���Ky�s��5Cq�c)�yT��
�W�69�mN���E�zs�nY��},�t�nA9 ���^j�"E7\�9�L5��U���XM��5��+s*C��9m#��f��7���S��a���1�O�����!�'wv�f�����Tb�L���-�5�e�Y���2��cU�\���zt�,�5�x�[[�D����#�]f�8�l��6��wi�g�����7S�%��4�%u������5��Y�����/�v\�U�.��Jw=_�]��c��l�wcg����c��]�"�U`�s�t�4�;F��JW[%sz�7Es�z�[��(�����r�"n H�`r'U"�66�f�jdZ�X��4h�1O�+F��(�r��\�e8�-
v-�*�E�k3c�$k%�%�QT�������&�V��$�4M5ch�M��1��m�i���$�����[k��Sa��j�46��Z���(�4h����ph&YR%(�$4�JG5KP�I4%H��0�W"1V$����bj�bd�UbeE&!��31%F�s�u*��D�*9H����3(�%��R�mEE(D*�3:"aU�T���� �8e*�-*�R��Lh�K"���X�cU0��a0�!@�1�"�2 �	�
P�R1��h$dD�XP�l��"#6b��P
#!�3	�1CH��X�)��fleD0�"�b�����Yu��]��������QmL�U�Sv�l���4�Y�I������a�#,�4���U59(�U�-*���l��b��#E�����e�$����K9�������"JYX�D���CET�8Y�R��K��j�Fb���"�,��Z!�Uj�J��F��K�l����M4�Z�2L�UE0��+�V���s.*jU�&��ZcH��Q
���u���I0�����UCZPgH-34 ���bI��rMNHd��F���%J�L ��j�I"e����K�k,��(i�R%E!Z&bB'I�V�(C���9��U�s�F���N��a�`���(4N$%����eTDQW��D�l�*Zea�V��f��j�QUj�fZ��f%�I��UaH�KI�5,#3VaQ����ZA��-)R2��I4	B�$�dD��C6U�2#�2�T��aF�����Z���2�%��ML�Ef���R���E��BHbRV��4I)H��T�%���'@���J������Y�D����4��Z"���eEDJ��E�S��YUtR0���*Ts�(�(j�-3LD�SE0�PJ��$�V'5#Zb��
��IU���nw:s�p���6�V*�k��r�X�4M�0RXD�M0��$hh���	
)
����(�P��JH�����J	��	$B0�0$��#RY�#h�U�Q�m�UF��Nns�r���bu�p�F����5�m`X���j-jEc	b!,�
Qb�\�`���(��������T\1��B��*12�Ca��mkSsh�lcPE�b�i+-&"had��KE����li4h�"�%$��b,DT�@d�3[F�j*���n��FX�cR��,�R��$e)FfaB(�c���3KhfH�Dbc�H�D��bbd�&Q����Y��4D��I@i�RA�	&�YVU�eY����[J��BdT���i@�_.���mn�m�6���V5ch�X�d����cLcj��5��[���%��i�"M21�!JH�Q��Z��nV��3�h��-�d�Fwt��6�sscG���@�`�a�����I\�4�X��u�cm�Y2A�dMh�K`���mk��nZJVMS-I���-5��v4d�*�2#����f�6�Kk��VSY�6�U���������8kD���8��XY�JRE��$(
�,"C��r�Ve&�B*�R��C��*�����*TZa��a(TA���
�[��ID�(VS0�0�cV���Sl��\QF�dTF����Ib���sSV[�h���&H�L��X#��Z��������6��SBH�SA)�D�A2F�dD���2�S(�d�4I@�0���f
����e4�I4(���K��D��RD�I&j6�m+�kJ�l��m�mEJ��k$U(w2nI�Y}��^w�����)ST��:E�!\�K�bE5_Xv6� `�cq��?V`���,�!f��%'�s6���&n��1wc��n#K��WXli�W)Nf�f�������t[TR�[�38�U�C�T{�L/��r���#�d���Kn�$i��%9Y���[���T���]%��l��!�G�b���T��%U��3��1L���mH�GF�EqHM�wH�Z�]d��^�BSh��4��B@�$-�
d�����iz���b�B!B(
�f���A���V0���{�jt�
�����X={��2��6e��M�*����b��5�=^��5��sT�/�7����=T�p}�Cw����\v!��T���L�u�u��[����s�K�2�x��-N��Pr!�����0�Yf+5D��7�c�"*W_C�a'������i������;k�wl�MOX�;C�|,��_#t��g$��L��=W��/:����b�����-
��}�����J�V��w�R��Z�Y���b�u�����)7E�"����}�C����._U�����J����z���������V������D�&k��c�m��Eu�WL���z���:�A�@a��f�g����}�\�D��w*���	��!��^��Z��I�`��2m����l<F�W��V>��<CK�A "�-B�pC"!�u�T�n(��	��!��7�r'q{9D#�����U�B<&/|���u]���7��U�9V����f��/"��7y`S[R�ea����L�$�����]SN���<�w�;��_������fm&�v�2�����E,D�������j���Y���J5C0�$0Ufj(�XR�hj�isT�R9bT�jXV!���J�b�d��,QH�Ue*"�))j��P�3IR,�EL��,+5�5��X�M�
D�-J��V�����Z����!� $L�ZR!���4���
���tT��!�a�JD�H�$�k�#�EfkP��*�*��,HLS���+
,#8R!bJ�i��4�H��������$�j*X��@�4%0��
S�)CgT����5%E(���MQYV$eYtKI�S9��RS4+�H�GmE�*��1C6��R��bJ��0����e ���+D���j�B,�XTLC
��e�0�K
M�&i�L�1R����R�-9��a�������a��d	\�8��L-T�E��)�����`�P���3VUi�$f��aH����j%,5U
��X�fj%�YXjf�R4�)B�6�(��(�
E�KZ�b���TLSQ(�J��gf�A�N���J��D,I4�RSU"Mfi��,�1����X�J�I�E��mJ�L5, 	��IZiYtI3j�j��dV�����EE"%ff*E�UF\�f"�4�Si�P�.5*��V��F
�Y�&���J�F�eZT�Z��fQF��j��*��C��h�a1F*�2Ya(�J�)[,JDQX
�(��a�BT��J�� �b(ecQKgE	K4�[�������HUel��0���*�L�H�R"���jZ��0��41C6��"����0����Vdt���R%���jb�Iha�)B,"��X��h���R� ���Z'$-If��\��L,K`"i�B���I���T�)���M[e���U
�6�J�*eTQ��#
�f,�R�,4������ZJV����X�Vr��������y������I_��h/�����uV�����F��{W9�=�S5b�`�s������-����c�z���xe��y�J����rmh�3����as�7W]%�����h��g1V-\���b<���\n�����ps�z�5�.��
���oh�:]�nP��������S{|N%{o
x#���������Q}-����Q�uV;��������}x�B��L&��b[r�v`��������r���m���[U�[�ueH�8
�+��U\6���Y��x�z�6��b���2
Gs��r�Z��������z�������7�����uT���>�9������h�����\����	�����]:v`Z�\���7�]*f�+\I���t3%e�����ZL�������
u���]�Z�>
��dx_Q��5t�;�F�B�������R��TP;T�@j
U��Z����!�;{S;/fu�,�k6a�3,2��J5@�x��TdF��D$�U!���ov>�=�~���H�� �r!���f���2��~�U���E����xfn�t
��Z���}{������RV+Q�u��.|����CmKX����^��"c2����Y:�``�������b��ruwv����7�*�#��`����n�i��s7^v�:����w�W]�rw�\F���w-�J��IX��.d��
H���t^��U.eo���{�rH' H!��"j��5��i��77
:�z-��\����������u��{�m>Y5t����Vuhk��R�x�xl�xB������l--���z�5'T��x
3M��k.b"�Gvv+�o=�}�����-5��&��_4�+C�^����A���].���I�W^�Tu{p�uWGy��cc36]:���]�#��cj�pZ
6����N`��X��	�}��(T[���@"<���o�90���Y^������$�)25�@n/ Q�M�yt%OFD.;�����p�G������z�S���N�mvV�t]�
�u����xM@�!W[�8)�*@WeK�lM����)
;r#"�%f*R�Z��U��vor�{�R��}W�1�*D;����F%{
�3{w]nC�8v7�/o_]�%XU
���*���Fy(�(�s�����-��&���]cgL���Tl�w�)�K�r�#���xW��
rT��PIS��;��j(r�r���h�����!�{C�l�m[kZ��3R)��Xb�j�IW#"Y�Fj���0�U5MjEg#-����)�$�5U6F��9b(ii��(��f*a�Y�i)��)�Q(��2��+K4�MQK2l���)����XX���Jj�����I�EX&ViJa�R$�(���)UM4R�Y�[X���U������H�
��*�C�+�����%�
D��JY�I�Q�Zb��]kP%��"�+e���)d�(�����XF]S4����R��������U�,��s$������V!e-��0��`��%�Y���hs�Z���&���j�Z����X�s��*&l�#V�h��,�(D�+�Ufa�eRb�D�j��fb����L��V���XUi�AV�C1+hh��%�iZi��D����#9�Z���H��(g6u�"R��DA
3�������f�X�FR�.KV���UM1Q��%�idUb�\��m��B%��4���dDJE����Z��SL�TBS+-H�%�`�ef��)D5*���l�Z�KQj��U�j��UI��"E ��H�&&�a�����Y*$d��&�����S"	*�:ib ��Z(��"\%���F�0�43+I��:��1DR�R�-P����T�0�T-U
��A4�PQ+e������\�3J%����f�"T�6�k�b�6�,l���:Q��YE"���D��b�T�H���R�,U5E3-C(�����Z��450ZZXA���3L�)5E*���M22��2�22QeV���$ihJ�hjmki�XB�Ym[R��R��%�*�"�BL*6b�)b[-���KZVR�a�fh�%(V�*��� ���A	�����6��RRZ�!Hi��3BK�$Y�C"*�U���Kv��8���i�u3K�������fE�J�M�QJH�D��F&hfa�f*�����Q+�W��?&���]
��NZ�aB>�����:V^J�.�,���Z!���+/4�;���WRdU����dV]]b}�s������[h� �T�o����������`��2"r(����%��kTd��9�k}��%�|��	����;��J��kJdu��h�v/tn�Y[N�����w��u�};,���<������4�,"nS�C��U��U�����VK�����bY�0�v���`�����\��_
��5�&�/���z���������)8b�wa�`��s���t�l�(>��������T^������v��E�h��+�M�-���4<����Cc��R<���fy:r����jPv�������=��%�^S��f��vV]�<����z���=+�����F&i%��
��e�kt��eX+	����}�n���Y9��",���X�.��<O;�6�$tX$��������s{X���}zs�nS�Rn�k�rl}�e���c��w(]I}5J�W�dyV�7����]����J���f�������1��E���q���5�e��RR�j���������*�������^��mV;�o+����q�v�@����PT	+�K���5X�����K_(��fj�W��c�fb��{S��n��!-������Y�0"Y�xw}]t����������8����k�����p,��W:����s^��'�4U���)�Q��hh(p{C�Wv��6���]���`��d���b����^�u@���Oi�Sx�"3�30^I��&^'�)r��"3{
���r�M	�{�snK4{F���O��t�&N���&Fvl��+Q�2�C�/�z�������m<��������;$�k�-o��DU��M4�|�]�K[�5�X6���V�:W���(XM�}[Z������[`�f�co{k��G�)�@���:��m7j>$�I�e�i�"�D=��2��4��|��}���j��������N�-Y9L �$k�9�����v� �~��P;�=On�o��)!{���zT��M��D9@=���w��\��Y��
t���0��[Ct$��+O6�f��������s�Me�v�����X�l
�dP�G1���W��>lm�u��T��oFv(v��*�H���s�X��&��;��Wv�c�$l���u*/z*$_e����)RkfX��i��q�b�x�V�P=�f!(-7�-�Fl�������F��E��Y1���j���n����W�`����qt�����D��+��'-��f�4�W���v���	�b	s���K���z�3d5����!<ST�(���UT�66�������4��-�I�K��nK���5���" ��6���B���������3����yr��:F��DQdY�/���B�oU�����������f]����F�"��ih����SJY�"�aI�EKD��h�Hb���Rb�P'VqhjG1BH�RT�&��)f4�RB:U*"I�*2�+A�4#
��*����*�K6Yf�"�ee�h�T�X���H��fZV��$sJ�����A�!d�Va��$U-�B�����i�eY�K����s����2������F�HRl�Z�L�L��$Y��VJ5C�"R�
����N���J�VH��V�m	L�2Y�$j��J�j�V�mL�.biXi�T�(��K3H��TY(��R�THt,�*�$":�F���$�:W�������U���EATJ�5)i]��%B�P�$-��(�F�Vi��j�di��b)R��h���]-P�Y�L%	Z-
�*S*j�hJ&��YQ�VIg:Tb)���FJ�jQ`Zf��ZI�
)T$�uY�Z�r���!L���BK6�"b���#���:"�R��V�Q���QENd��!V���	+P��TW$L���RRbR�%T�1����X!������f����#L���&�i��$�C�����
f���d���t���8A���\A2n��V��'E)4�eV��:�nl��S)"�(E
��a��he��i�
1EL543T�C��D,�\����eI������J�**��2P�dfPJ��+U.U�2������(�QK)i*J)���k����1!{�����i������H]������}�gn�4q\����2[����d��S%��GF�p�������/�V�L��2��U��2�t!:�t�v����9������+�oa��,>�B�*���
��Y�g��YiU�g���v����K�U�u���g_^����v�J���s](��0��k����~�4on|��VF����Qk0/���Z1{P�5�&q�o�n94����Kh"��,��n�����J�ot����:���'����{����J��2�j<9����kZR��oU���V�j���a��^�UQK9���H��n/�i��[f�Bn�r���9�F2l*���n�Y3S�U��vo^��}8��t9Vp����SY $m��%�PE'vK���+��&�
�������^�9���j���/-n|��:;q�;7q����<cJs�/�Dv�������{\�����=;��3(�w����]�]_9�,��j,�lYT����S"���i�F��
TFF��)��)i	f,%��@�`�	"���d���� ����1L(J5
F��KB��2A�2@��$�H1 ������
���4�6��j21l�����"H�	 ��(�����"���bfX��dA!�!���&X���2�$�b�22������T�`��LhdI�*SL��(PI�	���*�F���eYVU� B5�TC[J�����E�b����l���L�H����0f! �!)&0���Q$f@��0�!)%2"����2����H%�����Hhm����Xm��TY�%*�m��6cP,b
5RE����Vo-��������]���hcy��y����\)Q@����2���&U�}�������hN�T-�U}V{I�A��T����tq�o4��_lz���$���M�z���|(�.^���g�4'b�^V���'��o��'}��s�^����I#�i��
 q�q?s�v-���a�e��{6�#9Q���#���;,v�j�Y[G�y�jA��/n�����=�$��)���z��v�����>N���v�nd@�@�OECp9�2	��_B�����o�L��ho�Kw4=�9����/}��?x+���:k4#T:��,Q��b9p�|!�1��+��
���>qw`��
E�'��D�$bf��2�����e����a��^��[#�������������������lGK2�r�Z����,Sb�[��q
jVj�8����C����2`��^��F�����Y��WI���KK����L�w`aafK+�^�����n��l�'|0�:��
��!�K���z�����L��5s��YUf��#�e�4k3Nn�V�>�
_-��^���r����k("qLy=�	�z�B5Q�	�����+��Lv�^I8T���M��xQ���+��1�n�}�����L����M�3���Q{U�'��"���:��(������X�����������w8�_�<������nJ�U�#e���B��v32�X�P�R��@�:�XZVFY�Tb��VA!�vgo_*��m6�x{P+�����R����a��[kr�y�p�<	��	�V@	6<��F��Q��
Ur���WE���.���7to����G������l��r\-%�9Z�4{��]N��S��������:��M�7b�]N.
��h����J����.�(��l6���W������J�luv*u,��6i����d���vMbLK�.���
�_UU�u�F����(��3NZ�<O���,�3��'a���[��D��vx���Fxh�"f�,�����KtE�T�<��7+��k@#���~��`7�������m��mJ=U@{z��/px
d:�U�
��������CO-����V�zz��+����7������x*��s��Ihd��
�'���	p�YN�dZ�L6�P����F��o
<�m�V���}����/���wZ��E�w�N�p�@�#���������YV{M�xP���4�3P���^����{�G�P75B�q*d3[��������T����`���e������^{���]����G�rj�AEC�@@��C���"���]�:eM���"�Z��,
��M�-v��sU�Y���{yb�U���j���^d[����:����z
��H	G�����D�U���B����D�!�r������l�m�y�+u�^^������9^��dgq8��J@Y|�9�"�U��
��];���u��. �������enV�7W�2�p�]�Z��;�wk����n�vl���%�Q�\B�=<��G�����Q�q��+^D����7S2�LkUV�����H�v���e��X�:fU�X���'�BwqV*��z��`V��Oxs��Y������WC���������w{&gL���+�5������zd�V�'%�����Z����0�:3�w_S1���Y(rH��L(��Zf��L{�,�1����7o0Q���������d=	�7��$A������=��Z�=Cr��)
�d��q�n�]H��'`�!��������=O`I��&Er'b9*��QdC�z�Z�@��P�u�Mr��yr��f�fDj��\�	}��=��r������t�njsTH�����4|<�,(
 z,�����Z"����� �p�C�f���O?EP����5�����R	����]���]���u�ro2VV��������K�X���qeGk)
�E����3�F�
o)g����������d
0�\�|4�"�79�.��u���=#��g��r�M��vo}o����`��,��jV����Y�����
 +��\���������j��;>������.p�����4�TJ)%�����r��"4�8�b�3 vw�'������n��B1��c�1lv�%��a=
��t�>D#�EE�}��{��_f�,�F��!���2��K0J>>�\��V�d
�� >���ku�-�JIwn��b���*�U�M�0�d+�!E���M��g�����G�>���nYz,}��x��K��������9�g�#&V&{�y��i�',��N���i����-���.md��Y@��IgSOK����l3��IK����vo"P��g7�GO�=�2�d�{��C���;�j�"�aq}����F:�b��H�9��9k����������U�����5=�}��(�{5V���ou��>�-x;@Q���%o�Um����Dq�e��7+2����d]�7;�!��z���]��)G�{��9����n�l�6��>��V�J��F(*�$�K5%$�A)E����%4����ha	mMNEHl�2��A,�L�Y�E(I����J���b����R�����:��$�.�u"�L4���BTl��3�9�}�2����C)iT��5�F��-D�Ehj��H)$E�`�,�(de�ldH���)$Q&���D�(14�0Ai!A������2`3)B$`��DFd��@��6f��k�e����BH���e���
4��D,��i�!�-IJ���$�2������+�UL���a�b�h�Vq-.���DW)
V�h��,RJ�j�E*Z(Etd].�#�v������B,��C��A3�t�D�,-4��j*f�-�EC1D8���Eh���-T����2�
aRQ��������R")+BYJ�%�e���PC3�Z!f��EQRi��%T����J�����J-Nbe������"FB�"�� a�i�H*a*e%FEE�*UKC�a�r�4��:�+R����aHA��f�%r$5%9Q7.#�8ut���r��&er$�.KV�#B�6�
�
�0NB)U�,B��5�����3�YZ&e��������L����]+���PI!���L�DJRT��C�%0��E)Yef�fJes��
#������!b��X��kQe%�I)�E,����E���F+B��bI�����h����bp��+��gR��TY����G1=�����wE��0��4M��lD�K��T�B��N�Ul:UQ��,(��2���
6��������6�a�t�P��dfji�(�#I7�����&Q�j&��(��)�2R��F�B&$��D���4!0$� ����P&aKa#)#`��BLe6T� `ff�I2S$��%�ljm�t:�C�������J���Y4�*EF���M��T�"��44[
!2b�fP�a�eL�fIH�P4A$�b44A1"D���D�dd�)E�����T�D�%HV��T��lV�)$H���@/����������u��[����S�T�>����M����2:�(h��;�V��������,}�K-�^
F�����y�oj��I�������5�!^�&��D������GS��9��U����N�*[4������^�{6����Pu4�'���\�.
v�i^�|�:t^u@i
�P2�^d�7�IZfM���`���Fh���{
TC����H����L�;�8;W�w�)u��������[�N!�9��;w�^�v^���a�lP.��j�x��>��5	��uj.S��#.���S8F��3wt��8�:���F��h:Z�������GE�T��2:pq{h�tnZ�����s���H��ZK�j����F������d^���9�n���*��b�/U��'�.�j]���F�����j��OR��� �3# jVdJ��(��������JQ&D���+5  J�R��VD����$�J����	e)2T��YiI$�4�edL�M�@����I$�N�L���CI&�D�k0��I�`��YJ��I&�13R��2����T�@5J�R�i)�jVR��JA�"dT��JR��$��0����,�(�A�YJ���".�����D*d�J�3��BaR���fMJ�Vk
L�)iQM��$�I�,1����M$����e++"%$�a���+*I$�l�	J�Fe*i�F$HZVR��&RRH�T�)ef�4�ZS*��3K)4 ����ee$��IH�YJ��L�I53%6�fSK4$����ed�b�����JZ$�0��fe���I5+)Yl�,�0���JR�M R2�Zb����JZR�e+-�L���M����JH,���YL�cJ�R���e�I$J�������SD����"jL�ZP��e+5�
"ZT�eJ���bB�XH��4�)JR����fc�)tGD"...�$�k"Ye����s��U�DDtB"IK���l�d@����VR�X$ai�+)S)�J@�)2��eL2�@+)YX�)����S)�eI$�J���jVdML2�&fZVR����b�E6VT	$���JVVa�)JYI�iYJ�`H$����YR��503J�R�R��������aH�&dT��JR��H���$�:"""*S(�YJ��F�T�tB"I$��%+J�2��4�3+)YY���$IiYJ��I�I4�L�KJ�2�)JD�����)�4��Q���0�$�HFR��"T����"*VR��2��)�QM���d��6��eJiJ2����ed�"H!����+*2��$*���)�2�jVR���I�f�R��)MD��ZL�Z����(�T��ee&E��5+)R��K"jf2-J����) JJ���������R��)MD��h�K�#�!J�HT��f��JR0���YRD�%M$����JS
"L-+)Y�3�	B�eJ�����KJ���(��i����D�2�T����Fi$�)�5J�B��4$��VR�:$\UJ'#�H�$�$)�J�f�RRf2����������T�Z!�MA&R�)��JD�F�e++)$�%�#��w���B! B@�I$ HHI	$��$$$	 HH @����H$�	@��I$�$�$ $�$�!$��$@�III$$@ I�H�!$!	@�$��		$�@@II$$	 @H��	$!$�	BH��$$ B ��	B�D�$�$��@��!� L	I$�I$HH� B�I!$$!	�B�HC2B�	����@! �`ID�D� �@ B !!!Hd$�2�3I %)D�!%"%0�II2@��$$$�$�	"a$�I%2f�"BH���A"`	D�A� "�$$�J@Jd�D$�L$�D�D�$�e2B	3&`@	3d!�RH�H��B BHB�	!	$H!�H@!$	@��@�B fB$$��$!	$�L$$�I�BD�B@!$�$�BH@��!$$!���	@�B	�$ �H�BHI�B I	! BI$$�����I$�B@���@� @�	$$	�$��$��I$�	!$�$��	BH@$�$$� H��� B	�I	�@��`D�HB@$	fI	����$��I$$3$�'���B@���$"@I��0&d�0� L��$�`�I&R I@3)$�	�BdD��d�$��I$����LL��)�)L A"$	H�2% �$��L3�B@�a$&a�L��!�L�HfHdHHHd�A!$$!�C �D�I�$@Hf�!M2H@�d	 D�)�!�dL�	I�R$�e0�$L	 BI	e2D$H�&$!�$�	3I&dH �I!	�!	0$�	 I$!���2$J`BB	�f	�"f!2"D�D�����d!0��� �`BC)L�H L�	 �$�L�	�I�L�!	&	&DD�FD�RLH$	� ���� d�C	33"e0�!&iL���$�$�$�L&%$����0 B%2&HH��! ! L�$ �`Bf	L0�A"HBH@I$� &I�$�1 �&d3"dH�$�@�220�I$�2��I3�	%)�		�DL3�I ��	���	L��D����@�D�!$	�@�"RHD�$�	L��3�0�	3 �d	� ���C1"2�&B&D���$��DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDB�A�JY%$��JI)e$����IKD�:t�$��D�K �%,����I��$$��&B!!2���	%)��B�R�)$��$
hA$���2SB�J&JR�A�JYI$���S@�K4�������d��RI%4$�%(�%,�dR����`�JY�$��!$��H��$��$����R%,���(���""%D�)H�$B&D�$��I	$2���fI��I)D�$���$��%��
N��]""tI$���$��$��d�DR�K�DN�$��d�%,��JiR���JR�%4��2�!$D�)�0�J!)��D�3f�L��IJIL$H3$�$	�d	�% ���@�Bb`�R	$�d$��	��� fL�e4��B`�!�dd�C"M)�#��H$L�@L2I"f����@��D���H�C �D�JSJ�H$�2�C2�0�! �H""RdL���@��0�
d! �dI"BI	�$�2&��e3
R�4��)��L�$ BIa&fd�JfDC$3	���Be"�$�H $D$�I2�I$	H$!!	$$&I H@!! �$���B	@�$	$�B@�$!$�$�HIH��� I$!$� @���I$�H@�!$�I$��I$���@�		! �I$��I$�I$�$�$�!	!!$�$	$�$!I$�I$$$�$$I$�"H�I��BH@�L�HHD�$	II�I!$���H�$�I$�HB2@�$$�B@		!2$"@	��B@I�$��!I	2�I$�������$�!@��$$	 BL���H	$����%2H�@�����$�$�L��L�@�2C2C0�@	�H�dI2$�!3	��	!�!�2I&D�R�$$B&�3HHf@�22HD�$�����)� 0$�&I"II3 $	H�$$$$���&BHe"I�H� H L������BBBI$���!"AH$����H@ HHBH@!!! �� ���	$��I�H��$��$����� @L�$L� 2 I$��	 !2L�� B��$�$�$�I$��@�B�$�@	$@$!							!!!"$$$$�@�$�$�@�IH$  BRB@�!I$�$���H$� B �"@	$�HI2$��@������U���C�����C�� ���x"���������j�.�����x�����{��3L`�Q��;W��n��+�����R�}��������N7��,^AC�F���O����^#��E����;U�>�Q/y��*�\�>t�E�i�Z
��{rU��t�6�`9����3o��dx{�-w7d=<��*���]���QX�5�T�/�T�$���{�r���v�@�Wc���Ho�|����f�PK��h����vF=a�"���*��9
��Z&��O��;��K�>Yn�����Mg��Ly�)����!C\����yl5y�j8�wk
y���[Y3��z��
7��J$
�3�Z�:�o;Gxx"-��9_����G�V�)�$��������'+��>�Z���N���6AAj�^���
��T����j������;���X�����d�B�gU�Y����o��)���=���}h�4�
����&0��G�u�+C���b����<N��T� M�<��V���/.�my�~���A�3�|���}*lo����B!��OUr�"r#~�/%$�V�z��q���n���m���{�>��	6RA�7<���?�_���Q
�xH%���C�L7}�KD������U^��	 ������G�������B��������<a��,�{")�K�U2K����vc��Y{:����B�n^�}T1�[G+
���:�I��9�����IA�u��{x�mS�*��e��2)��[`�6-Cv�]lV��t��:
>�P>$q�,TD���������)�H$YJ���D1&��������!�>6�L��;MvM�Im����wv����#���]kB���+����$=*�b���V�"���B	�E�������&r�!��!�9���^�����&l�yu��L�f�^�x�Q���nu����j|�
$��/i�L�Tt��������V>�)T��G1�U��It:�xYqM�ei	+�6�����X��xl/]!V���UB����Cn��������o��Xb��!���t�z�pS�j���P!^*���dbJ1�R�-���0�~�7��t��w����og�;�{��vs����D��c��m; ue(L!n��";�|4�VIDS*�q"��bCZB��&��e`�9$�c����������&��$�����DQD/�$�Y��x�jli
��	�'���A+�3&<�xb�O�U��OXW��x����<�GQ+��	��<��Z����C)x�4���fb������i�^���
W54����T�T=5�r';B�S�S#|�<p��^�!k@A��6u��Tc���Y$O�x�5�����P�P�o���q���P|u������-!��Gs����zr����@9P.��,�A"E���F����w��a$���.��z�l�tTB����H��H#����`��`\�S$%���(W��j��@97tQ�D��l�*�SI��u(G1��,^u�����	�d��|w.���>�,���G=cz�"�����}� �5�K$����F�����78���I#5rxB=��������V|g-"pZ[W�x��en%Y�q�R��g{AV��� H��v-s��a�'�w����1�=�b���9�������V�7���{-�8J��'9�g���"!��*j��|H����J���k"�_��\"���r �/:j	��$��$�$��ex�t��wt�����M�����	��5�{��]����en�8U�{����z�q���������;��El��(���y���8	r���ukz�>���h�4�v;�@��������B&���NzKA,�����>�zz���G��{���!HB�j�@{�om���=�L~��"�M�{�(���X|<Zn��6,�@�������"aQ���6C��# �&�5H�1�"�[��oA���H��"��#8t��y����<w~�����	��w~�� 

{��[.�p���=�c���E�0��M�A��`��U����
8����OWJ���r��;k���=�8��c�z�4Q�T�;����i�� e����{���E�h�+�L�V{H�QnoU����W��;�["jG���IP�J�Ox`���HEyE�����Q[#��&���]���z/k
��aAu
���{6��C�@X
z�]==��'
&1mZ�+`��\-�S	U�Yf���.��*�c�1����z���QQf?ij��l��@�'�����J}�2�j�D�f`�*rZHT��'�����������k���X�"���i�#%��oMCm�0R�TfT����r��	�W��Ur��?xT�������o�}L0=�=C���>�/������+������9e�~�<lH�
&��j��+�g3�Z���.�������P�-Y�[=
���}T6�k~s���c����bk�!�!J5��7tVVZ\�u����OJR*�8�U��` ?\�����z����GjQ G�E��y�t	��"W�%���6��3��Xx_R~��Y��q��kZ�^[n�����
�c�(�B���q��a�kH�U���u��!�}������t��Er���T!�7�G�#�\��]����9�����N�TNW-7���>�9�����FV������������DCX����������>�7��4��>���F���X�����9^��e�' +�q�f���6�F_��U�� 0�b���B3�UG7�E�����,��������f�#�i
�^��#�����Ms���*���4��vl5���0������"�|�!����C�-� �KYi����=7��vj��!L�Gr���$>�����h���)�E(I+��
����({����0��� ���BOR ��-�a�w9��c�����E��l�4����XD5��g��oO*�4�+^�{6���7�=�+�#�����J���
����;��e��i��H
��5H�R���[RX���q(x���\_^B�L/W*���7����j��i}��	����Y�XK�U���y~��4GtgSd�,������z�
�F����qv*��A-��d��yw����sl<��R�Y��m��/(�M�C�]3���@��������n3�b�x���:���h�f��\�A��noXf�\c�}�m^N�{��s"o�*����@���UfR.�:���]C�&$0�D�UC�[%�nM�3��H��6;��Ux��)u2c�tK[�q<W����Y �w����c��z�oos)�b���J-B3v��
��x���S]9�Gw�S��oi���!m*a��wH��=��Un��C�Z=�f�����f��9�F�i7��n�DT�(�&�$JgOVk���)������8�9�$c���5����8n����Xf��u
����t/��(��E�xy'�Nf�P��!�2�=[����)�`��WI(�
��<��l�T�/Ak�'&��x<<��=;�/����2�ugc�f����q��
0X��}�B	�gk7����Pn�C�gV����w�5��
��YC�z�
�_:#u����������c�hU��[K+R/sE
4+���Y�f�:�h�Cx�Z��9g�W\�vZ8om
�;3��9���'��M����(cbix4Q��i+|�SA_?Yql=������Q��h�����0/.+�s3N��6rd^a����Z{�u�h���8:(b����WY��Leh���+��� 6�Z;�����N�t]M���3ZY/x��Y��<L�V���9��J���Yl�3�C�V�	
�R�
[�FT�&sGF���@6�y����v�����z5VNr���.k��zV�� >�7T�V���;)�uT�y\�{fn=����G�k����(����n��s��w�/�-j���!����RKY�f����M��^�c������f(�T���h���6����95�Z�1G�^@���a�]��'�����r����h���_���HU�@�>�z�w���wV�u����(j�����^�"m������t�Q�P�@����"����]v����}����iq{��*�_S�d��E����b�h
�^k��0Ct�
!��T��i,$�Y�\_r�8H�y����!�����%j�:�wz�:B9)9axB����oM��,j�`P����Y,l���i�<EcS-g�Z��e���6����;!W��yP��&Ux5�.��J���g��!�b���;�u$���2��T����S����,��e��k��c��q�$V�g��������Vn�n���Y�nv�������Y�-d��v���5-gk���b���h����m"7��K�:��nwK�R���E��v�r���-���#c��b�� |������3���ni9��<(A����_3kH����J����G����E��/T�����S�n�v�^Y������"�j�3J������c#Ht[�C4UQ���>&u�prX*�������zy�3����_l������������3�
�u����c���T�k*'�yV�d_^�O� T{��T ]�r������#1o�����$N� @gf���+�Q����gq�;PwM5f���w����7+��@����cH�����e�"�^WW���@V��������'h�^Z@���ov������,C��}`^:8'|;�]�C�;������wv���/b��N�O�U���D�uz[���S��4�m-~�
�A���o�T!�k�6s��p� �����A�{53���u�����!����;�*�vW��)v��);P�.!��k���u��w^��K�k�G�}]v�W�9��[�����hPF��=����2�_1dw���6F9�����su����.6����D,U�vng)y��u]�6��k!�zY\l� �����^��/c��]��r�*�^>#�w�u������j������<�"b�����,��gM�y�n�������t��
x���������n����A2�����]�l�c�������%�<��f��
!#De��2���E��vvZ'q
=)u��C*��Rl�f(5�9�\����avt�����Hk���D��C�X4!U�yU��[����E��������3h�E�u^5��W�3,�U�[��]���tz� x0��]�$>�B;0��I6�c���;D�n����O���b�;�q-��]2�]A�g���z+!���e��u�M,g{��[���l������]y��vU������������q��cX��6��.���{
�c��6FVlOAvH����I��@J	^��t�e��]}r�2���W�)����)+z�&�ak.��\������UB�+�Mc�(0����-�V�Se���f�M��V�l��>j��n�4�^�����#q��D�Zjw��#�5.�]�����W�����(��s�Q�AC^R��������N�X^�{]2�5u�V�����U�����pJ^���/|���^O6h���j\�������x��m�����	�
�}����[�h�acu�_:����{4wG���'�(UL��'�Vn�f��������C)[�9�rd�DB�:�#�a1���U�h�!�������+�����I-�v�(*����P�Oa�W�6�4�����/�������i+n�p�4�J~^&Z���H���8���o%n���N��W�&�r���
X�����+�S5����x&���M���R�h���� i���p����]���Srp��\�=p���=�����UF3��}�{2S��%��
z&����x�Dv�lN1�0r7��F�H�$��9���]�i�q���ug"U��u���D�]�F�^����F�vL�,U���%w�n]g�����Mg,�y�Q#J�Z�V�5���M����a�e��o��5b��MJ��]�;t7���;u3%�5���������k0�G��j7(������4!Bh{ff��)C�f{VP�d5.��E^���X�����,�grl�	���Gt_kVU��������r�:�v���W�Jov3��O������Q\B��[P\��U�s�����aK����=9��/8�o]�WJb���{�4%�;B��2���kH�o�=(����-f���p��u��*��.+�\5�r�HRA���h�^�K�_^��[q���t�a�1��k,Qdh�v���i!G��X�9"��f�wN��R�������1Y���^�1Ae��3�+�X�2���1T;���KyyP#rlm����%`�����}���C��Ew>�BM�����,��5u�������s�n�����B�wqh������Cv#��ymf!�Wf��wr�PNd)�����M�m���d��Jn��#�WR�yL*���������������H����hR�:%�a��u+�����$v��yO
��,T}��]�Sd�����p��F�i���>����fg�&�'q	�
f�e����;�u,�� Ar�5n1���E����1�`vu�����*����z����u[.��Y[�� ��rw��H_��;�����`=U,B�������+�clv�V<nW �|{�r��E0wcU�3^���4Hz�{���<�����K3�D�� ��Y������c���R�r�O
��i�S(zO{zT�����i�xh����m��]�\��y���fw�\�	e�|���+�+
yK�+6&gR>Z���w��s�Sq9��]�P��W����B����@Qy��"?e3��6�Y��4^�^Vz������L���RV����I�~����^jcV0�����b�_eTJ�J�%���zLEH��������4
BN������1��wq���v"j�yj:Ez����~.���7u�a]��6j�hI���@r�N�Y��n�ao�c�0��F�@��6�"�Wy���j���z��6�2��<��Z�y�����u]����d�<�����eS�Y,�F�{s��{�Vk=��S����L��vA^�����A��z�f���F�*��'`G`VD��b>}��T�X�Wm���O���=��^zA���K�]�;@oYD/�[��
<�(QDB+��c����{z��L���eU�q�����=z�<^�2(��p
;/r���y�n�EuS�Bj�����em0��n�*7�^��b��,�+���G���#�����x�P1��8����m�����gdm-#E�>��j&'_g?q���g��"gz�<��
!�i�[p[��U���U�tN�F�\�I���7���-��������4 j[��
�rP����~��P`��:&3���w��-��������Gvv��q���A7�n��BdX��������5����8pu:��9��dG�����������ZC7M���;��U\�uB-O[����*�&����u����1�b\��6�N\����������k�m��)�Qtf��n�5Qomn�=���{dm[��mK�@E����	�oT�ov�-��yyD���2�7�Y�N�*���Z���"����h�uFE�Y�u�'_f�}a����i.�ZQU���G�K����wPQ��8>�y}�BEb�X}��������Ux$��+�i8���hVdRoY$ejQ\��4��������;�
*
����u�D��kNj��o]�=�����r���BVU��*��<u)�fu�JZ��`��0��:��� ����D���)%��5x������w�����C�L��mQ����pd��Nt�"	���gY���e���V;gj����W�`�(mghc��~X�AiN<�u��4�tUWS����FD�;��Zh��&��A��eY5B�J1{YuD�,a�ykq����B��/t]�����UI��F!o
b�n�M�� ���]��XB��N�Xd��B���=Q�x{����v�'�$4�N�g+f�[��{%Q��.)+�����#�'1[�t�o��������3��Yi���9�L���U�X�J��f��C6����!5����Y�B>��y�G�G (K���CWP@�]�"������T������)>��O#��U`�����
���7�9���o���m�]t1�\z�iYt��� �,.tt�lw:�\�[r�;[i����V����on<{��c�v�����	�^g��2���55�rw�+�
��O��
<u���t[��_dK�����r�
��e��V�������m.��5�F�b�k��]������J���U|.eV���.�V��<��g3�l7{��Ay��0�����t-�f�,@�E�|rV_E��E1r���9�k�V��n�eW�DKNf�|K�uA��0Y�S���iX{ s5���wl;{�|��q��<�����!m���!���
�XS-�'-s`�{^�^��r���f��o�S8�	�DPO���bh�9�#7��kIk������v�Sl4����'Y�EJ�"���v���������{����\
���*�'�Q@S�X�pE��j$,�a�i1;^m�	VF5���S��j:G
A~�>=yk"�����������6*��]��Z����,i�)�)3C��.a]��5��2�2�d����l���J���6�]�32�k��z���.���'�v�*��������X~�O2��Z�Wg��UZ��j�GZ�F��h6�E���2�*��<Di���r����]�����*�/�w;bU�U��d��E�eN�b������0����+�V��2��L������k�b�zi�������]�2�]�8�_-����h,���zN���\�W2t]�����!_L��qc���]���B�%H5[����u[�&��L�e��YW�eZ1!�32�xE���a[�,��9s9��������!�&Zy�N�3�h���e!����p@zz�8%�%���J��N��r����w�����bu�Fq�yVwg�q�x������4���Yxy{pPqr(��bb�+�9�����6�d��]B:��{b����3���U�Gns�)�:{M�,%\cX������i,�5����P�
n0��n�� \���Bm�����T��c��"pR��e���	��S����
j�+v#���b��0������Y�m��#;�oF���uv�e:����9�3<����p���C�L.��8F^�{�����������4!�\�����������G;��:�V����Z�����V�vh{���9�����L�P����h�Cf����F�%g]�	�^)%Y]��x���S+D��*�P�wWgp�p����<e�.��J7��O 7�]V�j�nf���pp$�Mf���U;�3t-���+=��<�W�N
�U���{��V�.�����
�D*��J�=�����+��E�N�h�A�����K�O����
u��n����R ����&rS�d:�H/j���ou{,[��$���J����W�����:wLx�^m�6�i��U{�Yz�E�ege.}1u|���Lg]�
��w��>�������/-Hz���6���w��8���MBo�WR@����1%���M�}O�5
�a�t������R�U7ww��M���8��b�(�A��i�C����0e+�5���01���6z�;/lm�>�2�Z�X��&u��v�N�J�����/%����[<�v�=W�@��(��hU
�T���UF��2v���sA=zub��������n7�N��G�"�Gs*[���Ac3CR�"�$]at"�Z��v�L�5C��>1��x�3�����Y{IV. ���1u;����VW`���}K��b�yA��7�A�r��I{}�%����4�#q��������^�o-U[
Pg�/2�R
��r��t�sdePJ�L+���8�v��{/0�gU\��m�^9L3=]�(SY)U <��U>�g�e�:�o)��.3+���Z����	"��2S�**�e���KE����:�h�EQ���*�Z��n��O"�t�c2�B�b!*��A����>!��^!�@
4�G�Pr�f>�b���wQ.D���K�����I��@��Q#��D9agq�����E���L����r+2��[�+0hN�>�;+
:�1oO*j��3GA|�A+��M�If2�Y`���!�����uK��3%:����x�t��7�USF����#Hd$3�Y����H�8F0���-����5��MvQ\���W���;���5c2r,�5�s���<;�'�������'��
Z����%k%c+���S���#���(����sAQ\��q�u�+A����f�x.��J�k���{jf+~w����aV��*N��u����C��������$�d�o
�tk�3�d*�V���eM�x5���^FN'��T��C�QN�4Q	$'���=�I������CkmQ(6��f�Fs��UNvk����)���
�F
T�)�b��u6��VC<
���6��"�Y(&��%=�K&]����Uw>��2)|�t��umm��Nc�������w+�2n�����`�/��<�C#y�+�q��@�p��+���5���m��V�[�%���2amr5'���G%b�KfY�{S�L.�����������v�������C��cg�R���Wr
���n���A�s
���jfN
Q����m��,�#��� �I>�M�����TV,��V������$�y�q�i&����)�RU����1i�;U�4�-�h�
��G:>�i��	��d��<�4f����VV��(�r�����'5}��T��\���A����h������]i�k*��l��<��gMn������@��W�z6�]m������������[�]�*0��f�M��p���U"����qG��G~���`�V�"D�z�Uz���U��w�y}{n�7���-�&�P�TXc��f���Hf��y���V�q��M�kq`�]ob�g3C��N�mT����M�s���9*k�R�i�GT�8hF���*����(%�q��K�
N�.:�j��Y$���������A�h:��ToFp�����Y��������}�v~}Ec�y�f��^���I��Q�"�AU;�{���]��� w�z'!�����������$�)��E~�
�e���;�r"����
@�z���3t��D�/Tp����T�)����N�7�s_nk+��{�=�M�p�
#���/abA}���LJVm ���7����S/:�����D������Q$@�{B���r���@�;�\�SqQG�w=��� ���@�����bwI�8�P$����)��d�U�9��R�F�����$U�5.=��$UY^��NL�"z
�+dd��P��\TIJ��}�MDGpND
��������	��������I��"��$�S��n���#��\^@
��P�d�������@���q{������:������� ���.0*�@;��w`��7j��d���}@7�Z������?B���T7�P��D�}� v
��f��]J���H>�$�g��H��M�������)��9�a�2	�i�ri�	�@QH [�N�����My�x��a�3��j�O��L���q��=s#��!�������E���H�m�Q/6x<T��h@w�H���=��|���m�}�'�|}T�hY�b��52�r�0��fM�hA�~�+M�B���*�4E_,��q���R�����=���k7��U�Weuy�
�O��Z`9�IZ�:�*g���4dNL�@V%.;���$��
�M��"SXj*�c%�.c�B�O�������}D/����u���Mer.���bN�������}@���mG���!Z���j��,���
D.`�PI�9hxvm(���F�=�*�tOD����L����������d2G/Ca�YwQ��5�V��
?�R
����T5�������}��t����>�D&����)j!sp����=�27tn=����\SS! �c�R��t�����5.u���p����}@w�N�qsR�����.������Sp�B���+���>�V�L�
C���Sps�����AQr���Gq9�G�NE��j!�'(�k��>�v>�����?G��Gp�er����ARi-��.��P�N)��o�i6X�����BE�z����	H��h����h�(��d��t�a$�8V[����u]�#q7BQyP���,���h�vg�����`�����7	��
�q��f���w�9�E}����T���`\nPU{��z���m� ��i5��pc�A=�9w�{]��z";��1Q_�]>��m��R��*7]]�����y���D�������a$i
z�J��V<�_A[�IqL��1Q��|��v.7y�������L�����H�]3��f��f���m�$��i�C Q�m
����\��(��������v���`��g�?A>��{��VNZ!�n�@B�M{���������t�;��U�	���XEx{�� =��_&{����%\��R��;���W"����'OI�y�G�8�;s)_;V�l�����zSb��%���z9�\>��@�wSL!eH�J0t�M3��TK��S�����3�$�s��Ibl9pV��m�m_`�Y�fJ-����	���[�p-�"k7��vu[���L�n:Pa��r��������&��X/�7]�T�Q1g8���wt���U�����'
1Or)tP=!M��v� ���0t����P�b;_�$b�TGG��������q"�	���#WM8f�L�c��(o���s,��O����?xW9�f�{�}����6m���k�0����&u��Rc�Go��K���5R��~U�,9����r��h�!��x.���v{<�i��D���\��}�����'�-�U�gz��!�$q�����\]�U�����w7W I<���#�"H�
�� ��Q]2!d6 ha�EJh�/CD]"p���J�O{H��i� ����
"������[n��n��c��-����(Z}���v[��5���\��T��x� ����[j(CD���r����nI1������*
���-�6��>���D������ #�|>3r�������0f��6{#`r�or�T��q���D��h�D��r������C5F�26U�������N0��}�4���:-�PCM�uT� a�LI�w��B+:��:�7<��)��}�f]@���]���.,��qw�����obP�j�i�_�E���B2�]j��_&u+���\��n��N�+-)6��OA�����_���7.�1�m�:��,��3p�vj2�<D���s����E��1��/z�I6a�G53 2��#�';t�!4�� ����6�z��^�
,"���R|g0�B���0��
ba �;!:��A:��8�j�;w�MQ1�6-�$v%2K�,�J��v^�����H��/��4nlE��N��X�����q�6e������%d9yf��)�
x���$aoZ����^��2D�YG�a�I��&�9X���P2RX �c��d�A���ld�|�\p�( g4�����<�����
]�����ffm�T���������$���m0�:vE�ZysA�\���_jwmK���VK��������kd�����59��k
����7in���0�6���Z������)&��A�*�0]�����9>��y�ty�]��y���
�5s}��u�S�z�j�����K]c^e]�;���-6����qNs���D{��P��8�PS�i�GHv�$2P(�\�����S�o��%Rh�a����M#=��<�����h���r�O3���cv}�y�=��%���kdYv�o�6WYt����4���[�w���A+����/��@{�!���72d�<N-S��x)��\�|��"OU���E���.���W*��k��D�2z����E��8x�Z��H7��-�!h[��1�%����n���]�Wj�
+,�)���p����F��*��4A��)��h�E�a��#�����m�I/����A�x�DR��tA��s ���EoN�]������B�5��U��9yR�
q1[V�&����^�����R����d�,d�s(��x�|p��w85����tn
�PU��,�w�;�]�!4���o���(='j�U����g��{��>�t���@�C��,j15���E#>��DlH�������O�:�.��VX��{J�����z��y�o������\U�}M�z�#��f6�d�Eq��u5�3e���d��Q�'��8������L�,��ao'��8��r���U%�eTxzje�3y�v�Cd�woP��-,#�
|{���r��r��Mv�<��P�����.�Bn<���6����@�M����/(f��>������y]�0nN�U����J����uP��<��e�e�_>���e&��o�YW�Q�iQW��1�U:�!��gt��]�SH�<�c\������T��s][��i��*�3�5�4L"��ziO���.�g��$i�j�J��|�{��Q��)��:�%��9p��|b��wjE��/&�P	]����g�<��G�-����o]�	A�a���qw�����L�2V2f`.���4w�:Y���I�@�j#3����0E��=S|���VH�Gg��h���FPq�ds�HH�qF}cm��j����A��H_Q?��T�`
+2z�;"3n���y����2���@EC��t���+��X����c����6l�����}��bK�R1���a�v�[��5d������J�`��<���[�SG;;�0�<�:9lX��!Vew���^���=���-
n���*U���{���i2����On^�����f��9�N�!�A����uJ{�8��������{)Cl���N��S��Z�������F\�4O�#~b�g��KU�^}���������#��E���
c
:�^:�0�G�A�(.�����M(*B�8B#��`����+\�tm
�������p�k-��*�>����O������
S%2m|�v��k����[ 
j�	?�������C���sB��M�hY�CLC���EA��_3�q����99$��_����U������8�1a�D�Cf���zd@P�{�_;���+��	xN�f����0���:�d��|�����m���������Hh�7����h5q��.#��geG�_a�B\^@0�����z����S�����T���KmR���M�>a��_��*P%�[����o-����*�t��,�����B�*�,�o��^�8@�& ���{������SjO�x
9{u�����|GR���?
�xU�C_���P�d��S�koe�[�����SD�6�4��	�!���BZ����q�d�����c3�,=v���5UF(���
�iqM_A};�&ceD��DH���7����6^�[nhY5�F*�z_B�lL<t5�6��"3/�mHo:��Q����1����������C��G\�8��+2�/��;�E��0-�a���-=�z!A����G����r���d?.^��o�,N�8�v���=lV����T�
� �D-�D9��r�Qq4�������Ra���[+�.�����v���-�����j�2��%:8�S��z����]���Vz���D��L�-���}�6u>cEU��U������X���Bb3�AW��!��f"qZ����x�9}�Q�6�� �CE;]Il�O���_cL��|(�ae���Q���8������#A�"�g����1�}����B��@��'����;���B&�h:�V��V��b����seVl�
��L�(X�M�P��d����;hfh��va�4��kh'�o7+Nh'�eN�\��v��}�Je+$��AZ���3����|���
 ���L������Z��1H��}I�r��\������JJ�3Qg�����x~�q����Q�-����Tvevm����G��i�\���/���|d����Z�I"	Q������R@���+`�o^L��4�H
�DS@�DDe���/i:]� ����7T)�*��RD"�i�O�V��s�?C���
��&�� T��oF�"��/(E����f�/��~���x{ �sc$���qDi�e�y^J�in9�\9qkGvb��up�W���3G�"����zV��J;����w�������^���[��9����u�5�i���c#`B��x���7�z���l}�v;����(���J�e�L�����3�75���>b)��d�B��"Pwb
���2�=Sqsx�N���5�c��;�Oq5vVf���z�|nF�Yk)E�k�#�{l�g����r�m�9���}^%T��
�z��W�:I�&�V:��{�E��|�#���V��/��[���e��Z��r]A��B�o1�$K:�swE����7���_v��A���[a���V��z���SuKmlu-�I ��A��3g�UP�%�D��
}�oj���[������{}������]�QY����c�.me�wL�6`u�e��c��O)���kc������fk;������)��@_��7r6���������9mD�8����cv7���e5f���M[��b����w|��!Q;��=�Z��	���	Jv:��i���	�W�������3��X���p�*{h.�����c����mk�;�V�����
���>�M��9��Q8�T0v��-M�@��	�����`��mYm"J�C9Y&�s`8��<����h���KD=��i�PjJ���]b��P�'����S���������^�,�G�����w/��A�!Q����FO�I� �<d�@YF�P`�qi���������*�	���*w�����$",��GJ���u�����k�d�@���nZ�����wuP�DD@���{s�wN���}������F�3�z8��y�>
�?yg�m8��c'��DF�"���0����������{
Q���(�����U ES�����W���^>�u��Q���-|�2
kj.���E�v��Y/����s��QwE�U�@�E*��zl�!�UU<��h����{�u�Z��*'�"z����k�����T�"��3=�w���[���]��&������������\�	��b|��e��l�u����
m��yQhL���e�:�J��\FM0,=�{�������Q�p���	t�v�����*(T�	 ���l�s&����Z�Z�4.����\����q��[�d�>�S7mN��rA99pN���}�@W�	�u
�����N��������"�#y�a�4!:�r\���sEc�ls1 3Q�w����$���
O@D������}
W<�+���@*7���*fS�{tn�z?E�5u� ���L�2"n/#��v*������6�=����������n��P>�C���������o�n�r���N�K�����+�����}�'";K�;��Q2v)q����V�$�E^�;f��v�=A�$�9jv�R	���]D7��Q!��d�R�Z�C���_y�� ��'`��#�;�V�>�d���U>�w\���	y9��_��q��z'��) '��Ese�����D�Y���h�a�(��y ��?O.��d��� �";#��(��7�}���P/�j�����w��o�)�g�^���#P
��>�5�
}@sX��<��5�X�D����S*�C"������s@Dj��y
�����5
�*
�����{5�������s���v1Vp�1�5�uS�h�EF�i5��t���,���+���x �*�����kl��H��G���s*:�o���x<�!q�@V :��e�;�����@)�G�)��G������U��9�3F������A�D6�n���i�0�LCm�}���Dk�1z�{0c���>,��	������xt@����>��$�!�$���DI���yF����������{t��Ql*�I����\��|vr.@�K�����Pj=�� o�Y�ho��}Aq�jw�2d��9�\�fP�T9�V������@��D������3����/"*��TY16V��tg`#t�PT#�������������o  }���7�(7�OJ�E����Z��eJ���+YcH�CH�gvi�Xt�Uy�mUg�q^5)V��
���G��*�2��0�*��P�/�������:}7;>�u}������6�$�V�R��j��B7�4g�a=��ETr$c��6j�)�������,���c,%���dT/�>��������w�!�����������KA���$��Q]7�l��/���l���.�������n���)<��np���j����Z���s^.Ys�j1�w_l����w��
��MG����tcO]��,��qU�eK�J����e������j;/�]�����<]�4��6�o��wjl��/�t&c�D�"���3WBkM�{�d��/.�}q,�K��=��(T��m���:|{��[��rlc�����yp�h(C#�r�Ol�i�qY����Elb��YE�?:!����T}X�6A�
�u�C�m�������������b���v}v14C�p!T���=urj�DM�����U]���S�%�6+6Q�]Po�yJ�V����
Y������Y/6`�f����W���.�'67������0X��I���2U��4���T)��"��Xs/�W��#a&�K�eLB����E0M�"Z���J)��T�����CI��Y����)����L`��q�\a^47����G��SU]+&�����a�����
*��D:�u4�f��QDqu�9�P������
����3��]R�fmKTK.�n\�W�p��{2�U�n9������yAT�+�yP9u��������A��<�Jz������'�����Qc
��mA��J�_���iG;l*�X��������+/�S�}~��f��^��8���WdJtu�m���}��_\������yFm�N�R���}���AL�Co�3]B�>���d"�K��0��*Y.���� �X&��U�"PE2
��!���F���K����9o�e"!IJP��0��,B��������c�����o��Z�����2�v���z�
Uv��`~���J�#�s�,��l-a ]���"��#�H>���Ee�i$�USn2��R���P)2�����0:F
CLy������L/���"��z�zY�^G���o�0��-����"�*����#�I�R�Z��CM��0Y�.<A�
��d&.��� ��
>�1�� �[S{���Q�������i��M0�%�O,~����,�tB����y�����������X�4���$��@$If�sW�\nTy1��Q\�G�{�Dk��J+n�T��c���F���R�4�V�^����z���O����=V�8
T�"%^��z�&��K)���p6;����=c��2�=U�EQ�0]�����d���#�/5��$��c�W���YwL�y���Wk
�Z��L�A��!�h�O���>d=��G�8�A�M����.�<!U���\� ��!q�W(����0�TC!�$�H+�0�H�E����R4�6g�mi^������]�>�d�����J�v�7���iL4r���Rn;�O�����^���=�z"7?S{����TM���������t� ��<X�����8O��:��'"�0	!� 
!��@@�^�={���5��#�rn�]z�d���jT�BMc����&�8�����U����/|�;��Gq���~��T��!Z�������]�\�	�'�<�*(���$� ���P,� E��K�#G�Q!���8�<Q��T��|.lX��HI9�q�k!��4Tdt��a?O���pQG�M��;��;6�i0�9�%���x�yF�������h���K��Kk!���f���`����6e�������n�F�~�{"�T��7c����cY*�fT�������[a����V#z8����#�Ffm���`�HKj���m~�(�8PY��
I���7IsQ�����n��W �P��:��d���'l(ppqm���������E���^+�V�����mr��y5��%FEL�.*TGPj	"���]F�.�$���8\�7��!�����n s��7���{����U�X��sB���Q��:|v���m�w`��� �N8��q�]F4�<Lt���v���q��@�1��(7S,�B�:�j�*:���&E�H�	9�q�Qs���2�%�}N@�y��A����9�>�u��C0����&���p�i�@�!�T�H#���8�n{���Mm0!\�Oy�a���;�$�z���3��(h�i��v,�d7�d`��!9�aqm1EDX\}�!`�\m�wq�v<�zm����F�t��B�f�L�o�#S"�0��4���e(�I�!��e���v-4> q�[*"�H���"�lA&J-d�H���2�"�����T?g>���U5@w5d��N�~��?aQB*9���j�3�o��@�.@@B�	�A�tS�y���uN����6G��x�ri�s����
��0:(@������_�����L�'��H�tV\���^s���KiL��^���C�E��������Q�M�e���v��R�7����zJ��F��z,]f��:�a��[���Xr~*j�I4g��b��e&��r��"
C#��s��D�C����|I�0B�O�Ag�(� �@&�NF�4�(�{��6���d6����J��R-&������&�{��h�����&K4�a�������q9���������
���_�YWG��'��JHy4�t��N;@ ������	&������B��
i-C�H�/�"/Q������>t����\�`{U��r>M�0�,���mW!���:g�>]\��0A��51W�Dx>z�!Dq�p����c���R���Ub������B�9bb6�
�$�;��#D6�{k;�vt�B!�
�	8����x�C3�;��5
/P�����^�������cj���E���x��HH+A�d���zb�8|��l�H�����y���P�fy[�1������,�����C�����t�}B���sEJ
����*/�^���W���+X�c��T�>�b,��A�W:�m�c�W���6�
�,�;�u��yUW]���R��x@f=�"����+��F6�����"*���n����H)��=�#�@5�a�
��V�z��$��S�_���E;.RD.zt"�F��D�������']F`��]u��Hq�<@�<�t��e�q��S�=p/�w��+s����.���P�k����3�o�&W��n�����^G�7����Dk������o���g��Mo�s�S���O��`_>�7-� ���JVP7>��7�"�]�\��)��z���v�{�W��X�9&�FTn;���"�,����C��8�����V�n�?���4�#
���u �L���D��Dq?�L{�ANa�6tvN�9�=.C���0���7���B�K�a�����Z�\8f��u�nr���w�]<���h�<N/p����%�������u5\�(+��5�;�V�HI&�`�?oBv������$������A#�"s�}]�L�}���������Z�0��9�4���7e����A��������`y���,�8��A�5��&�L�.���0@*��o&w��D��{��O�4�l��v�8��Y�s3�U8��o�8�zn���^�P\z�t�wf���E���VI��Ux�.�Un��:�����F�IT�fw��hD��8����W
>�cegjr��d��U��+�v��k5��)����{�����"������;�B��?r`�0��	 �
�M��@�
��&E����P~L�QG�+�t'��<��ZL��N�d��������>���C����I�������B��W���OC���3����/���7T����N\���!h��X���0H��h�q��.���~����CqA�~���?~���m)q`=&c���B�qT�������aGG����N_����� �f�)����: Q�J�U��
���Y�;�wGK3X��w��#n]{��4E�m' �MUz
':�l7��{q�M(�A��������P�Br����G���}{����+�G3�}�u�r��j*�(Cz��n{�}\�aC`��`0�!���$^22Q�R��X�#mO���,��th��hR���4�qg���q�s�O41��]g��04k�8]?iy���))u�6Yg��&�v6Mi�R�����k;�F���*�*�h~�n�AV�����4����$��0AZ	�W=����X���?HA��]H��.��a$K#2EZ���F|�/{Kg ������J���Z���G�
�y��c�#�E��/��L6G;���������u/���SW_��0�9�p_K-�N.�c��2�I
��0r�6;r0���?���[��=�V� �D36��3Y�
��gd��5DRB��"��	�X��coXK��["�g�e�y{}��E���
D0g����A����oB���EN3�pz��l���]��@I;���xJ���w;��h�q�=���Q[��l�i�V��R9��\�I�����C��R���������51B���&/|E����X�z���nm)R�8�84��I���Q�������S�������g�m ����G���N��2�������%Y[7Zi�2�ak�+��6�thji`q�nI���ew;�����k�*�����Q�*��uI�i"��fX�C�2��c��}�9��|�qR�9��jC�-@�5yG���y���|U[������1�.6������~no��������HU�c���7�Z8����
���Y�B��_<�!���y�|���/���}��w�^Y�K�0�X����+v�.���P�4���Us��A��0�/����a>���/vta&I����Y���fMx�������A����Y)m�.�9��%+�qO�����O��O�L�8����"��W���*���]3� ���������hWx�����������?
��U��/2�nR���O�������������P�,���m���t`?}��9S����K��}����#��^C�B�3S���gH(u]����d�3%U�,�
=���5C�K��:r��Rd���e�3�	E�m�H��0���*0�"|���yRxj~��L W�����|Z�>�4FE��p�,�36�yND�]7� �O[;|�����R��#tU���!���=�����Z�rq�������O����o^�����_y����L�v�HQ��2���<�����
{|�y.�!��g�2�/1^y2`NH`&�k��[�Z���5��,���������`1!+9�H��o)�z���[�,�HD:�<@A��SF����O�4�C���OX�4�f�0
�q�����MPT�B�����g\�R�&������:C �1���K�z�9������'T�6'���7x����"Q��)8t4L�KOG=�Yr�����\����B^�j"|mi��h������E�UF{1���n�Z�oHg�|�{B��h�<�����u]q���p�g�e/M��f'�Cm��j���a�Z�	q�x,��Q�f
�p���c��w�6�-�h�N�m��T7�:���8���������� O�roQ#OH&*b��0��2>2�!(�lc���
'|������Gi��t;��M]M�)<o�qU�f���jzF�����,K��\�C�kz�z�A��	!����|�^����=O}{�[Xg��_�Q�v��`��
�:�����&Dz��"|��sU}V���EH-zG�7�>&|����]q�;�5\��u
���N��q���M�0p��OAQ��v���V��z!�_f�mpR#���.������T�m� �j=��
�K���^���f{�:F����� =N������h�E_�0a7�B��(=XY�b�.�[��s,�0H���Nr$��sI�����;��� �L!�2O��������v��`�����]���Su���f��^du��e����&��n��"���������yQ�&���������osL|g%u�sE�|J#��a��������(��U�4�:B��ew�9+��^���d�����N��R������\�\��/z`��]��]�5X�
\0��/���+t�;�"�&�tlw��dx^���u��'
���Gj�K��R�ZwM]���PIv���'K�!��#�o �"������op��n+{\���������M���6^e�#��>�d��}W!|	�H��
���ji�\�Zy��Tz���(�/_v�]�e����i����e.,\KL�����Ja�f4�9�l�t��i�X�U���!�*���S2���������L�x
:I�i��bE��(��������Z��6�������2"�oD�|�uK5Yp,u
�����R�S�7[���6����R*��{^�O$6������yVR����*u�W�%�T�Z+�B�C�}���.��V�e�1QF���N��p��[U�+
�����*z�d2����j��C��V�Z��L�w�H2n]f�I��+��H,�w-cn�����J���GW��n[7�l����.9�3���u��Xm2�;��Y��������+�'P��l�����"�VU�y�n����U�
�p=�7Oo[DS����������	~�#8���-�=�3v7YC�-���uDY�y�n���� �����m/��-Ex��cN�u��cW��JKn��
�0QD���n����P����\��W�Q=���=,uk;$k_�����zB�u�ej�eqz�&�sQ���j�5tG��og[����u�a��o�|8_Y)���+u���O*�������Mp��j(����)�H�g^P_B���6�y��Z�M�+�h�g+��n��Z��7v_7�+��2�;�j�C�l���j2����7���X!"������q�;Z�j�y���F��bW(�[�s8u��������3�u�����1�]U+cm�JD!�aK5�u;z_p�Z��v�Ay�f��T��t+�a����6��C3���	}t���~"|2�����J������]�Z��DE�����	x�9:�RI�:"FO:����VW���Xeo�~2������@L��>H�>FS��^0��EW�x����7@���Y3�s���S������� S�yG&��EG�*	l�$8��i�����"}�|1d�����D#D|@#���K���H�n�
�T�5����,v"s����/n�����/L�8r/��u�����������DG:A�?q3A;����9�F]�I"Ms��M�p��B�.�S��$�Q$�����>���K���3o�i�iK{�����r6	q�m.!�q�����Z� aG��L7!���/1���6���p�AIP>���@�g��i���,�08�T�
[�	�_�����R�#J?�Ff����C��w��T�#�GE�����f�|���#DSr��>DYC�
>�����8����E8FXBH%i��VR5XD5rZ����i�Q}DB>�<��
��"|V��
�1�Fbmr^&�	+��rz�X�/�
�*�f�$6B&r��7�9'm�	�)�>��I�������fY���[" ��z���xAuh�iW�h�=q1���t�,Ok�r� zbb�=� ��,N�]	!:D�I�����%�?�G���]Z�e�|Q�.����Q�V��/��Q	
6����M\��a�N> ������C ���h���~`��
=���������<}^1z�0�yYN����WU�[�s������>����A����|��h,�E�����`�%@��~��[G-�`����$0�Nd	\���<T�8}.�x
I��~��^F��|,&��C��$�dXin"��a�#�'St��u���3���|{G�E��B��f�4w8��D9
�"���D��.v����"��-%�����G)NW���G�}k����;p�/jEjj������b�� k�kQ�,����H@IU?��
����'���zw���������G���}���������,r@�A�3�my��,���pYN@���]s�!��
kBX�����*b�G���v���j)���6:r�z
w���fpOv�1��}�$�"R�T)����R���6G
A����I`�"�?[���\'����J�A������X��E���dR�}%���o\�>�O �;�rI�#��w:��7����$M��C�����x�k`�~�b�+W��2��[6z���pA*���dE �����;b���ZkHA������6��vCdF-|�o��v;��:�|x}�?FI	���,�UE��&o�9j5$n��b���"A0!m���%�� �4��������OJC&��/g����;���0� [�����@cB&�<DrU<r:*_H�7[j]�Ii�!\�@�-��-���`�������>/�yAh}��x�S4(29��b^���
�o��,�G���2��7�=^��O"�A�C����=�B��O��@��b	O�����q��K�6���������v ~�����-�k/�Z��!��^���G9�n��*;=����n���#�������`�3fX�Db����r���[(����a[����o�&�b�0�I��O�rk���6��yg���x�	yUdQ�M>v`Z���z�,B�F����<K������-�O�����MX�J�J���/rHrO���i}�	`�i�~�F�G�����i�a����tDL������iZ�YF+2���������"�����W���������PA�@�Q��R�{�s�*@��F�[7���af�3�
��(��v���������z 0��� ��0��9B���5�������PP�~��	Gk��f?w��n���.��Cpp����M:p�H�a�}��/c���~�����* �G��zze!�q�Q��S/�6���d2����Lu;T(!B�w��:`�H{�	�����@C_����81H%��S;�H���^�6��5���{So�{U�h�dF���[�����8�=�!&���������I^��6o=)������vi\�7u�
�|d�Wh�:.�fJ�jSOQ��{�<H9�������A|���f��nV�~�� ��@^tYLQ���^pPe�|Y����Z�*G�I�����>*^_���N&���:{M�rO�Jd�� ��H���:������5�\.�`��H���J���W���������=(����@>�'��6X�"u�jw�lx�
m�f"(�@`����9fE�~�4��jl�K��o��f�2Q{H�`,�@B��u���4I�S,��Cx�qJ7�������DoV�����+n,9���bUc�����@�HH��Ky R��#sE��C�I���=x��o���������D�G#��OR��w�L��5�|��=��+�����at��6�5���f��0�z�a�{�@�T��A�>L^����x��S�yy���e��z�i�A�3D�������*�u�{S�V���[me,��y�p�3�������9x�n3;v������t�-\��c��:W5y.������;}AF��������uh����<HQ|���S8;����	��� �_u�+AT��uz�5�z���lTU���a����"'�}.�K�~�w������)P�����>3��<��d:�����p4��u�1��������gL%8�k c0�m�����cx�K�p��~�+P�dS�	 �����9�v#(c_4�!<�
Ui��(x��5�3p��Uo��r���
�8;`�C�Ji����l�rGt'�/��U�58d�r!����@a0V����^�4�i�\xW�"��Y�LO�7*�r���-���7�z�p���!wCwc��Bw2(��9�i��]��
vz�PR��=�	�t��OE����	}u�WbgI9�����U�Eyl�h�E�;���g�++t���>���>��mRZ�bP(xly���~w!]�w���C(��N��k-����t�
�{��^e��XX����:����n�8yH~g��~�:Ls��0H�>����'�!��c�kP�O@�
��9(^��������@������B����{�������Qf^��f�Y�BK&���7
�:I��3lc���D�� i�M�D�:G�����6�6Z"�5��{��~w.�������g��?����.�����a�Q����z�p'���<���N���wq��>��""�]������N�&�|�UG��X������V\���6�� =��g��������>�C�`�e�Rc�R���N��r�"�H���T\�Nd�

��^1^<m������C��8 D2�7 \*n�+}Y�����2�sS���{�����B'�~����-�)C-�oyy��r	mi��g{<��H�a�d���&����}�GVHN�]u���P=������o�'<�wwy��rM��)�3!���qJ�8��c�C��d�� }W�<�xsH�~_����X������EJ�$��F���biG�����N������;Y���F,�)���a��t@��������+(�Z�>o�����, ��
�a�.�2�06�1��(HA$��4GZy=��	����6�[w ��jY��o,p7���R
C�'��0����;r0�����w�e�<��Y����c�$(CE�^W��+��8���S
,`�����K`Y�����*����!�i$ 'p�0}9�g���8b�@
_m�
����0�f�D�J]�v�,��3�I�m5��	(���D�Td� T�����`Dv\��t	E���2���@��0�<|���0�f%�6a���� ���0Hk��X,�@��D���y�vxN��V�N���Y���9+�����b�
�W������m/,N��#9����koz�������j���v)����
�$�L��Df���+S���cyN�PMn������Wq���`CW�L/"��%�N�~����Cq�\���00{B�Y	��L
r���r���b�#���{�!����4Wrw���T�.7OFyfF��?1}��>2�^��l�(Ak�+��Aj�c*�mK?��:�H�v���rf�b���*2���n��z�j)?WfBn��Y�<�/�$�1(c����[�%�MlDtoC�'�%���^�fh���a���9M�/K�?�A����{0n*���,��a�,����]U���q��t�^��u�+|d3K��w2��c$��^w����AV��
���.��p��o	�bn���=�������r�&q�3�v�[�;������Y��3�6:/� ���<��q�Q���RGT�I�[��`0�k���!���s�|�mv�#��R~qR�'!�c"TE���S��&����N4n�x��'4����{|�f�'���Vb�5��Q�d�^��`Y���a�z��z}������S=��n���o�j$3*�0U�m��F������|�0�y�U�FL�5W�3e���fe�thV�!�G��F�P�(8�Pc�U������~Vu2'���u(��iQ�����?*"����w��:Y���b����!��m/
���0�3�b�"�����z1��������CKH�.���D������]r��7!��C�=�a����fh���J�n����_@A�X��!T}���O�w�"j��m�2�����,�h����5�j�������0��
��\�i#����S$�{�z�>��m��qDA��m�v��9��t<p�j��5�z�~�����h�m�E���k e���N{2������22Tu]���S~o:�_:#2=
4K�>Z��=��7�iD=a-�W�,
-b�
dUJ��p7��z�e��q�*�+]jc�b���=f������avn��,[��@��=W��VBS<�����d�{���x�%vw23��+b~%Z�����0B(��l��@�����z}��vl�}�;�4@�\�D�����l?����3V����k�7�����6���<{��|��S&8�����[�����0�g��P�86����4�Db����
�Q��{�+��?�R�����T���9o(�����/B�-9�n�FQ����k�3{�<Tl_Du���EJ���	�P��>�����=	�����q*J� 40����y��>r���{����;��m��Z�I�@��>C�,�/@Y�Y4��7��B����F'�������P=A��E��C1�B~i�2qg/@�d���B�$_|�?
��]�:���G�x����9T`=n�`]Y{��}d�������q%}u��T�W@��@-i�o��G���B�~���N�m�Q��k;���
-s@��C�"vL����v\�1����hJ6$�=�Keb9�M�U��A����KF����
��\95Bda�;��O��K���6D>�\(?B!�G��9	�����AR�����������tD#��S2���/� �w�&��$�j�!�b���E�,�j��*d&aW F�</e�����������h$�qm�x���Y]I�V��s�����o#����TCx?A�sY�����dn���Hq�K.��x)���P�����]�D<����]y�OK��1�Q��������z*�~�s0�������)��:��#]DfC~
]E,�)���!w���r��c�������o]�{�������V��J���E`����T��R��P[�a`�;������*��0���#^Yu�@O�2�*�/��K\�i�;�Z}�z�'��!E�\Q�������J��iS=�]��u9��.�seu0��>Egs�c���������+7wu�xt['
|p�Y��E"f��v�������#i��������R�j%�=PN�]�}zueF~:o��y��$r�H���`�!�!��I��F<������"�6Ef���`��=���1���I!�`�H�^���4��w- ������0�Wo��}o�@�G�7�'�=�V=��������0=��z�
������32�d����n-�r,�h�{���.���0������B5Q����3Y�� �2� �j�� �Y�'^=Q���p��d���\E����'�b��(/�3�hxL�FV�)�����s&#�3�8f�������u<$]F��8.6&�0� �#��x��,(�,�M����O��l3=�h}^OUNe~��:��������)"I-VR�Ow�>��(�\�]mb��(�/�������[,q|�������f>d�#����U$+Z8��5�|�������etvt��]N��F�	"|KY\z1�����9Z�q�=,byj��\�v���Y��Y��s[��&�g�t��\�W:���F�/C�[�*��;������sXHV����l��h9�������e��N�	`�8����������;�G��o�s�Mev�f���l����GC|��j������m9a`��6*wK&����PMs<��4+q����=L�^)yM��Z5���*�B��f<�������6�X�wM��}��hu�ri/�RT�����-
�0D���T��=6��;X:��/(��}v#U������.�B�����\�=��n�j�vX��c�r��+r���wn��p����D(�y��5��P�e������sE7xmJb��<�����Lgf)�������:�����c�AXw-3r�0�D�	;I�Z���//	�oT�i]!qn����i�
�q\VL�F������� $K��������;UL��si[����n��+�i������.���E�U��?���t,��7�]S�W�t�P�}�P{|�������
��L=v��$�5E�b�mOa��#�s���7t�a��5%��+OC�2a�*h	�� H}�2x�i�Cf�b.�
�����6�����|.�K��^'����%,9��J�+��SA�+��{!�iD��A�b�)f������ue�0t�t�=�9cr�7y|f������w��l���e�tyCoA'�i6s���/��Z�uU��R�M��GUy��7
U�VcsP�	-���u�KZ��bv�V��u9�e�����Ke�r,wD�:>O���Q�������������m����@�vm������5f��]��u���
�'z���� o�FR�����YT78���!�cfuT��u
���%R���nD�c��+��bb�����+[��s�����a����g,�9��o=}���kd7�4
aKo1��f��Ux'i��+�O7n�AWN�\{Y�<^�s%�7����K(d�q����"���j@��B�\������v�}w��UY���P�c/��"�:;TR�����g_����|4��'U����\m������.�V��v�0a�8�GF\�����8W|�k_����]�k!)T�qU���i�[n"(���������)|�t�:�^���uzz�6�=�8���_U]��������m���y�#��)��r�F��_��7;ouN�X�_[��AM����f�(���O��m5^����U{�u�6K�&W�����������
���t���
F�7�s
F�9����N�eH���>��\ �[��j�G�\|9��_��[
���n�lH.��]��F�b�L4�y�8�,���m��������������c@C��N��:�AB��$�l�P��\�"	%�k`����1����*b*Y���m	AUqO�t��%A��M �z�]��S`��&�KBR�*��<S�	k8�`u���@�#A�v!�A��|���T2L��9���.	n+�E����8��t�6��i�"\\�d�L���R;����v�WQ�xA;���|�����p���v�O��i|��A�LJ����w�<�s)\M2��g*�r)���xuT��������O8y�;]�t�]�&#�����6w/F�n���c��T��T��\K��8nz������jaZ}f��C"6��B��k�e����(�C��s'�����~�T�l��5�e|��5�����|�h*t2L�������P�z�R��k`2>_U]��c���ipL�8��;������N���}�2,�$��d:,�s���?4*PK�a��r��
�T�R�m����3���W��aR�u����e1�Im[���������#�+
�������K��u}��v��|�xQ;��\��w�Mq��6������fUt�c8�]�x%V�2�Y������Z��i���=����e+���dn�L�Z����6��o.U���S��
����b�]���$%@�}M�D��pV[�G>��U�6������`��H���=oYA���X0����>�t�����y����G�a=zY���B�/XWV<m��v:'0�t@4��x����(�=n!<:����x��$x�8���/���R�Q��uW2�����IGLwh5�u��
N���V��wq�F<	RT�>��l��,��u+�q�����N=e�U:X�((!mQ�vo�yr�T�_�	UK��'� Ae���B�zc=��N�RS�: &N�9S���+t1��=�z>������W��d�Ma6E����T�9��K���Q{dB��y_����h<[�8����Q�MQ�ltp���t���&��3=���N! l31�C�vTPv���������m>�w�aq��/s�P��T������u�^U�X�]f�'�E�,�3���n���v��i�N����0�}�]"x�!�t��g+��"H${.����M/����O�1�aw��8��L��^��F������S��d�C����������}�!n�f���#���O�u���C��t��DdUS�t�����Z(��u)�;��w����.J���'3����<��ZE��5���z���
>��Y>w,��N=���#��Y�F1�����m�4��w~�"A��4;����0n���^��~|�}AuI���@4�P��,A�W�h��s������:����>��{�Ss�u�1��NE��4h�z8�zP$&=���L���n��z)�"�L�V��o,a����$��v\*x��5/������Y�k�3(��Z�����4M7������mX��K%�m��w�Y������(�����K]y�f����}gH���������#x����8��W��_���'z�y����^����ak�7p�E��U�c�bR� O�W8���I'��c�/zK��������S���l�/1�u�t�I	��=�3Y!;�6��ElTC��M�\?�j2E���Q�A-�U�WOW���`��&$��yA�%�DzE�;D�����q	N+��1>�(����Z�{Fp���\������!�J��h� �
�����QxW���a!���!q��N��~�(o):|�������f�]$���xs	�AR�7Gn�Ts���������u�I��2�N]
k����A�B�a���r��_���+j�pjEh�a�����29�G�����@�q���Ga|0{��v����%��!���(c4����U� �dg�n���zc�8tu�?{1��EfJ�p�{�e�f35�4��[����
�<@�l����m4?'���B/P��7V8��;aJ���>�����y�K6�"���C���Rl��k��J�z}��G}�����[
G���Ng����=���Tx:�!,��vVJ�Z��R�BA8�����H�0zdm?��R��%f&`���t��	~�j�����D�G�Y������[V'g�r���LC9{�o!�n�Wn�����#E(Yi{��������`��g��1���au���^T~�����*3$��"�!(�����z�X4}{��x<������f
V���u�^#�����&�������X��"�������;�K�1������ivX�V�����#�N=����x���J�t�e�M�}Y]Y�����|���&P�bq� )������@�d��iAa��?�������f��� ��1�#$�N}�X)z����� FauAm���3�:z�����67�#�y��}��9[ ��<����b����;���-9���O�P%���L�+Oxguvh�A�
= ����p�v�0s����3�]#�����=9�[����k���f
��i������|QQ������I4�:�1�{z'+��[��	��/�}�1veG��Z��`��~9�9o8��;O��R�H��H-�Y(#���x(��>��x`km
��m}f'���`������f��j�&�Y���@ S�,r��N��<�Y8.�S���Xr�)�@�[}��v������,���=����v8,�����_1����T*���1Y�JV�`X���N��V�.���~�0���r�B���T��XA^T�1
�r���:F��:7t�Z��$w��(�ui�����B� t���A�ZB��Wu���r�0�$�}�&�+��0�S�64��C���<���2�%LB"\�8�E�<��a������M��l��#(���@������W��.*'_6������:�+5#PA��!��Rd�Hg���H�+�z�bU'_��s�e\�5�TD0�@��iI�a�;�;���V��]�q�w��z�ea
�H�z�EO8DX����uV�Q�9j>�$y��E�W����9���=E����;������}�1��we�����yy��'��q#�x���F�0��������|�p�%�x���uv�5\��{�]���<,2$
�0��vh{6����/�h��r����e���Z!���.�b�y[��V�����R�&L�r��f���N$����
����nn#S�7��0RB+P����6�����6jG<e}#�Lu�f�|�Q�p���t�$y�p�F�y�0�{��+�q�|^br�!�`<��wXe&`$�!���=[�8��=���jRH�������A���{�@�w�)N���3��j�*sdQ��KF5�H�M�/a�C�k��w�-�v��u89#\����I?�-8��9��o��5H�X�\Y���s��h�������%�-������!Y�e�3���)��; o�xX5�����OG	��\�!�g|�v����$�b�_�g���T��y�_�>����O���|�g��!�"�����,0T������Z���b��c����3��{�{�i�lI7sfP��w��3��[u�xF�K5(���������t�/w���:;V��"\��:������'d1�
C1>4)n��\����T��!�!�(%�� �8����A������UF�H�:����v�25������*����d"!���>�="��aI�#	� 3�0���{�M��!��8�h������7e�������#�Lv���50jnA��P�t�Ar�o�I����S���"��~G_���Z���D��+��8�����.��](+�q"����b���@���e�I��S�t��OC�/��r��f�C��]���y��%������.�"����k��k�%��"D����u��L<|��_k����4��/�������	���P?��;�K�t$��#��J�mU�����E ���	`���K�pT�2��o3��z���^��WGr�����I
 36�G��F�����:J���7e(���L��d�"��yj����������^P��|;�y'5��*��_��2��)���0{�����v����4�L���UW����>KS?={���4>7%�����h���DF�����'�����m\���=Q�mo
��oS��4|��A�����V���!4Y�-�T��>:E�"K6r\���w���������g��|���0�p�Y���H��[��u��Vh2�H���T�B�9�sd>U7����d�K�DQ`����?$�!#��2&xO��x���"���
-�>���~Gb|`��"SFX��IP����=���N�)��U=�/�u+�����/P��C|X�cM^<4�@�LH�;��� �%��'�F�����\z�k�1X���i�7��bTZ�D7��E�H-f���;s/\>t�5Ul�U�Z��M]���ty�����2m��I��m�q;�����t���V����Q]�|�W��}5v�`��&��"RM��b�K9�Hv>�����q�St��v����������n5�j�������|/~OqTi��('��B/V��B�LJi��
#�����.�M���� H���������5.��������q�86��&�N�Tv-�����C����%�2�;�!����j�9�G��mn;��02��C�#[���x��q�e[���M�$|#���PG�V�>b��h��9;�a@%�ei
���9�z��&��V�e=
�����3��`�w�Z���x���x�t+�o����Alp��H?��"�z�e<H������9$O?z`�!�\���`>-5���]_������c�I�O��K�s��,��o�-�:T�@Q��O5:��p;���2`�h,�^�y>����4�����F'(�j�b�W�^�����BS���
��2fQ�/��80,�U������v��q�fx�o)�|�3e��dv�.8v���6
V��8z�����q�����]SV.�����wk����+%��:nS����N��-��������r�]p�E�r�_�jq����x��@��2���ob����weT��SU��E
��T��_J�UH��7*��l#
�TE���L��5�u��V��������#�GGy]�xN�������YX��$��E��X��}���L����v�v���3��A*���j��O��RP�@����3��TS�Jm�eK��o�u,�
�����0��B3^B��v�����,�[�dY�6�-���qT���:�[�U��cn��)b�%0����k��;%��w4;�*�:��]����c/��F�<|��:�Z�p���xx��:cT�n��.��U����z���n�2�e�m*A���]k3w�/�����3���������/���*���s���C��Z�&:�6tU{����%�n2�����]�:��,3���}��9��kOU��B��n�����1���}���/�{c^�����5u�J��|GWfl*��.���LE���r��oz��=�6���(����H$���	s�"���.��J��).����|��N��]���Jw,�����KL�a`�W���*��4=`�C�ss;/x���7�yp/������_u������U
�/�mcVjU:����.���3�Uy*K{k�<�Z�*��X���7i��^H�3rwi^p����S3k�I�*�}�|j�q9�"�P�(�bkJ�g.��s��9�0�����0�3��6��.QN�-� �������vJ�V��_gS������%Vn�8P��3����I���IJ��{��cz���q���U�3%��#��Q��A�*�%1��;//��0Nx�Y�����V�;|hc�7�5{7��Tg	��<�[���M����]r�N������Y}X���{��;ubu\�S/+�5|]�m
��SrSS��8'�����w����Z���qn�����oS��d��<��B�Q�$.��LV_C�4��aI\���A6�|�_5�R�������A4�h��!O����P<E��-���UAv��^���4m�d$��a	���J~�����V�T��|T�6���W^<>!k ��P�/�
�p����?L^�'�`�u��l� O`�����5�e��;D4��y�|�!{�����q�\j�IF�; ���"��r�s�\H~��
��yn���(!Z�D��L�"5�0����8���iP��*T�����}��>4b�����q��B�y�����&����\WOUQT��%������f^lsl"w��k����/���"�`�|	�����+��|����^���^A�T�&��h��3#d����9�����U�����	���������o���GFK��\�������Tu��R�����*��Z���V��S`u��)��5"�*RS���s#]���V{��3�7y1��_�z��v����������S�>�F?����!
,Q������B!�hE�.����=��7���on���/�Ac�j��C-�n���6�Y�g<G��]kp�W�U�Au�z��g�v�����sP���>��~������m�Q��,�G��zy��!�!<cU�*��������r�����v�"@�t�HQ����|�:V)/lP���/t�`� ��P�E�G�~���M��X]�_>��������s/2\���z"2���(�!����!�y���������E_bUsO�[��������j�R����6E�Y�,�����������|��M����*���������FKs"�hV�y�
�)�v�T��oGG+4��l5��l���in��6.����\�S�U1��<����}�/Y]�������Y
B.�Q��=�(*�x��u����t"&��UO�
��#����`�Q
����1'�����	JVBJ�T�,��E�G�A���^J�G��"a��%���3j��!���	�Vfm���0oC�� ������b���I�����D:E��0��#��{Hr2�.�
���L�\B>C�@�y+<�_�>:9$�7"$��SI��E��Y�]�@eV��Lk�V����}8����|����Z�.�r$����������.C�J%��/A������Y�V�����;� ���C
�q�����5e��WPt������F'�`�H�:W�lHX0QY��q)V
�q��v\�N![85�0���J!�c���um����������W}p���%���L�t��(5$M�A�p��k���8{�����������>{���`��>��EVYCn�:	dP��"��Ac� �Q�������Hd$�A���$���(�M�v����\U:��^15
/(M�|@P��mT�,�$��R�3����B�l>�S1H>���`�����
)t�R�pV��u������Y}�a�mVrXl'�1s/�ltl���un���6s������H����Enx��H���q��@�B1lYo���(H�f@�&5���7}���x�^8�wNTS���s�q#e$��a&�����q��r�1H������jdI<�&���mx~����\,Th�>S��(���>D8�%�'�[+�q$G���|CR,�m�������-GY�8�) �$R��CI5�?+w��nnn,�����.�<��w��uM�$�O|B����vu�[	c��6~#������1�2(��3�������M���-��z)��������y�q�&��1cx��(�!	��3I@�B>*��K�<�f���Q|��xQ>T[��L(�/�G^�V����������b$V'� B ������/h��o����U�_���
0FB�%�[ �Dp�^�z�9%Fy�
>����<��i<w���"�#���O��S���;��������3�_��jHM4p^,=�h�y�P��Q;���8q�~�w�Y�9��W=>��PcXc&nPA���|���������#3���h��d3 H+���j�4�]�7.��iD9&G
:%��x�1G�!D��O�6p���:�V��������K��]Q�+E>W~s.��z&�����q��U���3�<�n'B)������{�������S���Q(�	$p�Koy�����[F�j����i=V{����Z�<�����+���~BC��j��&C����������.B(EH�>!�����V�kz��!��bg�����-%`��q�StR#�R6
�����V�j/iaT��o|���'�I?G��
��UG�
W|&Q\Y��v`��b?%S�#�U�`�vI$�O��O�q��W&t�L$��C��/~!��?M�"�g.A���"���V�7�;�pt������u�M�PNc$s��������**3=&b��*Q����H�y3��� iN@3�����Q����,���[���D���>9�f!wO�*l�<\�W���tT�?���N
�yE��l:�m\��vz�`$}�%/J6h�g�u^[������]7B���.�M�4��Cz�,�;�Yq��2���
��b[r��N������V�V�_=�
����$
J4)+!
C��)��#*`3&����wE?���=re;J��j�$V-)����P"�yU2�T%��0�w0���Y�w���WY=�����XO P�2ch[�M'P��S���[�M�y0X�$�f�J����(�y8�s�V��N�����Wh��5S��M�����8G���=�0�m4Z�K���@�b��g�TG�/&+���� C�]p�������f-UV�6�����������j��a]{r���bE�!��0��w"����>����������n6���Sl~"E�i��m���Rye��G]nH���g��V��7�oUe��%���M�������n�g0d�
�
��+ksj��o,eO�����9���Z����^�K�{e��]u�Yl'O
��?f�1��+-�9�5����/6����=��L��.�.������z�N����#
*.^uH����h���Q�4G��$�������2��
����Z�w�9�<�E-|1\t,>�l6���|UQ���d{��0�6�������k���i��8\]�g<p�[=�:�<	�����k�2�$���!G�? ��Di�@vjZ����^jn=5�z�4u����qy�!� �� @��p���e������H���Bn �!�5���k#�P�������P@��r0l��3����i 5u������:�*����xw/������1�-=�2���B%�u�*�M��F�rs��N&#|q��y�v��.�g��:�{��y�hu<�&�����\�Z�����CN;��A��>��j��y
"�+���;�S���2�����B�:���8����6<pt������Q�&NXL�|����xh]�,oq��6.�iV�=���E���^��.B{+N�5Sr�c��X����T��<A��H�`>��5�@0�.AYB�d�_(uH�5���-Q�7��Q���E�	�N���1���7�E�U�A#Jo8�=��>������\�0@u�l/lo`�-��9�z��rB
�L��f������Y��v������p��]��G*s[�21��!�}��n#�iK&	z%�������;q��\����`��6M�M������-(�8>��B�2�V�����!p�x�j�\t���5cH����=6"����W������xIm:XS�����Dy�U���]�n����x��f�dod"!����Bd_����A�#1�/$H�Vk�H�M{���i�Ns�9�r_�f���Q���4�R���5W+iB���_�u1�!��+m��m�i��>#Hdv�l�O���r�d�D��v��}����-=�\��R�����5[S
{��������}U���9]���u[�I
�J[�M�M�������v32��,�*���k�1Lb}hQD����a��T�Q�U��X9�����(�U4
-���_��;�6�	E�iac�����J��:dI%$�,�"��r�Y���c6����������g��x��2��bF��,H(�`���)�H���c9�.#�L'�;�|���c�	��	x�$2�y�	|7����hO�}���L��W���<��]�g�'������(�K�Y.�ni���0��
�I���MD�+9���~Q���t@�O��s����j�����!�������|O��'
�	�<�;��8@�M�[?��vBeZ�P�/�t�I8�W'O��iB����Yn�"-i=hzVz�7���y(�F��@Z	�.�e�G����O����;���;E�Li��v-��c�����,�������3�(����������5�M"���7�:�3�r�����p������=A����/��d:��n_z������7��S��2ud9F�����D�y������C��~o���*�VV����`�� R��c>e�pE�/��G�����Co�x��$�����d��sXt	E1,n��j�����(��:�8a'�����9���G}S��pH*RD�az'���l9��Q�L���o`?!����l�ADI�;��=���_�!��x��_�)�����P�:c��X�g�}��<���'�������<�(`u���y!����@���
#�!��M�p������I����.����(��8�x��h�"PG��A��
x��I�2F!E���Q	�Y�.��Y������X�9�LPD"u�=[]!�=��)p�:�x����pX-�j��
<�']� Q�L�� � F�>R��;���#_�`1��Q��n����;7J�-������L�#�\�!<����I�z����!\���	��FH���'VPI"
CP�X�����a�6���T�����8�X}���r4��� �g&�Z���V�T�7yuX	�6�����Ss��Za��>t��gLE�n�����LRM�c*���"e�;#A�_�Ei��S��{�}�3!2���A}�������2�����`$��$���n�VP&���>��FYmq���i�v��Q�%b�
�rY����[V��<��ylC�D=��$�����>���>.���"�����LG�E��Y!��G�&�B8��<@�_?I�T7Wv^a�}G�a��	����_��`i'����2 ����N|��_����Gp�%��s\����$�����G|���8@x8F��t��=G$G{k-�'��3�������������5�U�]Q���^��:���N@R/�*'�K��h�0�I�n?n��W������V\\���~��*��J���+�:)n	���A$����pm�pt�G�����������*c�#^C�w<@��OA���i-zr�.�����!'r��U����T�}�r��fh
'Uo��2�8��N+p�v$LS��T�u��Wt/�v ��Av��R�3������
7j������)-s�9/1I���0�(t$��-���(����+�=|�t��7=c�Zi���:p�2�\ l�i�]����F-�U�e^V	#:������ ����`BPS/�����V����{��oy��B�x��pIAK$�w���(��5�A��Y�]N(����O�/xt@���$�F3�yP3.w��<@x����D�zV�,�H':�J�����2B���BFyMo�������+."�*�-!�����Ud�4@{�(�ml"f&bOT�t>��fT�e���6�� ,�/�B'.I������{�,�
�U�Y��T����]K��&�,��7����5����l��%1OP�w����:��
�j�P]�O�U��_-�����9k�775��������S�/��
����Z/�r�5.v�%���7Z/�V����y�~X��!��������=��sL|/I�����:;��eY�1x������@�Z;
^��z�7WN�-�N6����.�p[�V������w
K����I��]���6b��M|�n;��HK,f�W��1`�wt���c[���[n�4�����u�9�x��
�]�����d7����P��E`/G��N�[���)�2�����I�Dx
��T�_
�y���R��B0S`���<����Gov_*���c&�Sub���8UjF:Xy�u�"7������>�
�lSY�^�H������;7P���q6���������%Q���J�U�B������{Y|������r�������
�N�������0�6������U;��pe�[��L���������}���:/��WU���/S3u����[�9*BRoie��������]�9�k~�^<�������v��76�.!a����L�	�����������1���~k;v�A+���z��j���*Zv�fWZ9Q#K���u�:��w
��P���]N1�e�r��JM9��,�7�5�O�����j������v�oZxg��lr�����U��k/�]t�.������G���w�����vv��76�
��F�����^��4�:��xm..[��NU�����l�7h��^��1ty9��Y{�o��T���B�tz�l�s�]`��d��^dq�����
�yu�����:-V{F^Y�;�BigwtvV
]���l��}�TU�������JtV�%�L�YZj`��`�����v��nm1������
�m�-RrT\��'�f�q���������b��U���,�T��N�"���!��=��m!Y[j��7	�����<�Zf�f5��/t8fn���s3��'V�\����ueL��L�{hW-gSV�)&V���p���giB��n��A�te��33�X�
UG����XR�z�Ty�9�7��|w���im*�1�3m+�}A7(�5��z�>g�����yV���r�G����}������P�D�������U�L���.�/\�W�l-_n�_	uD��l�e��',���}���<s uGr�
�H���^�=A���
����@�yt����N*�/�XS�.�V�"�K]�k����c�=����5t�����bsj.}�;D�P��+9�����Xw;u�mp?!���v���x>��c���pP[�2���+��F
hQv��:]���<�5���|�OU���#��P����8��&�g�<{[���r���ID]�����U����~
  ���b2�2,�@�U!J��W��������l��Jc�^��.kv~BP���i_w�'zV�����]�S��yn\�w��������x���\�\����
c6;?H%:c�8��� B5��A>�������"o)~��-����Zt�B@7Ra���v,U����nmQ���+O�+V����Ym�e!��u�V���e��;�:�#�i�� ��*��m�X���Q��*�R�%;��X/R��1(��0���]����L������=t{F���9K�Wf��xap3�(mb�
�����F�ml��U���:�����L��o��J9�E�w���{�����<���[��k��x0���vA���������w;������5�lb�b�����^�-�$�g�B�	��Q�U�$����Z��%Y�"���{�O+��X���rkm�_�N����<�k�7��M���j��`{H�=v�%,.����b��r�;�������k�a���JRKt�$i��i�6�-����$��Dh��%�/U�/!� �����P_������4�};�d9x5
���Ow�d�w�'���Y��Q�Z��R���8��S�)-���\W�}3�8�u��x�>B^'E�KkH����R,I��QON#�5r�%�R�;@�k���\�d�/D!�4��w�u�`"�� �P�H��Fg7��C�������I<���@��0�QA�C���T�N\����|�6��;��}�����%�\J����X�����"����E��3s@���(���
�����*��Z����+�������N#_s��K�j����`lW�����	�8�4��A9���bH��bV����90<kyI�`��
�Ccd$���j����c�K�����p�0h5�9^�����Q�v��u�94�]�������m��~`��]1`yblP f6I�z�2E$aY��<�D�s�!�jgP�VS�gr��+/����
m���$��I26)tC��P>`2�y4g�s#��������|�H��R�������@����p[�Dw�d�W����f��
�����:�����'�9��B��K��o/�9d���=���������w�B"��X,T�B�\�����2���W����g�� ������feD��#�ob~H=.N�O�!�K������=C�Qb�cm��a$�4������0��ag���s�
>ddVF?�K3u%�����V�&��n�0LJF����hS���a:���
m��P7n����;�	���u~�
��<��L��"�W0@������#�����dX�G�D�oU��^S'$;c
�}��o*�����W0�S��V�V��B�V��S<
/������%��s#�K1��tF��M}����g���?\��q���u)�`�	��
���A���������='��(_|����x
��,V�T�"�^��D	#g\�}��>2�d*X��
�����G����H��L:~zN$��<��6����&g-EO�G���,��I/)����A�Aj��2A�r]�rg��	���L�e��B[�~l��E�24�=�R=(�����KgL���U)���	����|�e<��`��x�
/>{1�@7aB�.�~v��\{�PJ���>y�.K��������W+�5������O�����`��_D�����E��7�N<��Y���������SP<�a��y7�0����\���~��5�xu{��8k�@lKr�4P/x�r3�v<�:���WQ.�J�g�����h�97&���4;^�A>$�]6���.��2���f����Bs������b�6�������X��@�����W����?t���I�����t\
�(����r�0�y��7��6yF�DC�e�i��\��0���/��{H�������#(��9yfTa�('���Y��qo��R�m����t����l�Ya���a6�$:}����!�������e��|<d�@F9���H���������](�����{V;�&��r��������p��P�QYp���=�H����Ku������|�3����� �c[�6�=���5t7c.m���$^�E3
  
�����e�*~.�g��������P��`��|��F�
�������Q]�'�5O3Df�#�4�gb�z`�h�)
��n&�,������D����w���f�M����W�zC��T�*b�{��c�-wU?>��5��	�����6(�����k���k3�<�)Y���y\��� E:3gZ��L����9��l|��Ab����
�����E�f�.��-:6�5P����-��yR���L�@���
nH�}�p�<U�u����g�B#e�d�0�#��,�w�-�����~RF�<9�]WHAB���� ��k2Bue�B���Qg�*�����Q��5�"�hK�i�fL�#�bl���NZ��� �E8a��0�� �a�C��wY�h2���U��������kz:>s�lQ�!;qB�q�o����]�wiy"��y�n3"��Yvz�0��k���\���",���q5����%�v��0��������.��Ps�J��JD�7�JD�
�s�L���^��A/�9�#����#/�G��;�ZC�����Bxi�����
%x+@r�C�7�UU+�H����I��u���
7yr�J��cy�����:��9�xY����uu�����qi���nR�:s���@�L�F�p�!���m'X$ �BvR���@���(�8�L�e����I1b-y��y�ai~��s@�����k���JE��jj!��62���,���^5�����:Z�_�+�O(����J#��@��x�`.,��e�I�����43�9�iWmI;��
�������)���~>�x�-6�������u��#�F��P�OtiCg�!� 	�G�8��&
�t�M���`�� ND�m��I�~�E� �HQ������wJ'^!�D�V�e�r���{��dBt^&o����W��{wl����w�L�w��j%G���&��C�>��V�f������V���@{�]��z�����7k�R1.���p���QL�wU�����r��{/r��*�xE�9Qz�m��	�T��=8+k��P���s��/�K��d�Rr���V?T�.���#�|���@
��9
z��A�����}�\r9;��'�E����)�S�}$� 0���t{t�y�-N���xXG�E����{����74s@�s������B�``
w�sn����*a��q\�o�I�aTRX0�6&:H��K0z����l��I�bj��������{�G�� >3�|�b�^���L�L �!\�Q�<L��� tk�B���y���;�a��s\����a�b��4C�}�O�#��.#��<�G��m�����"EK��F���{O�Dnn�3J�$:ux��h���f�����o7)���D��P9���lbs]���&�y����fm�(�a� �M�����#j��xE�JP���J�A��tm���mn�*^����F�+N�l������
��+�q��x�/o����<t�����I'���s�<C�~%��r:��QD�!�[B����r����m�}`�,���1(���h����B6�I�&��A�=�}������q�����9�{0���&�����y�����(@.0�C�{�[X�v_���Y"q��V�Zf���Ho����2�-u�rS��G�]��ppx{�����nI�|��
Lk���!�9�!�R(��^}�$bM�=3��9��z��t�k����"&�~��iCs���/��sY�������L�*>�yr�[�E��>�u���f�'/�J|C��E����7���H��U� �e���#�;lLN���d
#5��v�;�9Y�H<b������.���u�=�&
�������e�0
��`���[�tR����F�c�����Q��1x*M��u��D��"��Oe��(|�P��B?�A�����G�1�q�im�D�c[~��w�"�y ���6�zq�j$_SD\�e]:z��9@���w�X�Lc�:�/��*!�n*��P\���+tq�]ab����X������IpMA\b �o�Fz����"�*<E9�>�o��zvu���=�RVb����y�jC��t���u*�)�wO��{�W��q����7D��]zj�<vb�<2�$����g��3|7��u^]VN����y.������?�����:|��hs����2Y-���.�{"�^N�d�9�uV���h���|��hc��Y�Z�T1��6�o4lnJ����l;bjT+6:=������<7*_ERt6�j��UP*5)�9w���aak�
[�p;�c7q_��!���T�\���������{%�������y=��.��$�����9W�Ex&w��'z�����0�t�n��<*��g�C������}K���"��&�.�.�x���d��L>,>���������O7����}�t1@5��m����|��8[���][�a�n'�G�R<��B��Y3�}n;_!��5+�!���sd)L���zfga�G|�=�T�
.�<*L����{q.��;C���W��
��r�O�~��9�wFh���}m6����?n���`�FxN3��\�������[w:��rV4&��g�����&�h�f�ng�s�����LZ�:�z:�{%���'����'�MA��>�k��������s�v��y� ���|�q��8�|4��;}��u����3�(On����������TP��z����:�@�b�`G4�J��
������;���k���$�G��5��������Ec�o�����u�N����F��T�=b+l++,P%���{���j��wW"�p�����5��e�V��$�	$�q�=-���������,��t)$�|�D����Y9�a��-��w{JE�D]�:�������D��L�;���!#!$hM�,��Ct�PD%��mz�
���b�5T�G���Y���v�JF��W�e
�\7pwR��n^k'BU����s;������:�AST}6MTj�s�_��&��Y�����5����+-����b���e�m�F-�{�{�Eb���{N`�����|
M���#D#�����M�;���,9)+7�c1�9F�����{v��5j�5��Qk.��[��:7���:�7���S��8������=k�f�}yz�m>u��-R��;�"�f_D9�:�o����Xp���a���W*�vV�wb"�:=6l����
�8�(���^
�s�������|sj�T�>b�0�w-��)��2<���aU�3.��*����&��S��8K��2�k1X4�Sk�]�-����N���OUoRT���*�������j�o��&��ES�;�J�_t��:;���b�r�y�au�]�������>�'�s���a������UaS{�7��d�sF��m�5��0�M�f��m�C�G!��fk���P�����5iW�V+��M������;}�����YP`���������)�.��Y%��|Ik'm".U������W�3��U�E�v�#��������l��
�.�t1�kVENw�����]f{���a5m����U���)�o� ���Ni��lU�Nz����2��rh��gm���e�
�"F�Y������.���3L����@��?j�O�S���5LjKx�96�.���	�(�%i�0qx�e�<�L�
��	�}<�j��b���'b���6bm����M�	�}�h}�><���M�0�i����L��EF�7J]���`���">�u\wB�9��UNlVj������@�b���TQ��{�b#�go������Ph����� l
������;zG�l���]�/��~1!]Bf}y5���
p[���o�p=����w���|����F���n�e��s�.$���iP���t�j�^���i,'dM72,�w}u�8N����s����N�qG�BW���'����������wp
�{��W
����!q�=5��j��"��2�
.����
���������C� H�k RF��9��$��wV�A
�ev���neF����V�WG��h���N���L�*��8�9���s�K����'�t������]�����a���v���?I�[(�oI�+]J������y��[�o$�l.�F4	r��p|�H�z>��iq�������1���G%"5�(@���&Ll��.�d�����nb������`�����e�HF�����kW�n�N���*��a*��{nJG����Ri�2�Tw��i��;1�J!�Y�G��t
yO���8��lm���eW��wVI��-�V���%��|zr�	i���c�H� ������=�C������	>mw;����g�oy�Di(�X�����+c���b��f�I���4�]�;X#m�$V;�o(�&}j��W�L�����|B#GD3��BN{�}K�R��-�YT*�#*`���5Cs5_lz�Gl������7�%Tb�78P�'a�2�0]�������21h��A'%]0�d�@$7.�����R�=����r\	57!��O$^1*��W��A#�����Q
��w����3�[��k|���,�GHB���]��Y��V9���9�>�x�o����+K�M�u�L���R*v(&������ ���fkB���%��$N����{�S
L��{�:�_�F��3��4"�{����7AF^�.�?���6��S,�4/�T���>���C�� ���y����!��7�2=,������O,�fX��Y��N#���j���b��G�_t����$���������}|�P�K�sb�q��u.{)�v��KV_��<�Q����\��.f���U�P��ge*�������;�d\��v��\Iu�k�����3��vBXWU����q�9��Kwe������1t5�>8��,o\i�CX�@��&XQ�I��*�����G���}��� 4�L��u> }��ie�����Ud3��T�A��yJ#�m�F�5��kU�`���{K������j�m����^����������
	g��y�������Ao���{-�U�8��I.�;� �yw��N�U�F��e�9��������8�F������i=,`iD"?TJ����@�#-z�������C0������i&�Er�5����>��x��E���o���Tl��H7up�jH�I`;
x��������&�J)!'4��;VLfaB�����AG������H������#���.���36�;7��5��C�q�=K�r1\��qYbBbP����~�����pR*�v��{��R��k1����1���������g�j����7��������wuWcJ&��e	l����ui�y=�3�Z��%�[������(��]�<��oN�yGP�u�{qTD�{	p��ct���u�Zg�0���D!L����0��7�r�a������G�+�:c������c���^�������4�>�\��T\����8��:"r:G�b�WSoB�����Y�0C�)5u���v+�M����*��oF$�������A���]��l��}r��H�|�������A]9
a�i9�E{�f�����Q�\a��<c|<����6��������s[}������MO^�j��@9U����U���(�EV��j��K���I�����n�x��F��*��&}Bj-9����D.Q	��^�7q*��n��3;.�:F�j��}�-���u�'�${��W���u�)�����})j�
���8�2:��6���I���9]�PL���F<h~>�VI���}�g��|�}�J����8�!�V�4��L�'>��8��~����A�Ck�q��f��1�.8���u�!�����k\�bY(m�A�|EsL�@l���������y4*�jd�����f{eA�
�.1�L�Yn��|��hk��H`p����A`�$7e���;z�b�>sP�CYfg�o�������i}o2S����x�HH-�w��u9<bT�n!��I���������p��'4X�B��
-��5�B�[���?����}��
�Z<��o���D@����j�\�J�)�-�����0���3g����g�*wd�M��0���rBq��N�{���3	��(��el��w��G�[e����.:�fW����`��D���+0f��E0��J��wV^���Qy:v���������"�rLZ��B�7o�m��� �'9?R���e�B�1�M��e,�L��m4G��4H��E�* 2��aQ�i��@��A|����Zo�A~�a��O��'y%~�^b�8�������]�g`�}�\��`<@�����������F7�z*x�@m \S�c��f���;
9pl����#��53����vf����J�(�MT�`�3!z�����A]���R�a�CK�@�#��@H�M�X��N�����m�� {e���m�a����4���-����tTy�����~RYS3x�,��$Z���'�nx�&=LI��k�Y�@���5�Lq��9��������@`�������(��oki&�S���y)/�0�3@T�$�^{&���������2A�x�P�f�,��}��j;��++��y�Wlr��[�������N�en&�h�<�����c���V��`���Gk������z7!���V���V�,go\!�o%�����4���KP�"����6�P��J�K��|o��Y*�E��T��%mi�I���=Hs%���mY'e5cnM�����C������g�p�JV�$$��LHo���!������Hw��U�#�^nY��"Y:��5����^�Q�|:q���IJ�����.-�O/�RE��V��|���j��)�T4u���9��VG{���qL�0DW8q��Z����x���y�7K������;";��v#J�s������y[�K���� �k�c0���=n`]4
��Z��z��(����z�}'C������ng��k����jw0e	���\����v�62�W�GDI5��E�|�q�
{���^���k3P�z��������&.����[�8I���2]wQ��+�s[����oc�}��oLQ��ee�..I�Zr����l,�R�l�&�<�L;j$ 4�u��M�qI�~�a�|r��8�S��}M���������B�z�8��2�������fFh���(s��cL���Q���IN�0������B�	��(:�3IC�w��B#���9Q.}���x[�}�!��������,�cD�^�R����A�H�J�����|&��>���z���c���0#�[j=<
m�\+�Jc�6Se�$[��>�^��]4�P���;�Y���@�Yq��w�j5�\r����T�P��1i������:)���]����s��F8Z`��"J����q58]3f���^�#�:P:n3�S�`���SPiFw[����%�x2u�
;��%�l �1��Ghg^���e��������M�������lFS�����a����J"��R�X���/B�6�:N�Ol����Md��%�
G���$�L���\4�1�B��
��K
�����@�e�e5��v �d� V��+���p��tk�����<-r}����
7E�$�[�B
��"�g�`�$�t�<���3
,"#=S=�
��?����u'�)c�G"3�}���7M~I:g�'��fz��]����m�PA�Q�;�^eG���`�C`�Xo�S������?;`8���PF�0�"��dv,!^��uD�;�`p���=��-�t�W��_���������#8�b{�pZ��w�7X�3�v<����`����b��d���Qx)	W�L�c���w�}��u�J��B/�����K���H,�����0r��{�|�����}�n�|F�Hr���c�PR<,qG�/]��;��j��	�@��H|i����"��Sz�s��@�:@75.������]������P��6����4����|$��A��Z�0[����f��M�[z�~x�QK7r�NFi��%
\zc^��g��XG�H�I�x8�y�����k��f�u�6���������e@�@5�kKL���E�6�p�'��Y��71�&�`�
$]H���A�9
�^�_��s��v�	��@��a�y�|jJ7�����"*���us#��g ��JV3�)�6��"U�7��>��q�3-.���|Z.IX4>v����k��K�)��0]n�f��amR��v� ��V�j*�;��p���J%�������������_i��C4��a0�_j�5p�_HB(x��O^�|x���^1P��{�h2`�$�4�_�"�N���/;��&�iv|�A�����W��zB�����a�i`���,����e��*�Fw=8A��L�F1�����!��v�'Xa������dv_K���oT��f:�tQ�d�,HX�]������1���C����"/�x9���;s���w2���}�H��>�����B-x��=�3��Z��[KULB��� =�%�zy�����F��X�1�4��	�0Q$�F.>o�d���m�OO6�`pr��II�������z��/������ei��_�Y4�SM�Sm�.5����k��A'euU�q	CV	K,A �g����p��q�������&n����T���=�\*�[�!����i5�S��gv�V5�n�}��If���v�o<n�\�����2
��TY���}iwVP}(7w�� r�1K:pee�3��UU$4�7���k�F-,�\�F���{�ir��u��y�����`TIV^tqJ����Le���`��� ��7q��3pZ��Pv��8/=4�����������>y�a*U�SH=9�*���D�-������2���u$Om]��`����E����1���
�W�P���A��J�������s�v�J�J6���n��E8^����g�r��5���%��ld�W>��H��K��gh��������z���_�s��*>a��5um����[Y��+��i�����o*�7WyF5�b.�d���W�����kw���4na I�c~�O�v���":o��@�f�y�R/yT��q{��me����"n�����gsYMK�F��@���@��Z��������t�{���v��6UvV�t�{vW%F2�yW��T��(�������Y����}�K�b�4Y�[k`"�W'u�vO^�7�xO��&�������{�MnweX�w��%M�s���NY�gD1�L2�c���fv��[�Q^���7������+�
�[Vd��D/�7zV��zi������W6��0<b���������&��*E��;N��������K"����B�U`���e�2�[�7S������;@X������P�x��b����\��F�U�5q�#��C|����*��5���2m\c�t�B��kj\7]�wa�v��1�_N�;�{m�������S,�<P��b]k���5\���U]B�a��/�C�&������4X�Ss�X�J�����w`�sT��;+�`g�]���
�7:pt4_+���+`���J���|�UA2�P��=��rcs���h�)��jY�8j��i��k};o�gk7a^���o��
�X�w��s{�����2�ie�5�j��)�};��H��/�>55]�2�3Sq5U3��U�w+��[}&E>� ��R���m}���x�vau�"`���=69YN�(w��]�"��������a��fngu��5\{�2w`2������YA+�G�v��H�]������q}Z>U���]b<���Q������&��V7J����P|\Of#7�x���!��"Fh�s]�
��bc���j�����)
�+.�FJ����$U������#$��MX���H��q�gb���b����z��M�i��|��]hi��1"�"-k���5&�������?Vw=��z�Ks�J+��N"�a�9&��//F��QR�{$��[oz������5a9����G�#O�P�P���'^B�E�����!}�C��Q���kVEd������{��'S�~�����m����'?��HF�
����H�����+�(���i�f0���~���s��������vx���9����B�M	�Z��2}��?o��]���\���% l�T�]�t�K��%G���t�������q�6������{���6�#����'���CuN)r�C��Ml_b��5ih��Z@��[������3�e��f �5%��-��������/4���f#S�V6mb:��vO�dX����q�)���
*��52=_a�C/�j������P�FmNs�4S]��s�������k	����F�f[������9���O= i����D�������Y�e������a+������^�X[�w%W��e�#g@��.�u��U[�XlmM�T�9g�-�����f�
zX�����s�������n�$l@�x�F�+a8�? ���y�T��(���CI*��eID�l���Ut���l���Lq�����7B�>�����{%	8�:f�]i��Q�3�|��odm�`�E6B���84�h�0��G����3 ]]�}�#�
O�xv�s�]���������35_UL�N��BO1"i]X�e(�\�v3���N�5������O~�����(�=�eu�!��NJ��:�1�B'.�aaOo���/�U��xt���\�6�����}��z�.1�BW0���K.zc����E~n]���%��Y����9f�6{H�����*�)�m>,�iI/�U��I'��(��)��w��x*���;��#��yd�D�k�0������b��Iq=(������������eV�yk\|MY������<3uWA���������FD��&��ICRj�F�dm�2�W��9�� �<����Tn���{:L���F�B�rs��������Tl1K�r���U�]�S���"���~�M�m��,�~I��=��������z����v`LB]�|�+}�}��*�(�,�C�.��C����A���+�E�{_UD�+<�ou]e.�X���vi�����9>��Onh��(����az5����5M�%��#�_}^�������:}y��RAN��,�����L��w����C/4�unV[�,���������������c�of�"�r����lU�����9vQo+a�zv*���������v��!K�[Z���7��#�O|���
>�$�R����K��Dq�T|�����LN��&LR/�
 ��EN�Vo{�0:0{���,��t���=��j��G�36�����\��T�z0@<�.����������;3�A��;���C�A�b�8����,����h��E����a�O��^p_��]$Y���� �f��5,�����t6�`]��).��hX�{��TVU�2��s�!r�k��N����Eb�l�-����S��g�o��P�?M~-~�I�>�{��!� ��2>ut�;h�i��C�����LI���TZ��
���#g��z���~�0�]{x�|I�T�(�@��Y-���nMZ=xaMq�__���]S��Bt=��m��c���4�^A\,;u%�
�����UY�.3yUY}z/b����N��	�~��d<��S d�Y~"��KqDa���K.�[�l&�5��t�P���SP�R�N.����-�'�|�|C��m}s��A�XG���AX�Rx~����2\F�����5S,��_��"f[] �3p����m]�^���CG�P�Uz��g+tip��k��[�#*����q��o9�Y��q�l��^FEh�����/���j�ri��9������G��!jT����r�������"�� ��T���1�W^aoLxo�T����7���T��@��U8;m����id���B2j�=��b����\�|{A�>���/N��G3����H�c��bt�!�y��Y5����]
�����M�f3�Z	N���(��}��Q�4��z���p�[��;-�~)�!��������{C���K7��D@�C�fLh��v���K��#��;�d!�-^�f`�i�+��	��O��u]A�,��2	n����z�����r��Ru��y�)#��t1'"vJ�����5�_�u;S������eeVi>x)��
e�t�G!t���������n�W�hI��<;�{�J���t��8G/��p���
'�H4�zF�5t���Qr����3P������F�<���|�x�t�]�����EN�
"�~����Sbqjy���e����|�mS����eQ�:�5�ow��s�;};wz��#]���n����������;����`=p\�>�ZlL4U�Y��\b��|o_�2��Wu
&�' �$��c��(��*�����Zw8����Y��3/31aD��)��#|�����`���e�v<r,�+W
��'�H���s@��gX�W���������|��l���@
CA"����'��$M$��X}�������e\c�U����j�/sm3���$v{��Y��7��A�����|�Ig�i�2��J�)o������D!Y��m�l�oyCB��� >T&�����Pt��&��6��EU��G��W�m�wFj��q`�U�DH��v_M��f���O{I0��<��S:��c������3��7}U6p����u��<^V�����[D���R��t�����v'I����un�,������*�"4��	i����)���F�~�a���*#_����s�����Q���+H+o�����W�;����=
Q:�`&;FLD/o�w�f��Y#Kt(�5�����������r
�R�J������^����I����b��)�%�M�\z^��C��?K���|T`��z0p�x,w��c>�,Xm[���C�#aP�c��1~8D/�����
Rzu��N �����w{�X�j�,(�)���",�oc����p=ETC=���r��u��2;D���������1Q����L3-w;�N��Xb�3s�v$"x��A���::*g&�V6��i�
N��*���3w�e������b�8����i;�TL�w�S����:I^M�}(b
R"��+9��=:#�+��p��irAv|���@"m���$��li����������L������HY7uB�(r������!�V������j�LF�w��B����]�l<C��e�P�������Hf7sdhsoda �C��E�
�����MT>�+���D��U���
s�]U����/��������:���
/E���r�9�Nu"���>_��+�<[B��yr�~��	����T�����,=Vzwl\�����d_7<�{�Z\�t�E<@�v��3dQ��������a�x��qe�O�s�>��\*���]:5{.Z��`k�S}{5L�����j<�Kh���P���[i�u�<8�6����)q�	w�U���\������.$���.��W����������4�f?�7�$�)��.�'9Q^���j�J-�L��� ��sT�c��,�6��0|m���\��M
/�����<]�������/W�%���,�F��r+�<D�?%%�7��O����T������B-�?�!����L{y�5TR@mE,�� ula�<����;�3#r�d�}�T����N�NV�w~����������S�]��_v�g<�u�]�^�5-�@�
���O8E�p���C�}p����j\R��vvf��s��XA��{���x�k�.hoLXzOjY�b�D�`�l�{=c�e����l�-���"X���o������};]��6�m����/��f�F9�IP��r�����I�T��c�qxgUK:^U���a"=��,��7z4�j	��k��K�&�Y������.���b��KOo]z�JcZ�����y�'�#�����22=����U%C��U�9����{���<&5��	�9f��\��S�]^�*]��@5[�.����.@�d(�
T���F��{���Me-#������4�������[�
��WC��<p���M�I�]*v����t�E��"s��T��+�$�bL�z� D),0�A��]��2�ivw!��Na"
7�]�������0q��j�c�
��8K���	��n�/��R�(45���L5\�nK�o8����U+�ek�y����y�]��L���~�gf���^Rh�7��m�Z�����ww���J�������8��C2��V��kwn�k$;�/f�)������v�a,B<]�������-9��(�����;�X7=����[�O3{R�l���V�|U�����*�x:��0��
b5Y�et\�n��MT���q{�"I8�
5���d����].N#[��;=�To_�e_d��-IPj�w!g��4Bja�$����wWZ�|�:"�7�vx�I��H���`% �
,����&�JBI�d�4H�QL�$��	l�#a�)M� �HI&�@D�@�C#���hD���$�R�I$@		$�R2���"bfE%�Pl�"MP�141$L���h3
cD�H ���FF�f$%$�f"D����(���j���$�
���FX���SJa�i�X��"�M� �)��JH��1"A�M$	!3"�c1"RD2���	"A! L�d�T��D�J$���	�h��21��	%
	%
�3L�`SX�3B��0E��R1bJfD��,�B �I)"�
$dL�B��D�2!I E$!4�B���J�P	� �D��@�2��d�D�fL�I�P�%	H	JI"����L�e��$���A�(A0F�
�"	�f%
��@aL�"D�D�"D��L�4��30��e(��b�%-@� ���R2QLFSB��0&�%h�f�d�� `�Sd�L�h�D�D�(�A%D���B1�� ��JH��`FbZ������JLL�2%$��)#�
RP
L�1"SC2H�,�%)D�ILfD�@�
#&e�2f�Y�)$	UK �)i���d��	)C "�MdCB�$2�����Yh�%L �FK	�RE"I)U��ET��8|:;�����=��������s����X��<I�� A��w�6A>&�D�2�0Q���hS��G�cr�R�5��$�ICs (Lc(�9����`z�<��e!�UoK�xe�tGZ�����a#��X���\rj����@����Ve�|&i����C�@[�7�Gf��A2VZ4N���s�b����h����{����8������8�B}g�G�	�:ep���]C�����0f��d��������wZHvZ��/9�C9v9�X��Y�����IAvk�����n��~��-^������{�7N�����uGEN����M�[��])�yTa����k�X�1���$Y(�I#)&�����2
 ��	B2dIB�0�	��e0f��M���F�(�20�!D"jX�1XJD�@3	���U(RD�0d���&Fi�(30)�����"i)�S!4�C0�F�b��L"2S&a$��K ��bc"d��!���4(M����Rf�2�b�i��# �R
�JD����)���`E��!��2�D�L$���&T�����2HBBj�fL �S$ ��"a��&L�0��",�4�`!%PFc)"��P�L�LP����@I���L��2LZd&P����fYCHd��HR-4���I2bYh�h��H&�0P� b�BB4�DI	�4�R$
�I��$��`�1 ��CIJ4H�!jLL	H�@`����bc,��L�$��	��
IT��(��� ����2���e2A2D���Q
(d�
��&c!	@�@� "F����D$����I&!2��D�B
dL�" �$4�ST1
!�����	SB`�S
@�$��E$��J`�`�!i0L�B&P����#A��(I0�LH�	
��Q$��B$B2L�BL��FD�mI��H��I�a&HR�d$B�f�D�d4���d01#���p��
p���DH�)f"�o/�o�f�P�B��p���S��73�����}��4v�'&f�P�����/7���{{6�&�Sj�k*f���f�k<t���i�.J��]]b ���7:�qUYQm����)R��o9[�w	��t�D3�9[8��xp*����w�3$r���?Q��,������|��QA#(��]h{�;�5�q�
��w����s����.���x�3����]���
��r����IP��-n��}��u�V���v+9��ByNr�2���X��(D�;#�jUAy7u�z�il���m������k���+��T�3^�=�V"��`���-O�5������sebB����P��j�)��uN�s��p���A�{�$����i�z�}2��n�K�[�{V�6Q�3gty�t�Oa�.j6�}��]����&�N��=y�_j���e����������Q%�����;���]���p@�T�C��c�VP7�G���N\W+��-:Efk�^�{���1�]w��u�����ny���^�J���	[W��6�M����i���������j����*�|��do�����tk)���^��Q*���#�X/��VAY�����k�F��"j���>|�Y4�&�7���t�s��bK���4�nX�E����8�P@�|D����3�-bS~%��v'�����U���q[�����)N����Y�Pgur?y�����>c����A��r���8:�D���F�UdMX
�,�Vh��R^�,u�8������& ��CZ�D�f?�l@ORm���{�%�T/�*������5�D���L��f+G:�p��2m|r� ���I�*�\w>(�i��m�Cb�sB���:�V�N�=�lqk�����5��n��p�M��|,*������O��v�]�Gn1��}���@}q��;���N���E@�$#�A=����jY�	-NBEa��>���	�"}��:���g���y����ts:�S#���Y�@��W��<�m�S~����q�4�������"h����*f�D10�g��.���`��n<���-wk.���!v���|G��0�����7jV�;�(k�R��eD;����\���I���'��tz	S��U�'�2+�!����0�����:*�UU��s��8_�; ���������_� �����"+jNm��m���Q8I���Y�T�8^l���Y��rUTw��[����x��TOh�w����Ug�N��-�/'^L�����;E������u�p<�l������!��<[t*�nf�j<�Uf��]�nI�X���&���y]U����1{V�j�*�����2�T�����#�����I"HJr����CU���C
����~�����N����8u����  H��kc����^M{�*8h.93I����
�������0�
f�Y6��A>��N�������m�H�����:��DA{�������h��Y���}OOT|��Q�y�*� ��1�}t"�z|z�Wt������4���pKV����y��@A&`Og�Tyv�n��.&T
�.t��B+����G,�����J���8q�iguA>4�Q�=P/9�AW��$�d��_��]Ww7�����z>�B��E*�p]O�Y�}��(�N��{%N|4����v:�7������X�k���I:���o�q���Mv�O7�|!aQ�r�F��s�r����rb����Ey����7��%����L#�<������fK�-�;f*�z�����)�d0��6CH�QFh�w���V7����Gq�U��c�����.'�Q���:C��(c�O:�n}k;_�&����C2�${�T9��5��d�dd��"^'Di��8o�MP������#��X�, �+�2t����q�������\��	�]��Ds�?%��e���ifCaU���Od��!�b��C�}^�-����W����q���F[��]7Z����%~�/n�����Ow7Mz�6�rs���BV�9���h����v]�
�r�zZ�����l�M��H�������0�)��,����f�O?b%����zw�Q
��sY��tg� �q�O��v]����C{���4��(E�h���@�i6r��B�����_z�&��
gHi��C�R����A���)�%�6"��
h��"M�|}��q���Fu�#�9�������
�&�8>�������[�P8�A���
��O:E��y��*�r�Y�Z�bx��]�B�1���Z��	3�����Z}�>�������p��{�	�&��<}{{�t/�(��4��y�>�?~��XX�jJ~�s�m1���ac����[7�.�����9{��\�V�>cR6��WwL���E��vZ�M����`�t0:G���,v�����c�9�f���h��{H�J��$NI&R,4���+�3�M)��b�F����Y����H5>���:�ZbIf\���z�9>K#X����z.@��ZrS�#7r�i������.��m�1�0��OQ�Q�_&�0��q�(�e`�@G��(k�����[�Y�=�����O9����!-��x��.v��������E���C��c�V�4D��
���VA`��s�Y��(m�����qs��X9~z#|���-�K�Kb3!WEe�pf�yI����V�O�n���v>o(%o�J��f�K�$����	���@K�v3p���
�d��9o
J_2���Ps*Z�y���N{zX����Ag=�<��mR��x�Kn��X��\�n����`�R&��C�:�{�c�F�����Q�n�����v[�&g��������Ibj���_��2���%c=a	������*g�G�UnY&Y	���	��wW�r�s��_��+2��x�2�9t��4f������4���e����<��,.E�z=b�x�8z�.�!��3�(i1�����������2�D8l _ ��v�u~��~�;V���������X��mM���Vk�����W�%��u4/FT�a��>���
�]{0���+|5���t-[�
Y��z6f2�)�v���(uR��nl�"�����	����3}�\�DK-�B'\�qI�
R��/]�R���.�w��<&�����Y��z+
���>�e�.��RKD�>0�n�7pD�Cw>�|���gHa�p����Xc�@�XU#�������=����#w#���k��**��-q��i��bU���l,-���|:A
��le[�����%��f�������F	7-����i�W����)�JKX�@���ET����7����� �Xn&�I���jUey���$�&k���B���(����sW`���`�L���A������&#��[���*x�����U$U*�sy^��{X�c5�4W���T�?�����Q�b�[O7���1R����2����{����$������"�o����h�r���A�\�%�f�1���C�J��1���;����H�R����Lqn(�����.r=�������b���
>�th'Uw�8�)�z���%�=Ov����"� �n��������1Q<�z`r�-��c�����Z^;�����7Y���6��f	�g�r���R��}���V����1�����)�S��N��R.a?�QA�q�X}��:Dx�o{�O�%On.�>,u=d�dVRpks{��
s�S$��<m�k6V�{-Ju��m������w`���B���*]	Y/*��'�y_m����^,F�wV{c���b��������n���Z�U��(e����)�b�����4�rtx�4�hUe0k�Z�U$�����R��b����v?���x�X��fk�Rp�*X��}#�	X���Dy�h�	�Xh�&T@"i}��64f|����K�lU�+\��V���$1#+���"��v2��@6rw����0&��C�1yv������^�V�R8]�ai������A�d��{<6��7F�������O\������k��J��JnI��)��ja�^���{������!�EGqT�����i�#A���!��1�B��}+�����"�u���ng��3w3�t�|g��=�9)�wa��}};c��}�1��6�v�
���5�>�Z�n>�wWz,~W��+N�=�GL=��c6.����9�Jj^�r�N��F��y��y<�&���;��W��#E�Gu�"�<�V�[O.�7<����S���*�I��	�����~�j����w�&;�hs��9��P,*
����@������DS�t�5[�z9�H�����������!�(����G��	��15E�Y�2����Bc���:�W��{��Y�_bt�wc���z�v
���HQT7
�����NY���l��`�dQn�pOe��z�Y2ki��t,"j�Y�3�0s��f<[V"f���~Z�-d�2���_?�f��[�0�v����
U��^f�b��l�����s��J]��U^Xc����B���*�Ivb��]\�9PY�����4��y��p���(���k{��&���v`��*~�h�dsEO��JywJ
�Q�N��A47��=_�C��8"��������@u��,:�Z��F���\q������V�$,��xyy��a(�������;]�uhOcv{��i�w���,�=�������+�
���d�����LyD��aX�-t�K<th>y�F\�����Y��D�O5�M m����oH���A�hernbQ��#�����[H����>e�X��gC����O\w>���1	��}���8�.�pJ����0X"���d�Bl��[2�(�0+U��J��h��������!I,�B�si	�0�U��P13-V�W�du;*�D����B������RV�7i�C���%�Vvq�-����gE4����e�{%���� 3�q�
��;�/B��'.����F',�&�)�����7%�=�q�=���23�4�
�)��fv��<����������(���E��6�FV�O�^��u�����m���!�4�8� ���x��O�p���"�$
V��W��:��rj�6����]0z�J��)����P����X��<%_)8��r������Kt!�	��
i�7��=�T�v�g��B���g�A����	��=u�=^�r��N��Tz1\hK��w>��*�\��o'{g�^p���X��
��X�-,�.�c��+��N�g�T�f�zO5C��q�.����l���G�m��b*RZ5�%���6A�
��D[-c�u����9|&Z�/j���D���=����L�y��ob��*��V��6���L�,�*���Ww�b�%������;��M�(;��h��J�����=JbT,��}���f���s�L�VY3�Bkz�����4�YY�[8��J1Q����'�`I�4\��d������YG��>�g�EM��,�sN�3 ��|�j��s�r������X����6�=�\'fI29���K ����Q���r�Y������V�}���<�)Le��e��*���)i
�����U_U�������Z��a���ix���9������+�Z��=P,�L��&�#��Ctwd������v�B���g��UM����<J��,k���lN������[v���
���7�*����Z��������yv�%�7�D���H$5�W$���x���nj�F��t9�4*�L�e��!f�`�Y�4\}�k���NUWUV�r����1_)����HL����
s�Wd3h��X�wk�o��������1�c�d�Ar�
��{����'l�p��������������r�e=��o�K/�7����������y`��5��1��Y�.�{�AT�;��]�Me�CV|�����
��������p�����}��B.���������.��V9���-e���*2���U�����_�hV���4~�)t����N�\�bn����2��+�y�l� ��Qbod�*c92��P��z���8r������:#�����c����������|�n�&tV�E|����y���r
T�_�}v����:��V{�t����T�;p��d�V�45W >?t�OU�g)��\�U�k2j:�3ut��O*a�1��.�B<c�	E�@���0!P�IQ�?^d��.7C_��. �������Z��HG�[�=��C�M��Q��w$��	?�	���e�t4���V6rU�!���6C~�+�8���F�l���
xq+��D��D������9���&g}�$b3��|>��e�=��K�MR���+B�5��l�l�|B������|T�/R�:�����s1J�4��6���M���vgO���n�~�%L�tQ������z�G�T�WuH*-E�����t�I��t���z���u�vxA���M}N�9��,Aj�t�0/8��{�"qJ�h�����.���J��O�I0�>������"���v��_N��NP��;n��zzw��9����=V�`�~�J|#��@\���e$^,M5���i�~��S,�CB�}T�����blowV�v�����Q(R�`n�]|ov�g��������}��n��*�	�����^�z3�3F�5�Yp�>7=z�)�t{{��tB��f���!6�63.�Q��hS=�Z��c�j���z���)����g���b��j���R��U���f��/�L��|����O�bI_.[�I
1W��)y`�V7#��y����P�;�t�v�tP��3����H�M�E;����`Y?d�w'^���"����{��b�A)l��D�kH}/sv�g���v��~�M��:��=\��{���p�����K�/4G�UR��tV��i���lAs�A������:��O��UR����8���',�~�v-�w���X>${��{�c/�
�@����6���g�W[9g�vD���2�J�V�rj���fde��;����E��{��|5�cw�9���u�~HP�L=XA$u������GO�1t�	)y��,�+Y����Nk�m�w��wz�D8��;Y�K���'-�����F���������,\6��v�iY}�Vt�D+��/���jp����2���4k5����k��"
���d��s���y��P�CX4�C���h#��j8���%~������K�P�c��Id������~�d�l!���C������zU�|"��5�4�r�F�}Y�:�_\*���9*���NM����Ie���"jP}�/����[]-1����<��l�p���w����*}��3c �Hu��_&m��q��m���<�X���Y����a]e� �&��������H����4���(K����	�K��o98;����ve%T�:Oms����G+o�������G��^��f�s���Ho<;�IV����%��5ev]J���\��(�����,X��1��R&���A6���
�� 0�Nd���V��	��O�g�f�����z���/�k�,L�z�E3���w�����0G���a�����.=p����>�n��l�����ne�2,��J3FpT�{S,B���.����D��I�$��I�MWt��A����;��-V7������m*9����l�us+�B�����@4Op�3P{b6���� I��UT�H���t�`6T3����<7#n��	����kD���fg�?g9n�a2vIS.��(�C��;�R=d-&\��O��\Q����W�u��z�7���9j�Fwq<z��{N������N�����o/s�>g(�VES��(�qX+g�%�o��wvpB
�U���HI'A��P��ZB�����y�������y�=�DMrv�M�w!���Y���"��o7R6�����Uj-`���K������4��f*�q����9��FIs�<�[{l���g��9OzF�n_�?��&
ay����r��ON���:
j��LUl�:�A5���c�����q�C��+&qQ�Q�^���CA�#�����B�B��j���!��*}yQdb"T����D�v��+4�K��Y�k��24�u���2�nJ��^���b�p�5�a�7/}u�Pb�G��8��u����U��d��o��f��|�NV�,��f���@�w=(����Y�c?f��h�U�1l�K����qa�.?�����\I�>f2���8�r�6l����������\��=o�������D��*����rz��������Q4��L���^Gf��
�<�X��
�]Y��(v'����WZ�	���}�u�V��r���{S�������@�R���W��?^���4�)���459�a�Ws���+a�F{�����$3���Di|���=Y����#T!>��"c�u��:��c(������'L��:��pF�R�,���������W�V�}��v���W[G��i��}�����i���.^�V}�e\�t6��y�yu�-Pz�=�����+�r��u����O��������3�����n�!3���Q@�k���"�N,��f,#)�O���sv��{
����(t��C����H�G�<K�|l�<��94l�&$/V	�+�N]���\�:\Dv�����paN��F,�N�:X{�����z����-C�5*��*��l_���R�mG��Cm��%+|�������l���/�"E�L
�Y��d�Kbp���O����zfq�B���"�n��
��%�G���V:�]V�YQ6����f_����2�+��}��s����-�Tt
��yb��@�4.���6'AUn�����{�+���T0VR�t-��;M*�������>h����w*?4����R=�T|uGOafCr���&iN�
���d`:�wm�$�g��p�V��XI}V�������7���%�qx��fz_���F~/�����s��E�GZ��z��3^.d
����i:�4�����&8�������Q�*�0{��`��RTa`��3��=�F��r`�����XajZ�F�~����|W�n�����j	����1���<I�A�8C�^_�ti�0���&�QP����($j���E��b�M��fDw��q����^�r*������Zv�"/N	myn\r^���N� �p-�"��Y�� $	�[�5<q1G�����y�n���|���[]l#^�����o�

[����(��@;�K���tD7���c2Q\��Y���+��rdg���ng�d&����i�@���������0��}�Owy�L��1U���vx�Y
y..T���8_��&�Y���v����Lr�W\k�P�^uRg����w���S��1�}�K������L;=W2]Z/�|�0�l�<Z�E�4�:�S�I<���w��l��~�a�v�kF��������B=�S'������v��<��Gu�e����+t:1_'��2i�>���8���1)'��V���X���h8v���u�`�|���� �W4�������#+0�����������
�3/�+��^}���K3k�C����%U�K�{T���������Y�U�0���G�F@�>	W��
k�gb1����7L�i���-��S����G��f�dJ������r��������6���Ax;K��k7~������K�tNH�Lr���3;#*��<�5����`m��C�mv���F�\�������)����z���OU]�4�7�-����Py�
Q������_a��*P	
.
z
g�����Tc�?P8�5�.xQ�g�>�J�����3q���g1/��.�	5f*]�`��O2������������^���u
4e$�=����]���p��x��E2*�V�:=��n�Gv�i:m����g]��������/�kC��Q��=���l'�Y�}7�SU����������{�r��^f���^�@������j�C\����-lr^�������#�#%�'5�M��z}�1��A���>�3�G*��u#��'�����L�dk'z=*r�t����B�{e������z���yA����^���L��
�6+������0S��p�>�n�!5�e]�XE����GK�9�1�9�;:����B\�zd�z������~�W{���������)�@� s����������������0s�����wp~s������������w��RT@���$$$			BB@����	@$�$$	�	�$�BH@IHBB@@�$!	$�I	��IIIBBH@	$���$H@$�H�BIHI$@�@��	$�@	@	$	$$$@�$�BB@���$$$			BB@���	$$	�	$	$@$@	B� ��H@�! ��I��$��	���B@��$�$�BHBHB$$$���$�BI		BB@���$��I$�@�@H@@	��$���$$$			BB@���$$$			���		BB@	�$��	����@���!��H�	 I	�@$B	$HI	I	$� ��$�������IH BI$�$	$�@�@$		�	$�$�	�@$$$	$			BB@���$$$�$	BBB@��@�$��$BBH@	H$� H����$�HH@� I$����HHH H���$		$$$		���$!$$!HH@$BB@�$�	$�$$$	BB@��B@��	$$�	$�Um���Umw�;��s��;�0;��s�`����w�w<�����p;�?�wpx�����G��@�q��G��@�q��G��@�q��\�s��#�!N���@�
.���G��@��wJ9�����G�.����!��!������#�!B;��#�!B;��#�!B;��#�!B;��#�!B;��#�!B;��#�!B;��#�!B;��#�!B;��!��r�@�9t B���!�C��B!�C��!����\t B�:!B�!�C��B!�C��!������� B�:!B�!�]�B���!��r�@�9t B�:!B�!�C��B!�C��!��r�@�9t B�:!B�!�A��`�h;�<@Q��P

�`(�/���y��x;�0;�>��x���?���0�F1�0F�C`��`�.e��c��a��T����`��vg],11�0F��a�`�#���#�0]���1�`�7]s1W]�`��vgD�c`�!�`��0a�a�0�!��s��;0�0������1�t����n��1����0�aD0�0�F1�0F�`����cwE�`��\����0�;$��]v1�0F�C`��a�0�a!��������#E���]`�5�vb���0�0�F1�0F�C`��`�'d��F0�s.���]`�����F�C�'aa0�����!���IFt0�0�aa0��BA0a��C�)�BA0nj�]��`��]BA0a��Ca�!��0a0���G`�a�;$����`��Q�a�0��(��0a����!��a�!���];C!�
���! �72J0�����0�C!�0��a�0�0�`����BA0cITaa0������a�7���`�! �0�C!�
����!�����aa0��UTaa0��BA0a��Ca��B�!B�!B�!B�!B�!B�!B�!B�!B����B������:�!B�!BB�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B:���!B�!B�!B�!B��!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�!B�=�8s�� ;��s��`;��w�;�<��w�;�?�w$S�		0���

pg_aio23_basic_tests_v001.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=pg_aio23_basic_tests_v001.xlsxDownload

PK!��6a��	[Content_Types].xml �(��VM��0�W��|���=TU����v��R�����j�}��U�%�6�^b'���f<c��noM����wSvS�YNz��z�~,��>�*�pJ�`��������� Ud���5��#�I6`E�}G++�@z�k���5����=��!8a�`��'X������>�,�c���_��2��R 	�;��"��JKP^n-A�)D*5hM�&����Xb��3�I�HO^�dY��F���\�!�<����mG�
�{����;�������M}�ohJ�j+�;���_~N�7����:n����w��R�_�������C����#R�
���Q���w"���tR/��K)0�t�8�yg��	��CC�������]����������bwm�XR8�a�,��������|h%�����R�e�*�Y��jD9o�*�B�����6��?����G8�A3D��x=�]s�>b�P"�����������Gr��@�p�������PK!�U0#�L_rels/.rels �(���MO�0��H�������BKwAH�!T~�I����$��'T�G�~����<���!��4��;#�w����qu*&r�Fq���v�����GJy(v��*����K��#F��D��.W	��=��Z�MY�b���BS�����7���������
?�9L����sbg��|�l!��USh9i�b�r:"y_dl��D���|-N��R"4�2�G�%��Z�4���y�7	�����������PK!B6����	xl/workbook.xml�Umo�8�~���lTi?�����U ��R�WQ���*U&1��$�����7��r:q�C`c{����g���MYh+*j����/���*������Yb�u����H�+:�_i����`�����
�z��R.��������KZ�����H�g�^
J�:�T��i!��%a��E�9|�`)�x����[A"�~��e����9p%/��Hy��9+�|mAu�L����2/��
v����?�����`�������|!/���>�#��lNcp�c
�b*�{V�� +o����a��h��j%��}��s���`�
z���F����T�*t� ��3&i6�{0�kz4!���a�Z}l��9���Vh]���3r���R� ���TTD�	�$�p���j�����M�_

��BK����["s��P���s��@	�+�~<�%9-���L�*wM�w�i��G���:��J�����2pGV��z�+�+8���T����8v����������{Nd�	�Q2vQ�J��3�RN��R����y=Y�!�n��a�;�7��������+���v���~�j�Ve|=�
��[���~���]~`��A.>r�s�Q��g���!I%[����r�RL��[���#�H��e8Iob���q�=���V�-C��b{������@6�_�
�Iu��D�WVN�/��KC���`�{��j����*��r��F���M[ED$y���N]��
 gYF�k��>��~�t^���������y�
�;>�R(S����1��p������`	S�����`�N��y]��z(19�
{J�(�]���}����Y����x�*��70�?^�����qUn�D����'yJcRCU�I4��!���#(:	N���=�p��v{8��n[Q[�*~���}��M�l ��zj��j���~r������
����n�����������3
'_of7g�^�����\��f�����i��,��;����n��V�f'�����PK!���L�xl/_rels/workbook.xml.rels �(���Mk�0��������������2�u�`W�(�ibK��?-����]�.I����|��6�<gs1I3����0�������Q$H��qrq�����E�k�a�*sQu�R���U��,GJ�[El�JvJ�Tr�e3�O5��L3������o�g�[�����vz���)�V�^��XU����J�T���1!:sp�R\�5����*�<cT^�8	a>���Zy(���W���������t~�5� �e��`&�l�,j����L�{;T������/����!��� �+F������S�b��\��{dhG�J�����h��]|��PK!�;���
��Fxl/worksheets/sheet1.xml��I��0�{���w0&�2�d�:��U����`LmgS5���L�\���b���=x0y��m��R�9�~��h�.e������7��:�������AX�8��a��fmk!��9���2B,��b���h���F1�fElg+�I�!a$D1�����{]U��g�7J����0��Zv��)~��Yo:�k����t��H��e�j��
����ho`a������$7���� �c����$%�_����bhD�����Q��B���
���;������dY��O�.��"��h6��yTxi�]$O�E�a����R���BFT9~���qQL����~J��Wm�W�R�u��+�X���bp���h�������&��"^��z1��W�JQ�M����'!W��&�#��/��<<��*ai��\7`�)�]PTl�c�������i��b����B�$F|c�V�NcN��������q%�@���G �C$�*���t��d������}��,����8�~�u�(lGxF3�{�^��"�I�1r������ ��z��X
$�������_���O���O�����~������w����_������~����������~���_~�����w�������o��?�Tw��?�����l�����>��������������.�����<��<?l����\����<l���������C�}\�_�uy�~���C�}]���a���M����3vc�_�?������?Y���?0��}��9������]�~d9���e���E�����,����_�1��<OU�s?�������q_�e�G]�.�mK�0�7�����'6,Kz�2�7,sx�no��e9o�M�
+�=��v�:����/���q������U�������vz��Gx�v�J��0��k��%�]����9�]��{�����&�v�������Q[����2��������������v����^=V���#~_z2��#�����}`q��goq�����-�r���[\�{nT����������6j�h��o�]�m?�<=�n<�|��w���;F����#/���<FfIod��Ffo�����,���������F�F��>\����_F���q���W������y�Wj�c:�Xb����]���,���������������]79������������v�5�����G4�ac��1T~�l�h�����Q�"���N%v^���y
���������������Y��y�c;K}���z�7������6*�����1�:���c�����c�����6`V~��9��s��������78sx�3�7��vop������R�s�z�K��iv�����4��Nn~��u��x2�����{Ts{r��/.=��J,�r�79X��J�z�ou���n-�v���&���*�F��72E����?����t�������{�g�_�|_�a:6?���e����1v�e���B�f2��26K��7Ial�(����$�������G=�M���*�Y�I����e�Gu��Q/B��<����1�������d�<(f����Q��35~B���nJ�Q�k�#,�o�����?-����^3���eu)K`�<V&����h���l���LR��>��\Cf����F��[G����v_��O��m��_�����������"$��Ci`	���z2|N������I
��Gy�od�HKj��]�_F�M3�'���`5��v���)����I\����e�+7�Gx����=��	-4�kLMH���j�]]�[]�j�M!^#k�]�y���0�����4�uk`�?���S���X�����z��y`�Z@<������Q��]35�5���:>�������;�������k�O��l���5��ou~K�yN�'�Cp���<g�����ldZ�I��#i����{~�UW
�>v�>�i��)������17�2h=~���ru�K�ydN�'�cs�r�<:g�����h���P��]z)�����>{���w���i�dM/�JJFD�!��E�\�G���:��	�d�<*g- S�q9�����������h����V�����]"���x�d�wO��i�U@e���9�/k����;���qB����7�`��8�,��-$�o��dsk����M������=P�������?~o�
��Qu�n�����^S��0�&���E�p!K&�crB�\crB�Lar�Q���Fv����>�k�W���W5ug��������k�qG���-��������NP��v#c-LY���v#i�F�Z�����a�Z����Y����M��f����M��4���IS5$����<O��vx�#R�h��j�"���#u`	���)���:k�|�$�G�mZ����1����&�����b�>��������d�o�Q+��.���<4���yhN�'K��9kY�%��Q��V�42YzSC�^��%��<���Y����]=b+n���h|�2�;��Gs�"l��cs`	��Ov����t7k��%��6�U>�Lf�D6���K������?�)���T=:?�O��qz�c��5E������G����Y\>�Mh�Y��w�d�(�
�b-�4��d�LQ\�r��=�a�OQ�Q�~����G��������p^�5y,A���pB=��<g-�~��d�(N�|�L.���v�y%���U��Ocl��|=��-���F8
W��oM|K`�<�&�����������d�(7N�|��n)o�5���#��Mw�a��c����.+R&����|n:�3��z���8����qB=�>�Y��,I�����G90���T�����~��>��;Z�m���GjL}$��
-5��1Z��q�9*���;Su�rB������]���,
����&���=p�y&~����5�+5YS�x�w�1��4!���u�:���C5?���.���atm����B�< ��	-����F�>�u|#��odB�5��������O<�|3�����S+�ah1��lW�!sm�K`�<2'�S��#s���r2}��������������^�_��������z��ho����U3���I�i�+;�j��B���*���5�Qlz��T���(�Of�P9�����n��>�od��:}�t��W��`N�j��:��5-�����*K���k� X�����z��y��Z@�=����u���S�R�z�]���P?a.^���k����s��G����*��u5}P���:���uB=�>�Y��y��$��u����]Rs5��&�RsQ�6�p����F�5J��S�������@���v�zr�����X=�O'�����t��:KR�����Q>�o�duI�0%�I�} 3�:%��Y2���~~T����o�����1^DS�"����w`	��G��z2{�������I���������q��^&��R{������>��
Nx�~?&��@p�#��-��V��*>���,�����B�����Y������Y����g��3:��}h�W"c�5���+��j�}�������=>>���`�6��%��y�N�������di�$KG0�V�42��u=��~%���F��������������vbh�n|������1����@��;s�L�	Y���w���;Xh'$��$ww����]w72.�5���I��FAj�&����z7�I~�O|c�TE��oy�]j�rg\ B��,�@�Z�4j'$�"��jO��~)gd�h����~���?��5VT8�|S���4M�������hXt��d�|��B&�ytB�L�u�Q>���d�M��m��:}J���!�O�4��o��|���I'�9yr��Z���� Xk�A:��b|����^('kG �V�42Y{S�1^�tI9b|7�c�>���m��5�^���)��mp4IW�����p!sB��6�gd�<2���%��Q��V�4�_����N�3�-����7�%�����>U��{��E�\�G��t�<"'��������:8K����9��id2�������K���q�<Nf�i�q��
����b�C��P��]"�rg:z��	Y���\:��q�K'$��"Y9o���qsX��tI9��7�
W�|1Fp5#F�����s�f![v�&�EX\����t�<'�S7�cq���,I�2��G9��~�����2r�����[��Y����}f��8��k4/B��|�X3�3��z2s>s�Z@f�#p&)���l5��%��f�����\�2�M�!�#��8T}��d����8Q$�F�6wE����	� 'd)�3fF��rx����9���E���~�����XS{���5�.8tkE���r ]r�b6��P�|:����"n�����pB�^�x�?��E^H��n�lx{��4�_�%5nGU�_FQ~��r��sO���&j�k�\M^���T9�� /d��y�Mh!�k�MH��#�-}���L&���)�q{��8�1���iO�q��0�V���|I���S����x[�3==��,�=�	-dv
�	I2{�Av�eov�&g���Nv?����.���!i���n|]�-�r�>h%P}�rg\ B��,�@>eNh!�)sB�\ B� +]@��}3n��h���8EhF.��7��>ihG�F���`���k_����KGH��%K��8��,��8!I���8�JK���o�M1^#q��_��	0��{��H��A��1~g|5�a���=���:��q}�K0���8��fsy4�Z��oB9�=B��G�/a�o��1^�q���4g�\���^�b�0f?����)�`q`�� �/�������;��#<N�Rw��qBuw��	I� XV��:�U�jO�����w|��M~%�8��������5k���>���`N�=j���E�\����t�<.'�S���r���,I����oO��2��������K�}�N>N��+[�rvB(�0�O�3�L!`>�r#�"�n��t�������rBux�*'$���vaw�+���S�j\5�+��~���}�S@��m�?v[(�o/�4�Tq��E�\�O�K����rB=��<6g-�>��dkk�������*Z�K���i�f����f~S���6 ~��jq��I&���:�-�E�������E�]��..X������ov!�����f�:�F�f7��F�fO���4�8��r�t'd��`��y��[A�`{m(��V���#HN�������B�����$[[���u"t.�M���4q7v2�)�����Lh���0c��_�U��o��PC��rgl�oB�l���������$���.l,u>�s��O�0���M~�(T�G�}��n��yG�'�J]N������P����G���%��A7����A7!Iv���K�O��v�����mPr�;)>�oO�a^~^��]����.��_�"�m����%����B=
��8k
�,I���cK,�i��N�;Y�9���s����<���������^�[�P*q����{7�18�.����zr�<g- `Irk�p]�j��K3M���Wj/�����f��S���B�������8/�����_�Ep��/XK���U��\ _�j- �_��.\�����&@.�+V5�+�qo2t��O{���0��}�S�o8~<R*�u(��V��apB�x����y/Z������L�����&@�
U5�+�q�"���|�35jlQ���:�\)����SE�.B��|�X���O�����9kY�%)���[�*��7Z��*USB��z�;w������h3�C;#O����v�O�i�C��rgzv��	Y����8����:+NH��C�
�j��d��� �gJp�fs��Y���1m�A��R6�5c|�c�������~,B�V��.����w����`�.`-��]HzHm������nd�F&d����'2Vgms�L�K�F ]I3���M�
|M�����U������;�8'd�����B.��9!I.�sVQ��~Mgd�h���"�4�6���O����5�x�����������[�3��;!K��#vB^#vB�"v�����q��od��)c�
����O�E�������z$����m��_�����U�1?_�*�S���Z(��$�=D��W���;���o��U���Q�����O���G����v1�n^��k�|�`n?�w+w��G�������Z����9!�����b����	���&h�����i����N� �8"Q�����A��X����SG(��%S3�F}<���Ej�.�O�Q:��id2�.\����hz�����9�������j$��C��'F��`�9>�f,�V��D@��%/���	-��u���$/�� ,<�id����#�.dM� �t��'�^�T6{�>n*)p�p0}/���|�X��<�"'�SO��s��Y�l��i�O#�
N��}7��6�����B/���t����T����{���tc�K`�<H'���� ���l��d���:�F&��BV�&�'���Y����)���"�a��S_��8F��<����awc>k,�����zr�|����\�%�B�N�|�\`���]�wI9B{�T�8'�Qx�����_�������L����"�n�~?��E�^��a]�x�?�[�����}j;������=bcd��h���������{����nG��������`	���d�<'Z@��p��$��p�}�Oy72W�����I�z��)�i��)�rj�s����zR�,�4G?OE�����'d��yNh!ckNH��C�j�nd2��MQ����	��!������0�����N����V�� �����waoS{K�����PO�;��Y(t�$<���W9���~E����J�]np��7�rc�Q��@�z����P����W(B��|X���	���:k�K��(\�*����q{�����
�4�E����fL��������3�q�!�H��F�"tn����%p�|�PO.�G���,I.�s������o��a_��/g�#;�%��S&w+0���S�P-�^F�kW/���|�X;���z�s>Q�Z@v�����.&�:Q��"��4Q��[j/"3b���������~
��|�Cz'|a:��9^\�^��MyL,�����POv�cr��{�Kmv�)s&@v���kO�0n
�������7�x���j����/S����$����S2�rg&�2'diB�/\ZhB/�L��Cd�jO��d���#�L�{�����������1���j��q����'�!�jN�����;c���d�|��B&�YtB�L"q�&��r����������s{����7y��L�N]���P���A%��.�~.�V��.���w����`�.`-��^HzHm�x�������2^�?���}j/���`xsG�1�N��������qK�i�^��sZ�rg,�uB�,�G����F��$Y>D� �:����Ae����Sek�n���|�>
vQ��u�[WOMkj,�O��\;xN7�ouK���w�
����w�Z����?x(�]tp�����m4�C�_����{:�D�#i�8z��f�(Q�~������9��%�y����y��Z@6gI�y�i�O#��ua��~%��$��\������ �4�sa�k�r�\-]���yD,�����PO��#r��4K��CDN�|�,T�j�WR� �8�qk�����*M��K����X��+:���/���<���yN�'��q8k�%��!�u>��WkF�SwM~�_�1��������E�b�l�Qs�6/���<&���yLN�'��19k��%��aN���42�\�7M�,�n9Cp:]�E/&��I�:���s~�n|�u}:�����`�^��yx,���9��\ �Y�X�\ L��:�F&*Z��| �s��t��Z�J'9��R)��GIw]��m.B�V��j-B��,������B�5��	I2|��AXm��Vk��0n�����:��&��0
�T�B��Hz�%���k�/���|�X���O���������Y�l�sZ���ds]�j�ds��q�����vWQ��y>gP�B����.y.�����������~�
]������O?�����o?���?��]�XV#��?o��{���M��}W<�������
o~%�������D�.��������
�f:Re��[�!�0\������r�#���d��UL���|!��4�B�Sz�4X�d�B�nWx]���^@���D�mO�������]�����6������/.�����A�[TF� �m
�z��<��Z�� ��_����{M"�� >`t��| �����;/�����9���%�*\������zH���%���2r�<��5(�y�/��]"���_�\�>�m�$v��_�*vo�z����V�{+qb���8
��c�*<��G�g��U�����T�<T�e�*t�����"q=Tlt�LL�
������������"W�8���
�[���D)�H�"dPI��+h��]C�d�B|p�>���t!m��3Gk�R, ��Hpu��ZZL����W-<��
Eha]����'!���zO��aj�y�p�5B�T�$���&v�vx:�xK��[�lc�����p��q��b�I��"AT�m�� �4G~�?�.����
	B�0�uM�"R&�}Kt
6~6�F�h��N�e�\�������t6��'�O�L_�U��v���~��V��M���M�����l�4�?�&������������l�d<�������hG���rJN����*�
7�� B�4��|j���s���d��!~���9�N�K�bm�s�������$����������	7�� ��w_#A]�$��WO	�4����H x������4��s����2������.q����G@%:
�Q���������w�
��������*��!��E`�����C��70D���@��X?{�&��@"��� /�%�?��� q{BJX��/�����v�<���qL�t��&�������=�!�(��!��[���`��H���C�zYT��H�]�D'���y��47L��V��v�`M�0A@n@TG0F�Gac9[�Y�e0���s�F�����pF���������5��6:���F�tDC�kPp�����k�w��'�M� ����e�a�O"���i����������,�>`���$��!@
��Z�9���o.$�z���p�����p@1�L���o��`oj�a��.s�����!n(�9��
�$`����X��p�F�	���bD�%�I��<�r����?��r��}5L�������R����.W���C�PH����p,���8��l~���7(��mX'�
_�]���-J�p��O1����G!�7������9_���u]���971C!�np3z8
HQ����'�@"��d��p�2Lt�-?%	z�q=�rl��GLj����t��2����'
n@��
n@��
��,[>o<�y��lR���.�����{
�z�C00XI>p�R4eD,���4��s��S���E\�s� D�4��(����@�z���=�>ZW��@IG,0D��6��k�ci�a
Mg��_���	g����^�BS���'O�49���� x�	��,9��
�(}��$	�����tL�P46��}��X��d�������g�2�����W�'�x��E:�<l�+�l�+L�F�B����4����Oq�����\-���2r?0�w+�����<�,lnd�'�����
ndZ+8�MV�� ��$#J��fB�S��?��;�1)��G�+D{�"��4i��!X!6eH���	B�PH�Hp)z8.H����"���{�Fg/����������4�c$0Dr),;��I{I���ze��Bp�LP���� D�4��
�P�a?�Z���Z����=����l��������"�{�4,��d$��v,����R��2����]�h(����� _z���0���A D	����@��[jm����%a�?-Ak��C�p*-M�����!{��#��2����Y��nd�7���Xh��p,4Y5!����"9��D�P��rN/2}����S�%�$#�8,��2�pe���"��o@�B��"�B��z@��/����"��������8W�3<o
��`{�����'��K�2�pe�yD
i��|y2�)
���Q�����hF@�I�0�bf �F��(�gj��;l�%9�\����� �&_����p ��(/�y� _��Z����C���������*a�"L��]6v�CbO�x���e������~����W��������+{&�'�i
��f��B���A��#����r�$�C��yP�/l�?�q;��+$�G�>�.e��0����m�����`���7C��=�a������C�6������%���ZR������HN�����,����H6f��DHg�2`���'��
`P��-�Vp�gY�|�EhZ���A��|&����YI��X���>Q����#�0^�ep�����!(�9������(d��!�����A��������KFx����A���v@M��9����-��P����C��!�+{�BDPH�+�@�v���
!"i�
�m�j�'PH�1XZ vu������33^d������P����28���'n���
<���<�,[>�M+�����&��6v�=(l�-��E����*=K�f�<�l����=B<PHs��<(�p��,�A�<i��6��������^�����81i�p{gu8��iX�+�2h��Qq�(��'D�U<�4(d�Bh0}��0:���p �FG�a���7w|&
�0a�tB�?�x�r ISG'�e�����! (�9������! i	��@�~�F��>�S4�����mZ�=%�s��+�nR����+{�B|PH�#���v� �P��#�� ��#)��/� 4��zL������8�n�c���lDi�]>��g<�� ��M9���&�B��d>����F'� i�)D��BM��������E�q��cB�h����T��b"*�K��5\�s��B�=�F������B�="D
!-C���#d+>j�w_�0�w� ����c�G����n�x�����T�u��e�t�� ��=�!�(��
n"=�A!��e7!DHk7 D����X�3F������QdH��Yc�c-"?�6�*���h���+{�B@QH�'����P��	!�i�	�m�j�'I����i����|�l��Gl!(����S���r��2\��Q��h�p�.Y��'
7���<i`Y��WL�F��W���$}YF�T**=��m�c����D��1�N���d��U��@��=B�QHs`��s(�p`r�,�A2BZ�d4~1D(�	,��zO9�����\\g���"X[[�� iA���&�e@cw#�<QD��x(���F����#��#
MV-#�1I��fMGDp�D_������j6�7�� �~75�ctvW�1���hb�B���
�Q��h`�B��6�����'���$�d+��h�c�M�t��=�<���.�����~��pI@��8��+CW��'�(��fO��2
=�	�(d�B��r����2]���^�9>�Mg�c���AQ�QjB�������O+�[*�I����I��
<(�HB�V�����!�hZ9"l�$H,!��o�5�y�P]���[�m%�l�j�(�}��e����	�	\@HS(<���B���\@�����	����t+>p��6�\{�+���~=�Y�1������}j�&��}��s�2L������
nd"�7���D�V��,�A�)�Vv� ��yD�&�,#V�or���G�L�����KT����+�e�b#E<�K�HQo`����h�`�`Yv�_4�����y���A�b��p�|����E�2�;��5��@����)<���)E��]����
v�7����LE�6���A���18h	3����s���!�9�8T��-�����M�o�������(��npM�V��,G�M4�	4������������A����3w3�0�t>�UC����\���e�����.�������
DQ���b�(
Yv�Q��D�N+GMGDE�w(;?nBG�W+��m�����f����!�+{�B$QH��HYz���E!�.�,BZ��D1(�V�$������;��C=��j�a���k5{������+{�
B4QH��@�v�M��!�i��lt1(p�	,+�+>�����{�g��:�s6���.P�����K�o�~�hrp��C���(a�V���e�B,1}��j4:2D8�D��|��\�P]�`����7k���09��������
�<��o���
l�����-=4Y�:RM�-/%a�=4�:t����$6�~<\E��I+Xq����ep������	l.�)������o� �Y��&+l����v�����Q?����F_�~��4���w�GF�S
RC���0���k��p��=g�'�l���B�?�=�l�'�������[mtJ5��W�_nC���
�`-��Y'8��k�y4�����=g��l���B[>H7�l����|���L��F���h����<d�0���������_�����#��P
7@A�DC�
PP��������������! �6�oC�����&�8��O��o�!�c�Y�dSc��]�G}d'E�eC*���B�

i7j��A��������z��$�,ht��4!@�~P��=��t�8H�@g�`�E4F����0B?(���������Jf��7*��X����@�����@b|�����q�uW_0]�!�I�=�q(hN�C`*��@u��(�d(�W�\h!B!���D(�ph B!�.a_�\"�6���ph��b�@��h��G#L_���4��	&���q�I�30���+{��!(���7�@��m��B�mn_�l�B�bs^J�a ������>��*,Q�`�}5uK�y\�`�C�h����;����4�hd�
�7��p�V�������%�<C{�DP�l��T��(9�p�U|��1"E}�y����C!k�$�j����'n������N(��np'�V��,��}�r��O�4Ck�
�������U�Is��R��w���G[�/��������=3F�'�!Mc��!O<4FX+��,yB��	��<A���&R�����Jt��T������n^=����}�WR��,��!��O
c����\"�4��
@Q�a�E!�.a_�\���=��K? DQ���Yz@9�1F�#�&�I��~@�]$5�D�`�p,�W���CHQH��o@�B�>��,���@�~{�� ����'x`�4AHt=B~<AH�'-���!���a�'�e����s�R��7
��v���Y��#�(G0E���EMG E�W��d��"�Q 5c��Q�LUJ�U0�Q�F��P�.�7r�Mn��7�T�F�����
,�.����h\�h!�
��K����I8��&6l��X�qOG?_�jwg�Jd)���(�t,CW�\h�E!���F�����!H8��1�q��dt���.��^4�e����������2�=��*�c���;��i��l�ToT0�'�7*��8"��`�VpD�quJ�5H��&Bh��i�Y��x0����Ss�6t0H~�J��R��{c�������!�4�a�+{.8������psz88�����c���!���6����G�43n�jO7��K�b(y�Bz��7X�:��5v��s�0��=�!�(��%n�6=�Ai��e��1G�k���%6�/��D{�rqN5�v,1�� L���QbS�w�����n�X6�7�F�D��
�Q��G�`���G�`c�5JY��u�75���Q��L�8��9��(a�Q����Ok��1.��]M���T�ru��k\�3�<�#i
��A�P �V�#Y
��#�G~�_�
i�
p�Ht����:C�h��]36��u�:Rz���p�r��X�:�(��E���\�F��x���"gk���K�Xc�8r�k4~������x�L����	i�_r��j{�|F�w��7���+{.�(���@pez8(���C�����
�h"���t� s���d
['a9����YqD.{�o���e��t#U<Q����(���F����;<���c\���lt6}P���_HGZ<�$L#
���H��5#�������"�yt�C�c�0����=B@QHs���(�p��,�B(B|qt��P4~�D1��gp�w��'n�����D��_�sDcl0q�O{W��\�n�.�'�7R�8>�H]�Vp|�q	K�54��&B�E������~�!]�����������G��"V�F�1�V�n����|��(��>pV�V��,��8W��r\�D�d�$�A�b�����^&>��8[�?�
��V6^'����-�l�����=7>�X������(���`�B�� �!���K4~1>	�������&��8qM�p�6dZ2�t�>g^�d:"�i	�bC�8�H`On$0�7pl���h������1�hj96H���,��N���	�_��k&��-�^�B#5��@�8����
@Q��}��h�``Y��8{����rf�@P�lt����:{}]�A�����G_�����0����]O	6 �2Dqe���	<AH�!x���
�!K���@��y�~����<B�r�����f���YN&
!p"b5��_��Y�Y��s����|"��4����E��}"H_��1�q��dtB4����W�o�yO�L�q[J�Oi����GTwB*��P2�7KOn�(�����a��
�
,�~����<�lt��u������-��y�N3\"%I��k���ar�1"���%2c]h(��#x"��>�7�K���7���j��M�]"*���
	}D�#�Cy��Hp�w�$�y�)}E�+;��8��P9�7r�y��F���9��
����C���((l"-?�K�0H����q�����0v����1����*��������s�����!�(�y�p#�Q���B��(d�%�tF����!�WW�R6�D��h�����S�M�����x���9�:I������\�s.�B�]�F����.!��Q�JW�K�$��K���eE�@��rKOs\�yY�
�8qk_#=m�2��V���B�y��
��e �|�h�D#��k��x�`=�)��/� d�81��$���`�F_�;����y{�RZ�D�o��~��R�S�����x\�s1!�4��Y�B�A��(d�b��:&���Sb��i������m�;&���x���lq�#�p�g4Y,��8#x��go��g�V�d����@-��E�:t}�t���TI}p�����iX����i��+��Y'�1���`���F�����o���������~�������_~���_��������Y��?o������}AI���x�/(��S+�/(Y������G�/�G@��e�������N�=��q�{����~�/n�>@Ev��������]��} ���Z�>���e���0:��H�>��}�~*��������83�;r��hp�S�JMU&n�����J��>cQ����3�,w�L\��Vt������@��^�SV'r���yI8N}~���L 1v�������W�������[�/,��|vgw�A�u+8����A2Y3�����t��d��1���Ir�y��!��>��yO!*oB_���� �R�,�����Jy��}�L9t��M�����?W[\������b�6;����)����1��8����	�������,D���������:m��?_��?x�w
������Wp�N���
_�(|"�q`X�����#�����K�q���-�'�$�i��,	�eIX�X�eIh�X��������|[��w��"�S����`Gn�V�j��)d���9���	��������,�V���`�a�� 	��j�qA���,��A`���q�Mq��P~�lk��S�p��	a�q]�'��	��}BX[�]������j��5!4�@8��������hZP��s�7���mZ�)����/�m!�}nz�OMLU:�SK�z��~jbm��}Y��jY�V{tn�����mg��w��[\����(�w���#|?���p!6N|8\��yj�O�q�����
�������%�j�&���<G�T���������YM���"�A?���c�f��<6��]xSg�laC��hu�����o XP��YZ��:[7�������a{I{�M�7. U��G�Z����}NJ���s�/�����?��}��,\D=��:�P�r��������j9�y��>*���(O�,l������|\���K�+�����z�~��33h�0G���6����Z����V&��oX����W����`h�����i`}9��/.���>�u9��K=�%k?�����G#����p���������{���7��0?Qw�a~����3?���{����]��������Y����e�6 O��'aC��G����n_��?������������k�}�	�p_��C���o�B}����U�a���pH�~:#�����m��#
�����|��������X����:���`�0��Xp/Z��{�
�t�������p�3�=��_O4��+����}K���6���=�l\�`=��������z�
T��L@���)$�^��3�`�(O�����XU���;/�G+��~|���n;�1��������O`0�h�`���7P:���`_T%b��|���X���y�f�-����u����i��l��~��)�(�|�q��������O`-p=�X�?�hE��J�m@k����|�������.	z�)����-��<�����o��1�2��]�EKz�e�'������{�?�8Q�s����h9X���(��Eqe��w���wz7b�Q�h@"H�	����Xp����/�P��k�9�"�1M�o��[�m
�g���6���7��
����^�V�P |���W����1�3�z�sT��jE�Q��b�y�	
�� m��N�/;�c�pY�Z�/��=�XU���@� �t���
��!a����|�~��5)|_����O�r$�rbCQN�]I���
g
��C5<4�����}����O����{xxCO��a�#���7PXb(�y� ���L��(�~E��0���5��E!��Gj�X ��+�1�D6P�J
a�>3�^���0��~�6n(��7N�]N,l��eVz���^�&z��
�;s�{8���	�8�h}�����0T��?�[��?.y~�����,.s41��FK�7�L����mX����+
�`_�D�>�	Y+>Jy����sx0,1�Ax��h u��_����P,�����k��a\#l�@1@�X\�����#��E�O�B�
Q������\�i�!�&����_���!���`h|%��d�8~��-3���m\h#l�h�x���a�t1���
�Z!�\�C��x�KJ��[e�8��hI�B��vNU��bt:1G�����(����zX�<4�/k"�������<����m�/�bb�m�&��N�)�>e7o�3�R�T��C�������O`Q �h�bA���z/2��3�Af b:a
����4��>J��xq!�v�:�E��O'��2�a�z}�1�'p�0�h��c_���n�D�h�s���` c8 ����Xc���
q/�[x�u��/�o������u�9f��6F��&��T l( D+H��B�
�wkU�4��5��,��W-?�'�/�u�|
���!,g�:�yO:
o�@j���Ea���eQX��,xE)gQ�I�"��^DQ���q��z��`���	:�a�K�I5��;�*���Q�l��uK�7��U�p��(|��-U���uC��B�@��(��4M'���@���8��@�n��> \�_���0G��(�x��U`(���U`)c}7��/D�����.��~�g�������!���C$(����0����E�8Q$-�'�(0c���0c�����,
�f�����D�92Q�*�+������O����0�����Q<�{}�#F+��8b��^7�����H�yt��T�r����6[���@(�`�g%M$����?����~(���?�������Q����~��G;A���}��Mq������C�?������%��#A��Q��P�l���@\"l�(�x�b F+X����Q��c�r�QN��=>b���=Kim�������������2�C��2����2O` �h��}Y��#�Z!��\�3�9K�m��(�('�G�r������X�����G��T�u(f����E��������9��E����I��Nz���(gj �!����Q�M
Xm��U��/"�\�~�5�JO��L�m�/f��(`c
�M�64PE��
�K�_��k��u���$�(�xQ����(?p�*���If�P�����EI�@���^�Qxs�DQ�����{�Dx�� ����=��p ���Cy;p�"�i7	��%Xz��*]��K��^��=MX�(�YxQ���0xQ��&,^����LN���=�%]����;N�;0���#�\����{B�!,w�������m��=QX�(�Y9�������Eai#��(m{&�1������z�
�Q>hWlm��H91����b6����E��2Hc��L��e�4Fxk�4�����p`\p��w-����K��R[�x�&	��rYb�����i������	K�7k�� 9�G<K��_H#�^��l���r4��z��I�\O�/�R�H������9�?.��5q��^�&LM'���6�$a��LR<�E1�$�<�`���D���NU��@���Q��2�=n���zu���Z�������m�Bf���`)�����)!�`�B�Yzl����d`������xa�OGE8�hC7n�M��V&�\��4�N���q�9H)�`�6#a��)��j���
`d�*5�������d���EF9��U��Q/��Hq�/~�YM���r*R���m �6N���	,����h��}������n)����������_�;F������o���H���P~�Y����c6���o��� lhN� _@��Q��[+�('D9�����2�q{��B�q
.��~��3�����r�o�,5)n����=MX.)�Y\R���0\R��&,��w�;iB&0�\���.�U������p���e�Fi�CgI;,g�>&�yOL
o����(��,
&��Ea�d�S��rb�Q���#E��t���Pf�^��%�}�<f�^�[�(�������{��G���n�#�u���"<��`XSr��g�@)Na���z���%���i�0�>����0��(���i��
�d8�1|�'������V��,J�E_�K�H�l�=�~�[���/Q��L��>�!�yo��Rx�@1�!E=<P)|��V��PZC� Q��8<�$��+�����.W>�3�����:*Yo�d��s�1�����F����(����gI�$���n�O�Z,Q��
�KM"@���p���H���l���8V�-����v�?\����i������	��7kb�;�zx,0�Q��&l�#��:U����l�{�T�������p�����>���I��!��a9 ���>�a����x�1Z�_
�������V��EQ�9@�� @������B�S����kC�iL�o~�0�����|Q<�{}�/F+����{��E]+z���(���p�R�����,/P�cm	���Q
. 9�d}�!�l�	�1���
iB�Px�V�&�/i"|��A���/�3m��5������C�c!���Hz��'�b�
�g���Lx8��c6���2G���`���a���Ea�#��7C��($��(�	��2���������KjCT�S�6�q�Q�l�������e0@E=,C�/��RFxk��/�JE�c��jh�����%��Dm��iL����=MX)�YR���0R��&,���d�QNSL]��� �(?1]�Y�K�pl�A��,GD�3�����=QX)�Y�������,
�	o-
	.aO��(�q��s �}�mn' ��r!"~�In:��c6�	�G��������`�#��$@�J���8����n�v{����#�ie�3��,�0�f��{��1#�yOF
o�����&���	#��5Q��1�}����EF9�S������fA
��"������Y�>�hd6����H�����������Eai$��(J9���s�0I�Q���Sn:�W�������M������~<�#lt���	�c����~_L�cx�
���?FE��8���r�������w�_���}B���M��������S��O`��h�`G��RA�[��L���������cO0Z������y��h�r&@�s<2�wl�(�7aC�6 �$
�K"|�(�����r�.���_�7��D�c*��*$�q�<!��zj����]�8��c6������{}8�z��
p���8�[~D9���|�����-_�\��}D�����Y�{<8�pc6����F��"�����������k�M�jq�`�Di+QNlI��o?�f���q�����e$�J���s��rf-������
�7�b8�zX8
_VE��R�����j��X2�0�1�q�o��y����A�0L��y�?�����s�1f��,c����(�a�(|Y�J&��j�r�4h�_K��3��z�:��	���h	V#���5���E��1b4��;��}��(����;���y8p�c}��/1�r�:F9Q%]�z#7���1���~1�T9��p���l�,j�<6�FQ�
5
_�
�*d�P�a�QNX)��������+i�'� x�pQFNu�j�Mu=�@b6�u�����|$�z��
H������M^#�M��\���"����!�m1��a�)C7\)����7{���W<�"l\t���	�b�����(�
�(�=���?q���v�����Q~nO��y<q/U��6����@t(���z������7TX�(�y�`����x�H8�km`U�(U�jJ9�YU���(?������8���F�W2������s-f��`cF�M:6�aC!#ZA���%�7:���X��t�2��L�r������&�������yn�%He��?�2�s�1��Da���fQ�GQ���G����7P�0���i�.�Xt��Z�e�v��SJ^�
��r:=�Pd6����H���@��V�A���U�Q$�%��r�NQN
mU�������u#E�|�	E%�N+�s�1��:��F��?@E=���6
_�xO��;����1���pH��oHR}8���0�c}YZ���c������7�`�5�zX�5
_V�g�p�*(�4�{�L��������u�������0������

s��9�a����x?��h?����S�d9����sP0I�Q��14 �
Ymfh*�����+��sl1�����7�lQ��c�a����������^D���W�"g��U���7V������>�:!���r4��9G�yO�>
o�}��(}�,
O��E!�.a�����?���6��YC�Fh�B��K��r*@����~���Q<��~�Vp����
T�0y�����l�{������m�/;�W�L����h���s`1��F�7�`Q�������G��G�^�C���c�,.{����&	K���&���b6��?l������6��.l�����.|�<����]W�gbiv��?���Z~n��t�B��2jC�W�9>,g��k�(f��*,Q����(�aU�(|Y�(�]�b@�(F9M�=T��+R�o��9��6D���~���5��y��-2����P���n����n�����e6�Od�e{�E�(?p=�|q�P\�@�����B�%.�Z]x���l������E1�E=,
��/���D�kQ�rBaO�]�����������������tXNE�9��8�6��a��F��n8�1Z��
����xQW�a�b�s����PE)?0;|��8�����j!sY�{��J���k�$f���`I���aa`������n��"|Y����d1���zJ��%FE8��|���^o��/��������?����y(�����A�9���{���Qx�(@���Ea@��eQx�w+J9����Xa8c��8���+^��$/�������X1G_����_p����� ���
C�/����x9���
I 1�4�1��;r��F�8����9������r}���l�,L�����#a&�Y��1�}��}.��	&�1�E�0����������NH��T[��V�,�c�q|
l����r�2-���He�V��#��(<q�/�&�Fy�qL2%X�h`�c�����pWm$���t}��g��c�tXN��FJn��?�������??�)�_���������������������O�8��������CE���P��P�lZU(�6F�V��P��*�7�����F��UE-o���W-?pX'��c��S/L_�'6M��&��X�S����t����f�����u����eX����zD-g{
�B�x?����J��j�V%��?���q����["��}�K��(�T�,�>�T��(4�T�,
�%��E)o�����V�rh��s�q6��W��6���9t����;����,�=U80��Y}0��aUh0�|YL&w�

&�=���X�q�����x�1�lv�T�D����|��)��y,��nw�Qys��U=��:�Q�r�[���u�k�X�E����:`�:�u����'�]MRlx_thF��#�3���y,�=8���Y}���ah��|Y�<&w-�����s4������;lwx���@�{��a�l}?��yTP55��B���Eu�q����	<���S[������	��q�.-d2����B|#��q����|�,(�G��P+��@L�����������X\�6��yt���z9�H���j/�������_�c���rl'5��$���	��\����d�D��jTO`Q�Adm����u��������J����cp(��u�R����������ZN�K�d�9��'��J��>�,
z�TaU�Idm��x��J2��GuaU�M���(�8r����?�e}�]���O�K|�f'�9���;���1�.���Q�P���
�v�K�A}�����S�/M��!���Z�����+r���������2|d��7��9���{r��Qx������`���e9x�w�f��S9��(���.�J=���!��'�����a�N|�[-gb�c9f��*,r����(�aU�(|Y9�]�� ��'��4��r,O�X�zn�����qd��M���h�!��t`!��f@FQ��@F��:���Z2���2�C�p�#���D���!��}���s�G�������=QX)�YR���0R��(<���E)'��2�{l�����`�U%�b����a9C�����v6���rI������V�����U��$��*�{�K���.���������Js�3�0�����
�b6���U�	�z�
��#�
w��z,/�^?w��zL���u�c�1B�=Vl������������=��Q��������K�c5f���E������y������~7�Q�����F����(���$��R=��X�����������]+H3w�V�d��3���9���{���Qx�*�I����U���/��&=&w�
�V{:������>�c�P�+gU���b!"95s�������xc���`�4@�(��wV�T�ET����"d�l8�mMg����j���,��n|Z^����d2��9���;�l�(�7
��D!lhp�V�(�/
�
�(�O��T�l�r�9��_�N���p)��{�#�����CFX���eAf��*,����)�aU)|YA�]��(gU{
aOQ���%�; ���<��c�+Hqvt����?��n]��c6���BG��:���������u��#��t{��T~t"a�-z�D�������r���������=QX)�YR���0R��(<���E)'��2�{��'�Y>H{����4�a��_���i�1�����!�l�S�E���U1� E=�
� �/��#H�kU�2����������k���������^v����A�l��w�7��d�p��(|��=d���w�^�C�G�m�Kj[��`� +e����8�#�r�D�|���Q����=QX)�Y�m�������/����|�C��M�Dm����/�����}�����rL*"�9�Hc�5��/",�Y>4���v��������0"���G��Y�@��x�*r�������P����?�xqH�
����Q��T�'n��^&�5�B��\���S5q@���q�l�,��<(pHQG
��B�$�d���X=hP(�"P�1`�=Ni��tv���`�8����wK�������0G!�
	74PH�(d������!�L�c<����l��=O(#��������qKc��V�"D���n/��������G�u�;f��`#�M���!�$�K2_��q����QN2�������~��y~X�4�]
A���N�U�j����u;f��(,v�,��(�aQH���*|Y�W��$>����������G���W-G�������&�TE)B�}�*��8�����x��!�l���E���51� E=���
(�~�D�U�0i�h������������!	YI���8��O��:S�h���V�cw�(�^��c6���BG��2��������~�AA�l�w��S�0�1��.E9��Rr_�!����$���p�2n���X��(�9���{���Qx�(����E!! D���(
Q��G����R�3�p`U���BfN�5����(@�@j��3��r�s����:�#�yO�G
o�����&$ �&��$3��_�O��G;��5Q�042����"���(��!~��x��:�E�|*^��u 6n~1�)�������������n/dB#>$�g�I�����-���=	A�I�>�o�K���7�9� H�8 H����V�d���F�
Y=h(�"6���
���v,}�_u/���nl�*��K���u�9f�^4��Qxs4`����9
_,s��\��r�QJ��a� �(�-�XZ��-i�����A��9�����l��}��7�>�@X#�%��D��-k�Z�[�xp��V��M�c�����@���V`#����2D�����������;2��	��d lH���h��K2_
����QN2�r�-V��=�P���ht��J��y.K
�9���{*�pQx�
����U`���eX�o
������y��U����O��te��������m�'f��,O�,��(�a�(|Y�'�[�����'����|_�'��6���a����D �������6s*�6��yO�'
o��O��O�,���e`xb���`��a�������=�t�J� ,�qP@��~�y�������
o��X(��7�P��9\�1`(~�����Nu9����8]�L7A�Mj��bm�A�6C��90��{�o���������/�����+Q����5�y�7`0��������s4t�bmH�9,gV�9L��{���Px�(E=,
��(|Y�[��`��a�p�p���|�!�-#�m�Zs'��\��|_]��(a6��������`����������\?��MT��u��r�J��` a����'���w���!ju9~���a�m�8F�8B0p�x����,�eXL���A�`HX	L�����n��0:3�6����SS�9z�
d*���b SQ<�E1@�,
�eQXz��(={>����i����z�tm	&
��X���9��G�}f�ND���u�MA�P���
�u�K���/���i?�ZS�~��U�;�q���Ym������!~��-�s�0��d`i��f�BQ���B��2����� ������U��-2��}�,������9�M��kiC\CK�"��1��YS���a6����C���`���a���Ea�!��(;{����(G~��H���B�'s����S/�������9��3�k����m�`o��H�`M�(|Y$�wk�����.�@a8b����T�T2Li�1e�Q�6/�����~z���l�_?��������6,CE�,K���$-�g�ld`�b�#�hO��coS�I�j&�������}2f�^���Qx�h0E=,�/��BFx�e�(�����a(cT�-(���~��D�a�������b�e�,��&&�Q�} �6n0H>O��` �0Z�s�eX�X����A8!����``�b�cM)--!� bBS1>o����m�*�sp1���7pQ��������e`�"��``�b���C���b�g���������I�
�f��mj�0��7�E�
�b��G����(���������`�bT�������8!��
J�GQ�w���M������D����@.�����(|y0�4�z0�O�����	&FyN?�V�gM0c|�"RJjT23O8��b6��6f0�$aC���!D+h0�$���A}���@?�'f}���bB}B���;�p�.�Tn��DF��mf�p���l�����0E=,�/��Ex�q �	,F9G����
��`��NO��=�a8
.K,a^jz�!s$1��T`I��f�DQ���D�K�f�&��I�m2�.�Xu��Z���'�������j!w'E�Sa`��6.PC�\�����~�K
��h�)��D9�C
�~��pn��%��P�OF
���c�!f���`���Aa�!�zX�!
_�e����Q��(�"4�!i]oO���P�-��$]�8_�l|n�P<���l�@�����(Z��0@Q��&,P�w���x��1�	%�r�R�����.z#�T��� �O�m��W�9���{C�%����} /Q���n��/w�%���C�!�a�����O�]<oO�?�qK��g�E{�]D�3 ������'
��7�b�'�zX�'
_������0<1�E|0����U �}���akK�9��wqY"%�M�p�1�a�"�NO�O���`�������rD�9� fK�.GD(������e�.jC�v��mfq������7X�(�y0���O�,������#���a��$p�� S���O�F�b9��9���;����7�B��� lH�
�/�"|V�_hl�O�	F��-8B�G�x1�!���K�8?1�G����!.����9���{���Qx�&����5a���eMX�o	��f���c���j9�����|-GiC"E�8�v��m&^�s�1��d`���fGQ��G��2���Z�����x����|�&��
��������������1��!��xG������,����OQ��A�IDC����
��.���r9B��r������&��S�m�w�t�q�l�,w�<4pGQ
�;
_������CC)'����t9"D��JHYC���RX�#1�����s�1f�^�[�(��������0F���o#�u���L�b�#4�N���=q���J��\����B�s<�a�B�@��x�����h��eX�X��B�a�a/B�����������������BG���	
�r�|�j��'���9�x�����S�X�(F+X�������iBNL(%Y����(��h��}. �fwC������1���'��(��2`��
��s_�L
�~d`�G�'V(���%����6�?:*�"�&'�g8��c6�}(X�(��Ca�9�z�C�0G����e���
�9���:���:W���@��K�r���K[�?O)+�\b��!�s�1�w434o�����A��&�44_:H����@:~"��4��������,D��=O\{bFYz�1�]+�!"*��K�9���{:��Qx�0���u`0��e�(���6��u ���Q�}�7G%�D�$������3w�<�xc6����F��z����������o��`RQ��C)�`�~�Q�V�1�e�r(R��!����:�e���4�s�1��`I��fl���L���e�(�k��6	��R�V����R��|`��~o/�D�Wp(����"�V|l����2�-��_[���e0p;K}%��.,I7L:c�caz��Jn�0'��a��;{�r�V����>��9�yoX��Qx��0E=<,�(|yX�7P�(?�����x\�q���p�!�q;����t�N�K�:G�\���h<��c6����G���Hw��(L���eQ�(Q�|GT�:�v�D9�
������
1{{�2b�������o3��9��yO�3
o�@j���`R�/ �@	��6�Z-Y�Xa@c������+�^fy:GbEy��z��ob�v|`G��/��(��_�1Z�_����7P��
��	ir�"V�����b���XV��L.\[�{�F�i�����/B<��c6�
;
o������/� �@��pGT��������Q~l�D�\�HH`\��|gtT23�x���l��l�p �I���aC:�V�p |I�
��O�p���t3>Vm�����%���>���n��
���m-��Pc6����F��:@����A���u�Q#��:D���=��!���}�����x��e�=���M��*���$^s�1��Day��fQ�FQ���F������Z��hC��B�C��c���������0L�CC���R��L�5��yO;
o��v��v�����u`�c�S6S�7��z��o8a�8���Sn�C�V��p��}��=X�(�YI����Ij���x��`�Q���Q.��������������(s:�/S�(�G�^��D��6Ab>�vS��}A��O�/H��Ea����E��c}9�����	��f��4�1�q
(��8�|AU'�E%S3�9��Hu����@��x�` �1Z�3	�ex�X_�t`x�nt`xc��#����?�����w�@�$J�SAb;��a�t0��(��:Ho�V���u��c}9����a/�������|��]g,���0	�J1����MX�����'��(��2���
����\�
�������J��Z���1�TH]y���j���*f�BF%����#�����h�������(���C�/w�'�pO����(����@0d1��m;0�C^c|���"�[�������UN������������H6z8P��*�M;(�V��p�|[U\o��A��UE-oUQ�)*�_��������%����j	��|���N���v�~�QK�J�������������n�XQ�r�[�X_������������K=8j��-O��t�ML�"N���M��>E�y����D��:��DU�@�D��:�$1���P�Y���%H�<�E?n�&�����jH7o�ZN,: vnt�o�NU����q�4�]w��k+8(�/��r�Z-��+V{��"�"���+/F���jI�t�O����p\��(�)��J'�~�ciPO���
��(,d���($��Y�9dh�X����o/\�-�f��\Kqh�d� �U|�D��>��y/d8���9d��U=24WT������+V�v�����P��A}���x.���`��b'��>jbC5V��F�>eLU���OK�z�C�2�V�����
Kk�<:h�X�E����:����
�.��$q����qL������>�yopp�Qy����mT����s�/��B���V$k9O'%K���f����`lHG�55���27/���6���v���d���>^TO�O�>^���q���7P<Ao��.,�}��cbym���`�����O��n���L��
�=�J�*��J���A�B��
�A�B��
�tcua{4n�!
���E������W>-��J���)d�N8��Q���p lh46���K����PA��YsJ����	/D9�t��Z���/��uV�|����j%3��1��yO7
o��n��n����.?��u �����\���������<%��P�A��'����n��=X�(�Y�Q��:0�Q��<n����N\L�����B��B��C�{�);��o���Zs�
������i�g	�9���{2�|Qx��y��W�.f��E��2�|��2������9,��8R����xEsX(��BT2���c)��} E��<@��
�<@��
����[�����k���hP����x���4a�)6��O�|f��1��yo<�HQx�x0�E=<�(|y<�H�z<0H1�EX���[8����n�P�i��I)�p$,T�ho{�s1��d`��f�wF�Fw����_<�eL���V���]]h�UqX01�q�F�����*��kc���D|�37�����TL��G�D���k�(D��%DX���l�@�(�a�8���7�����5�Z��-<��b6���)
o���g�S��������x����C��g!�P�Ex��*����8�K_!X
�1�_�i�<>_��IsH�1�a����(��2@��
��2�	��s�_�����!�$�Q���gla�b�h��6D�,�o���~R����9���;�l��7
��t l(<D+H��tP�@�� �I�8MS��(:@r
����>(�V��D]��k���L"�2��yO1
oV�b��*b��
���$��0�a�A"~��d��FQ�����e1�l�%���S_K�q~��a��=X�(�Y�Q��20�Q��<a����|��-�i��U�cODa=�W|��6A�����u�e1f��,b����(�a�(|Y1�]�� ���,�(�� 27���c�p�w���P�=�5�&��$�9���{���Qx�*��7Q�Bo�V��
���GkU���A�$1�CZ�:�	����;�Bn�C�P��b����\�.���(�9���{r��Qx������`���e9x�w�����.�����Q~,���'N�+)���_��1��;�_�����f\Ra�����	<HU�V��a3�7P��x�/������u����vl�\p\{���#�'f�U��mf1��Svn���Lr���,T�.U���0���u�v�6���;(e�$F�"��P�����$����y��o1���_f����`F���~�w��7�1�Eg`8c8`k�H�-�7�$��)8�7����K�!�����%K�7���.�|X�.
_����p�2��'H������H_�
 `����	|")r528'kLxY�n�/nj�:�����5�����q�-�.��?���#��aO/����/l������Q�g#���^���l��7������t�����%���A>�u���H�n!�"'��0%\�3�!L�g�����N��H�f��%����AE>,C�/��D��!�+���%-������t,;b������"���t�|6��0��������7��%�|X%
_��G�p�:0(1�	!E:���X����{L	!E�t�9����1����t`Q��ft�D����D��:�(�Z%���L�b8,8@{Z^���Y��bs�>3/�V#�r���0�����%��7����|�����a# �?Jz�R����|/�����q.�W����!��^�/�n�m,��9�e�S�W��7_��|�p���W|4�A����
�7���|X�w����g�{T��K��9��E:�
��}�$������1�p��L�0r������!Ncq3o��2D��"�U��dL!D�>T��l39!������K012ZqS�sy��`ZMC���K�O;�P4���l9�%A�.>9V��M�:�a���A��
<m�|X2��h�Xl�"L�bx��Q2Ct�%fzY3~�t�R<Gx���/X#t��F[�cq���W�[����9�yG��g_������D'�7s�K���7�0��w��-G��MAdPn��WA�G�c9��<t��+��N�6N�P��U���
d!T����`{B&81��U�������2�9��z\�@�'��y�\�^R��vI"��g��<����oR��!D)�p�`kH{��b��s�Qr-�L�(s�V���'d��8wD���O���.������Ko��4��`c��yK�!
o�AC��d� t��Sk3�'�����9�p���R�����$��~Zuj2�	y��Z���9����z�k�`�������k���E)X2����Y��jLXb|3b���!�e���e-D6�U�h"v�y�lr�����pz`�x��C����f�j",V����(�aE�PB(��gE���%U�t�:]������_����pmK �;�r��1�c�4,����
���s�nh��&�c7�x�8��"D�?!���z�1��!c!#}�)����PZ�=^8���~�M������=�cPq3o5*
on:������@E��uo�"�uc !�Ol`��E�`�b8�*�8����$2gp�'Q�xx�=��>�g?���pD���}G8�x��9�z�E�\�%����~���c�W�����HG?P.�Y����@0����%�-b�oB�r!	����I�;H"l� :H�x��$F)x�A�W
MXbx� dt$aHb��~�������q��/�(V���#:���3%�D0��'��(��"���Q
AT_%�,��^t�*FF��k�A����,3��������n�-�cTq3o�,U�<.`��2���xW#���*���A|u2��	�������������������<��e�|���u�R����i	�7I@���
��$�K����o�%�Dt��9�P��H_!���������?E���T5�������������]�2�7�� ,[�,��(�aA�(|Y�-�[�"���s���"]����D]C"X�ob��q�
�����7,cDq3o��E��2� �"��!������A���sz\�?g��v!�@�B�#�k7Zj���A���W���/xci��1v�t\��'t\�"��}��uo���������6�{�Q�2Ruo�a����v����r�_K�M+�
����|��_������a�T��X�0J�#����W��f��� ^�	��!��� $������A�0���O�f���M� �s#�5J�R�F\�w>��V�;��w��4a������������]�=h����t�����c���<>�7�k
��������Y%�I1W"�KP��k:��x7@1J����,P��������Y{z%Q�
(F:x�sy�,?a���|���*��b�BF>v�0F�����(��*���Q
V��
l<��*�=X������Q�.o����o���3nz[x��}a�xk��`�#.6NQ��e���,�e��D�+d �1R�������^C��<@����	���������.	���#��KGH"l\�w�$�7p�w��(�=�r���D�+���C���7�0�K�}Y����A	B��Y�d���_��oN�J���6F�����!AF)H���b��s�1G:��Hgr����A�C7������W����o��e� d�����^����@��[2�Qx�:���e`��eX�o	#�����N�e�gz�����Ek	�_�L��|�p=@�@�un�����Pxs�w@C�W�������BCx���U���� �	GzY[^��|���A4G���������Gf�/m�<Z�H�f��%�����(�aA�(|Y�$�[����]�A��Pps76Y�E*�����#C
��
Jw��������%��7��cs���e`b�/����[�`O��b�8���=�]t>N�yW��v�0�N��D��\��:7�V�[`(���;������C���n�!�u�`���������c��.���s�U����?�;�!��t�M��p���6nf���xf���<3�`���f�mu�Ax��?b��+hf�;�������$ZV�7bM�u�S=mL�� ���k����4��[��������z,�h�w1�:�a�*�����	�d��f�7z��0p��
�
�����X�|��i�����/��������P�����,�|�J&1<X+BF:��-����p"��Q5G�"���.�F��������,X�(��Y��B��(�D!
_%X�o=J0 1��$���p(��k��v�Z���%��3�B"G�@~B��g�^��1���7�.oR���vA��"��._:+|E��s�����	!��_����^�'��e�7F~���8���_�fF8	�v��������J|2�����y���{�:P�(���D��:�(1��"�u`�9k{�`�/
��y�1C�u���!�x�z�)�b����Z�s�$n����D��5�AE>\��$
_�yK�-�hI�t���~�h�n��YN��?�KC`���|>��T�������iU��1|��D����w"�7p?����~�}Y�\�������N�H�C�f,/�ntD��Bb������1m���z<?�K����@��#&6N1��
,����(�}Y$�\!#��R��C���H���sZ>"��5���)U�?E�z�cX������8D�V������"V����m<,0X1�E�`�b8<oX=N:���3���w�E��C�0��Q�`|vF�8EtF�VD`�Rp������sE� �f:w&(1�Ad��n�S��.����Sr������4��U>9V#�1����O��DG|�x�5F)X������:Wb� ���1�4�1��Q�����cZw����{�VL��W�8�G���+���g^��A^o`t��(��}Y6:Q�
���H�����PT�|���T��a��u���������o!�9�7�����F�����(����A���aQ#�%j�t
�(�Y0�1��i�����"}'\����J�&b����T��i�����i"�7	B�P!lHQ
j"�/	"|i�o�W&�����4������y��+�����[�h�SDL(����M9W�������5�7��
,g����(�a�(|Y�3�[6�N�B�sG� ��8��|���4?�a9	�(��)��I����f9�57��,t�,��(�a�(|Y:�[�`O'��4���P��8q�������8`�>�
M@����mm�1�������F��u��(���7�������Fx��7�1��#0��a�����#t�[������oF���b������j��[���Qx� :bE>,�(|Y5�[����}B���Y
�r�
�tt���%tu}�x"M��^0��u��c����a���Q����1J�D�eX���F���@B�s�`Pc�c�pa�_�D�c���w!�����n40�_L6��;��x�|S�Rp�w0��US�\�f�s��~��
1x� �cvG�j�"�0&F�$���&?��U��1�����q*���
����`t���U*01���*����~�<���j0(��PnOD1g
RB��������@�	'#-�OSW�`%�:P"l�:P�x��%F)X�K�b��k[u�?�H�)�����!���P���2��a9@�y>����0a��@�bE0F_;�a�D���Y��E���9J�"�8.1�@�@BI�@�������#���������7z�/��H��iQ0�r)���30x�1���13���������!�D)H�����!���9��-�t�h�������{��uZ]�Jp��v�$�0|�y��'L.?����=7��$,@�,��(�aI�(|Y�J2[HbO't�4M��PD��r��{��0�9N�����Q �
Y�/��v�Hn��{!n�-!X�(�YQ��B0Q����L�"�-
"aO�."�i���T���9�����"���i�P��)l%+�n����>�U�0�1��q�EG�x�1�Q
�-��_�!y%�C#]�-�C	Z������H�q7z�S+Q�)�p�����U�w�1����b���DG�xK�#�1J��`_�D|���$����N��H����������m���p���\�4�������L��`�'�;x"l\�w�D�������g_���U�2$uo��N����A��l��
w������^BX?��yD��!b����R�g|wpF�8EtpF�VDg�R�"��_��?�p�x�nwt8�$v����}��<�4D8���};��]8�����V�c�1������q�����
,���`I�/K"�@I�P�x�0%\D#a�c�o3<o�#���Of�����DC�(���������������(������Q
VD�/P�0[���a�Jk{L0w{�,>?�����|������#G�O�r�1����q*���
���`t����.�� ��� ����� <^G#E��!"�1o�vz���5��h1p�������O���c�������?����?��?���������z�*��mD��w��)C��:P�5eP6��MM���o�?_�:8U���^S�L�'&�w���~�>'-����p�Wo�{��3�{N��1�c*h��.Y:�wG����U���
4tT��
,t<>�T��c���)��qc�#x���O�Aj�pd�s������8g��tx��1E���.Y:E�����R�����Q�������8R���iO�)�$��������x���-��M%�7���_m��m�9�����1G���C;�Q��:�����u`�cq�DN:���)e�s���c�c������[:���~^��'�����y�58����H�n��#�����bT��t��e!X�X��4iL{�"4iL���a.��g%�;e�p;�o��1R��G������3.�����d���6t�����1K���6t<�@����=��1���cL���\_!�Fit�����K��c"��U��[�|G�7�������M��������>T+}qif������L���&��I�!�J�3����)B�G���q	G.�
Q�jI�����H��[*p Qy�
��
� ��K��}Y$�l�	f:��$�}��O��g��'�����	J�
���oo�C�{���mzX�t���j��mz�����}Y��1��K�:nv{������Q9N���|�����4,KtcK>|��
��n��9&�6>,Y:���^����0K�2`_�A|����_��16�����g�a_bI���-��f���<f���K��	>�EC"��]l��7u	���aCS�(�@����t�?��c�hA�S��$�H/J�}��(���}?�B������5��)��#"K������,V������9b|��������!X�$1�����?j��8'�����K�n
�q��Yn�p���y�e��Pxs���]D]�:0�P��<=����e�1�?3���_&�����},S�xO���B�Y�x�"�'����f��z�7W}0�p�`(|��=0���z��;����3�1�����0,�n����x���q�
.�=��K��'h&�j����,�
��e�qa|����	/��<�����;�qM�v2�m�{tZ�;@�sL�[j.��K��y�e��Pxs��h������A��%a����*:I�)�tt
�F:��O\z
K�6�e0D��DE������N7��v�b��5���@�f��%F)�Y�`��j���B�G>���wfXm��|���<�{����p�x+I��3K�-�K����8���w��we�c�E��x��W/"D:�`�Jt&FF��������� a�`~b$Y}������������?��[������;�{�	t�C�Kw�0|�bB>���_�3D�	�	|����^�<�a%�?�:L$���:�ne7���v	c(���a���(����@�Q
�:Pb~��t$"��E���K��s� �XX��q����ea��*:���G���1���7��Eoj#�
)B�P�� E_���"�Q�FD:����C< ��%�h����$���7��>��a^�M���"�c�&n�-X�(�Y������u`h��ex��M���=�*��R�bOI��N��W���,F�����c�����%<���f��x��7W|>�p�|(|��=>���x�����H���"uZ0��.���a���l[�M�1`�sM|r�R��U��[��TQx�":����a���eEx�w�C�^t	+�C	?Xp,���@�gaX�Q���Lb���H
����*�a�GV��t`E�t`�(
��>Co�/PC��9]h�Yq�`�b�O_�_��)����.����q�������K|t�!����#Q��u�����ex�����D:�@�'�2��� L���Jt9_����-/��rN������N|t�D�8t�D��AN�R���e`7<#XP/6E:��lx��x����������x��������1q~���Z?�->���f�)X�(�y���E><R0lQ��$<[��\�t�<J������L��H+6����3�A���7�	����V��c�����,t)�7p����,�/k�)�l�;��<0�1������y���w���3���]����1��� ��q2� ��
,���`�/��)�l!C#]LZ����u�!Z�W����������6W��{�u9�$Lc8q3o��12��#����!JA2�$��1g���X"�0R���A���|pV:����P��c����]?��8��=�W��p�f�������u��E>���/���D���A��$f���=�#��`��~E�g��Mld���)�;`�M� l�-X�(�YpQ��20pQ��<\������a��B<`]��G�|�~U���a7���_�]�U��p)�����i�(n�-X�(�YDQ��20DQ��<Q��$��NSG�����H��qyZ_�+]����=0�������R�P^��0�!���U�!
o�����"�x��Y�r���D����H���49�t��������}:&���<�_�B�(�����e����F�cq� ��q���(�������QaA�/P����zq1\�����n��#FFeE�v��!�~4I�rS��5���E��Y�9p�`�i�&n�����D��-C�Fg��f�������D���aO't�<Q0��a_h0�i����q��V�����c�c�!K����r�=�b���yK&
oVDL��"L����a���}���}���a_���)<1���g��
UC{����1�8uPD����"�7p�A��CtP���C��Qa(bd%z��a[s���m�<a�7����O,�v�D�w���0�����Q�:�"l�$:��xK��(F)XD1�@I��*�#	1i0@1��iH���.qK?L*���M�3�Lq�MW��<�7�FgS���:aC�/l�3�RP�_��D����=�t�|��k�t�8D::�<>7���?i9)�N�7���1�a���%��7��#�|X#
_����p�c�Hg���-��`w(����7��l'�AU���|����+���y�#n�-X�(�YQ��:�`�����`S*���+�#�G=4��h
�
�N���8�
���?
���t�'x�� K�[\�������f�R�E�����E>�5E��g)6NT�8<H�P�!���!��o��hh���'\�'p�%�����Yv
1��������7+��5�|Xr[2��s���i�_��n#j{�U�p�pX0f��y��	�$�������DF�����f�0��������7��c����U ����3a����	m��J����@�`@���r������((�|������(����������<�7��,_����(�a�������[���50����(�u K���:��;���+"2Mx��z<Q�TD�����,_����f�R�������E>�	�����,���&V1<X���_����	B��nMaID"�����*?�k#�7?KV�s���z����f�R�������?Z����K�m��mDA�F�O���o�Y�y7mQ_a�c�?�����y0�k2c�{����c�(1N^s�\�1����4`)��f
0d��A�����`_�@�:���	k@�
���'d��2�xa��q�k��@dP��C4Q�/[��_�����7U���j6���RP�_���@!�P����\"�;�|EM"}Ad���r#�,���t��H1
�&L��`���yK/
oVA^��
dx�?����^a�Ne���[��H��*0{�u����(-Oh�����yl���0��9����M�8�{(Pq3o)�F��J��"V�����`#��$0�d)~ez��P�n_������ls/�ta������)���E��[�n)���j���"�vC�/W������.� �=��!`<�z���c�c�Ze�'s��a�����%W�c�p�����w�mo���cos����X���7q��A���������/��ms�z����s��Wo���!z����$���1`�����77�P��M���������=�� K����H�Z�ey����BsfPZ~"�*x^.s�*��K��f����cs�x�����t���U���FVA
V�^��z�s���i���������y��O�K�I��q�8)Q��5�q�J��5��Wi����a0 9 $`�`�����2���Q��k��;-2G�hfw��2�7�Vo`������	�|�70HP�ro`� �uo`�`�W����e�p����NG��*Pc���B9���{�1�t����!��
�t��(�00|U`"������
��I�869��<�3(��j%�����q�nn����������@xS lH����(�@���W�+TO�
�GZ��!lD-�>�7\.�^����/�"�KBY�����`C��y��-�\�HP���o������H�E�\����U��|i����q	������x���������#�hM(K�h��������������7+�
�|X
_V�E����0h0��S���"��=q�>NA���S8��w��!�c�]�	������Q����K
�7+���|X�
_V��������i���W��0���^�����an���� ~�%.��ppF��:�c8q3o)��D�����=��"L���eE��CxkE���Y
(���H�H_�t;�m`�@$1����1����D`I��ft�D����D��"�$�Z�$���(J�2{�/\��yZVLC�2��������0,�����u$n�-X�(�Yq�"��;�,wo-�6�%/DK`8b�"a����TC�T ��eD[r���{�1��v�D��	dHo�	dH�R��}Y6�0r�9��k�d:4`@b�o{��}'&�%��s�YW;T���H\{�o��:�7�V�`����V�+�|�U0XQ��",V��nV{�5�e����8�n���/C���onOI�o��on�8����(t�E�n:�b���e	�@����<�~���Q��hZ�tP�7f}�S�,D#�!\:�;;��Y���l��75��4 l��R��/i |Y�o�5�5Y���x��b����'6��4���:��@F�C�*U��>x�!������7��1�|X1
_V�E���kL�^}�O�m��h������]�8���bSdP6,\�
��!1O��r�}���@�9�7��",b����(�aE�(|Y1�[+bO'|��t:o.+��'���>!0Y���}�\)Fax+���q�s&n����0Qxs�w�D����������Dx��701����O001��>�Y���CfP�bG�}��|�C��y��-2�\��P���n����z�����
2{�f%����s���������z�-
����u���t�A����g�)��q���S�xt�qJb��G������0��F��Vb\ 1!Z�3���C������V���C���\��?�`��0����qJ��Fo`%tD#F)X	��J�1���` �.�` b���W���)13(��iL���q
K�4a�">;("l�:(�x���"F)X1|�d��#���9p����
��0�!c�PN>�:�����_yff����L��f��#��yk�`����aB9��0��C�K'`���X����� d !�3���'t�)�g��%���Q��N�y=��b�1�����q�A@o��� F)�=`_V���+�*D:c��|W�@	 FzYUX��z?N�aBdPV�SH�8�.��vd�#��y�9������@���
5Q
��%��h�	� =�� �y��~#�h���+�'�9��������G�������k�n��z��Pxs�w0C���a�����2Cx��H'B�D�t:�=��C��'�w.�$__��i6�S�e�A���O��s��3��[���Px�":����a���eEXfo�����Q�W_E�� ���v,g�#���D)�����p��=��[*��Px�
:����U`���eXzo�C�^��C�H�U�� d������t^6��!�!p��!��1����4`I��f
t�D�k��D���$�Z�m?��������/��&w�)����#n�%�}S��A��[J�Qx�:"E>��(|Y	� �[+����>���/�&,$������^.YaD)q����t���5F7��
,=����D���� 
_V�����*0�0�E�`�a8���fLXLq������hWS"��������c(���a�f�(Q��g�(1J��F�eE������'G:���w����&���b�%�/n����+��9"]6���
�1�������7�
8Q��m������`q"���c��d)��#��o����w��Y����������@����h��V��[��XQx�"�q�����Rw�6t`��UD)����9���p�p(�H�;s����3��O�^���l���.�|�=�7��"`cz�M�6�aCmD��!|��_����Z�N�)s�A�N�����^��bk��[k����iCC�x��5|�-B����f�R������AE>�C�/+��Gx�^#�Y����'��e=;Z���v�ife=�|8�ax�x�-?��x�f�������5��E>���/k��Fxk
���O�A���`o�
J�J?�>�&�U�!V!����U�{|3��e�y���'����E)X�=
_V�e��m�O�����N�"��3�pa�L���2d<�����My�!����X�(���;������ G��Uo�#�uc`�c�h�t��=�����\��tQDt��
�!$���W5��n!�=�7��",z����(�aE�(|Y=�[+������a��P"fL??�����4D8+���_��mk{�����U�6
o���(��z7�Q���i��jW\��("!���A/`6<�t����>`���S����,�����M����+��
:��xO;�b��'���u_��>�
�@rB������
d�!K��h������xaYo��7�ZNI�,O����1�������77LQ��M�a���e_�d`b�m�>7{�h�
U�������e����Y��,�C4����66t��c q3o���D��2��"�����e_�d����L	$0%0 1�q,��qg��F������-CF+>��g��&�/G�Q���A��]���E�l�.A��"�R�]���Ep|� ���L�E`��+��.!�KX���.�1h���#��l�(e9�����m�
���%�7��
U>,
�/� �@�@�+�l��Yf�s:�v,��T���lk����7���W��4,��rV����?��[�p�Py�$��P�����P��$��$t�b�V,k����3�2����tL�v�?�j�x����Tn%2xq�pF{��z��yK *o�D �|X *_�D|����^,�jI�tHB/fz�,��r�����#t��q�Nu��Y]ZoC0q7o���D��2h�D��@�D��2�/P2��%[-�=�����E��ncy�p��W�����Y���C.Rm�
A���U�*o��64T�p�kh�|���T�kjX�U3G��@C�#��	/��8��8�})�)
�u��Li�
����%	��7K��U>,	��/K�����@r��0!���A:x1��Z3�ux*&�X�.{�����iX�����/bH26�l�����K���^���x.���Y
�K�����u�;����`v@�����&�p�Q�4d�����&5Q8�%�[��<��a5�7���s�q�`on��Q��[��/�5W1�L�y�$�h4i<�K$�k��A|�����������q�H2����!�����	�7�
U>,
�/������	
���	
���@��4+K��"�vw��qvk�zu�yC�1���&� �
� JA��%��"��A�Sk�<q�4q��r�6����?�a9q�'�����o0��c7�&��' K��ft����W-�
k�����H�u?����\�2����V_���H�(!2�u�c�����V��&�1������7K��5�|X�5
_��g�p��C�W��$d1 	�#�lsxc��N���\���q���~�tc��yK�5
o�Dk��$k�,	���%aXc������p(}���X� V����U���;�.+
4�C��w�.�1��U����f�������%��E>,	��o}��
������p�p���NG+A���D�c���@��1e��#���4�F9��X>��D0� ��CW���C�����_���
�� H��"�2>��#�E ���h�za)��N�?2@�z��TF4�����b�^�a%1� ���W�8I�O_,�mI�O_�R���}����]����%�����0����-
���)������3�aL�"s��-��f���c�q3o��>
o������>�]b���4F*���7��-7{:O.e�#��iE�cZ��G}�+*?���^`+!J����&q���yK
�:
o�Bu�pa���e)x�w-�=�� �)��[P���90����&����(}��r�d_�����}�"�-�-��_,��n��Q:K����
<r���� ����a���k��mw�%-"����ij�7a���1=���f�h`c���aC����� JA*����18�G����1��I��>�1i�?�G}���4eZ�[`Owq�D�"�����17��,x�,��(�a)�(|Y
<�]v�N�I��*��HB
��e�M��0�(Sf��|1p��7s,!,f,�K����%��7��5�|X5
_�G�p�"0a�a��B<�5�H/'�>�3�����j�0�o,'�-�	$��M���l�1�7��,^�,��(�a�(|Y/�]��H�9�NG[`�b��mp�sz�� �6�spj��[��KT���f+�1��`�G�Q�q#�f)t`EQ
���������b~\��,YH�`�H/K�8>�s�w�A�Z��D�P�ou���>� �f�j,d�,��(�aI�(|Y2�]�&�1�E'a(c8����`�w6��[��JI�1�+���CL?l[0�(6�-�@��
<[�@�Q
�-t���5[0���B� �+}�-0$1�������Q�_t�W��3rp$���N���#~6��;����;b��+�� ���7�����/A&*��H/���.�s���c�h9!r��g���>
��=���f�j�-=���w�C�7��
_n�==��n�
={��|[����?iq}�������p��t�Kk`U0F�6�1����
�t��(7�0�@5&`Q��HBB4f:��n�i�=N�"l������0D�`���2��
lL�oj�
U��� JAU/|��/U���	�~��U���t"a_������q����0���|�g��~`��B7��p�f�������e��E>,��/���B��~ �	F:���u�{������52(Q��T���������u
�<��[���Px�$:����%a�����Za���j<B:-��S�t��~?�w�Wf���KD*+"J�����v���1����$aA��fIt�D�K��D���� ��'G:KB�0 1���k`�?,f�L��m4r��@�����i(n�-IX�(�Y@Q���0@Q��$<P�����?����0D1����������p�|��,���8�M����K0�uN9����%	�7K�(�|X(
_���p����	-�=q%��{������^?����0�����T�PDF-N�BTc�S[���Vt�E��Vt��(O+��5�wJG�����@��*DOa�b���$��{A��f��A���R�7�]^u�y���y�)��QxsS��(�����)
_���S��n
L�b����`�p(x�V�������	������~�!�A��cc4l\c��1Z����U` ���U�!c|7�u#H����0�LGX�����:�x��c���"��] =��i��[
�%�����8E�Ka/��?K��)���A���d:�`Hc�g�pl���������y�q�s��<�7��`c��M"6�A��D |I��,�#j��~�z���{�x@�)���f���lX�a��|�����0,.���#���[���Qx�$:����%a���eIx�w�#���N�����G:V p����."q��Et\�������8�q���%��7K��;�|X
�;
_����p�R���,�`�c�cE�g���O�������U?��4I#�����g�=�����\$o�"�{4bU-��7�nrF����M����Hd��������C����~^�Xc5����F��2`�"��a���e�Y#��$;��t�I�%,	�R����A	G:�D���0�����L����:��yO
�1
o��c��c�,������L{J:�CA��by{5��� 	����H�������3PX��b5�I��E��R`�"��a������"��[����9�X�H/���'���h�&H"2(aK�n�xV�did`��9�X�{*�tQx�
6A�|X�����O��U��s��������(�`%��_�Dn��yX�<�#MDI�����?��:G�yO�4
o��i��$i�,	O��%!��pAE44f:����������5�a��p^��k
�r]�X`�:��yO;
o��@l���%a���eIx�w-	����
����O���EZ���D����Bs��2�!�u �6�6
7��6
7�[0m@���6��F]Z��^t�9�&��V�
~6�5
���S_����$����A���> �y�e�������A��$�M�@�?�7X���Jz��i_�g��%�&����&��I�i�e�Rz
��{�t�W����9���o��HCt>���p�4���c5�	��G�����"���Ap�,����W�/�[�JD:��H��B�CY2��.�:���X�@/p
f$��aK��(�q�����W��7
o���(������{���>`��{��\�&���t��*?��z�nO�}���z[�9��
fJC\+�y�\�s
q�#���'K�7a�8�|X2B��YA�Lp#r-B�Ie�e���A�v8���<��FKY�C��!�1mx��Fs��>�A6nh0�AZ��C��E !�e�"0����,I<140�����������N`!7,X!���G�)sD5vR�t������X�(��)��"V��5�
8VA <���\uS���>�`����8?����^���������#�Wa��@�0����q��@@�(������E\E�/+" �P��S	81e�`�����a��C����;�@��2p�0}���%��
s��>�g6N{�E	���5 ���e
0������$4 �#4`c�/������7R��gP�����l��������`�%�Ba��~ �Q��U��p�G��o��g_��}��e���7(1�E�`Xb8�����c�WF�;F����������.���Z�9�x`��q2`�����2�q������	���sEI������m�6����{�=�"���R�q��D�w�������1G�yg�#�MEaC�6$�x_D�
��������H��$F:�#m���q�8(�=&�w>���k;����,������9jX�{�o�������"�|C
�/W������K�N�A�s��P�6��M�JY�z�~��w3����-h2��<�b5���D��: �"��!���u`	"���tbaO�H�C�?F�w���`<^���4���E��y=��������W�
o��h(���7�P�r�[ho]��=w&H1�1.����4o�`��>9�O�������Wt8��s����*��B���?�(���7�P�r�[Xo]�&.1�E`ha8��q/���_���&���7>��o���`w;���a�������-�q���[D	<�|X��Y���xp� D��n������x3��Ta��7�8rD���N���Wj1�
�6N�P������<A����&&R1<X�'�|d	����AK�CX^O��Go`pP<R�|�07�Q����9��S�9%�
�a��`���R�	NV�n/:
����^��~=�w�&+ ��z�(��"8���g�s�1@a�d0@E	,��o�2`_:e+|�EL��101;��,��)N�z�=D"^�A�J��
��Z�9�� ��q ����A��`	�/K�D�+��$�?2��� f>�/\^�6���!�3(��=@�6!JB���8<��a5�L`cT �i� lH�������K*_��\�~�G�D:���������=s ������m��!�!vMn�������K��=EX�(�Y,Q���0,Q��",K�wQ;m��tbH�N�A�o����F�b�*8�y�n����4D���d�z7����J��=X�(�Y(Q��20(Q��,J����A�aO�C�7_��85���yYOx|����3�������=�Xb5�U�e���+�%�|��
K�\��%�[W�a�a/z�������&j��a���^�$���6�t>�M�sX���t`���f`E���`E��:�X�Z2���#��/0T1�����r����1����E�(i�)r�*�XP�s�*V�� ,U�,��pD��PE����T�Z�*�=�
�*�}YdX.��������.����G�2��
���S�L|���j�������u0��Y��:0A���u`���u`��^t&
1 ������
K
�r��RO�G�*�:^���z:(�N���s,��M$��(�'�`1��'�����!��m�zC�7�FW!��O���b�vhpQ�i������@:��S�nX��?�=;���C����DX�(�����,��&�*
_�
T��\��t�`� b�tt������w���vD�S�G�X��67�s@���*�E�����[����D��.6�_�
%���� ����>����z8�v�����x�Gu`V���3�+
!msH��wt�o���!������@��~�
���=A�@��>�����9��1ix�O�d�B����@Ia�������m�(V�� ,Q�,��(�aA�(|Y�(�[������i��,�#8q��H}�N?�����k��D����h���L��8b5�U�����+�#�|��
G�\��#�[W���a��B<��?@������rsoRxH�/��,���&ns���*�rD���?�E>\��#
_�|���+O�h��o�]������|���#S����%���!����p����m"V��,D�,��(�a�(|Y"�[��@�����@���|�~��]���s����8��� 3*'�"�������j���e���u0�E>���/���Cxkv��'0�0����~Q!�dEd"�EE��0���)�i�`��9��
�$���BE	<SI�������h��f
�"�G;c���D���,/���	+�a9�����e��!�<��+ls������77�X�eIl ��E`����o,�����L���$f>�p���ry|�gPV����!JB���ls����T`i��f�B�w�
_V�����
��|
T��������4�	O:b;�/+��r:��C����f�8���XD���` Q��m�:�����}��m,��m@l.��@Bt��o����.V�� 
��r9-�
v���^�E'��C����6F���aC�6��[� �/	"|��@���#�	E:u
Y���~e>u3�t�g��uf���i%@_��Q�kV�^�[\(���p������B��uoq!�eW�\��=w�@�(���j���o;s�z�7���������O�4C��C��=EX�(�YQ���0Q��",C��V��N)�	#�t��������}�[!K0��lG�C�{x���!�k%V��,J����(�a
�(|Y%�[k�����������G�m��������E
Nz������bY��V�Q�K|
�7.8Q�����C�-x\����u����o001J��rrZ�������l�HA"���[��0�5�y�9�0Qxss0E>��(|Y&�[��H���N�����"�,��v���qYOh�f�h b_�y��^�������W��
o����C�W��=�\�6����%�����{C
����n\�4<�3(���i�:�+g��6���5�
�yO
oV�6��
6���
��U`�a����p�p���A~5!�d-�{
3�Aa��[Z�
���E���b5�)�"D���@�"V�A���a"��"�t�7JR�.����\�����"��%��:�=�238��*
�Gs$�5@a��$Q����o����
� ��`���9�x�������>a�+����K��� o�S8��Hh.�����?���t����^����_������������/���+�k������7�F�@y����iu�l�� �����m��e��!fz��L���(�!����6lo?]��(�x����^���v�pn�����?�2</SXq7�)�aE����cE�+BcE���pX�x��!�[���-D2����q���r������iX������M�&"
+D2c��e�$��=8���Y}���ah��|Y�$o-M���2�E�G:�_����7��^ �U�~��7������Tp1��)v�s`��4��\�t�{�
�,q�O-�
4KT����o��A��������P�/D�����&��Z��q��+\�YX��m�0�1������7+����aE�E����Wl���j�|��z��-O2�?M:z�(���W{�E>����9Q����|���
�40H���8@a@+*o�A+�|X+*_�A|�������m��@/B:F�HG�	�����`�{����S�a{�����@���6�/��K��_��#�/��������o�����������������A��#��'�N����g.g)?���F����hL1�v9������b�$����J`I���-X���/P���
���I�� h�x������5���� ��)�7 ������L����7oplQys��g�*�4[T��/�(����m�>�`O������O
H�p�2Q������}d�U��9D�vS|��_T�,��&g�an8B|�����U�t�Eh��u�
B������A�r��������:G�y�I����7)A�P� lH	���KJ�/J����~]�7��tJaO,!��4
oPDc�0,�����+��Kp�i��[�~^�c5�)�F��� �"V�!���_��#��X�1���i��xot~���6�=��mlt]��i�0�a�G8���&��W�'
o���(��78Q�r�������w�5��so\��2q�����&�� �Ft�|����|]�e����8�u�#��uQ��]���o8����?�@U<jAb|4�u:��
G��l`C0�X�}=d�gP��Na�1��eO3��)�u�V�^#`����F`��|X�
_V�g�p��
��*���P��I<��M����vn~����0Uj�B���p�����e��5�j�������0E>,�/��C�k!x��70�0�g�<��|�r�~=r�QJ"^�:/��]|t�#���'	K�7K�����$t���eI$�;�z^�)�Z{:q���I���9���{:�EJPb�TE�"��@��nu�:�yO"
oVA?V'��`�XE��*�/P��X���(��E:�3L���KW��Y`=
�4��st��Z�nv�=�s���d`!��f@D���@D��2���101�E�`(b8���'�2��i��������()
������es���:G�yO	�"
oV�E��E��O��� ���$,���@������[h~'L[��s���h����}��������2G�yG�1SI�M�64�6��x�����_ ��|�N%#���N������D]D�����������~% i��N�h���Qz���b�9rX�{�o����� �"�}C�/��'�p��A�s��=��p��?���=7���?j�Mp|�63�	A���vy~-w��9�X�{:�<Qx�x���u`x��m��B��^��0<1\%�t��
���,����������A=0�|BV��%
\�9���a�j�S�����U0�(�a�(|Y+�����������*��)��%KaXW��-��DI������b��=EX�(�Y�Q���0�Q��"<b��V�D�?�a�3�x��Cd���Q������Fe
�7@�"V��L�f�8��HE��c�~g\HC���l��0�Q�����1>�����R�75�F�����p(F	$��!�A��EWo6
a�#�^���cK�a�����F��M�a��a��O���C���R�s�������#��q7���=N3��3����q���i��2���)��9@��<d_��G�:����������!L�bdTOW��NW�Q�/YP�r�H:C�4.����N*�p�����	'������~��l��O5�����[�E�,	��������4H�������
J�j���iXO� ��`���kY��[�#��@�"l\���AC����]���_*�|Ou�ha���|C3<	�����y=��a�6�eFe�pwqI�9�X�;�����4(6�������@�R����������"�{�x@��L�|��pW��I��%4�u��H !
�7��<E8���6��yO1
o��b��$b�,	��.c�T��?�:�W��)\/����*E8
�T�\��9-m]�w�9�X�{���Qx�"`���a`��eEx�w�	<��f��m���;���E�_Y�~n��=e���E�A��@�"l\1�(J�.b v1��������1?�]p�tn��Da(cf��n��s�h�|��������g�j������Q����j�k,\���E��.
_V���p����NS��'����3D�_7\�+�������O9�I�;���]Q���6��y��-:�\��P��n����
�����
:{n�
:�Sp�Q����g<��_l����!`'���ns�����`���f!�C���C��B���Z����Lxb8h��y3[�@YM�
^���g{C��F63���a5�������� .Q���o��/����p�����4Q�A�	3��A�pxM�����G���v�*���r=��6��6��y��-�\�xP���o������x���
{����;��cE�����wcD��D��i���K3����N7
����p�x��/��s��8�N��0�0J�!��O<�@�E��b��g{�!�����:G�y�/6���7���saC��T������/tPg�w��G��O��4��2�8�p����]9/�u9vD�a����Z�sH���t`���f A��� A��:�H�b�:�`��6�G�S����#���27���y	/B(��������Q~��k�=Z�H`5�	��@��B �"�!�����I ��	{����Lz�����X q����M}CH"���>$E��sx���$a���fI� �|X&Q��$<�����NP(�	
�t�
Q4��G�}Ds)���3��E�pbMu�1�u��C��!A�^��(���{UJ>�^�R�su{��EW``�sW Q�G|!"���nv���[8�7�q�9��,����u �6NA����@Pa����e����20d0�EG`�`8�]����o 2�_{P0&�a��K2?�9 X�{�����[� (��V�A����� �u���
{n�
��^�)��q�h���@��FQ�/�?��R��a�u �6��%%p+0Ko����
<���0XP�c~h�`����:��������)rs`^����jj��a��q�?6(J���|�
0\P�����`|�
F:�{V�0`02����{��g^�/�L���`���_#��#L�y��8Pxs06(���7a���+��@�K(���$�������?��aB>J(�^���w�a�e�����BL)��w�oR����@��"�-�3�����Q�#��43���J}T�X:��ei>�w>/�����F��� ���>;��"������Pxs� B���A���k�#B��� �����d���oLF��+���!����7K�
0�r��.~�>��yO	�
oV�#��#��������
{��t!���c�
nG�aX�8B��a�=Ov�ad�6�!��@!l\�0A(J��a �0��{�eExD�G=�A�a�=�	 {�����o��Q>n8�����ciKn���9�X�{m��������HB���"
_V���p�m���a�S�x�|?�#"�+A�6��$4��C��|?,.��������a5�������� �P���o�/����p������1���0�0�/W&����!x%�� �����}�#���sl��9�x�Q6���QE��}���*��G�/+���8�#P{�#P{��;N�ug�b�gPv'c����b����N�h72�#���'��(J`�x��Q�/P�Hh.�0���c0H12*L�r{���������i�2�-�!����
�st����K�7wQ�"�"d�_8�����IW�������/�[.0�Gr��f><@���,+�@�4R�
�,����'s�E	�X�4L�s�9�x:��k�E	��[4��/� `{h=D �����PA�"B2fz9�h����b�1�#$����[Mcq��[OW�^���1��y�1�������@���
5���K:_qIs<����h�H�^A;���D�qfe��;#������������S2�	�����e��=X�(�Y�Q��:���u�����Q���e����"��
:2���B�.d����@�7m���$������:��w�7�<�`c5�	��F�����"��,�����Y&oT����o���4J�t�9������A��zL�
�`�Q����zF4��W|pE��^a�+��W���\������G�+�
+���W��qKw;4��@�v&�*��~�n;������C���>%p�D�[p��/�x�;�s�4���.�G�s�o�����(��i���t��qB�7�(�,z�N������������[|(�����"n�%������!l��0�����aO�(K����0 K�J3"(�8�?���R�if�V}�6`�>H!l\0@
E	���xn�����x��7���4��5�����fz!D���wo�O��m�Q���)~�7�6���(�k�[p��/�~�;Q��FiDt:*�����p�k�|����Q�z��?v$�P����
>"a�D0y(J`D�[��`�*�����I(!wx����h����������!�{���<��`5��,�<B��8@�9�� l�8�� �� �sW`��Qt�Dp}5�;�P�@9�h=�q�\(O����sP��wj6��T���ZaC�oA����� |E+�����Ns������S�������RQ�y�SKo�S�uaX����?�'x�1�j���e���e0�E>,��/��2AxW���t��|� Jh'��5��<�
���Hr��zi
�s����
C���j��|���7W���p��'|��-����|����P< ��@@�|&Y���e`�w�N�1���eJ���~�=�l\�?��D	��`�xn��_��v?������������}C����Z^���<Q�a��vA�i�O����cS�s��9�`��}�����_����_U�&fP��z����d	��!�w\��Q�`����p���7!$h�##\[�������s`�1l\�l8%p�l8����@~��*���P���h�M��Q4!}�����o[�G��1���^�F�t���x���j���-����l<�p�/��z
_:�
6f�O��G;�������D"\l��h��;3�g��
�1�_^#g=��`5�	�A���``�
�DY�R�a`_���+B>4�- 3�(G��8�0|�p���2�����R0����&�������)��(�1�-X�������+B�>bO=����PV��ul�`��e��\4	Q7D
�N�}�a�j�k",&��D`B��
_V����.jgE��L$
Da(a���X��B/�('�\n<l������G��v�ms���wT�.oR���vA��
�-�]����#�xB*�tR�~�����&���(��K#�x^������D3a���`5�U������{�|��
�\��[��G:W�n��@8�tt8������W��6b�;����G9{`��6�
�yO�
oV�+��"+���
��!K����+�D	��o�?��r���A'��g����aX�4��+�m�9R��@b5�	��D���`��=�HDYr� ��a�#W�H�oA������\8�(�?VD6h"��������m&V�^�[�(���v �|�1��������������,
���.�J5v���kq���s���Dp#���D+��s0q���qC��(J�?��o�CA�������m�
B�7RF�Kf�?|�3�qz��c���}w.Eb	���������@� l\������-����k����6�y�Q�b��
>��rn�r�z6�e6�����V�3G���V��9fX�{��e����f(����0C��*���r�8���������Q	!�Frn����	$�7�s�|�Z�9@�
B��6`��
�����U �D��%D`�`�#���18������4,���N$86���v��	b`��(�1��8�����X>��F�����- �3�����Q�����.bfD��+���D�9&X�;lL�o��
����� ��j^�R���h
�����tj
"�;�,�eG�_��lw
!�@l0����@�d�$�n����&�5G�yO�
o�-��-�,K�-G��T��g��������r�:F�02��R�
����`C���W_s����aa��fA�B���B������Z�=E:��L��f��F�������T�##����i4��|
D��u����`�w���o�`~u�������d?M�7)��<���Gu`�0p`�kV�^s`q����`�|�90�P���d�)�V���
;��?��a�����bW��3�r����	}LA�����Em����B��=AX�(�Y[�E>,�Y�� ,B�����N!��01�A����"��<q��;�03?:S��������?PDQ�1�������1r���H��w[����Y4s�4]Z���|��"�sh�aNq?��4U>�_��a��|`��(��| �0���|�`��U��m\�{:�%D`�a�v�.#�A�'��(�����%�0,��p����`���!l���(�e0��-X���6�0re�,�=��}�#��kG��D�	8�No�<3#�0���2|
 C�8 CQ�`�[���u`�a��:0�0���0�p(3���q��0�#����C��'O}*B��!�p����t)gn=���-����M+e���-Z!(�V���B>i�`J�q��YM?M����8�m�{����4���� ���6�� 
����vO
�	3�.S$q7�	��D�����D�B�D���p$�x��A��$��Cz;��^��_n�'��C�4,�T��x�va��#�.�'��5��?�.Y�f��Y}|����Y*_V����m�,h|���?����Xa^��g~��+������a�,�YOk���?&8&uN}�X�t��3���>k�;�>S�����}Y�)��0�Lo��&��F�G��<����q9��t�Uv	{�`E��i������;sU���\�tU�����P�����-������������%�A"LT���Gz9�M�=�6
�-j�E��j�#������L�C��M��q��������U>�;hx�|Yo�����������,�k���������C�_6T�B�|�V��.�t�)L�DG
��D��B��fV���nf��Bp �xk!��-NH��+�2�p�#�rB��g�DX�;z�q��i���J�e�-���)�c���U�GT��
t8��e�p���U��b��P)�y��%cl�����B������S��:FF�J8�TG#n�)����*�E��������}�X�*����>QL_t4bzp�����F���(�C�n�'��&+bp��=�"s\W������S@L{�'p@Qy��{�U>��o�2����^`�'�0��t���'�<��	h����]�M|g���G����A���6fv ����
5��j>��j^�R����okg��X�U���L���(���Pw�(0:�����1]<r,���)�uV��
,6���l(�al(|Y�[�"���#���x�2��q�Q�oV��|�u��,��h�!�O~-�H��:G�yO�
o�9��9�����G���7��#B��B�C���@oS>��s�����V�V�u`�2�4��G[�|��������7���9y����d��:�/P:��$D� _:0�0� ����o��=�r!�uy���r�����|,���"^�0b5�)�bD����� n"V��AT������X����&1�Ea@bfT�;�
�n�$`m/M>�+{����QS����*��B���>�
E>\�
_���U�:������i~���t��������S�9-0g��(����3�;���98�o95��3�0C���B?�W��\�������"VD|�R���;
�k�c0�0��j���F�������o�����|�����iM����k?q����~b������%	n+�\����u b~4����t$bfT"��1��3��p�w�@����Z��Q&����"�1�%K'�~L��B=I�x����/P��q� �%F���A01�����<�=�����''��
vC�����V��:��yo�`�����B�p�`��e�(��X�8�@�C�!f:�����t�!�{eA�y�Z�G�Q���[PX�xb5��6�9�$aC�aC2����@������@����{�x@()��C x���;3\J�����Mk�aYHX$k�+,sL����`���f%0E�+�0E��J�/PJ0P��	�N����_��v���YZ���p[������D���Nl$�2�yO.
oV�\��"\���.1s�\�tZj��PDFD���u{G6���
K(�����������"�S/�e-V��
,Z����(�a�(|Y-�]�����=�a���9��^:��TX}�Vv�{���%4Qowmz�9��D&���"E	<V�L����B�J�-�@��q��c��l2��.�t��?�5�F{W*�����k���c<�
^��������2�W��37QW}�����H"n<X����7`Q�k�����A��3bJ�IB</����7��A��K�M�O������D����O�po`x��e!x�w����70@1�i��������Xj�]��	����!�A�*5m�d\ #l\�0E	�6�7;o��-�/K�C��8j$��R�D�
�1f��3�<O7��01B ��z��V�#�u�z��G���Fs�q`��q�`��V�c��`E0��5P0a��8(B�G(�0�L/7fc��y}�v?\����kG���.sd����K�7wdQ��]�!����O���C�^t-��~����Aj�&%\����=����r����[�V�yG�1���&j�
� ��Z����U~����G�#���N��EY���n/lj|���/�������iI����k��gOp�������7k`*�|X*
_����p/�`
����"�5p*B�*�_Y\�v75����,�,?_os,����`Y��f!�D���D��B�,�Z2�����)�V�A�
G
�eE����rs�Qr�;_��"�os<����`y��f%�D�+��D��J�<�Z	{:Q��'���m�^N�(���x�?���$ZnC�c�8E&�I�m�&V��,M����E���(
_��P���A��6@1��k0��a_OGZ��"�x^��������x65���b5������k RQ���o������@���
P{��eU70�X�.�����<:�K�a{�y�F����b5�I��E���@�"��A���%��"��$dq"H����!��^��~^�6��wfPB���H�����{����f�0God6n9@E	<� ��<�d_V�'��-�Y��E����H����c���t�����W3#*�*o�9�X�{M�������m�"n$��k���/`!��E������0D1
R����H)�/gl/��|KC�N`:2�6�Q���h������&J����b�G�H��Y�2������tI�D�
2f��k_w;,@o�gP�B��Oq����A��(�s���wZ�oj�
�@�P�oA]��%����ZD:�@��OD����B��;�.�iX���t��E���a������ �sx���j��E���?�E>\�/
_�}��.�"�k��~A;��2�D�b�oocJC����n��������'�������j�������0�E>,��/��E�k!�������N�"����6�6�A������O�����>Li�c���'��7�`�-�|X�-
_��g�p�20l1��)E:��X��1�#�~�����D��s!�����Q�j��qK�7��E�p��(|��=E���qC�^�#�C���CMK�?�x����U���/�#;]�9�X�{J�DQx������`���e%x�w�Y��-��0@1�/�pi�w�A���h�|��G�e�!�SmP�:G�y��-9�\��P��Uo����������
9{n�
9�����	�r@�;A{%L��8�vfDI����
���u���p�(����0�����0�@���h.hpad%�����) ����x���`�w;-%����S��o�x�:����DQ+a 1�����t>V~�R������B�A.��1|]O-p��g�^p��0G.�	�����:��y�[��Pxs�0�(��n� B��*����G:�@�T`a��C�p��7���A�B��5�S�(	����F��I��E�9ZX�;���i�7)B�P� lH��._RD~�h�-(E:)"�����p���1[�l��8��|���p��G:��	�0p�}!V��$,B�,��(�aI�(|Y!�]6�N�(�i�����(�����r����G�1�x,����9"0�v9_����s,���aY��fE�D�+��D����,�Z�%�=�&"��|(��T\���G����[�����G�������i�C�����+�q=���+��!.^���b�����C���p��`�=D��v�cdTy�Q),-����y����li�Zq+�<&�Q�Y��v��S��ngQ+a`�s�+a���������E:w,f�x]Ok�<T��5�N��>p������l1G���a�1�	Z����o��`_�<a�ly�h�~��[�s������l�5E{q:I�8C�T��x�.b�6��T��������������,�ex���m{�E���([��LC�xI,&�#��W������F����j�2Z�(�y��d�%1@QV&K�}Y>P1�m^��-�y^i�@k{��/\��+w��D�h������w�].�#BW_�#Z�66}���H��Fb�@�X2��1@ ����l�V������ �5tW��ie���l�q�G1��#�
H��9y�R����@��(�e0�o�2��%�@� >��SvfK��0�~�
��u;q�t��t���L�d��i�u)eeq����D��6{�V����[�z�?j+�$�����|Na�}���r��!��a�c����V��HBxS[!lH���B��$!|���7��[�$�:�0�:�+����a������C2z����e������^DpZ�c����i1� �t�#l�"��1� E)XA
_V�G��r����!���qh\������z#��1a��v�\�w�oI#%���\��c5�U
�<
o�yqX�<
_�'�p�yq����sk��=��=��Tm���;�@�l�5���x����M�0�!�	W%`H�n$0d����o�	�����Zl�N�*U�����:+q���!�Y-AP�s�����	6�a�����W7X)��nHtq�n0�������w]7��A�:apd^/�u_�/�.�Y��!���!a?g��7��e=V�����Qx��Hnq����F��_��G���o�c����d7�C�S		���GdI70�c��%i��gUs4M��\0$l\1�!���q����G1�������!��V��� u�_�4Zf$j"��o�&j��3C���-�v���xw�,1Z�c���srH�����@�xKB�aIl�`I�3K"�������/Gd:��$$
Ea���c���������Jb�5q�1C�����@����
sr���q����	��!SQI�3+"a +�$B�K�"6{�n�L�����q�;I!~�+R�CFQ;��=���u�C�9��H�8)�H�����R�y��;�����dC�K��RGT�D�u�$���v��4� 3'��3BQbEO
�]<��C�����������^��!I�8$�(���$�������c}i�D\�L�|B����&^XEZ��x�X9pC��~.w�t�5��y��[�(���`G����Q��-���o�����GD-����O�����X�?O��������������{���7[��k^s����>�����?�lTq����D|~~�$~�5�w��������i��y�?�����k��#�uOim{���R0�#�,1��v�$1G_�6�=���	��8,	��I�3K"1�$L�c�4Kb����P��s�BZ,yy"�}�C�fa�f�vLb��'Abn��5�y�^��Qxs�0U��P��y�M�o�~b��z�����!C>�z�h�>���6	N�3��������K��B����6��5G�yO�8
o��qTqX[1�o��� S�.0�E��y���q����#>iV2�,�6�9t�c��0,s��3���|~���j���%����0��Z�a)l�`)�F<���r��tz��C
2R0�1�?^���\K�f����5���k�&V��g�4Qx�gHjTq���(���4�r�!�;���!081��`}!��5Q�����������C����)@��	�)A�Q��@�����&�1<x�(I!��
A�o-���p�O���<C�����?0'�����.����<������C��u��Rj���3��[��u�v�GF|�2�H���s�i��y��o��+AQ�Os���=�	�y�������>���@�����@8��_Q����\����!�G�#��^�����|�S��~� -��0�?���"x�A�j��������0
U�������`�!�eE�	�u�F��_��o�W�����}!�0���]A�_���qEVs��������gO���Y�P���*0�P8�
,;���A�E��������>TO~�����/m��eD@{,e�2epH?�s��9ZX�{U�����E�0���Z��	���[�E`iaDm���D�d��u�,\����
��U�3�
�gF���� ��U����G�M�0����a�z���X"�����k�
�w �B2�u����\G]?���2����l�C��"���!���1SS�����a5���
o������0�P8�$,;���Z��D�:7�f���xx�?�w�y�]hn�>��������7s��=��W�%�'p}0�����;����PGE} ���Hl����y��|�`o~�dp�B��X����`�?�M�
x���<|�"���` Q<�U0���`l���U�H*���9��"0�0<d�>��Y����*�f����/�������s�=a��0�X
#1��R`g�l��
)l�"�,H���^��J_�D���Y��N
>�
���1���������1�Qxs�`#�8�10Q8�,F�����b�`8b8�#��{�\�M�Z������G�K����f��u�*V��`c��MZ6T-�8��(U�������QaH3R[-��3���>�z��W7�0����!��z���u� V��w�Qx�w �*wC�3wK�-���NQ_���XjbGy��8�7���!��{�YFH:e�R�a������K�7+b�&�8�C�3+��DxkE���*�
�D�D,R@`�Z�%?3@�B���w�]�>n�����j�������e0�U������e`y"���B\'j���b081������3|�#�Ds�A�IN�?�l
C�����j��9���E��>�[O�>�[�bp��Y61_��g��,	��IH��>B�"b����5�Pk	s�y����%G'�9�X�{��E���+����p%a��pfEX�o]I����m�a�����s{���.(���A�����=G|��	([;���u�4����Q<�k�����a�4��A���� ���ha@c��@������4�^(��fVD�gb��3{�]�9��`G�8I`G���v�b�$�c�*I����`ID� I�`�|2z��V�a�s��� U=M	�{gD '��m�����FX�(���`�"�x�����6����	d���.�C������1��]��}1��.B������d����o!����?U�"�(���[�o�E�9��Q+��N�a������d�N�1����K�a8���������wI�,(�V���d�VY��YP��$�������qD^o����gB���_/S/�������Aw��i�5��)���������f%�����J��Q9�},��r�����9Z��_/���8�����,zn�X������=����e�5n����X�����g�2��s��V�)_Y��5������� �@5�v�����E$��Y�����u~:
��9KX=?�^���f�������%���2KBsG���p��x+���[����l%c�w�^V=?�'���c�4��z�#�\�f��^�H�f��#���E�_�,��t�rf8�X��tc���!op������~��koPc������7��ah���2�7��^T�,�>^�qX/*g�����[�`�����o^����y�4�������h����l8���O���'�����Q=��1���v���rM�_y�?�^����OO���?��ey`�Y���hh��j~$Su�L�������`�����g����g��_�-p.���]�(����5-L�%�0,<6���Gi����t�;7
�����z<�e=LN�R��2��{Rp�Py���P�a)hd��Y
o-���;�����0�o`F�{��p�����
@]���3K���n��H,!]s��"q+�Yg�l������\K����NOL��(����p��#���p]��mJj��Zh��#�\�7���uV�N���M���!E�8T/D1H�������Q�q�����G�� ����;�i�S!���zH;j~���2!G���]g�:��yO
o��:TqX
g��E����0�'��������^vD��Z	��"@i#�n�����k ���u�&V��$,M�,�����$M�,	K��%!�����{BGq��y��P7���rF,���?}�z�������x�4^��a5�}}�7�p����7�P8��������D|���ht�b:`�p�Y������/dIn8���:�������$7�p�����'	��7Kb#�8,	��3K�bDxkIl�	(�=�#�u�9��'���]����{R{^~^�.���ij�9�x���tK�u��]���3��(��J`)X��/��9�u�Bd6�H�#q�O�������l���D��`��H>Y��^��b5�U�/
o���F�"&�?��_�5`�bDm3�J��4��v]�:E1!	{��0����vV!KRV6�������i�����"��������Z�Z�g(�~�0�����%;�$,t���jAf"���F��h!��[�@������f'�!F�������W�tAN[j��U��|���Z��G����|Tq��0�Q8�$,|���4��<�0�1���6���<�m��������������D^���,{��]S�����������~?]q+����V1�m�t_��t�b>��f/�C��e��������������`��0���m5V��`cP��&�+�8TD1�a��vZ�W���g~��&���N|)B��A_��q�rl����f9�n�Ab;�����d���b5�}vK�7�����g7TQ8�g�WP�]���i)U)����T�k{|��SvY�f9�D&�(J��dD���K���b5�I�RE�����*K�PE����WP�0I�[*�D$6���������Iz�X
�:AM�Oq0���33���j����(
o��DQ��/o��p�/�����EDX9`�����7l��?�f�1���p�&��?�*l� 4c���9�X�{2�Qx�(���20Q8���L6"�jl��M�������l��������7
���Q� �p��9fx`��q}�f(��}�f��>`���5_A}x�.��:�Q��^���rUl�V����������e����a��m�V�����������������
g���WP*�a1�����:��Ft��C�i���E����H��Q�iXf�O��v��6�
�yO
o��6TqX&WQ8�&��&L�"��)ehB^G���*��2��-���9i�kIr��c�b��M�'}k�H�����
�>��R,�]x,l�E1�}��
��
L�b<�`Q\��a���SV�/?��-F!1�t:u�����bb�j_Uu��s��6������E��3�$-F1X����WP�01��E��y���,F�7�yc�{V����8������?�����y�)W�Nk�	�M���!M�8�ZD1H��4�� 4���Y��N���4�����$��tkwK�7-�;&��J
���sL���>�e���?�STq���(���{�w9�����U���?������=63[��-dI0�����
�0�
��)�����T��=MX�(�YPQ�aM�(�Y*�]kb�NC���A��ID|��>��|�<4Z��E8�����21�8�,�s��n��=MX�(�Y�Q�aM�(�Y7�]k�����f����Gz
D�p��ph&6�2�8�	6g����k&�0�����
�K���X���	a���(����=f���^���a����A�)�����(��0��G��FI4t������V�	��`����x�7��$%�8�o��3�x�1�[��e����SN�u��}��{�?��|�0�
��?<��a����l�`3�?����,~O�A�Hrb�������1c�i�`�����`��G����@��X�g$��[��f(�c?�Q�$"��}��DD���#��Q�������|;���(����
R���{���g�b���s�Q�!��5i��4s|�>�a�41��X;'�Q�;��X�
�y0|1\x�(1"�����7,c�&�=�-��m8
���� ��$	�K��������(�����Q�;�&<_�a�q�5aVEk{t�G�U�����uNz�!���Sj:���X�1G�y����MGaC*Pq��� gRA���t�_�U��y�[��wTN�,����2d!�I��7F:.�����sh���t`���f�E�u`��pfx�����q��v�0R\'���C���Qr�qG����aY�p�B�"K�L7s,���D`Y��f�D�E`X�pfx�w-���Fq��A^�#���f����?���X6�D0���_�V��A�|�������G���Q�[~���C5���q]���C=��q}����PD����t�s���ot'O�w�N�c%V�^�`Q���a%�8\#��E��b�(��F�������?�:���~[��mU	��B��YF�'��z55�\�{*�tQx�
����*0tQ8�
<]��V���a�����PA\�����;-���f�_���
���c�+��5\Q<��
#\1��
;�
<W�a�0��E�`�b8��^�'�b)� 
)������<8��v�I�9�X�{�����k���E�k��(�Y0�]�.�u�:���������
���4��^��Nhg�0,{l�����9�X�{Z�`Qx�U��$��`��FY>q�Z�Y��O
-����&��-kL^{���2"��p#�9�� ��q
�IO��a�$F1�a ��
�!�L�p�
�-u��$JDZ�r{=nO�,$����nC�$7Tx���j��
`cT ��*6����(�@8SU�� T���y��N)�A���b5���a�����22<9�������Y��0Kx�Q�j�������51@U������5��"�e��Y�=
���R^/���B��[`�EY�v�G�e

[�����O����oy����?�U���'
g���'�]~���Gq���h�'��@��t�$e�$��[,x~���m���s)��5
HQ<���u`��X)��Q�����������������[��#�0(�
_��]����������9�X�{u�%���������Q��O���C���I�M���M���M;A�0��q��hs�����
�sD���T`���fE�U`��pfx�w�C�^�)����?����k�$7L-�h��?7���s�(V��
,Q�����*�@�\�g��Al��VC�W	[\h����1����X����zmo��}����P� e
�y��)�(>��b5���E��2�*��E��2�@�rz!��d&#d`�b^/��+6�����I���B;��1�y��sV��w��Px�w��*w�3w������a/C���;�!���*��G�0.8�p������[��=�_��Mv�0�6n�0�x�0��������uX�����9�y��^Q�$��-:y\����e)���iN�D���*6F���aC*Pq�J�b�
�3� _A�t�_H�oM*��4:�����^Fh�?�Bm��et`R�CKNCK����1�j���e���%1�U��a���%�"�e+�Y�=������J<���gR�4�Dcq�"�cQ��VZ��a5�}xK�7�z����7�P8��������;��k�7���\�����<�V�d�m����<�
������`�KZ�Hb5�i��D���HNTqX�$
g��'�p��0$1��}0$1���
g��0���0,�/|_���>�����j������%1U�����%��"��$P{�>�hp�g�����a��T_����^H�Y�2����e��8����a���-�'p�qd��(������b�];��	%��h3V��5_�6a��������W�W/k�����;.s\������7W	\Q��*�pE��*�\��.�uV��`�
L�b^/=F4���\1l
P���-#�u�U��=X�(�YTQ�a�(�Y�*�]�@RBt���]0X1�:h��w>�3�P�1|=�-1�<���D�e5V��,j�,�����j�,�����]�A�$��i��e��� �b�8��0�hg���{uO���a2.�6�{0��{0��=���

/�L�pa�5�a�(a_������3H(a3����$u�}�'�^s���wj�Io��
IB���!�A��T3�+I�����Iq����A���eU�
�����e����'���t�qK�f�5��yO�5
o�kTqX�5
g�g�p��D\�D��N�I_�����+���!
-E�7
�<�����4>���������7ka?�8���3k��G�k-�����:�����
H;Yk�Y�4�p"e(�#~.c��5��yO;
o��vTqX;
g����p�Z0�1�E#a�c8�1�>�^	(d�H�`�/;��!�����cL�9�X�{b��Qx�����b���U8�[�)_\�w���*Q�G;-���1�h#�8�~�:�lJ��
s����Q~n7�����k.��u��xwG�b�����Nq	_���*���/s��
[���*���7)H�l~����n�f��!���������_�����E��5�_Tq��`B�a6�0i��AB���F�P���q�j�P@>{�#������\�P���5���	+�9���u6�n�uQ<����]�\7����n0)����H���3�	��7Z��-e����cElOz��x����r|
�����=�X#{.F1X�������������Cu)�C������>��]-j
C4�n��q�������-|�"���k�>��)b�>�'�"F�c�����@�B�V��p�R�A�L}D�1~=t ./,�N���Fx�����6�y/�S5�V@���c5��$`c$!��'!lH*�$�$	�L�_�l���~��5#���s�o^�Xb�B
{A2e���4td-�M�P��8w!�s����>�%���?�iTq���%�����?|dr]��j���i���:>|<�,|�LU�������c�G�����j_��	a�]���\z�{3V��,f�,�����$���,�`}B&�Q�@�8�un�a���A�2���}#�����(�D�/�)�p����5u�m|,���k�K�'p�0�\:���;�$�
I�Ok�0��Q7���(�pt���&��\v�8�����0Dq9����UE�!����.�q�8�E<�!�p%a��x+�"�|�v�2��"d)����������g��"�h6�!�������lZ�{�?V�^Sa�����b`����*��M����Nq���U���.�
 �?��� d�'��x-M��_���#D�{����9�X�{:��Qx�����:0�Q8sm`�#��dT\'����Dm`�c�)��+z}�Q��������h��{�g�����h��k
@�x7#�1������j�h���?�fO���8D|4h�o�l��)�tI4��M�'�Y8d�4��9��`��q"`��	,���`���U"0����"��EK`�b�ze��^�����p,��������BY�����I����$Q<���I�b�� �����Ic�O�AC�$�u�������06�����Oi4B���qV��
�9�X�;}�Eo�R��C}�()B8S� |�"����~��U_����"d|�6<p��g�s[1�a�m8
)g)����^�c5�)�F��� �*+�F���������	,�un#�K���//������������%q�4�S���Oo����O?�U���)
g�]6f�w.����ADjg���i�����\�{.
���t?O)��C W����E��=X�(�Y9�*�@&��
g�����3q��Q\�������C��-�1�D-�fPZ�����[����;�vi��Q�=\�!l\�`��'p�`�F1�o���K#*W&a1�EK`�a�z-\p���B8b�pk�
�o��]�&���5�|�����q�| WQ<���H�b��9;�7��0��7�<h�����u��GP����e���2���I%:�!K����s�����
�77�P��&��B��"����	���d)0(0�0��J��]��L�0,M��xbM!�!��<���������:@a���z(����=�bp�����H"������B$R�����@<(H|x�����-���3@�B ~7���Xm�0��f����x�f��O����m>bD�O�y�x����8��#�-����l�J�'�+����� �e�(����2����E��S�EO`E�P�(+��Y61��"6V�����P���0��~�_�x�l��h�O�e�ha��
X=/SYO�(P3'����.C5����m�A�����n�^�V��U���8q��j�=��~�F
��4SM���7���4C��ki%�nor�X�p���D���5?��T#��z��DY:v/��?C�����;6|y���w����w�Q���{G�]
�)M�J_9|b�<�G{�f8L������6J��>o��D����t
q���+���
�E)
�<�($~�Ra\��~�W���-�������~sA��i���j�a�=�}W�>J!�>���(4���B�AV#
��X����M�XT��?~�t������	���g����">����o�X;zZS 2���p(R�Y�WO�HB:�Q�Y8 Y��,b�s3�=Z>�n��b_F��c��v��)D�}�[v�'�`b#�he�w��x����g7�HB:�Q�a8jY��04��=�i��F;N�<������P���Z?o�S$EK�+�oO�CG��B�(l���jd;��s��C��d����(��?{z�������0s��$�g~~H�$�+w����hI^y��Zx�)����i0C1�0�4��[y8�)�E���QHdl���
�����-%�S#�KgC��v7J���dG���\�v�{�,�s�*����$p����Y�L����*��O�/�tR(d�����	��-��&��B$-�Q�9��(�"���vo����\[�����n�`]����G�S�3�jd��>����{QD��'���T���tO,���F1P�<��Cb��|>���<>�Q���>���N�$M\/�IZ�{���4�?W����"q�4������_��)/�����
�*
}�J�x��@�:d�p���8�\�P���}����CO)���q�������c�����H(DD
�y�� B!���j������lx��%n��%n�EAb���}����F�t�y��x�%Vf<}�1�C��*����V�/��*J"�yET�!
�C#��6$�D�b�w�*����cc��'��D/�A|�-"N�Y���N2�j��-<�B#LTE��c��]�3Q���b��#��`��o��$B]�� ����
�_��s}��8�8u�������H��wU�����A�*�P�C��]��#Q�U8$�
�i����jCt3'�s���(3����|����l�����v�r�d�����F��P�U��B�B!����(����P
�^��?$6|y~�,X"�kmZ"��Z��$$��]��LO��$F�#:IJ�
�!��Y_�HD>��9$
-C�H�f���&{�-�� `�2�U!�h��y�;6�'i��*�3R�/*�F�"�JD���'^��"B!.�����'��E!���G����A�^6�H������C��iC��L2�k??���T���0RI(D���~�P�K��Q�[t�����I��D3�����D��x��S���U��9vE�D&a�u���mG`�x�hf�`iEm�]H�����;X7���Pbd�y�%�,A�;����gi�6��-�D_5bb����d��s�d�������)D�s%"�X!*W"QV�pg���b����.�>D�tQ��h43��!���]
,K43���0Fa��%���9lfe2
1�L�}W!��
����RI(�1S�.��)��+D�h��Hx�f&�p%��>���r�1*�G������������X���3��IjZ�����T�Y�PSI��e�
w!OM�od��`8���RqD(��z?>�<���)����n�9��!<�L�}W�
��d�"	U8d*��*<2��Q�C���p,n��"��	����@��D�*��i
����a�c����oN&��������T����SI(��S�.��i�#7'���jN?���9��o8��K$B�9�F�U$���g�sp�IdZ����G��_�b��HB�
w�
�L�o�
�����`�tC1�e�����}���V�4�����&�����|c���FRIad'#�T<DN�(iEN�]��S�|G�+%�E)�p�4�Q��_7L���"�g�%b�5�
V���d����������
Qo�PRI���
w�OI�Y!����jM&���d�����a��p��aEN"��u���D��4
#[q��CV#hT���(��8���j� �E������x$�/}��������:!i�;i;�$6�"-���z���I
z��0�j���!���Q�vj��j��	5H�Yzz�~C�J��e���B��|T�#��+rm���#I�IZ�{�O9�nG�KDE�v$���,�|%����\a�
��h��,@�~��q	�<�����=��
�92��m���(���O�j�����_hc��HB�
w��x�
�5���?�QW�`��oiD����<0�<KL��>��T,�-&f����O��5o�IZ���0T����PIH��P�.$�)�����_|����>

�^q�[��Q���u�Om,1��5���F�-�`c{��'ih�����P�/d1BCU$!GC���E���9\)"��#�%/V	����X�Xvw�?*��d���x�(O���(g���s�������C����U��F�B#�"R#.��E�:�qC�.���G=Hj����K$JZN�bLq���f�G�n��V��<�B	#TEJpT�%��H%��QD6J�70h
QY|�FW,Dx�������<x��B�e����b���G(���d�����������	�U��HUl�D��A�(��8
��$F&+V-]��$\�v<��a���N�*d�����I�y��0���o����)����m�3�6�E�6���x��J�YjG?�FI#����]a�D�4e����qDL�#g���JQ�IZ����'��_4*#$TE��K�B"�"R"n%="�!Y�"GB�F�`��
���'F&��	��9����Y��S%���$���Q�Zd����f�FQD-2�G�E�D\�h8	���IT�H>���_�p8�B#��y��x�=&F)��j;�������W���iD�s5"�X#*W#Q��p�j$_Di$�QK7���
���H����7,���{ei�����y�F���X�������<&1i��J�cR�/$2�IU$!�I����L�:�#n�����	��3��\�����D��0�7�����%����%�I4Z����hT�Y��QI���Q�.d��"���C��!�/qG�"�
[���/�X�('���%bZ%K[�U��\[_L��G��t���Q�/�1GU$!G���8�/�M�|$�Wx�&%2R��J��h�����,���~_?�������V��1�
#���y��`�
#1�����?Ch���d-8��R�A�e��;��w�f�!*�:���+bb��������[�L��G�|$���Q�/��:�"����Q�.$����K�D
����|H��ky����D�!��`4s�"!�1����l52�J#�F�a��!b�2�J�(�aw��6_D�R+
'���q�4n ���C0>��L}�AK��������q����1�I�}��1��U�&U�D���D����L�E�������,K����q�V�D_�����ei]N���lC�r�L��
�$}�����V#9��!���b4�"*��-F�Ede���P�#�M5(�f����e@O�g�+��Ef����xd�eL"�������B�Q�
v5�����,��9	�!�g�1��X�-�e~I�����ul��W��b�3���)�I
Z�{
	��,�?7$��e�"qCEaYw�E���2��"n�,tY ���(#��D��)Ze�e9���qm�~�w�1���9	>�}W|
����"	U8�)��*��O����!T��<.BQ���������������G/���$,���8.��`�9I?�}W�~
����"	m8�)��6��O0����
�\� W�@C�<���������W����f��*�S���{>'�g�����O�/d1�>U$!�>������F�}���
Qcd^�k	�E�!�M���<���1�!��k=�o���(�9�Bad�#(T<D�1D$!
�*�!����K�>�C��!�%qGh#
P�P���1��$C�V��P�3�@w�=���D�s�V�n����u�U��0�B_(�����C�(�O�?�FY�x]_�y6�x
>K�d�s����g\SB}�����d���+�<����HQ_1O<�|<1&a��}�c�xI���$d!�Rd��g>�L�\��O����H����N�R��c���L���$����P�Fd$1T<D�BDu�#�����B<�%Y�_�n�vC5"y�C��J��'��M��)��K�U�����g������-&��sz��
cz��aA�(��-F�g���-\:h8���D���p�3C�o��cG����Fi�N|V"��������9B;adE1B;�C�(�hgE�b�v��HQ��p��<�h���|~��z,�'����H.E���8���#����E=�AY&�g���2`�$"���!�X"*7(Q��p�%_DI$�Q�7X"qC4(q�G%�t3|�/��d�7���E�)Xa�����e~V��0<��B#�SE�p�S�a|���G%q��8�i�
\�W�(�kP�/
J����;�gL,Z[�L2�j���g��_Hd���HB"��
w!�/�D
n^���:>	��y	�~�/!����:6Q�l1�P����k�I������6/�/$2�CEID�_E4/�!�/<4_���C�C5/.4R��o�"���e��Gnet��������-
]&ih�����
!���PI���
w��
FU�����cNL7�
�W����l�@F�m]����W�]b���Sv�<&���b��.#��0���Y��PQU_8*��,���xI!I6K�C�������!��eUc~��'��kX�������,���m[�Lr�j��9<�B"#\TE5����]H$^D�.��?��9�]���v)UG�4)��]���kiJ>[�@��D,l����-����
[&�h��j��Q�/�1��"	m8<*��6��Q���qC�f%-����
k�qF����#JZ��{�B"J�F?.{|�d����
�F��P��RyI�B.O�yJ�]�"^D�.!��\�*dY�*�u�<���T�����+�b������+i+�I:���Q����k��FQ�����,���Fd��p�����&��/Y:���`_�E.��9[H)����B�����������k�V�^�#���U�0b��H\uDQX#��5�/�������D4n0�7���e���}8A�7b��b�������Nc�k�&�h�����Q�/d1GU$!G����8���7��hQ����q+��O�7�����y���8m#,�4���)��O9�c��F����1F�5��$����x���(/)jE�C�*qGh�CE�l���A3D���^�aJ��8�5�k��V�n�������U��F\��p�BE@����CX}�T��
��`�����4O�gi0Y��F���6�!��d���+�D����U��,�B_�(Y�����!�Rd!=Jg#Dq���O�YyX���cF�_K�.���K�d3:^�|��w%������*������]H�E#�G�C�.��C���.`��������o������(H-<�sK�^�T��w��������*�����]�E#��3���-�`���=��tH��`��I�ayG������$}�����vHGRF�CD�t(e4�"���.�����K�l���|��*���3��+�ng�F����~FtX>�
�N���H�(��4F�F�C�4��F�(B�.����K�4
��82���H�k~���pE���Qv>�������A�sPU������{2�
A�m#@g�����)��|X��U7DJ���'J�(	v_�&�����v�$�s���(�C��B#�����*D�}��,_���	hv:�l�pU��������(����_.�C����\�����=W�����2%��N�QT�����p�#C�RB+#7)%p���D����z>��q�A@�n����^�����-L�_���{������'��**��jUd��A_��*�I�D��/�d��k�]7D/�.*^4�u�=�:Z�v��1n\E�	8�
��Q:4T'4��������U]��w5B��T�������|]~Q������Dv�62l.��V�m���>_G��?�.�~��t���4W���!G��Cd��*�����*�7��Y#�";���M��
�����h�ec
�������l\�<�yj�3�Vl���v�u��$�K�����1�r��1"Vj�LSC��O�#�����������Aq"#��.��%v���R�=�����Ie��F�G;�|��I2�J���R�T2=��H�
$�u�2��D�7Ni����~��>>R	��D^�Zjx���h���l��
9�/���y�����q~�K2���4W����#M���g�!�D+P�c�B�7(�Y#)$��T��B��ZH��f+X����{�q��5�1��	Y�+UZ����>X�t����V9�P�JPE�U`"��"z�*�3�*�OU����c��u�z����|c���Sc������q������_G��4���f�����i&��2T�2�(�1M�������ey�bS,NY�x��L/�~����h���9�/�5^DUl���.��?�����i������&��*W�*��b�
�3�7��)$�+�y�V��SE���
M
F]W��y�B�S
(�hU�3(��������d��9�?j����*j����Q�M��XZ���/�eIk��o�Ig+P��,z+�$�Y��8�'��K���N�?�2~�e��q����'��D"��e	T��K��l�(��p�b�\���L![U]��Z^�*3&x�*����^j�h���un�v���	=W�[Y���A@Fa��@O���:�p	Q�+�r9a
hx�eH1Yd����<���Iw�T��(E\���>4������h�I=?���2JJ�x;4B�E�zzW �T��{�h��q6�����J�=[]���]�f�@7��@��?&��uT��s��>]�~:��j;L���a�8
�D�{��d��pO�F=d	���V����j�D�=�@�c_�+poU�gX�6��
��|���u�����Y'��,��,*�S�%���N�FYdI����R�E���
����yB��-�:[�o��<��2�*�-�����~�M�O'�\�oU��M�U��&TS��o�7�)r�)�L�P��A�#�^�S7A?t	�����W�h�0F��)��y�4����J�����D%K��DkzW`&Q����C�c��qX�b.�
Y��(r�&�����	�=rx���q�;��B����|���4?�M���dhB#$�
����,*L��Q��Q���`��1���[{��t��q�����G�1D;�~���\e�7?�|s���"9��"%�	UA��&xC�����(��M��(�Nw������K�v��6I,mU�56-�(Zw�����=%�������������OM���9��������������?��_���>��X5�7���F"f�"�$�Fa�IVA"�+a!�p	[��A$�����2�����$)�����.J�3����ao����n:�o��=�=�|>���f+�o�?H��7�*�B�7����M��%�I!��� ������m��t�I���L�������F���/{�v����Y_W��������iF��QH��F����@%	�$oPIJ;�7��Vp��?���4������L��J�^����B��^�n��	E|�}���1��S���C��Gy�Q����FH�����p�g��`�$z6��j�+��$!����n�����O���~t}<�?����N=����mI	'�C�N�
���p�7
��F��'d+��+����
q�y�u�#O}��I>���@;�J�k���A����i��cB!�s��o���P���+0&DoE���Tl���D��!^$�soD���������}���1�����?���)���r������RN���B9�*0�I����r�2``���%�e�cV`S	�;Z��q�����^�M����3v���>��c�����A��M�
t��m�7� ��4�D[A��G8�6,p|���Y�3O
�TY[�El������a���<�x�R�**Iy'��J*9�T�$��$oPI��i��J��<"�j%�?�
��"�����-�xW�����q_�Y�/E��#W���W�,�p:�*Mg���[��g�L���eDo�FJ:�o���V�R��I���a����A�*[-Z�,��V�����g'�\��V!2���q� �H4�*��(��RpoZ��oRh>�����_%��(�:�D\��a����@K����T��j5)�O�]�m���Xj�o�X�ou��M�]��&T���&x�.r�)���(�n/���.&��7���� .3��V��b/lF��z�����K��\��N���@M�cFjB#�H+�F5���C���q��|�"�=.?���4��h��<��P�pc�Z|��$�]��1�^5�	3��)�T�	��&*0��q�3��H���'���M$���G�m���(�vCB����k��z�R��>��I�#?z1��U<;��j>r�	�>Jp��!"���
CD7����n����n���.��zI�����Q(��r����{i��Z���
�8���o���Q��3�*8���Bg�7�
I���}"�h��-.�H�m(R��+���F
���6���������z�D'�|V0��RM���G���%�	V04d�ah�1f��8{�0�{��#�����
?������*`b�����}��&{�W^�{�h"l����ow�H'�|V��R�R8��
�N�
��J��$����Q"�t��l�����\�H��k���G;+���:��@oT-;�8,�����
��Q��B>�!��9��D#�;����1�d|�[��l��s����|Q��![-o�`����[h�v���N���pM���pMh�4Q����nn�|�7jSC�{�|"���N�D�5[���:]���C�|�����sN���[���g��w�e
��� ��3�*N.�+Q�G�&p�R@�	
����gkD�fK��{�M
�o���������L�E�����R���_F�W'�\�o5��M����&T��&x�Fr�)I��� �
/���L"��"�p�(qcE����o����WMt"�Wi�(7*H��V��i�7h"G���1O�@[eqQ�_�6l�����m����N=A�!v����/��N����9�.BH��M��M�H�(!O�+��"#�^@!%��l�(�h���v��T<K{i?�'��,�xko��Sg�cF'�|U8e����9����#�VU"��
��'Wl��<M�cy��q6��Y^�m���p�������p�A'�|U��RT'4B2�0N�
L9�7� O���Mx��d�Io:FZ���tC6SE�q}N�a��i?�Ov����d���m��'�C�(1N�
��o�D�8��L13��32��Q��A�x6��_��Nf�J�����4��yG�����l�*dSF�`Q!��
��]��"z�0r���1�M/���D(�����Q�_O������b��a�y��WMt��W�d�(�D�dB#��
����&*$���Y�7FM ���)����d6%����6�8�������	���qa�6����o��e�h���$��
��Q��
��FH��]mT��{�6�3�"O���x����<�����T�d:���3,��0����/�[n�|��)&�:��j7��Q���2��Q���J�x��������a����)ZeaC�$�V`���/W�x���'�S�C�3\���a���2�&��J�d�A9���*�(�a�G�U^�^��L��2p���g���wL�k�|��n��Pr�D'�*SF��P!��D
�Id�A9�l�%�%e�,9�4�N�p�j���l��������|�oK�W����&�]9G�Wat�����)�T��Lh��Q����������Q�����{DL�2��4dl�2��5������_�RH)����@Jh�~�
�����_�����,�}bhq�Z�-$���6t�oR�?�M8{��?���{����(t��x�6�npUD'�*�RF�"*�!ETx�wQ������r2��1o�DI�����1U����/>�J�h�R�1��V��$�Q�%�I�y����C%1SF�0*���	��J���~��?k�/x�4�Y$����M.��"�����}z?/�LI�^�^�vI�q�<���������N/A\G�Nh9T���RaT�%4B��@K�
�������DO&W�5MY:f������`{�w��id��X��1�kaZ�YN�Vw>�����b���+�b�?�HK���*X~d���
�A�d�������c+��G�1[]����G�2���%�M���-�8��SBG(�8�����I.�
��Q:hT�%4B�F�\zW`��� �D���M�B�i�F���z�<���s�_2����L���4"�l��6}��;�j7T�(S�����"��8TxW�2�;*�})�kV8���/���"�hm�-�tZ������)C*j��\�}�r�����'
'c'�\�o5�N���'T�'x�F�CP#Y��j��B@#[�w��,����+��rG;��*��U�{��g�Y>v"���V9�A���'T"��&x��CPX�
�|�y�Dt��^��2G���oM&~t�g0 g���zW�#����I4W�[]�D�A��L�
t�ed�7��?u�5�.2���Zx�
�����Nw��5�����]l��������Y�����cv�(�YT`'4sN�ad��
��Aa`�&�,%�(jd9����y,�<���������
q���/��N�9V��RT'4B*�N�
�/�7��?U�No2.E��������^�
L�
����M�� s��;�Z
!�<���u���!$g��!���	U�H��N�����F�,���f�^k��u��������C#0flu|��
v�����q�y�yVc'�\�o��#N�=�5iq�=�q`�� N�F����2��92��.�o]��f���$oe������&�q\~����f���j�a&��J0���!���
����
8��`����"H�2�@/��91����NJui��s��S�@�������DB�s�N�9V����YF�oB#4���M�
���>#H���N���"H�7[������i���~�I����������v^���/TS�1u����n�Q������Vq��D��wG���FZY�*^5�U�$�T[���m-��uG�"��+eX�a&��7������XJ��N����j$'��)�N�
4��N��|!���c��|/������b����������vRT�wW��4z����oAL��s��UF�?��Q��P(#���
���?UA���G�0^�G�����<?�u�T��^�r|���u���[�h^������,d�D����Fr
���
��@#
o���
�d(�=(�xg�����4,S���N�o��z�^��n ������#D'�*TF����@����"C���Nk�S���S�_\��X8�?�]�{{���S��r�������(�ID�
�Q*�
�FH"�]�yh�Q|!�\����(�3��a��S>?�.��FD�t�y3�Z����������9�X�{fy�S']�oCI�D�BI	�BU0rdH�A$_��*�f"���P(��hkDs�����f���^�R���x5�I��t����6�m''�*��2JG�J*(4B#I��zW`$�� �/��\������G[�^����?�z���<�<sc��}�;���SI�u���\�K'.�*�e���r�!QT�{W@�D��r�&��	0��a�V��92\��X�2��.�\/!?ma���x��^��ks~�KcJ'$�*�TF�2*�!eT �w��A_�@�fS���Ez~�]`
����F�zyV�C7�p�nOV]*�%�3Y�z�O�ywR���n�!�L$�'dt��������J�xxW�H�;��}-Y�f��"��S�@���~�=7��Dq�z|	%�N���j!���Z(�P�
���P�-|����'�^Z�<"�r�8^�6�P��x�����pW����@z��_��\��������6r
�����@o���
md<�=`��% �D_z�l>-&.����9?�������-_����s�O�o��;��j+���?��D�*ID�D����^Qh.�L��Z0��-��:��/y#��j����������(��(����r^���zO:����5_����<�j��*YZh��f���e�3�j��n:|�fV{C�����[Nz��|�F~����HW�[����A#RA#F���T�{cX+{�^Gz.�]l�h�TPKja��9��F=G3]�8j]��9��������N���� ���:(aP�
H�A���/T$�)cX���F2
��(�Y��@t
iO
<�:k��-���=��)�t�Z���p�Qd���
���=�G��!���$����$����h�MGm��F>����`N�'����S^�u���>��)�T�	��0*������>�� ��2E�)�O/�5I�>[��T�g8�E�o)�����h�y�y���wx}&��:�3�J��y�+�SF�$*�!IT��w$Qa��CPYb�;�����R���
����<�|���{��\D1w2���nV!�L�gdg�`�w%��ciB�he!�xAhY�Q���;��c�pZ-����j&�@]������J�9���_5��>��%/���o��'��FJ���d��A#_�g���D�Z���/�x����s�)�b��M����F�N�6�[�w�7
"1������k>e�$�k>��.`"� (x�����@�APw��T.���O��|�V�]���v����C����[�+���s���%����E�sBU���s�7���T��t����,���ay�1�����l�N�.G/_~NI�?_��)��;!�j��r�?(�9�*PD9��r��D�t
�l��x����_y�kP���k�b��*��~��;!�j��b�?h�1�*�@1�4�%�S$�
�r�="��1�V�F��)�zQ�x?�2�Z4��:K���s�*O����U���*WyzW`��A_ &��
/l�E^�0�t��g�����uG{�P�!�)/���s�L���Lh�QI����"�7(����M����l��3���]7T�?��3#����,�����q�F�����s�b�(G�bB#$�
����8�7��K'�l��<M[yf������{�ac+����r��*��6:��\A�2JuPA�����.�+���
:����5����� ;��6Z���n�M�M�u�������R����/�,��r���X�(S���%E~	Vqb�]������!��f)�"��^���
��W�������z����q���<�G�����	0W�[���A$%�	U�H2�	� �/S���D&�b\�0�
�u���[zl��m
�D�=����[��j�����(r`	� �R�&T���%x�(�dm��D�S�#.E�O��7�L��7�(6c�G{:V��T���=
r��4��G.3��g���*�q&��JJ8��d8�A%_p�*HT�p�T�y���C/�x��s��	J�k��i~�Z�]U�h���z<��vnO	��I8W�[����A$%�	U�H2�	� �/�S$"�
"�p�H������O[�����C7V|��3=\�F��g	u�e_:��R9�.�tfZ9�����@�o���>2��z�";��0p8�����0U�`Ql�������W����-e|)�31:9�R��2J�Q������A�+�l�p��!�l��9�	��y��%��l�WXQzn�&k�����L�?X����v�}>�_��t�`P���`Ph�Q���D��AA�G��b+�u,�N[�d����d��>o{�$�V�c�_z��H����zE'�\*�SF�(*�!QT��wDQ���CPY�;�(����!����	!n����c��m[Uz?MS��qMl���	DW��gD�f�@M/V:�U���������7L.�QU��83 �?0����[�nZ\��K�J��x2��Z������������|i7��sS�|:��j'e�G��Q�	XE�xW.�5K&�oasozR��.#��h�	GF�@A�$/�K��H��5K �e�a�h���a<�:�)�t����V9�E��BU ��4Y�6@-I��������1T�\�JL�"�b��We���r/�`���r������QU��]�9������Z�1(��J�-�Ir�Bl��D�z�i!K�������]L���t�"�����I��r�bN��(������N���n��J���~*��(�"�������H�(M"�# ��$�,��?>�U� �-.0�84U^���-�)�-zQ/����v���~��|��^�	@?��=e�*�r�'4B
��c���<#z�B@�B�Oo�y@@�<���nG�:D�K$�z�F���*�������Fs�*T�Y�������a&���a��
U������(��!)$��T���(d+�0����^��;t��tz�(J���]�zI��f���u�l*�N���P�CG��B#4tTH�w���
�pBI�����D��O��"��%J�A�`��X:2��d�!�q~�=&���W�S)t2�O���(�B��B#$�
����7H��$I���a���!P.0)d��L��G��<����L
3��8����|��o��������m��9(�C�X
��A��i��A��I���cdY��
�j3P/�2^���� �T"��Z�[�^����#xR�ay�wvb�O%/TF��Q��Fh����zW@���x���QPo��:m�� h��V�I����W���tYx�d1gKQV�x����g���OM���9��������������?��_���J����k��
����%R �0F�Q#�*L0[W��;H�y�������\Qg��\��@���(�k����z2��w�I��=���am����1:-w����x:�Q8�n������O�G}av�#���
�Ha���Q	�l1x������uxP��b`���>��uz�;,=��dw*��I����wn�7s	3J�P8�N���AV ��w�7�!���7F9l>a!�<�J
l�hY���2����/E���*�~�b���?�+�9�d��z�1���V)�$�<O�
d�0N�Y����q?��,�+>01�d�Bz�&n��mYP�$��c���cN��.��i��cDgn���8#
8�u����q.��s6��A!#��{�2��y���%�n�<N�c���_��Ha[c���������^y�}<s��/R�I�0^Tx&U�E�3����g�2^`�������	�l�������QT�Wa;�:i��������.b���r�:���J�A�6�N�"����Hl6oD���ESHr�}o�
���f�D�9fn���%d_�?� �sO�}l��(�M3J�P`�����l]5�f�F5$��I�����BJ7���?/]�9�w0�������~�y)����~�+�E�|=
)�f�����I��.
h�ut���j�u����'b�G���u'��K;
������3z//�f5e\�~�[�Q���@5�(�D�jR#���l]MDo�DJ5[�w��@5���&����cG�o�I4�q���
���������|?]{�}����n�)�L#���dq'X���w%j��F���
/�i>a���'��kkD��J��GZ�s�����SCFr~�Yj���|�����������0r�	� ����@�oF�;�o�alq���oq�����v�/�/��uE;O{`p��^e��9���8��M>>T0'41�@�o�A�9�A�t��P��z�6���~�������#�i��f�����w��$�N�����9�)J���d��A"9��2Rl0Rx�e8����@#��zC���34�-�V����o���F�Bo{+��z����
��Q:��POh���
�����"z��$��WC�jj��92��������)���
#�6L{��E�6h\�O�a�s��������.�T�C�/h�4Qx|�����&
�5o�f&i�I��	����sz�rpu�5�����q�J�lCH8X�[�Z��T<;y�j.r�	�.z����3�^"��
�C����Y�H5��8�L��[q,�1�V������:���8�	-L���������LU���v��n�L0s�o�U��!��4�����(���g�6,��pD��8mg�A�D��Fv�����/W��Q!����B%5s�������L�������i���� ��^ �4�X�y��[���N����b�f�mW���M���g>+8SF�PQ����!*8��s�
�to�C$I���Q�%�uF�3�.M"�i���Tp��ge��F^u���#������w�u���c��w���2m�>�(N*�*���
����{�6�7��	�^��m�����>�@���Qy�����zb����F�W'�\�o���K�)��%TR��%x�rp)�$^R���0����pTQ���o�4�Q/�*CG#��El��V$��$M�{u����V9�F����ad(�A9��"�e�,<�$r*/���z�������O`�����e>�8!x���3��q���!�[�
��_2JCF�4��_�L'�
��aK�9���}c��f��1�8�G�n�"L��
:.����O\��r�<vt��W]�(�D]B#��
����4��.��Y�&�h�H���!�1�'������u;���F%c������Z��:�|$~f��^��r��
9��,M�
��_�7�y������ei����_����a�������|� ���tb����]�*��2J��J.&4B�C�]zW`|����4�m������l>a	�6���e+����j���G�!��8���b����yD'�|U��RMT&4B��ddzW@�4�L�8nwyA$���DF0�@�����c��if����>���[��NG���<=����p�*�2JuR���FH'�M�
��p	���qn�eh6�8vlC2��u	q�����i���U��:W�K������_2�^�8s���S�8�aN��Ug�=������ggz�0nl>�)Z�����V�<I��)"��|�	�ZxJf�-1t����N2����2 �(��SK�J�;���iX��(/�2��H����Qk��H2l�6��������3/���*I���.��s�b��1�F F��!c��
b���#�!K�t�^�����4���j�c��+�{�=6Z��z0H�Vp��2:��j;P���a�(�K�
���K�m��R����(6��Wq�
���=���f`c���N�c*U��U��r�pK�cD�[B#4F��@��6@9��od��=bF����-�i����rG�`��������Y7�������]v)�Tv	��.*�����
���.�b�E�v�;������������s�o��6��bCXsx����*5J����s��a�(�H%
!�T�0�+���
�9f������Up�M63��
��y����E�0��f�^u��1�
��Q��
��FH��]T8�{��#;S�|����2���6��/���B��4��S���<��S�����2�
�b�)�T�	�� *����@L�FAdi��'
"=V�.6��\�tM���G���:���E��D�[�:|��J�m��[�����[�?,;J��PL7��L����'f�������I*2j�
������`���iv\�1�2�:q�j��W�?���+�*�A�+�t��J�':�p�{��#�������h#5����"�:��/^m}��O���,�[3~5v����N$2���G��Q��`E�]�a��H���H��0���8Xx�AZ3�fz���~���#�Z'�����qV0�W�n���������l������&��0JI�P#��
�����y�����(x�^q��� ���r[wh6����#�v�4���s���A�0�dPb�P� c��
2���d�=b������2d�Sf��M��i�,�y����i�v��u�-0�;����d�����8r�	� �RB&T���&x�8r�)�D�t
�L[�2)ta��N1Z�z��.�����[L��]s������7�
��Q:���MhV�`�@�����/Jlo�Z����]q�"\`!�����>����������N����0u���q�{A:��X��2J5R���i��7�+0���������/o4�	����oz��n�b>=����J�����#�tK��@G�]�c�j�}��)�T#�	��F*��������H���M�F6
3�lui��������f��,��W��V�]���X;��X�2J�P����>�+ ��
r�A98��KVO����+�,}�=�K���)-{��n���&���$�=�x�^��5��s':��XI��Q*�J�&4B���kzW@�������5�I`�`-�d���{��������F5����c�Iw�@,ii����t����U#�Lt��H�Q����th�4RI����F�7h�?5r������P,�����bD�WyZ��|c�bx�:����<��������������DW��U��2��\��QD�`�-�����iB�����M
}yA0\�uKkCg��a�����o�rj�z,GbWkM�	�ST��#S']�o5��Q����(T��(x�F�CP#~�<LBT���((�xA�b^`�F����cu�Y/���##'�*r�a�[�����F:��T��SF�8R���XV����7h�?5�����F6�5�"i	�BY�t,nc��*��?_|YkL[�T���:��j;x���a�(aS�
��aS�a���02n���?����<2l���||�U��f�c�9v*y7��������+��Nd*�t�� Sh���JF�w���������{����8;{���!S=k�OS�K.��0n�y+ix����F�����:6X�\�J'2�*�TF�F*�!�T��w4RA��CP#��vw�lV2f����t�����i����m����5����4j�K��U#��t� S�� Sh�4RA���H��A�d���Q��&#����X?�����y�V:?�t�*�(�}���}����Nz:U���ReT�)4B���S�
(�BO���2��QwelqC�=`N��F�EG��/\6��<�{���pq��{v��r����j��N�*�T#�
��F* ������!��,���@#�`��m����V��G-�nm�����M��}|�����4��]zO��Nz���.\rz
��p)e�BU�p�2J���_��*X��6id��"��^`aEi����@=��o�6�����2������w'=]��4"�l���2�X����^��-j����iB�G+�&�.�F� /x(�x��Q�Q��kq������.nS��U�^���$����rR
���)��@)o��R�
x���H���+^��DV����`U�&��qu�r�~�[�*���<;>�4���GW�[e�|�A��R�
���Q��W��h��q����;��,�H�4d���������^��:��F�Ur��e���d����(r6
� ���@o�6�
L0q��@�E�F[����76��J-��2������;�'�|�t���IW�[����A$�Ra�Q��j���Y
�|��^3����c+���.�����E�g��^��H"m\-l��>A�eO��������+$TF�T�BB�X�����i2�|I��lNq��A�X�����,�!��v���GLl
]���}�+�SF�(*�!QT��w��D�%[�kQlN �� .�
����jW}�\�<�u�������m�>�e�t���IAW���SP���H����$�Do��R�D�����,��5��GX�����h:5��������q�]d{��N����*#g���(��� �`�����(�K�*H����M-�pZL����5%��<I����d�#=���N���$��(
%�dQh�BI%Y���$z���N�F�wr_L�lmX(�_�T�DY�N�Tx#��P��a������ �L
�G2�1����]�j����!�+�,������3�$�,Z#����K���]<�<��?�����D�qUD'��+w�(UD��Oh��Edh�A_�f����� �
.���Z��6J���� �0�8�i�W-t�����)�T�dOh���-d0�-|���#�M���}1-d0��SI���������l����^"y��hb�����m��1&�C�(aL�
��aL�9|����w=������E�1����C%�~���Y���=o����jW�.���CE'��+I�2J��J�'4BCE�^zW`"�A_�e��8TlN�P�G�V\`a����q�	O����-l�����;�����s'�\�o��e�?<avY��T{L$�# �/,�k�`f��� �� �{��r����>��a��uk��w<=���8������e<�-�\��N�9W(���1�B1�#*��cD�b���F���N ������v���	����:q�<���V�P�O�
��pz
�*�Nb9W�6e�
���	�� *�������!(�,o����%rI�_f���oA�|���:?�E:�����i����s%QSF�(*��������]QT`e�E���M�(�b�(2X�4�:���_~�I�S�q��Y�O����
�c8����r��K����Kh��Q!��F�\�Aad����	�s)���^�csY�*@De8�����/�x�l�-,�w~�_��t����n~)�L���d��`������W��!ExY��^�\�����h�C���p����O�C��e@X:Q�j�������/eaBU��g������*Uoky�����=�2��y�������p[����<�I�C�*����p���
��[.n)�t\�pKh"X�02n	� �/��?���[��*/���5�GH���H����2�U��%���h���Ow<f3���[����D�-������@�o��n�
�qb+��N��l�,L8���o�t������~�����q�R.H)�t\�@Jh���
����|!z��@J���Rr_,`d�^0����q����V���^o]/�v�7p��SWo��D'�\*��2J�Q���FHZ�]uDoP�Z�5�:��K�����^�F�g��l��^�v_��kS\W<�H��b�����������r��K����Kh��Qa��G�q|����A�h���{@��7���x/�����5}�^:M�Nd�T���RT�%4B:� K�
� z��$Yz����4`���Yz�B��>�A��W�Z��5���>��4skO)4�������\G�N��T��R�T&4B*�0L�
�$z�J�$\z���,��=(�d�5�P�
��h�2��?�s��1�2�`�$�)D'�\*�RF� *�!AT��w�A_�.�G��q/T�2g[��������;n�X��\��nb*�[���|:��j�.�Q&
���R2�������E�Q�C_�� 
/����(���D���GO	���!��j����3z�8����~�b����l��������:r�	������@�oP���
���:�Uj��C�$B,/XC|\��
�z�n�|���nWz��������V��$��k8sj��w�-J��-��Z�=g�����zY��"����������V�U��	�kk.����X�A���
��&x�"�`M0�X2�
�<���_!�%�o���6=����8�x3,}������2���r�:<F`Bp@�A_�f���58��$������v��g�@���h/�C��>����y�,�	'��&�A���v�r�?t���
��6��7���T�M8��8���bqQ�uZ��/�����K���"UP���+1u�]��W� �|U��l�Q!���7*�3^���?��������6�@���
8{C�T���g����f!���YQ�i��8�U��2�J��Lx)��3�U@	��?��J1��M	nCy4��2�7�t���TSV��,�O�\�������I5�nb�q�*�SFV�	!qTg�
���8���8\Y&?��#�&s�U��:V]w�=���;NL#�y����&�������.9{�g�x
����px��&������B��
���"��[���;�}�t�2VM���m�8�^rnh���C'���#�>r9^QKH�;P��� �D�U!�2��D�h�C����@Q����4do�A�{�������5����pyJ�m��0�.gUc���m�h?y���6/�'_)�@��������HD�t�������������`�L��������~����������S��,�����=E32!��S�Q��d�$�_e�V�o�!I"�z
�?�����D�G��������kz���f"��W�k��:��9.�q��l���h���W�/n�1������M�YT�&�Y`qe�E~�"�5��{���(VoH�44U��P�h�GY�.��-����@���c���q��v�����*,�$PEnR(P��*�Y�g�*.�*0p�,��
	i��4�|�z��"���UF�����[JG���4��p[u��`
)������\_j+�<� +PX��"��"$��Tn����b��bH�;��W&��V�4��6g%��A�M;�<��r��8�y�<����RL�_��	��~�n�?t�I�@�-����	"?����[7�8�?�
�<���7U����r�!%$�x%�T�����5WP�k�;��lAm�P@��Kw9�`��:����`���t`H�i����E�}��M'��]��|!�B{��.~vE��nzQ��q���9���c��s�q6#+����P���@"�R��D�7H$�!I7�����@���lv�6�P=�EU�4�S�~�_�t�&�n�1����:-�$H'�Ca���@#���o�do�H�Y�F�l_h�����S��*���D��0�H'��J'�p�0-��q����������Mz�9��@Mz� 8$� �q�0�?Pr6�r�������k�!���X>~���z?G��^ctS�;�a����� P�
t�=PBf�f�Ft������
3��ov���l�7t��90w�E����q%�� �xs]��)����2�7���!#�g��e��U�3�U2���g�7i�c�H4��i��P��
�X��R����n�[LEv����g���K�j]Y�b��:�4Rb�
4�X'x�F<��?����4bv�w�<M���w�WU�7�`����;�	��3p(�D�6A�Bvs�� �\���)'��J�B��o�����7zp�3< �D"b��JE�� Y����u�����=�?�����N:�Rv�����,<��Ee��
B�,�o�����7�Xv�I
7�c��-�8�
�"���y��E$���l�EZ����ol�&��+Q���:ed����:�-
taNV 
�:�D�Qg��<�p�3<��H4@W��������t���Z���}�����_�5w"��������OZi����B�2�z� Ox���<�U`�YA����MS��?:���r�������p���Yg�r�E����q��7�����A�y�PNYUT('<�TQ���*�����X��7��Q����>Z�@��R�����M���}z�OI�E���{:���o%�ns��2H2�����L��!D�dB(��d�7���L�sUE4d�
�-��G�G��[����a��b��/�@s�?����r�Ti� ��M�9x�)#���[X(Y8�.�SKc����	���k>�:iD��/?�C�95��(���B5edsE�fS���F����"{�(l�f{G#��h���+������n�����k6�o���g�:�v�p$���Q� #����er	V�s�W�:�����1��h4dDC��)�.��z��T��t��*��g*e����"�(z|�r�J�� �\�u��$��JtB��o������� @�$�p�S�K�%�>���y��_Ggni��=4���,7��2���P�Q�?���(!��1J�UxF)���!�3�#�(n��F4<&]�x=KQ���&��l���C�{���n��]��"���PQ�?���(!��!J�UxD)�
�(�r�C��1�&�/�6�rF�
������+���"���PQ�?���(!��!J�QxD)#
�(���+��X����n:lw�����N��*L�@,������PT�?�h&/+�R��?��'�E@�63�[Z����"�A�nh	d
�u������z�%�*�3����TT�S� ��V����4��&�!�&�
zW�	��_���1-`D�+3����in�*��Z�t54a�V���K�t������ �\�{*�z���P�
*�T�A���gU�������He�L��V;��]��t;,�2�vu+3���Z���A^y�`����Lx���x@�T�ye�������H%[_�pe4<nW���S���������=�x7jHR_�kX����,�!��
��WUT�ex#�re���X=`���e5��u���!r��7��m �O��An��r�?� 2r���A�(�/�*g�x��
��}Ex�*�7��"�*�2HL��=��6&m�9����L(�w*�l7��"���:���l�i�b(O3�dQ��
d�h&x�,<��?,�!s�h���4����n��9�7��m(�=Q������Ap��
��K���%�8p	� .�o��ex�l
��C@�Vj�S��������a������(���{o�s�?�����)QLq�A"�b��H�Q�����0f���])��.�
��1�R���}@��RG����]��
r���P�c�?���1!��qL�NGf���>���hJ���}RU��rpC�9��k��SS�� �w�e>y������gWqD����<�@	YB(�C��
��R���!��h���J+�c��O�����;���oS8}����:6?/��� ��Uh�����B3�!0+�����
�43��d���
Jg���<��:����E���3�����n��q��y_�������LP,��
������l��)�Mql�A#�m��t#k�L�`�bE(��*�������X���a�m4����F�U��[f��v�	���3�UPdo����X�U�2X�A6q0���g�2}{
{By���cI"�v��z�� �\���1���R�%��n�#� 
��_w)�-�������8f����U�k=����SG��:p}�:r�UmuV�,cS�45�#������5BFj�U�H�J�3�;k$�	jr`����U�� ��)H4h��q�N���W�X�m0u|=��)�b�O1�tP��
t�(&x�<��?��:p;��#!�aR�hSW���\����p�@Z� �U�i�s.�����A"%�	�@"�s�7H�sN��8��N�4�A��Z��u��f)��)��m�G�X�iq.������A%�	�@q�7��#N�3����2�Asnh�d}�NHl7�>_���<h�m5d�;�������Hs����OTv��C`V �4�d��f���H
 ��l2p��� ��h���
�f�OXVC���b�+��9�b�9x�	��9�8'�U8�	��
�9�o:W��:��V��#���<��H��|�>�5	�����'u:?��|w����F<���P��Tv��y0
�l^4�Ag������E4�>���sF�����n�������g�a$�W�%�&6
����P�e�?���2!t�e�7���L��������+`����Ee�J�z}h^�-�B���5�����6ed��	�aEm�������FL����N3�Qql3\���z�t�z�<0�^��l{6�NH ��J vX18�J����**u��RE�N3^T�������cV�cjm,< e8��X�ct����_��]�	���@9�/��s�?�2r:��3�(�*��x����:o��7&DC�W���CJ*���I����������g=�O��3��Hs�?T�G���(!M�pH�Ai��G������%%�ajG
��m���"#��lEW��r��o�z���b����+��Ls�?��g��)1MqL��aw2Z�wN�?7P32����u:�]������'��_�I�y���7
~�Z�}_.��B���A%|	�@_�7!>������M= ��lBp����Q�����I$/��e�f��4Uel
~}g1�6��)#;���Mx7�
4��&x�F�CP#n�y|=hd��Y�rG��I��]��mc�O����2���z����M���>�:�����N�����:!���N�����H\M�"�"�cEC��pC����x��x��X����Q����>_��)-�D��������U�� �\�E�Y'��HJ�B�H�oI|�#�l�6�H��L����J[�\��K�T#D)�ZU�Z�Y��~h:B��Q�+�dd�
0���I��+����
�A�����z���A��=[O�Q�N�;]���� z�xe�-
��K|���c��^�(�^A�2����Px�B*(4^`G�]���.��2�����w��Phx�2��I}�v�{0R�{]��%���(�(�^A�2����Px���B�U@�T��p���H�I�����h�^��r���������[�9��l���Lt�?b��	����vV�z��UN�*Y���?���5���,�h��-y�
����I���xs�I��������&+q>�:����� "]�E�)��HJ�B�H"oI|��1REf��
�5�AC�h�����VC���mN��9�������������,<�E��B(�����
��BE��y4d�
yz�M��S��uV����aY���O��j>�2�G�Vt��R�c�Y8F
� �/��dyF�b�F)�A���3�^��M���8���+6�K@�;k+�]�������D���(m`�P�GD����*��(��<G
�<���Q���-�M�9���^�J
�J������������ 
]���Q(��:J(B�:
oP��F���W�:��6�t��v�
���a��!������� ]���(��:JB�:oP��F���eW�.�r��N��������r"���k�U"����SjKn�A����CP��� (��8
� �/T�H���"�l�@�A����G����{��(�a��������A���
�SO�!�Q��z^�s��P����L�U��qh������5o��|���k�M���;�l����oh���2�G�����O*B��
�W�Th��H|d���z��F�A�pQ����w���]H�'�����c��Wi��h<I�b�u����s�AF����C�U�,�;��?�d��EB�!O[�A]7h�
����`2=�����bk���k�>���7�r�s�v.��B���A%�	�@�v�7��T�NM�vFC���G�j��2"����u��oS8n�?MC�ZA�s}.�����A#%�	�@#}�7h��T�,���jx@���o����O��I�j�mo��e���j��DP�u���i��s�.��"� �A$�bQ"q �A$_@���
{���{[Fq4tb|�(�
Uo��QG�=Xg�4�\?��O��>���g�>TFv�Q������@���7��
���g�E��m~�:���v��M����i�=
k��9�,�?�b�Ox
��O�(�A8
� �/TL?�hxP2q4\���c�ks�z�bP�7�������9�:�C!x�	� ���P �:����u*��u�Q��0����7��lv���y+R<+����x>��b(O7�dQ��
d��&x�,��M0�pt3<`B��fxL��|k���	��57��3� �9Za���� �\����'��:J�B�:�oP��BO0�p���A��1�p������#We����*Z���[a��"�����s�?�g��"0��R���`u�NMo���O��h��p�,H6[fq�3t�U����^��{�.������R��jD����*d�f"��UAFY`�U���'xgU�!���	����h���=1�A�N%����`���wNQ�O��#~����,<�E	�B(����
��@�;�hY��Q���"���E���dg=9+�a�����h�Uuz|���yy.������A%�	�@y�7���T�����?�RHx��W)��(�P���������� ,+����;�i���A�����#O����'��8�	� �/�S�H��(y����5�U��-�����"ZXV
��T]?�#��by��)#;�� Ox O�Y8�	� �/��dh������������#b=T��c���]O������:de;u�� ]���A����P �A�D��*��;��\��`
1	ID��3a!�t
z|C^m%]�W���Pt��l$#�wTn6��P�Q9�3^&)�d�����}����)�A���Q��/]Vg����RJ �BIW��[K�c�AB:W6���j��	B����p���BH�Gf�8B�_"
U�L�6�^�8s��_���\���*��AP���J��K	�B(����z���v���Lk��	������h�����Kx����lc�#���ts=����6��t�����$��Px�$���x�6�D��rdE��m�g�6P]=t��|�J@�� �����1��Q����	�5�H��^CFN��{
2�Y���U�5�U� �;��Y���,�h���.RD4hJ�Cx��	t%��u�����9Wz����&^�����e>wGFV�?��;��(�
t� )x�.�@���\�
��'�l��,���j��Eg��4w�m�\��x[��?���,)�+L������E��B(�����
��BI�Wc�d�z��Jxd��C���C��K�R[F��m
%^�i��^��t�?L)��?h�I!h�AR��|��
`4� ixPJq�4\�g�<t��P����_GGvmS�n��F�^�G���b(�L�DR�3�@$��7��2U#,�rKx@~q�4<���t�?����o��`�����������<B	�B(����
B��E��a������hx��NYy�~����T2J<M������b(�E�dQ��P W+
� �/XT�,
J"�����%wU����-�D:�����]��{]��W����K*(B��

�W��j�]|A��#
���"��b�6/Y=��t:�^]�:P�����.��gu}L����<�-��Q�b�yx
��y�P(�����P��|A�
�Km�"�b�&�B�az>�7���
��'o���2}�rJ��*�$��
	���H*$BI����@GR!��C|���p���`\�h��J.��=-������T�?C������������o��7���Q��/��������~��o��������K����j�s4#�
�O=���_%��U�*���x����O/�!�����%��/�'�_�������6��(�����������\��<FCW�CYXJ� �J�(�YJ� OC[L(�d�z�IJ�H��J�����wMxy[�,���~Z��l�s2	e:�����P#��?h��F)h��Q��x4������Jo�D��`�I�)���b�������c>^�wVf����&�m,���plhj3K��������
�y*J� OE��3������dE��V���/�gx��n���]���������j{T��]�{
�A�z�
�P �A�d�9h`z
�A�d�A���*/Q��]%<
����J7U��N$6��!����eZtkF����_���P��C�F"�g�F<"}d�:"��Y#�.�8~�N*[N�d�����^:zx��������y����]�E��]G�R(�����1?#�
���������V��>�
+�����,O�z����xm`}��������l�x�����f�����
��
3
|��}E���/u<�(���*�>��"{�(�h��Wh��������?1Q
�*M�P�7LW-�i{��j�n�t2��a�Q�{���w��t����aOay(�COQ��
z
d�M�cC���)Le��'����C{,�8/���S�����:�g���O���A������A��WGs��C�4R�BO�@#X��4R86�Y�DAfcD5g�B��q�t��YW�	qD����Z6y������q�W@�V��7�-����t���BF.��Ve
VY�*9��wN'�Mc�h�c�h���_E��OIc��z��fh]!~_��i!��h�d��RE���>Hd��Az)��N2�:(��6�}S� tT�����u�d�?����An�|�}e��H-�i{� ���������8������T����(���qD���a���'�C�PB�
T��M����Ua�A[������!#-n�$T����T���a�.���y
=Z���y�}'1�8/�)#�IT'<�:�
�����!{��=��#���#���QV4@%�W���\�����dU�!&�=z�voti\y���a�'�CQ�
D����"?D��Da�<[`�G�
0�P��"�tW����[���#Bh���>x��6��"�:��y���G����-��=dd;�
���P�Q��*0����������4��'fh�
m���f4�UU
u�pl;�Qf�����F-�����������IFV�����2�,���������]�7fU`9������������Zs��s�D�&�� ���;Y�.#�����c�s^*�SFV�	�.��9�U@��(���,��Mk�:��K�x�?���5K�zpMH.����^��N��id{^2-��f2����5�I���+z8�	��V<����5�����f�@����u��f�)�$����;Y��HD��dw;���r�K�s����P��w/e�B�g��6�p������}�'�(�U�(��e�YE:w�eNj�!��?�-�4��Ri�t������EFN"���,d��'X�~#^%����Fx�D�7�~#�D�rKL���E��\OZA��A�(x�������qige�����^Q�b��B��QB�
��P(x�:<
���/ ����Gxd��
�^�p�m�:������R�n��xjD5�-������P~�?��?!���O�x�)�?�#��h�\#T1����������4,["��|�IJ��z�
@��@t�?�������� H�Q��x *#D������.�����x����j�-��gw�	�$�\�l[���P��|t�?�����)�Qq|���n��u�h�����uwn�x�Z:Y�nDS!��3FYp^w�o#.~T��1��h�Bhl�}�?T�'������__}�*�
���,^T����[��@���*M��F��qm����d&����{|5�>�u�G��@�Z�2���
���+�+�o�/��oQ8 �N�.��y��L�6�l?�N:=^���N>	t��zD���a_�Q(�C_�����1��Y����X}`�jv�O
���v�R<m\t�0�m �����*�l�6|g1?��)#�YT�'<�:�
:?��
?�A~�LO����b��`'�������*�� �\��
?�����P 
?�D��<�o���3<(�8��]^/���R
YO�U�BN�`lSw]�V�"�z�An��s�?���\_�Yd�3XeY��d�	�Y�M�3�r_�O�h��5���A�Abx��
cD�n^�oS��|L��4N�A���j�cO����'��8�	���=��]G4�Z?nh��$&
�U�s)�<��u���i��m�=����6�=�Cx�	�����P��=�t������H!�������T=��99k����,�D�_G�����i��R�����h�6�=�C�x�	�����P��=�4�����{F���Z>A���
�=��>M���:�H�S�g$����� �\�u��&��JlB��o��g��7:��6�#�A���^��cO���l�������*,�������;^qd����*<�PE�mB(P�c��
��lS�F�m�e7���Ra��|y�.��'���l�"���A��UV��xD����F<��H��B�F�o��G��7Y2�
�����Hg4�U���F���9�.vd|*@w����fo����j>���[��SFv�Z)����+P�#��
����c*��P�JS������*�j��k��z�!�<8���4Mi�*I��B:edUQ!��RE�t������
O:�7fU8��O��.��H�^����^�9��Z��9=\������6�<��L��'�C&)!O}�C��
���S�&���\�f�3�����L�r���,���%�k}'����N��s�?R��\��Yd��'XeU���>��*���g���g��
n�*�S���a1]N�����4,+��?��V2
���P"x�?H�<!H�O��x�)�8�!�h��-y��]^�������y�H^+	�J-ps��M��]� ]�5�a(��FJ0B�F�|2����q0��$��
����ndm��}W���t+�����)�%���?m����4H?�CQx�	� ���P 
G?�D������GC�����{h5-T�u��#I"�
����6������0�W�8Q�T�"~��p��o����`�p(�A��7�%�hQ��(� =m��,�vW)�KWo�U6�Ut��nT��0�����v�L�`t�?�8<�8J`B�Fo����7����Fxd��
M"k(m��4���+�������$������"|d����<��A�}B(��c��
2��S�F�}���>�C��U����w���m8D�ognM��s�?�����:(�uB(�����
:��S�F��3<(e8�.:�Y�����`�Ag!Y��E����u�����s�PO�E�z�C�z����w�7��S��FP��@���zF�j��mr��E��Om���������?�k����a��i'�C�Q��
T�h'x�*<����3���v����.��2�ta�<�o�UZ���'vI�>�7�#��u��e@Fo�U�A�J����e��%�)w�Du���?&��h�"W�r�c ��j�N�������Y����QN���?6o���b(
�5��Q��
���&x�4<��?c�h�QnP��M2��~�M���C���IFX6d��8?�w$���r�?��G��:(�K:p��A]���Ad�"�U���,y�N�]�W!�����]C=~;���?�>�.�CQxt	� ���P 
�.�D�����(���bm���*7��v��
���}]���a�5����&/��7��G���� �\�5�I&��FJ$B�F�o��'��7	���<�#S*nh	d
%�=�t,�������1���3�����=M 3��c�7c����{�@O��g�@OxLN�
4�H&x�F<�������!��h���h��9�F����Cw������o�yX�����Q��� ��W6���J���B����S��
���c�����A��a��"�u@J^W����-�����r��X*������0�x�	��cJ�BA���'x�8<����1��6A����|F�]�<�T�����7��=g����1(�A�����CN��N��P�9�t�!�����#r�����N��8���0}C���#���.4���V����s�?��'��)Oq��A"�x��H����<��gw���/����u��`5e��	���$�\x����8t�?����X��F�(�P���W�c
������[��-��h����O�A��h���ZH���~�r�N�����~��;���s�?��g��:(�O:p��A�!���]�����K����
*����:����E��=w��yj��m��c{.�����A%�	�@{�7H >%�J6�H`m�|�.�/X]t�un�u*�0��a������<�S�_���>��b�O>�tQ"�
t��'x�.�CP�jS��.2��������b
�{�:ZMI��Y����������W|^lW��D����H<�I	}B(�C��
"�A�mL�B��H�]Z�pE�=�UWpk>��w���HX�,��{�o��<��b(�7��P���P �7���Bp���l��6P	���{,�;i>1kcY*���?�u��?�{���A����L�)Fg�Cg�(�m�7(">��6��w�B����������$��Y����E�
�F����������aG�A&�CGQ�
d�@&x�,�CP�s�p$S�l�Z2����/�
+V�����������f�	�"��\1�0�BM���R�	����==^�C�	���f	8��B��Q�p�
�/�t�x���e�q{�z>��y���?)�b�Ox�	��O�(&��~�QL�����H�Td3-��6�t3��v��S�YN#�b"�x7>�U�	�s\.�G����;�?���2��,�x��w�w�E��EoK}G4d`��"Y���G4��*�m����9�$,[����s��X�i-��@��Ps�?�����)AMqP�A#_���
H4�FV�/��D.��H5��v���g�����\�����_���s�?����")aO"q��A$_����LU[���'��4���������UyGQX6r��&���_��1��!�A�����CO�U��'�U8�	���/�S�*��<;��<;���z��:'.Q�uZaY9�?��2P��>��b��<�4Rb�
4��'x�F�0O0q�3<(�8��]Z���q������&��N����g��u�������g��SFv�Z������@$���7�������H������phx�R��t�j�c�w�'��������f��A����OYT�'<�P)��W��J�|���#��6`C��8�
��5�Tk~\_�|�M)��w�o�zb���0�x
��DJBA��0(x�:�`P0I�a���$�2��t�y���J>�T���U��lI��J��y����H��FA��Jed��
(��PR��*��T��������p��,����8N
:L��[�u�j�	����=��(o��=.�9�F�����Q�����F!���w>�OF��Q�sfqh4��$���������Z�i�����F�BJ�Q�����!�<C�#!������@F��UB�J��;�	!�rg4��! ��� F4�@����������h�}+M���"[�D�� ��3���*�E���A%�	�@�x�7����!�N�������g��t���k&z�����x[K��m
�ZS?��b�?x�	� ���P �8�d��q*@�A�q�GN��iV4����K7��3��Nc��6X���������b�
�8�TQb�
T�'x�*�0N0�p�3<(k�. ��������kV�7/�����U���9o	S��4�?�_���s�?����")AN"q��A$_ ���#[FA���7���zhy�uz=�1 F3��oSX��~�(�Xs�`M�gk�C`F
V �5�d�k�GBFqX3< �8��F�����#�'������Y}�,�4y9���� ��+pSFV�	!T�f�
�<�7������7��r������b��:3v���9G|��9� ��+W��J�r�<�$P�d=^$��A_f��D!�&��)[�p3�k����z� C|�9w�a�|��;B�Af��<�8�6�C(�Xi�o�UU���,��mYH ������1�hh��m��<Y�y=,+��icl�Q�_���s�lp���)Lq����0�?�������#�K��z=���f!����@k{6�@����j���b�������"!�7�*�$^%���"�Bp����$2������A3�h���w]p����OfZa�
��Yk���]"�A������N�Y�P'�Y8�	� �/�S�������Kxdx�Ci����u�[7
�J~�O+_-������F<��H�{B(�����
��=�h�q����- �5�m�?�.��j��Y�mq}{�!�$��R���-7y
b���P$��?���A!��aP��|��
`D���@5< �8
�>D2�y|�p�e�/y��i�w��Q�b(�:��PB�
��P'x���N0Bp�3<2���,�4T����n�]������"D�(n3QG\�ib#vM�5�:�CYx�	� ���v�p��A_P�Y8��D��.��
�T�{��}D��^�9�yL��<���u,�G����k}.������A�Ga^RA�z^����${���������=�v/���6u�3����4k`q�5Q�E�����e����mz)nS|
����P$��?��T�	���p���
"�G����D���D��h4�s�T���
���\����E���_�lt�?T�g���(�Q�pl�A_��U86�R
�B��&�����
2��h�d��)%�����K���Ged�p�-9���)��(X�Ho�8����"��u86
��.�9?_W-�����6�/�
�w~y��Rpe1]L�����wu�\4
��e������?������o���8��\j	���j�u4#�
�O]�QY%U�WI��N�x�:z[REoH�0������~_i��u�K���L�����RIo�V�{?�����P�|�?H�B>)H��O�	x��`��
 ������`���Q����C���|��5���;4��W�%�QNU�����*Z3�]�}����x�=�I��O9�GB����O��_v��%`�_�
:�������D�jq�o��2�H?L�����d0od	"��s}����N�B9)��PN��x���N�P����Vo�>#�F5��tU��w_�_R����.�������-�����(\����~�p���R;���!�	�I�	x��?�	�7�G�u��@���z��3:�7���)����|w_������6��1����$,�$�*h�B�"�$oP�G�-������ih��������.:���
dfi�[���?�NB��8�5�8[P�5�8��:�
,��
�*
,��!8�����Xt'r�&�2{C���Zn3}���w�[�x��k��O�.���c(s�?�,,�$�,*(�BAgaP&yCg�Qf`:�m��6P�0,����!��E��A��>�-���6��%h�~swRZLW!�`GR��������S}}�����S��
t$�D����#sG�:���e�H��
�>��y�iu���58k�u��H����*�E��1����$�l�?�$�I��'1d��A$�l��'Y@$�.M$�l���+^����z���<���u����H2r]�gu�Q���UVG�J�B�;���Ez[�B�!�# �DK��t���#�����
��a�����w�KT���A������N�=�p'�=8�	���/�S�����O�����Wt*�@������[� �W��a��;2�y23�>I��dV��f1H�$�yG��W�oe�tQo�����U�EZv\�������*���� ��E�(��Hz2=�
D�(x�H�0P0"q4=*���32i4�����>����eL�M���~i��q�����yzN������A]�B�*�oP���Fz�
%&�3�<Q����-�o���1�f��
��T��v�����y�N��"�X�A$=i�s��M������If���u*�Ml#����:�)v3�[�������]����l=bx����������v�7B�#
�z�}+�:��PB(��
mV�t4v�w���e"#;�8�~�J`_
�@�Y�Q�HlI���!D�!�1�pp4�N�;`�2�s�K]�p�I����seWm�<���\
��<�M'����cS����'tn�a�����u�:@!�3I!&4���W�B�^6M]�z���Y���m���cz?���3����1�+��8z�������~b:HM���u��D:N������N�ZA'�i�!��
I�I1����a1�h4:�L[���qz�@�����������!t�����bY���e��{@�����J��E
��D��E�]$�$]����`��8i�����n����1������(��Z�)Z�|&���b��;�>�H/�������.��"�a�sTo�E�D��9�#�Y@#�C���L����5��%kU^��^^�Z��ve��d�N�A(:��4d����3
2�P��.�)�������HoZ�p`�"*k���0�Z����]���~��^�7��q�je5�������pC�e�N����`�A]`B�&0[34Q�M���	����4a�GS`���4�<N����S5�!bY����I��+�pCA>��}�1HE/=��2��FOf(T���$-,�	x�B2eF��������9�&�R�hH"=��^^N���i	�J�,;���,u���.�#�^"�����*#+���?�������H�7��u�D]�D!���KX.����,}���j�����$J�����h��-�^���� ��d�����'S*�~��Q8$
u�($�(����m �����8a�Y�P�2C/��7B�(ZC�?��y6�1�w���6�
d��6*#��6
��Bj(P�c�P(�����U!����,��K�Q
�����i ����!�1���w<O��=�(�L���J*#���TR���QC�2�:@���o���}*�H�����w4����?��B�B�R'
6�J�]�V����9���W� �� QY�� Q����D�)���A�����G�>��i�)�C�Y��8��:�?��������K�������~f�{Q��K��E�JH=<4���������o������D�E	�J�P�_i��[���*��Ku���S�� ��PPY1�PP����CA�) �
��(��>�nE�=���Y�RN�XQ��sY���GJI��I��G/�8�uD�������W�EF��U�XdS�(��N,��D����YPE�E�eJ�,��q�y�O~EM\�B�l�,�J�y]o����u��N�����A]B�,o�������_e�(hz���%�qeA����mh��~N�8n����V�^d�����:�>'�C]x�	�����P�p����mwM�FU��|������.����z]�p�A'1;���8�����Q�-��?9-��2v�q����.<��E��P�<�t����C2�����
]8��..z2qI��ulZ������}�,��C�'��c\��d�
O<�T��
�@�x�7��O�U8��4��,�tQ&���k��-?�NIZ��.����de%�����|Rr���A�y�A�2����	����@"y�7H�#��F�8f��-���-.��gV��>_���*�HK���m��H�������a��a'�C���
�@v�7h��N��n�����%Ys�%W3�bz�{fC����Y�fe��^���\a�d�;����	�@!.��A!>�S�F!s��w�:��vfAl��_�L�X�d����]�6�#f���H��c��:�>'�C�x�	� ��6z���������! �
�����6p�B�l���~N�,s%������,���G~e�e�2�C�=<TFv���C��z���l
���
�Y�82��B� �xh�y��������6C(�����+:����}}�@o�t�?�*d����� ��U�UM�)U�]����@9���UY��VM��Y %�I�s�t�?5m<-o7������,SZ3��,o+E�a��m��N���<�A"]<B�Do�����g������\�eJ��Idt-��:�|,��� ��T��,t���7a��'TF���I�J`+����
��0�}c��Y��c�!
C� �@z_?K����B��������@�a%?~��[O����z�=��A
:p��x��������������t�|h��/��a7E>�����1��� ��� 	���~��J�'�*!���~fS`�Q�A"���o��}��5=v��*��P>�t����G�e��}0~de"g^�����>edE��>�EO�g6DQ�A�}�o��p�3=*����ZbR1�����:!cB�&-�E�X�z�9�#xs���� ����z����'�*!Y��zfS@�d��g��*�?������.G����JH����&\��?����U;=��Z��JZ;�d���OYY�$xB%$���l
��z�,<�l�Xe�y��A�C�C�YC�S/�,(���f�Q�7Qw�JL=�2enl���� ���d{��*�'�*!�� �l
(�z�B<�l�X2����P�C�Y��DW��__�V��N�����W�H����r=o�ls�?��m�?����
�$sS@
=lSU��i�\��)d>f� i��1dv�rz\��;�bJ1���:e}>��hu�~����v�A�9�)CF����2�Y�����rcBUF6�*���m�M�������dpq�:	.���h���<��d��6��=X���Vm����x��
d���!w����y�	���.�	�@!�u�7(��������keA]��G(d�~�:��X�C�h���:o$RhxZj��Z�~���A�9��!|�?(�+B�B�oP������,�����]�������QZ*���r�3�,�k�:we��A�d�B�t��
�@��7���P�]$��c��[m���\�5�/�����C��\����}wN��B���A]��
��?���q����LB�L���KW����+B2�&��k���kBf��n^�$_���� �����'��B��4��}�>N����B<���1
u�3reI��Y ������)j�x���3J��%���f�-M�gOl'2�A�=TFv���A��`��Ko������0�}�6=vB���a�V�.�5?W�Bh-����Rb�j�+�������~c���{(���(z((TB�����X�Vo���8�
GA� D�(h��8�tK����Z�_���
}fq��n�?�D���)#+��	��,z�g6dQ�A}r���C�Y@��c��2��{z~��a:���L<3�3�1��P,�b���p��1(��,��B(Bo�������_�x�B39��p)�Y��-��u���CH"�H�Z�Mv��q��V�v#�1H@'�#Q����_EAFu�	VU���W�wEzm�XD�|�*5���=/�@i[w��w�yj:��O�o��,3�����z�������P
�v�?���vB(P����
j��S��EdA�"�Fv	9$���U]&���y+���l[&�^��v�[e��_��� �|�$w���=��P	�(�
t�'x�.<�l�X{���fl�G]}pA�"��6�����.K?���M�:b�:�r��U,�,}B�������������
�� 'x�0<����0�L>�L
/���+�k�4T��`
��_��RN���������dO���'�*���'�3���
���}c�:�LR�L��A�,/�y����S����tbZ��������d���|OY]��{B%����M]To��g���.f�
�����e*���@���[�z���iyVn���Hk��$��b�r>z(���(z('TB�������(�7��S���U�r�'�r�G�{��>_�1�R�&�����YZ^����1��s�?�px�	�0��:��`�17��<UEH����U!yf0N�L���KW_|	X��!���n����nZ���U���cG�A�9���3O�]t�p�P���)����=U�����YV7H����6E��#�����S��K�\M���X����I��j��P}����P
u�?��uB(P�C��]o���WFO'P���c�*��z�F�e���Vi\��z�fY�,(��A�v�������.d���_uAF�v�U�E6���]u�>�z	�,]dA�E���.�&�@�Z'����d�k����+��_*�����;@??���.I�9�>'�Cix�	� �.�	�@}�7H#?���CI!	��� ��PF��x=BC��3��l��X���z�kZj�u���[�>��d�?�����	�@~�7(#?��2<9��J?��&Y��E�2�P�
V�	��F�B	\�UNEL�X`K���~>����d�����2��I�Iw��`�p��A�!����_�H�X���cv���4�|�V���U�c}UJ�PK��\����^#����C?ed5�C?��H������z�F�CP#.�3���+`Tq��y������I���%S:�?�����z)cj���N�k�� ���B�����O��K�o�H~J#����P.��5��CM/�i~�O_�F��^����1�X'�W�@�{�A�����2��E��J�������-�7H!?����!�?���'��������KJE�qM��-FF��&|�}$�� �|�$w����'�*!1�$wfS@����bp�3���*�����x7�t�������m}t}���"��t���^���b�Kdz>{=ed%���	��Dz=�) ���>%�Yhx:�D�!w������[g���9P6�Z]����/�������)z����$���{=ed%�s�'TB���3�����|!��#�D�k�����!���t
}���#0��!t5�tdmY��&"��de�@��X�;oG�� ���f�2r�:%���N��U��fS�F��j�}u#��h$�B6�"���dA����\��:�^ �,u����:�t#�2�����k��N����<�A]<B�(o����� ���R�����,��x.�\e�X�VwI}U��R�\b�nly
����P"��?H��B(����
�F�H������%K@#s�[��,���"�����L+���V�6�O���$����0<FWV(�a82
� �/dT�0��J����P.q�%M�"�f���:mw��� ��U�Y(��*���C(PK=B���/,T�*�PhA����H��J���hY����R�:��������;y
B���Pz�?h�zB(�����
Z�=�h�A�������K��D����v[�<���~?Z�����t���kr�W���]��pO��`jp�������> �{�S�\=����@7"�^�sM
O��IC�z����CO)~��l�2!�����*��B%���M��i�e|��YD<@��p44����6��iM�j!.�������V�^��f'��
m���k�N��C����CIW�'����e~�7H�K�������� �<s^���%�ER���R�W������2V���Y���_��A�����2��E��J�������.zn�l����L'��4��p���`�/�U�?����b���[����Z�����CF
����{�qN�G����2���d�����KdS�2����CHYITedA�YYPYI\�����������
����w]���&����zb���Ps�?��sB(�����
���9�G�,]$`��G��0f�x��v�}n��'Z�z=�Y'�Y�Y�{����	�=�6'�C]x�	���.�	�@�m�7���T���������4�h��\'�+�N����8�
�i)al�����{nN����p�A]pB�2�oP���Fn��$n��:���u���]��i���A�j�t�u���g�x����)#;��I��J`�	V��7�t�o�G�uH@���F����b$�M�D��3�R:o�,�Z��u��q���;������a���'�C���=!h�aO��|��
��3�Z��>��U�:���L���iu�)~<�vHk����w���s�?�����Z2
+��dO����C�!��/��y��?V$�0���0�8����u_?�S��[��E|uH�-�:.�1�����*<�PE��v=��X��OU1��NuY��=D�[�(��g�������]������fQe���o��Bus�{r=ed�=��P	M/zr=�) ��
����>�$ie,T����uj�Q�����:�XvO3D!�m]���i�	��L�� ����<������
���1��O��"��*2�r[��\�����&�/���g��2nL��.�������;��H�?��p)�G����U���5��������_�����/�����/����i����4�����?�?OF�� �"4*��*	#�g�w�8��^�E���G�%��J��
�V{+)=G+���7=�?;�?.����Zwe��T�K���bD�����Q�������^��>�
�a�'y�0<�\>�
9�s������B�.��B����I�],������X*��\��q�����FVW}R%u�AV�
>�T�����U|6�2��PE�t����eU��1��
�/��V�F����G���1:�J�P�����R(��!��
�4 �h�t�Qm.q	�E��o_�F��K����g���jn��(���F]vp���S

#��t�P��������)e�Bu�����Yr����h3,�/W�����.c<=Ww'��+-v\$���/��K+�_�c�q���3��,:n��JH�~���,:n�\>ea�7��7Z�.&��y�s��+Yv��n�R�����v}������>]���4�?g��Q��O��Q��R(E�$o�!<�����
@
I�2.�E�.������D��]�ly[��D�H�*n.\r��qC����0,%F�P �@���h0�H�g��%��.�z������9�C������a�� V�k
�mZ�^?\�v��j��:7j�CF:`hk
#�$�a�����	��9��<`�2{�V���zE�@�%���^�.���3n\SSO7����)��^���?�"=d�BA/b�(y�D<��1d�y�"�����*�L��pb��k�������CL5���v��I������ ��������0�hg������F�C��
�}MA9���,���,��h��
���z}N����d�f����.rx��Y�jH{���<C'�CUx
���.
�jw!#��Z�����
C�@��,�#JDVW��-�k7Z	������[�Z�6O^}��y�N����8�A=y��p8�A_p��0��[�0��cv�����}N���F�����u�^����-~�'����m�*�|l6:�j��P��t�Pq<�A#_x�0�������;��9���:��$�g\�U��Z\�wp^�������A5��O����95�pmV��:��P���F
�|�GE\Y�r��~y�k��<#�k/M7�8N�������A�;��U��o�3�+u��~���C��J`�
V ��m����|.Y�����sH�X1��|sv������ly��>�,4���Uv�P��G���s�?�����cF��P G?�d��~*��%�C���
1���������[�kw!b��d5���]m[=GV����*����� =w�}���g�����z�����)rA �/�}d�9MHM������y������e���n�	]$����uB��:���~����~B%����M]����!0�#�.��h�$��Q���F����Yc����f#����<`D����~=��S�A�y�A�2���8��������{�2�zC��y������O����)�x�G����>O���[��<��"����rZ]�y���� �����2r�:� �
?��N;�)���*��!�w��"�,�8��T���8���������W/.!�J�������%c*�K��v�������P"���?H���B(��#��
�BB�g�Yq'����@#s��x��u����&�x������V����~���^	�d�O@���E@!��3�U�sj���0��N�]`��gQ9�����J�-�#k�h��
t��KG�.u��z��VQgk���8���/J.��s�?T�g��j�b�
��G�C
�PCBHRBTm�(0�O�#���=B
s(=���QN�W�����-��G��e�8�l~�Z���w����=ed�=�*�5	X�,���
B����$\�g����;��TS{��E�!N6v5`^�,u[��3t����
�~lB�e�N��=���=G������M��:����L�9\�����);YP'\=G�^�����y����Ts��s/w����H�XM<uO��jB�K�=��^����� ^`���x��}<���`���h�V`pqL4=�1��k&�6���#C��~-4�H��4�k\9�d�qx
��qtqP�����u@����:��f��#Yd�@IX��.���x�<Ai��	���-uz�&	_!�d(
A�D�@FU}�'QToE�B�\]��P$��*q���>n�Eu4�C�b|NGGw�I�[���R�O8q������&=8*�9i���D�7H$9%I�%�r�!��!GC� �*zoW���*BK�LC��Z�l����:�@'���CFN�_{2�=X��$�Re�U�M��,�=�(Yd���j�R%���!|�����X,�Z�2��R
!E��K���s�?���B��
����Z��7!$��g����
! �T��Z����������A��c��c���|��Y��r�>�ud����D<��H��P �>�$�����D�d�D�)!�>�@�U����w��:�l!��s����Ts���.iA��t�?���������
d� (x�,<�������AC�K�L��=^/�	���YiW�|NwB~K�J���P�;�v����D<�H�cGq@�A"���Hd.��+=*���\�z���&���N:�B��������N@���s��l���� t�?�������
��@(x�4<������Q)WT���8V������ �l�(��g�Xw��� ���e��'��,��'�Y8�	� �d�4u	�
��+hPq�3]�w]�y�C�mP�����,���d+cn#�����s�?T��������
T��'x�*<����,����b�g�@� ��,�q��c�0����g�pq��VST=���s�?����b���
���'x�<�������#����#�UW3.7�A�!b�����1cib1�usN������A]Y�
D�l10F����z��r�t�=�\@��������t$6�@W1�������o�qe�H���J�6�<'�#���!O��!������U�H6�"O����7!�,������.� �D��O:l�+�h�����0����=[{uB��
b���P{�?��{B(����
b��S�!x�M��Q�,���,�������������������B�U�Y���N�N4o��s�?��G����B�
d�P'x�,<�����C���H��.�X��u�����:A�R����.x����=u��f��Kd{�*��7I��#@4��.�J`C�@"{�7H�c��F�9f�����2-.�a$�?k�p�eJu���(��a$sU����D��s%�A�9��w�?�]�B�4�o�����7������J�R�T�c�S���zfs���#Cheb4j{i����6s��~�����D<��H��P �=�$�����D�L`����� ������N�~{�]��\���N����AV��~�����<��C��P �;���q�������"��5����@�u�p��k��L����ps_k�,7O�~��2�[��w�yF��w���=��)�\�� 
�<30�3����4����.	����0�@��������������?������� 	��{OB�z�.
���p$�A"��������
&������x}���G���������.�w�m��N���\�A"]\B�Do�����gX�����g�"���~�����~��UK~��(���4�,���y<�A.:�IDFn��*2�\��D�)u��*��&.��q;I�a��S�(7Ea-h� ���|�.7!�UK�����,Z�
M�8������v�}��N�����A"]�B�D-o������^$j/�0��Khd�D�]��z�h��S���R���z2R�[e��csx���t�?���������
d�h)x�,<-����\PQXzT����>P�^���lZh�Q�����tW��NM���}��N������A]��
��()x�4<%�������QQX�~!�9Tl�?�Z�2Q�#��5�/���|��	�AB:���R�YtR�p��A�����������!�t�����������C�6=��KP��&m�f����d�
E�T�E!��AQ�Ux(*�
�Q����q�1������]��M������l�\s�`���jrD����<C�P �D�������M9M��D��~�/������>�R���I���*�����~����*#�J���P	�P�Y�P�Yx����RM9M���u�M����;��,�.�&�fsez���G�Az��TFV=��B%�����`=���7�G@�Oe�9�]8�T��M��lQB�_��1�P��wd������'����<!t�y�w��OF�nE���U ���6%D��gDn�N���,O3D�v��3N���=�nw��s�?���\_�UdT9'XUYdSj_�U�M������|J_�0�����Y0��W/�k��y��GU�B$-����F�$�1�9'�C�x�	� �.�	�@"�s�7H�sN���ZT��u!���V�)�_5y<-4�\�s���eIk���o���4Y��b���'#�{����L�@�u�7H����F�=�L��,��
�� n�:]_z��n��Z�\v�(�,��f��n�O[v��� ���{�9�z�.�	�@�s�7��sN���q���A%�)�\����Z�^�pQ'�R���/!d1��8r��\�A�9�
��M�!t�MBpd�A�l��k��3=`�p`�y(�K��P&WZh�PZ����^m��^ah>z�;��,z�;�Jh���@hB 4�a�p@3=`�p@3=������%�f7K�HK1,]���	��n!��N��� �|���)#+��[>��H
q���x���q�3=h�p�3]�V����,��~/�?/�I�v{JZ�~�������cyN��C�G��CJ��wq��A"y��Ix��*���\���P��[�Ex� -^?7��n4<-�6���K������)#�{�pO��z���M���=�Y�;�|*��������Y0�4���\Y�Fm!�O�����e��A�9�v{�?t]�BAw��'xCw����Mw��gz����g�LK�K�?*�Yh	�9#PwF�R���n���s�vN�Gj���!�����*����lJ�!���!���h�X xT���m=�Cp��V������IC���[�Y������������"�
�dd��r��:�
��pV �5����f��*����M	1d-Rd�4\�R����z<L�el��}�� ���{O2�z���Mbp$�A�d����,1�0N�KE�Y����~-���S��f���`$-c�}7l��s�jN���T�A"]TB�D�o�����7���H�
�� ���987��gO���������>����c��(>����2�;�=o��)=oA%4��P �;���������xgzT����dzf��x>=����:���m��w�j�q��� �|�0NY�0N����8�)0��� �8�7V8��4�8��.�����y9�^w>��V�K:��#-5���?d��>b�q>{���Dz'TB�a���H�KF�C���,��3*����w:��
�x�Y�a�c9���>W?��kt���������|��A��������FO>'TB��9��MmT�z�c������
��1�@��p��Yp^~n��g�Y�wq��H6UA:����`s�?�yz�	�0��:��`Z����7�"?e�2:9�S@�1����f�����P?�?�+��Ch�<�/_�5#���RO3�6���y��g��F2�]G��FP	u=OeS������}j�%x��`�gF��@g+���ub��2��q%����zx��I�[���'�A�9�u2r���q�Q��`U;�lJx���}�"��x�U��"��$�g/��d1���0Z|6�5U�|s���E��4�5H>'�CYx�	� ���N�p��A�!(������d1�x�.���%��(C	Z6q/Ch<9kI�������)�M������� }��v��v=��P	�'`q@�A#�!����,K���][4�����@��h�����]�u<i��[Y�s��R���%�_�v|
�����D���"�
4��(x�F�CP#.�S��hbxqy�Y�q�uc����K�����,�NK3fn�:�A:���CP�Ut%}B(P����
��AU��OE6��htqi�+��t���$YR�X��YH�S��TBq�
\�AN:�j�sR��tn�P��
������;��������9'm��u�l�3�i�\���R����i���]��AP:�j��R��t�y�P��
������;���F#�����Y�J�9��Y�G���u�[��R���;��^��t�?���������P��
������KUd�����GI3Vl�\���'��
c��l�� ��R���7'�>��_��WO*�����'*��KO*h6�G��|!��#��ev�.r��;0mZ��������H"||�=c�J�m��m�5HJ'�����R����+%BA��RB�������	UdC?�-�
GJ� �<�/��j�������+�FG��^�������� ,���t!#�g��UX
VU���g�w�E��E++}F�>#`\��J�� ���n�y�el���S�C����6C��0�=I'�C9xH
� �.H
�@��7��$U�I�UX�XzT�`� �(7={S������f�����4-5=u^C������6<�FW�(�m88
���/pT�6���h��)f����)m�ZO���U��,�������A :���Q�]tQ�p@�A_��]8 �4�8"�.1���������,ciR8�PXu{������P���?����B(��#��]�i�������#������.�{Fw��h+�s�/m��b*�<-:�����-l6?n��=HB'�Cmx
���.
�@���7h�	U��M�����%��Hh��xM3^Z��?0���C�\
�G�;���L>���� 	}������LzRF��`q$�A#_Hh~$h�����a���t�_`<_����8�8��u�>-����"����52HE�=��2��I�JH#=���X�Vo��*��A#��^�}F?��h+P��.��
�1&���2��v0��Wy_������{��N��c����cLW�(�������7h�
U3�8��F�6
��H�sL��7�\�|OWk��j��c���{rFed����Q�������l
��t���fd�3	MW
M�xxQ�:^���?�7�x�q}�x�IIK% �\*����������U������{���/������o���o���E�^��C�]�����##�/�����M)�EkJ��Qd�|,WZY����?���������?O=�|�}�=aRX��B]����w9�z|mP
��#�UC�1��Q[��N��@
���7��s�����A�G<ZAa]� (���Kk}DY�,!�@��E��VE����S�N8��1:�����?�=�B�.%o��g�'�`���%���V�h��7�^��)��l6��f�����]��,��7���6�`���,|�}G���s����
4bx(��<t���J+(p��Hbp�R�E��� �yE��+&�
����OK%_�=5tjPg�#��F�Y��Q���8+�����
��Pt���
D�� �
m(�
��WN��rZ
4�!��5��1&a��L���>Ocpt�?c,%cz�(�����Q��x8p?�@�1{��q����r�k�M�p�[��1i�����%O��k��B���18�<u��0��H�Jh~�q��5���
�pt�����N|5��x�1lt)�>�6���D��~$����M�����R���!�z����l�X(J�����R(�?L�(y�6<���0P�y���@��1�z�5�l���W�����y��f�e����c������*F���H�J���HmM���z�F<]>���6cm.��-���1eQ.]X,�<���LD��t���N�����=��Q����'I�BA�a�D�T��h0=V�S�1�2�hT	=\Jq�8>��,���-��8����hT���1��j��vV��`U��M�=xWm�!>��J��UJ\ mdAY�_gM�m7�v-�w��m�m_��:�i��y�N������A=	�s��MY8P

Y|�
�]FT �0�d	�������=M5����mDs�w|w��K�,��^�A4:�*��Q�%t�QJph��-pOMm� M��l��� �P �,P���_:����
-M7�����f�����\;"���P��?h�'E�	�@��7h�U��Y����6��i;E{"�e
Q�h�)���u�������c�� ��5��(��Fz�Eu]I��FL�(y�F��QUo4��hz�H��h��	����\�����R�x��L7������ ���������5��(��F��(��84
���/hT�F�����c�������� ��z��Xn���H~�����r������v�1�B�=,TFve��B�``�0��O�Q|a���u�>@��|�%~_0��ber�\_��I���d��/.�ln#����A&:�v���?t]LB�Fo��&���pL4=hpqP4]"�K�l�����g�T��&����e<���{�+�� ��5��(��FzF��.eM�(y�F�0Q}��V��g.�����qz�>!�Bh!s����%�����qi?��s!��k:�G�na�~�X��<��)@��7h�!��0�8B�m	�8B����z��f��-!4�����d�V�R���&���Q�!#���}��DG���/������hB�#��.�J�Y�K��IH������3��
����X��S�g�B��G�W?�8��2HG'�CYx:
� �.:
�@���7��U��YP����er��iC�y�������PIKm��6������[�^y�d�
�K����K!h��R�m|��
`���3�lKh���,�n�����tc��c	1���5�#c�=�[��T�� 7��5��)��F��)��8n
���/�T�F���F�V8M��]�+:��j���w&��,�M�Z�X�,n�:y�>����e��N�����A#]�B�F7o��n�F#sA��Q���<U�.�6����!��>�
��X*.M�����a�^:�bz���C;�b
��Q)X����R�1|��#�.}��-1�8@�1�\�m����[
*�W ���������� #��0RY]�0R��t��/�M��I�]|a��#�.\�hz�@� i���G����F�T����a?��.�i� 	�t���*���|���G������u]G�����l�v�Hp���ght����"y�o�T��TcJ���cIy��"�d�
(�z�Hh�����1�@�#���@��z�NX�dc���1�,�S�FN�L�t�?�Jx
�0��"�PDG@��q@@U��J8��N�6���~�u��m���$����
1��_�^��>� ������=�s��O�Q�����2z�g��[.-���!c�����g�,|���'st�V��=|�
Y-I?6h-�
���l���������*��V���*M��8�y)���r	\��G�M���i����J]�p�!�w�|1��?�b�x-F�RK�Y��,��'�q���6��6�?�SUx�	���.�	U�*����u����X�QXq�"}~�m4pdA�'��h��,��(Yi0��8$}����������� �\�O���'����'Tj�<�PCm��24w�?-������������Z����N�i�'���
*�����/��K^����sd^F6z���F`�V���e���2�?N���_�E@�p��������)uI�����W�d����6�q��n����b:Lx�	�0LtM�
����!����E��U��/��lU�����1J�����5��+��V�Y������
����T	�f�?(��fBU�L�%�6@	�iXp����(a-���hf�u�t�P�}iD�H�O�����=33�f����
����T�e�?��',5��"������
�H�H���cR�Y�1�D`��#�Y0��������H~:M�������H�)�}N������ �\�O�'��B�'T
���PHm�����=Uq��A<���8[������YSp�`�+��U��)p����(��s�]FvF�s��e�
$�)�!�����hY7�\.g��V
e��oJ������v�jb���\{"���{Il��iy��l��}��Kdk�z����Dz�&4B�U�D�����0J$n$�8���Qd�1J8��~G�A�u��>�B�*�W��)���-m�f��[�!�����Ps�?�#2r
�G���4��*$�R�U!�M��}c�*k>_eA�W�����y���t^��:�`�_����&�:t��n�,�j�>3�S5x�	���.�	U��oP�������,��E@Ha����f�����g������7=��I�(�V���g6� 6��	�b*O8�$�E8�*��#��
��S�F"kAE���\�.�d�%/��?P�j9�����T���
)Y��d/������}�t.��
���A!]�P(��N��x�)�������G:[����P2�g��N:�2(�{����e($�r�q��V�
����+Bed'=W�B#0-+P����
���}c�t8��f���u� ���b�{�`��\���0�W#0��:������ ���v����'�!]��@�B�C�7V]`��Fz@lq,�����N|cK�aO�X�����M���n��t�>�B��p�Y(�C8�b�P���P�Qx*NMX�8�����+�#cN�*�[<��Y�B8��nz�~�h�m��b*@�d�@�*����
��T�F���G@�E��ML���X]�dw�gO���Z��>�1�E�=XTF6��`Qh��H������3��h8,�|*�@��OGE��� ������ZE	��b-V��[$�A�����~�?�]��*!��zo����z^��T!��A�p������������������&��n���b�1�.�.t<��b&���0��2O��������0��F�X����# t�����uO��T�I�_�~�����l�I4K-O.�R�� �\�O���'����'Trp��Az�?$_��,��"���B
�;*����}�*jVNkt�����u�H��vW�F(r��� �\�O��'��B��'T
q��A!�y��(�1����$v�C(���9������4���.���W�SjUi�1i�;r�H2�:=I�2���'���&X�0�o�G���0t8��I�L�����a�����K�������
����hv�����)#+��	��jU �5�
������ ���J�$k��Q)5�����,�N���k��iZj�[�b���1H8=�SFV!=�!����
L=�7(�N�8���pr��Y��5�4����m��)�j9�b
��Hq�2�l�>�b:����a���8�*8�o��g��7�8�������0K;"��M@���^^��0�HZ>�_���;������s�?��������PH��N��x�)��Y����-H5c�p����	�]7�`���C��o��������l�i���A���a�2����uB#NzXgv�I�Lo$��e���T��z�����O��L��;?(<-g!�i�����������1>�����O����|BU0T8�	�0Tx�)3T ���b-�h��g�(3Kd����YFvGO��>���T�Ev�2�=����
� ]��$"#7p����t�'XU�dW���U"�MG��2pdAe^����&\ �����!�����;���Ig��(��7�R��]��S�y��.��
�T�A!]T��8*
��OE���H�BV�Z���TZ���n��T2��n��X����mU�n����iE9�A���
��P�at�P�
��`(x�0<�������%K@�����qPQ������Yc���5E�o���2HC�*#Tzh(4�Q�e8
��OC�7������w����A������(���^����IGG~��,���>�_���{]b��������B#��,�]��F�]x,�����a���P��hzn�/�%G��rV���J��j�x���Kc/�/��,t�a�2���a������fW@=,4�q��Xe�XhVF���{~����xL����cM���x�XJc]	� ]�O���?L1�p(T���P�����($��p8����H
��.u�5�p���fO����m����V�n�<�@�SYx
� �.
U�,o��G��gJ��f��@[UGtA��H��]J��*du������Ib;���y�b��C���FU{����x�($�!�w��������2
)�����I��gM�I��t��M�	e��d~IC�{�t���='�ed'='���l������d���������5���@�!�!��;�M����I�)��5z�����Ck���f�T��Qyr���l����������V5�dW�,���M���%�,����RZ+?�������{y��Zs�[j�e�ME�&�N���e	�q=��b*?�d�?�*�����
���S�<�����q$]*�jKy�����p���|I��n��W�Z������$����<�PC���@
�x�7��O�5���@\+=*���$�wT����}�����#��n�}d����������x����T�v�?����;T�p��A�v����������hgz�z�����4X���~F�#��������q���#z'4�K�1��O�1x���X3?����2
�qf]����6��k���<vG�yv�;��|�����?D������x�q'��xL��=�S���:���Y�n�2���.#�`�V \Q$GE=\W�V��4�F�]�O�Xk��6g���a�������n������n4BCG�]��X���[����o,����G��4T�z�!�Xy����~�6W�U����F��Cn�0�9�;����N���wBUS�o0<��?o�gA�YY@1���V��3����(s�)�f�a��J����?���e��1�;�=��2�cGO�'4BcG�����Q�A"�w�o�cG;�cR��)w�6�2y�<��K� �������|��j������LY���u�����uB#����]U��������?�������O��������iKF<�o�����f��4�5�=���"#�������\�jh��T��w7������YP�ZZ3�{�ek���G�[/��S�Y5�"�_�jg�>�e��k�w.��z���A]��=8�	���;��S�,���,����6im(�����2��r%c�<�f��&��h�d����$�k�|.������A]��]8�	��O>�ot��gz�H���1.-?6wsV���F$����h��Vg���\���]�A�������O�Yt�O�
d��'x�,<�����C��A����t	��;u���B��*G�"sq����c=���A���*�sP�e,��EO���c������D�hr��\YP�\�	D$�R+��e�wukE���W��"��b�Z�n������A��I����{��}B#�:+<o�H~Jk�8�T\�g�J�}��5�����?^�+4�U/"m�����A��!�2���!��)���fW`U���Q�T���L'<V�+����D�4�;x����v�������������l�5E����(�C����P���7��!�����_,]�}F�qP44x(+�{Q���P���5*���?���E�M��~:?_=�SFv�����
=�3��E�)���0�4���Y��������M�E� �z�7�p����Y�dq������2�����*��B#���]eToPF~*�e{f�0H�����Q�!�V['��q'��}f�����A�0�\-g�#Io?���s�?�2r��?��.P�����J�xWi�!ip�4����,����@i����<���,^����y�z�9-?�6�{z.��R���A
]��)8�	� ������T�������G�[�@o(���f��J�x��9m��jVx�J9��{�a�� �\�O���'��2��'T�p��A�!������2�X��H�q��0����7���-���ifZ��E5��H���{�|.������A]��e8�	����T�C��9�Q��\��D��6�}=��W�i�0�qI�:5���e/�A���I����U��B#0�+�����
��AYdY�P��Y�����K�l�E��~=���:������?�����R9a����^"�����=ed%��=��HO�gv&��$r�=�GV�8��S�L���Q&��$���[wn��/���l�__z�h�!�$����)#���	��2z�gv���=�����2@���V��$�z����~�����u�5�K����<��{������.Q��R��o��$=��kt�N�
��C��
#��lYG�:����,�Ih�?�Z��D����s9��y.ms�����2��hs�z����t��'��2���CU���:������
x>`�X=(�8��u�������n��������&��R���Z����s�d����F<�����nx��@#H*C#=TF�0��,�t���3�����c���9�^YhZ�N���g�X��6���?��Z����~��?��w���x�����?������������o���������?�C}�?�E���Y�Q�YY����y����! �O�>���"S�o��
K��h�u��
mU�\W�v~�+d�Y^uS����M^�1���*��P�e��P�
�ap(y�2<�
0��P�;��\�|�S�����Q��"�����/��RY�W����|��2AW�S=XJ���JU�%�rk_������VP��)��"Q�>G����Kl��9F��7�uP�p�XS0yl�s�\������
UO{��������TH��P��x�|@"��6��Bi0bdUL���~��#+x�J�����e|�����T�W�}]: h��E�F*� +P����
j��}$����j\X������� 2 �d�T�#���eK��J�T���1���t�0�z� �����x���T�z�<�45��~����R[A�Z� �����mz�&��)=�5��2O���`��?_��"��3*�b���k�~NYH��u�P�A�!��0���I����@�V������+�1��P�x�m(Y�2m�T�01�95���DX�I�0����T�
�9���1gT`&s6
&������������o�Q��:��W���������Y����t�yp�����J������v�|����l]�!�z�0<�45G�X�`A�,3 r~~Hc��ag��L*�#%�/���3WK�O|oa��SG�|P"��Q��HG����s�t$z���D�7H�CNSsHdu�`AH�@�O���!b����K���Y�O��&J.�����.R-�����c��I�kx!�
<�����J�xW�����,�dA�H@x���*iZ�\o�g�H���>)x��5wUiU�,gP
1�����:H<�Six�	� �.�	U�4�o���TL<��..���Z��6t)���yh�l�7����y�vk��A��Y���0��v{�B�k��e�%#;z���������`q�A"�}d�.�����DmZ�h�c7�(�V1�u���yl�d��f�+6d��|{�B�k���H��GK�"	V�`q�A"�}d�����A]b�PP��� V3P��K/2~H#��p�+?z�k���F�FH���EI���^�AD�}d���T�GzT���l_��o]y����R�f�5������������?�u��B
�FH=(4����>'���U�fm�~e�A3)4�*�^S%��9�5�������==�h�����t��Q(���x)�HzP��,�B
Mo�##*0���F����c������D`PC�by*o�$�!Ys<F0��0�=�=�SFvH�Yg�]����=�b��lY���	��7cf��gkC[��I��G�[w�{rPb^�[Y�rS�fb�7T��t�?(<(z�<5s.$b�<�$r@��(M��������u���e�T���QF�S��#��� ���PO����zB#4������M�P��!8�0���t��c����g��>���O�b4{o���7�D(Y-u;���!#wMO=����l������!��������]���r3[�:���4�f��,���U�@+�����.�>8K�,i]��#���zir��[�i�t.������A]��Y8�	� ���
x�5@��{�����h��u�]+�L��*�Us1���D��A��A������CM�5tAM�
���&x���*0j@D�D�����% ��+�D&m���gV1���_��yU�H~���e��jN�Ts�?��������Ph�QM��PMU`4��Ejz��7���Y `!���3�,���J�<��i����6�wd����0<�F�
�/�
���&x�0��*0�pL3=��$`��$y�	����mY-#�L�MF���$�5v��p?�D�S����y� Nh%`�X�3����A�g~d���0��(������+���k�D*��@�U�$�WH
�l�q�
%���� �\�O�;��.�	U�0�o�A��*0&x�l�fDG;� "�m��e���:����
�m��]� ]�O%�Y(��D�X(Tq,�A",T�8�SM�����gXhZ��>>C���H6�{��>��tM�t�?�������P��P�]PU`t�hzPLq4]tE���jZj�]Wl��a��6�c�z�d�6�>�4C�S�x
����s�/�
4�v��
��A
��gH��6yFxq,��!������������������V�A����iAFn
�Ud�����.���0��T-�w�B�������eA�B�8����@��[v������}��mim��T��v�z�����.<��E����!#��������5��W�f����f�v	ad]Z��Mi�i�&�;��������I��u����V���b*�?����?�*�����
r8����CF��jT����l#V$��w���T[�A��ju>�����r}��Kd�z�:ed#IOR'44�@"�7H����������H��gzhEr�=6��|\���?n�M�E]����w�������0���'�SFV=)��	�'�3�S��
�8���#�0�L
%�9��GhI2M��N�:��zqWv��A�5�7-���Q|�	�bV<+�Ia�������,����7����*������VmmhD���u%����W:Jt���Rs������	�b*O@���E@�*&���7�T���#������������o�{����u�}�T�V����y�uK���'TF6����B#KjU��=�
���lY�=��b�����\	<O2������Z6MZ��U,��
����t����a����P��c,S;4
�e�9��m�i�g��|~�Ab��LS$�bf���r�t�n��d��;=��V�XK�vu��6�4�Sx�	���.�	U��$y�����0�:@V���Su�]	8��i>�c���+�*b7����(`���i/����3�>:�3���E��
!�:�������_�����������G��S�8�EC��}%eM����Z��_Z
��M���!�������9�}u.������A]��e`ve(���h����e�m�G]�pA#GQmM�d{��PH6*��_�(�d�C�Tl/���>H<�SUx�	���.�	U�*��x����h	��
����C.�
��"Jz�~�P�'��~{���pP�e��{�fV�}�V�b���w����A�y��2��xB#�+P���(�#��v���=��A!�E<�.-EtN�	����:W�=CVV[����Y&�����s�?0<�0��<�*����Cm���,A.�S��ee�9�C�B�`�h}eW���QJ�N����Zu
���Z����U��y�����B<�PH���@!�#��6@!�H9(1�)��("�8��
"e��%E�$�g���j��0�I�J��/���� �\�O��'���'Tj@�j�x�(���5�j�o�x�VF��!�O��W/���r�_�~Q��N~�+$+�����6��dbl�{����d�lB#4�����������7��;�dqA�
k�6�r���!��YZ����8�I�eK����*����>OYU���	��*z��gW@=x3�Z�7VU �U i
U�\�O�Z��x@!2/SK=�x���j�L����`#m��A�y�9�.#���������8{vt��;�u�8�O�Y���;?�������Ud/����t��U��'it���� �\���2rr�:� ��6��N%�+U�]�Gz��7�a"�J�U��C��U�64J�/�S u�h������*�L����4wA|2���T!�q�?(��qBU��8�����U!�q����
���P�'�S�i���F�E,�im�kZU������c�m.��j�l�A
]l�58�	���6�o���fz@���
7[�����f�X�^�u��^'-�V��2��,Mv��� �\�O��I&���N�CU�G2���I����EM3���+.��!���C;c��L��C��G�A����~n�?d�D����.<��E���@i�7��#M�]8��5�L�H���l��������i%���8Zx5��������,���:4��58|	���/�7�9����AQ���t��+E����1��_��K	�:���i�4-_J�����G��(s�?'<�'�P&T�p(���{2r(3�*�l>e9�1�t$3"����w�]������at�Z���;�9b�=��b�5�t�uH�]`��7x�.<��?gVdA��Y��A�VUlx�(��3��z�c���MI��u����sYB�	� �|�pM�P��5�
%=\3���z�0|�fV����A����t��O���.�k���H��q��k��H]�dc���,6��1�5���sM���+�����q�7���q��k�,��H���H��i/��xu�n������~�1B���L2rc�W-�Q��`U��]�cxW-�7A�,�cDT-pW������Z�~��u��H�]����ux���?��UH�S�������B�<H2�SYx�	� �.�	U�,�o��'���!"@���t���(t�����T��r��ms3���M��o�k�.�<H5�Sex�	���.�	U�2�oP����7��E���R,.�#�X�lN�CVZ.��;�b�Hk��W���0�����)#Gz�4��k����M�Yx���q�������,���5����3�qd���5�<J�j�E^�/��+ds�=�ed�s@!��P���L�z�B<���A!s���9�%V#�v^�����Z��`��i��"���-��`�n�=tSFV=t!�E����7�������nfAe\�M7[:@v�v���_�U��R]$}�����C�l��?2R���t~�)'�����rBUH�oP�������,�`WB!�r��,��{H�*b3}��D`���r��q���s�?U�g����b�P��1N�Ux�)�
$�1�\(x8��.q
dV&�����#����~��X�oZ�1rl�Q�t�v�=�SF6���Nh�BK���@h�����$�eq�,Y�i���`g�Z.�"j���g�Fu�P�����^U�m������'��s�����UHO>'4B
������B�7"�{�o���=�+��=[A��x%�.MV��(�c$�1�f�����4�x����,�����kH!�J>�����JUxW5�7������Y!�5S���}������$�I^�MFN�iU)���6[�9H<�S9x�	� �.�	U��o��'���F��FT���Z�
=r{��;d�	�7��W&������{{uBG��s�|.��
���A!]���8�	��O>�o��gzT��q�-7���~�<���;$�f���ATbN�J;$������9�>�Sax�	� ���N�
���'x�0<����c��A�$����?5=�������M;�u��v�i(#����i���v���s�y.������A]��P(��v�w�HMF���9�ZT�g�))[\�'DP��1^
�m�x��3-wM���(��u�>_�q�T�������B{�(4k���tO��x ��
A���J(��V��_s=���}����g��T.��rI���r�����<?���.�	U��oP������D�����"������b�G����\����a_��������M�z����4�x�	� �.�	U�,�o������=�,��*2�$�����)���y|��I;h[&^���Ri\����?�����)#Fz�'4Ba�~fWfToP�O���a�X}@!�{Vk���f�n]<P��0���������=K�A���!�2���!�����xfW@�d��gV�p�3=(�8��.On���+i=����O�������v��;�6�k�y.�g�DFN�_	U�	V5�dW����!�i=�EU������.�z���������Ui�[����7Y�K��R9�i���2�A_�P����B�p���~�/��A�
�0(x�B<�o�`��,�d+e��~�tp� �+-c���}>yw~QcH�S��u����7y
b���t��a���P(�aP��x*��fA�Y1%K�\�,��z��l����W��
j���r�-��w��� �\�O5��'����'Tp��A�x���g��*F	��1J8����-����{q	���
]������f�������^��s�?U��������P���N�Ux�)�
L�U`WB.���e����#�Z�r=���8������7)��bq�z�����qB#�+��C��
����}c�h�,��(`8����c��������v�}��x���S[�������Kd�{�z����Di�Yg����D�o������DV��/�c��9�lmh7T�[�m���l4/������R�$��f����y
"���4�x
�[�(T
q�A!�������]	�8�
���3��i0������������0����5H>�Sax�	� ���4z����� �����4\�g��0V�..����/��]����.���vp��jZ^f���>��W����$=���F����z�gz��k$I�Xv�[e�p9��~G�q]tig�	W�����G�����~��q������s�?&d�t�u� �:L�U�������#�I��.�E�M�@���R���������/�+������Z�3�J|J�{w.������A]��Y8�	� �;����,�0+ zdI�Y�@k��<��L%��^���b�?������:[�=�j��{n.��z�p�A]p�=8�	���5-��}���S�YV�w��*�8���������!!��R���-_u������+M�=<�S]x�	���.�	U�.�o�E~�b-] ����}	]8��
�WE.�S����4�c�_�yZ�*�?��;�����s�?U��������P(�AO�e���2\��j�aF���B����.����\�vH��o}��u�X+��"z��'��=H;�S=x�	��@�0��9���x=
=����2)��� ����$F
G;� �i|jFQV YqC���>LK�h���c�S�A���a�2�+��	����`�p��A�!(���� g�,��H�T�1V�h�.��)��j[��m,�,�5�X������������!'����9�*�����
�A��DO��@`�X(�8��u�)�4??��A#���U�i_6�_�R�C	�=����LO�a�'��a��wfWdTo�H���BY�`A�3Q��_�l?d���S���i���_t����H������� ]�O��@���0�a�j�LA�7H$?���$� (�%$� h+����Y�O�0��U��������V��\7���d�j�����o��o���&{�{���/���������������V��i��C�@���"���A�EhTtAV%��������.>�heE��]~��Yjxi.e
�)�[T��������������|�J�i����u�����^��hPi�Q��FG�������Q��@���7H#?�a@h����5����������^U��7Q��Tl��� �Z>w��~{E�`��V��]�OG�F�F�4JU�D%o��G�Q��V���D��EL�l�J���,uQ���c��Z�J���q��|��1G��K@�R;�t\�v�|�U�D%�6@"��~>r���
@"km`&�����W���O'���_�)��<6����2FI�p���RR������RU�CI���)iT`����l���{c�0����V-���~��,�������)H�"RH8}_���j_,8%�H8��@"��7H�����H����Q�X+��iKS
r
Ae-��s�|�]dE��;�������e���/�4�lP����MMkU��M�
�������A�p��AA�d�6��E����)i�P�l��/i�����m���m���/]�O��K���^JU�4/%o����Q�2��X�"��bp��
�g������f�����K��o������+�?��c�T�g�L<,.%�HOz(U1���
��4*01��y@T1��y�����M�����e����^�����4[�)��C��5�G��?���RU �H�d�iT`dai���bisYnZ���<���rD~s"�K�O��n���Gtlvl5���E���W��������o�k�Hv��R�.��E�����K���f�@����S������HY����aN2E[:���;�W��AX���
��R�a�$��]�������,U!�OM;����,�+�V e�S���������N+��Z�:���3��l�:�H�SexF
���.F
U�2#oP�#UFH<Ck��,id]�p�q}�*P���*W6��~�g�Q�J6�~t������t�?������I%]�~:z8H
�@RU`4�T������i�m^����u���:�k3�f�Y���]���B���T	��?(��BU0Z8(
���(�
�M�#�����:���l����Z�|%l�&%#,�����v����'�5?�O������P�]t�P�
t�H(x�.H�*0�p$4=(��Rc\���"��c��(i�����wJ	���w���&�����mP=0T��5I]:u��V 
C��qC�#a��M��g�B����z�9��;����r��I��?�[��c� 
�VJY��z��
(����7�P](�@��7���G�0
M�%��G���Vo�o���&y���
�,�������Bd{^{����:�D��l����aO�9`��H�����A!�q�tY2���GI��R�T�O��4-c�?����6������ ����OYit��W -�����UEWkUo���lY�����!�|�6����_��l��\�q���o�����1�m��N� �\��&�2r��:�$�J;�����*���hB���YdA��E�`��V�����e�e1���������������~9��%2�=��$QY��$�B#0r�H�qO��p���U"�{��,�t�(��HP�x���{ �Iu]����z�k:�r��S��u�)#+��e�	��jU ;�
���lY��:U��dqA���/�+������G����I�I�On?�G�s�i�p.����N��P�E8�*�#��
�8 ��`Q�#��Q�"Y�D��8,\>�Hd�\������_�Z�1������4�;�S]x�	���.�	U�.�o���TF�w��<�%.��e���d���E���ya�V��G_����A����j��O��t�O�
4��'x�F��*0�&����C��
�N���3����QB(�km���S��=o������c��N=TFv���@��w�$�fW`�R�A"�}d
/��r_b��h+�7�����{��ZE<�u�n���Q$����Sw�A����"���?�"]4��Q��P���PU`FGC��"�����]�����Pl��"��t���1������n����~�$�S��=z�(4B�G���wvF���B��R�v�I��#+���?c�p`����9�3V�x�Qo)��fMV����A.�������?�]\��1�qQ�.W�e�(����0�r����p\�,�~����ZEl�<�3`�������F����qd����Dd����!��H��J$�R���iB��*�,�cG@Xa!�V�E^/z���w����:H�
9�w������-z{��6HF�Sax2
� ���P�
���(x�0��*��0�������Z� t���!������f�[h��	m��uFx�2��/��6�����J��R��t��BU GK�$r@KU������Q�XT(�
�u�����f�i�������F�VUf�� �� .]�O��q)��2�p)T�p��A�Te8\�U\Fh�{�E�:'M��Z�isO��G�`��(��������ALz�9+/#;�Z�U�X����aR�=`��H&��&����Q��C
k���{0�H6������+���F��v�
����t��h�a��B�P���Q�]�QU`�	�F�"�C��G��E�(���Z�����c�h
��\r�lN�m��.����<�AMaU��C�^��`UR�A<4k����������h�H�x2~�#�������7G��i���>��e�^����*#Jz�B�
%�*2�6@�}dA�YVw@�=b]��e	��K3B"�e`���d-V(�<�u|�z����z�'4BZ���&z�g���$|:�0WT|�=BID�	��v�� ��������J�&��:ow���a����g�����*#+���Ph�$���]�To.h��:\8�IM�X���	�$;��);}�zh�uW��o�%�N�A����M,d���ubAFubV5zdW�����}����!*���������x�dS�=Z��HNy����[(W<���s�?����R��PH�N�)NU��YR��(x����)diyhYZbF���t����+P������);�\[z"�N1�����T
JPHT�Qt�G��=���9�k���uD8����r�y�$����O0�40D0!h�L�
L0p3< ;D�`�U���I{��)Jz�?���f�K�m�m�ZV��<�P���P�G-��p@-����
HZ�>����sUx��vX��G}���P+,uq�e}2�~s�}����+2�$2T�	�@"d�7H�d*�������U4�x�S+��n?�-��H��[/|:�R�w�L��u��J��S�I�y�6�[Fv�9�m�����V`qL�A"L��d�b:��S��rN�t���7y�������w���\W����F&��m��SFV##���	iB�h����,��?����em�
�n5����1��l��-sO�7�A_��������E���R9i�-�B�#��s���$����MYQ��M��D�@���(�;�"`�%���Kf�
 
��E���5�t]�]�`E��tVm�����E
����\_����O*���6B:ed�1B:���m`%f0�;h��F��+���m,�T\�g��J������tf��~�����=wy�hd�L��<����8�I�Y��sS�:���C��
4�E�E#��H0H?\����!4h�����h���:��b���%�N������������v����8r�������q��F�(�O���g�9��F��&X�h����~i�lg$�^�A�����~�L:��4,�~��MG�����Bi[�s�$���+
OB�D1DB)�+25p�;�����\��U��X< ��K^���+��B�9��Za��>���J�||y�Zy���ofv��>�H�}W��?�c�R(�bq�N@
Q�8\��������X4d(�
e��R���N�r��o�4�1	Ke��t"��������g1�}�V��F<8��8�P���,���Fi��C�fq�4< �8p:�z��z�k#D�*Ke��A�w�mm1�G
�=���O��j������
$�<L<r' �F.a��e���i4P�q%��r��z���%�I�H��Z�lDY,�a��i5�j&��}����NNG�)t�\��`�fF���Fi�0/��i�}^�r�0�PRM��.�%�bRC�jK�����;��EL���7R�s|��$��r*#��r
��F��i<�#�4L�4/�5��LG�d�h����p{�4�,�����D9�CF�9�����ILZ����cR��L2t�B�(�0)��<&�?o�E��e�I[,]�����r��?3
k��E�u�PY���v} ����$.��]�x\
����9�p�:�yc�$#�4�s�����-B�2J8Z
��n�?]^����$(V'S6��LK��E�&�x\z�������K�42�K)�#��;�#����#��,Z.
-Z�?}��K�,!�P`)�5p|��!��$��=���(�?����XAVY�,yF�Y�McE{�4���<VD���Mb�����
�����V&�q�2Y]��^����a)�z�+��kd��>F�	�����5��	�:�
4�)��F<!m/�5�%��u�G�`���I4hjP�������Y����R+��z��I$��%TFV
#��B'�jpH:5x$�^2��!����#v2)j��o������h�U}Ik�/���H?��&��c���2�Z�><tBZ�P��>���G��%����}���V�G��F-�j�S�P������T������2��x1�>#��ed52r�:!�@(��c��I��-�g�}6����
Z�pC/�>J�x����V�y��,�EVG�����d?rL������U���NH
T�h't���3^r���x4d�
�E\�h��'?������v��DU�����L�%�;�vK����w�$�r�?�I���S(���w������
�hm,�U�l�n���'�#~���n������b�m��X-�c�V��8<��P @������8�X�"�,����hPm�6[�����/�8�
nu���s�<V��/���\���w5�(��F�(��8
��_/*���4�4@rq4��*��N�6��)�bw��&)9�Dg�U_�I�Y�����A"C��B�D�w�����7q�3<(�8.Z����E��1/��o��K�4�0�.��o��b�9�C�}O2r�R��� ��C�*�"�%�Pp��oZ�p`MK�a7@u�`����.���J�!��g�r�u�)������7��y������,gV���kc�>G0���6F0(tK�m8
��
�A�%��%@��|xd��T�;�Zz:R�%�lV.��x%!0[�������;bx
�0bU�R(P������8T��H�T�x@"	�E��*�:���V3�����m��N(�iV��KCm��s��V��8<�Pi(�q8>
� �G������W��P�����-V4����J5O2�r���4cX������s�}Z����>*#�VF�(tBie������#��F|mh{����4�E�E#�4�5�^'�qU<����K���sAZ��t�������$,���a��R��ad�P=��a��Rp�xX*3� �,��(���S�8Z���*Zz��M�F�q��VW35q���L��g�7�����wu�A)��.�@)�]8P
��J�ot�4d���-�p�4����-)��?i�E@Xm���O��?�J�7�b}1�s��V��F</��/�P��K�4�y���F/
�������}aa�����6_�F����7��3������5���L�����z�)���z��� C��,0����K�K�)����AkL�E0�\��C�1Q�yj��e)�>���1���*���2v��9	L�}w���a�
�L�4�����8�4�:�h�5���>t����|�F.�rr��S����d%_T��������q��F�hg����K�+��F�Y�8�Y#�M���dG�!k�����6��ik(��/�6��aR7�yThs�lxS�5IO�}W���?hc��=�m8z
��
OO���G4�6�1�p���P�@�u��eK���(�lf�9����3�i���5�R�}W#���?hd��R(��c���o��Q���6�T*�
e��^��?��p�-�K��%\Kn�����tr��8���5IO�}W���?�a��R(�����r��T����!��h�K�(�$��=����e=���I9���i��-���>��^����w���)��6�N�S(���*w�����7�@*$�d�H'�\	v����E���#�@[���
���p�IDZ��z���AC��B�"w��G��7zX2
������i4v��n��K����J���	K���Z�l��W�$ }�����]�����Nl���RpUx@/��Q���X�Aq�4B��YDG/W�5�r�j���}��]�I$Z��c�G��c��P�
�D�T�����X��hxP�p5����K�r��
��/�����_�2���R������$}� Q��c�B'4r!�x@�4��(.#V������Om?d9C�=�[�Z��+,��=,I��U�)�c�>�LR����dde1�A%��d�`�p:Yx
/		�Q�����(hx�o/��_����c	Q�An.+��htvt��=�=�}/�����sB!��=�*�"�%��U��-8��h�l+ ��n�h
�m����W^���_��$�oA���'�g�����O�q�O
�p��A|���g4d��
�$� �
u��V.���o�o��M��,�k{1���w���$���]�x�	���!�I�@#|�;h��O�� ��l#<rZ��3Z�n������c�i�>,K
Xgs-,�&��P��!�������OC�42DC)h��Pp�x*��-Y(��b�p�&�
��&�f���\����g��',��}��}Or�j�U�������
T�8(��*<��Q���	Gxd��
%�DoM4o������]K�S^��G#f�5rp���d�����F�42�F)h��Qp�x6*��F���c��!
�P������)g�%D)��,�b��+C`�IPZ�����A"C��
$�@)��D<(�����$
J.���KEb'��
Y#�8Bb����M����x������IlZ�����A#C��B�F6w�����7Y`��p�,d5m}�K]����
����:#y�\����A������'������w5��)��F�N�S(�H ��']=
���M�o4��ix@�q�4<���o�{m��f>�5�����TH�b*��s
�IjZ������AC��
T��)��*<5��Q����A��a�p���?u����qj�S���8=ZkF��
�(G(Ut�������s���������7��������������_���~�Zt���?�E5bn��{]������^�g�sSt�_�}��Az����=s
r
������
��������V-�C��}^��l�6�������a��������
����A�&(�\��uW
�i���Q����ot��I�c�M"�`:��C&S�4��rp��A��C���;�!�����5r*a�X<Rq.E�[�M�o]C��
��������5��3��e9�����y���h�w���(��:�(�u0EwPG�	��kEk���=�4�9�k(��������q���r��e�z�.��~����p7��c
��}W��?�b�r(��Qt]���.�N�F6�X< �`ei�E+}�U������q4�R���Pa�Z�;J"S$T!���V#;���:����
�90	Ew�C�N���~^�������O,�Dn�m���������+�XUY,���������d���Q�E�����_4���H>��yX��?��}	R�2������4@.a��(;��vj9av����%�\�9I���;��pS�3�����O��\2�?9�?��x������������Ze�m�����vVw��J.Y|���Oz�����7����d��`�2���:���V#;h��E�4�E?��Fvq�Dq R-���OeF���=��d�.��@�Q{���>_�Xl���8h�wG�A�F���`a�� �x	����aw��9����.G`5X�C&�r����P��z�Xj�2vy��t�����'����H�(CQ��"�g�#	�g��7!����)����H�#/Y�a��Z������s��b��c�-Y�W[��i�s�}��2�y��V��<���P��A��p�A�vO�r4 ��KD����:��_����,�q�r���b�Y���W�8:z�����+OG�D2DG)���Qp��Q0"Y��6<����F��?����y��������DeW���T������F*�IHZ��"���A$C��B�H$w�$U#��D�4@f�
����I��	v)��d��^]�
6x!�&Ii�����R�q�R
�p��A�T�8���������C���0���������Un{�������G�R�m��I9L2�j������� ���Q�p���q�L�#�d��
H-��>t��x���<h�U	������*����y{���V����w���F���8�F9���F��o��.��������h�+QKJ�����J��(�e����a���-����ITz�s���O=�G��c�*�P �J�dq�J�[�5,��~~J&��F'*�����I��|me��g�D4LW�m��0gVz��V�����D����1P$��@\$������Q�8h8V���l(��C��py��?.���fK����>f�\�����o����c����'�����?!C��B�H+w�+���u)�
���.\4�b����d���`�>L��U���?5|�7�jE�}o��������@F��U�C<K��������
�,=DC�C4@Ja!����2i��y}��>���daT����W��n�e�V��.<���P��G�tq�G�N��T#2����<��"��_������}t|S�����-��L�s���|�-�L��j�����"��
D��(��H���8<9�DC^��������N��-�	�R
��eJ���9�"�IZ��J��A	C�B�wP�U��@�������"�u.�`��g���J���5r�����$���]Ax�	� �!�I�@�{�;��{*��j��?��R�U�x��#�0W��l�����D<w�]���$���]Yx�	� �!�I�@�z�;���z*�������Q���M_�Q�����aY�d���5_-�(�E�I�Y������AC��B�,�w���T#W�>\qh�(}�T�>M��h�T�$a�U�����b��$Kg�1�T� ��b�2I?�}W$�~�?�d�~R(����"9��
�;&�KUd��[�Q�$�GI)��[T<r�.v�HX��c������e
�r{������+A�D2A)��APp��*�����	�A��!�(��@�o��~�`�-���$~�R�l'���2�;ed�V�fQ=��$+�����z8���-3�@zY��Ke<��.L^*/� ������b)&��'���$�����BFN��G2�Y������_V�V0;��gu�7!����:�!S�h��nPJ��z�������UJ��*����Y�����KO1�B�}W$��?�d�R�<����$w"9@�yW�)D
 �%ZN)���%�HZ��n'>���K�����,�o�0��%u���^'�g��J��O�I�O
�p��A�Sx�
 ���J��&�z������]TMB�s��+�V��h���W�$�f6��+�4CF6���������*����W����e'2	�E8
o�8��-s&�ZO�3�2������r{�U�gO7�e����DN�/	��!��~����{�V�����(���1E)��AQp�@Q0�������A����}��qn�;O7�e��Y�fp�bjM<V�u�$����OH�42DH)h�Rp�R0q�4<(�8D.�,���|��&�n��(����bg��l���es�����}��������2��f�,=t�b�
D�x)��o~��y�K�m�e��"�,�����LI����z��9�r�M�f����IG���~'	i������?�C��B�,!w��!U����{�",-�p������\�g�d`[�=���
D�E��I4Z����h�ACh�B�:wP�U��F��2�c�����5�<���/d�Q��`�����1u�AWz��}]'yi�����R���R
"q��A$�T�H�X�"-C����p=��{�y>K�,%���X<�����,��&�i���DFn��Y$d��d�E��k6�,��&�M9����E
y
�~���	���<��4L���<�y���m�V��8<.�.�P �K��q�K�G�hq,�f���x�]�
}PgU��n������^)���8���2���m��V��:<9P�9�P�GN��q@N��ci�����G�d���K4\^���I��5�9����jyB�8Y{���O(���6BOed�=�N`=KV�
GO�Tq@O�-�z6@�U5�2��S)���loy��SZg9�!���1	Oo�kf�.#�����C'�q�:���xK�����Ay���p��EG���(�c�G��E�1������s�~O�o�M��j�M/���?��!zJ�@$���;����*��h���@��i4�K&u��SK��X�������a.����$/��]Yx^
� ���R
�p��A�T�,/
X������L�|U����k�6L�X��r�O�	��`?:>�����2IMo#��ed��X��)<����)��8�i{��u
0fD��nBY
�M[�M�����>�Q�Xv=],�����I^Z���������/�P��K�tq�K�K,d��Y8\
%�<_W��-}X�,!DU?x3�a��mJ�
����������qe_E��.�,��k$k��3(���n��~��I&X,��"���2���	`!��?�����?�C�_?������?~���/�����O7���_������?������~x>�~������^�?�LE?��������������n����
�i�,�?��}���.V�c�K_��������~���}�s\?{������?�g���KU:
.���w�8���5����}����rz���<�>-�����\������K�����������N#2�4�Q#d5BVQ#���p�)_B)eA#^5�C��"�xII)0�s�NA��cyMEB�� �\}7�����U
-d2�[qt�8�?��@`������ ���Eq��8������|�i@��-�n���8�b�t��_�"@�|>�y=[>sq���0�[q��8�?��@`������ ���Eq� ,0q�8�`\�G?�K�����hX���g 1rxe��[9�X14���o�`��lL��F �9�}9Dw��	�������^@�^z��qz=^
�3�m���~�����]z�����{\��bl�����b�u�A`�+�����I��)�_����h��}��z��E�A�O��r�E	o������`��e�J��6��O><�y�>��-�V?�|��I�4��o%2��� 0��Hd�H$��Dv��$��D�'J�bA����]�x���~q�\~��n9,��N��<6��-��:},S�3�k�%b���s�D��D�
$��$�A"�7Jd/��O��^@���F����������.e�����Gt��n��sy��ti�����b�e�A`�+����EtY��(��d����+�kW,0UxU�����+��|�E����2f���-��<��Q�]�5r��1�[���D�?H�@"`���� ��%��D��v�o%�?����G��8��q��[jp����n9?�g�D9�d��y��� 0��Hd�H$��Dvo��^�@"{
.�L����������F����X�������F�FN���iDF����F�(j���F�/Q#�5���/��E�Jea���4^0�uI/��8G-?��X�c�#�(�R�����o��Di�\CKe�J���B#H���z_@"�$��1^1H$����4^p�D����]o������H3�gw[�Rd�����2fnd���m��)�C�b�TD���!�q�HA+�������F/��/��n���7[���/�����^� �9
9��5|4����(
5��QP�/>�;h��%i#������c#RR>���u���?������t�!d8����
����5�TF�j�)4BB�"��B
1uo�j�G�^|����2bZ~HG�q�
���sP�#CD`^���)�Q4�����(UB
�FH	U`��J����J���J���WF�EFF�/)�5�����C
�����l�;������r")�\CAe�����B#��*
�}=�PP�F=�GF=�q�Z*������(�L���R������cI�[J� �[N��R��37���v��Q��f��`��Qp�YDND���%�Q���}m���hi�P�4-�U�H�[h�1))��f{#�[�V�D�Q�=I�\CBe�F�
�P��"���5$��1z�GFid$�+��$C���|.bR�<��BKO'A���j����
K�8���#L#
�kP��R���Ph�4R�B�/�����������A�i�##�^�E�4����
�(��V�1����f��.B�F�����#2���q!���*�#��(p���{��G!p�H����KZ���vJ��-l�b���j��F���fn���i&���r?�*�E�*p�4S�<������F/���4��V���z���PP���
��;�����j�����,biD����.r�	���*�IU�.2�	�@�|.e�\^)w�	t�,0dY�^`�j��7��t�-{a��<��$�����	3��i$�K
��Q:v��Lh&d��L�' ��d�G�D�K���]1�d$���;��49�rK�Y��s���"]�^���87��(�3N��(R�JU�D��Pp��Y����@"H,M"�!��/���Im�LO��W��W���Z��D���vIw>��/��s�A�2JcG
��F(vT!O��;�;#���A�OD\�q��[����hxY�o}�E���r
/�r��0�x��P���4R���6v��!vT��RU;�\Pp�����ObGFA�#"./��QRA���[�C0�<S-Qf��^:���<r4���&TFi����F(rT��z_ rDw�EN>�b��tZ�d�����Fn�Tc�pO�$6u�1<Tu(#���Et��6����|�(�F
��FHU������ �<�+i`���6�o����O/P������
4�D�'��e8�^���*9�h��K
��Q*��	��������Ay��WB��&w������Cje����K�[nt���z:�X6��r���[��f7��Q&���2��������%�Q"�M����D�E�
/�a�4�g�%5�t��#�!X�*,U��������z.c���C7�[]�<�AU<��]d<�A9�?�P/�|��";
����-L�Y#/�����y�(���O{}d����HA7�[5��A
U��5d�A
9��d����/�k���s&��u ��\��F+8�����9����k�h��k
��Q:���Oh�d���'��4r�Y>2��y
���h�],R�/)m�^�2�6��8tL��B����"��$]���=���N���I��FHU����\Dw�F�=�GFm�>�n�G�[\`���4F��6E���1����N$}��|vO��R���8�X�f;���ad�:OUA�����j#�_��K,�/�`Wld�hiC��ex��@"�[�a��T��mUR7�42��&�SFi�����F(pTe~z_@�G�<�G���1O��UI��Y�n��;���\���n�@Qg�����2����H=��)�T5�!aTQO���z�7.W�|On�&��6-\d���,^�\���W������
eix������3��k#���oG��~�?�(U'��*Q�Ej;�!v��S��Z%����K��*�����N�����`���x�v�=_�6b��{�(
5�� Q�=�/$j��{c���:�E~��$�,��-I�^GI��SJOmir�
�&��m��s^p�g�j����]h�Q&�����"�$��/Q�/mso�E�Y|B2h���G�@�G���w,��nCY�/U�t�+WC#�|�`�Q������	�5d��A
9���5 �T�p�8PxA(�iy[��w��������{�s�r~t���x���W#���oF�=�F���@"�w�H�=���^���:�],b�/i'�>�K�:_T\�Db�L�&u-�tN�IwE^��s���C;��Pu�'Uz�`'��r�)�D{A�����*�y��/�E�R����]�h����5������<7�[����A"U��TH$K�w�H�<��H$c���
&��DJ��2�m43�P���!�C��/Jd�S���R6�H#�|�$}�(�x�$}B#4��J����44��Dr����#K�tU2�Y~IA�Y�i�����P����uR�|)��#�/�5���6���!�Te}RUF��Op��T�I�mr����*��-1X+�#>D���v���M���*�k�����U�<e�F��	�P��B������'zz�-v�K��,0%d����v��(^n��`�i���q=�G���R�%���[l�%��<_5	�2J%R��	��D�>�/ �����W���{���e|�_���vR���������������#�F��!�2J�QCB�F	���0�;#O-�������v�����aY\��u����/ ���VW���J�h�x���R�Q"�S
4
������%�����7���#�b8|��3�������GA/�5)�3n�W[��>0��H������6�d��{�����V)
%�FM�'V�HP(��4Rj��gV
B�(qI\L����W�m��������
��W|���Ho��.Q��c��m�s��C
=��P=�*C=���BO�O���U<�J
L^������c�nf<zc�%�	����uj2}����~������"��*M������S�:��
��$��A")=>2%	-r�����$Rr>�����	�b�u?��0�,��[i�E�lC���m�H�'�C��A�X�!A��jH���'#A���F��y�h�D�����]5jh&�{��$���mJ�U�l���� R�I� ���N�
�$u�;"���[���gG�i3��o��4EYZ�����F��Im��K�m����l5�=���n+��i�?���ibU ��i�;�"e����"a��F��im�D��
5�$��6m!����R����|^v[�l�����DR�I� ���F������%a�Z;�7�Q����E�$C�����_�f������Rr��<������K�:W^J�6��=+����
�I��|�p���,�{����xm���Y
`��iO<L����!��Z��q�]+N4����]��	H���iZ��*������G
�,})Dw��	Ja/) ����(�$w����v������75�|\����R�m������I�f7��(���Q���(BVq��}�����%�/���.�V$��xQ:���,�W$�u��=]��E���
���myG#��*�>u�d.���O����c	Y�82�	� ������U����D���oq�i������(;<=�x����9��Q/��J���*��J�F������?�*JU�F2
��� I�g��-~�F�/Z��G\^��@��I����
�������8������?h�g�52���V#9�HM.��c��d�A#�%��$�:�hd/�A�YkX�x]��2�������ni���)�o�(������b�u�t��UC�@��P�@�*PC�@���_�jH�>�7�����^�J�@�c�J^� g�a��=���a��(T�~U�w]#
��ou��P�]T�P�
t��Pp]���.�2�md,TM�`l$�X�����*�n~-U����`+�X-�v��c�>��]#��o5�sQ��TqQ�
4�qQp����F�d�N5'���e�^��
���o���+�������mM�?z@n������w���F"����#'����"�T�##����/AqdHT5�8`-�@���R`9\�a>�KOM#�
,w�re�:�p�K8=��u�xt���H�G�4Rs�}�����F0���Qk���6r�����N���i$��,�i��2i��21�B����<��C�p����o����Fn��pS�L���B#�=�����X
7-_��������F�����re�+7�M�z�d�^�\���-6�Xu<�K�V�FX���e��<��b�;M���������q�)_B�(eA^Y�������4A]^:�����j�4,����_R��7����V	9PB���0"+�|wP�2�5���:/%`_4��G���6l�~�v�p�-C���A{��"���(�!�f+���?��
�RU ���;��U��� ��Fw=x#�V��~�w�w�U�(��cFa�(^�`��D�F�����!��z��T�!��z�<UA�lR��=`��x�{h%2����R���Jo4P��(J�z��|�t��@�7����V9��E���@�w����
]d��=���@��6,gGy�r���Z}�q���.�n�9�������8���(�XVs7���V�]d��A_�g��8����{��������������Uh�;{q��/Q��z$�}#���o#F<�"F���@��pE_'��WxY��^p�.I0��&�Y&�x:���{:h��J=���v�.���F�����"g����b�T�"c�����:U�it��N���$c��a�%���C��f�[�f������[�gm��m����s���H�:�4R�:�*�H�:�4��u��D#H.-v�4�d��]l������v��.���&�7������G�tm��}
��Q:������N��`b��8��H�$��WB ��`��H���uz�]�2���#����iR~s�@�9j��z�����I�f:d���c� ���H'YEUx_"�����%D:�,���*�/R���Ji�"��I�H���������M�>mq&D��6bv�N��C#��o5�3P��T1P�
4�1Pp�|a����/��0����;���) hq��'��)+�p/�W���K����A�H(/���9PD��@wP��_	Q����=.lQc�AxZ������q�UG���j�2^�1]x=Z���"�[x��h��C�x��I�xh�d��p(��.��P�J�E�C�#.V�t���b.�y�����j��l6��6���8o
��F����9�U��5d�A
_ �*H�����Y����D���������e�4Y9��,�����~g������n��"��(��H��(U"��(��H�QU���.�eC	rOJ2 ��$���������r-_/�-�i�=�4t�x���Q:�T<vd����@
wP�Z�2pr/U`_L
-m�zu��vZ�<�T����%����5�s`�=����n��#'�����RU�����;h�UI����{���!�������:�K�/?�H�[��e��v�;Y����]���t�9/�4j���Fh�Yu$��0#��2�p���1j�N�PE�iQ#��^`c�^y��S��Tc���z��H�N��,Gch���m���'�C�:z�2�U��)+S(�"��*��O�9�py��b���giC�B�P=X��q����������1�)��3��kl�����Fd�E��!����c��%j��F��-e!rxA�����%q�Z����E���")UhL����]�H������;7�[1���AU���1d��A_x�*�	�\��%�\>X�����]��Q�89wJJo�����e�[�E��������5TFi�����o��TP��d�A#_h��02��ty���*�5�v�|�����W�*�BK	�[���t��
9U��T�"����@UA:2�4�d	����0�����;�5�[���Z�FEx/_�=o�5��A7�[E��AU��Ed��2���8�� �;E��6�d����E���b��=�x3�x��NE�s�F�9�$��(?jA�X���"������=�+�N��,�/&�{�_R�"x�'t��-J�z��9���%���F�9�pO��������Yd�Y|���� ��6�tF��{	�+�Z�kv�e����[��C'	�������,!��P�>��W��@O.��W������#����>�+A�,RoZ������w�w�N�`�����/��Q�|�w���h��c
��Q3j�'4B1��zz_�]Dw����5�,2��}1Yd�����t�^������,���G�o���������I�@�F����Bs
�0�b�T�����;h�U��:vZ�d�����i~�eE�Uh}2�w�)�����D����tN:5B���N$2�	�G��Q��dE�}����H���R��b �&5������@����������#�/���>��X��R�2J�. �y�f�����?����RU�����;��U<� �P.��qN��b��S�y�[��H������i�eS~j�����Fr
���*JU�F2
���/<T$�x�{��% �����N*]���Kq�����	<������W-�+r��F*����#�������T�#�����BEUSQ/���,�dP�t�Q"�%�/X��,������R�K����gO�F4����"G����B�T�"C�����FUA"�6����,24Z�[��Q���3�x�-u�p��mS#��o���Q�iTe�RU ���;H�U�4v�4�N2>Z�zey����D�[�~�SWx}I.u*e���vL��t��FH��Q�JU�02@
� �/�T$���B�=`��%��������NzX��N�"��^�c����F��2%eS#��o���Q�iT�Q�
���Qpi|��� �F�G�#n�{�@�
�V�5���4������������Nq�g�8���G7�[Y�x�AU���*����5xTm���,djxDG�qK�]@����&�yjW������%���2`�+�|t�9/�}���F����$�����-_E�M�|)�ME3<��9<��%���F�������l���V4�HQ�<�v����G7��@"�L#�	]��GLVQ#��H�=j�|	��R4�MJ#\ �xA�#�
{ekPnO\���6���V��[����b����n��r��(����(Ur��(���PQU��
/�T�`��%����z
��SKZ�������n�Dt�����E7�[��X���X�<Q��Y����r���lF�.u�
x���dd��t��
���J,bxe�1�@*^�S��S�
2���X��5J���=��<v4��������X'�"Z�,��b����g��� ������A7M�fM-�M*��Q
��8������Z��^�8�����������P^~��!�T8A�
��@��iZ�������,=�?>�q/�Q$����Uy��z�#���X��������R�\�Jz�+w�*�F:�$��(��N�H��D#����$���$��A����"��E��*�O7��vE�B�i���qq���<�4n4���&UTF�(r�(��"�(���(�7��	%�+6Qd�"���|z����k��Xj��4:<������������|��D��|Q�
�	
�V��h
��
�����$�`����z�,Y���t����E5b�pO=�$b��]�\u�U��Fc�����5)�2J����@���w��A��I���Mam��y�6��F�`����^���_�5���i�
�n9=����|ti�s
�Q����h
�F����Yi��8�$�d�����6���B2Z
:Va���x]�*�A���h�@K=v�y6�X��fG0d�	��`�UE!xWb��(�&��+TxA�� R��#4�,Z��T��yF�\:�O�di�(�s�o�\����yn��b��'��"�1D#�k**@G@�7�!#��xC`gq�����^�ugy��=��\�P����r���-���;��&I���4���V!9�RHd������xBG@!9���/��p��R3/���0�xU����8Zx�8+D��[�p�u�!���)�X1�f+�s�?��sBU.2�	� �s��7H� �-/��#���.e�?_��������:�����T���F���pN��*�	�@d�����H�pN���#��t��Jea+�,2�m�������Lb�c��n�������k>-D���S�F����M���b��
���mzW@5l��QY����,0���Y�g)����������B�4:�:��r�4�xck�x�s����
i�K
��Q��*�	��Bj@�wR:��%|�(���_��$�e|��"��M�����m���/�z�?O�N�|���jn�����j�?M*j�&T���j�7L*r�)�dR������!��^�<�����,��	wT&�����PI�,���?S6��1��)�42TaLh"C
���@d�����������F[ld�L
�,��OUB�O��W1���]OW�D�]�\�����|��1��)�T!UZ��`L�
(�c�7*��1*$����h�q����������w��Wa+����M*��i}�YQ�f7r�(���VQ`G�Jx�������+\�@��@�����~=������8�,?�]���&L*B�/S���cn��J�9&��j8&TJ�8&x�r�)�_���czA�T�aJ�6t�����Lq4��r��5�x�2�k��������m|�F#�\k.��Q$��z���V H�L�
��
��&Wl������.�
���~�k8_���RG?O���e�g���}���U�l��6r���6|�\�)|�pM�
�G�5�4�sM�'��S*�@����kz��	��q�$���*7MB�g�5�=���F���PM����jB+.j��w�E�)���\����'�	/��$K�,m(ws��l��cT�:���W��i��j�\t����K�*um$���m��	'�S������ Zd���
��7��D�,{�=hD�giD���_5R���^J��cB�W�9�vaO�C^�F#�\k�����QE4�5D��A#z�"��M�������|^.�)F9w�*U������S��T��U1+t���F���$k�(�DU�&���I����&�7h"��\�i"��\`s�k�6�CO��i�^���Nh~���w�������0���;7���$���OIM�&TI��	� �<kS��@��N���$���Ht�����t���iU���[,��Y���r����7��)�4hT�Mh�F
���@�����<K�|c��^,���FF7K��r��Y6��_c����>8cG�Y�����t�I�f)d���!R�U��H�]�����&�Y�1������U	{h�Q:�\�U/�~n:uMB��Z����������H���jD����Dr
�$�
U�D2
� �����Q� �/���]L#���:�|��U�(����I&�����_��R2��j�����r�	�����M�
t�%l�7� ���Ot�D��q-�&��;�@�r~�*�e����F���#f���z����w�S�4��W�iu��I�iuhF�����M�7H$����h��N���$����M�0�x����
��~�~9��b��[.� ���>_��s��
9�
5'��*��u�]��S�I����{���Ow�"t��!n�yy���])����:�&�t�����on��2��&��j�&T2��&x�r�)�D������$����I��>n'��a�c�zK�xE����iR�~���#H#�|�O�#H��V`���Aj��{�z$K�t�<�@�������&~��#���:��<�vz~����&{�2d7X�^Oz����eH#�|�`O������
(�{zW@5���QY6���2�L^����q���msD�_�H:���u>H�i���V$��o>�4b�WMV��RITeuB+ ���N�
H"z�{��%�a����LSDF9K�v�?�Ghw���R�]yrz��$�����7_�|s���Z�|�ijQs
����yS�)�|S�L�� �u�2�Y��_V�N�sw��d���}w�H�m���7�E�)
-p�����?����N�������������������������I�����w�.��f3J��Gi�U�d�aF��=~H�F��(�T�����4�G\|�\�G#�"�2#d�9E15�5(�b�!��sc�o��e}�[�6"um�lso�3��6*�&U�H�&y�6R�i�6JA`Z� �H��?Kf�;j�xN�y*)����_�at�������E��5^�q��Y��iFi�����V ^T�u���Y5�H9gR����s���J�E,���s*���,�M{�<�|��U=�t@�!:'���GU�g���o�f���nR+��
�Y������n&�.�Y
hI��������U�|�������;����\��<��tmg-t�o�(R��i����I��{��?PFEjg�
(#z�2R��Tl��}����(����w��p~�2E<�s��D�,K���]��"m��V�V7�T5����D�i���D�M��3��4�����O��"j,���m������/�
9�������?�L�r/�g���o��u�?M>#��T�Nk/Y�Do�H���TlI�;K
(�q��P������b�OZw�b�%�[�&F�������F;w�[m����Ig��*X��]���m�I��:n��X�&g����J��>_G���W�����2e��Q�����$m�Sl��_����I
��TH!I�$o�B��i���Y�Q
����$���@��9�6�����������b86�ai�������o:�l�����QF
=���Q=�*PF=���BO�O�������P��"i��p3����8�{w��I�n9�z���z?E�j�k�����@"�l
�����0�������F���g��kJN��p��>==�@��d��~�J1d����2���=\�����$]#���o���N�')��N�
���N�)��S�.���{D��qI�������5����i�Y��;���C��4=N��\���9��)�40TaNhC
���@`����s�o��!���������s��	x�g���K�Y�����,g�����f�2DgO"��j�yv����5R�<��H
����Fj��{����5�D�U*��GF<�CTb����_����2���@����������qn���G�8����N�w��0x$)��
�"g�j=<2��0xd��=l��������T�������pd�>�:����
q/��%����l�o����1'��,*R:u��FIJ'y�,r���Yd��=hIr:����� ��]���N��4��?��75����Y�L���52���V9��C
��� LdL����zGa�������2	j�SW\`���iz�����8���M'��<���8������)�:}h��]E��np��5)�V��,���)��'���w�	+R6KW`��A�!(��&��^�F��y�^������}="fqb�c��{v��!�F��
J�>�������nn��#���O��nBU02�	����H������ ~M�,4�d|����������\
 ��t��Y��F�^9��,�[���E
�T�����������QqR�j�]���?�A��� �f
� �]��A��{�yN��������S=3r,P���W�-��?3�w0�,}#�������2��?
���R0�A����)B���/�(�����]����(/�Y3{��tV ,JKMR�ky+�A��hD�}M>��Ra�+����o��� N0ad��A�!(�,�����{�h�.�G�zEl��{a�Z~�U<��&hm�����;���&��j�hV6�����N���Qqz}��gW@��u�e���2�f{AdY^���@!^�?�6��a������[�����6�[�v%�L��F�F����N���
wB+0���N�
+�4�������e�x'h��<a>�\�u�4q�b���"I�
ja���1���b��D�*�F���z�(�HU�'������H�����D��D2�4�d �]�H���a���4���8-�d�����N���F����+9Wj(T�J@������j� tX�p����z�^����z��[�8�j�H6�*K��^��'�?�5T4����(��U�PQs��*��_`��P(x�2�����a��"
�X�d(�=�:�*�qJ��`��|����3xl8q,:M��M��7�4�m���RU�'U�t���#�H��	���/��|d�HFH���������E����L9��K�����H�`^���Z�������������5Y�2J�Qs�]	5���l����e|���#�2v'X�"��q%����v��bY{�L9J2��6?�pyNiC�k����Y�F(����2r(
�4������ NdP�A
_��*�}V/5�0�dP���X������@
{��c��j��M�z�J���k�zu.ch�����Fd�E���U\��Q��w%.M�;j�|�DKY�^5�0�xI\���%��|9�X��L���	�>���SO�+����s����n����(��Fj�(T��(x�F��QU�q�"��H��@�4���YO�*���:f��7�j���+:~Q�F����n�����(��2j�(T���(x�2��QU�(#���G/��Q�A���=\���^����~����_q�q���"A��64B���V9F$�0��@Q��Q����/���{m4�dY�^�-QF�q�TJ'�k\GJFn�x�*?��Q�&�,t�a�2J'U,Ze��P�
L8�w��Q�������S\�%
�`�����z��.=�tV�/�n��q�q��u��C��s�2JEQu�ZQ�$�zW@�D���r�.2�0�d<�������'J��(���)EBq�����ql���F:��P������
�����wtQq����|	�YM�6��4�d8�t��K�k���qD�:4��Z��N�U�Fg���g#�j���RET�OhQ��]EDo�_�'�l�����s����g��������tV�wk�r���3����0��=��)�TU�ZQ�`O�
�"z�(�`O��D�aO/��#���

�%C��8�����t�[�6=9���g�Wa4��&-TF�0��B�FMZ�w�Q��Q_>'�YZ�;��=h��h����n��7��9]�`	�q����5��� c#����V�2�$��2�(	0����%�1V�!Ip��^W�^.����,��Q;
 O<:;���N���8B�/qbl$����(r�	�$��	U�(2�	� �/�S1Q�av�3�@�R/�����v����zQ'M��n
�fm�<O��]E�3����2J#C��vh"C��v�
D��
"�3�fA3�/��ZH.�!�X�^��P�J�X*	��-zH��*��f�5)�2J�Q��	��2jh�w��A_hf���I��Yq���f��k3k�}J1<��J�y�����1Nx��S���rlD����h�#L�����T;T��F�
� �/S����0����esz������<��$�\����^N�Gx�k�g��F�=�����H/7�[=���I5��P�!��o��z�
=d��=h�����l/UM�oEqS]��]��oJ�;=[eWEg�jl$����&rr	����<N�
4��q�7h��T�&����tXfd��=��*��3���z�G�v�k���$�g��96���V9�F��v�
��%o�7��T�02���UyL2��vRU�5Ga�����[�����kV�����,@��76����V9��CM�&Tz�R6����]��D�t<2x�.6x(g��V�+��T����������{l7'(�/<��X�-e��7��%���l�]��F
�,��*K�t'`T'm��Q��{{�Pg:������DjSk���1���z�<�TS#�������2a�?����,�+Q��E�F)Q/���	���Qz���v9�������N��{���rm��l��/��uj����0r�	�$��	U�02�	� �/S�(� �����2��v;��j���R���g]�]1�F���A���Z�����pn��"�'��Hj�5�*I��	���5mB����fw��-����^h��^��^�������y0.���l��R/lsjd����(r�	�$��LM�
D�ej�7���T��� ��#�G�C�a�
����.�My�^�u���u�[�����cJS3�F����J!���OR���PH!���
R��5UA"���sX�x	o�=�������K���
!q�p�A/�O�5����lN5y�2J��Uy��
�7k�2�+0��������<re_�Q�>��R�Y�6����
�������)�ykK�S�0����?�\��q@6�fm�EA�XQ�)��!�h�Z���g�����:f|}my�<u}/�A���$e������	��(*I��E�Q�M������(����foC��*)?��G�M�����/k��]'��N���M�Ds�?
�h�?��
��� l8�	�����
L�pD3<(l�������m��D7�T���?z���=��Y KBs� �A����J��M�'iT�&T�pp�ApSiD����(a��c�QpS�����Gdi��|�1�`�{��P�Q��2�a�5��]����A����6��y��;�0��|O����||��o�G�n��j��a,�I��u�!VJ���42����l������xVY#`����J�xg��!B�5K#Q�1V@(��,��H����q�@�i]������G��������j���� �\�O��&��"*`�E8�	�����
lFA�\��Z�D��4�����������{��������y�Y.����g��O��YBU��;f	���0KU`~$����QB���E�+G������	�]MM�)m����:eh�����0<�F�[BU �-��q�-U�fX6a�*\Rf�,�>_w��~�5�?��.9��j�D�ZN���;N>���h��Ds�?�'��O"�M�
D��&x�H��*0"Y2�
�����E��������O�$AR����y�#C}�4��!ej��)#;�,AMh&��]�Ifj��I�K�'��!�ek���'
�������S'��o��
�vFk���E����y�q.���g��OG%}�����o�7�S���1����� g�,�Ew����s<�M��|���V��E�����j���s�?���O��O�
���'x�8��*0�X`��X��I���p�{��Q%j�+���w�����=��s����l,)�<�V �TN�GW �do�����������I���{�
y�4�/=o��f>l<m��&�/>�zo/�A�9W���J�Z9Tgt�Pa��Cpj�7�	�`���,W?��y���F�G�_�*����(,JV�>t�����?�N$�A�����9��?���"�>�+�_L���;����>n��'��}��h���I���2���kp6�c������I1�b���40�9H:�SYx�	�$�
���@x ��"7�X�Q��*n�Md�yU�MQUcS����|$-J�r����O�g^�����>]�����Y9�.#;V���C+0VT��GW`��� ���������26�D�z�2�]�*�}�>1?_���r1�n���WWxA$�g*z|��~�T{a��g�H���0JG��F�Hzt���A� ���	�emFAfQ��4]/�d^�D���>��E[(�7fA<Y��uW��M���wW�
d���E�2�
)]�	��B*mFW@!�G����K�P����?�B�@������&�����M�^�,��vi��$/���s�?�fx�	�4��V��`����M3r0p(�����s-YD���Wk�OG?{��8p���_4�3�
���6��l5�;���a�1�<���N�����	��xQI����x��A�"I�R��d-�K�(����gxh�������|�^�TDy|ib�W(kc��i���;8����������'���Q��P!�����J$��<OU����`A9������YqBDY�h+�����;i�NI�b�>�b�
�@��TQ����@��*r��`��
�����*"�C�Q��o^��s�n[����!?3�L�r�ET������Y��2����=���]����A�l7�F��a�������m���?s�(���	#CTq�������-h��_�������p #'������(��,��RoZv�oL"�'�M�����B/��w��6�=���.[��<�{xH����jcD����na�����B<�RHtBU�:��A��y�4
�`����Id#������KX�h�-_o-1t��_g����u�d����5�=�S�x�	�$�J�'TY��H{���k��"�vc��"Q����O��)dm^(�������Y��]��M�����{��kz.������I��O�
d���5�o9<���9"�2��J�-.��<~^/�����f�a0)�~���[H��j�:��K��2�:_�)#;�(�Nh&�]����:�'.�3|`�X=(�8����	�YIP�[���-�������:���� �|UR=eduQJ��V@�T��
��;�u�2=�E����rq��W�i����l�B�*4��?w�m����#�_�������� �|UP���,J�ZYTPgtdQA����p���b�2�G��:@v�5��%�)�������.]�VX�b�W�I!Y5���V@
�'��+��
�oT�����V
�k��<����h=U��f�����u�|����������5�8����g��O�
���`��'x���dd�#�qv�<�D���8������������/m���x��oQe��}d��
���6J�Z�a��<�+0ldo��g�Qq�H��A0���1����m��EM��Q��$�6}}���?�&��3�3����s�?d���0:�UV��!����Y	�M$���O��W�F�����1Vwl��t����g�'��B��3x����T�t�?	�B:�*�#��
���S�M� �� /K�#o�sA�ET�'`��|��Ya|���$��
&R�w��=H4�S)x�	�$�
���@
�h�7H�M�)� ��fQ�N#M
���g�/��8��(_����f���`�I�_��SX|���b��4��tQA�P��!M�]x�)��4��b�c����K���q�0UF�S��i��a$~�{{��.9��Hs�?�G��O��]��@�3)�
� �`��"#lQ�������fx(�r��H�3:c�����M_v�F�}��Adg�+8SFv�Y���
L4+83���
����Q1�'0[�iZq8�����o����"��4�D�?�\��Go�=H���u1�3��)#����V@�]]do��O���A.u3<(�8��iO�����a-u�5�n��_Tk��uf��~�E�F)��B9ed5R���
h�B9�+���
����S^Nd\�m�p�3
�[^t�4#�(h�,�g�k7�m!$����V%m���� �\�O�m�?�3*h��y�C��
��hS�����J��C�Q ]�����b?��Z�����.��)�;,�w�y>���w��������.<��Ef�0^T���k?����\Jg��Pk�����f�����;���������y�jF�7M�B�z9@W(h��������������W����:��������������?��_����4�$���������A�Yd�t@Fi|�]I: �4>to���������*�cnS���?���i��?�wy��V��^-E�3<$9lM�m~]��{��W���I�n��7����k{��DZ(�L�
�`x&y�,���1k�������?m�\	bu�+���slZX�����O�^b��R��h�uI����_���j:FX�I������@n�7������s�^c���cEwa|�����nT����=����m=��9G7������p>�Uj�F�|�������Q8���a#{�$,���1HR�Q�a#�lC����@C�]���^�2�����G��������B��C�/c$s�?,�$
$������L�)X�����`Hf���aHf�XVz���h$2?������v���������gl�O,����R�����I��hQ��+0Z�f��I���4-���J���=�Y���z�f����CA�Fo�)�J��}H[��_u�MIR��t��D��i���� 
D����O�H�W�{� ����|��OS����[��mh��a��?t�{������-������\�O5a	&��&
���*�#x����C�&@�`6���T��#�`nm�`^.�ScJ�����MD�&��-�"��hI���[h�P2�3�������������
��9{W �n���8r�T��'���E��u��{�����6����md�!�>wH �RW\���d,=w8�")�k�J�0*��k�NW$������0|�|�^1�����-[H1<s+��|H�<�����&�A������:4��0"#���0Vy��F�+Y��u�4@DQ�AdpW���J�j�64>\����5>D��c�-{f;}����	�t���q�v�����Tn�?��7�*�����
��pS�<���EG��t�K�X��������E��4�~l��%�GX���J�u^��E��ts�?����O���M�
���&x�0<������ix�L,S��ix����L7�w��_�B��w��\�ni��A������#M�'R5�P���j�7��#M�nt��fx��G��m,�#%P�Oh]P�P���%���n�����^��O��k���]Fv>Q9{�*��O�����|"{�,<��o�������(]DJe���}��L���\���:���C�-'�||���^��52�7�>�)����H�oB+��
����F�7h�&n��7���(������$n��I`�;m��!�C�{�r
xI��C����A�y� NYMT�6�eT7*�3�����	������"Q�0��{U0lF���.IlK��t���&���_�-��S��U(��P	��!���t���'����B<�*�q8�	� O<�o$����<�
-y�KcY��m��������I����',IV�i���:Z�R���hs�?��G��Or(�j�*�4��L�&y�l�f�7rX`���EG6�`����[H�Y���iI��LGA{*��YR�k�j��F���V`fQ����"��H�L���h��M`�h����Q�������8�uOZml��6�&�#wQ������m�r.�gc���.��
����<VDW�.�;��M����f�Q�u:z3	sF�:��*r�>L�)�=��Ou�������y[��L�� �\�Ou�)'��.*��]8�	��O9��1$
vJ��"
2����(�)A#������^��k}���IF��m�k.��R�\�I
�	U��o�����7Rp\3<2������:.zD�(/+L���tfl���������M������� �\�O��I'��2*��e8�	��O:�o��HgxP�p������}��=���[:1]�T�R'r^EXL���6��
�
7�zl!���
�3N�'AT'T�p)�������C�]�^�!g8eX��frA����<��S�8�:/�����a9�����<�� ��U������8��qVgtf��4����<�Dd���tz/�x�s8_w���5���"*�!:�g}Rkr�oX�5w��^#���Va�2�)1Oh4Ra���H�����F\^g4	���A��%v��~.�Io�7(�JT~��o�Nv�����!����{=�[�P�.`�z�Jo�����8���zp9����LE�����Q���.zw
rh� r��q����/��F�:1"T���q��6H=�������Os�
���@#�z�7h$>5�2:Us1A#����={���vyL�C�Z��&�Z�~^���j��D��^����V��2�a�A�+]����A"�!(������F_(�8.W����u�|<VG�;�z���yF���I�z��R�� �\���9A�?�`�Fy���dA�wD�D�eAp���FA�\\��Jo�G9�3�fvi���~h0xeel��#�Q�&�A����*��O�'eT�'T�p��A�!�������r��I�UJ/P4i7����z:q�3$����	G�S�C;��D�� ]�O%��(��D*\��8.
� ����K�T�F"kD�p��U4���a:^�:��GD�g>��k�����3���&�A�����3P�'T(T*p�A�!�AUsSA^�rA(��p����y�zW������e��6D��O����t�?��'��Or�P�
��(x��*`& �K��;��v���I��+yxX�x��{���-��0V�������+���t�?��'��O�xf���`�U�h���F��~��R�2�1��G���K{ohW.��;�=w= ��x��Um=�}���!d�v�+�SFvR���
��B;�+�
����!B\�g8er�
q�3<to���z}��
cF���My;-j�+������g/�A�y�`OYa��'��(��y���0�7��d�*�"�L����{���#��?T+���-��B5�K�
;���~����A�����=���I{BU0�p��A#�S�	����A��=�EL���,�&�y��y�Hk�%���=����4`��o��_���qd�{�+�SFv)qOh��
����8R���C0���On�a.��mq�3
Z�y_�v���#�h����k�(W�����wnM�t�?=d���0z�UV��#����y��B��e)�p�RHQ�g�Q��@�+~'w�&�RF�\����J+���;?�����!c�����0<�F�BU �?��q�?U��(��3
 �D	(#P@�
)��N����$�a����������� ��2}�7������dPa����c���������c���q��!HfJ��vh�M�v����'q5���%�|FU6	x�����x�1'��*��!8�	� ���
���NY��/Ms�6���<_���t������ ����!�/=��o>�Uk�����L<��IBU �?�dr�?U��	��6�����M'��F��J��a#�N��T�Em��7�e?�$�S%�SF6|�r=��^VHgt����p@:�f5V�0�#�-.��z�u����k��T�X��~��vo_N]��)q�T��2��(�NhDQ���E�Q$wF� ��	D�N�GU0��'���T|�IO��:$
m��_[����F�����4H>��8��'�S�$|BUG\�'x�F��*0q$���[	�q����H{��k���~:qp��'�>�r�/��d�|d�S��N��tW'�#F�@{tF��
j8`�Q3��U��d�I�C�Q�U�m����u�N%1���abm�/B9�T��2�:(qLhtP����A��oF����	�48����&�E��h?$L��B"�yK����O����>��r�?2rr`��F9HDW��;����(�r�&5,D0�(�3��H��"-6�c{iF��K��f8�?�E	��vS�� �\�O���%��2*��e8z	���z�
�^FA�\���������y��Xv������m���-u���[^!Ig/�A����L�!��3�2rU �3�	�A�f���� �K���Q���sp�3L����+U�����XXj�UB�P2�7�g�dduQz�Z]T�!��@(�<C�?C���'���A�������,�����Wz�7-^��g�d��Qj�.7��b|:�3���rq���6JwB+�������F��1��i���c���yExd��-�7�Y1�W>3"������veN�e��;���u��4���B7ed�Q���
�B7�+ �
����F|d�����L��{F?���C��F�Y�L"I���*�w��D ����y~���O���@}����t&�a'��L4�H�Hv��&W�Hv��vv�}��L��������;�ey�JG��%�
���t:��������}XD��Jr����QJ��V@����
(#{CX9@��#���T��^d����(hk]��w���=�&]��Yz��umB����hIe�2H;�)#+���V@�]aThg�2\F'7����	���(XB||?k����:v�C|�u���n��#Ea�������@&��1c�>*TFV%
��4*4�����!(
���M6i�i���4m8-��.����:n/�6���S��������
Z�N�q�b6�����T��8�(���Jx������8�,��(�kX.�6�@��^�����o|�$�u[���>��yB���1�A������P�'QTH(T�p$�A$T�Fj�(�/M�����@X��3���_)V}������T��.��#,����[�����v�2����T(���?	���	U�Po�UF(�9Z�*Q�7Sz#�AD�C>���V����=�����&�i�?V��\A�2�a��F�+4]����A��Q3���)������ZX��u���GI��z^����l`=�������`2HB�
	���B��B+ �
	���*O�����'�f��`��;�@+��������$LE�.�i�����&"�/{��<�?�����'�S���c�� d�s��
#�Av�*0!�������hody���S���Q��!������9�?����!}���1H=�J����@QJ��V`��P��
��q��5C�@��&�6�@��g���y\��\��"k-Z�^gm���������s�OYQ��'�����
�"{�(�g��p����D��goC�@���DY��q<�=4�7�����4�}���a��<ed�Q��
�<�+ ��
�8H���A.�3<(�8��i��?��t�ui��~����nv�{$Q��P^�M���b:����iV�$���8���AG@
I�Q��_�'
�X �l��C�Q�G����������%������������ �\��� #72�?h���(�0�+yd����!���ei�,
��/�{h�K��(�����Z/["_���M��F/nR=i�b�O3���P��P���L�=�LU��>�������"J2����������=z�h���m�K=z����T�"$��J�����Q���V`���xFW`��� ���?2�du��"<2���6h�U�����y�Z�ZQ#�����$��Y��2��(QLhDQ���E�b��H�<�	D��������9��"���u�LF_$t���Kg����|�gE�[�>��bS<��)�#�P�����a�8H�T&���������f���r}}]��W�J�{������b����k�?/�Z�B�^�|�Y����3J���
�����
���q�7�G�@�H��6�b�����Q�#�]�|�w�ky���3O �=��ka�h>+DSFV%�	��*D3�Z���!?\'7�����i��(XW"S�WQ>__zw���(z��K��\�^���`a>+SFV%�	��
*3�*� ��!����<"8��Q�p3\���S6o����9���3%�������D�/i/�At���KYA��%���U���W� ��K�m�����Ioz>�+��>��6*���\������J�{i�+n��[a�7:D���#k��v��������J�����)#��R�&�*�U�J���0tSI��k�����o�|;
�b�c���T�z�w�"lY�_~����{�6T��#�E~e^��������ZCFN�k
��B�,��Jxg!�7
\�������(���.RB�"={|WR�{]#�t��������+�.����02�{�B�W����o�c9��DR��P�+5Z�7�$x%���y���L*� �+.hY���t�����=\�d���BL	���������M{�A����%�r�?����	U�J0���$7*��KR�K�T�m������yV����tz�*�Do0�D5:�na�k�k.��2�\�I:BD�F �=�� {��@9�+n�%gFEwn=\t�^{*D������k%�#��}H,�p�"e3mS��?�`�D����H<�IqBU ��l"�M�H=�H\��*�M|�R�"��������_�uY���a���F9��fkkcew�*�SFvrZ���
%�*P	&]6�doPIpHR��������J��pED��c�ST�����	o�s�y�=
�:h���v4����
���UD	zB+��\(b�
,W�7(��
����F���,������g�(��*X�mt=��`��I/��jk}�~A2H=_�)#+���V@�*����A�����A����/��
�yF�S�yz�P���\������j:�^�(t�?�dx
�4���TRA�j�	TRA���l��Z�>�m�S��������Nm����-��#>P�Ht+�[z��)�Q����
���6J�ZAT�gtQ������oL{%�'"���K���9�@�)��-�������4	������s��ZkK�n����x���� ]��F
9��?�`�EF9�DW�H�;���&��oL"���1ze�=�.��O�j�T�^��=�<���Y�-a:�4y"���T%��?���D�*P�C��
*�HT����w�=<��$
4}�����/W&������M&���'�A�����3P�'YT(T�p�A������1����I��."�����Z��.��� ��C��n����&Z��~���`c#� !}W2?ed#L)�Z�S!���0B�a\�g�d��+������Ho��|}�-�����������W��^���t�?J<"J*�����!R����#R����!������C"����y�.�*��l��%G��z�S�{jz���d����,<�E��;T�pL�A������e��E�.�e�5��~]����4�Sk���M�E|����T��?�"SK�)0���u@G�����1+��[�IK[.h����Q�T.��{�����a������<I��J�b� 7}W�������M�PI��FW`���A%��r�M%HA��S7
�Y�Xo:6��������J����e� )}W�FeduQJ�V@R]]do�E�yO���.07��b-���.
z*�����k���c[�vT��c����8o���m���[���t�?
.���?��x�
�.[�A>[T���<��HG[pq�4
��]�����3R026��)����HpQ����������/���&��}���o�����������������E��=����i�'*iFf!���J�%����A���J�7LAL������O*�)�l������V�|�U*?��<t�s��}��l��g��	.��mP���V��E�']P����]TJ���J�7�=�^�X+����$&�f����32y��t��R�OV,������Ae������t�����ITJU�,*%o��E���J/H��0S����
���?����o�n�8�`��������~�V�
�H�*��F����-B�&H���${�<l���E<������4��M������|�/�U��H!�����
�0��1������R�����k�Nu���a�`Qjta�h�F�&�v�&��v�Yk�����o�ga�w*��p�z�,CwG�rTeQHm�ZYTRE�^����*���E�YX,�}�~����0S��CE{��jlg����=���
u�.��el�t,��Q��RHmFV%��Qj�J�����J�7����n��U���sd�@PY=t$����^��q?WV�FI����*c\T?U�����?�Es+��m�!=���.,����&���
*�vm�h���b���
)�t�NQ��� Sv��Tw�����2�EW��%����O�($�RU0�0	��
��X���^[/��+��T��[���g?$��v�tK�i����t���2FG�K!����R�*�V`�(d���@P���KG�o�A��������������'Jn-��p�-S��f8����f�����!#'�����,�(��,���o����,� Q1�7�xh���^��^��c�)����q���)i����:������� ]�OU�a(��**0�U8
��
C��!%
2���L��@�+
D��6�+�/q#[��a��z��)a:R��lt�?U�g��O*��Q�
T��(x�J<��Q�Z*�M<���N�e�����5.�K9>�
���e::yA��t�?�����O:(����*��9KO��A�������A1����N�F�K�������1��$,�^���?)0}~?����� ]�OE��(��H
��k�����/^�#�R��D���V��k���~��6X8 �aH�������DOo�����������P2H�������t�H���Vit���T�)W�T��S�+
�*�=������)�����_~Y) ��_����2�@�*#+��V@]!do�g�\q�c�Q@1��������]Q��\����#���v&e�Zn�%��Z!�2���d�N�
��B@�+���
���+n�X}`����@��y(���cK�m�,�����3)���u� ��D'Y�T�D[�P*)�����J�7����������2"�FU���=�%�����3�Z�^xD����<ed�Pb��
��<�+ ��
B���+n��c�Q@A�A�p��T�|�t�-&��:������c�`�)���m��.�g9��?,L�*����5��E�Y$�M�X"��L1ze��b
h��m{j�����u���u��m!�s����� ]�OU��(��J*D��8"
��OD����<0��	4�D+	�G�.�����S�"]���9a8*J�
����T�~�?)�B?�*P����
���S�F�~��(��3
���~����[�o���V���
�w��n�U��fz}~��z@�n�ht�?�G��O"��Q�
D��(x�H<�?��(�	_\��
GF�@��io^/e��� �l��<���c�e���*��2�S�R�(�S�J�ht� �T��h���	�L#m*ql4
4���{�&:��
P����=��e��t�������a(���Q8D�v���=F������Qq�����ZE�.uio�
���fu���)��@��jH$��
��,JdZ���BF�+0XT�hx�z%�1kA����^q`4<t��N���b����)����������������� ]�O�F���Jr(T3
�
�0Xx4*3�ph4<2�����U������[�q�d�A�9��ec1d���*,TFv�(�PhF�
���hQa�����K
-����7�����=�]������X�9��������!��C���$�o:��������?����k�O'x���4��pR���(����ih�:L
)J���)o?vl���!e5=)�A2����BFn��X���rH�����sH	o;�7��FdYp�d�[ICG(�LZk>����������"�'�.���"�}��.��*�d�I%2
U�JoP�'����G�J���p�3�(�*E2�+�o�Q��Ke�����C�}��.����t�I:
U�*�t�$����?{Y��
��x��6x���G?8��q���h�a�L#�h��g`�t�6�� ]�Oe��(��L*|��8>
� ���	��}��LWRZ`�e{h�-
�W�4���z����C�����*�W�����}�.����H�I�tQ�
���E����p����c-��.M�K#z>C���Yz��2:���d�>*m{�1��Q�v$����J<!RI��BU�GH�T�*�������m��*\�D�V��$u���+��5���E�����th�>�L�S�xd
�$��3a!SA�j�A:td�2���L3��2�����L;#�����m_�1,
G�X����t�?U����O���R�
GK������".�
/.�4\�����v*��#)�4,����K�2#��#C_���<B%W�!����
B�A!`�M����(���E��y�u�Q�kz�q��u�=��)]G���	�b*OH��dR!�P��R�����L��&�H��-g"�={�����|�+
��n�4?������P���b�9(
����s	0�J��d(
�Y	�CH	\���wED�(��+
�T|�iW�=�i������#��%o�z@�-E�A�����cP�']T0(T�p�A�!��AU3/X�@#h)O=(��)z�����e��|��i��.��2�\�I&.
U�Lo��U��� s�(�K�h2Y���q��6n1��Y�a����l�n�i�.���� �I
U�.o��UF����*.z1��$���W�m��l����7��3������b�OE��TR��P��QQ��PQU`T�d�y��m�X���- hb���-�����0�]� ]�Oe��(��L*X��8,
� �,�
�L
2.q4<txZ�V��P�s��
K2��:r��4�A�S]x
���
��@���7�������A������������kV�Y�.y�m��w�O���L������3�������b��F��TRa�P���Q���QU`T��
	h[�84�atU�{�E��0)E�����2�L�JJ��,)��B+@@*)��  ��q�L�G���(} Km�p�4
t����+�|�|`�YFXTw|�~��������(���Q��P����7(����3b8:W
���[�cu�;�-�����������-�Ar����DFn�P	X�q��J�+y��|M\�"d\��id��7�@��(x��]����X�f��{����fw����0"d�A�������T�'�T@*T2q �A& U@@���Q�/��d�V5+L��hFy[6���������� +]�O��Y)��*��%8V
���V�
�0#T{��a%J�VK�W���w�����F��N�M������#���1�J�S]xV
���
+��@���7�������� S������V��'��>�}z�����9��W:Ha��r�^Fv�Q:U��|�r�>����
29��\s$�FL:m2��{��G�C'���7�����'�d��>*����*�t�(�J��*]%T�]��3O�&N0`��}J�z: �N"�~t��������������IN���&�d�� ]�O�����Oa�BG�*+���7�tT���d�V
�|���|�>������>Y[+�x�Q��TFv0)�A
��`R��4��I���Q����G���%�F�,~�6W.J5zoVv���k�C9��A.�������?��TR�
��J
� �.�
���RI���+���V��^�V���)���r�pl$����d����J<+RI��BU��J�Tr�JU��(
�D�
�F�S���`��:����l�u����e�6���
m����t�?���\��	X�(FY&��e�;��MY{Y�d��,�(�����%
�'���uu��Y���T��]�!aD�P��<�G�Sax<
�$�
��@��7���?��a���KSF�Q�<�?zuG��N���k���cw>��D��`�{p���At������S�'�T�)T*q��A%�T��y�y-�M$kU�s+����o��I�aY9~�Ck�y��.��2�$�I&�
U�LIo��IUF&���D�t����J��aJ�
KL��� 3�N�
:���}��)���
:������N����������}����(Pn�K+T=�8�O,��P*P�P�����i���y�.�������O�G%��������7�i����kAf�A�Q���s�����~���;�aY9��cAf����Ked�����
&����L�7�������K�U�������l���G���'`.�W\5m�D-��aX��up��<R�����T�����g
U����L�tqRU�>H
�eQAf�K��_?o1�$�G^��v������|�e�����J<URI��BU�GU�;_/'�E	8qT5�`���T��=�Jf�xN��h�{��d�0i������Xt��s��'��Rz�	Z��Ry�)���rwi���Zm����R�>O��py(6\t���:`����.Ts���D����H!#�������(������#E�R�,%DAVB@<a
Q�Gz���J��O#�\���K�ble��x��9�J�SAxT
�$�
*��@��7������������>���������������l����������������AV������R�'�TX)T2q��A&�T�8Vy}y��);�����
�
��$vl�-D���5 `����=��]N��,�*����w}uxjWU��>��#y��I}�c��;���)	�E!�$��E!H�qQ�I�pQ%0�p\4"�+q`4B�}M:�|Z��S��`�u45��(��vW�3�Nb��/s'%]�/U�))��JZv�C*P�������S%0*��E�/��4"f1P-�.o��%���9�v�P�NL��_��cR�'��`RH2q��A&'�T	�L&��_&�]�����*�+e�Q��{�({���q����G'�[������	��S`��F�U`���A'`�3�	��a�^������2'��`y�����-W�,�k�,���-��r�*i�7���JZ�M�U@%9Tr�I�G&|�����Qf1[�
����!F�g1��
�"�vU�D��%�;9����xN
����pRH����
29��J�4= ,K��"�-���{���
����P0i��2��tn�r��G ��@��H�U��h��C����al��8B��Z������ZS0V?�a-wm^�_:����|��������p��G�JV	D���~���R/������2lP�_��M����T�wH���\�Q�Z��]e�K'R]�/e��*��LZ�*��8�
� ������d�E��LDdV�c�e+���o}����,�rW'��������PW�K]x�
����
�@��B4����*���c��LX@[._9�7
F4�M�,��*�/<��lw}�����P��	TW�K}x�
����
�@�B4���*���f��,"�{q<5"��Sg��������4<[���+&�N���_��#T�']� TH�p�A'U	�.B��OB����ry�}qW���t��0vm�\:���)M!���M!�AS�a�@S%0�p�4"�Cq�4B�Q�����~����`��pTe$��������e�d����.<3�x�E�&}H�@����!:5'����;i�����1�	�Q6�i��x���uu����B�G�_t�<ti��r�3�&
O��j�W94�E9�y�����)���3�U����0h��~���Q���]��Pw��>Q���;���N��_��B<�-8RA;������R9�v�U����Nl�}`=ii(6��d_:�t�?u*e��������(?���T ��>�r�c��	C�*'�x4�Px
4-04^�
�G@Jj<\�h<���|0K�"�}�Z�����*�X
�����;�����s��j'����N�W8�v"^%�� "��a���0dA���7�O���f"�Q��&�.x�}n�m��v�$R3�]��;�����=!�T��=!�Q�z��l��WH�����d3@o!���.~+����n�O�&j��������z�������I-�R�*�W��#?T�E�*\��^��"3�0d����v�|������2�D<��;�0vU�;)��)O9!��B9!�ci&�#@�at�x�����R:9���wM����z��Qgr55	��}��N���_��3O�'��0OH*AYT�*	I*qU�Jl���@��c���LM���w�(#<[�&���@wK$�N���_��O�'Y�OH�@�Xd��I�p�S��Yn����4�w�a*����q�h^!	Ou*l�������.����;���)<!�d�<!��O�Yx��x��a��+y��#� c{��!ah�����)�	�2�8|E���F7��Z:+~wb���R�B<��e�<�]�V��\�`P99���@��5&7�!��p4/� (�=>�=]����Fx��,��]�g(���rU��,�h��	�t����x���P�����	���"�,q4�����N4q����|Fm���P:������
���S@-�3^��?#�jA#�6���d������O��p�6�g*)LoE������z\�����������_�C��C��}������������������k�{�W^�d���.����8]P|�M�+���RoR_%����(�F�.�o<VVC�E5�9����0���u����=���G�Q=Ko#T.����O���o���:�N�d>y���kX
�=�Wd$�
J�@$��R4��R���H�����_I���J������}����_>5��4�S���NE���E����A��]����R*����
�������2��`�)�����]�RO��#�����"�U�=��u�����e�bq(�S{��C)���P�]XZ��.L�g��N����nbR	�*���H����t*���P�T�
��j�Q^�>��_���P�'�4�PJ"14���)p���c�)-��1�y�1�����j�8n�KsY.���=�i<��/�����9Z��tv*
w6��vD�rg��V���i��M�U`D�pgS��)V��Y"�C��]Z
���Xn�����yUO��j-���,�k�����-�%�UI����.U�����
���l��*�:�<$���eH�2Q��1y�Q��2<t���;�p����:��%�:������N�^�>,��_v'�R<u'
X�RAwb�(ECwb�h��%�j��,���l*Z
���;���&���
�E��}�I	}�Hhq�
E	��@w�@B��@C��A��&�a������CB�����o}<tu(W
�o���DYj����7j�`�R<��)��<Z����&G�JlyhM*�b@%A)�;���^_�����6)��>gh8�NU���*e�N��_�=:�������5��,p��I�Jn6 :"�i|�,~��5��(B���0y�jt�h,�gU��A���U���������(�I-(R�.
�h��G���aF2�bC�E<%
;��M(�8���pGj�-��H*�5����������F!�T�F!���Q��x0�x�S Z�W�`a��<���FFmp]G�P�������z2Gytr���R��B<����B*����
��\T��E��Jz�'l�,w*�F��������Y����(�Y����
c_���B
�.	���F��K%�q�
�.�W��F��K5f/��Ph�_I�^&�6�a>��=���v"
Tq�	��VY��Y�_�/[
A!�Z���Pm�H������R4��&�����A����Q��\u���f��\SQh}�����=��W���|�������@W�K�x
�$��
Q-+���T�R4��V��x#���'�����P��-�4<��Q��.����N��
������G'
]�/u�Q(��.ZP(�����P�]x�x�����W)s�B�P��kYL�#��N%�,w*�F���ye���BW�K]x
������:^���@�(����Bot�PhDP��Xh����A��sl������P9\�|�OiD�����J'	}��P9Y��R(�--c��c��85����������4��A5~���z�s��9Je����-C��7����2�����x�������<G�����d]@tn<"�tQ�1-���6H�)	|�A�c���q_��'�k��k�������.<�x�E�T�D!t�����S	C�a�N%,u�AC�YWW��Q��V��=��Lew��x���yv���R��B<����B*��#���q{������p��$����\���0L/���$�����"��,���]U7�.���D����.<�x�E�T��@!t������!#�0d�����A�]wq
����=�/���iX�W�ig�g�R��@t��T��O*i��
T�jC!T�kCoT���e��v�Wt*ADu�v����\\S��+�S����l��	E�-��r�C���Px
A[�h�
As4��C�H������F2�bCi;��O/-�
�.��+O_��p������������ek�9(�Sk��A!���B4��sP����q���>�G�.`�$-�>��^�O�����T�v*�������J<�xRI�T�GE!T�����J���8,!��:BV��c&*��t������e,c;�N(�l����T��C�)���@�x�Tr4��C�H�
b�2�@�Y&*���a��b��)o���x�W5�u&�����?�d���?�)����z���_	�A���=8������M''�6��	��??{���"-�F�.�5t�����3��S	�Cg^Y%��;�x����*�h�a�*�'��t&�1	��Ak�������vH}�I}U
�Mj���zR5<t����R�B<���B*����
��0T�C����x�A#�0L�:�<��#����
���������z
�4t������O�h���
t�h(D�.<
U�����\��]8�`u��K%������5<����kN2t����R��B<����B*����
��4T�FX�Yt��q����*����a�#����~Or��#��v��t����TNvt�T
O��EK=h�
�.r4������i�5t�!�.6�nd{��� o�mHg�W��L�����>:���>�d��>�)����zh���MW
1�,L-��+
CY{
�i����>j�'��i��Y:����}x�	��}��OH���
����7���A��#������(���-r�Q���O��x��+v�]�e
��s����G�OzhA��
��'D�<�T��>�'d�v��0L��5����\����8t.Ztc����q�������I-���
t�
?!t����.����@�����Mm�{���Z����f��~t����u���+�X'<�-�3^�-�3�q\�j=�����"��p�3Bt��p���^��u^����SHT�k�?�~T.z�$2v����������Cs^Y����x�,������+�u��Y*C�XlP?��?�0�w��,]�p��eQ�1V��F�N���_
��M�'a��MH�px�A�!(��]��?�K���	d����n��:�L����~�ZB
������v���2���I&-�R�L��h�I|��*���f�~�CJ���xk��C�3�r����Y��r��n�k�c'�\�/���&��Z�&�!8�	� �����T�"�.��A��T��h����GCGR�&
<#M_G�	8��)';�h��a��xa�hF|
�x�#AXkZ:,	-�����Q��3��	���$�L{���N���_��{B<�-��!�n<D�L�CP&�Wf�~l�H��\Zf��<������xx�sT���w��������t��T���O*i���
T�((D�J�CP%��S��J����a�^�Q���4��^�YG?�U����ia<�!^�!��(���J���I	-�R���hPB|*��x*�QJ��h4?y�j.���{��;k�g��ds��A�5v����R&�B<���B*����
2�A�������zd�u+��3Bt �N{T����u:�@4��[����Z=y~���st������C��'<��-�3^��9��Bp4	<1gi/��Ny|����.������=�b�H�5?yu��������C^Y��[�x�,����B����B�!��E���O�P�;u����)����,C��gh%-�];�:	��)O@!���B@!�P�a�P%���0�0��:8�(cQQ��L&]�h�P	��d�
�w�I���d����<��xB��T �8!�p�8��a3d�y�����-�z�����������,-D�+�]�Z�N���_
�3O�'a�0OH�p��A'�S	�0��������-���z�:����'�F��5�xu2���R�yB<	�eO;�a g��[
�8a�J`�����x�A���a�j�<��-u%�#<[V�w�������2S�W'�\�/U��'��JZ�'��8�	�����F%�H����~���P�2�t��[����PT�-$�i��-����r�Z.<����4]xO�)I�a��*0%�� ��Y?2OI��w)����a��j���\&5"����V|��$�Oo�+,�N�|uB����������h���
Z=!�q=���zFMI���+�Y� �����w^�oE���������v�Z6���6M��)�p�lp�W��#G���� �p�3�`��L�4v�A;�uQ���u}z�O�g.��4��#�N���_�wB<��E�Zp��ke��A	'�32�]�a%��%8��� ���j�n.v�yy�b$���>u����J&rr
��L�+��r�����2�B
F��L��e�P8Dl#e~2�q�����/�������U����|N��IBW�K�x
���
�@%��B4����*��2���a���0��
�H�gL��k� N���H�m�r�i��������J�(�I	-(R�
�hP�	
U��B#"�0d��
����x�{IJ}��z�png�7N�t��T�g�O�ha��
�(D�"N�E8��l! ��������L����:�A����]O��O�t����G�O�hA��
t�(D�.N�]�#�"�G@#b]��~�S@����
.��������S��v9�1f��vx
�1[v����3G�N0gd�1����.��������P���,q��Qg��sD���������&��z���
���O
FKi'����vB4��z*�i0���H��� j2�A���E����)u{Zw���
���������}�������I-�R�>��h��	�Tf�a��*���8�5�t�gLT� ����t"M�j��	=W�Kax�	�$��OH�p%�
�8)�T#D�e���R���g�0��6����X6�N?�3��Dm���S'�\�/u��'��Z�'�8�	�����F0�6u �yF���}W�Q'���E��Y����1qw}.�%z�������I<W�+����F!T^y4
NY%�*y4
�Y%�C�xV[bYa�,+y^���0��I����k��v������YV����N���_
�CN�'a�@NH�p��A'�S	���E�HD�iJt�Z���1�������b�0v��5w2���R�yB<	��yB*�c�
�8a�J`���gD@�P��K����%�]Ga��D�r��� ���>��9����|��If��?O>!�t�B>!���O�]������'�Gj���8���}���Wm[�z�<���`zMmc��d�D���e���'��0Z�'�a8�	� �����p�3"�"I���R�3�tT�S@��C#��I���|k�$����0<	�xFK�'�a��O�a��P%0�p$4"�'q�R������tn�9:<[��w�����|�-��r�S��rOx
LIZ�=�U`J����}�CpJ��ge���2�t�3����3��: :q�p)
�E���k$s'�\�/[7!�Z��	��epp�A'pS	~�$��R$��fJ������nV����e�V����tB����SN��h����@�R��mD��E�C��p���"��p�3rizZN|�25��Y�GqW!���
D��C���������-���dU����*i���W���Y?Uu�yJ��a �,��#�a�^Z#�:*��pk���e3vMI�N���_�/rr��x�_�+�r�������B�[^�C����<%	������6��sIx}�?W�9����,��s�����Oi���
��'D�@N���$ �-���l!e�����uPF���h}�/�a��������K'
]�/U�Q(��JZP(��8
����F%�H!�����
Z<	��Q�m����������%�E��	��	?W�K%x�	����	�@	~B4(�~*�Q����W�����Q��:g^��n[�3��o�P��]�K�N��_���P�'���PH2q(�A&'(T	�L
��V\h����E�
��Xiz[_��T�����H�0QY:I���OB!�t�BB!���P�]��P%`�<�eC�:mK�T��Z��F[���M��7C���K����4APx
�PZ h�
�PZ h����G��k�"�.�(�A�0h�9�xG����$�)������D�K��y�UNV9���4
�M�C�
M�	�������97u!��F�����23y���������t�n.�]���z�]H']Z����.��(<�"�
F��
�+�� ��"�,
��a�������d���G�������i�
��N��_�-<�xj0Z�@!4����8�U3�p4"`j� hD���^f��=NK����Y���3�]=�����W���k1 �^���,�x��b@tF�j1�-�$a�-F�'	K���\�����D0x2�e��w�Q���������s����G�O�hA��
d��'D�,N��p{��5�C�Zl�
��,�S�.���$<�*�0vMI���s�����O�h���
��h'D�0Nh�a8��#	4Q����l~�������H��a��K��	?W�Kax�	�$��	�@~B4�~*����A�+��]�;k9m~�Z�T�g�N;w�����N���_
��M�'!��MHu���j|B��%����E4]�����?13������)J�������#/���<����Z���,<��@g�}]�i'��N���_
��M�'a��MH������"��F���!���x��J\ D�Eo�A���,������5��}@���F�q��a�z@�N���_��3N�'U�lq�T�
,�,�h9�SN�_�T��<k���@3H�aFPd�gt����fmq�4[�q�k���� N�����A#[���>�(�N��n��r�s�&�	O�9jN�r�"�

GPH������)*"���:�a�
����e��������~���3t���g/���(�N��n��r��h���FN��2�"�
�I�p���� �-�"�vF*�Yw�|z��D
{�-�t�_i:6O�X���k��N���_�.�{B<�.-�R�B�D�(���#9����~F(d��I����K�
���d������q*'i|uy���*����zl��y���D��iD(>K��R#BNI"�U�},�x�����
����k����t����c��=�C�W��06H���0���������M�3�������+����n������I
�R�0�H�#?��E�0L�gIl�a����D��AM��y�����cO�2���{"���!v]�v�u��M������9�d�@@�W�U(��2�,�#@[4���{����b��]J
]l!��S�6��"MXj�q�h��50�l9GA��G�u@n�2��X���rXJ�$�J�@"X�Y$�	F���%Qh5$�a���
����l��x�$�9�~9?���0"��>{-�����p�{q��E�U��^�T��C���������~<h�@�L�g5L*�,�{+���Omx��������?l���|�i�n;������ZX.J��Z4pQJ
1\��A!���x�Z.Z#�C1`��0�&�������E��<������e�����p������8�� ��)I)�$�I)H�@R��XHZ��DW!i�H���P:������#/�� ��5w=\�S����r:�������M�>4��_
��Q�'a4�R*�A�
��h��a�������T0�"&-�>�V����o��an[����'���&���CUS�LM,�xRC~
�5xhy,��qg��851��5&��j����2i`p��F�7LY����u���/��R�oR�������P������X J�$� J���0@�����@���KO�Q�D�Y����V��&w�&���a�Rak@|5
iAv��9~�I���9t)�N��_u)rr���WnD�)+$^%�P��
�hjD�7B��
�b�R���Jc�0�KQ��������/:��#�-a^���}9��^�N��_���P�'Y��PH�p4�A��*��0d��[���aP�2�w�u�g)�Y�m\:.�L�	*7^>����D��*'�D4!Px
49h�!PxD:�m��&�!��F�l(MD���DT+)��+/�D�R*DA�gP3�d�(�N��h8�s��C����v�����h���������pED��������^��g�T/��.�2���NMui<�a�������)'+�&�	O�6��{���0"G�,<���ZV
�bKF����KE9���h/��s��,u]�~�����������rX4���������O����|B*h9��h�����yp�H ��� �,��������A��|V���Iw��S��m�a��~����:W�KQx�	�$��	�@uB4���N�Q8����rP���>����M�����H���9OJ#���6;~�����s�������I-�R�*��hP����7�p�3"�715���b�EQ�h�OQon+��� ��&?�"������q��	?-�SNv��R�:���`��?�U`���A"������|�lED^#aC�����Nyy�IuP����a��j����F��D���e���'�S���<!�yB4�#O�a8���8��G�U%��Z��#R�e�	�fV�r��~����9W�+Y������^�S�E�Jn/ :�"�	bpb�>���V�K��I#:YG���R���c	-����.�=�.���(h���g'�\�/e�9'��,Z8'�Y8�	� �9��3y��MJ��n$e���w��"��%��/|Y��u%�����<�1��t���b���I-���
��@'D�|����[jL��Cf[a�Z��5����c����AE�1�\�*�v���I7W�Kx�	����	�@�nB4(��M�DLbA���� ���S��w��3��uI��.�����z�a����F��t>[H���@��t�S` �B:�U` �B:#��3br�EM����c8��R�;��=�_^R�x������n�����e�g'�\�/7!���	���pp����pS���pp3"��pp3"�u<���TyC������7?����s(G�[��������*<��xRE��T�
G7!T�����*���N��������T�kCwOU���y�y$r>��m�:/���t��g��IN�i��^wB�����
�!9t��f��Tt��"�,}���a(���=��U��������U�Y�q��0G%tb�g���UB����Z0f�
(�cF4�&\
g���o%��0���r�w����s���N���F�L��:��*:���lSNV-�M�PEN��c��h<������0#��
1#���[�� Z�-E��������Y�!:����<s��Z��I�ah^Y"��%���N�� ��C������5&�T����
������P��S���!��*����7oh�9����`�2�N���_*��M�'���MH
qh�A!m*�B���f���#�B�g�m����{���RI�Y����o�q�
���CKa��l�T�	O�6���3^��
����6���0@7�!E"[�r��L�~�U���Q[�c
B�����3l���	AW����CP�����B*h8�h�����7
�f��+"��:�*���M���[K"a/%�?{M3�O�������7��sh!�r��E���@s�B>�U�����_���Ks���.������#a(�+-��y'�yPZShe�����0O�������:t�����������hA��
��>!d����Ms��gDP/��g��R�Qw���4�
�rB��0I��l]D�S���N�94\g��T�\�\gT\C.p]8�	��?��N�0d��F���4x�<r�E8�B��9(C�C>���I�������#��sh��r��H���@?�B?�U��� O?9q�G������0��x=�'u�Dz�QF�B��W}M-����:2t�����������h)��T�@`����C4���p*�t��3"��p�3B��k/�u��<A[�Q�5�|���!���9�a|j���x�t������6M �ME�W��"G�D<���@h�a �,M���a�B�N^���dh�����s�($<�g��P`m(�;�o��jD������WV8�F$^%+��B"�8h�����8h2��MV� .!��|�B)��|^Tg�|����������:����;9����A!���A!(�qP��x�x^�CnC����|@�{yM���������X�/D�3t�����{'c'�\�/���'��Z�<!��yB4��CO�9l�L1""OK�P���*�]�e��o}<��!��_h���id�D����(<��xE��T 
�<!D�����(���E\�gDh~x<����Ikx�������[�Y�m�tK������F'�[����8�	~�S`��?�U`��?#���3b�����c���0���^��S���g}j9�1���u��w��ec'�\�/?!���	���p������S�����eD�&�}F�D��'����&����KwPw����9�c'�\�/U��'��*Z�'�U8�	��
�>oT�@h��5R���0�S�5o�����j
�����,�G�.am��<v����R|B<����R�,�W�>�|��o�}H|c��:��h&��g��>DgC������������\tQ��zI����E'�[*?�dM���P�T~���rt>a�~\����
��Kr�F���j�%�y3���8����~��J�k������c�5;���=�d��=�)����������T��j�$	�����G=�P��e������)��x�ki%�t�(����T�:I���{�������,p��G�J�DgY�!Ypf�"Ya���C�.��r���<�T����g9ck���:�/�F"�-�|��:_��s����g�Oia��
4�X'D�F�CP#��S����<M
C����H$Xk��L����e���9��{z�U\��n�g�R��q�������D<��x�H��T �?!$����q4�u�!ON��UW����UW�����B������.8<U��|�T�J9C�W'
]�/%�i(��DZh(��8
� �����Uf��l�i�\:,e|<��^��yh}��QN�=[;��:t��i�h����Zx���h����S`4��C�U`4��A#�!���������q�z��F��T�w�,�Y��\r�S�[�/��:����,}u"������H���hA��
Z�D!���p��\�]�2���KD@�E������e_A�ajP�r_[�Kd��a\?���A:����4TN��h*
��@��S�D�G�D�CP"���A 
u0�84BJ
����\M�R��x�a�������gK�_��t��l:< �xj:Z)�]8@
������#���R�T����t8@��1�1Onl�rx���&������1-��x��F'-}��R9�����S��h���*0�h���CP����G��"-�p�4e��]'�U�Y���9-<m3O����OE����������m���OmF�y��
�d��"�hh3Nh��b[@�,"/�p�4e?�������zl)F�.R�z���J�L��M��t����\�� ��M8ea���&��0��P�Qm��Cfaa�36������%M��������&�>CW��B������b�`�I-`R��h�	Un%�b�"2���<)	C�����s?E�����g��	�0�|����~O��ICW�K]x
���
�@��B4����*�����3���0�\����2z�<�����������0]����N8t������F<�x�H�T��C!4r�C��hd3d���8kG�I�!�P��A����J��5]m��w������2<�xRF&�0�h��z^�����F���*4�@nK|D@��S�/�����B��2����Z�
Ou0����������N�S���Ao�i(��DZh(�����P����������PGC#�:W"!�&��4����l)FQ�������h<���8����N-TNvv��@�)�t� �xh:rt>��~6�FP���A�V6��I�Q���|��+������JH�2����Js����1u�����#�����h��
Z	>!�p>��(�0��f�����0��k�h�Z+>U(|4�N#�h����6E�it����v��6M���C��W��!G��C�qp���H��A�����K�w�*��4�H#<��NY����-��f�k+�o&:Y��R*'����Px
���24^t��A'��3�f"�cZ	�D��*	E1N������,u`�_.���**��oW��N���_�$rr�x�I�+K�rO��%�Y"�C�����H2�
C]a�S�0������Kw2��)���,t�y�|�������������I-R�.�h��	Ua�t�E@�! �-�\���k���R�SG�q��'�������1����I@W�Kax
�$�
�@��B4���*��f�#"r��MD�A�=jI=oQ{�����(5
�r]���s'�\�/���'��0Z�'�a8�	� ���F{F�${FD�I�����z�HQ�{��n�H�����D��b���s������O�h���
t��'D�.N*?������I��I\�g��s�������<Q�8V}������y|���:��)';������	��p��A'��~db�a��V6�b��t%��Ku�������s�>q�6VO��G���sn!�r��h"�������x������!8#��������^`hQ����v�/O�����C�H����B{��N�9�*'���c@�)���c@�U@9Z��Y?2��{F�#zFHY�_�=Rg$ye$<K���~��2��'�Zk~���y�-�SNV
M��jhA��*���Y?[	W�A0?u��#�����51�g���{B��R�����s��G[��sn��r�i����H��W�����!(W�A �-&$�~F�r��n�<[I
O���I�;�S����oZ;��/��s������I�a
^Y"��G��*Y"����!$�jK}J�D�}JXr3�)���?�����>U}�r�s�>�1����?��,�$t����'�Oi!��
4�H(D�FNH����F��S������
����\b�|�����C�_=u�����.���^�s,��s��T�G�OJhA��
���'D�N��%8��C	4Q��U����a�����(�[��;�9�S�>�v����=W�KUx�	����	�@�{B4���{*�Q���A}�����>��A[�R��X���"���cr'v-��s����O�h��
���'D�S�����qp��ge���6��#������S��;��s.'�g�Y�hm^+����[K'�\�/e��'��,Z�'�Y8�	� ����������"W�
9u7��9���H�F@[P���EN�Q���|���.�t����G�Oi9R�D�$S{!$�-�+�T���Hp��G�CJ�����.ml�:�����97�Q;�����HZ<	����,����������8�I#-%��
4�J@!4r�C��4#���=K3�hh5����c��a/�e��u����S��]���g)�dti!�r�L����S�i���x`9�qR��U����0K�"���k������	��u�a��W�$��OM:	��B@�d��D@�) ��R�� ���O�\���?�@S�@#�hA��������Q�Ac����������s���;��t��w�W�8��#^%���.�����3Ka8���V2�b��� ���1>����������s����������N����wR���R"�zB<I��zB*����
9��J���0�D�����H�-��{�ivGAMQ��@���O�zw"���R�B<���B*��C�
�8A�J`d�hD@������3�>���`)tT�px����#_\��N���L2�U���Fr<���|�����h�	���VF����l��
�����������)�Fv��o�o������|{�;����Dx�	��O���J���S)��o���/|�1.�����*�T*�N����h$��?�� I����
T��'D�*~`O%`�P��*\�g���[�n�7��Q���+��by�UZ?�B�~Z�������P{B<�b{B*����
���=���!f�9�t�2��@���<�Z���G�I�PVFj���������?��Z�����<&�{��SN��*����d��3_� 5���q����G���71#`��<3B��|:���u�PVG��Y������:�k�6��WzOR���Jx�	��J�PMH����
��A5����jf��%��pT3
������Y#�|������|��=	1���<��x�A%��:�@L=fZ�����y�^A(��
1�@���=CZ����\k�vZ�n#����O��v�yu��I�����1&��*c1�`L=��acf4��m���z�: ��F�Q�4DWq{j�q�W�v�TWu����C��jB3u��� �p2#��� �"r*}E�2����F�s��e�
Q�?G@�����Wh���s�I�������9��P�������Z��i�b.�����I"�R�D��h�������Y"j7�C����6K���/��W�h�%Y���U��4�g�����-�?Os�r�?����Oz���
��,�����(���S��������r���9�!Z�|�wD�iPA�[��6'5�"��C~�����ln&�r��y�.e���9|�����K�'1�KJb�*�C}�!*I�E��%n�b�W	1���n���y_�����5E����B���$o���1�m>Os�r�?����O����
d��0d1pHg8�6�Tm�1U	�2!�H,!�
�yP1}���{��,�Y������/J����4�2�C�X�I�$��I�@"X6����#�$���3�����
�2�PH�Z����/��n$?O����'����6�t���i�d.��j�$��I
���
�����j��5$$5�b�Hl�������p����1���52*��s
�}��jI�ws��y��NvN:�I��s��M��U@5��`��`�2��eX"h�a�e�u�����f��w���XR�um��k�P��N����9��<
Td���HE&=D13���(j4�" �)i4����Vte�TgM/����[�-��t�P����.�������@f8��������������?�"��F��7����0���:\��������&d
��R���h��u�_�'�e�?7����a�^U�T�
�*U]e��g�E�v`T��'�YH��1�F���q]����� VO�T�'t����5������?;�<O���(�/!�$2�/!H+5��h���q�R�Q�A-4����8��Q[��x�]_���>MO�����[!}�'Qf�?��G�O�A��
��P&D�4<�T<��4�*��SG��Ce�!m>?��d�\�QGU*��U�Pe���+�����E��(5!�d15!��AM�Yx��x#D���d�d�Wi�i�����u��=��0��#��3=��h����;�I�y�<����P?���$��	��p`�Al�o,��i�s�4�i�3�����F��u�FntZy��9�z�����}�'�f�?l=<��x����T 6!$����M���fF@��U�!��p��&����hD2�:�R���H�����|��]��<O���
8!�T18!��N�Ux��x�
83�&*�pfH�)����G�0�XRj]o[�UJx����4����-�8Or��(�=!�$2P���T ��U`.;�=��h��R����v ������ix��P��O
G�8iF�C���>@���5}����[�I�y!�r��c�|�Sq���|����h���7�!�#��*!G>�K����q����o��LqS9����	G[�b�u�E�&��y���Jd��S@"#p4_$2G3%��X%��h&��������������V`K
9����T���e��.3���Q�e�6���ENN"�xU��S�\�U�D ����3,m���
J{L��`�Z6hT�>���z���Z��?O�kE}���`���M/����J�R�'i�RH�p��A�*>��p�4#��%
�v�!j����u(Y�{
-���Nt8����Mf��e��6�Cx
���
�@��B4���P� ���3#�Ia�#
���h��E[�Bs����{����v#~�����.�d��J��Q�'���QHqd�A"��*�Hd1T���!z��x�
y�V�1hzh���=0�]Wb������1��I2�����(��4F�(�i82
� 
OFo��@�*1������4�
������i�t���R.�F�]% ;��������r����"�����j��fu8�	�x����	�?3�z�?3�~"�����@[��TEY����fN-���v�e��$m��M�g�OM��T �B!$�Y��MS�O���h*
�����q��"�I�WY�/�}�����a�e��6�Cix
�$�
�@��B4H�3P�i8�0q4#��K����o&�4=�Q���'J�go~
I6U�;z�����P"��B<Id`[����^$�tR�<
����7q<4#��q@4Cb�r�n���%�MO��Z��t:�z�c����^���2�>�d�C�������`�F�|ah���>�Pkx���C��p����p��f�V���o��Y���}�&��N�I����Z
99i@<��U�N���W����*��&,���H�����"ip�f�i������z���6=�V�o��xzjY��-;k�N���(�>!�$2�>!H��O��x�����A��c��IZ*�J��~�l�q_�]�}�V���J����w��������W���`��X��J�cQ�'��`QHqX�A"�*�Hd1T���y�!Z�|���v/���:kI������h�}z�vK9�� �����P�'Q��PH�p �A�*���������4@������v�V��+��)�(�ys��������_'�h�?���O��
$��(D�D<U����A]�#�r���:������cIq�"�������E����|�2�J�#�TNv�:�J�)0@a��*0@�� �J�7���Y���������\P���n^������!E����6��#U��t�����B<�#tRA���(D�(<U�i7��Z���Zb����[ky�S���NFj-Oz����/�[�A�:II���D<%�x��%�T GI!$�)���D%��Z&����@�(�������������|�hp;>9y�1IG�#��r�]�P�(<��:��]J����e42GG{Le�@�y88�>>��b����l%���K���]P���$1��S9Y�Sx
Hd�����Dj4H��L\Wh�P�)�JH��4�3������H����Rii�(rK��I>����99!@<t#�U�N��W�B��*����"MU=��i�n�C�6�p�����_���SL�k�tYm�
^l�m�6�CUx$
���$
�@�B4��#Q���Z*�JC���!D��rPms����C������C������7wh���Iz����~��m0j<Ic����4
�h��G����`8�uj��J��5^����f���=��&��S�����l��:���$m�����O��
$��(D�D<U�i=��S�1�0C��K����2���}�T�g;V����������&Ih�?��'�O�!��
4�H(D�&<	U��>Q�+#�Gq �G����6�Oe
)��bG�v|%�}u\����&�g�?�g�Ob��T W
� �<o���gF@��gF����c����+=���{��9���D&��m�"#�~�#E���apQ8�	� 
�<�a`��gFP���g�DM��y�$�
�=H���
L���T�m5�o7&��m�TP9�����������
����
���cY+IC�l�`CLK��Z+y�o��n7'���bZ=�)����Da��z�&qg�?�M<��x�MF�A!4��A�T�L����W	U8���(�R[p[���I������tZ{���:"rps�mw6�C�x�	�$��"����z^���p����F��6���J���[��\q��[���G�U��=�hr���"��9�Y"~����H"rr}��D��J�j+��R%���h�H������"���;B���tC�-���Z�Z��I�0����3N�\,�O���(
OC!��1BC!H��P�ix�x�`���W	i`IiHc1�T��X��������B���7�J��1����� �I*��%��(��DF�(��8*
� OEo$��hF@���E���:������Fr���������*�9���kc��`&��}�����l3r�{$8���H�QQ��x*���v0�D����
�����b�r���u��Z[�s�����_��6��}�6��v�cQ��vc�B*���
��XT���pX4#*	K4Y����K��{�_���qP��A�kI�+���f��}�6�C�xX
�$�X
�@"�B4H��R��8X���8Z�!����x�+n�p,)�d���,��Z�c���'�i�?��G�OA��
$��)D��E��t@7�����4T���:t�
Z]�\�)
F�^�����Y<52�����$%m��j���I
#���
��*C!���jp��9$j�w	5���4��8,��G0��Z^��xT���N����S�����j�3S�'��0SHq��A#�!�W#��F#��zGM3�N�{=��t�������*�����D����kX��=��>	M���F<4�x��4�T�M!4���m�u���{ �v�A�n�����\�X�;�4��b��`*$���9��%��Ij���4"'7��x�xUj
NU#�*��Bt�H��H���#
U#�.�H�&
Q&h�z���N`z�8�k�3j}Go��w+���vG0<&�i�?����O���
4��)D�F�CP#��T�����b��&C@$	PUv�"�eX��H�3�h�}X>L�9sa�^#���1RU*'��U��S���@#���#@#�!����>Rt,�,���At�b��}��K�a�!*v��Y_�iz��=m��c�6����R��c�B*P�C�
j�A5�
Re��P�Wq�i�?c>{����u�S�W����j.��0�G#W-��6CW-�S�Y��@��#@�!(W6�_�{��A]����v����x�r���~��	�r��i)=������3Z�w�|�1��^NV#C��)���
��*0�����������;�������V��i�M�������k�Y�I_���Ne����v(�B<u(#�RA;�)D�F~R%`����Kh��4D�������L5b�d�����ipm�1�J���.<*�x��*�T��J!t��*���C�A��c��a�����b�z�`��D�k�<�_�O�B���F}�2�J#���d�������_Fv���@�R�A#?Pi������W��v8R�
�_�����#Q��U�3�\w�\�������t<&i�?l?<"�xj?F)����!R�m�@�J`��H3b���
�H�����u�+���#S������7=�E��$m�GJ��k% �^�����|��J@tUB��v[i%�P[�4@O��
������~;�MbtOU�}�1[{��T���u�';
}N����B!�42B!h��P����J��E�l6
r�A�����4�\�Z����k��{-���F����$�l������I#���
D��'D�(~�O%0�p�3#j�h7�TS��E�������)��ka����������s��6�C�x
���
�@#��B4h�U��C3�:D3��_Q���eI�n�.��1G�|�=�_{��P�9�J���F<*�x��H%)��8T
��<895�������
��!:�^����z�
Jz��\���E�����$m��?���O?�H�(����U�B4��?(��4�Z���w���Q�4������}�xt������X�����{����L�����z9�����zx
�Mj*���������	�@��U��������U������������H��,S��ml�9IB��a��I(�S�1BB!h��P��� �J`����W���~8�
����ko!`n����!��.�H/���9����='Ih�?��'�O)�T�W4
���$T	�Fh�%4�Hh�����������SD����|�w=>'h�?�G�O�A��
D�(D�(~ P%0�p4#�sq4C�yI%���z(}���S�}�j�yZ���_�@���BNn������p���W�@��.���|$mu�����#
v�A��n�����@t�W;��P-�|�$*��������_�@��j�Q�'��QHq@�A#?��p���
D��%
�}�!v�\_��mZuBO;U��h�P�{���g_�<��j��P�'M��PH�p<�A?x�M8����D���EG��gzd�X���BX���{%LR����y9�dh�<<z��
���'<���z��,Kji�H�
�:8��
g+����GS�� ��z��n��^X�����$�I���[O=!�Z��	�@#�zB4h��T&�i����F\}h��_��K��.��|O��]i���3�^�8�����P�']��PH�p8�A?p�] ����b�^����K�p�x�_�K
�������d��������I,��5��(��FF�B!h�aQ�����J`4�*����8*�#t��}~_*�H�99}-�B��<D��O����$m����8�I#8R�6�h���F�f��%�_q84
�����������Z@��g\m��kv6�Cx�	����	�@vB4(��T�;3�zG;3$
�4X�?]
��gt���3=c��OJ&q�k��$9�I���I�����$��X�F�,~����uR��55�ap���Nr}��'�������O��:{cG�=��=�5��Q� ''��v������*U]�?��f��#%6Hi�x"
q����Y�v3sO�A:����N��$�l����(�I#(R�,��h�����]����4@w��E���-���e=W�N82E�h]����"������mh������$�l���h�I##hR�F���z������ph3�*�`C4K*�H�������#C�D�k�;H#�W�O;�|O�����7!��0R�	�@
oB4���T�i5����LC������{S
F��������*��*�������'�f�?����O���
4��&D�F~�M%0AX���@�����[�����!�%�C��'��vW�m.
�w$�t�=��]Nv:���c��-��*0�� �t33���4Tr��J�����n���|V�g������/^g$�}�t&�v���D���������T������
�G5!���j*�i2���J����%U�����z�h$�h\��N4u��*�}+1�0�#�%���C�%�S���-)_Z�������p��tf�Kubf���Q�b|�t+���q�����c�V]��0�=	3��a��a&�S�03!4
fB44
?`��F�D�1�t,���xh���>Q���y�����N����$���P�hB<ic��R�6\'D�6~M%0�pD3#��pD3#���~\��C��L����&����������|w:R7C����?�]��8k����o����������������������Z�������e�?�E8�����.��t�Tt�_�t]t�~t�V������v�g��G)#�nh�Ft��z��n��k��{�����}�������T�c��;��j<�`l�����h����G�yw7La���py� 
Z%/
pO�����C����V��x_�9���6lR<idlR*���
�`3 ����Kh���n��������K���j^���4��k*���4G8�CYX�I�$��I�@�pR4���H`d��2z��@=�A�=������g�o.+J��z�����K���D9������{�#����	��d���>�)0�8���
�2j4h���#k���Ij�((��E�J��"��������$��]�������>P�7sd�u ��d�0B6�) ���_�0@6��A����A ���G�h#�1��~��^�I{Y�)��h+��_���S����R����N�([����%�O����T���I��dx�	L�bg��n�n����U���2s��WO��K��|�N�x�#�^�9���1,��x��@�&�-�:M�-�:�H`����!��y����W���rw�������{�����o?������F�	=)��1=)h�@O�mx�	�6��lu�%K�6+��t����6�M/W��}"������~D#~R<id~R*����
��3���#�/1�������v�w���cI�����=X0(��S!��>;�8O��������I!4^uL
N����Y��-���C3��.���h��$����n���
4C�G���w����IO�R����k�tt����k��W�$#=�0R9Y�1Rx
(����n�PH��tF�k��K���HP�Z
B8"����4�(\��q�)���EE`=��.e�e��y�6�����P���b�B*�b��#@���`�<#��b��CB
��o-�]v�'��*�>�=WV��Sj3�$�%��Z�+�x�'yh�?T���O����
��u����PFr�(�|F�PFei�����������f[Xg��Qq<�zD���"_J�����'�g�?����OR(��)C`��_�j4H!�#I��uFb#�$����F�a�:W�E\B��Tb�SGc��x\������������9b~G�c�}f������O�'Yl[A*���*0���Y�vr�&���b���Z�!u����W��������������>���$�bd����&A�yt��+G���^���?+��.���������S����w=��V��p���c�c�Kl�;��������\u�W����~�����/'��y��SGsxU��uF��\TQ�@X|�����d��
S��?T�(4�G<�!����Y��z����S�����*��VFZB�7���<�;�d�����^�FM��R���EE��&$S��c@KM<L�����5�����h&�/}}?u��Oe��K����Z`r(1B6���ad���V�PCMj���PC��f"�#�����x`�/��p\��5�x�n������x����6V��I@u��u[�a;��$�l�G�M99�@<6��*��B�U�`��B2��f��Ry�c�
^��.�p�8��x������r���\����J�z��H,��L"��(�0!��0R�	�@aB4��#L��:Hjs��<8$��2L]^����r���]u����p
6{����\75������;�_r�������x�	N 6!$��f���^,1ur�u:��PH�����z��3����LZ�����NB�glK����q�ek6����cM���ckB*����
��XS���pX3#�qX3#����"�����e�g,��t���o��h9�a�P����e�p6�CUx�	����	�@�pB4��N�U8����8��!BYb������J�N������N�4s>t��v����!�I���%�i'��DFh'��8�	� O;o$�%�1�d��!:�|�:��v�4�X/�6�YIE��������!v3��2I@���B<�xR��T�G@!�	���B��Z�u-�^�\�
YR<.�����Z�2I��e���c��8&�e����D�(<f�#4_f�#4�q���>3���4P��������}S�����%_�������k���9���*&��e����b��S@#�3_TQ��l��FU8��c*�@�)������1���������8�y�&@��_��,!U���7���I&za�r�
b��P��W����-!��Z^�1U!X
qLt}��H|hsIm-�Hm�����# ^����2���vj�N���4���S�������.�U� ��M�'�zLQC��5��c������r�_:����������*�a��W�$���s��*c�������We�hP����8���T���J��������X�g��6��Zxo���D�{j��zCZ40�~�:�B��a��Y(�S�1�B!���B4(��P����l�4�B���L]3���t���N_����t��������Z��}#1�=����d����"�1'����
��3�$��gFP���g�hr�z?t�����?���B�8-%�U|�z�	��<�:�=��a���'�S�0�=!��qO�Yx��x�2`��VF@���gF\T��C]?��;���[D�a~���9U�����$���m�r�m���vx
(j*�#���'����V8���c����x�(.����T���Q���.���e��x\Nv��:I5��a��&�S1B5!��QM�5x��x�D8���s8��!�sh�lN�!��7���~|�-��2=�:�n����4�����L�'9�tB*��+��h��/�T<C�4��6�J�!f��:dm�<��:v��|�lfY,9�c\5�����|�I�����&��BFv�C*P����hP���T�Q�S,�>\agF����5��H�*dI�����
�zN�@����:�3���,<��x����uH�p%�
��8S�F'C����33D��v��f1kX����O��,3�����X��f��I����� '7��x�x�Q&8U=��Tl�UM@�c)�JC�=Y�lPG�������EJ���Y�J�-{������Rn�"���M�����3!�t1R�	�@���Ag*���4�.�������n�*���y���tx\4��=P�X=<X�f��o&q�m�tSN�e*���@�PS���G�<���X[W���S�$��I=�4��o������`��Uz�u����$�l��-�g�O-���T��.!�����M��jEFTH���)�����0@R3������n���Xr��c����
�j���C��P!cB<)dcB*P���
�S�F!cfD�r���x�O����>cf�(�����:���\�n�&Af�?���O�)��T 2!d�A���,���H��3Z��d�<��c���`�[Dq���)����{>�7��{?��������r�C�����b�lV�W��G��'�e4N>��1���
�Xla��1(����9�4d��grfu#����:�m�m6�����M���b�mB*h/��hP�g��g����6���+����N~����>!��%2��I���_e=U��77�Y�}�D���PiB<	ciB*�C�
��HS�Fifu$�ifH��h�p�Q����h0n���
r��4]�'��^�&�f�?���O����
���&D�0<�T��bX��2��4��C��']��kj�Uq������3���~�?���$�l�G
��i@<(��H��B�U�H��B2�F�X�yi�
IC�S�S��i���S��z�mDq��j��S�O�����>	7���<��xR���T�7!�������H�a���$C�L���h
dSO�A�A����\o��'���\�f�#;��OB���=!��1R�	�@zB4(�CO�e`)���2�B.6D;��W��?o	HO��>��+h���~\�&y�}�VSN������@�QS���G�<��o��Wi���
T����xAL��Y�,��Lg�����<��~������I���[6!�Z��	�@
lB4H��M��V�������hfH+����}�nz����'f)y����.@\�I��vH1	9�#��r���P�&<Z��j�|`�\B��8�t������h-��Gm$���S�����~������y{��_Q�O�����x�	����TpB*hD\'DC#�+8�<
k�� �����B�IG���"��q�!����R��S�=:ov$���I�y��.'�^mG��@{1�=_��
j���cY]O�aIF]���=�E��7!M���_G'�V}�T�;
��[[Qo�[Dl�2�;�#{��d%2�7������� �;�7V�,10KE�]���������I^�S����_g������C���``��>�=��a���'�S�2�=!t)�{B4(�sO��.�q���.�q���_� u^��=nqe�O������?&	g�?���\k�����8U-������2���Kk���Z�:����8���,W������]5�������q����xu�1I;���2<��xR���T�G;!��i����HCi�L�
bZ�Ze�>7�����g�+�{�z���&�������1I:���*<��xR���T�
G:!T�I���*����w����n���q~}F��I�v��V��sj����������P�zB<�b��R�,��h�����7�pU�A�H>�v#i�#u���+U�E�F�6�+O��v����l}LB���A!��1A!(�AP��������t�v�*OC���A�6D?���|�~?x���	�,�1�:Q��X�3
����D�>&!h�?����O���T W�	� ���f�2�2��@"Xq�Z����c	�-���P@W�����=L���<!��0<!��O�5���\y�25,�K�����.��~,���%���������z�zv-����"�c�6�C�x
����ZOHq��
�A��bOe���ig���A�[oh����|�Ym�z���� ��S��is��Nv�1�A���D<�x��H�'����O�����D\��2�,��8����x���Z���%Eq]�+f �|X+����d���s9�dy������]#�r��������}f�K�.�~f�X���-�]��\��i��n�����)o�����s6��VBNN�xU9�Sm%�U� ���CH�YC�4T���J����I~�Y�e���/�7�-���z 	�Y������3��������#O�'5� OHjp��A
�!�W�������m_��J��!�Z<k��_]M�vaUE=Ol �o���I���e��'��,F�'�Y8�	� ����+�Tf#��}F��.:t�B�*�`���^���-g�kO;�G\n����h0&��s��SN��*���@�1R����G�e���2��GVr��C��G>C�%�x�_p+=�\���U"K��7��=���0&�����������y�	��x��OH����
��?�G��4�D�<4$��@�!�!�������X"c�q���V�=����l��s�u6�����N�'9��NHrp��A?X��JI@K�%v�\���z����R��v�P
�z�XO�r��S��������N�'=��NHzp��A?h�=,�:���8����=���q7u:��Q������'�zn��|�1I8�#��r�#��BOx
�,F
=�U`d1r�P�Y�G�n���V�U`��M5�x�z�������7V���L�j
�~��������`�����
Z6!Z�`S	L���fFP��f��Ri��L%�H��M���l@	���:ZH������$�|�PN9��c�r�S�����*�z�P��!�z����,3�r��G����^�G���*���v��D������(�5I;��Q�!''�����,��6�*U]���!$�n+�J�,� Y��v*��������WK������������s��-���xM"��(�@!�$2�@!H�!P���@�J��K@"v�{��Fz��&'�r�7-����2:kS)^���aZ}�:��&yh�?����O���
4�x(D�����t�����T��l�f$�g)������������%=�w9��*=������tm��k�6�C�x0
�$��zPHq��
�F� �S����J������4��\��4�<X����
w�T���?Ll��L2����w�xR��W58
���4?��DS�cdLY��J���St���b��7�)���1�{��o{��]d{M����bx0
���0
�@#�B4h�U�b,�
�2��06h)%
q���~������f���Fby-������I*�9�SN��:����>�U`�R�A?�(gVYF@K�J�2�����h��L`�Y<c���lr>�`��9����^�5IE��a#��(�S#1R�	���pu�
��Q����pT4#�#q��������'�8�t�h~�w�����X�������~�c�I,���.'�dm{��@�1��=_��
���E9s4X�c��1�tT��![I��4����g���(	����.�'����]�v!�5�B��a��Q(�S�1��RA��e��3�r=+�"?'���S��x�
��B�!��c�j@W��h�9��$�#�WTz����n���$m�G���k/ t^�����|��^@t�E��gV{��
��]IZ���+��~��_����9�Kuk�x�|������X�=	C���F<�x���T�C!4��*��4T���l���?C��<o���'�h#=cw���3�OIs��u�=����k�'h�?���O���
t�(D�.~P%0�@���FF��I��X�A���N��u���]��-���]�a������$�l������I#�R�,��h����F�0C���G=3�����h&�-�n��F�6,�T�������=I@���4<�x���T 
G@!����*��FnH�^	i ���U��g���{�O�����^t��e�g��$��������'Ig�?��'�OZ!��
��H'D�~�N%0Zp�3#��p�;�m#+�������7�����gh"��X���d���P�uB<�a�R�\(D�~�N%0rp�A��+��V��:V=�[�������H_U�_o#]�'qg�?����O����
��p'D�4~�N%0�X0GE�������L^.����b��nuWi�?�'���{{`��D��d���P�uB<�b�R�.\	(D�.~�N%0�p�3#�q�3#��������gt%�����J�a��d�'������=R�)'����>�)��F�>�U�i����!��\�gA{�DPW�`g�������}�7�"k��R*�z����f�bC��Ou�?h�����k�@^E�TZ��*E]�b[?T`2���
�2�v���u�6&��F�W����ON���~�{^�g���}�����D,��x����T 6)$��f$�u�n����&"$�y�_O��!��3���^��.����E>/����Os`s�?���O���
ta�&E�.<��Fl����tK!��A�".|UY-������g�{������R��4�6�CeX�I����I�@�mR4(���H`��JEN�(�Y�-F>?��������C�73�2���Z�l��\��o2����4P�Nv�1R�IO��FM�0d���d�d�������Ez�ki�N���Jv��S��)�
�\L�����b}��(��4G9���RN��c�rR*����
���3��P�A]���=$J��/6�����qm�����{�Kle�����9��>
�v��m0Fj;�)�`�v�W��I�ax�i2��K8�i5@Obp������v��[V��g���/�\��a�Sn������C����~�p����NOU�w�����h������T��1������ ��]g�0���M��|f.�S9a�;Gn��C���@Yg8Y]��u�S@e��U@����@���A��@��a�=$�������������;� ����^���W��>�����p�aa'��c���R��vR4�#��3 �`�4��0�s}�v!�<�s��K������:SM�taq�y�x6�#)��5R��H�S�B�Jm$ �J�5���������W)���A8B��nP�q�E�?�gzF=���������>9���|��nPq�����PsB<�bsB*����
���9����4�.��<8$��rN���mG��uu��z�cs)fXf��K��]��2��y�o6�CAx�	�$��	�@�oB4��T#��PaEF��B�Q,m��?��[�4;�U�rD��~A��8���"���=3�����1+��m�����{8l��5f�e�Y��<+-���F`�qbn���|�����k<������ ~\�����z�,~����5~8��?L�f�}��?t5k��Ye�V�6�#U���g�9�_b����I���G	�4��F��d����]�Q�!M����Tf�0�������a�5�K��������n���G2k��ZAd�n��
��<�4�[FvPy�=*�f�]�Aa�i��CpRi�4�S%��@�p����q���do"�IDR�K�����������~����$�<��LY��Lh������DFPf���I�|�H����Q�>zA@����/H"
����1��f?�~�C�����$�l��!��L��2���t�0���M���37�BL�f���xf6�B�M7~Wid/4���&5���^�]W�����>O�f(
O4���1���t�P��Y���&�
ka~xq������?	�V�]c���Y��o���������)��W�����u�m/��{!�I�����&��BF�&T�O<?
��3��B�
yw���J(�����8v���>9+-����e34��#���wlv��$�l�G����m�?���l��,�+��G��:p�7
YT��Sd�R�E�bm_k�����H+��?�@u+��\l�����Z�y�d���P
�m�?�a�mBU�L�5�&@
-�����(���T����S���w��uj����!�2�|_��s���~�S����,��L��f�;��2;�*P��/j��~J�r9��(��0�uq��e���a$-�e����������0r��$����N�02;�##�3���� �����1p�D�,Y�C��.�Eo����17�2� M��ZO�4a�w�8�U=���lnm��$����r��*c(�Ze��rfW@���S)a�@��y'����,���"����w=���gs��L�R	\�����FZ
x=�^����:�4w�����D��B��� �,]��&@!	+i�p9��V��QyDHqH�(����[�F��a���	���6�Z�H�}��}�����P�~�?�a�����#����O3���!�!�a)�������z�%r�������*������@~^Vp���&���1�</#)�2��c(�Z��1���]��a���wz���!��S��������G��lC`KOd�J�K�xQ��P�EZ���u�����.�����}�?�#������'x�������gAM����-��C��@���{,�x�8����wKzMg�<a���[W�s����Q�.F�/�:J�h]?�r�7��S���%�"@Ke>��uE�xh�Q���t�/�7��H����&��V��r�7��x��
2rz���0��Cv����zHo
�����K{ehq��GoC�B��������Y/�����_
i)��������n�N"�f��<��2�<�*P�C��
��S�>��]Q�H�
.zA{Iws,�s�,o��64��f�k�z>�l�:I7���<�����@�n�7���M�18��>�����;����ATe��^�wt�5�����:�>48Lb�����
CXZ��1�5�+>F�fzc��o��c)�h�WV��K��q�K�G�R������>���o�,�y�uo6����M��c$���O��W�7�o������fz��G�����D����F�Ky�>nJ�[��C�	�>N��w�g��$�:��)#;\ewB+0\�dwfW`�!�����K��C���2�����������Kk:��z���W������`�`^G�7edE0��	��F�7�+ ��]��{�7����S������;j��:=���16,U�It]rA�}�#-�
���lx\'�f�?��i�?���<N�
b&UF��M�BC���_PwTH�3\g�*���?)��]�e;X�K��0�M�^�������>zL�����t��c�T:������
�d�AfV�@,�c)���Hf�����Pf��#zh��q�I���'z�y`�:I5������&���1��	U����&x�2<���0�`�.#�8��J�}*ys�
�L����z��?a��4&�u�o��F��$�l�G
��;�Vu�����J;��*$�i��Eu�������[)�dk����y��?��F�k���3�U�*]�����]P�M��f(�7���1�7�*����
��xS�<tdc����.u���A*�n���(cGZ
t(�fY�&i���"����I���%��'��DF�'Tq��A"z��H[��#=j�h�$"�P|��n$�A$I������#IKm���������&1h�?T����O
9�U�BJj�����)��A��.X���5����67x�M����r�n������6I�'��To������#O�'5� O�
���'x�<���Q�C��A!%�����y>�:0�Y����?y$�>6����"�t��.)�Jd��F@���tt�B+0����Vo��O���X�M��.�������x���/���iY[������������e{2����&)i�?D<%DF�<�*D\�'x�B|���y�$@!��P���� ��<.��T7��gl����� �dB_��_�N��H����x1��	��x1���]��b���7._]�g�)�<(�8$����<�����`{��������$���~� �M��~�:	Eo#PTFVCPZe�@��
(c��7*�ew��_}�8��h����C����$K��?B_Yg��
�[�5�\�g�?�$��?E�8
UA$qp�!�x8*IMX�88������>t��D;����������>��'Ah�?R���x�����U5dW�x�U
�M�E��2���:^dD��LY���G�U�+������[d\�zQ�F/��%����'1h�?�����O���P��aP�Yx*$�`'D1�,�����z�����Lv��h���q��k�sV�=�2��<������*�SP�'��PP�
�((x�B<��Q����Q�HT�������\����C���5zU��s��T��47�h������I#��a8�	� ?�o���gzPDqI���n��I]��N���������I��K���v�z����P��?)c�BU�B����&����\����SV�\��	DP��Qt���������R4�������$�s�����'9h�?T����O
9�U�B��v��x*F�YP)W@Pq�W��9����������?��{U�6��+�'�g�?T�g��Oja�P��1O�5x�)�$�1	]
(�8��.���Up
"I��zOl���a��?�Un�q��;�O��f��?���1�
U�2\F(x�2|F���2�X�"��H��gh����[a����r���*��H����%�I���e��'��,F��CU w��A>#T�F�!���p�������YX�}�4�5�k{�������&��nnS�>	:���<�R����@
t�7���N�58��>�L�>���Mi������_w�g����"�������ct6�#5���N�5�U�`T��]������YT3>�EM&ze�h"h��G��uz]~]$��q�����������>�W�$�|T���1�����M�'%��M�
���&x�<��o%`��w�=*����8���\�A;E�[},2zUZdl�^��Q#��cL�'1�`L�
��0&x�<��?�,��; Hd	�a=����������s�z����zoL�q�X�%2	4#��ed#���vh"�����
D��
�@�c�"��
%�@�J.����c8�~�����A���D,3��q9on����1�2���t��L��qc�eBU0n8�	� 
�2���;@������q
�C	�~�4-#��Ou��+�K����[0�����P`�?�b`BU 0�d����,G����h�'Y�8���?Ud�{q�<�����k��|H�D�_������Pf�?)cfBU�3���a���2��������sAD�lC��D�#1+-oZ��b��a]�,u��f1v��cj6�C�x�	����	U�B�oP����7
qP3=��G�������zY��+���8>��G���td�Z�h8r�1	5���,<������@j�7��CM�Y8���BqT3]c��zf��.�;��^���AT7�{7N��F6[�1I:���D<������@"�t�7H��N��`�1�@��Q:{���I���3����)��l@��--�H�_��mY]<'�g�?������A`UW�`Tu�]�+W���Hob�����[�.�@�������y��u[���M�����O�R��h�r��4�A��$�l��R���I
#��)8�	� =��CDT��E���.z���vL7�ev����?8W��>����YF�,yN��f��?��t1�?�*�����]���Q�����YVixT��1F$��cD��
0�J{F]��Y�������R����R�%0?��I����2�Qd�B+EF�gv������CP"�8{:�D0�3$�:{�X��t����P����#[�Y/K�U�z���$�l�����O����`�p��a��A5�,N��/���$}���C�sh�K	�t3
t,Q�[]|d�UI[����
��,&��s������0tvZ��a��zv���K<��� \�f:U���lq��Vo?��Y�_��F��x�&�������1	<�#�z���b�^Oht1r�gvt1�2Q���K�L'���.{�D6���yN��8�*����ss�Z=����#h>��9�9��a���)��`N�
����
$?u��7Us�_�X
(��������x�\jF\n����fR���X�f���~�@&��s�����81tbZ�qb��zv���
z�A=��Mn2(b�'��?��y�����J�TU������.Kq����n�?������W�dd%2�J��U����&4�A��T��z�������@�9�x�'�$�T���f�l�M1tdc������Z�I����B���.�B	XU]�Q�Ev��]u�?�t�e^p�:�BI�T��"����2��=��G��w����[>��(�o�*yM��f(
�;���1�;�*�����
��Ai8���y��2�@��,��~����UIZ�J�F�(n>�&d�&Yg�?��g��O�a�Ph��N�M�`���!xTt�u��=Bk��]�#�k�H����w:o�G]����|�����!b�k�F���l����
D���]��1�5����N���a�����f�K��/����Fv'^A|~������Q�I�>6�v�������p����i���P�v�7?h�*0C�RP��Q����m����'.]O<-#�OGL�YOKm�<�CW��&�g�?�����OI���@".��A"�!8���O��<# ��C��j�n��F���/����;{�qz���3����D���P"��?Id$�����O�����D\��j6Y
(�8�u�R��n������!\������1���c�F�����u]�	���c����
�:�7h���Y�r��X�"�������
�g�������i��^7/��
���:E7�mvN<��$��"*#+�!"
��DF�hv$2BD���p�R>���f��D���YY��GL5��Ui����{Zxw�t�n���I����2�j���
�a~fW@
����d0\vgzP�p�3]�B-����~I�E��z�<x@��Qt��9�jW'�I�����2r��W�U��yEv�J��Uk�Ch����?���`�@hu����!��&��N=�2�N����C%i���ks��w�����J�sP�'��pP�
$�8(x�D~pPU��DM��:��J�{�b�����J��=y���[��8��M��C�$}��{��:�?�b��BW@���7����]`���Q%K@+���O=x��`�����4]6�f��S���M�����'�h�?0<FE�SF����BL�� ���).�3���+*����)���^������IoT1Ei~?c�R�O<��	�~������+��2�cG�'��@Q�
���7H������A�����2@�C��6�^�	C�R��z�{,_+��o/�I��I����P�'���H�gv��������x��?�	�{�=] �$���j����\�=��������n�z�}�0����������=�����	U�8��'x�8�{�3��&c��p3�'�zfA\~ym0�N�!�d��<��\�JL1�2��� t��������P"���?Id$���dP�������H��`AH���^���S�����3�L�>���%����~�����2	A�#TF6�APh����@\����i�Y3L2�u�2�Z�8
�.�g���JA�/[&��������M�A����w���$m��c���Oc��iw�
��
���@T��|�B�c����
Q���iC<�*�z=��*���S	�z�w�y���K"A�f��H��iF<x��Z�4�Kd��~�@��D>R'���3_�q}�O��?
4�x����./t5�[�n2X���D�T��x��������i(A���pA�*�CP�9X�*�c-9,��X�\B������n�
+�+�Ow��K�w�vQe�����e{���*r��i�D��4rD�*�Q��X �*0Y�����c��\AH�s��.�\����7��vP����Z�D�Cu�(��*ShT
'�6#UF��*�ht�
D�c4���*�0�:�D� ���'�q/�_��n��q�?W�
#SR/'](nc�U���Y1@Pl�P����!(6�������S���������'z��|�_����/�>f���R+��3��8��?�����a��b+��c�v�c��8
�5�����i_������6�i�e�n}�~�u��q���M)�C�U�'�������xl$r|"~�
H�z��ay��#���<��%V%�C?mU�8}.X��������r���������Q��b�������fde1����,�3B���,�7��2��GVY0]=(�0]]��X�d�A����s���Sw�o�H{iLAP5pA�����V@�t�
H�z�om�|>�:�����;�X�$�
��#k���y6^��~�,Jv���k�
*'����~�1EC��qzh3�H�V@"���kW@"�$bi�Z���� $�T��~
�(y\~\�Z>���������.n����D��4���
9��?�
����T��T��w�H�E��J$�D��h�� ��+Q&����[g	��4�-�w]��	�{aLr���KHj�sQ�'a�pQ�
���(x�0~p��H��T�Fz�5�.�}�����������1Q�0-V;�H��p��eF�*�9��X�����?Id�BU �E�$��������a�,��}^��2�^KP%������z6�����iCQ����#�$=� P�P2�@�%#4���z�lv�~_�]`�0g�W�C'�
%�DT����U
�����b����P�����.T�>�A�
1�AUShQ�A?0h�!��J!������I���Uw�]71`�K�����-
�j����*#;<APh�0A�+���
j�g����d��`�d���G�p�(Z���rU�����������*#��
+��U-�h1	<�#�SFVC�Z9���
��z�~��<���x�K���E���F��m�oN'�>8�i����~��8OB�f7\��:8=�
�1=U�����zf��~�����$�Im�H��go#�R���1�G���������$�<��M��a�mB+����]��a�m�A*����gA���O�;�X|fUWz�w�z�E�&���ga�s�
�*��������I�y��2�r���
�a�cfW@#����:��#�9,+��u����N��~�:/7]}~�U~�[?�us�pf�g�e�_6��X!#'
�VU`TDv�J�k��B��ee�,��3��4R�6�W�k.goI�4���������F�M]&�e�?�����O���P���K�]�������������l**����v���3
����{��^#H�*��U�do���e�Z6�Cexj	���j	U�2�oP�j�
�2�L&Y�X���#��V�e��{q�6�[m`�-:=����f{iL�����
&CZ�`2B0�+LFf�&.����<# �$���sF0��������	5�"-o����������N�F.���K���&�f�?=<�=���g�����c��=~0MU`F�4���c���]<u_A�i���g�tw�~Z��Ui'}���#�$���@M�cjB+0b�@��
�#P��.������p_K ��'�_��V&��q$@�7.M'���+C�}d�1	8/#�SFVC�Zi@U;�����,}��s+��|4���V�����X+[��D�� �I��zpY'<t���3�X���E~�������I�yI�����QLh��P����y���o�lF�HnnCm����Ep�������Z�������Sx�t�����by���l���?��is��l&�f�?�?x�	�4��T�T�������B�����pY�Y�0G��� `d�:Z�wV��zg�7���y;��9f���y{i�v1�;/#�SFv����
h��,}��Q�A#=�4�G�������T ����hg�jF���E
1�*�b���uJkc:�~����$�l�G#���*�F������*�/U�^U��4�����,��3����R�I/�=_J��?`��&
$��!��9�z�L�d���	9���u�v6�Cmx�	����IU�6���6j��~���j>��s�,m�C��R�J/�#�	�q+w�H��c�@��u;=-uW���������8&����:O��O�'q�O�
�����r�;��G�K�����7YP�h!���M���5���"un�e
>nLB������CPZ��U���'��_�����"����2�����N��
���y�;�d���L8�
���'�;����7�w�o�N���bj^G����Jd(QZ�@U ��/>�;H$q#,U�#!|8���%�%��l��d��K���z]���<%us=K9��P7z�u1I7�#tSFVCtZ]@U��������.:]��������X�-D� "G��������k�E�K]���!��)u��D�n}-Y�����$�l��SO�9��f�yf���r�8�	=�d�%���e�]74V�Y�������t���;�jXY\�R
E��W����#&1�us���C�Z�1bsf_ v�`���ei~d��c��`EVF��ek��o�/�qy(E�* ��=�}.��=��H���L���$�l���'��O����`@p��a@��S��C�����XA#�x��l�{�jRVZh�A�����7�E�m��uf6�C9x�	�$�
a|��jB�!F`fz���r7���x@|p0�7��������sq������?��l���n���n����AF.Z�?�����:8d_j��:8�7��d�YP��-z3�b���|�|r���S�{7t��s7�T�����~M4��p����P\�?�a\RU�.������9XdA�YP�(X�6����@jR^Z��"�u�HK�r�-�����n�y�����P#�_�?id�_RU��/�4�����F0GT3���$*����R�#D�g]���"�����+��^������I���U��&��**r�H2�5� �+�'�
�5��]���0r������\���C����'/���E�����H�*�m�?:��c�k�F����lc�kB+��!��}��Fu����7�6��F0�2F,�R�"�����������r�?�T���$�d��n�*���8��M2�f8�x�	�4�T�b�j��#�4�'WI�����z�F�����bp8W|�&�/���U��ze����I������&��6F�&U3�6������y[=*��Z�8����=4��k�
s��������H`�X,u��m�����I�y�2��e|B+0p������H"	��+��c����;��0��R������$��;�
��-���W��m�?�t��K'��m�����,���C+ ��������,<���C��"3+�M�9��:mi���AuZ�T���2��x+�W2�%�$��dt���b(�ZQA������(|F'W�p�Y@��Q�t9������}�������X�o��O�����
�L~v�O��f4������d��F��:���T��{�Hz�b�+�F��.h{ee��MD��~j@�<����b%-O_��u���[�7�-.�O��f��K��42�K�*�������T�<!����Q+�Q�������n�g�$���VG�^������>�D���<���@���;�3Q�!8&�P������:z~��������z��g�oZj���e*����#�$��d|��F���Oh"���@$�� �F�G�M�,��V.��"�$L�+n^:��� �(��q���)�m���q����Y��I4���F����s�T .��A#>�S������
� 3
���O]]�*�a��������Y�q���:���$�I$�����(��6F�(U�pH�A���h�!������h���U[+���������1�edy}MZ���}�6�C�x4
���4JU�Fw��G��7Y
*�H��q�1~,UiUw0jj�W���������Q����_�N��!����Rh�!C�4���B����u���q'�{+%��h���L��n���bL?��hw��8i�������I2z!�2�r"��
�a��f_@#d4�Q.14}@���F{��fiW�$������`����6�W��M�O��fJ<
%#�T��!
�J<��	%��]���P��hoC9`�����u��T���������=zL��f�
���A`UG���"�RGp��Ho-�G�ElT��Z��VJ����r��U�Qs}JZ��Z�������sE�j��u�9�4��J��P�'���P�
$�h(��D<
�?Y�����B�^�*��_�e���Q@�:vV�h���
�l�1�F���,<����@���{��OF��������S2z� F�l��KA,M����o���"*�+�P<��7A�+-]FO->&i�?��G��O���@#.{�A#����yT��u#�=�|������z��
��c�MP�������������I(��U��(��*F�(U�pP�A���������P�J���\�u��t�o����y���.�Q>�����yL�f�D��t1D�*������@T�FKA]��G]�pA��C����8���z�--Q�W�
a��~_#��h�c��6�CYx
�$�JU�,w��g��7�p4= �����8������n�3��s�����u���I����2�K�!�	��u�of_`�Z����3@�be�dKe.����K[$7=/���$�j���k����F��#1�����<&1g�?<�F��SU0*�����<�;��'����
��)��,�L�����:���;�"��������o-?�R���)�I���5��'��FF�'Uq��A#y��h�!������gz(r�����#�igO���&|�SG����D���H2r��A`U�	UUU�}����*���W�x��ge���L!� ��.w8=;�*����G<��T��������Dm<�K��$�l�����I##����8�	���<���3*���-.�R���#��}z�i;iGr�������I�����Q'��FP'Ujp��A
u�������Q�HT^�"�(	C�:n���-�l3����^���|�}�D��$�l�����I"#����8�	� O:�o$�>���t�#t�]���#�V#���x�}�|��[X�Z��g��s�{6�C�x�	����IU�F�w�����7Y
�B6=���B"��[o1?��f���������*���e��}���xL2���]�2�����@������������F<�Y���4����1�p9������v��������.�
����N��5�x�s��6��a��P��ad��RU0�8
� �C�o��C��B���"��Io�9�uQ@��C������E��R{(>�L�����1��(�#�����1�G���.�3}`9���V�j���^��`��(�j6�k<�Tu��?��x{�1�G�#i�2��J�V@Ci���Eu�Q�����5�8>��fq|4=4H���h��=�~`��<��i�j����(��
�B����
�� h8
� �B�o��C��AA���t�tC�]� ��b&�;�,�W�o3�:�l.�M,^�`��iDFn��X�!������!��]o�K��M�Yd4*��M>�@���6�oz����=�i�5����o��&ah�?�����O���T���Pp]���.�yx��[�.�yx�]dUq^g������ik��4���z����y��I*��e��(��,F�(U�pT�A�%(����,2�s7��x�R@{A<��|�z���"�y\�MX<K�@���4�i��$m��"�\�I$#\���8.
� ���KU�!�
4��B/.��#��>������
������,��M;,�zM��f��B��t1�B�*��c����/A]�$P�lt���Iz�=�^�
_^�[$q�5�,u\7��TR�.&��k��NA��'�S�!��}�)huY���,\h6	�����t]$�'u��
N��_;�X�}�pZ*�|?���1I>_#o����c��wh�1��{��Q�A�%��
�MV��L������]��_��qyWY���`���k��6��X�	(�S,�td1D@�`HdQ�A�%(�9��x�]��\�P� �"�����e�E�48��+���e]�I������'��FRC�*�[��Pp=���\n�j	�0����0q�3����M�/x)M����v��f�����Q��.�W�$	}�\*#D�.	�V`��$4����KI�KP����1Z8����Z������s�BA���U�h���*��v�'Qh�?Bd�D�0��U	UU���K	��!�	������,�"���<a-[{A �������.I�)e�����^�}O�f(O@���0B@�*��#��r�A@Uo�d����g��CG�z�����u���o�~
UI`5Z|]�as��� �����P�'u��P�
��@(��:~�PU`���Ekz�e	�8���C���&���	Y�,9x�;��d��7r���I��E�A(��HF@(U"q �A$?@�*0"q 4= �����a�I�J%��
�e�D�k�S�^�6Q�nZyOB�f�E��42E�*������EU�������+K*��q���u���u�~�;V���a&���(�e�
��$
m��z�4�I#��*��;���4T=`���Fz@`q���
{���*i�%�z��i���%���P>���$}��P�%��V`�2C�/�D���Kp���@��R.�:6`�HJ�<�r���WsuWV���{��R��������?�#����Q��Op�Q��Tf�p�3=(j8��.-!X4����zwD�����_
�������v��=�D���H<���@$���{��MFM8Z8&�N��E���C��}]�xy_��;Z��S�<-�N�����I�a�2�1d��B+C�Xh�b���I�KP�o���tU`�h����^��qy^�0�\|c5R�u��8��F�p������o�����_�5������������?�������?��_�9�A[Z����?a00�����������20��	�{�/	���C������G����#��v���t\���8���u��N+����������i�y.�������I��a�I� �<���
a,�j����0�y�J���V�����^f�]���c���i�v.�������I��]�I��O;��C;�GY���Lm�q����~����HA�����1�,7�8����b(�7��d1�7�*�������|3*0�0|�{P1��������%*}����������DY�������0����4�:|�	�����
L0F^��}�	��������dzv��)z���*/�����:=�z�f�4{�qA��~���
md��������9�����p�?����Q�Nr�Q������pv&�pv��h�TO�U��M��C���E`� ����������bj.��V#P����z�����t.N0X,EL�g���n�N��z�A5���i�k�F�	>����b8hX�I�4hO�

<�D��gT����������\����~��13�������������"����J���9�y>
�#;����V`$9���#��3H����#�r����@$�"1�s-��:'�X�2��d���������N��b���<��gYU�OjT1<{_@�s�T�I��N��L�,���%�F��xt�q=_n�x�}T`ynG��?��K5��K
y}��3��,��9u�?�����j`��Tu�{
,�KH���YP�W��
H�
��u>����j`IK��u)��������d�y�`��Iz��r���}�G��O�A�T���Pp]�@��+�.
M��J��������&�+&�M�a��������^�U���)������}�?�`}RU��>�t�}��xf���Y�G��:���%��v���;���
"u��������H.i�|����P$��?�d�RU B�D���#������� ��SO.� �m����^��G9�,;��2)~��Q'Yg<����io�?�'�h�?����O"��\����3��dz�;��gzFF$�PYE�!E{A���z?��:
M-K�oW������,�46�w�edg�#����
�<G��GU@+�Y��z��A�W������+K �$!�]��Cw/�G�c��+R�p��������T�<BEed�1DE�P������������Kp]b��w��j��`]��ho?��ui�����$�����I�^0��t�R&��y�P-��HF��
��d�|��d���KP$�[����A$�!�Q�^p{�G<��`Y|cC���t���$zan+��N1<�bd|.]?�b��Or�A�g|F�������B�#���@]��Y���>�$B��_��U)	`��_wB��F�y�6�Cux
���
)a��j�L8�;�#�G����2[�����-L�g/��.�:�X��pU{'��*��"�|��e���I����!#7�XUqPUuu�}����8���8�f��,��������NF{A���N�{����7����E���-�Lr�f��=��t0�=�*�����:��=U��,�c��t!�����'����*I��G���F����s�����r�LB�f(E��D2E�*����"�EU��RPS6��.N�@�+�!������[%�tK�y���oi��u$�x�����P$��?�d�RU E�D���#E�"�KM���.�YXsXQ���E�Y����G��r�$���P	�|�?)a�|RU�G>����|��G>��%Y1%ie\�g��v%�hwt��F����#>rLB������fAOh��C�3����R�=�W��u��Z�pAD�l?�M�+�:�n��2vJ���/��&v�9	;/#�SFVC�ZU����bv�/���9�~N'P���*�����~�/��|����[_7-}������������e�k6�����&�S���T��	�0J���T�U����������D�e�L9Y~���M��Wp�:_x�|�"�����lO�Ac(�Z�Ac(�3����kGz�j�4\�g:�J�f�u��#%X'��-����r %��"����Z~�9�</#��2�"J��V@$C�3�"a��KP$.�3�@$�,B��������^@����T�*b���@�{�u��)�5�B�:�>��Q�����C�����F��K���h���4����4�F��'YR�'��v�+���+�.����z�|�,��i��?�Sr�����P��?)b�RU�A�����dA�[YP�h���P>�EW��\�c�i������$�g�����
��N�'!��N�
��@'��~�NU��);���a�J�� �)T��7�E�d2��k:�m�,���C�r���6�mIJr��#�#�J���n���3��O�������Ut��m�8��{r2v����%r�	��d$&�NHL
� #1��bC�b��.'��&���{
���mj��n^H
��0�1~8�<Nt�c������C��;�;��=w_����J�`�C�����}	��k��k� ��,P���|��Z��,����a+�?p�x�??�i���$9�
�)�%�fX�HN_�"�,��r������'�.��`�����{�!V�3�ri�5\c|�����k�>�u[���=F'�9V�N)��(1�0@��t�Z���$A#����]w61����]4��=6�1�+���z��/��������Y�R����\�/�9�	������4�9����J�m#SJ�JV��6��b�������
e5���4����o[O��8t�"����$)�5v����%*r��	��N
P��
q��C���J;5�!	P�K1Tdl�����X+���6���K�����:�K�1�lG�c����\�/!���`O�x�P$
"q�������S'�Z�xj�&���/������ss�9|�z�d��ig��-����;4��e�$;�+�H)���=`��������
�G��5�,.:��2O�Q7�`���h���,�d��Fl�F������)p�.���6<;��:	�E�9�	���
�IC*��*��
/����9�uA�;Y`��Y~�b���1�����t�n�5T�/*���w�|�y�����bg��@�Kh�(�4*(
�@>RA�NA4�RO
l�h� ��$�[z�n+���J�Fx����hR��9������$u�,�����E�2���hO�������k���8{I�HR�1f�,0��q�.PPy>��R������U�P2�@!%��NNt���9'
���J�'
�+/�_�9^kIX����x�x������B��!w!2S�yG0�
%o�����j�D������j�����]�/����`O���%�=K��&�n����n�{O��
d9
^���6M�f�$�O^���[>��gU~�{T1�������{���$H�
A*�4��R�0R"H}-J*�[#F�JP��4G,�'Y%hJ���*��2F����_.�H� ����#L'?:Un�K)�H��;�)�z��D*��[#D��P�� Uk&�G�)&l�����g�E��TL����+z�USG�����I�.���%�G��bK���`����`�������[2z�-(M���6�b�n�N*��.��n��sG����������9�M��N~t��J)u�bP�G��������5:���g���|���jA]�4e7��P��W3�������Q
��E����-���]�����2��=��������D��yt#nMi��(��@i��j�C��]M�>� ������QS
�T��~v����%*rz�	z��Td�(�*rzT�\\90@pqI���$���T,hQ#n9���>��u�-G[��{J{=M\��<�����'{�H�'��#O
������S{���fx���Y��X��
/���+>� 2L��g��kJQ��|�\:i�g�&�R\J4)����`e��� �����$=-��R"M�~������a��\�[
x����5��t�2�[�N��Y)�R��R�(�h(��Z`�Qy$��q���6��0���l��)m����`�F�F+me,wu >9�w�Q^H����=F'c��T�J)�H��f��S_`$����+I}`��M�:�����<FVI��I{��j�����:�o?�2�;�����6���N�YR��$L��}GN��=�;"�	)���IuX	@$�(��"�
@$�:�C���H�a����c!��*Sy������T���?;��E�9Q
��J!)
�����y!�����
'3!�dL����&���5�A�m�������:�w�(����'�Q:I�g�����pR�2���(]���@8����$���W -kP�B-�d���������Zc���T���-�������&��~vr������9R�'�Q�!���gd5�`�kHe���� ��H���@��U�9~���8���>�(q�e�c����}�h�W'7��_ACJ��{�hE�ACEh�Z���
����E�{�����k�����
'���
�h�T4~^�
�W�]�pz
���D�K4��(�*�(
h�8Q04�����uA��b����@���~N���l��#�8|uM�1gxa����:��E�9
���
JC22&�9*��k����22&��a�>��sp��u��]���;(���@x����OBH'��0�RJCH��Y ��P_��h������x��4���	����_�8/�b�����e~������OW�49yu��������P�'�Q)���md��`��Ge���U�U��\>����F�x�E��1�h��R"?w�r4t2��
�)��S��O�<E�������0�n����V�m
��6K�k�R���F���T���`2�����
"���G�W>��Fz���d>�K��3�`O��|�P�02���a������1�nA�HV+�&Q0y�������=�.kLB�.T������N�s���@Nq�=a�Bq�P����s�@Nq�>��h{M$2m��1�.�����G��:`HG|5�8�����$��{Z���$=_�SJi$)��0D���k�HR!=�#IV�3D�
5�d�g�!�%f]u~�D�2����t"�������;&��<��R*��ZP�PQ���*�98���l3���t
��&Q��|	=�1�T���]�F|�9=~�@g���Nvs���$R��I@+����������[��pQd7�M(�`���~�@g#��9<����r��|i(6m���M{j�;i�E�9�	��
�IC,2��9�){>qAd,\p@��"����������:�k(���m�vU���v+[��%�H���;y�E�9�	��
�IC42��9�)���B\K/\`7���k����`�
aU�1i��'!�NRs���ANj�=��r���de�`8�IM�'8X��p�H\��"��m�0��E���k��|c=��*|����<�u�m�;��E�"9�	��
�ICD2n�"9�)�"��E|��f��5�R��$�����qV��z'e�9I�1�;y�E�9�	�����x
����j����Wx�>AFV��D�O7Y6�YW1x�����
) �����}��L�7�9z�N��]�/�4%)���Y %)�v�Z %��������!%AJU��[@`��M�PJ2������@���JI�����\��])�}e1�Z;��w�.��R�����,��]x_@$�Dr���d�'/�����l?�u[g��E��e���V|,�����*������|WxO)��(��0@��{�Z���{������-(�d�g�%-=Q���Q7��d��{mN;`O��w'���_n/r��i{Q)���`{��s�9� �9e��#.��L�������_��`�W����`����>��N�s��B��2���
���������D��nM\g��@~7��u�R�
�P���P:2�aW�`WFT�������)����t�'����:�I)EF�'dT�NX
t���2�����l������nA�%��j��?�������tB��z�3����7U�U�:?�d���2r��	�������`���N�s qA$�\s��X���)��>���+�W��8V���9���O'���_�"�>��`Q�>i(�E�}�9�"�>e��"�>�"[.w��9��x���O�|��������tM�!�@�yH��>?�{�RJCJ��;����=w_��h�����1�d��[PH���6��y��{+�	��kZg.�%g,hsV�/������\�/]G�|�=��
�IC����O0\������3�V�M��f��,��*��������;P'=E���H'�����RJD���Y�A�hN_8�
�����du�<�!���f	�#������M[���l�)�y{�~o��Is.���"�9��|E����_��y�9�����}�+2��-(�d<g���������a����3�S���3����G.��r���s~*<��R�Q�9ap%����#���l�K��f�E����b����t�=i2*99���n�h/�����Z
��������R�)��"O��Q*���4*
?�� 4��h �i��X�FF6���;b�G{k��u�I�C�����M|Lu�J���~.�W�DJ*��	hET�P1��Z"*�<:��%�
�ET� :@4a�m*�{�����4Se"�|BH�S�1�}2��s'���_""g=��Qa=i(@DV�	���DDV����?Dd�\�- |(�E��>o���j��n���:�ij���������{�H���#
����HV�����ZK�"R].��������D���5�:�%������	���1w����%Fr6�	#6���dl(�F�K#Y)�FN0�
(�8�rYk)	Ff�,njy��K��r��t�]]*�_�Nt��G���=��RJC8�P0p�� 8�"P�l�8�t�u��u�U�j���:��\��:�a�\o���8�ks'��_�"�A��pQ�Ai(�E���9���q���j�(����.��[j�1[G��5XT�uZ?��v���stz�2wR���%Fr��	#���4`$�������*��%+��
,Y-��X��x��|���r������6f/���y�3�NBt��}J)�aKu�0����O_���7���  2F�� �E���FF����+~��+�s����y</V�������Azi��C�K_���`O�"r���	���F�|�	B#���S4��la����DE��kk�I/��3�i�� ��r������7OT:���B�J)u%bf���Q_��
1��1����`�+2�b�#cF��I��\|Z& �!�P^W���b\�e��Z�Q�����_����_������������?�����O�������2�e��7v�N����\�}��p�C�}F[K����}	�b��5� aMr�D��������CZ���>|��m�"��b�P����)~�ES�{A��_�"%H��pQ Hq(�EB��9�"'Hm<�m���_�GM��os(�����x�M����]KW�Y�TD�������1����<�
�V3�d�:�/�2�<$d(�r2t���'2�Y�|�I�M��q��0#}�sB�c�y�(:�u}�*�X!��{��_:��
%{I�
��$	J���
���S���� �iN#!C�@��C7T���������f��'{�m���V��N�Q(�A�-F�t]�/�`�Q)mk�-F4T�4����u$4h�CEB�ns(����s��v�M��Q����!�m�����Y[�����#���6h��J��uU��4�lk�Ds�HN�n_1�P����K��n�X��C���oc5���+{ms��������=a�}D��SR"��)��P
bJB��yh�fJ�1AI��f�D`�#�A�@��]y��=��!���U���1u����#*����^`CM)�6�f��RaC�Z�_Ds@�	�b�z�M	���QCE��n��=�;=u;);�o��l�I�c�L=����w�G������P&jJ)F*e�4`�B���F�9`$�E���,����*�����n?��D������{/�=�2�x�eN�-J�����e^��R^��)����P^�H��+t5�<�$��xd?��$��6�����'����*j&�����1��Tt���]��vR�|�,@+�*���}�G���������pA��E�pAd;���Yt��y��fj��g>�4M�U���{�2�������{�F���5
���jT���FR;����h�I�Qa����aK����������6��p-�G'I��_"#'I���Q!Ii(@FF��9 ��$�	2pJ�8�"��@�).��+B�s�H�
�k�0��8�*��Ta��n�����H��T��G���H6��X"K�A���,���/���T�&SF�<���q���46LO�;#^��s^��&�F^������.��~$gL���H�pt����$��d~��1����S����%a��� �|g���������Xc��-C�%��{V6�a}����	��2{I��t]��p@�T��9���2���a�"�s$c��������A���jtr�{8���z�����<}t����%Fr��	#����d�)�FN�S
�`�P6��HF��9l�$��=��Q���
��dj�m�a�$�#�T��B�J)MuKT*����k�
I�Jm_����tp#�#�����mI���M�xL���l|{]k�3��/C=�^�����R�KG�s�`O��Pb:�P�H�S2Gr��j�����C�15�dT�����3���
a������T���]�<�tR��
�*���TJLUq[�l*%�6��I�Jm_��$)1mSFp-���Jm�/�c��@.���Y�g��w��$*?�����`�{�6������R�
��Z1��P��Z"6�<���%��&��.��pd4.�M���n|���1��%�G_�p�_z�Af[�*��������I�.�� ��U�'�T�U
@��` 9!V5p���"}��g.P�y��s����m5��a�xF����C�|�6"C'���_b$�X��0R�Xi(�HF��9`��b�	F2��-�Q�q��Cw'����=8������4�;��`��/�>����I�.���)V�'�T(V
0����9`��U$��P���M���I���QE�a'�v03��mJk���E'�:T�N���8JU�0�8JU���q���Kp��q�n�Z����N] zL�g���-���S���Z�6�H��]iz�;tR������)T�'GQ�Pi(p�
���� ����UOr�0�\�o�@0��-c����Y5���c�9�����|r(50�N�t���EN��=��B��P���6s��	m������6�eI�3,�����v���V� _�|��v|���s�����w���N�t���RJ�I�$�Y ��HR_D�h�8�7���S�F��"j�$�H]`��?z�
6�@#<�W�?�A�W%j��SGD:���Rf*��2S��Q*3��4�9@����5�k1hd�h�%���|���*���k�M������5U���)��F'%:T(Q)��(Q�0`�D��Z��qR]�#6V#pN_B �8��S�C��9Q�X����_�6�Ck�}vo
����]��vR�����*�,|-`���v�.���)Eq� R\,����#u)��
>GN���.�������,<�3��4�����{�H���#
���T����e��bV�k)x�-���l�y^=��z<X��Z
��~�9������{�F�����	����S$���O�j�%���Y�����u�?��~�<�����>V��� ����S)�A�Tb
�@�)���Z �Ds�	���2�� ��,0���.��zuj�s�/f-m5:��k�g��k�#KO��NBt��t9!
��:*E�4�����'E��Yq*p-���msX-�������.��|e��R���R�4�C���+����:��m{�D����D4(����+���BP��:7��@i���`��F�l: �yC#��n�����>O[1������'r���OT�O
�DF~�9�������X1guHQ2��-���
5vo^,�2_������7�*���9X��F':VXP)�~����,�7J,���F�m_��kV*�F��"���
���~}T����}����;�ID��$�a���+���R8��O��Pb>}-�
����������l��nx�I��D%]�G��W8kj�$�J>#>�T�X?vR���e��O��0R��� �d��`a����I�)m���m73��
#Ou�^r��\m5��y���.U��������!9�:�E�
R���.@+:
*�����G\�/!��da��S
,.\}F���Z�����k�u�����Gj��f�SzR2u����%,r��	���Xd�'�,NHO
�����t[.\�K�b�^�:��a����U���z@cY�^Q�:����$9�J�Q)�."�@��;���cE�C��Y���TO�����K�yX�g�������C�1a��H���v�*������5���
�~����M&�\�n���:)�E��]�'�D���D-��u7�Dk��s�����?.�	D��4�d��]����E�����N���,BV�������`L��������{�q�`�1
���R�?8������Dw�ZP����6���&��{���h
�yR#0�>4fVW�����G'�9U�M)���Do�,D����J3Q�DV��61-m����?����s~�������5���gxm�v����NZs*�9?H)E��y��w�@�Xyi���1�o����t��(�a��#c5��cw�_���t�^:i�
�i��NB��Jq���G�t��S���R
��	�D�X�u-��� �"^}I���i�� �Ld<�v>��}TeE��+��*�x}+�`����8'�ZS'���_n,r~�ic�%���xX�x���'	8��a@J����2��N�������h�z���]�hX5�2c����M����	���{Tl0�c]#Z0�t$`�����d�#�6��s�������~KP��T�.	�qF����b����\��� �,f�=@��"@+B����
nM{MXPpA�.���M����l����3���T��#������|e�)%�f>;y�E�9�	����c��<`���`
xX�����6$�,�k`���I���+����>�.2��3����e%��4��	���{�B����
�6�g!J��m���\��������RO�o'�X���jy{�w���C�6��������Ic.���iL�'��hL ���`
�iL�'�hL����Uj��C�;�:?|�6M�c��v���0�9��N�>�'��@��j>+����n,J��a
$V���
���H�jM���h,�X,0���1�t�1�������C���l�j��=ka)=(}v��������M�'�(l8+��&��1X	�/������W�
�*\8�q�����t�q�^L|.^��Wm��Z$�n-��NZ�Y�5�����	�������pZ���Md��npX-(�d�f�_]n?��]#]�i��~�@�)���mx�x��Qt���
�)�%6f!dT�L_ ��f�5"#��t�L�O���(V��{�?P���B-�@3��YJ�+G��NVs�� 9�	�@J�&�{���k 9�)�d����niI�j��Ci���8[s�H^���z�t��/����g��"g�:I�g����RQ����������R���_d��n�b��H���m~������*1�����W���#�����fdw -�|u�������R���AZq�	Z�w�Z"F�:vVsk���g�	��n�@1���jy3������7���[b���!�`Q�~���:)�E�9�	���	c(2��9�){Hu���P��@�s��A/e��H�f���Lb=��4#]�i;��{�a�I��*U�RJ}G��f��Z��k�IN��G����7E����N|�7������Bd���P��M-M��/�q�i��&-�NBt���%9!
��KJ�(� �Q������_�
b:��o&�C]���@��50����2��J������������B_*TJ��(Q�0y�
�k�]G�P��<�y��@�]\w�Z�qx������SZ`�zT{u��V�A����N\~�=�L'3���}J)�I��f!�T�Q_�$ZLrf�6�d��(�d�h�D5��|QJ}N|������j�V�L�[��y~����t���
K*�%�f!hTXR_@#Z4��O���5���@.��JF��R���C����SU_��^u��{B')����RJ�P"EaB����5!/���
�M$9X`@�H�6������Ex��<�*{|`Hxm����GXtr���Mv)��(�d�Y�JO_�"Z,rN�6Xd��(td�h�D�c����9��T�Q ��5tn��&�mlJ��:��E�2'��Q����T�	cAN�U{�5�#���=�.�u<,0����m�$
��]_/���7_S����xw2���(��y�P�VdBA+���=XGP�51��#�1|�	L�"���I�����{MW�F]ER���p@��k�/�����@��|w���%Frb�	#%b��d�(XFrbT��8\� �����1�jv�z���f��J�c���S��l�T*��C���=�K��'�JU�0� �=�@����O@�
"y���b�a��P��;�=\Y���\���H_���y�G'���t��R=J:a�o��"�9�@����#c��hN������n�z�������UL��N����l{S������������Iu.���"�:���E�.;���+��kFNu�>���2|��?�������Up2���B�
}�Qj[�����$����(:9�w���R�(J�&�B���i�Z`�Y�4���Y���D��
��X`p�9,���?,uks���JL��D6�����A��f�+l��RP��L��@Qa3}-�
�������m�U�6K�����.��<|�G�����R����g{=�`������N�s��$9�	�H����BtjB8��@R!:�1�U�
`���x��MA�HI!����I�m�z�p�U��,#��W><?K��ut��=��SJ�#)��09�
��kGRi����1��l"�����-����z�f�0Y^��������!1�D���cD��8���O)�@(�~�,�J�����c�=�F  wj@@�R�c.�|$�8��u�+1��v�=���}�A�e�m 1�u[Y>���9��0"�`a��b����Z"0�:��	<���l��p�#l"�&QX�S��|I�i����RS��!d|&��OQ�Id~"�x���Vk��L�'��LP��`
(��L��X����a� rX,0��'���']�y�J���C��t�����;��L�' ��L����`
@��L��v��AFf� �	�J�6�M-�b��K�;��_���6�=M/>���	���{B�N� d&XrS�	2�-(.df���'"~��WU0����B$�\��u��_P�tR���%4r
�	%
�hd&X4r
S�	4�(S{I��`�1��B�b����+�Z�V�y�[(��=��N��S�.��nK�%�9h����1�p��c�G����P���+|�K�8��n;�U�F����"[��P��]�I���/?�RJ)0J�%�B�����F�`���<�m"�bLP���1�$��Lo#����]G!���>�SeN��,zK@��E<z�N��S)��R
�R�&�B�����G�p���##���-�k���FV��i�X
-z����������������HJ)���I����F���
#���7�G�??����
M��gu��Bg�7��O�I��j���%�N�����L�6�r�����N��S�+����DW�,��
]�k�T�J�F�d��&�R^27��	�D�}�i���X�>|=:���y�Uw>�4
z�.���N�r���:��A�!� �HT�V�m�Z"$�:�
�&H������j��Z�q�@�h���r*��7��;"���U.��@��J�' ��J��q�`
@��J�s��E���"�.P�����������&j�:�����h��z��T��z�N�r��A�R�%Jw�a����K�p��e���%2��- �p	���A���h���-Bg��)�2����T�=�S(�;z�Ns�bJ)D�f!@T���@������l���2	��d,l��Y��r�j��;"i#��3q�IZ��+�RJqP�b�*W�}-��h
8�I���i�52��-5��y��������������t����T���R>:)��BiJ)EI���Y%�jL_��Bi�5n2�jL�o�ZP��n���+�P��������t�a}K"��:H��m2u6������e.������{�m�	iG����p&�2;XIl���z&���Ct<��*,���wd\f�������R�u��=���/����l����R	J�������+���R�Q�4arN��n#ZH�K$X�i �HM@t�H��z5�yM�6x���U���'���z�-\�����cw�('0:����|J)I���Y$��� ���IV��S�'Y-(�d�gKf�N;N
v����Uu{�*����Rk�oGE��V���w.��Q&�;���L�:�i+VQZ�������r�FV��F@j`�E��>�V�od�Y1����|����<@����K�)�lI���������?�����c��?��?������?������u�Y�d��8M����
���@��!�����6�:���K�����F���"��&��&��Cw��-w	����*^U���]C�^�|��F��um�X/����B��
�b���g��~�
%JZ���%k��	bG6����*����HZ��������d��?�^��vz!+q��i~�=Z��~�e���`�B�M�Q!D�U�:��� -�FB��5@����Tt��)m,��P��O���X����?�k���6�G��^��mkX��Q�s*�%eF��K����d$�d
��/Ad$%�6���d��_����	�KcU����;�w�����_�-��������Kl�����>�t���q�T)�0*��4#��$k�	#)���`��&	Y�����R>�����t��P��9�)�H����+��������?jJi4���4E�?����h
��/Ah$i�2����|F��n�W5m�3d���]�M����>��Y��:��&�_�vDFO:�5���"�R�I�2
<i[ #Z2Z�d�I��<�4 #���X�>��Y�U�sV��T?�������#5��!FU�.y���������%�I��"K��:������_�IM)�96YI��6�,	Q�L�qu{�,�)�@u��������g.�����^����}T�����
%{���i�P�Uh%�,r*��|��-_]� _E���H��n�7ns<-�k�v���z���@j�G��O���YI:�������J�'�Bq�P���q���n_�;����4V��&t[������*?��:��b�0��
/;��%�i��e'�"����E��iH)C��� ��4@+_KDXG������d!x� "�<\I�&�
fLK�">�S�Y�m�!-q�����)����xt����%4r��	%��hd�'X4N�O
��������������',���:��������|��
i����h�mg1ci<yt����%4r��	����7�F��5@����	4VAd��"���h����k�(��}������������+���P,�4��4�>;�,M^�T��	��
{I�
��$
��*T$ ��P����q�nby�G�����-��W��s�x��6�O��o������I�.�����P�'h��P����`�u�RJSV��[��(�n�OX`����w����4e{�}�_�*+>�\v��s��_�!�?���$)lA+��&��V`8�?yd��.0��A0���6���x�e�����:��Gg&���yk[�v�yj�I~>*������D~�,���8��Htk��	��_	NK>m��
(~$U��D
�E��:�,Q�>�������������������t��P4G�8t]��	���_-GFz�5������p�F��"�i�I�y�_��<���xa��b-~�z""{�[�FX�1�tR��B5����jP�X<F.�B��Z������
�G�P�Q�.�P�Q�m{���i��/5}�MU�A����~^�9����|TxO)�^��{�,��
��k<Dk�C^:�������s���{3�D��L����-gL�Ue��%�V���y��5jL	����\����R�����$����.|-`����hNY@pA$,\(�h�[�xCW���������_�=�����\�/����`OH(��0 !c5��p�jj�c	��m\=B�c	�����	W��S���K)��]s�kD_rB�,)?1t����%Hr~�	$%~��d�&XHN�M
��$�7�R�J�9L�����P��>������v�z[i��W��<�t��C��SJi��yW������	���:�}e8qA�)X`�#c3�����-����������5�q�N:���p^}sTTn������	���&�}eD�h��p��q���^w���k�e�p��*��u��{���k�{��Hy8�d8�J���R�Q*��Y 
-I�p�5����l_A�|����,c�@�/�z�<��9���i�����c�xqm����nj��%G��s�SJ):J<'�B��y�Z 7������}eD�j�T���$�9��)e9C\Y�����x�l�u��#*:	��BpJ)EE���Y���z�#9�9'g������t,��V/�a��I�������w�����>����u�������w.��yJ�w�=�)%������|�5�����|��pNIB`�OK��������e4���&	����9|�Ky�,j$h]I':TXP)������,�J*,��L�m_��WV��F@~%����Z~0�}���\=��FQ�Q���Z%�����i����I�.�W�CJ*�|iEJ�����DT�u��KM�"*X�|�m���[
�N�J\u�
0�����k*iy��kK�Q�c�dK�K��l)�HJl)� ��R�������$�an(��+�|t�,$����:�o������L][�����wB'#:Vn�K)���0�� d�(X�N���/Ao�1�n�/��p2�����&HZ>6��|�E����6|�<lt��c��R
�
�
�?�[&y	�@8�Ayd�����]pg��^"�7�!�r;B����e��|�)�Z�)c'���_����{
%������`
`8a?5�����ZPx@��������#
�l�kU��S���>�]-�9$:����uJ)�%�f!�P������2Z$N�N����0�u��:]�,"�����?�����+�kFj����v�"�3,��}��8�
�)�%�f!TT8N_�"Z*N8N�P�q�.�l�`�.����J�

*������6��������It��SJ)4JD'�B�����F�h��<�A##:]@1$c:�$�(��;�����O�3�pp>�[E��`�gGHt��c��SJ)$Je�0A�R��kHDk��	��#$V#�:���d#c5����K����:�9�������I^��RJ)J�%�BX�����B�,��p�����n�	2����<��{g��/t�E]"�^	��S�|O�����2�Ns��J>��!�!� �H`�VL>|-`����(	Y�pA�.������h��{V�tBwB�_�q�zA763��`V'�]��#:9����SJ)J}:a��8�8J��p��+������@��8� {��)�#V�"|����:��E����$��#(Uk�X���k�	7���v�8
7�M�z��o���)���=.�mU����L/N�4������{�D�Z�Ld4%X&NhJ
�`k/�mp
Y��������n��}�B���/{��u��b\c���|��f��S'o��_�#�-���Q�-a,@G�[�5�����	:�~2�TnI*X�p���`������uU����G9n$:���B]J)�H��K��6����
u��7���q#�Q�>Z$�\��u(]�����o�8�*������]��tQ_���U9���e.���#�2���0��wT�LMG��+�q�e�����b��(�d�m�
-���<�����c��y_�~�I���RA�c�;��G'�9U�L)�N�Dg�,�D*t���H��l_�N��2:�UYiv1��'�
-zsdGf�Y�Z�����H{��Z:���hJ)EE���Y�?e��3=�\����D^��T=p�h��5�5-(�������9��;)� �/�� IRE�hF_@���U�5�	T���-Q���+_e���v�������]��J�L����S�A�9U����"J\Z!ET�f�Q���KP.)3�@�hz��@��h�1�S�5^�l#����z�L�(j���i���N������s�?�m����a�AVy�Vy�}��h�m�w�E-������A���V*\����R�]�~�<Q�a��������]����A������r��t�?����O)%iB] <_�$���l�(w�]7�d�y��M"Q�z<��wNJ����*\���#\�����9�<�EW�CYx,
�$���@�"�,r �~0|�[���oFd�"`j�y�2%Jr����{������I7x>ul����|	���-Q���<HDW�C9x"
�$���@�,����9�����@��Ukd��m���R �?i���+qb�8��&.J��J�����s�����t�Z��(X�0���#��0���r8��A��6�p)���
��Gu����"���iW������8����/�C� �+dTFV"%2
��D*d4��� �~(��8Z��F�����BI���^�{�������+�Iy�,���G�n���c�����NYa��:�F��F_@�������-�lyT	
��u>��US����?Z��k���iXjz#�6���� �+TFV%
��0*4���� ���$��h�M�0��T\Ng/hk��������$��
*���w�mZm��Gx��T��u���&*#+��VH"&}�do�H�J��>gr���I�%yF�O
��~x=H�O�j�O��*�|<@�6Q��E�h$�����Q��
pS ]"��6A�����D�E��������;�_P� ��x`�U\8/y����G�������>H@W�#]����� �������K�xg]�7����i��}��H���k/�f�uR�V�Za����3?��a�w������xb3����s�?T���O�(�O����'x�2<���*��>�"H��eI/�
b�<>����V����#Q�r������������A�y�W��3J����[�(��O�ex��?2+c������.hcFt��x�|��B�U(����^�^��:2�H�}������G��O�E	�B]�
�@�T����x�hx���(��"�W$��������yF
b��g�aQ��}>��^8�8a��Ced��Vh��u�D�6@"�����K

)����%:��x�l�����Nt�o�llM��W�1��=[�F�}�������g��O��R��V���	���hT����Em�E��h���l!%:����w'�x������I/U�R�)��^!�2�F��B+4`Th�V'��:����#����:q���A��������3�r>_Ny���:���~�?�x�����`��'��`��
�T�f�����FU�$Pn�
���?{��BO)X�gE�#<�Z�,H"3t�������a�j(;��HQ��j�"{�����`g����hg�=��-H��
�vy��a��B�ai.���j(
O<��DQ"�PL6�o�'��oj��$Q��g�z�����TTrf\���V�4]�I�m������*����}Z�nM�����Ld���2!�<v�U�I�%O1�;�$�i���,�����(�`�.<z#��_�r`/Vz:���H��V!��������t���x��d����H<I��B] �B�D�Y��y,��L�� /O��i$�!�����U�������R�?I������QA�2��E	}B+� +�C��
B��3>F�>�#�(������ou�9�"Iu��|Zn/�������-`���
���6��ry���6J�wB+�������$���hT�p4<(��C����q�U��Hu��.���A��T�?��r�
�o����� ]�#�G��O��t8���!P��x*I�E�]�"�#��q-�s��;W�9oM=����QI���0J���

�}�#{�<���a�@�����[Q�$�K[0�{i[55��]4����HX>t������ �|T����0J�Z!aT�g����A>�3*a8�I����������c�m�u��.]��:��$~������@�t�?�$���?E��� �8

� �*�o���7�����6Z$���Q��yW�D�}��>���G%�SFv�(ezB+4XT�g��
�o��{���~g��6
��.����Q.�uk��'����Vt��?<������wY=���C+��J�g����a\���d�3��hz~��]� �*x(���������`��62���Eo�@�����t���s�?
2r�d�i'X��}���,�����dGd��+��P��i���^C�����m��'�����!$,�N�s�n�����t�I%�	u�(�o������D�(�+M��DU����qD.����W��,�{T�y��[a��.4j��bw>+�SFv�(�Nhb	X�2�oP����#�p�pgx�B$J�5
��W=�L��}����.�@2��r5/Dz��=b��1H;��)#+���VH�}�HR�������{r����I�����wu�x{WW����
����d�n>+	�2��(%xB+$�J�g�D��a��t�d/����H��f�"O�y��hE��C$���DlM<���g�I�vQ�D������N��)F)���@��;���Q��������@�<��%r�\�����6){�G �������o���9�=W�C�x�	�$���8�`O5�~K=�w��3�1�����'�.h������^5���E��+Y##���adm>+��ed�H�|;�Ba���}!T��o�����')k�C�Q�s�96�{>q��_�&b�Q���z��{������A���v�?
�LO�����o"��v���id�G����H{�JH��{�^5y����$[k������->��g����%
���Q!��>*��7.�3|�g�����������n��02�F��j������b�����`!#�������|��`}�:�<X�7� ��6jdpWFz+I���������N,����-����bX�
���X� �\�%��&��Jh�	8�	� �6���"
@���p��7�*!	�����������k>_t3���,���A��T�����P"��
��A8�	� O4�#aL����#�_q�bC/�k�0���z�'�j��^��r��KY!��%�BB������
B���Q1�7!`A�_�6����F�t0�Z����������2���T��K�Q�e�a���!�3L���Q����`�p��
��S�&d8�2����s�Zc|O+�q���%���'��3,u�|�]��*�!cb.�|M�!���	���Q��������Ab��Lt;
������f/(�k�������^�~fUaiY�2-W����CK����-�..�o����?��(Q`W�(����<�?5��P�E!�9/8{����:��W��r��K����/�(*�2�E�_�7.<]jf�I�2
(�8�.��zw�@!�	e��������f����!����f� �\�G�/��F����`�p��a���R�f�<������_��v���hv������;��{���K2���(ed����Vh��0��F�8D�G��D��B�
v]iBp���@�)	�cc�'���M*(�'Ct�r�F�R@��lMG��w��?��o�O��M���2����������������?�����"��u���������0�V�\�U&z_�:�;
������:zA�V� �W3{t�@��� �A^�X����.�������U{*�0L?C�	�V�C�	��W���CY��$o�e����b��E�!a����{���^c#V��|?N�f����`]N��lFVtI��r]��.�
��E����:0��{����������sm��y���j6=tr��L ��s>+t<����X�.	-
\�UjUR��[��G���}��Q���C���4-��b���a�fwY��}������~=N����
^LE@&[{�f+e.��]B��H�I!S�UjeR���zu,�B�f��${�__/d����{A���EC9_m(U�z�q�����4Ug���kX�����MbZ��
.�f:����pj�'��4���IuA�1Y��
"�/A����r�_����Db�g/87|=�>��c�VG��Y�w�UB�RI��M�����������ct�?���O"�P�Db(y�H�KP$&m���RI��8����������@��%*WWuS�Gx�#I4v���w�?f%cDT�S*ZKD��DR9�Nu�H�au�����HLJg���V�>�Hb�h/��VG����k���x�_��[�^�%��j��7�����}�#���p|���H��:�B��������doPF|	*��t�&AHg�2��.k����n����7���~���������z���+,������E��UC�R+���}5doPC|	���n�&A
�F{]����+M�'�2�r�/�H�x	"�����x�}�B���h�!#�
��BV��U�!���
����_B����d�&5����e
�zZ�<��<TD����������w����p��x^��j��E���P����.�����
z�/A=��������N�MX���EA�Z�~�$���W�loaT;Em"�9�����6/��t�?T�'��O��
#E���A���Q�>��_����`�p�4�Q$J@�jO��E��������+?�����k��	Xol���������R�2���J��j��u�D�{"�/A�8R����\��J�@GR��3��wG��Rq>��O����u����-�DQ��d��
��UF��B+��
���D��F���2�+�
'P��6e86��C7��?�������#��������RMpuv��d��0�"d�m�p�^FV$�dP]��Z!��A[U&�do>~����Y$����G���T����4fZ���h5�+:�Z�6��r�&��6���p�$#��������I����/0�do��$��2kcs�������#���������"v�=_a����J�3�63���[�����P!��}���pj�A(����B�.�u8
� � T��di8y?
`f���5������Lu�J{��m
�%���A�F�}]I�j(
OB���Q�
�-�- 
�J� �$T�i����d����Ph����K��]���v1�^c�]^p,m�DM������h6�2�EW�Cix,
�$
�����`Q5h&,�W���t���l��g��:*-��m�X�������=��YGt��p�����e�u����G"���u�?����H�*���<��<~�/!���Z�I��$�(��%
���W���Cy��-�K�k�7�w�YGX�{�O������wS�� 4]�E��)��HJ���8h
� ��Tp���M��L��J"U7_\[rq��>�G�zw�k;{���mrXX�S������H����i*#;��2M�X�����S�N��	fo_�#�������$(�������~{���������q{�j������7`���_��j8tx`
�4tV��R�,U�<	���*~��9S�(U�#�{4Ul��P|Y���|Op7�����t~#��t�J��C0����������Z��2��G	�B+4zT�i��!�t_���9A?E�����B���QW#�������K�����>��6NK�n���tz��w\��j8�xh
�4���)���AS�i������C�V��Xxd0�m����z?������L�S�&��`K��~����
.��0J�Z���K�/0`doP�\5�*.����������U�%/i���0�}���G����9S��a��^+��2�j����RC��|��L3�7��!��A
�����+�i�_Mz�GX�^�X�^�]�}��aG���?�O��$��J����4J���
I�BH�/0Pdo��B5�4�6���
���>��/����q�����`]�Os��b�^+9�2��(��B+��
���*�7��G�h��@�T�M��6��q��DT�������	�#6�Yg��e���A"��M7e���0�$��b�<���de�wVF�Z�DYV7)eD��(���7�B��tc[^�M�^G{R�����F(�����~���� �\�%��'��$J��I8�	� ��S0����8�
\�Hoc}����mB[a�^��������iXZ�u����*<�RE)a�U8�	����SU �l�����(X�'���,�8����V;W���%O9���6X����
B���Pz�?)�=�.P����
��=U�Q�|6elB\�h���%��"������	�(�g�����\�����y����l���Pu�?��(L0*�S
�Zz���:��r�g�`s�<+
2���M��Z4	��F������G�A�y�����N;K���
,H�
�G9�T��r��L�Q����h�2C��=�u?����hl�qYte��.Hw��lQ���7�`2:o�)#+���VH�}��Ht�/��H|e��QG:�E�����V��Q��
� ~z>`	��������w��d��
���D�uB+$�
���� *��	
�e�r�mv�D��u���W��C�������u��W{{>~b�[s�����9�REsF_@���U�A���
��MsFA�>V����h��y���z������t�g�?��]�� �\����z�?M=K)�PL2\J(x�$��Tfa��gxP,q�3\Z,�o���-���M'��
��E�1n��6X����h2
����H2r���$�*�F�*K"���
���/�q#�2������@�h5��M�Q���N-�_VF��rN��s�?����OB(AN��� 'x�~@NU��B�����L@�#2�����@	�����3A�^�YwE^?��f���vv���S� �\�e��'��LJ���8�	� ��S�8�B�t�������������:�u���K �RT	�������w��Qf?���rN^F6����C+0+P�#��
*�A@�+a0��2������E�������S�|�l+��.R��M����*��TI���UE)�Z!UT�g���T��|F��
G>�/m��r<�@W#������2?���n�}����#����I�5vJ:�B�

���I	�B+$��)���${�L~���d�>��B�c�����V���?1�_�:���n�!&��n�����T��iL�\t�?��x.
�4)���� �l}��6@%?@����`�m����
&����L�Y��y_�"]F�QN�I����A :U����xQ��
� }%doP�������7mJ���D�@�n��?z�U�N��9-�������iz=������d�ND*#��"�VH'D}�do���L��t�2A����c���WD3%	���/W]��\^�y����P.��"�X�W� #�*��2�*)e�B+��Jfh�T��A%?i����������!�^`������E��f=l�ve������BF���V�U��`�������w�����e������s_����V�Q��������I�����f���������|Uv���K)� %]���))��<J��y8J
� ��T�J�<6�!���(V��y�� ��x(�K�l�z��KEX�0=�&���8v�c$����4<�F��B] 
GF������#���L=<2��������)���9a��]���^���_��>,}��s�~������V`
V�G?����~���1�����[)Q�yy��>�\���]��?�N0��S	:&mwP�A
������?���O����?������3N8
B
��P�n����j�wb��N����S��bz��x�_�+�y�x�������I�#�PH�%�7H��TF��Fx@�p��{��K��m��V��n���=F	�\����
#��Oh��H%�3�K�
��_�K���M6U �lKG8�@�7.\O.�R$jy�Ey��wVW#=��KA��8t��������'����V����dA&��x�_/��
�?����������E�P����w]��z�b��m��Y�WhT�paM2�<�
���BJ�Z�!$��������^s�u_=��onoz��+�Q����$��;{�%��q�	N����������#>Som���w��=mm^���m�Td�s�q�����8��E�d��|/Ydo�E��������7Y �l��C[`q�3
&�6��Q��k�\�����������N�y:�A����9!�?�������/�������4>p�B�I�C@�`
Q0=��������6�vl
�;_��1 _��1����hqd����H<�I�yB] ��l"�m�H"�F��F{�\�(��.h���O��,��!����FA$��/���}�p���B���I%�	u�0C�	��D�������K��V���������Tw�^O��s�w[3�v�����_�~D��� bg�A������N�'Y���C] ��/D*�"�	+��G;���H4�uu=��8��|�S�Dt����?���O������~�2.�����s�?����O")QO�D��D��� 4��V�D��US�(� ��gx�u#���=���bm��Q�7��?$�[{�Y�j(�:���Pb�Pa�����A!IXq�C�\���C�Q������S���B�y���(�|&��������g/�A�y��N��H	vB+�$+�&]��E�	�"�!��es����m|�<(�8�.�C�x.S�k+����H��{(};�{e�A����m�?
%�	u�*"������"���H���LhS����>��6X8���Co�PI�XD��5�_E5�l��>��{d��%�	���P���>doB�G�;�-���(�;����[��U[�z�]93�_�<�������*/=����35A��2edeQ��
��2�/ �
�o\��G���������#�����������c�M������dj����K�����T<��j7d�t�7�*sM��q#��u�y�o�E����(���W�����.���3c������m�����Vy��]���{����P�]�?	��.�.�c��
B��R��X LmB�V��U	7h�r���&n�eCV_���[_�����Zm<I�j(O2��dQ"�P���L�Yx�)#��7�$g�E�7�Y��	E�X���8�����37�x����Pe�?���2�.��C��
��(S�F�+�UhxdJ�-nlU��������Yt�y���*����1H+W�CxZ	������@�V�7���J�8Zy�0<D��s=:z<#ef�[kZU��F|�����F���cd���� �I#%�	u�F�o����7q 3<(�8�.��i��/D^{�e�!_Tj�}T�[D�C>���m�� �\�E��&��HJ���.�����o\�����|���$��m���f�����"��1h�SMX������Y��G�������tHZRV G2���If|$�������HfxL�M��o{Zv�b��aB�UJF���J�{���p`�@�i`�	��4� ���lCO@hr����4����;�.J���`��Blr�T��f*y�k��G�<����� �|T����0Q��

�}p��A>3�+n��|�W �l����(P�����r�c��Wf�����j�����A����2rB�*�`��E�%����&W,!DAB�x�[I��W���\)��+a��{o
�V�5�@>/�+�� �\�5��&��FJp��8�	��7���id+�lC�%/S{]z�n�a��&k��1�wG+��ge��:U1�6����edG��C��
��U8�	��
�6�#�L3
r&Ed��bXQ������k����/����}�����l!�sf��������O�C	fB] 3���a�����`fx@qi��1-�O���YO"���	�{��<�P���T>i�j(O3���P��P��L�!x�)#G3����K���[{�J����)D��r���n�)�0U&��g�90W�CYx�	�$����@`�7��L�3������5������k��n�95aYI���-�� �\�U��%��*J��U8b	��
O,�oT��Q[`�Q�����Y9wJ�
�3�����{���ST;L[��{����P#f�?i���	u�F�o�����7q03<(�8�.�5�=N��p���B03L���|^��@�U�r[�8l������ <�D�fB] G3��i��� "�2ms��9�����w�t3�.q_�`��a��c����E5?�t���r�?����OB�	���/� cm�	����8�m�$�� z8~��z,��y,}��5��_�[E�m���@�A����BF�I�?����,�*���L3�;�"��f��L��Q�e0z3	g�����'��c�v&o���,�=W��� �\�e�&��,J�Y8�	� 0��a#
2���6�@+�(Xg���:"�2�&���5�|��k(�Ad����#K�'��1�.��C��
:��R�Fx~\k���Q#
2���v�KW����;�MWm�����1�I7v���K�RM�R�TZ�
V�G3�4�if��B�
!g��."������|�K��i������hI��|������2H5W��A�SM��A�D5�.�����/M�Q�6�)��>i���W\������9�B����#9T��}����9����� �\*��������'r] G5�
������y�+
2���%.+3<����i�������(l�������cc6����r�\FV�c��
�#��pt�ux�	�pt3<(�8�.���xh+d�e%�@P��"�� ���������}��2�7W�����&�Sd)�M�D��&x�H<��?��E�Vb����f�L�n��Ku=��@����8��iV������A���
�cM�'!��&�BpX�Ak���a���X��fx�g�<�������^��{?
� a�vA�B��?]Y�j��:���Qb�P���N�ux�)���K����j���D�%S�������r�M�����Kv"��������_��������2����������������?�OmM�K��2����	�g��U"�d�D����'y'�toX����y�'j_��������<��������W*�I�Woq}�5J��g��zqyPI�����������oU�**���U�I��
>�G�ih/U�c��TI����)�<��z�_9^����W��m��<�A7����bP�'UT0(��0��A�6(�T�y���]@q�]W(��V����H<�����}�b
(�T�QZ���1����rP�'�T�:�.����
"���#+��o��P�[������V{�,dm}�����G<P
O��J�4������]��iF�Ta`(���0���{:�@���Ta2<{�N��-�s{�tK������G��������<FF7�����Q������Iu�F%o��%�����9�ER�
(�4�]t����loK�I��5�e���W����"�H��|.�fd�
 �V2 %+���
"�����y �|��iK���d�U�K��.9�o~���3�W����5��C���p|�YYT��S+$�\�� Qjda���#�,6��V�HX�
f�Q����5th��HXw.\��j��7?��8�fY,'%�,NJu�:'%oP�����D�I�EJ�����<�z���|�[��o�vU^=WzWq�2�K5s��0�K��DR��T���R��X\���H0�M?��)����@wj�N��>w����UEH����W��Je�qY��~�^�[���Gc��\d�Y��XeYD_2 �,��&@�?2E�(���$���>3���i7�Qv��"���<�����n�}�+{Y�K�������.W���T"�`�p��A������@��"< �DI�qDA[���E���g�v���",+���s�����e������G��O�E	�B] �H�;]��C����M�^��y8e��m�����@��Oz;Y�0'M^���}?
�
�~������b��^
�����pQ�m�7�`}V�GE�t_�������������E�@)�:`��������UT�ho�z_����jT �*�")A��W���@$��7�$�EbRF��i�|�V@A�a��K;�Z�h��[�9�����/"�"o�����������jT<
*%0
u�.o�E|	�kn��V��Fd���^Q�=�B9|�x9"�w���=w���A�����P�']T��o}�����P�	�"�u�57]8�zE�A���S	>Z���L30�**K��Z��NBa��Ced�H	�B+4�����,Y�i��1	5	��	�������}�����4�u����F�����U^_�8RJ5�/�Pt�?=<=JP����AQ���#�G�=��M9�+
 �8(^<n7]!��!2�
����q��~g�J��Q��s���,�pD�U�8��6��rt_��0y��I ^HH�R���(PV�}9-g1��rBT����Z�^.���1���G����H���T���}�!��H���H�fM=� �$
 �DI�{D��Y�J|N��n�_�;Z�|�`�[*�����#�;����AH����CR�'�� )�*q��A%�%�s�~V����
%\��Kh�^Y?����n�B�Nk������,��_)�j�OI��tQ��P��QR�]�������� ��(��h�x%�N:{]�q� ���e�@}�T�W���u���������I%J
u�.%o��J�
�.�yjW%<(�8L.�����R�/o!#�=��pW��z��[�^��j(OF���Q9K?C] GF������#�� �h�#�U��
[Uz�l>��Y���*aiIO.�`�SS��v$����.<�E��B]�GF�t�����GF��K
��6��y�F&�',+�_a��(\:�t�����H<&I	�B] �I�D���#�I���J���M�p���un7_������r}	��
�����4z�[!Kw���Ah����CS�'�T�o}?��j���T��F��j��/�aQ!�1��h������w<���3S\��
�����AL����cR�']�0)������
���IU�=&
1����+�!{m���#����J��k��:�IW�C]xL
���R�(��p��A?0�*0�p���AQ�q�p��z�0�K�0:'�?�o�=��Y$����.d��(��.�*�Q�����d2
����	1/�Yd4
2��1/.�R%
��f���T�>��_��-y���������]��Y�j��B��tQb�P���P�]�`��f.M��FA�#��tg���]��~�kHY��BfyoG��v�m������H�I%$
u�<o��$�
�<0
T�+< �DI�P�%�����m����H��)�~�i����\�|����J< RI	�B]�H�T������ s����.h��V����s���{����)�7������v�����]�U�i)��JJ���8Z
����T�8Zj\ix�
R%���Y;��_8��R6�i���c��A>��*��Q�'%��(�Jp|�A	?��*0Jp|4<(�8@.B_�Y��-����;��n��U��2���O4��n��6�FW�C}x4
�����@��7��UF[AN��'������t��n���g������N��mZ���
����P%��?���F�.P�C��
*��FU�Q�C���������r�bU��t������,`��v�������j��F��tQB�P���Q�]�@������������h�����y�<cyD~N�h���{����*Q��,�����1���V�'��b�|Rh���
T�@)x�J~������lC�����Qp??����b���^Eb/�}�V�Y�Al���%2r*K�*cS��*��dl
�Y%�K�����(�*��c�C+�(�P2=O��s�]���y��J����� �4�PW�C�x�
�$�C��@$���7��CU�� ���4�D�u�i���qC�a����O��=��s��� �4�RW�C�x�
���J��@%��7��JUF%��=���G^�pAI���^����I�u���EMX����n�L��t�?����OB(�R���h)x�~�RU`��hix@Lq������������n�e�M!����d�0�������r�@�AX�����R�'��`)�"q��A$?`�*0"q�4<(�8Z.�9>�����[P���	�J�q��b����JA�Tx�i����V^mj��+P���
*�R�W�����AYx@Lq5<����n:���o�������P��Q�>/��M�9c�Kq�T��TFV%��I�X�����T������+�JN
8��G{qc��S�v(����hM�Zj���A���OS��N��B] GS�D����pM

8���������An��u��P�-Xe�N�OIN�u�?����O�(aT��0*x���d����>F
'X�", c��^�<tg�UziO�Em���9'�G5���[O��1�J�
*���%T
�P�����������kV�X�0��)���(�6��~z.���������v\o���AR��
2r".�*�R���E�%���H���p�5K$Q�E?�E�Eh�������;�_����Q����p�K�*e����� *]�U�Q)��JJ���8T
����T��\�L�(�.h"��PB��t�?u�q[a����FC�w�Gfo����� *]�U�Q)��JJ���8T
����T�8Ty����$����:���^���H����pzk�����Ar�����S�']��)��p��A?��*0�p�4<(�8t.:�����v[>����(���J���������td����,<+�E��B] �J�d����#�� S�����Pix��V*��s��9G��F�]��)���:)�<�JW�C�xV
���+��@%���7��+UF%����J�cz�e|H&v���FAe���� �zDk'i�t(D����H<*I)���8T
� ��T�8Ta+
E�e�5�O{�\�F�����h��ZhG@�A8�����Q�'Y��(��pp�A?��*`8��E*�mB�y���tz�'�
e�-��"�0�����p���j(E���P��P�AQ�!��������8���K-
E��r����������c�a����_�"���Hed�W	�B+�H�
t��I�t����L[lQ�V�
�H�.�hy����yL�~����k�=��k�� ]��9a�?d��(XeaD_2�,��%�E{YFd�ym��Q��\t��-r}�.J��B�����������A,����cQ�'���(�*qX�A%?��*�����#������W��.[��{�+,[X�~�<�Mzk�!bw��>BW�C]x
�����@��7��UF����(aD]�����n�+���������N����=?�?x?\r�{��&��Rz�	Z��X�,o����2��)�����0hx�� �{��s)y����"�IZ��.A���R���.J/5A+���s���ldo����2������0�@hx�W!��-Nw�m����������?/0�����0�x�	�FJ������O�]�@�����>����c���K�n��F��m���e#��3P�=��`v�z�����0<�F	wB] d�zT��mq2j?>�"E�Y�����p�������V��p����O���<�c��+A5��eY�8	���{
I�T���Ij��1%�{���H�=7/��dlG�4���:��f�x�]5:�������A�����$'��t��O�~����x�]�[d�sSSD�,������dY`�h���
�N����~R����	O
/r=hK��6�<��%s��� ������y�������]@��<��v����Hh���6�������Y[��;�><�(�;�e������\T�a����s���r��C���P��S������������}�@+�����I�!�}�0���R]N��a�e������b��6p��S�C�@^k�Wn�]� :+ ���Sn������=��������T���i7l(;O@����*[�E��6%�va����TcB<��cB.P�C� T�E�*\q���'@����
CT`q���:�:��Cx
_���m�<O3�u����9�V$�Ls��1*�d����A����b��"�������"q�����]�!�+6��c{��^'���jB���<���e��X���u���i��L�����iB<�]���d����� ���0�P<��d��Rz�4����7�1yNO�C����mph���hWL�A�Y�O��!&����9!�bB~!�"	�s*qQa3P'� f�����m^������t"�:������<"�:�v�_���Nd�h�=��d;����������bI~I�F�������
��p<3:IiV��v4�����j`h�6t�s���.'+�����B�Bp�B��}dZ�_���k�a�>�i�%9���"<�:+�mO�����zs�����G�8��G����\�d������rr�;L�
��pf�Gq<��h���2���	[���%�3b����G����v�td���T$�uB<���uB.�c�

�g���%�0���fiH���r��i���OK�m�8�m<�Q���7�d{�.�	O��n�����nF4�O�#s���&�J�+��|}����u��_y
=<�z�xZ�tmy����X��i��� ��?�+7�.Y#���&���L	Cn,�=J{L��-��.E+�����N����T��� �{����|��n|���*OC!�D�EC!���P��x�x�Q��'�a���
]a�q���]&k��!	����*������C[�A�Y�O���'����'�!8�	� O<o��O��������Q���x���T�9O:�1
<���/i��D��o>Yg�?�g�O��b���X'D� <�T��+���>�l!:�`-��W{Gcx�	�oi
4������g����<9h�?���O"����D�8(D����f�a�K�-&�Z�PZ
�A��rV�
��k �v4z��>��X���>�:���������N�'toB.��C�
:��S��0�L����C�1OZQ[�r�mp�K7��h������v+�A���A�r���.�	O��)x���h�G����  �,���@��c������V��n��n�#6�{Kk���}��<Yg�?m-<��xj-�X'��8�	� �:oZ��3S$���p�3��d�Ps��qpR���mx���^�������gO%��l��2]�S����@�n�3@�n�GB����,�U�,��<=o7]�������mo#Y�*2T�����s�A�����r�B����B��wr��A�t3����D��u"��S�[r����Q�������G����h��f�?�D��D����W����[�x�,��"�h��a�"�'j�����I�Z�A�*:�S�b���"�C.�����=�(@?���=��A�Y�OE��&��H��&��8�	� �7�#�0d���"�xJB�a�51���:�5������6X����aH�Z�k�uV�SUx�	���.�	�@�uB4���N�U8����%���zN����:����"���Z��>��n������TvB<)�vB.P���
���S�;���J;�Xgtt����^�z
��4<y�����x
��*4!���4!�M�!x��x#��eT��R���ff������q>8to���.��lOS��{J�c\�A�Y�O5�a'��F�`'��8�	��;o4���hd3P��hgy�tu�|S���Um�V���"_�{�����y���D��'<���zp��A�y��L�a��"y
�
e&���O�_��������t�
��]2}
r���0x�	��0tqM�Bp\�A�k*�4�kFD^	+#�U����j-t��y5,R���_��K*���v���{(��krV�S�x�	����N�q��A#r*�h�A�����Q��JZh��.��g��
�Uy�y�7�d����]
��*@!�D��V��o	l��F4�-W��O,#�w�
�Gi{�uS���z^c}f����H�V�b��������������M#������?������������?��F�S�������t�e��Z�������+	��Rk��%	��Sk��A���-������&��mH=JK5k6"%�j��k��f�,=J:����'�����:�17���8Y�LKc��<�$/����
:����Y����i�Y@�\�:�F+��m��n��o�����-�J�������5]��\����*�5)���I�@$kRt:A�8�cca��;f_�e�����aw3�b"5�/���\i��<�<�dE������}�1�6�k�A����=q�S����@�m�3@�m��Le���f3�Y���W�e{�|[�W�q��cq�,6O���Wh�����r���::7�����N����tR.P��
������H3�*��<1!E[����j�E;�����8��j��������uqn�������I=��r�:��hP�E�%��c3$��"�2�"�-�8��Yu@���~���O�\�6����1��\;Hgq�]I���PW�����LGr4�������-da6��T0
�T�f�
3��t�������e4�[7NK;��,I�z������"t����t��At���:[u����h���X�����I�LGZ�����,��������5?#������8Y���s�SH$���]@$9DbIgK"A�Z���3�t���f(���V�'0��2�}������1���Ui=��bL��QD��\0�0��A�����Q>�(
E������=��y���Ji>�2U��kl��!}�
���6���k �@^f�W�C�Kn :�!��O�)7�D�!�k��$@�B^7��?����Cj��hG��/�S�����r����?7��
���*O:!�D�S��@.�#�
"��S��>V^������6�
�y�A�f����yo����O������v.�A�Y�Ou�a&���`&�8�	��3ot�h�4�*Ef�AW@<u���y�%�L)��P��B�����s��@W��MC���F<��x�HO������qd�4�����F��hd3P�b�6�!e��+OU]�US���H�qfVK��]�D�� ����"���I$]�r�H��h���7"	8�;,�,
���l�����Xtj��!yWt�-�-MYK��Av��,����W�S!x�	�$�.�	�@�iB4�3M�!8�y$^�A�Do�t���!��ls��S���2������mpV�S�x�	����0E��z`�)a��8#�(��i�=b2��(f���B��y�\��\V~��������%`�o�{G�Gn��������O�'���y.�S�I���x������X��8K���gt!�N8�mem�S�?�B����:�6H8���<��xBO-��@����A�p��F�pF�(�pF�z�2G�k�a��o�yv�o��:����2�[���^={�x
Pp��8
���@�G��1����A���K1g��3���� ����{�n^�q�cc���r������ �<����x�<���t�Z���_7�7[f���/��>�
�\ax���J{�u�f�C�����>��w�jQ�4�=��� <��xD��\ �=!_��0;�ef����pD��J'hM����M�45<�.��I�����Jk����T��B<����B.����
��/AY��Ne6���}pH�����u���G��������;]iWUk��i�|V�S]x�	���.�	�@�|B4�"�ue�iX��Ey��\���"�-�j�k}\��0�����FZ��t����ndvN������+z.W/	���`���h��_���`\��@v�3@�%�W����w)�
G;[�i�hSWg��<+
��D��dzNw-r���;�J�Ad/����D|	
�Ur��� ��@��E��U�;\T���W�:����O�VfY��Q�a���A�Y�O��8!���.�	�@�qB4H"�%��s�����#a�Q��8����4u/�=�_��f��
p���m�I����6�/r�[-�!g�?U���O*����T� 'D�J�KP%��J��#���8�]���i���|ng��=�-EY��}!U����'g-� ��������I]�r�*��hPE|	�" cV�C�zdQu'�y���q����,����g�J|U�����kd���T%�yB<����r�J��hPI|	�$lY%H0K���v8���C�(i������U��,c��v�kC���� ���gB��}B<���t���]2���,��%$�fKBC���.Gh����������
~�!hx�N�������������}�vV�S]x�	���.�	�@�vB4���v*���t��weWEH�0"��xh�����1(	��gYg�=,k7��`��y!`����\�A�Y�OU��'��J��'��8�	�����F%,S�VDd���"�-����\��(����<Q	O������ ����J���I	]�r���hP��T��;#"���!�n�A�����
nm�D.���j������A�Y�OE��'��H��'��8�	� ���F$��3"`nP��K��U� e��@@����)����i^���e&=g?/�A Z�OU��(��J��(��8 
��� �F%�H�M":�C#���]/����T2'���������;L�����*�sP�'%tqP�Jp�A	T	���T���;��p�qW��]�J<�l|�k.1��g�����k�;<���D�0(D�H0h��<�u4"�Sq��<uD�����U�U�`���=u�����J|��7v�����h�?mJ<�xjJ��(��8,
��x69�r���r�0�|�g��:*��"�1�k�L�]	�bpf�mI����N��S��r����T%�B<���B.P���
*9��JP*qX4"��qX4"f
L�e�q�Z����g^R�,�����X�<�H���*��z�U�Wf���U��)DgU�/!F���0��#��p�0G�?�&����m��=���:�����T�W�I�������
�CR�'atAR��p��A�T	��C.�C�����b{���a���jg*�m�$��b�41��������(j�|B�9��|B��l��3�O�*x�J$�hP�$�����A����J2��
E���NL0�����4�m��B}��<HL��i���)�S��EL!��S�YS%0��#�A����\�I�]p����7���W�h���z�W�����y��V�SaxJ
�$�.J
�@��B4���*�>R�+"�Wq��VtS�2k���$@
��IH���UU}l;�A:�T���v]����8z*C�]`���A 42C��@(����a�bK�'�����ag+�'-�f��`�y�V��&��P��&��B.h�hP�U�$8�W8�BT���<�W�)s3�V��x$)G�q���������O�'Ut�B.P�#�
�8 �J�x<0/E�Y�>���	l��]��i�g�r��&� ��������I]�r�.��h���T�$�e��Rt�Xg�E�,u^�4�~��:_��EJF��}�� ������p�I]pr�,��h���T#7#�:G7[�Sw�]�O�u��T���J+����Ye���_���csV�3}��
7!�A^s�W�G�KnBt�G��������iV���
�Y-�k��>����_�����r�%�m�������1�<���J<��xRI��\��<!Tr�<��[�0�J�]��DD���T��������K$��S0��vt��c�rV�S]x�	����RP��p��A�S	�.���]���r��KS�e��+�0�<��jo~���1�9���0<��xF��\ �9!�q�9����hD@��0gD<������l�X�\jR��{��n�� ��������I]�r�8��h��T#�:#"����(.�R����tNWT{f��Y�����4�t��{�����������O�']t��B.����h���T��>#�zW�B�\�����u����s�c{s����D�����T%�B<���B.P����On�SU�X����P�����AE]����@[�w�Y:�X�oK�i����7�<�h�?���O����t��(D�.��0.�������0h�D(����V��h��p��\��$���T��B<	���B.�#�
B8 �J`��HhDP7�Ph�l�s]k^�5:�
��~�������A Z�O���(��:��(�u8 
��� �F�Pb��L8������t�������6���]uN��� ��g����� �B�+�"�%�P���h_B����u�0d]��Ha����J����>k;�[woo�'=���(.�����
��N�'!t�N�Bp��A�S	��!v��M�J�BTt�>k1m�a��j!��t����F�2��g��A����w��6]���(�/P�#�
*9 ����\�#�"".6h�$�t)g��*��6j�r��8Yj�����^x�	��^t�N�Jp��A	�S	L{�PgDd�h.��S��X���G?g��w����r�br.=G~��6]G~�S���9�3��9dq9#34rF�B��h�S����8E���(g{���#���v���x>��!�eyV����#O����yB.h<��hP��TF�a���
e���g���Y��Y�.�x��=�����	�d�D�A�Y�O���&����&�!8�	� ���F�(�x���Q�0�H�u�\_j3�����9<{V�������A�Y�OE�'��H�'��8�	� ���F$,�H6�)�q���jp.*�Qym.`j��sW����7W����}����T%�B<���B.P���
*9��J`T�`���4%�A
����ynw��H�����$��������� �\z�<���v]yO�i��p���8 ��+S�_@�.E�|�T��u���x<���]��5�����\�����G���i����AZ��99�@<4��1(xe����iDg��/!�lI$a�"	t1a��������y�IL���Gx��{w��.����W��;� #���*���I%]�r�J#�hP�#U�b���"a���
�_a�]��(���Im����o��%��-� �����8�I]����p(D� p�A8��Z��)��0���\����G��]I�i��v1�u��V�S�xR
�$�.R
�@$��B4����*��#�A}��
m!���yY��G�,�$7<{.�oo^��;V����*�J!���U
�v�c�_5��E��a�N>�-��zgN�O�'f��������&�)^���?z�J6���q��������h42M��W�}O2HD����r���������h!H���c�B~�B���:����k<������-8 ����0�=��� �����s\_*���t��o�aKx�AVZ�O��J!���.V
�@"X�Y$��	�Iq�J\����8V���E5��n4�F�e����i}i9��^uX��m��KD�k��r��H���jFr.��l��h�H L��+���i,����8P�����Z?g���'��R�
�F���"��a:���b����Y�C��dE�3�O!Q�\ 
��,��� �@�$
W_�@H[D��h�[)~|Z���-��V�h���]���n|�2HF�2*'��.2
O!]�\���,�����@��W_��"�Cqd4r������%��AG������8kd�\4��g1=ah�?h����a�A^��W�E���Q�. :�"�i^��5/	C�\-Y���A��fx\���n�������"�q���{��	�r��O9��
�MW��$���J��P�'�t�P����"�����q��J\�������W��������|�j��B]�����"�*&�EZ_���9D���.<�x�E�\���0���
���Q��4�������0�\�R���q}��K�:���#�>?=�YU�|�^��G6=yh�?����O����4����������a����!�c���(J��=C����i��x���SW7�Q��F�?���])�A2Z�O%��(��D��(��82
� �@�$�S^�u�a��E"Q$�t��x���Z�����c�����f�����E�������J�S�'�t��B.���
��T��A������������M�G�`�0ken~���0��'-���B��H�����#Q�'t!Q�"pH�A�*�� �����E>K;��h��<�O1���J�V
X�y��=��}-]+��A
Z�Oe�)(��,�
F!��QP�Yx
�x#����FD@��(hD����7:d������j�{t��n�����}��y��g��] ���@�B4��������+��-���F���v����sQ2	mO]��8y,�i�v�c�~>{������'<���C?�]�r������j|dV��g�;����u�UZ�������C���4#&�X��d�������S�C�A^�}�Wn"�]�* :7M�h�T��� U������#��SAo�!������j�&�Y���YTV���5�>���D<��x�H��\ �>!$����yt���(!�q���"��&v���5k$������G-G{�Mwn�}�1?_=���d�������Q����'D����K��p�����J��5�a(��O�5�����T�X��<!i�I�j����g�?m2<��xj2��'�e8�	���<_��7E�!#�0�		G�^e{FaWm�~��g���Z��iBf�t����d����r��D�yx
�9h��Mxh����H��c����@�p����_���C�H1k42�����	Ix�������g��knV�����M����nB.����
�pS����y�;%�����f���:IX����<��Rh^�����,#rNj�~�P�K$�A�Y�O%��'��D��'��8�	� �>o$��gD@���gDh���+���1�`�)n���'�.%�*��E|W������T�wB<i��wB.����
Z��S�F�wFu)xFH��J����O�)����M8�K�<�����,q��w���4�p'<�F=u��.�2r4���NN�b�0�D�f�E�l�s[��u��Da��k��	W�T]���A�1A_=TNV"]�B����. �
�%���H�AP6�8��pL�I��@#Ry_�bj��a>���*������������_��M#������?������������?�M��Z���m9�X�'}Hq2J�����W������.I	����^���T�������E�>���=�h�R<����1������Q��V�,G�>����P�����O5b�'��Fz�'���I��K>K<��fH��E���d�+�6�����]Wm)��r��OQIp������3�����OUa�'��*zj>)��`O�U���oT�����c1hQ����P��}i��{A:9��b�
��ej���a�YK&��U���^����*�?)�$��?)H��O��X�Y��Dp;{�Hp����%1�f(��^��sS�@�y���[g�PF!_���NK����O�`(��z�;)��P�1XZ��6C�[-"�-c�W�'
1���4��{\������^�F�g������q��:�<7�SYX�I�$��I�@�yR4��2�oda�g�H@������--����/jx�����a���Uvm��#�1��^3mLU�����LU�)y6J^ �4)$`���#���-C�\B7a�f�U��^s;
T��]�"� ��&�������KY��v#^THJk���c�u������j�g�;=�4������Ss4h����Gf�l10OE�YF�u6�V@���<��]�R$��Wx}w����.�v+J`���q����=���H����D:�g{�H�l��2�#�Dp[{m��H�p�f(5\�P�����mC�����kO��Ug6�������O�oR<�)��������~������-�`j<[4Q�	���-�J�n����6�Tx�<�u�<t���/-���I�i��%58c�����BN���x�ye]�Wh������l���|	��v��}��1���MA��^*z�+iZ�]H6�ah���uv�q����T
kB<��kB.P���
j�XS��|��!c�0��G5l����w�C������B�Y�@�u|T#��w��'�n�y���TlB<��lB.���
��`S�FlF,��%OI� ah����~ ��\_����'kd{?E}}}���~�52H6o�9�)��lO�2OY�)0�/��#�
�d3>2U��a�S�0d����v��uV�U'����pP�����h=�T:"�{�!��Y�msV����cN����sB.P���
���S���p�3"�W�������#R�j�[5�z�H�#R��=���QZ�������em}d������sN�']tqN��p��A�s*���q���^�w������#������#v;��r�)oO������}�2Bo= TN�W����Ur.�����������l1y���8�"��h���9-�Ou/:5�k�H���3e��sW���V����r�[���H���Dz8h�Las4H�sPN\������La
��M5�$��2�S�W�n��=��0	�z���)?6���*'+�.
O!��p�x�H�������8�i��E�B���a���-��(�e]Dy��U��vd:HEoE��,�u�S�Y�
I=T4�t��A���������w���-Lk����*�{�G�i�jK�2�����G���E��x��h�?���5S��P���#�%���,"��h��$���h���#4a	���D���o���T8�a�Q)�R����K�� ���b�L�I]Lr��h�g���j�0�6"�wpHQC@Q�t?���O_�s�=_g�hb�H|�&������h�?���O��������(D�0<U��f��+"���
EQ����1�������q�s�NK�&�*�A�Y�O<!�~��RN����
����7����Kl�%�������,���]�7vj���J��E�r��:���*V7���g�?��'�O�"��$��'D�D<�T���#�A}�C��2.�P��\�5<��q}�����Z�����bwN�r������J���x���h����#��r��(+"��p�3"4�xLw�����R��&;���>5i;�9
B���,x�	��,��X_!(�AM�%x��x�,8��s8��9D�t���=���X��Pf�e�n_�H���$s�!�r�-D����/���.0��!��%8����-�#�����2#�T��O�6Z�v�K�����}"8t���.\��nK��A�Y�O��2!����&zP�k[*��7Ig���v������H�|��p7G�agD�b�q�>��W�e��y�����jk2�j9_���z8
���*O2!�d�E2!4���A�%(�,���EK ��}����Tj:f��~��e����"��^���R��5W������&9�>�A�������w�}
Dga�/!apf	#Ya�N�C�`�A�%������y�#<�E�d��k������_G%�����}�qV�S�x�	�$�.�	�@$�qB4�$�E�
?���?�YV���
E#�#����	�����,]eU��i�,����}pV�SUx�	���.�	�@pB4�"�U��>���b3�>%"��XJ��$����@�(��m������<���Y|q����T#�B<i��B.����
�/A�`���8
�^�q�w�}U���G{��\�	�G����~��3\�A�� ��������I]�r�2��hPF|	*��}*si=2�
�)�}��������f<<5�����������m5�}�V�S]x
���.
�@�B4�"�u��>���
��F�*[D�1q���{#��S�����>�Z�%���_|����T��B<	���B.���
��/Aa`���8,�J��re�}Y>���W-��R�z�������Y�q�h�A,Z�O���(��0z�\!�aQ�a���0Uffl(-���a�z�
<u�ZYl�e��q��=RM�Ro����'�2i����SN�ntv�S���H��P�i����	����|i���0�#�4���RZ
�u�\�'�-����>�@��i3�(�S3��@!h�!P�- P%�e�0����O�c�a(��:}��G(/��gY>y|*����g<L�#�
� ��g���k$ �A^��WF�KF��������fK�D2�
C���At+��]��/��p���K���+��n�]���3�A�Y�O��a'���`'�98�	� ����N���Ed�y�����n���(�E���K>�/�����nGt��y�vV�SYx�	�$�.�	�@�vB4���v*���������.M��l��G�(G��\���������x�W����<���*�;!�D��;!���N���N%0"��N����8�:tK����+kZnu:��"���!��<�B�yvV�S]x�	���.�	�@vB4��v*�����}����>������Cs
�R�u���,�;�����
�����<>���F<��x�H��\�>!4r>��h������U�FH=^g�u����b�u��de�T������� ����r���I]�r����||��l!W�r!W�t�
�����J��'��>�p�n"�r�~{mS�n�Z�<�8���<��xRC��\��8!�p�8��(��gD@�gD�s�o�����|�Nxjg�e~~�tv��h���k����T#vB<i���r�F��h���T�D�e���q�3BT�u��$�����������F��\5A���� ��������I]�r�0��h��T#��D�f�J�l��N�Z^�@������'��r��K������AN�vB<���2��,�x�L;!:��}	�N��e�0d1�!�$a�#�j�������Z���C��h��3�q�� ��������I]�r�0��h��Tn%���"���P�R�8�����T�S�}U����?�W���e�c|<�"������W9��������c�"x�H�h������0��Ec�0d���J�A%�:V��f�v p8h`!���d@����#�����6wB<5]�r���hP��T�d8���8��Kn������-F�P_�s��h���b7�<g�?��G�OZ�B����'D��-8��}8�!�<k��������(�Y����O�`���~�Q�w������]N�����O�z��8�	� ����2-��!C-6h&���}l�/�j�����Z[�x>�]��P�����~�<�Z�c|V�����O����|B.���
9�J�<�,-qe�aP����K1�!�k[
�l��*�e:��k�M�?�C�8���j��P�'�t�P�q8�A#8T	�F��l���V��n��&t6[��V��/Z&kl#r��*��� }�|��v0]���`z������� ���2w0[Lb�t���!�0��s?��eN�<u��&��i<����=	h�?m/<�xj/�(����P�YP%0��#�G@#����H�a�Gp}�1T��w����cD���L
rr����+#P��j�w��Dg5�/!�l��Cn$��GX2�
C��J��qr�����Co���V�=L�\�?agd���T��B<)���B.P�c�
�8`�J��D2�
C�[l��$�~k}�����}4O����zFT���7���A�Y�Oe��'��,��'�Y8�	� ���F��]�3"r�h/��S��Y���R�|������h3<U��YN���2�B���F<�x�H�\��B!4r�B��h�����N���)��n�[����
��,����t4�7u*���~��A2Z�O���(��2��(�e82
��8?9�_���FPl���
�S	b�J.�N��8���Wx��K���'������q��A0���Q9�!i�����@#�B4h���W�u�0�F�*�h��~���w�������Z
�wi�["�a�����!�z�e�e��V�����P������B.���
�8��J��*aal��8�t7�M��{;@>e'<����������Ug���2�}�1�C���A�d[���A�)�z�\q����A��`��C#&�=K��hh3��k����-#	t��*�t��W��5�oR��������F���CF�d5�EF�)��2������Q����0�F�,3�~���lZ������Y#�;�Z����xX���nF���kd�.=���j���#x
i����x�H�}G�K�q���"��q�4ri
_���n��?���U	��]3��J�"�:������Z�u��V�����H F$���)x�I�K	D���}	��3�!	Cfda���
��4�:]r1�5��3���f�Yu�������?���}�� G�����I#]r�FG�h��GU��4�Ed`��	C�l��v6�>�E�>f����h�:�2�u��V�S	xf
�$�.f
�@��B4H���*���c�}IX@[.�%��s�v�N%>9?4W�:S��zx>t��������������j��R�'mt�R��p��A�T	���!Oj���
J�=Ce`�������<��g9����y����+���� (�������I]�r�,(�h��(U#��2KD@��*H#B�_�I����3<o���&�CYl+g��l����u�V�S�xP
����]��4�@)D�F@�� �,�z���Gi;)
C9�K�|Nw��eKQ�[,����K<L�,�����t��
^Nv^�u<<&���p��A��}e��C����%�FD��w��/�|<�?�"��k{���^#��t��r�����H(�w��k(m_�sWW7A��-zJ#U9�K��_��4�w�g�{L�g����TE�v����������xP
���tm��\��l���z�]�m����������!��r���r@J��������[���^�������G!���UE
�@�����9�"U3��@5������a���|�=(��+�zz�^��g���;�}2���9�~�A	��(xe%���6����P?�l��CV��0d��T^����VB��-E���~����]����hg*�AZ�O5�(��F�(��8
�����Z�4b��G�d)5�4��w���#E��r~�����a7]�q���s�V�Sex4
���.4
�@�)uG#D����_u*�>���x&�a���
���R�.�P*���~�D�#����9�IC�<sm�
��.I����gf��6������B<I���B.��ti�g�4��s�%���G�2��z��J���!
��������
�7Z�Y3������1<��%�g��������lr,��+��h��$0����5��F�/�x��%��T�JdOYU��o���"9�I'����cR�'�taR���e����$�$��rR����u�l��	���g��q���q]��A$�S��1�:"�J�F4�Y���/�y
b���P#��?i��B,��p�F�3@#II#�_�F�����9���0n���k��~��D^SM~����>�ku�A���1u��������� 0��5��)��F�JK!h7��F�3@#I2I#��T��F�U"�8^�
w��u��jaiq�^����Qe@_w�����h�
�d����,<#�E��z��@.���Y$�$Y�ZR6��(�8F����J��o�-�7�8�]��|�2��M�����TF�yt��S�����6C�M@�+I\#��c���3�GE[���$0����C>Lt.K���kmqZ��x�O�8�>_��3����pTFV]����FO�h�I�7h#�"i����CHLC�f�]5�R���5�fM[�}��h������=HF'���!#����AV���U�%�]���*!��mdS�����l����=��3�Zb�O�e+W�*�Z,F���udZ����&�� ��e�a(��,�`(�Y`��x�,r:t�g�
�kp�:nU��u#����[=�)-t������Aui�����Q����m��{�N����8�I]8b�6o�F�H�F~dYKQ`���y������p\���P��B���>���`|V���v�������F<�H�X��E�4����M���3�J����l��<�������|jA�h����������X��U���@�=�E'�C�x.
���..
�@#���7h�sQ���
�y�Ge^�9&�����J���3-��>��D�PqhZ�um����� }������v������@���7h�����5�8��c\�hzD��?�`�IlZjm��]���G�[m}���s�� ����@����.
�@���7��3P��>�1��������r��<������Q��BTL{�W���'c�)_bn���@��S4*#�wt��S���)�w�)mMo����Q~b�=��zn���x����t�f���0���=��%�6D��� }�`PY5taPx
��g}�����7�!?�f�A�Ub��0h6D&9�.�����B�_W�5"��q�����J4�^b��{���.�(<�t��@�]@=4�Q�6���D�o��1�t�e���:��f�9�/���8k�����@|MW���_���
�9kT���_������w���o��������C�#O����P/W����02� �:�@��C��0��Y�w9�y�,ZS�����X�4���K��O�Eu�T���2/LY>/n�\�X���2������O���2^�1(:�>��jp��_�w���@Q�m(J��
E�G�6��j�l�o�
�h
����!�����OYh]Bh�duW���,�����n�����}F�%K����������WS	Y�.%o����#A�6�:)i- �9�=6�^/������wmU���h�/�d�!����{�@��2I�X�
�@�Tah���AsC�[���-��$/����ep�E~�N�V�,o�sy�^�1�9���BO�'U�@O��0���Az��Q��������]EVx����U�����|Q�����K�w���h&<�C:�g�9�x�ou�Cj,��������#��0��yP1����������h���/�����#�rJS�����u���aa�'�S�<)��O�]X�����'��rb�����4h>r���r����M�W�Tuy_��\V�l;��S�#�g��Nz
�;;@g{��Vo����l����Z��6�6L�gk�*��N*^�i,s�t�<��{�����|�'���(��|��=#��F����8�t`��.���
����-0h�`��A��p��s����m�	�"�$;�������U>l��L?��A�Fz��V��@��]@#�4b�@������P��LC���.
*�:i�Y�YZ>O�����]�������m�vW��<�A'��q���6��dU9(X�qG�K�xWm�7q���E���.~i�=4{��i�D�����,�,!b���_3Nv9�q�����<�B��X =������q���
���H$�J��\�|����p�Y���z�@��j1U��u�����d�@��4�S��X�@�4����F�������.n����H�#����f�j��)��7OW"�X�~�~���<�A'�Cmx
���.
�@��7h�cP�m8�xeC^��DL������1��T��>��T�5K%��!
���Ld��Pt�?�����O���4��(x�F<�������A9&����J��1Zu]�)�]�_,U���~>?~eL���;w���#k����,<�E�������d��(�	��3Q�Y��7���C����S�����|����/Mq�����E���;�@t���;���r~F�s^h�U����W*
�e�"aFR�@
���3@
�����3GB��#���D�R=����$Ch��eb1�Hj���V�����=ed��]����@��_N��/����������W�nan�l��g�D�����e�J�CL%^���=:�~���y�rN���SN��,�����&�P���ma�(���R��S)����<���C����z���|�����M<^_��nS,=��M��D����8_���| X����fm ��n��qo
0�������}K�p��wXio���i_�$��$�1]L�G}���.����}X����Ru�U�M}F6U]4��gd�v���8�KT��a�/����:���h��v����~l^�B���P#z�?i�zB,�����
��S����
heC�ypCH�=^��U�c�;i�*�p{R����,���2�9'�CUx�	���.�	�@s�7��cN�U8��u�
5�dC���XZ�K�-�f ���D��k����]��f��A#��.2k���V����8�	���;�#!�8���]�1eH�.����R�x;���f��C��j�/���y9��{��q^�d�wx�	��wt!N��p��Aq����D�F�2�#��#X�N��7^g,iL��dg��h��s�.��3�� ���5�y'��F�x'��8�	���;�o4�xgz@~q�3=�q��������rZZj�����m��hu�{W_L�y]�� ��5�)(��F�((��8

��OA�o4�(hzP~q4]�NXP��l������C��D"�����|�Y��k�e�N�����I#�Qa��S�2�7�������Q�9���5���n��Kz���`b�3`�n�>O��_��u--����7w#v�2HL'�Cmxb
���.b
���p��A����������d�W����P]������(u��p��y�;WELup��K�N[�`��V���X��P���7B1�>~�|�P���7.�g$-���w���r����)x����/�4>s����:K'���BFN��Y�TV���w����]�7��}dYt��
�Z�o�n�,��kzS�Qg������U"\Gi��kbc/��=�c}�d�
I����I!h�AR�mxH*N%��p������l��8���R��\�N��9$[u�����A:�
��P�'t�P�p<�A����������H���Xq�X�k���NX3D��5�����\����u��N���<�I#]��4�x(x�F<���>Qs�����"���c��t���DZ�!�_������c�\��u�N��b��I]b�o����7bp4=��#���CM%\��+�����1���J�1)��n�d�[���s�?����O���t��'x�.<�������A����t���hI����!��Tte�����������s�?����OZ����'x�>��3������VW���
n�)���!v��E�+�L��|W����z�>��~��b�l^{J?ed��]���(�+�#��
��/Aq�m �6����~�G�����\r�����U�w[I-�}N��R���x����ao�q&�So��3!��L�A��� \�"3��Jh���7��>5
] DMs��\������2������s�A�y���2��F���P��7�]bUoI~	�dn�^cnR��3R
6�$$�?����/0	��R-�����������-n�Ls�?�3d����g�Ue�`U��|�*��rh_Br��*�l�r�w���NC�!��T�����������q��z�lS���U��^A]�)���A�9����M�'�t�M�"qp�A$�%(W���?"� ���dv���M#��i�k@��i��g%��2G.n��s�?��G��Or�B���P'x��KP��S�a_+*�����p���w����=,���,��i	�s��6�8'�Cqx�	�$�.�	�@�q�7�#��1�ABq�S��DBq5��_.7i�r�B���AB�����_r��e�A�9����O�'�t�O�"q��A$�%(W����(�����e���x�(��q���W�^]S���~�?q�#�A�y�9�SFv8�u�'<f'`�p��A�%(G?��+����=�l
J-�k�{���f/�VN>��
��ah+��mR����� 
��lz��G��wx
��g�{��U�7�c����,��@2�p44�%����l�u�I���{�`����ss�����A*:��OE��RKW�'���QQ���PQ���l���Z�X�~��N�����9D�Y�#!��B��~�����II���@P����@Px
�$=[��]�'�� �R����$�Lh��FO�@ik���:�	�I��@����\�� ��;
E��:
@�@�z���C�oz����V�g��%���h6��'qn�_� �de��??�Y�N�3�I�~�Q�d$�N�A$dUEV5����N��H�������N#�H�2K�T
�
A�^o]�\��,�:$v}b_=�=D���X�� ����(�� ��(�A8,
� �,�<�������eC�`� ��Et���	z���u�_G�����b�o�{�AHz��IF����	�C�q8H
� �H��������%�R�:��UJy?��K���H�rU)eU��i> *��S��>H'��^�R��^��B,���
���
`z
H����#��iD������T�5KMP�J�l=_� ����-����d(E����E!�AQ��g��h�����A�t�3Wn��1��0�|���2aM�����r���� �����(��8��B!���Q�q��Q��e�lq���8��X��_�:�-�uR��|�(������	�d���d(�D��d�U 
�@�@�A;HT�,p�tn�d��h�L3���v��k�T=�����������S���Ht�?�G��O"�B�D��(x�Hv����
0qE�}�#��A��<�����BK�:pz�j����e��A�9����O�'9t��B,��+o���T#?�f&~�G,�<U
��4�HM%_i9�B�uAM%���R���@�g�{O�������L�B��"����z�Hv�hF���+MJ,���]Wa����r�<�J�ZF�^�S���1cj��v[���� ���z9��?�$dU�(X��$�����H����%��H��}dC`� ��X�o��T����+_X��������;��=��d(OG����EG!���Q�q��Q�4�
��r�F�PX6�OZSy\.��Ml��BGI�.�W����kK��3��������]\���@���7�b���WB���hz@b������������#C�e��7���G{�P�9ct3�}2�����������b�D�)x�Hv�0#��J;�!��H[�v�kJ{]]Z�nsM������:�2.6����{�AN��s�/���V�I�����I!��qR�q�p���ee%@�.!�I�a*{j���$+
k!����R�,����?�������BR�4�UH
O�4�s�|�M�7�d�����"A�C�������2��]�f��d��~�!�2��Z�����R��)���pTFV]���CO�h���z�v(i��*������y������������_ZH�k2�h���fL!��2�Wa<)�d8����)�tQR�i�QR���PR0cGI��b�l�qiV�j����
��$���0��$��q���U��� /����y)��0�x)�a8^
� �^�F����L�%R����6�p��8�el���V��Ba�7_��M1�����KedSL/��P�����.�bzxi�Da��5�`�h�7����������+���U�e��}_���F�0ci���c7jx��������S�C�AV��U�2�]���v�KH
����)5p���
5�dC�������U
�[�Z�(��"�6�H��0�
|��Dt�?T�'��O��"�T��(x�*v��p"��JD�I��,>��u��u��T�y������H�b}{���|"���P��?��k?=�]8$
���7M{�-�~�t�@����:���G�(�!��g�UrGZ��K�,���p��1-�]������ 
}���$#�J��P�����@��7�c��WV^�
 |��C���|]*�h�MI�P��������O�]lnW���Ht�?�A<�A��(��8$
� �$����
 ���2�c�+��kv�X
0_,~��uY��W]H>����>�v:�?�=�SF�������5z�g�@�7bfd�5f�J���/n������=����NU��_{��7���������*#+�.
O!q���|G�q�����p8��%R�������u{��vk�����J)�]io�c{o���uL����A:���C��RJ�X�Ro�U�RMJ)�j4]��F���M�J��j���'���1u&��e�G��8t�?����O������p(x�0vp�a�#�m ���������/������e����&���i������O���|��&�.�	O�d�>�] �To��N�hF�d��'�K����l����]:/���Md>����6�|���2�� ����	9a�?�dU(X�~"��
��0�����*~���l��-�ve�R�nDy���l�X*�lt��H{���z~~���k}N��b���I]�b��o��TF��P'���^�4i
*���n����
���j��b{���r$���"P�����I�@�x�7�a�x��,�#�r�w	98��
�5���d�Z���Bh}�R�����|o�����i�d�Ox�	��Ot~B,����
����
`�	d��4�Jw�K��_c����w6K������|�Y%_P����i�d(O;��D�E;!���N����N0"�*�J�
��!�I���j����gi������j]kOK���.cu�zP��l.�B���%=�3���=��}	�3�+k.q���%r�+����%�������-�r���+�����~�������sb�W���F����0z0g��s�/Aa�WVa8���(�8��.�K.*�[���GS	p�T.������eo��d�^�hs�?�m�?��.�	� 8�	��?v�������A��1�l
Zo�����,��u��T����(��S�Y�m�~
b���P$s�?��sB,����
"���
`D�0gzl^2D�0g6��+qk�r62�Ij=F����]��;�A�����2�Y��w�S(����|�*=��}	fW�����c�����g�Lw����u�o����1��5C�uc�1[�^���a���s�?�<d�D��y�Ue�`����?tp��.B��d������tS<�����7Z�����n���"u���\�Q$��:qV�HVu�U"iy�$���`v�=HD'�C�x"
�$�."
�@"��=$R��}��M?����W�8@"gC"�i���t���E��q}k��\k���EZ��UW:�6%��� %��%�))��D��B!H�7C"� ���$W��F"sdv	����,4����G������V���%CiAV�k�����AN:�J�sR�'itqR����L}H#�"IG��*��J��!������E���F�>A3��P�\</�>$2���I����=HF'�CYx2
�$�.2
�@X�=F}�"�$����*p� ���T\hk��\um�FO!��Q�j��:e���X-�m��{��N��J�T�I	�NaZ�@	��{�PB�%$�$%`�� �������W�w��/�w~;=4��r_��k������ �����(��0���0�c~��������C�+�[���fd�D�C=���9�I����$��{^��~��b��_�����j�{��N���|�I"]|b�D�@3����H�K����T��"�?NH"�8<�.��(�,5��O�:���x�����B�U������w�J2����%x
P0��`�fH�����4}@"��GG���b�y�����n$����gU�*@���m�?��<t�?�,<�,��B>�V����]�:����Y8�O4�Tl�����lP>�
��`&2��]�ZuY������������B�t���/�����?�o��������������?~��?������N��gNB���l �02�WY�U�YY�w)��.�h��A,�]{]|����PVO�������0��-�]F��q���z�l���WUy}��c�������a�'��2z�'�e�I��>���ZC������X��� ��qkex�������zZ5�/���{5Vup�}#����F,�$�H��X�C>�4b�g���
�c4�2I1
!�|����E����C/�����LZ�X4�:�����i{��������I=��b�.�$o�����ota�g���b�C�G�x$����Qg�����j��L��J/��z����1�>u�Fv�ps�~��JV �@�$b���u�ah���bhs�F�����*��U���.�����U�R��ru��:Zrp �QA�z�C������L��@��7�����#�f�2Sm�J]����l������U�<��l��R�u�����!Q�/If�D���J�X �D�$b�h���b�h���b�h���]���w�i�b��z���.�bV�;�1��>UfX��#�a���C?�]@
�~�7�����#k�a�g��b�gs�r9?������_�z��R�$~���r������1�9�v�z�?u=5��a�'y���h�
�f�5���O��6�[_=[�����\�7�c��m���:4e����<L%������s�j6��K,%�H�X �C�$bqh��z@"�6�%�6u���|.;
��*�~���&&��k���w19e���d�y����e�U%�`U���R�(xWe�7Q�G��:}�^�y�]5��������+T��~�7]�v{���I���t_�.�j���d�OD����ED!(�Q�ex"*�3����l�Sna�/RiE��*��~�Y��a�)g�6K��U���<H>'�C-x�	���.�	�@�|�7h��O�-8��5d�F$������|���6
�Nb��I��`w����[�W��� �<w����=��F��L0X�.�o��'���u\�
�G��({8���bV���V+f�o��y,�>WG=��I�|�8��]*9����������������x'x�0<����0�������TSsn�#��[Z(���l��b����IU\�n���#k����<��BO������Rp����Q����L�u�Gp,e �i��������r*B�����Ev2� �<�PM������
O���Q�����{��UoP���8��jfe�5�%*������UG���;!���V����#��`bq�{���4�'<��QcAG�'<��g���8���2#g8������y�{�����Mg=^�#@�1?�.��U������ �<wx�k�Dz
<#@�x�Z�D��g�D<�lY%��2�s��:�c�X��
T���"V��/�t����-���������)#�u�l�S���(��P&�To�����#�.�L�*p�Kd�r�.#�Q��8U�����.��t��\��d4��������*���Q��T�	�U�MX�}dD6��i���\�
��g(����t�Q� �yqgI����|��x�e�jN����T�I]Tb�0�o������_�17�0���A2T%Y�!�{�W7��)�*�^����H�z����F�B]/?��~����<��C��X 6��������LH�zX��S;��(���z��4��$�1���t������K.��s�?�����O���4� 'x�F<������Pk���.�pC��|�����:a���E���:�����bYZQ��m!�erN������I]�b�0�o����7�p�3= �����������"u�1��m%��l�J>�����q��Y�e�xN������I]�b�.�o��'��7�p�3=(�8��.1����Q��o�Z��A�����v���sy^z����\�y�S`r
V �<������/���l>�>�b.��g�E��(�����"VE��d��b~�����:%cu��������)#+�.�	O!Y���|��Vo����8d��R,+ {8��Bi��M�XA'1{����G6^,��g�X�c�r^z(���2�('<���C9�]@���)'e8��
�?�l��?N�g���QI��u�d�9k�Vg����t��ds^z0���0�0'<����9�]@���1'a�>/fF&q���"5�������=J�C�7�Tk~����n����h�)#���A&YU�	VuT��R��U�M�
,-dC�B6��������Y��x�A�u5}����h\�����B��W�z.g\��d(�5�����5!��qM�9x�)XZ	98��
�3���������Sd�����%&:�b�����Y�%��%����������2�]F��F�c��@#v�7h�����Z��
deCY�]F�"Uo��?l�nI��(r���R��a����xs�?�9<��9��&�U8�	��
�7�oz�7��+�L���z>o0�H���sZ-��S��-���n���dj^{v���v];��)0�+�����
r�P3?:	5������2m4|����D��sq
%{���E�����p}e�]�������������4�'x�F<��?�l��Z��y7D"�ghE�~lF������"��5Y���q_���m�1�7�=xSF�������0z*:�]`"R�A���}dYQ��z��l
q���(�"�7��]`��q�3�^�$��C2ed�E2�)�����
�!���S����G23��2���W��0���n�@�l9��S�������n"�����0Sx�	��)�
6!d
W�	��9x�)�)��,�sp$�=Ci��+k����T�y�X[�fF�&�{��
�
7��A�9�
��L�'at�H�X W�	� �2�o��Pfz�������J^��Jc��3�����H��|����b����Q�mlN�G����&�� �
6��*#��f���Ho�&�#K6���edd���B6��k����)g���un"��'16�wy���[�u�����F<��H��X�G;�4�i������J����,n��=#�����T$-�u>7��yi�*�\�k�aG�A�9���N�'YtN��p��Ap���bnY`�F�2�������+"Q�iY���3kAV��.FAiC�6���[��u����c�s��8�	� O;�G���hgzPvq�3]b�t�}���]�w~;�~?Zt#i��O��N[�yD���a7�(�S7�U�	�@#�����?���/��4[[e��T�7D?�h{�*������[���|{�]dL�������@���O�~��Ox
�#7��]`��q�Q���F�j�5	�������l
*�=�i�qA�� }��u�B�����5�����d�}x 
��}t�	���p���
�G~	J�|*r�g�4��1��XZ�]H���1*�7�B�*=�Jl���|�Kg�<�6�J'�CmxT
����c<!hc~�6�3@�%�
�����[$��m8T�
�5tw�jQEc����v�TP��Q
��m���l��S�)#�B��<�)�Bz�<�]@���_�Z@�Zpp4*���46�q��Ga�g�R��U
�%\�b`��Q�[="�[Ou�����;�)�����|�C�=���������K,�?��7�
5.��n$���o��U�wm0���L�c�(M�GyCFN�y��*��7�]�6��j�}	i�#K�P�F6T��
"^����L�hd{���*������#Q��6��Q�d��B�����B!(��P�e���2\��"�?(��hO��EZC����B�V�D���=������x�����r�����4<�F�X 
�C���_�����i8�
�P���x�&���j�Y��|�8�r���W3��:�s}a�������P$��?����b�HoI~	�$�
Ud��r�f$��%����|_>C�Z���ct'�E�b}Z
v=��W��n;�A z��	��}t���S`d
V�D�4�D�WV� ��^���3�l�~���*/�������.m��z����c9������J����SH=4��=�}	v�4�*���1��X���U��9�[����-n=[�~P���WQ���3�������S*#+��rPx
	��4�Q�������+k_1;U��0Sq��=c*����Q�"���j{���gFR�)������p���'��H��}B,�"�}�7(c�}*/�g(c��,��g���
oU��A�������u7s�=)����}bn;�Az�A�2��F
��P���B�]���A��K0��:�ti��E\�h��ey�����
D�������A��������.h��?��b������X������X�m�JQ��nc�*��6�?c���h{���'Ui9�I�9��^�Y�9��]ub���H2r�����bP��Z�w��xW-�/�~����E6T-p���
u"�������9--�"��y����������(nr�c�N���@�I#]@b�Fo��U�/�42{@.II+N�vmh��������C��{�(iy�
�����)��g���l�����c�=8

��4��%�g8
�Nu��
�g$�,�����nE+��_�r���"�����i�E�wZUs���c�N��}����O}F�X�A�4�A �j�_6�F�^44��B[C��S��B�:GI������V��uM~��u��]]{����P�|�?	��b�0�o��T#G>�����M�X���o��m�2���:�w}`[��d����~>�^�� ������'��2���C,P������L+�T�C2u��
�sq��M�!��8��m��[�J>����.@�
jm�$c���S\@�c��N���,�I#]��4�*A�4��B�����G%^�#�������+��A��R:��F�R��cu��_R{����P��?I��B,�����
����
`��
A�������E���jh\bw���)5����mz����c��>zh��������Sh��CC�]�rTo��Na(GV�F6����O�6����bu�;�m�|����V���KxW�2��� 	}�������@�B��)�w]To��	���GB�!f)����8�S��B#
��`U}���|��R�&��M��A0:��9i�?���`�j:�w���*��%4���F6�.# �dK�3Z���Q'0����������J�����A
:�
�SP�'AtQP��p�A;T��fC%����4=i��S�o���9$-��y���MO�R{N�����'K���dt�?��'��O���X�GF�4�S�F#�9U��uz�
�g,����E[]k�1�"�v{� h��C�VW�m��A�9�J��O�'	t�O�p��A;�S�\�gzP�p�3]��JNu�^"
�u�f�{�v]���%[�A�9���CN�'tAN�"p��A;�S��������,n�\��������������T���j�xZj�q�v��������F<��H��X��;�4��;�h�����\��<�#��%6���?�4���M�\7��5�	�dr[�T���9�8'�Cax�	�$�.�	�@�q�7c�q*��c��A�|��t�����AD�D���A��!��|jw��kR:�8�=����NJ�����+��c��
��a��+K�N6��If�A�l��sT��
e\�e�����w�JZ*�h�����d���)#��.�	O!����|=��}	�W��N�����+��Pq��C�W�z����!b���^K�[ �q=WS��td�p>{����'<���C8�]@
=��}	���z��a��d�g��d����Z����+��:�f[�m��kkN�G�9=�?.��bM��$�����im�KHYX3*��en���=?(�yu�A� i9��x��r�J���w}��_d
����P#�t�?i��tB,��#��
�!�
�+��q��#4��{^�*r��;
b�}u��"�t�J�mU]��_�Ps�?�����Or������&x�v��98���B��R���q��3_?�`eD����Y����o�)������}�t�	��A�9���cO�'�taO�"q��A$;�S�H��J���/n�>#�y�~��e�����VG�g&i����m]<lg&�A*:�j�SQ�'�tQQ�qT�A#;TT�FM�:3��G�$T3����(42[�	+�3�.�$�S'�����LR^�(t�?�G��O��B���P(x�0vP�a8��aM��&�u���P��JC1�������R��'�������d(�E��D��E!��aQ���`Q0"��L6= �8*�q<�]������t��}��E�Y��z���A���2���.
O�V B���B�+���l1�� �8���S����g=�I������Z7�w���8����f�J����~��
��~�SH=�3��F�a�Tzr��9U��
�J�L�9��fUA��^E��7G�}�'���%�~�9�B_=,TFV],�B��a��.���
����l_YVO������1G��k����O�����]�/�W�fuU>c����^�����M�������(CFN#����|�jb�w�����%�G[[�H6T�dCM,�!��-��68�1
b1���zX��?��]��d(@��D�@!��P��P��l��D�%T���MZ���l�^G'7��W���HZj�?=�GC��b��{v����]���)�M�
D��(x�Hv�h���S�N�x�G��pCh$�0���SR�IZF��z������$-�N��
{��j�{��N������OI�X�GE�4�CE�t$���dW�1M���}�WZN���j���;Zbj�u}q�"�m?2HE�=G����#]G��S���@#���3@#;T�}e�GMJ6I*�cEG�`Q������c�
cUe:��%����<��sZ�4����w�Y�2���:�B��94�F����F�WVm�Ns���E�3T�s��@�z����R�^��V��P����b��I+���� ��s����O9���B,�?o��U�cM�1������t�s]�O���v��H4Em�)-�V�u���'�h���Fed��.4
O�>��=84
�=�������@*��OM	�����
��������������;����L*#+�.L
O!��`�|��?������H���+�������-�����
1�h��u�[�YKI�X�����W#����c�5Fy����]���t�?�+���?�`�@�zh�r�To�H�������a�����
�#$����N���i��:�3��;�Z(Y^.�i��5�f,!�����~_�w�_��V[e��6�,��Uzo��x��ci�v�M��4��\��z��}�m3s�����m���-7]����\W|�Q��j�j�!T�'�t �����r]�al���<'�-pqQ�����&e��6���
�/���5�C��e�F0o��a4h��}���i�~5���j�/\�!$�?�:��^�^'������R-\t'�nE��_�����%R���Zp��I	�Xhai���B?�Z�����	h���k�>��g��5~���C��a����e��+*���dE��h��*��Q�'e�Q,����Q�e$�,s�V��%��4�0]Cb��)j�!�F�7Wz�����o������9?/�KL4�Ou��(��.2E�,����E�t��$��(�����Z
0z0]#������a�U�a����o���hs�m�y�s������xR�@j(�J@������S�Uq����k��:C�@7��=�c��/q�k2o_���A~�a�
b�/<.�N�z�Rs�������B-����1�p���%(�]c@K
�:7����I�����o>���M2{D~�|�C���*��L����O��c�P,4�,24R���#�$M)�#w���:��8E�����,a�?�:��RD;!�����83��Sdk�7�.�\��g��J��O�'�"~q?[�D��% ���$> ���'[�)!�������������i? �m��>��W�&���-����T������/��X
/��sm�:������������w!-V����'L7�q��}�����.#k������:���g�?�2�����e�W�2���*��*
��]FF�4�G�Q%
a����P��)������R�J�W�\���+Ik���~�wE�g�?��'�O"�PH�O��x��x��o?)S����}C"Y���	��)��M��2��V�6k��A�_Q�\���E_\����T�B<)c�BY��?!������28Q�����,��n�����0���7'^<��������M�W�M��.����OB!�42DB�,��#�
�$T�F#X��WFT����=����i��^�������p��6��!���e�.$�d9]���T�B<�b�BY B!d�A���,���bPq �"X������BI��L|���W������V��<	TU{
���!
e��h�@�IX����sO�]>�Jb�D��~A�S��>�V�x��E��7/s���"��Q9�e��Z`�
^ 
D!z���I*I����beCt���R�K7��+��"B!��r�o�]=��d����"�������B<�C<��8
� �C�#k�N*�J,K�E��]T�v�g\M�\����L�O&m�c�q��E��O'��B<Id��BY <�.d� �C���i����]�������:��� _#u������K���,�1+)p�����}���0CHj�f�f[�{�h��K���@Wp�,�r/4��$H
0zf�=�a�_O�?����z���������t{\��������C�A^��W�3�-U]���D�T��4TapS�e�Z
�Z��i���w�h]��g�>Qt$�^�<���g����@���J�Q�'�Q($��(D�D<U<+i�,�n�����A���y{����54w���m��hWz>�:d�K���n���HD���4<�x����@��B4H�Q�i`��Qi4��!R34��>#���x����Z���a]���K.��G�����?��}B<�a�}BY��>!���g~$�%�}fD]����^���g���,���������r�u����2qh��"�l��}�G�OjB�P��!O�5x��x�78��4l8��!��fu+�]�o7ym�A�e����6���q�>F�����s�\��;�D�4��(D�F<�Y�h���
1�tD����
�o��W��c)Bo��"����f��-��O1�u�h�?�F<�x�F��g��mH��Qh	H��Q���I@"����C��;'���^����m���9��L�S��&�~z�>F���l/2pR^<�F�$��(D�D<�Y{�1]4�86�!1���P|�/����u��v8�R�f�����!�x���l�q~P^��l��b�Ph��Q��x6�?�jd��E,��F�u�Jc�7�c��&+������C���S(�o�>.b��:�x,
�������1G�����5�E3����P6%T���Z����W1,�������A�?�vd"�z~`/�y��6�3e���)� ��E���������EF�2�G��"
Ui�1�WS��Z��	�9��yM�lc����[��\��oQJz�:���=J�yh�?����O���PH��P�ix�x�����^JC�^l��uz�:
�����v4���X�i]��v����B��������B<�b��BY �B!d�Yh���c8��~����n�BE��{�)J�m_������������o>/����Ix
���!
e��hP�g��7��c�A�GVS����z|�}^�![����Mf�F�������a�Qy^����T��B<�a��BY�GC!z��~DOC�T�x�)^l�L�
1h��
x�u��������� 2��������u��E�A�r��!
�@fx�0�h�G��� ����C��>*�����oi}��id�[����������9�*'����P��tQ�]8�	u�.<���] �],@���H����^�����>CG!{� ��s�aCFm���"�l��c���Oc����@�wB4H��N���Z`m�T3��;��� �_���+����UKK�O����6���<��x����@lB4���M�18��0~8�����L�����	�RD����}�Un1�������}{�\�����O9��d(�j��d�rf[�e�h��O���a(���:�\4�8���,�s}7O��~ ��?ul�����gz~�zE��eL1g�?�=�����{�W���U{�lK�DWid4a�4Uip��e������6hY�H��1�z���������7s����#=����"s��tw6�S�x�	�$�!�	e�D��h������&
w�!$����
q����>8x������n6K������E��E���O%��'��D��'�q��A"}*�H�����&-�}v�n����Y���z�g�
-Z�g��y�m?�\���HF����|ss�S�3
QP�a8

� OA�G&�������;�uh�]��i���@�3^�yn��?���RY\$����d��E0��O�F!���!0
e�D�h����gP��6%$��D�!�����������1Y
�2��8w��dn�e�t��6�Sex2
�����P(���(D�2<U�Q�#�A��C����������z�����o��������@t�r�c��Z`^ D!���h��:�,1�zeD�vgC,U���g���C��<���}���~b)Iy[z�u��n����8�����B<�C8�a8
� �Co��������ph7h��]3� Kd�x��j�+����&�t�6�S	x
�$�!
e��h�����7p4#h�p4C�f��6���a����������������n�����"�F(����1DA�?F(h�P��hD].��k���3�	A�W?>�2�u-��5�~|��:}�$����t�io�"��4���u#rr�x�F���P���H��J�k7��$���e��5J"l�Dz-u���=?����$��0~7�|��9F/J�KG	:�.B���A!��1A�,P���
��T�<���B�4���J��!�s����sg�=��Q��}���u}6�Sax�	�$�!�	e�0���z����_��U������g*�`C�IXk���+����eI����p9�uM�8���O�i�������*��O�'e�O(���'D�2�KP�������(��p���J������]����V���]�K�i��������q7}]D���TyB<)cyBY��<!��_��Xl�$�1�X4�8��!A��l����|:���^�d�
3z��Q�����{m\����tP9�)�P:(���m8�	������bm�������K�\l���d�E��|>OzjkEG�a��).�zm/a�y��q�F����0��'�B�������F�0�KP�
���'�%���
����o�Q"��%�=?m6J`<�:�Q2���E��9�.'+��3�P�������P�!����@��+�L�!{f������+��&��}��?`b�yF�|�[�9��h\����C�r�"�Y�����I��3j4�$�E��C���h ��>�a��KF�1A�p�c��U�����K�8��g�������:=��x�u�u����X�b�u��KP�
�2����������7e���0O���6	��C������6� �/r��&
9�~�A�U�'xUid[�4 �J�	I#mU\���L�R�V7���9�?�M�������:�n�5+�bv��:_���T|B<Ic|BY 
>!��_��p�S%3�`�z
6h=���~I�j�]����
x%�������d�����vK��"m����I#C��8
����������4�3(���Yt��7 T�,��?���T���?�<_D���T�B<	c�BY �@!�q�@U��C�A��c���x����VL/
'���zAB�����G���s�L��+';��Y$P,H���'D� �g�������jMC�[l��"�����zt�W%��W������M������v�{B<uC��P(�qO�epO`�
�=3����q�����X"c�xo�����S��C]�iCPmr�|�|6�Six�	�$��@�,��K��h���TF�|f�"}fH�IX�)�������u��9��X���t���5m�E��O��1(��6�r?�,����
�8��*�h���+��X�8
����]���y�F�"�I����Leu}�v�d�H>���.<��x��PB(��p	�
�8 �*���]�!t��g7��%��������qE����%��x�9�x�|�6�S�x 
���! 
e�F�h��UF#�f�+��fH�+���k���z+�������W�:�J�R���e~��D���iCNn���
��D��6�-��Bt���	a��U"���4����
Z����(����m����>��Q�^����a�}��6�SYx
�$�!
e�,
�h��
U!~����QW)i���t��v\a�qI���=Z���4�������������O�'9�O(���'D���
0r�TN�22F���V��Q����4�d�c�z����J�x�{������"�l��r���IC��98�	� ����C�[Q�b���5h(��n��=����5�H������7G���E��a�r�S�!�	��/P�c�
j8`���0V,A�lKL!\�g7h	�y?6]eW$=5��%^����HV�c%�7b�k�"�|�d{��jd(�j!���@#�zB����_	q�3#hq�3C���2���e����O���*����xYq��j�:�#)�r�jJ��ZH
#)��Xt�hP����A
�Vc:�83z�9�/��CK��M{,5�Z���K��
�k=�������2.b��H����2��=�R�H�g��Q�A�3Ke8��m	e8��
-w������YM�8���=�WDE�����e���f�?�iz�	�4���P�oB4��o�3�tx3#h�p|3C���{�=�X�fz>��'^&(	[�r����4�N�]�E��9�.'�Us�Z��9��m���F�4�f����� W�.��pP���S�i��4�8]��M���5����lmg��w�v��\$���������C�A^�h�W�2�-U]u����f�]�����"]p��� t���e��8>R;����5���\����T�cB<�a�cBY �1!�p�1Uo}���D���!��d����n�q���l�$�9o��a���t��3��E���O���&��2��&��pH�AHSe,�J-2�2,6�0��A4��+��Q���$=g=���f��9��"�l����t�ICt�e8�	������2����������Ak������������+<{I��P��;3���7���<��x���!v(���&D���
0rpx3#hq|3C�����u�����*��h��k������(������L�'=%pBY��2!�p�2U��V)x�0p8���~�sg�w(j&Vz���J�)�4�]}�����"���PM9�����Z`A
^ ��	� �����.<��7�����COy�����������7E���i=�~�&)��E����r����P	cjf[`E:5������nr��c,4�8��!1�<tnu����-������e�>8�����p���S��E���OG8!�F�!�	eA��'D�����T��p�3�`��3h�����x�JoS�_�gl����0���Ty��6Q����?.�������?�H'�B�����@�1r�g�����uS$�@#��q������������n��ih����:4��uNJ��I�{v6���CNN�yU�	^����Ti@t�>���4�V���*�4���!������~�^]��#��:=s���{3n�����q|~+���d�H�@�Z��/��
�8��� �%������6�.�~u��&���^����<_�MO��L�\���f���@�#���j�&hd�B[@#�B4h���W�F���`IC]�w��{���X*�l_s"�'d�2����Ih�e��C�#��jd�a#����Zh��P�4rC�+A#�f
0K�$
1�|��?���S����DS����=����*.B��H�����r<��A-d� (�28���� �RK�����1��K�=������
��3v����n�a�Ak��g{��e��4���N@=
�x���P(��h(D�0h�
��0
�C
���^7e�\����GK�9��y�}�xh$+����7�H�c�E0��r����Z��I����"�F�F�h�q`4#hqd4C��%�������xu�=o�h�=:���	�	������������qmoej.@E��\����'��5�&���2������D��"0��
�
P�~���~�����~��_�8]������������������������o���O�������e��������s����T��
����_��������S�f5*���/�?�?��V�e�ao����7��W[�}������}�u:��w�>��_�������]�e�/�q���?��_��m�<������o���,��e�}>�_�O���������b��ej$��F�
4��@#[[@#�4_������4�m)A���(�]��2\A��<]/������������9N�����z(��zxz=d�X���ak�!{��KPQ���������Z�.EQp�<��~����}�����L���?�����N���\/��xw/�o/��z��f"�bd��"A�$�J"�mI"!�$����H^eG�����:*�a��/��+����v�n��@��k�?�,�a��wo��/��r�>D���#��FtL>6�S�^#���?i�@#�
4��4��A#�%��(��
@#��������E#Q���1��"��`�p1N��W�q�,�����)�b*���F�'i�H#[�4���4�7H#��eY[Hc+�c�f�/���1���bmD+���r_�,�U�(���R�s���:v*���*��mp���w�Y�ZPD�ElmEdoPD|	*"��"��2�l0�`A�,���8��=�8|NQ�fYV*�0��������Y����&��u��E�?���e����+�E�YlmYdo�E|	�"��,��",(c�V�Dh�q�"-N�gC�#��D��!B��:�����]��b*���F�'i�H#[�4���4�7H#��eYXe��V@c�V����NY>��q�R���a���@�Q�!����Ga���(�����0�?	�@`u p����1�����O9�����n	z��������0���P�e(�_�9^���q�!d�(�gG���xu�C\��c����k�����T/�R��<�@`����� ������8�'�c+8��=�8�`zL��]��C�X&��]k��:`�.��w}�s�6��bd��-��j�y'�=l�=dw�C���Y`��C?k�"�bM�r�^�0����v���gp��5���2}��g	�
$�Ym��P����@[c@��QI$hml�JE�Y�
B_���Y��0VD	�*�1fYDS��>�+Z���&C'�\���9]�?L0�*�!d�u��������1�KQ�uy����A5���2�<n�����+,�P�����C+�:�15��,�1����CF�i�u���H��Z�� +����Hn
h����w�W�8�0�DI^�������(PQ�Q������n��"���~��>�3��(:����=edE��=��Ql�QdwE�� 
\D��U+�(�+.(G��P�������z���Z���y���K�g'C'�\�O=���&�I�@[c@��C�HX��z�N)zp��4���#��i�_������R$<���8~�(�-]E���Y���N�9��O���	|B-�U@,����FviT�].�p�3
h�p�3\�iV��v�����>#~����FI�y*�n�������=K��)���F��ZH��5���A;�0���?���1�f�3����S�����+�"X���`,���2/�����cI'�Z����.��'�B�h����Ev]���$�2���t�y���Q�j�gmw]�'�&��V#��h^��`Z��~�t����N���E��X�O8�	��:���>���������pY��c~^�H4� ���O>Qqx��?�����8�t����~��vM�j�N��~Fc��h����$��~�$gL�����~y���o�-���mq �e��V��}��;�N�9��OY]4�O��t�D>�1���������y8��`0q��V?��e�y����wl!4�<�d	����e�
���|��g�U9Y�?�)d��'Y�1%�e�yL	o�E��$�(��E��R�I��h!�}���BK�����\�25J����#����N�yk��2��h��Ptd�p��As�����TN�F�#-.��u<�y���> O>�R�= Q(����-��z�$���iO�I'�SO���I�@�t�;���N���3
�E�@5���L�j���L�c'S:�-�F%p����:k0�F��Fn��s�?����O�h�����'��8<�������A��K��aR��x�����zl��C����LO���\.��p����|�Z������D>�R��g4&�������8�p����.(]����������@'J�����d�s�
��Chab�������;ed���;�RC�����;���N,kEA^��G�`D��c�<���hi%�e���X������j�y�~l�������B;ede�D;��E����,�;���N\d����/"3R4.Ek��>h�n��]t� �"���e������N�yk��2��hJ��ZHM����Ev]x����.6X�"�,����������1��WO���-�����'����wkV��u��������SO���g��w��9���.��A$>�S��� ���G=#�0=�J�R�a�(S���&YW���Tt��[����%��&�B�D���@/��A>���^���(�������CB���������U����������r���s�?�d����=�U��d���hLV�gu�7-@8���]�`�?c�!\��E�%���i48��i��eS�}KBER�Q����ax>v����T$|�?���p;��8�	� >��cH�H���D-�����t��6���0)�k�o�G�L�����cV'c'�\�O���'��8��'�q8�	� O?�o���<���(����5=���yx��
��e���#V��?�(vm������:<�RG��X��?������FAF\P�G?k��}J�4����Z�&��t�9��h?[2v����T�~�?����;�q��Opq��O�q ����6��9"T�]����w�#D���ixq����������IMg�)��%}�;��SP�U`
gQE$ff�% ��A��N�p@�0�������9)�{�QY�8y~�:��v�d����H<I��w�"����Y���D�����H���5�#��sAY�D�^���<�f�+y�_���!E�<v����T
���?��)��b�\�'��<	��Q�#���#����O�O��]��[r�v���8j0� ~��	=��TOY����	��"+�����z���~����P��z����$�zF,1/��5���!$~t	��)�E�Wm�����N0:��QY�4�Q��D�F�1��Z�hx#�ry��T��x��d�uhi2���q�L`<\�4�g����u�}����s�:��j6�����aD!��E�*� ��,p�=Hx�8�W�$
�8�@��ZKG-(�&:�~}J��,;��3������<u����T���?����R,��#����$T�<�����K���I-���|[�h�vh�������6�k�.��=vD�:Q�j��B�����B)���Pp�����H;
�B���[+W.(�F��;��%�(�t_�X����e�$�,a��-<u���T$���?����R,�#��"�DT���� �(�k�������:�y��#��%,���$i���w��2u����T���?����R,��c����,T�F����,.4\�i�/J����T�����Zww�9��������^�`:���� *#;+mJ�Z`�BV �F�D��h|��F�k� �k� S/.(}G��:�(ZNA��������4����[���0��N�ht�?�I<�I��(��84
� �F�oz�F���%�Bj���}���n���f�f�
0 ���S)FN�M���������I#M��b�F0w����7q�4<h�q�4\���6������0)����|f�Zt�q�_�=H'>�Z����0��O�f�rF�1?�;���S\�L
-���e��i�%E��:6�����YX6$�v�'O��t�?�B<>�B�!k������M���R\D��y��E$.���C	7e��m�aR�`�%��*7j���m|�w����L2r=��8�*�������=�gq�7�,qDA�dQ�K�&MBjAI
�����&!�G/�a���,���"���0�{|������H<LIL�X S�D�a���FA&!\������:���*
/�B0i�ec���g=�I���t�?�����Orh�����)��<=����P�:�#�[� ��Z0�Sy��Z��3	K�+{z
"���:���>�v������F<<�H<�X�O�4�����F09�h$R?snX��H"��������	��v�KY>7�a@�,u��5�]��n�w2���T���?����R,P�c����U�F[A�c����%�����!h^3��]X�S��=Fm�Ea�zx����;��j��M��t��QJ�@��;��cS�]l���F�A-��	�J��`�-_�&e�%�4�]���+Y���t�?����Orh���)��< ������A�GH�e��E�n<�VKm���,a�~0��IHW�S�xB
�$�&BJ�@$���;��R���Lx#<`,q�����D{��{f_aQ���A�z�SXj@�3�N"zo!�2�<���B-@D�
��J����h����3����#�����Q�?sl���s���jL����v��	DW���Q���	�R,�����@T���p@4<hXqD4\4��
����������r
0�������Z���b�s']��D"#���?���2%�,�hL�������B/yrd��(�����p���������������h�S�����_��ss']�O���(��:��(�u8"
����T�;i������N��GQG�Zoz�*-�b��,����|C�Y^�x��4:w����T��?��	�R,��C����OA]��RE6��
`la�"��U^��]����F��0��z����cT�Ria;!�;A�j�B���B)(��PpE���"\�"Ed�y}�E�N��v�����K��.7�eg@O��������R�;��j�D��T�D)��Qp����J\V�"�l�|�G���`z\/�][*����l�����v>r�����.<�E�X�D�t�����e��h�x��h�(��+]=�'���$���J�,m����FW�Sax4
�$�&4J�@��;#>���G�ttD+���MwJ.����i��Z�''���`���I'�[.��]�6]8
�#+P�c����OAu����~P���xuP�����e���<��|�M-y;���6IK��N�������?�M��R,��#���OA
`�2t���(�����p�e:�C��O�R$��#���T^s��;pn��;��j*	>��$�>)H��OpI���$�Td�:���-8�7KY�U��������XJ�f�������j'�N�����BFn��Ye�IVY���:�=��~
����]�(�4+
���=�F��e-�9>>��|aY��w]�����Qc6���D����H<�I��X �<�D�y*o�D���g����<���~�=��R��Em��M��'R�RK����{�*������J<RI�X�@�T��*�Q�V�qWx�u	�D��r��N���G��r��6��Y�!a��������0������*�P�I%MP�b�JwP�(�F%��8��N�����eyj�y�$��4LJ����k��5�+mgR�0t�?U����O�h����`(��:��P0�p04<`�%����r��"���&�S^	N�S�u�b����N�h9U/#;Om:U�� +������BC�g�y*��*�<<`pq0��!���<��y���(�e����g�t�?�(<�(� (�=8
���/TLG� hx�0� hx���LZ�l�3��u�-�<���J4����G'�\�O���'���RA)��aOp=|��
`��rA���=�e{B��s�{�vhm���>������Dt�?��'��O�h"����(��;�d�J1�#��+X���_�Q�&����x�*���
�j�����9F��\2�8�tB�G�m�2�����F��d4���k�m�~
����y8�J0o���A�Z0����~�<�l�%�k�r=,b
�J�:���n�,��t�?�Bd������U&�d���hL��.�~
��#KQ��0�����Z�>�5>5�L��m,#L�D����i�h'�K']�O%��(��$��(�I8.
� �/\T��UIl�xEA^�pAQD��E�����<��U��]�V>�Z�������G']2�<�J���y.
���&.J�@%���;�����8.y�i�����r�_qB�`)��
O���Z�����tb����+�X�I$MX�b�Hw�,��+qX4<htq\4\�~�/�`w��aQ.���u��hX
�,~l�$�K���2����W�������PpA|!����k`���Wx���y@����^��E��Y��7g��.Jmo;�Z:q�j�]x
��]4%�R,P�������C�t���)����������+L
��"{r~�D����<�@�X��@�4��*���C��AC�c��D��s�^a��Yo
\���|��������n�,�`t�?U���O*i�T��(��J��Q`0�lE�)��e�u��`.���IX�W��I]Y�K�2��v�G�a[m������:<RG�X��E�����*�Q��2�����(Z�*��>�jN����dQ�&Y^�W#� ti�2���&
����	�Fc�u����)��\�h8A��y���Hh��+�/���9O��Q�N�7oFi��kJ�8������YO!#'	�����2�$��SDc�$�=��SH�,m�GAf]Q��#\�q$
��&���#�=�S��n��^�~��"�������$����J<RI�X�GD�T���*�'Q*�<��$<���L����.[�S�u���%~����k������@t�?���O"i�D��(��H�Q0"q@4<`t�PI=)���e�I��9�7���8����g��IDW�S�x"
���&"J�@%���;��U��� dp�u�/u��w�!�#,^�
�����NL�����cR�'�4%�R,P����*��I���a����a��P:�S��(�F����40�D�o���0t�?�����Oh��4�`(����P0p04<h<q44\��/��N�Sm������\D�w���P������*���I%M��b�J.�|���V%����p�kZ.�������(W���t��_L��z��l�V
�n��g����r|^Fv��t|j�AV��K�T���g�m�(�`�iQ�K#�eS��x�y�\��4�J�������F�-����J��>Q������R4�G�K�S��p9��R��|�=��#�f9���g�2G����WWv��������e��C���������G��O�J��z�fv��%p�#>U�rF����@ Z:GJ��d����/�J`X
+��G
M{Y�����~�������~���/��>L�����������������z�V_��������������.�*�R�J���I��']�>ta"�~��P�#-LjAb��qp����}C�Zj s��h�rgSn�>(���J�BQ�')�@Q�R0P��A
���Q_RQ������J���|S2P������+����cWQ-�r�����������C7�SyXJ�$���@��;������c+H��z$�e
��?��_��q�4�8���%sy$z��[����:p������n������I-c�8%w���%������CA���f]�3>o5E#�da����%��v�C����,�$�B��X �:���Qg	`�`Pg��a������{�7����j�&z}����H�,�^/)������=7�SAX�I�$����@�{�;�s��g-HTS�G�{����^�U���������v�
���5� 7w����G87�S=X�I������@�p�;������;��v�-z0�����uF]�����<Vl!�
+�CF�:�WL���t����87�S�X�I�$��|P�"1���A$p�F$�+�H�E��.En����n�g�Ds��u�I��4�������� ���X4�������j���;�
�arA���sA_�y����L����}�\�Wz���gO����vT�������m���V��]>������7-~;U��Z�Z����n�:WIKjhmp��*�����Y%�wVX��Y=� � yL�c�-'Ov�b ����My�+�����lt��S���BV�z�U�C�1Y���Q?��g-K�����D�.Q�QW-(s��6�7���R��=�}���WX����~��C']�OU��(��J��(��8 
���/@Tx�xEA&^\ �U�(gR����
!�����gS�eyx'{��m������J<RI�X��E�T���*�Q�����G�(���f�j+^�(z)<�!�I��7�����Z��K�w�,t�?�g��O�hb��X(�� ��P0�p,4<hpq04\��w��<�^(0��n�9\t���vJX��w�)���m������J<&RI&�X�d��S��+I7������#�����3%�>�?p�o�@�.(c�V��E������
�����6�����bk�r���`i�����_�Oe�a)��,�`)�Y`�f�E�d�d�V��
�J)��Rd�Xih���t^w�����q8I��i���D��R���	PW�S�x�
���&�J�@#-�Fr%�� ���!Z�D�gJ�2.����r(���	H��>n����hF�������������IM��b�00�#W��I�@[:��V�HFK���i
5�������k~���M:���>����������k��-�TF}4�S��)Y�F��z�Hn
h$�&i�d�����A#���GO�/9��}jM�+���4�����M�����i>N>:���Med��M��F4��������H�09�%��F�q����i�)5�\���uxfB���hu��T>�/e�:��j6���)�ad!�L�*����p��oZ�D��F�,\�V+5���M�h��U�����J����^N�E��T��R���['"]�Ou�)��.�)�]��u�����M�1����(����.����g�E���Xn�x�[{���(��������h��^<�v�3:����B���l������O�F�
����h#W�0I�p�������0X�d0Z����u�_����v����6C{��r����NX���v ��?u M��b�H�\��J@$1I$��*0�e��t �F�4^��c|_�����!nB��{��#�*���w��n�tt�?�����O�h��d��(��,<�?#�(�)�Q��T���+Q�0�0���<
K�/z?�c���O�1��C�#J'�5��t��QZ^W*~(��~�A��~���`
V�
G�P����+��}�*`��	�Y�LV��3�u�C������;+,����m]��Bo��~�Vx
��O4�P��p(�A������|2���C�EJ)�z��5��B?�������<���Z�W��F'�5������-g�K��^�@#��B%�OE�Wf�8*0�8*Z�����#��Rk�!���JC#V�;�����4��BCed��DC� ^d�p4�A������p44<h��ph�DkA���`�;gk�.#K���$��;t5e�	��N�u�KQ�I'��`QY�4aQ��D��E�1�Z�hx#����"�
`!����Q��Qt�UO��"����\��|D�������H�t;��j�F���������l��r������w�7��~e��(&��E�%��ZGW���:�����,�����e���]��w��MC;��j��G��4��G)h��Qp�x>*^�FA�:���(��VrUV��3��kTW�X4���h�sO&�c']�OE�A(���@(�8
������V^�+���I������"���.�"%
4~�:����9�+,���.����I�J���cF�y�*#;d��Hj��Y��wP������G;�#/E�z��%�����|���4��2����rvWX*E]����r�-:������MFV#-/*�
�U��8�	���>�+A#x��[
��|�)�	�&�k��4�r(���Wn�%Im���N�Z>>v����tH�$�iHi"�D�H(��H<	�?��FA�^Q���!e�C����E��k�/g�������s�����}l����FW�S�x6
���&6J�@#���;h��Q��860�`bi�H�U�0���K?��Z�Ld��7X� ��������ltla�2��M�ZhB��F�1������F#06���
6���r���~_���k���~�^�z=�cNJg"�ptl��2��h��P���f�[4���A�F`��2AZ�c�Q�A����UX���o>~.M���(�X�h��;q�j:�x
�4�����CUc�(�"��,|�hY8�M)�p84
����������O-^v��yO>b���iX;/�:��j�
�.�Ad��AVy��������&&EY\���(���V��h�)E��Cg�k��DRC��J������/��FK��N���*��P�'e4�P��p$�A����Ih��
.P����FAL���������>jk�b�}yU[c��bS']�OU��(��*��(�U84
��
�F�oT������
�(��:�:&�������#,�������b�/���o����i��	MW�S�xh
����Q�q��A#���h�A���1����J���;���W�FS"D��A[�8	�Y�>u����T
��?��	�R,P����j�xT�F
q�=m��G&_\Pz����}�?���b���[%��3��C��-YK��N"�����Q�']4Q��pD�A�����"0f�6��$QK�xFZ��9��������~|�]QY���n�O�t�?�����O�h��d� (��,<�������A������2x��:�S�aD�2x0��g�Z���g��sj!�2�+�&�	��"+��
w�'��+��:�93#
`q�3<���o�1 ���+cw�d>x1u�������sja�2��hb�P���yFc`�0��F�_�e��'7�#�y�_rM�z���,���,�\���<�,���4��O��s�?K<�K��'��.��Op�.��O������������Z��1f?����,;���S���T��`��we�s	�q��I@W�3���u �"!�L@�*�$�;p����7u Q�	h�I�P.P�Z�D��	D��T��9W�������JQ�B�k9
}������.<��E��X��?�t����y7>
2�����E���H�C�6������t�^��<#,u�q�����N���j��P�'�4�P�q4�A#����h�����&J�*"��R��J�{YQ�v��K��u^5�pH���;W�Sx�	�$�&�I�@w�;���N�l9�+<����"��:�����G�@]��I�Js����{'�\�O���'��4��'�i8�	� 
�>�o���gx��2C������1$w�n*��;����#�2�������q���?�-g�ed��Mg��X��h��Opmx�_	�N�?���@�O����CW�|�`D-NF��9�X������\:�����[�����8g>��H ���P�D�ih|%�k�vkx���`hxL�������e��$~s]����j$`����7�����0*#��&0
��F�RB�1�����F#0hQl��������^�����g��k'��C#�������w����t���(����'��&8��NA���B#0H�����1���p��6c����Q�L	��np`���k����c���F�-G�ed;��#�PuM9��PFvex4Z�2a�(���t��F������;k��BX�7�nJ�d�0��w���s']��:9Y�?td�;����hL��gY�7����I�'�Pn�d�Z��:�����S'����rD>S�(T���s'�\�O���'����'�!8�	� O?��['Q���(���]�>��u��OD
���U�j3�@<��
�Ad������H<�I��X �?�D�����H��L��#�N��hd5>uo�V������v��{4��\B�;�fs@�N(���J�CQ�'i4AQ��pP�A���H�A����+
2�u�cj����`<,�E����������m���������T�IMT�b�6w�����7�pT4<hlqX�Vr�������m�<�D�x�����=��	����Ltna�2���&&
�����@���;(�3���y��5
e�(���p���_�_pWm�p��we�Qc������M��;��j�{x$
��{4%�R,��C���HT���pH4<`dqH��!������C�ma���"�����U����
�W��N:��P���	�B-�4��h,_[Phx��5�2��F0Y�?���"��{b�G�zJ^Dk���R�Q.���^$�Ptn���IS�(�B"i���Iv�W��O����,S�(���;&�e��y�L�Y9��G���--���������;�qi��G�>*#+�&>
��H��h4D�r�h���� ��e*�I���9@Z�������a�y�������x��2#�F��U������G'$]��f 2r����U��d�g ���
p�H��F�emDA�@��vQRPN�*�\��i��!����X���T�����C��������4<6�F6�X 
�M�����pY��\�42���������S���}�	�Y�R%����S�T>���.:I���i%�.��i%���]8R
�������Vj}=:>��J�#�Z� ���S���s����Zyc%,��?Xj\	K�����f��Nf���v���?uM��b�H3wI|
v.�T�M����lv�'��# ���������(��z��*�QZ����.���IMW�Suxj
���&jJ�@���;�#>���I�G4#�Z5
��~����S���T��E	?���9�4,��M�m���G'@}�T�q�	�B-4�4=����iv����H�2gA��h�0�D�Z��%����"D��x���JXj�2�6�l��������=�'��O=IA�X��8�
� ���V"qUU��gC
���<n�>�`��{Y��6�����vl����n*#�m4qS����&n��n#��"�SP�F���E<Z�
�M�@HL�b�s��Dkt�������3\�15��m��-�k�Nn�h��2�"i��P����Fc@$-��~
���'	����d�Z�������_X�|��<��o}����MW����sS������=���Y���w�I�SP$Q��`����Hr�'���a"��d�)���Z�	HX��c>h�t���?-������g����?��AV��U�F4&����Q?��Q��4� ��(�.�2j(��i����,5�������L�y)���{d�N������T�'�4T�"q�A$_���~�H�-d�#/`jdtX����}X�m�]��W�a���������ISW�S�x�
�$���S�"q4�A$_h��8����PITR@��;;�_~���e�3=9���%��,�u�?U�G��O�hB���*��2� T0��*���GP�CA���t#Z�-���Aa���������r�������"�$�I$M$�b�HIw���F$���G����=����0��'"�SMX�<u��Q��N�����@�8�$�KI����6�T�V�d"q���"�BR�g���#��Ac�KF�����U��e��J��������V]�^�����MW��n�cS��n�	�R,P�����o���*n\�8lNyi�ej��i(;y�JP����jK'����6�������������"�$�I$�Ha��DRU#l��@S@$_HjD�l=
@$��"GR�@+��C��G�a�����]G��,h�8�t�����!M�j�!�	�Fcxdw�B|
v[h�g����,,J`�
�2E������\��������fy���
�;�`���NR�������?�M��b�p�H)��:��R(���H"��S��((O���v��I>F��IH���-�#,���j���t�?��\� ���	Ye�Dcr�Y$�S���,�(�"��H$Q�{�Z�%_��_����lXj8��_��/����-����g'3]�OE��)��H��)��8f
� �/�T�'���K�p�D�5/��2��
���r�����\�aQYI7��7g��g'4]�OU��)��J��)��8h
���/�T�J�J�F�#�1.��K�c/O�xz������,��IjP�����[��=;��j*�O��D��O)���Sp�|��
`D��ixd4������J�������.}�u�_'#a�}�}�G'7}�<t/#;iz�j��Y�87w�n�	��M�����ZK�����3<>��;���o���u7�M$|v2����������hz��b� #w�F��� �.(�S�H�`�^�e��Y����h��!%-mk�����t�?��'��O�h"�t�H)��.��R0�@�Y����GJk�����~�����,��s"�,���_��>;I�j*OJ��D�DJ)���Rp�|!�
`D����d+�����ZIY�Lz��_�e�2���J�[�+��
Ao������������I�D�z4�S�����N#������(�-���L�J��e�mD�B������r���������������2������PMB��Gc��ew��lZ?3��Fhn������r���{���H"y�������Y���e�D@��~�����gu
�4��0���_��?~��������W?^/e%�n��syP�'H12"!����U�@�*�-�1I$��D�������HjAI-�cK-I��W%���4�����J^!���0�q��j��w.�~d�}�t�?�E��O�hA��aP)��0<*-p�QZ�����f�����
��*(�.:4=&��j�F��k�)��Z*liz�X}Ug��pF���F���U{���@��;h���g�N�����W-�>�&��'F^pia5D�k�Hc�����j�������w'w�@��x��������R����b,���"�����Y$�VYL����u�2���h�	�r�ky��&��fh��+�����:Z�l��BG��> (��Q�$�K���sl��N	���2�8Z������W��Z�KM���Kg��[e�#{�%;�>F�G Z�����?u-�c�,#%w��g�%��������0�ZP���3��l�\���W�����]V��QL2��z4<�T��������U�S��g�jc`���A$����w`Z=hT1���[�-:(������/�R��,�����}xt�?]�X<J��o��Q��������k��]���x����d �v�a��xj�8��Dji�UK�=������]�f3s	M��7��R���`GC�t2�Y�{����B���?J5���I��7=��'�6�tx*LMI��J�)��9*DiL���lN@����L��S��������5����U�7Y����y1����-�{5�lSp��������KjJ�����QaIc2��C����C��K(�&=�8���=���OF���#�|8{����$�]#��?%()8{�L�V�C�VN&>�b�����&.�p�C"N���)�\���L��_��>4�LF<!��y�������|�v���]�$�%�$%n��" Q�(1' ����_q� s�\` ����$�H{k�?��������C|�R�>�X>.��k�C�H���� �$)�g �T�s��gHIJ�B@��$�E��I,.�,�l?�-Xp�����a���kL7�B��e��H�]#5��?BCS���A�D�2_�%�_�Q8�g+.�LX�P���d�u��_�~d�@��U����
�?T�mm������ �,)�g ����/��s]Cj��P��V"t.��� :���{���?$�.����#����a����r��3����{(��j���@F!��E@�8SbN@��3��L�,. ������"Mc�^&���n|�07��$h"�]��a�v������kdHw����Rb��F�������C��2s�/)�E��C��0� ;E���=X<��v��������_/(�����������&��q�iRb�pQ�I�/*MJ�	.���p p�hR����. �Qc?^�P���
7�c��F������c�"m�G�
=
%�GJ�(���=��!�G��O���f�#�uo,�(~��n+�p����<���#��Z��S�3��C��S�G�5r���c��)�g��rO��OGB���=~��J;�):\��A���(eBm)�(R�����l".��
Yk8m���G��\k��l���(��A�	0�VfK�V�O&�
b�sJ|
F�0��.�9%�@�1��;u���
V���<��qh�����5�I�]��$%�%���"�P$)1'��B��_o� ��. )�%���x�A��������	 :a��6��]��Fj�'��
��
��������������-D��*�f�o�Ln�A6!����w��GC�OX�����,f8���ky�F�v����p�a��kw��Xb0�H�w2�G�������#A��H�������d%�z���	B���oq��=g��x���Y��N�[Bb�<[�wy)�
����_�E]s;�]��H�odEw�g`hZ�9����(FV"�����lM��d%�*�c�-S�0�11d���J�'#[o��D��xW��20��G��|�=N4��=�/I��(s���G!p�J������	�ZdpP�pl��t� 3]\`h�_#Z����:+�$N�N�]XH��xJ����a������$��$(s�a�G!��J�L�#[X85�`A����uI�u{�k���Ky�k�=^��."�c�mD��+�%	���
������C�������d%�Z�i�����)��!DU��
	^�I��*����-=�by�-�$��qZ�G�O2bwr��������	�H���?���<
CV"`8fB�E�&���E���-5���Q��P����������
�5�E���NM^�:�^����!�hOXG�M�2!y���Dr�� $[�81�BB���E�e��0��w��8\�C��q?^������x���yZ��`�RQ�@��d ��f�w � SZ. �V���X���Y����D���y�}	8��/+�[d����q���h�$'s���9N���D�@�5�� � h��L`�	Nd"���,�G4]���~���d�o��������}���[���]���d2,2	J`��,hy��"[X8��`���9w���z�y��-Q��3�9�//.����_~����/,7�/���{�E#�9���������A�E��$S!�P�&�&���f|b��"5���U��&��[��8�����Mt��y9j���l�V$����j'24����s���&s�!R�7�'EokMo�^$�C���n��,.��q���2���������(�eB���O��_?�-�It�]�!��d2B*�'�D�ObM�O��(��-HnQ��n��NM�����rz��Z���t�me������P�<�%�q��Q�<�T0�I�	04������<��eEz�I7����
��L����8V�T�{�s��^������F�s %�d���N� �BuO��$��������lY)�i��t�9e~��>�'n�i���(�����6��x���
�u"i$=��	-/���
�I�B`�HObM`�I���/��$�(��-�H����_����B�.8|��:�_�C��J#�9�HOhI�!��L� D���� D����!��t�Q��&�(��������"����
.W��Z��s�f�����]�q�	-� ��(eN�(e0�L2�I�s37�f�'wN#l>?
� �pv&.������~������C���������i�����H������'s�1Q�?�'�	Ek�	M��Wi� ��.�)�[&|�z��]��@�pU��Mv���������Ku��o��_����h$A�R�'�d��y�QH����>0�5�&A�c�u�����.o4���d&2#|��|�P*L�q8C��������~����B���A��x"�B�����A�����dj�-2������5�����u��W�g�+�����F�s$7��vZ2.��L��@Q����@S�������3�4Tg����O�8��p�@E����s��J��z�����Hs���NhI�%��L��D�����D����%��t�=�� {�/R�r{����=�>�;��riDnF�Fzs�N��d2 *�&�D��MbM��M���q2��$}(v�-�oc���{^������s��{O�l�Xb3�%q�qPa3�T�I�	4���H�b3����f�Hn��N���8|����k����Z����\+�������Hn�%rZ%r��B6��N�	�|dk�Mnr���P��X�P�fb�ca��
Gl[G�!�RO����V��G#�9��LhI@T�L2
D���gB��	 4����L��2�F_)*3&��]����d.G7z�����$�j����g���3�+�i�]�qY-�� ���qA�r:��d\���f�f|b�E�$Z�9�p�Jc���z�Y��q���\
"��+��E���<dA���h���p��&s��Pa4�'�hk�h����� �	���MAi�JD���E�@���.�������-
��o+���������	M� c�R�I<l��NbM��	M�l�L\�E��p�A#�L��~�Pq���������}���	��q8r���SH#�9��:�%SH����BRH�������lMp������B��$����������"���$e{�����4�xY�lig<�������y����Cs��A��x"�Cq���@Ds���C�u�K+��t��#���#�~�U��.�(r�ELx\W���v�jc25���344��dhT�N��@C���@C���8��r�U��xz��~_$���(
��w �Gd��U���0��qN�Nh�LR)�$��LR���3!�$[,h�3>1g�q��$����[&Y���R%�f7r9f�����D��S#���?G	�z29JTXO��D	�zk��z�^D	�z�K ��tK �i�,�#kt����(|��bWS����F����J'�d���d**�����l�{��5�-hY(�����1�gq����m��5{�L.�reTn� ���� k�����F�s*q�����p�d��
��3!�������
��D��H�P�[Xk����mf�1n��C�)B�^�������]�1e@K��9H)�(e4��2|&
�:���Y������I�$ep�����+�^4~�v�p��X"�������'���E�������q�)N� ��BqO��$����=?HwA��\�7\`����B���7/���IXR���q�Hg���Lh��P��$���P��������	4��[|Pt�r�p�
 �����N$����kb�"������S�`q�m�����]�9^hz�9���R�I<�x��6�5���7a/���7����o� {Y/
�i����~�~��y~B`��
(���s#���?cD���A�H��$�F�I�	F4�	{�Z����[�������n�������^��m t���-���~�/���\�����/�
N2
�/���>�_�5��&6�c�/����P]@���
�PY����b���ST���O��|\n���v?dn$;w�����N� ��JA'�Db�*�$��������t�W��X^�=��|��d����p������$;&�������o2~�x"`P�&�&`��&�9���\w�,���	��smi��Q���������
�|��q��7����lBK��
�IF!y��l�LH��0�nMy��t6���5�U����p���Y�O>q�W����,2��s�������<���]�9Fhz�9�1�r�x"1B]G'�$Fhz�"F(z�-X�P����Q�����fr��3�h���_#�&rX0t����,�XI�]�#�R��9H!J9x���I�:c��Y��OL����
g�\R��#xn�Z���HxY��W��B�|,�~'���4���3&4��dLTO��`B���`B���q��E&��"oD\�cp������P]Qc��CR������X���#s��H������$(s��Q��$�6TM'�&��$(�6	�$�����]����\���K������7'�D"Y�������z.��Nh�DR)�$��DR)����D��	04����a����"3Z\`��p�D��<P��V\OEH|E���_���\Jw��%���;�d����2�3!�lM�����O�}�g�I��t�lT��+WF�����������*�!�D��f.%6Z6��BQa3}&6����R�i�
��K�N3fo���ROG6�+�.��7t��r�qFOT��U��h$2�R�&�$4*U�d�J����@#[�6i��	T�L�2y���PDf���biv����:w�����"9FC�������	���\J���%1R��NF!�<7�3!�<7�A1B�S��+qB��Q����T(������K��1��������8i:
Y)�]�y�)O� �B2I�Q�<19��{d$��wPt�jN�D�9~�!�����r�X�Nj���5A���16�����W�@���T�	-B*��d�Ju�����lM@��AAB�U!N/&b�dK�������m�<e��OW�;�f_����0[>���PS���H����Z
#�A
$D)c�(e:�g�1B�3F�;F�g`���rAN3.�tF���6�~;<	�����I��+��#?b�E��H������D(s��Q!B�'�E�k���U�	��O����M��x� H3�4�
�����X�~I3>Z�������;2i��T
-7*��d7*��>7�5A�E%X-nx�e�-.���@w�����[�_]]�/�n�k�y�����U!��H��%Z��B�Q�A}&�����CU�����,L]@"��������$_A��7 g�mK�"~��0+���h�H�G
-	�
GJF!��p�>�lM���A���>}H���e���������l*��\M�m�I��c������k#Y��?/44Y���F�N;�D�N;�&�����Pl)<��������,u��\h{se�?e`;\�a�0�2|�
���������Z�G�%E�%��@Q�G}&$PT������t��F
�������=��h��^1l�%����a�b1yI79����i�Ii������'4+��8Q��$�H�PU����	��U�	�"N�B-���cEy1=�m�p����)��#����9��	�\����������A��Q� ��B�O$�%�$_�Q8�5. �VJ�Z�P��,�����?�`j���H&c.�1Vli��l�A�
-�M*4(�d�

�3!��B��w����@����� ��Ac�xCuE#�72U��U��/�F��D�?	.@�ru�5R���cx���s��Q�x J9<�L2�u�!K�d<�����-jL�H�
��|�����*e��������Wz�9�F�s����;���
�I<h(��Xh|�;��g��2��`�\`��E�z�^^���U����y���kBK�
�IF!���k�LHp����48��N7"��'�nAbC�wn��\��,����|?��{�������Q6*���x��n����%4���(Q��N<�(���k%�P�p ��*�t�@�w7�M�_��f�3���
��o���������F�s+Q�����Bq�QH��P�>>�u���A���8�(3\`�C1�.�:�//��k�1Y��<��J>YM��;(���T	
-	�J%(���R	�3!���_�M�ge. ��7�
���1��E������8|�h��Z��	��4l����\����zq��F�s+����0���d�
��3!0��&�4v���>$��a����=c����K8��(x�v��d�[Ez_�_�Rs�����Ft+��BK��RJF! ����LH�5��{�Xr��+�9-�(��x�j7~������K_�]5���N����sn�{���X��{'�,T�@}&��`���=#��C��-� {gi�����y�P��V������W�w��Hr���{Mr2yOR��N<�=	��DA0�&X�R�	������,y(�3��s��+7����nd�y�) �]�>�N��O������������G7��/������_��_�����0|���������"LKD������SJ�������"�� �	��?�_�}|b2�"qYB���+d�q�Hv8'�"3������G�<�;�H���������W|�$5�����< R�Y hR3�1�� ���a���9�=���
�l?B��vxs9��NWJ��~>�u�E�9�*T�i�Q�:�($@���		���B�p
� D	gr�	Fp��2P�k8O#�S�?s:(=��&4c�@L����Zc��Tq�W��B�1��E�Q����ldk�
Mq��x?#A��
�M��<g�3/p������p��|o��3����i����������6��J�
IpR9��Q0
�
GV�����@p�5��~��P�;�0�
x� 8���������97�D����=�=��(BsF�>��}�����z4�
�i^e�(�����G������Q�;�����aD0rX��"�s����g[�������q@�Kp�<��s9���hc9�W��4-	���F!��r�L8
,����3�i�-$�9����K%�3�!�=����h�/vEu�����,�-h:��J4��cV���@g�L
t��
��3��Da�Y�:�t�-���e�X���\������x����K���:���_n��$0
�&����3!�(p��wP`��0"�8,XE������������@+�������D�9���F������6�|[i
*�N� �A\'�D
�u2k�9�\�9kPZ���V8�[�"����_@�����a����-��f���.�a���h���Bv�G�Z*R0	
D)G
����3���Xg4�w�H��UuA���;���
��G���U�\[���J��b4��R|�]#���?#C��AFF��$�2�I�	2��p���2����M4��:�!x��+4U��@��u9+}S����x\�Z`��V2]]#��?�D���AI�(I�����d�q�4|���a�7��-'.�1I�%��U�����>�Y\�^"A��;yd�+�R������J�'�dN���d�
��3!9%[P|�?�gP�. �����S��t2���������t�@��2�N�<]�F�s���dr������$��$��������z��uabQ���G�^��\��G��=;��5��>Z��>���������Svh�HQ�:�($RT�N�	��� C�v���E��(�S]@���:���{1�:���' ys��{B#���xMhI TxM2
B���� dk]����T�8� 3Y���5�":^!9	[p�u���8[��� �)���p���G���]#���?�Mu29���S��
�!�j��3������4"����� ���"	�9��R��p������	��a�p�z}�Bo=9���qBK��
�IF!!��q�LH����~s��-w�2M�X. �CQ�a����s�a9
������=!���7�"����J�&�$
7�G2
B���y{�lM������
����n(B����������K�J�N��b���ym��]���V}#���?&h)t0)q������$�	b����
��pA��\@�Kr��Al�1�y�����h��+��@b�	]�^���1��h$8�Re'�$2����
�I�B��NbM�����oL�����7��RL?S�]o���
/�������URZ{�������f_*���E����B�E�����p��	(��������>�+-�
��J7n9h���h����Az�1��"�:e�^�Hf����C���A�����w��%�uufM�����������t_�3�XJ&�*&��]<{���rD8\�h�����y�"�n�F:s�F��3����
�I<D(:�XD|�3�@ �d��-���,o��7��}*Fo��r���+��5����/��7����dBK��
�IF!y��d�LH��0��t���6���B�R�����b��������J���c�?Ht�(c�x��+%���]r}#���?�Ms29td���Bsb8�U%� �����o�����9]���������V��Y�Hc�����m:��9\����=v������o��RNh��Q)�$�dTJ9}&$~dk��/�f|c^w��M�YnA��aA��!����^|3�^��6#������;5{O*�g_�8�%AQ�8�(��gB@Q�8�;hRQe�nD��}�UQ�.@/��Z�����+�����B8���p�x36/��H!�%��g_�?�%QR�?�(%��gBPR�?�;(JTA��,�(4����xy��\n�$Fk���������{x����� �F�s�\e@K�9H���A���gr�����X��}���d�A��4{~T���c�D
.��$~����b*���p��f����'����=���������Jt'�$**t'��"{"����
����k$�|��?���tA����-><�.���P�}B�A]��{`�������=����T���������&<��,*���%
����$��*�������$yp������5�_��CsC�.����t��i�BX4���F�s(UqBK��J'���
��3!Y$[�x��$��Q�x�Lce7-�(�3~�?�l�o�pN��~t������<�r��0���!�D�P��-��B����BR���3!��!~��!D�u�����S1��(�����TY�;���egL�au��1_$��'��!����idE�+
-��
+JF!���>��lM���!���9Ty�O�%U��&hk2��eOz
w�H��k�-���v��W���P����C����B�P!?}&����IIU��G��e8-\(�3~�H(��}	��D�5�1�,;E?C'�.�F�s�^pj��9��J%'�D��v�g����CU�	��O �\`0PDg��P���S]a�+�ri�D6�>�`+���P�8�%�B��$���P�8}&$(dk���U�;m`�LU���!��&��Q�];9
�����v4g��C�3gw-��xh$3��	-��
�IF!x���>��lM��$#���2}D�RP�����2]��������#&���,�����Fs�L�R@`R� JD)�	�I��@pkFb�'~��!l���O@��\�`B�4T�}��0�=B����d.�����g���{���������yL� �R�I<`(�X`h�;�30�&C^Kw_��4 �z�6O���oh���8*��@�5��6���Mn�����������L� ��dO��$������!�t�[�t.��q�B�@������=-t�D���8��!�!`8.�����)���X�4�%sJ��$���R�4}&$�T(M��9E�r�
A�aq���Ei����Z7_}��(Gk�
�	��O�r�
~�<�X�1�%aQ�1�(�gB`Q�1���B�s�
��aA� 1�g��q���7������5���1�f�=�[�����rF#}9��KhI`T�K2
F����`T�K���P5�|D0V����f�B��.@WU�!�N!�����D��!F�������wX4�c�����E��$�XT�L�	�E��tk
���=���S1X("38���������>��&������:����\T4��c�rZ
��M2
AC�r�gB�P�3����?1�A����eEh��v]��b�|����/F���[w������r��������K$'�$D*$'�@�Br�LD�un���"����e�T��q�,�<����=oK�ZnT�����M�������w���d�53�r���}�yhr���Pdr�O�p���7�W��gLT���:���P$+��7�U���w�����F�s���T�`6�R�Q����$�b����,F�(c#lR�pI#�����Gg����M�>��uu^X�k"\�&����$�%�S#���?cDS��A�H��$�F�I�	F4�	{�g@0r�F�-2��'KQ^�"���#���r8`ery�(mZ��ok�������1�YO� c��zO��$�������b=�"sZ.�lx�a�O�������S����K�
���c��L�Tb:�%�I��$��lRa:}&$�dk�
]���I�P��n����:cdt4��j����������I�hh(�[j���z��H{N%�Z#���B0R�=}&#��`D�o�c���&�nA��*�t�-���Oi/���#��R��k�i�����xN%�Z���B0Qa<}&��`Bl�c�	zY�'�
v(. ���B�M�������}���Q;��;c��-B�o����N'���Tb<�%aQa<�(��gB`��	,t��;&�P��[�t�(��dYqV��.�U;Ts�����n�
��x�/I(�sj�>w�����>������rOd��j9�5�����=?rwA&4�����>]��8�	�y�zn���E�R�\����O�9m$9��	-#*$'���
��3!1�Br�5%0�]�������-ls��$n��1Tz��L�*�ZQ��.�x/~��`��v&K��F�s����dr��������$�$Jh��"JP���e�{���c���	��9G����uD���${���.��17����#2��bs��A�r� J>�3�uF�[����9wy� Z�����3\`�c~�+�q��r�\0��bs�p��u���/�m��Ht�����D's�qQ!:�'�Etk�Mt��G\��.hU����T�J�
��)`�*����sz���v���#,K/���F�s.�������d0*�9}&$`dk�v�'����N� ��%y���=��>������SE*��S;9~r*��#���<�Vdnd<w����O� ��LGdTO��V�<24���o��TBG�"������=�\v��t�6��:����!`�9�g��VJ��F�s���&9����@XTHN'`Q!9���0Tm'�`A�b�D��.,�K����L���h���
�Ir��ww
x������9�]�"��d2D2!I R�<1��H��tk
U��G4�,��2O7����[�1�a��4�NY$���g��4_f�/7y���{BK.7*�'�@��{�L�r#[���nM�����M�4�����=C��l
��!��'\[k��.�A����?7r���s���'s��E��$���Dq���`Bs����.�����;�����f�L8U}*�/=�a�������\|������<������2��$1���b8�N�5���a���8����������X�P��1�1N����������^{5�{�:��\�9|�.��4��H���������r���B�Q�����dR����4��H�q�A�N��bAC�j�����2��(�x
h�x.J�B�>��ci$@w��`-�	� ��1A�r2��dL�,��a�E.���S&b��_��@`��{�����:����F�����y��4����364	�dlTHP��`C����`C����
lx�fN$n�����k�g%.�����������5g�%��C����H�.%Z2hTHP2
	�gB�F�&��$�;&A��b�����><*���#��*��
��%	�
�;)��T�	-��J�'���By�L�5���<�1���<�T,y�OX���[��<��%�a�;v�1����_��]�9ah��9�	��uO$a(��X<��N����
:��Y.!	����.���tC$�����"����v�K#���?�A���AFC��$��I�	t)'����>�,:(V����xY$����Wg��;��5%w�4����3&4��dLd��d�
���87A�A0��8�q>$s���aB��.@������h�8�Kj��W���dF��,�,�d���Mf22�x"AB����B���A��c�R����*8��7��+8c(���)��d�q\6Z'��aTy��4����3 4{�d@T�K��B����B����8d�I9J���t���NC�!������g�'�$������R".�%����B�F�����}F��tkJR��M�!�����	E\���`�/���:��3>Q���������������d����Z
$�A
D)��(���3� !�9p�5I|���Q���L�$*��I���Lq��;4H*��+�\
���#tn�q�H#���� ���H�	"2
�H�	"�	�H�&��f|b��P��p�L\q�+��������$>W$��g��%�&s���a����Bs��A�k��	��:�&P�&��:�7��T�S\��
����]���u�+�8�@Ny����6%1��#�,���{�h$8��	-/*'���
��3!��Bp�5�)��9^(������N79v'}&���<����-u��5�@@i���"8�F�s���dr����	��$���/�v\��.]�od��,j�j�����J�]>�e	4Xc����vM��?v����gv�\�Z��-(*���($PTJ8}&$PTu�wP`���)(ui����Ev�kD�Z^��I�o��N��`��p4��r]}m�9w��8�iN� ��
�I<�8�hNbM�������g��@��!`�C�����3]��v�0t`.^���:����������N�4�Z��	-*=:�($>Tzt�LH|�������PE�nDv������hN`!1����&�^�9lX(\B	����~? ������\K���%�Q��NF!��k�L0�5��w�0��!	0(�j�P|��8��u}p�mq�>�m��y��34�F�DE#���OhITTO2
AE����Tdk�
��
U��CT,(��}!� ��?�}��������,,b3�X��Lnk�������R a�"�(e������d�������kNd6��f��)�v��'���xJ�g}����O]�2�m�g���F�s+�����d��
��3!��#�#���iQ�]�i. ��5B`�!�hb�yW��%��~?���!�H|n��MhIHT�7�(��M�	�D�&�������C���i�8,Hnq	�����v������Xr�w�u
:�����JA|�__�FK�+w�4�[����I��$��T�O�	I�& ��� Q���7�\�e)Xn���(�Y��m��o�0S�~w��AbI�������zk$Cw����&C��������Y(2�X�|!C�����`���0��P`��u�'�#���U�i^������{Z�pa��wp��[#/��?cD���A�H��x"Q5���`����U
��sK6�u�d@���k�����^^��=4,��Lh��7��AIr�[#I��?cC���A�F�$%�6IJ�	6���p ��!��\� �����$8���/'�Y|L�e�cy������UPZ��4��[�7��\�TxS2
Y�TxS�	Y�Tx��?T}���RU�1�7u���v�j0l����*���M�z{(��[����7�J�)�$0*�)�����L0*�i|��u#���%E��/K,������b�;��+%��cl����B��'z��4��[�=���F�=%�lT�S�	�F�=����P��|H��R����"O]`��V�,	��%�^�\�>��p1�����1^�~"q�W������o�`����t���������������o��_�~l��/��OK����K�������#f����������Sv?�C�������m.�u
���-�u���M��o��5��V+�s�����D�?k��N\��mDF�(5�����'�2
Di�� �@���A�����!JC�[�'!!�p6��	���w��2h�UzldQ�5��c�8������ZCc��1�^���$6
�)�`����L6�u��x~��`L�(B`Q�+R?a�����O�x�l#���d7��I��i(�~��_4�
Cj^%(
�1���Q`Hc&���B3��9�������(Di�9����]�o�-^>ftK�!K���?����-�y���6jtzU^62-	���Fl1
M?c&����w��!n���a����F���t����.��������6��	w����kz�����
T��A^�Q���@!��I����90���q��5,�%�KW�%�g3m�As�O�����~$+�y��p���N�J�O��������B�E�V|����lM@����L@!�P1�%�	=����k�S�h������,��)I%>�����}��5��1��������F�e�l��	�F�&�����DT���$`8y���@C`�dAW���i��'��g���kh�]��Y��h��������U��@{�z\uh��	AC�&h�����D
Zj%7
������>�`<�xV}>��{���F�9�*4�iI,hN6
��3fB���	t�hx&X8�nC��� h�s������m;Cl;6{���4H��!p��G_@��F�s�\fBKa�9H�L���A��2�g��A�36�;X�pY��p���$�Y1�e
\sOY#q�����
l��9Ov[�����*S�Ht�����D's���YH��
�����{���5A��5;�����5�D
.���;%��NTC�k�������yo���
����52���34��d0.�S��
	�<�&`����@����&�-��(���:c_���W��<x�����4�����Gn�]<\�o[g������&s����G%*�&�Q��n�5^�����?���������A�'Q�����a�yM�n��'����t�3VrU���Y�]�$��d2H
��aK�$z�PfM��.5"z��5u��eq��gR�����.�pa
�����q�������tF#�i'���������m���B�G�6�y�#[`|!9��C��|.��E�!�G���W�yF��V���K�n5�8.�iLx�h���5�i�<��	-��
�IF!�p�>�C�p��4������H�PW����=�0�pi�r"��LD�g���'i��;�=��D��91~GF#���OhIdTO2
AF����dT���Q�9�!mQJyM���t��C��|v��m�iaC^J�!�������W���=w!��<p�v�������i����~{oJA�����`
.�F���}�R�\��}�C�������$�������0*�������	�?�{�0L��C�L��a���1v��Vt�;^�yQP�P���;N`���4��[S~')��������H��S@#(=�T�h�����)N#}���l����F�v����}���OxCwSHF)WD�g���P���9��r�$���p�"+'
�+`��Fy�=���t��n�����m��FC^�r�t�
Z��]Zr���0x���fG^�������6	@7��<�Y
�@��7��*@;ao!�-����H4�����a�X����1�]���V-K"�6��.[G�[�I(�����(��@Q�qP�A 'PT�@
`_�
9��f�r��p�5&���k? �=��mk����y�D���X�R�,�

�H �B��pR��F{C^��G�\��G R]�x����N�T��� �<�L��N���V*����AT�:�)0�����	� �7�wFdHwr_���tFCK�?�7��K�(:�����w3w��t��sV�u�����x�������R�	�`�p��A '�S�H��gxP�p��?d�y7��hV��	]���X{��iay���@�;X}����V"����E�p�S`����	���pR��a���r}Nx@�pU��C+�OU��/��O�w�{�.$�`�v�9	4o%�)++�
����*@3zB�� ����B����V���8��%�?N���}�MD�0}oU�PDX]N����#�$�������0*@�����	#{�0N�9#2�������9�C��W�, �_�_�����'�W*��g�F����,�oEL��[	_��*��/�)��
����"�7(�����c�r
'R�����h������AM���#��m��\�@Z�a�U%��6W������r��i^	FY`�����,
�����A���%QDC�E�`�@����q�v[2`��e��f$`��c�u�������$�������dMT(&DM8�	��������&�H��!U4lW�\�_�.�z'��cykS�Q����J	h���]d���'��f?V��� +��/!(��K�e��K0��2�
�����6Z��T$s���_�Bz]����:#�.I����I�y/�V��M!����H!�z��	���
�8����p� &���W�
�(@��(��8��<�E@S��/	��[t�6XPq����x��`���R�	�`�pu��
�8�
`F6��r�#���]�����#���-�D����%�q'h#:��0t�������z�'��f?��� ��7!���M����M0"�G
b���6�C+-4����A"���B�K�.�|2������M&�	O�dR��H&����#�t������%6��n������:*��E�H�z����@����E�?��~/L&a��;ee�Q���GvFO@������7��Mv�z���$6�7��Kj~�
H&�w.���.�����v��v�{��g���^�����Ox
�#G��G�?����<�����|�^x�}�&�_�'�p���S>���&����(�2.��I���_;����h^���t��c��s=��|�����x���'�s��&A������Z7ydo�G`�����Mg����������W������
�h�����6��~h�;�{��a�+�1�>7��"d�
�FY`�=��������k��>�/�d
�2a���\�5t��������U�NL�8�N�?�����1�=7�� <��Y�	�@Xc�!�A�0a��?���V4�U74=��X��M{?��<6�b���B"9L���|�by>Jt����:�)0J�H 
��l�(\i���{�(�
6�D��Uy�?%M+�a���"UH��2�vV|^�	d}���'��s��zR�<XTJ7!�,�.�#`�I����T��%���@��Q����������u x�aDw�C>ON��8����tc�$�I����E��'�"��O�"�}�iM� ���$���[�g<����]�U�������2�g�f�tji������X%R���
��H���lCI~�$�"���v*pJr-74�8�
Z����*����Ja�RL^�D�n��H1�,�Qb���S�
����#G]|�6n��"{�.�.0p=\�g4P�q,4\�b^k+���Q��ix�����u��"�7��������2IC%*++�

���<r$�G���\�7�#�)I���g��
m�p�����t7��/ ����qP�����n:s%�65�xm�����(��%OPI��q��l�H���$��+~��
"H?��C�J\h4�����-����W��k����0�u�3�|-�������R5����Q����.*<4z���C�����YH7�.�J.���r�	�?��K�������{�����D�~�U�P>�rD��s�n���CVN$ 
`�EF9�DO�H�;�M"���D
y���}zh���hXo������$>��O+�5�~��HN1a���P������X%��R���
7�H��M�T�����z#@%X��T��M%{�r�����#��$^S*I�whf���$���X��R�,�J�(DY8r
� ON�od�jE�2L������n���T�����GXj��1����Z��2	N��rQY��R)��@N��@��#@��W�9e���#<�Wnh��jm���4��O���J�������=����9IN7�����)��G��B$P�#��
*��T�f�p�4<���r�#���e�v�h��%.I�z�������%uS�I��s�n�c!x8J�*p"�o����7Bpp4<(�8:.�����Z7���@��T}�'C��F7�}^���N���~,OJ)@I���S*�T�c������7�S)�>�/��,7�����hX�����j�/-`���(�|^���n���?'��f?��� �"�KE���qF�D��(����G�R�����#��B������	p�����h�YL�����-�Pc��>K�����J�(<tQ!����t��(n�pD4(�8$.�Ue���[�{F��jG�:7��`4����vK��-��$}�����H*x�"����	�${�H|�(n"���Q�����h4���M{L�*<�8�F��I�� ���S_��I>��S���J(@J1`�UFyr=�*�����y�J�!��h�)�?�����l4�������~�}�J�w��`��=:]�,�t��#P
��PA�	��(x�<�?Lb����9%\2���n����qY@�~�}�Mz�U�p�W6Y&Q�f?��G� ���B!���P�yx*#���$<2���6N��tL���:��V��h}2�&a:���I2���U��(�*���B$P�#��
*�dT�F%���dWQB��j�v��~2��Q6�g�D���e�d���X��R���
�H��B�t�Y���.
J.���>����M��q�F^��������i�S�$']J�"��NE+�"�S`*�#�H'�G�H<'���>�DC���7��?~��Hm�8�
���>��7s�K�K1��t)������RR
O�T@i�,�T�KJ�+f� �l�5m*q�4�A-�%+T5y�[�v�g�t0`4NU�-��t���J)@N1V
�`�p��A������D���K#�>��@/M=��7>z1i�f})��X�u)���;�L���Jee�
(���`Q��,*�4��n�����E-��6X8N
�]�'���q��Wl}����U��=L��$)]J�TVV%R
O�THi�TR!���*q���*�=`��Hi�M�mXk;�-}_�����G����N2��~�:d��� �0�0��#z�����SGx����b�RDC��G����A�A���3?u��0��������9u����:�G7��H<�Y$>
�@$���7���Q���"r�W4d�
Z�D�z��zm���O���e+��U	����:�I7��J<&�Y%L
�@%��7��cR��8L9�DC����\uu���R��>��4���2�U��)�N���~�F)@�E�B$����
��`T�F���GF�ei���_��j��R�����������'�$�����X�dUT�(DU8,
���H�?=N<�>�/%��
Z�pC�)�3�Sty������0)%�=�\2��k	����K+@���
����4{�<<���$
iD$W7���\�.�%���E�r@��D��d2�@����E��S@=]do��G����!�h�d�h���[�5V_�u�f��)�r�^�n���[&y�Z�[/++���zx
��R;=�TN
o�-Z�H�X�"�l����h������u���r�L�������#�E�)��N��~<����Heo=D�HTX~>B{���_I*�0
���><`��{,:�IW���p�S�~G�}���q�����X�R���L*a���P=�pM���F9p7\�h4Prq44\DC�-���k����^7:�-�=�W��&��f?��\r�I$`�EFy������Y$�M��K$��aX���J.���K4hW��"�����k�-��{S��S���$2���*���d�T�)D�8d
���L���%@%�G^��GN.���r{���IV�������SL4���|MR��~,OE)@B��B$����
B�TT�F�����h%����^����,qc�)�;����	%`9L����$�����H�dQT�(DQ8$
� 
�D��H42����C����\u ��8X4/P�����tj���D���X%�R���
"�H��H�T����J�T�4< �����h�f����m�����x����t�Q&����Fee'�4
O�	h�FO`ZA���P����$r
�Q
�����p��D��>�����n�O��w�����T��^��t����R�<jTv�C$5�nz��Q�R��Qco�X#<��znh��=�r��gY/w\���F�����:�>�D��"��D*���H�FO`� ���A�����d����i�jw����N?�m����&���;4���5�D7�����(��G��B$<o<<��<����@��U��C�J+�����\@���J���rK7�]��N��/��[&��T-*+;jT�E�)0jTi�F�
 
o5\�h��p���S>�����{h�m*�)R�������~TA�r�{��n��ADVN% 
"`�UFy��d��wD��T�_���"���U�
Z���$�D�r�[�T����Q����[�m�@��$�����H�dYT�(DY8$
� �D���%@v�}�d
��C���C�g7&�IXX�:����r�9L�{����t����R
�ER��	D�p)x��d�)�.	���Q�
�X������J1��>t��U'��Bz��6���\�����1���)�r)<RL���	����M�=P&n�}8e�
y��M&�3��N������}��'�����$3�L��f~��wz!�L)@?*�"����)x�����pu��� ��V0�����h��V0�d�Y����3`�(����.������{��n�����)����C
�@n=x�,�=P����!/l�!S1nh��
�������tO���U�n��1A���$;}��Kee�J���i�R]=����A&�(��M&�F�W]�+��M���x��\2���i%z4s��{�n�����R
����{��~�2�A�(WR���_�{�L#����:v�}��n��kJ�wzEPe�_�}���$.�������dYTp)DY8\
� �x���!U�&X�"m��K�a}�u���oq�,��NB�G�E���2�M�%l*+�V*��i��M�'�V�7�$�e��-��fz}���&WW
���Z~t���������'�4�3�}����m����|�d��AFi��=I� �$��=@&�_�!��!�����Xoh�a���}\��v��W�������2Jw��*,(�YPJ�@��7�"�UaHi���I�5�i�G�=��h�>��N��=�I_����<��8��v������@���<*w���v�(�m���S�0x���=���p���(���;�L���������\�"]��������������-�9����C1@/
��	�SMJ�0^�{�L9i�l�������'��t����j�����3�n�n��c�kZ=
��1��~��D1@�G��R$��a��
��L��/o�!-^{C�_������E%a���c�`�8
I�������_��A���#����2��d�)E�FJ� �H[#�H�G�_���{(�����0T�����`.z0�|�K�?s����LN4*T�E���{5�h�h�	L4
T�x�/���;���{P~1X���[n?����9�8^t\�kZ�2s�t�T*J��UI������BEi�	�${�������c_[�b��?
�e!��������T+#Sy�������t��i���`R9��E�2)K��j8��%�=�do��G��;f�d�= �d�=�}������e`�K��co�K/stt�T�h���(�Qz
:�{�(���=0�����l���b�h�%n���������|96��J���������u���������r��i^
FY`�������N����������
�sp�87h�
������V���ON,�7J,��\z���r�����X&��R�,�
5�H GM�drBM�I(U�]

9��G��j��Z�^���4������f�#�u��n�caxNJ�0
%������ SRJ� _R�a�����%b��m4�X��������,���3�WF=��.�m�����2����f�BE�����t�����AT���@�xG>v���
��|��6|��t���G�?:W.������m.�LB�k��T�l��G�
D���>Do��	D�w}�N�������H���Oo���t79 j�|��9fN�\���t���M)@F�B$����
�8��
`�����A�ewetn�jGB���U+������8�O&^�=�&��_�gm���dB)QSE�	�BM�^
J��FO`���A������.����
H(���zk�K
�8�'���)��0���r�����x�������PY������T��7�����F&����M�C��������F�����X(a�3�W&A��Jee��
(�������	����T�{�=Lmiw��c����Hit@����sm���W:/����w��b�x��>Nf�<UT��
<�z5�2x=�do>|ii���G�h8f���P0��-�7�t�<{,[3T��Y�2{��,s�d���0������,Fy0�<I��d��w�Ib��-}u��,�h�Y�=���aO��x?��K�{R�X�M%��$4�������d]T�)D]8h
���h�<|D�b��$�.M��V����q�V���n�6����)���v�B���IA��$���U�	*�*�T�*q�A%'U�J���>�#/e���d�j����Z^�@N2aY)HS�Ny��������[�M"��~,�P)@�I�"�LBo��	BU#�P���+C
��u���j+��drYX�
`:3y��C������0&����Neegv
O��G��p��8a���������,��i���]P�H���|^��e�2�_` �t�Z. �M���~<|x�J��Q�	T�@*x�JN@���co����,�8jx��kRr��-�|���i��a:�e&����Uee�
W���`R���X�T�j\��j�p� pU9Hx�o4�2�E|����%��|87�|��{�?�t��e�I�z+���
�R�
O�T�j���a9!��s�qD5<(�8�.��"��\����k?0]=�����? �t����h��2�I������� ���f�"A~A���������kF��q�p@5�`����
��F�z{	`�x���/�����+���P�$Q�������Q!��>*D5z�G���U��r�h��.�&W�
���Z��u�~�mY�����h�be�I����Y9aP�4~�Q��#z���Y�=h���F4daD�v��o����[K�+�;�:��T��#�F��2�I����U��*�*�U�*qD�A%'DU�{n4dV
y�
M$�3��e+�R���wC���c=�T�}��n�c}x�J�>*,"�>Ko��	KU��R�#C�h���h�z?�u
q�q.F���2�h��s�����X��R�,�
=�H GO��qBO���������i��n;%H�.�]����^j�G^����,��$J���*�(�d�T�P!���T����T0*�jS��H/�����x]��c����?���h3w�-�I������!*���@T��p�A'U�>D
H/��G;j���=���^����^o_�!la�v��'��f?��� #3MX�T��>�����#���+G�J/�5\�����k��}t��xQ�9?���5�t�k\ �3���e���K�TVv�[a��PI��FO`���A%'���cb��k\$����V/�����mr��
d�����o��D��N1��$J�����G� &�
� �8�
� �����.@&�d�R#�JS��@��5vZ���a�_��q��M��{	����x
O���O�'0~T�idd�5�@v/?��$JG�:��g�4���/Z8*�0�]>kB,#{L���~8|��������J�(������GRIoKY&2#������$Vm�o��i�Or����	�g1P^�������$J���2�(�d�TP*D�8�
� ����e�d�{�,�EC��������N�o?��,,������f�C�&���v���1�P7��.<B�Y�
�@��7���*���C��K�ha���~Y/?o-r����Q�ILu�������1�S7��J<O�Y%�
�@%���7����*�Q�������������_�#����	K�Q��O���$3	T��TY��H�6�s�Jmj��"�drT�;�����I����R���T�Fv5�.�=�=����e�������	�e�I��(�*++���
I�����H*x�0NHj�,GR����C������:�_�{�������*	��e>��F:u--?&��f?�2�R���
V�H�d������3�QS;�u���w�����n/���������1����u�{t{ot���[��c��y�
���Q"����E���S �T�i��k�nB�� �`��g������hV�6!���hx��p�y>��n��N������������.]&�����_VV%�M��PIeS�TR9�4�q�p%��'��N#V;������>���\B��I��
ry���Uoq�v��~}�yLB��~�S<D�9�d�	"�@T=��D��a(��PJ��3O<��Xg��W�
�sUN�������.�
�j�������P�r�H���0�s��I�xg]�7
��O �����t����k�\#�\^wm����>��F�Q����Y�<'��f?V��� ��BM!�I�&"�
*��Q%��T���P��)&\2�����n�Iun��-�?K���L�0����+�<'��f?V�g� ���L!�b�	������N�&;�����`�o���xJ;�)XW{�n�~a.���QN���h�����$%�������d]T()D] �l�E~��/a�!^�DC����t� i4<u������}��E���S�(�g=Y�>'��f?���� ��Rf
�@�(�,�#@�+I��T�a����������������/A3���:�0���o�;�L2�g�����@+L�3�	d����"{�,�U�,���$�=WwpC-��Ji{[t�B\w�7`�e[�|
%�]�xN����s��{R�<>T�'D!8�	� �=�������]iBp���������������\���D�at����i������� *�D��FO`�Y������x�<@8
�(o8.OU��[����.�3<"��Gg���1�=�%�)++�
����,*�3z��`��FY�r���5��y���W��>�J�����`6q��~,����8��h�Y���B�>�B����	��6����C�������d!Du��~��J�t���sEW�0y#��T���u��X&��f?�O������e]�Q�ODO�.�;�'��t�_1��h�����1I����Tz_ur���,�a��F��6PD�=Lu����$������,�dYTX&DY8�	� �2����h�t"r	7�xG����v��������XDa�������F4���e_n�c!x|I�*�"��o����7Bp�2<r�V4d�
O�K]T�\__��&��~���x��a�����X�SR�,�
��H �)��9���\1gxP�p�2\t�j������C���R���@��U#�ja�|���_xj�����X$�ZR�,�
��H G-�;��&#��#��G����R��� *�
-]��W��=�]&e3yv�{���q���2�-7��.<��Y�RN��p��A[��?rDC�U���a����sm�����>���Z�����a:U��L2��~��4)@VI�iB$P�c��
*�LS�F%H(���7P�qP3\���-��Z_q0y���;�����l
2�4����]�V�&<���=��i�Y���Ie���)���T��M/u��[�G����w�����LyL5���db.%�)++�
����*3zB�� �����+�v����`
���%������{o
���t.�L��D8eeUR!��PI�pFO@%�T��7#0��o�eW�.�@���
�A���*QK�,��i�shi���$�?���uwn������H(@�j�Q	��F�$���H��pg4e��5���>�V9��P�}���U]t��;���\�D���S�G����s����N
�uQ��	t�x'x�.<��?OA�!�Nnh���$

�}?trF?�"��7J1��=��G�u�n�c�xJ�J*0"�JoP����7*q04< �DK��D�R���2t8Bl:��Hzw
)�0��b���k�>{Y�S���)&G�8`
��x`�_�S�:�-2���6�8^
*�z����-�
K.����e�:	E7��h��(��E�B$����
B�PT�L����]iBp����C�n*�^^�_y�%�X�������:�?7��<��Y�	�@��7H��O�	 �l����������/�IX�� Ut���?�M"�>�����$��������dATP'DA8�	� �:�o�7d���:7���J;�t�E�JGr�����%��kU���0	7�����%T�&<f	�=��hn�7.D]�f��v��e�?%/8�J^���z���8�51�FnN�����s��uR�<@d	���N=�)tQa����p����=(c`)g!v^��C+Q��~��xw��J��-|��I�:�57�� <��Y�=�	2VKj�xC��XS�&c���e�E:�
�r��%�6��-��,'z������$�������K  ���@FY��<P�w�Ex�@�_1-3���7H�))�D����=��(gg�y~�M�R���)�}M���~�O2)@�E�dB$��#��
��$S�<^DC&��	$Z2��M,�:�����/P	%�<�<�n*�����f�&��f?��g� ���,!��1K��������q�p���|�T'"�gC'�A��hPi������0@��o~��Zj���R�5	$7��<��Y�
N�
p@�AH���u4d
y��M�3��7O�q����>i�K������t���k�Vn�c�xZI�J*�"�J�oP����7*A��������������:t�]�g>�x����Eo;Yo�&��f?��G� ���,!��!K�xd)���!/@�#*nh��J#�?���tB�k�zo�PE����$��������d]T�%D]8r	��O.�ot��ex@q5�����vG�������)|!�A���$�|��������5�)�*�`��	�J�7���lF�\_
0z��(���������K6���^a�G/�a�������yMR��~<xx�I��Q��	G5�d�+8�o|b�\ �l�5�AeW��z?������i����q�����R�����P)�����P)�������A�fF`���4!8�
:��~��Q�}+1�Ez��xCH:����ls��r*�i���0��E�$���J���E4e��5\Dd���7��yU-����9�Z�w�g��|�^�4�|.O���~,3)@BfB$����
B�0S�3�!�n�p�
Z�D�.X�$��UqU.�KA�����;���:z����=�87��J<��Y%�	�@%q�7��#N�� �l�v���!�h�nC��P���M@%��3���������orR}�����X|R���
��H�>�t�����.����bo�4�J1�Egb�^?�QU�nkVFpX}Q��@���E��u��r�Z��'��f?��� ���=!��qO��x�)#��!�X�#�.nh)f%������.��`
������>�D��zOr��~��A)@�E��B$�����
��T�F���G�]��?�F�.��3k�^f`����t@���f������X��R���
�H��A�t�9���.
J*���2��%�� ��T`mX�P�A�A��c�w������]�V�'<���=�%k|�7.Y]='?��5�funhid�PA�.!ZuJl,�4��Y�K�-��z��2��$������'� �	�`�p��a���S�f�p�3< �8�:����e}��y�a�u�=�����$����� �d]T�:!���u�7���P�]8�F	
����C
^Zr�b+zB�a���>���_����"W��?�����U���j��������_�����_��������������FK��c���Y�,��UAF)��QRE�I�"��[;��g�m�|���L�?n����J����{-o:/I%�Q��v����Ho�Y���9����aA(��(�P��0 ��A�(�A�E�?�?���}i�0�{�N�z^~��K0cTis@�lw*���5�g3��9����Ua�'��(�O��0���A��
S��"U�
9���^v}(���������,:�\��
��{����}�����M�L�M�-�M�M�{�>�������G�6��=P	���;%���2
M�3t����������&>�|X�yfZ�03�L��9������71@"
x�"�a�&y����0u�-r�W�����a�:{���2�?�/w�w�,%��i:�|�z�����X&�vb�,���H C;�d��21e�-����@�$���'�Kg0j�y=_�}�H����q{��_�w\��+s�S��������d]h'E]�I���x��)�l��.�v��y�����O������?��,:����-H�x�}I�wb�,���pxg{���^cy�a�{�01~v�/[d#�K�+���7(��?O���C�'}���������d�v���I���NC'u�S@����'0
����xT�)�4�l����������y}����u�/��w���h�r<S)�����N s�s�T
>��UD�����(|���"�7("�a@g$-��m�0��7,���v��������[���jar�D���pf!+'
�ea�Q�YDO�0�;��	���
Y�	�=�b���~3�~{	��J��%~������0������9������Q'����N��p��A'�S��H��2���!������v]'�>��8i��vJ�5[�S�Vwvh�z�����XsR���
��H��9�q�9�(bo�#<�"�� �P���V���T��3,��fjr�����X�zR�,�B��
�@����A��CL�g���"�����?���S�t����F��'�������:z4�C&��5���W��;)�SO
��Q��	���'x�0N�g�<�p�3<(�8�.Z�������o]\���-�|�@&	^�����_���E}���f?/<��Y�	�@�r�7���r*�I$�H-O������QBUZ�&����5jX�b$O�!��I�������&�J�pM�Jp\�A	'\S����k�����1{�Ee8�P��wl����z���c�k^K\SVvYZ��\�)�,�p��	,K+\��N)L!g�<s�h����f��������f�KS}oj��_��5"�u�sn����sN
���BU�������a�:�;���6��J�'�H3�p�3��6�?��F./�7tJJ�?s��z�����XwR�,�BY����0LY'y�0Np�^d����e�a`_�0�����!����pm? ���������,s�����P&�rY�$��Q�2`����I�2��e�����h�u�����h�,�.B��St���[�*n�Y�wG��y������a�or��n�r�$���X%��R���
	�H�GB�TrBB��h��+�r��H�g����nahs�9�b�����L4�/Wn�t�+�P
��P!�	��(x�N�%8y�
tE�~��][��;�����P��sie���JU���i�R
O��R���@Z�� ��!�8�V���5\T��^u���}�����G�N;����^|����$������J @L*p������
*9�����|����i����X��z\u&���py����8�'� ��%�qYY	@�,�
���o��	�w	8�O
�V��_[��Z'z�����5���uW=�/����OI)@�I��B$�����
29��
`������I�EE���~����A�������U��m�n�c}xXJ�>*E�	���@��qR�L�����DE;��-Pv]N��v��q�+Q�������8H(�@�V*���M(��Ox
�<+����y�-Z�{ �0���(�lS��� ��Xu��m�m&�P~�h:����[�L�[��SVV��Ox
�R�=ado"N�h�D���G���h�����������������Vkm�_����t��Y9]P��:�(��r���d]�w�E0z[�E4d�
y�
J������B3�~�Z0�����g��g��$����B���d!T 'D!8�	� ���<���E�#<2���v�������R',�"g������$��������daT�'Da8�	� ���F�y�d�heD��k�MV���'s&\!�>+���k{&�}jn�cx�I�*E�	t��<�tp5��)\�)�mmo��8v'4��\�a�b\�K�RVv�P���3�H�.���p��1���L����v.���q^�y����D�2�S���$����C��� �	�@g�7�g*�"�J�g����,�v�����������*������a�O��O���~,�/)@�E_B$�������X[e���q���e8��'7���������Jp�5�te<�m��LTV���1	*���NY��Q����@��TuFO`���A	'�2"���Jp�����6w�e]�:	\�
b����(d��\�������H�_�S@ ~=�do�IAgD� �l����2��������G�S��������Yx��~����j_�K�RVV|	O]T�e�t��A'�2"�.v'�)[
q�2�������{�5�Q�������S�cXn������(@�L�QV��D�$+���{�d"���!+��"%DC^|D�R����
��j.\�!��'�=,�}��������r�w���I����e�q&�2��L�2q8�A&'8S��y4�LvH$�:�]�+�}���,�>d��wG�$���Q�\�xL���~��/)@VB�_B$P����
J8��
`�����"<2���9�A����~�U{�n^�����#�c�`>Je�����R�	O��Q)���@�������!u8��}i���hP�����vf����*�`��-��� ���`8�5�������b���E~��U%i��,r�H^uJ�e��k������RG�SG���9��R� �|t�MYY���Mx
��o���L�7����� �7��R�+�LU���,P<sG��n���Q�D?m��C]Z2Vv���g�3)@�=8"A�p8�A?p�0���J+�!�G3�������<M�9*�H��V�Fq�l:��14�sax�I�0z�&Da8�	� �@S�0O���%���f6��������Oy	;)Z�%��g��B�� �\�����&���)��H W�	� �|S�0��2��6P�@�l�
J:�S�*�"s?@Z����1������Wo�������J<��U%=��!�#uf+x�J~@N0*Y`n�(3��8�A��Q������n_|�4��W���� �|taMY��f����P�k���P�k��@���XV��	�����@��#���t��e�|zo����l:��9H9��.BVN�t`T�F���7����E�� a��"�l���H$�R��bi�Gu��%�R�6=)�c��f�� �\����&����	��&x�"~M���
jh�
cq�F����U�^u5�:��"-��*�&#�A������M
P���7!��M�a���
`���fz�j�l��b
�����=�.Y^Q@w����Ck��A������N
P��S�	�@�`�A?p�a8���C�L[��4��wZAZF&�P�1�S��7y����\%�vR����]�����R���Q���x���.�5�uD��?1�
"M�E�!�����ot������Y-�%�y��d�n�������>��b.�>)@�E��H ���]eQ�H(y;�[�d8d� 3d�6d��g6<t��]���c�jR�ot5Pg�j��1�{h����{B���\!�R���
�@!X{G}($�$)�u*p��
���������z�������J1�����:�FHd}�[���;�������8t�?�����J��B$���&����H�!�2Z~n��N$	f�-�<�q04�?:p@�F>��@n��O0]��i�Y($_������#�Vn2�B�]�����������F�$"�i�zC'�����C!��f�]	m�^�����>�����W�����6'%��;�>�~�Ym���(������������S@5�"� �(�7�"�0I8D�j=��2�+�Le� y}oO�������g����]�:/e#�W���b�Md�dAJ6�*0���7����,��f*X���
�Z���*�p�f*�3��A����)�-D�Tb�������3��%���x
2���\!��R���F
�@!X���x�BVoT���T�������G%a���R�v%>��3�����g6�E���d��J)�<5Y�.T��PxVd��#@�7
��z���V��
X]�9�7�Z? �B�������M��Xc����-�~�ap�����y��(����	$��	d���H�J\��E��N[�!��c��o�����#����t�����|o���`�5HD�s�x"J�Bz�(D�8"
������������8���W6�9	{�8#w�kN��}���:km!�msQ�T����V��L�5HA�s)x
J�z((D)8

� OA�o:GA����������;t�����5�V��i��J2��]�������5�A�s�xJ�Dz0(D�8
� �A�o$�6�B�������E>C����x���4������K.�[�Z�?��A���� �|u�n��}��n����F)8�	�)x��>�&�;����������s�s�7tB��!�N"�?����.�)++��	O)�p�|e�p�������L��(�$��lf���.��Y��9iVq��:(~s]���@`�=��������J|����.�)++��	O��P�|�H�Lo��+�L�����)��'G>#G\�����r� q�*2���6�����!�b:���S(�	0����C�M�*�����&U�O,9$�*�A�F{JQE6(�<g����ji!�C�i�������eW�c��� �\���!'�
���	� 'x�B<��?8�b7��Ke\����|O�������3
D������h�x
���~����A����+��O
P��C?!(��O������-�gS���|�%p�
�w�Sj��6����7��_]Lm!t���{~���TU���`�y�b.
�;)@E��H 
�;�D�y��C/ 
��!
�
Q8��
�}�~�>T�u �Bh-��J����>L�,��n�����m��w�~wY�!G�~wx
9j$P����P�������;� W�@BYc�����m?M2�F%2�v0��_D������77�R�������y'�9(��H�H �A�$�9��M'�6��lzT��
��d��Y���9~T���.�E����h�1:�gc������b�w���*+��@���
�
q�A!��'B'�0hzT��
��d��6���;9����}���-��3�:����b��.*++�
O���B�M`[�A�r`��etk0�(��f�X�����q��c���S�����nb�T"y~�_4,�\�=�E��D��(��D��w�����oP�����I$kL^�~�h�a�lP
�]w-��0������Vu����yS�s�#������=��}D��7�>�z�|�'�>7�����|�HY/��&�mj���������}�jyt���v��i�.��}���B(@�+��*�j_�oR�U!�M(�K!�P��l�,�SxF6D�h�3���EZ>f��Y����;�������N�,t�?��g��J���B$��c��
�,T��B��sq��	7h^�����n��T�Fh���������D;/���b��{�I���C<!���x�7���x�����/�D���:��&�l�j�����?��:�h!�2mk���"-����^l1�4�?�s�x�I�Bz�'D�8�	���?�o��gzPq�3]�Z2���5@_���Y<��8'�hMNs�
���c�yN]�SVvl��<�)0��a��&0��� �<�'�%c����P��-n�����Z|����`q�P�>�R����
��ZJ�����8���u5���Bz�f���Bj$�8��G�B<�l�X��gz@jq����\h�j���i��s��j�*�h-v���}���u����y:����t��9!��aN�Ux�)�N�LJ'��:'Yt7�v�� *	���klz��������-��6`>����~��v=��G��7���?�������H��o�1)"���8���5'y<��g�:t	V59��I��������(���U*++��zPx
���4�DQ������}b������q������6�}f 0Q�KY���n�#H{J6'n(��t2A�.*++�
OY�@�|�EMo�+\
h��J�X���cj:Y]"��'��Y���ALJ�{�;~�#1�Cp]DV^�y~.��
Y9eP�2���0��|�������7)�k��|J�en�%b�De�����a��P�+�~����F�����ed����B<��U!=�"�B�oP�g�����l��3jJa�P��������]�9�u$�Bh��
�d�@���]�kD����,<�U=H"�,o��G��7�pH4=`��-��f�VMt��n38Bi�������l��V�\i�E�}��e�� ]��%��(��a�	$��(x�D<���H�\����3r�C���j�I�������PH�P5��R����w�|Eyr2������9z�(<�=L4���d��(�!�c��	+G#�d��j�/���0��Z������Cqv'�zUr�����������S@=4�TQ�A��r�P����@����tQ>yN�_�(�a������������kZ���M������sW����z�<�) �	2�����������3�|5��	7D�>^�C�ln�;����/���lU�q�6�09�.�)++��	O!�@�|��7�Wxr�����B8��
q<�M���8����I��w��D�4���V�� s�x�]�SVV!=��
�!��&���
����B���y�����f���k�e�=i�.�eL��-	�EV�5m�����<��O
P�!M�Dz�������%���Z!����l�N"��?�[)B{I~�1��Z������!��L5cj(���j�M4���s�?SHX�NBFE!dT��MJ'B�E!�(����&��BZC���|��X���X�c��x�B�&�����@�S2���nN���AYT(Y7ET+Pe��>�W�+da�'y�,,��|b�E�TY�-�-Va\�!NW����S����{��R�����u\���;�.c�s�?�.,��U��"�.�$o��e������Tu��2=1
�]�3�^q��7S���R	����[{����1u-�L�1����+��OP��>)(��O��X��F!}6H(}6�(�u�V��`)�z�KnZ��=Z��D|1F<d�w��`�'�j�O��E�������`�g\����5X��4�#�������U�vR���i���D��T�5�;�T�.c�s�?�����J�c�;E�����7H����7��P�W���b���J)Oh�"�<yCQ��O(�I�+�c�1:]zPhX�1h
��@���B�����z�0,
�|b��> l�
)L!hkP&������u��!45�m�X+x���%�[�J�tc���y�a)�]G���F�3����6o���B��
�b����D$m
1K�B���5
k����Y*�]�<�^����tM�10������QPe���"AFY�:���A�KW;�����J��iY0�s���3?�?}C�����k��=�+�����/�h�25zjA#�M(���[���E
mo����\��w�.����~"�C[�MmqR�e��%�0I�:�K�5�)��;��]\�b�_��������.����&U�]u���t����l��E6�,�]�5�!*�T����^�i+)�b��R��bj)~R�qGN����\#�R��� 
�@#��7h$�5b�Au���?H��fC��pCHd
��"Dq���UP�B(�\���Z�<�o^��6�����r�b�D)@�E�H�D�t����0���O'�bm�9%=*�h�n�t�N�_{�u
lQRJ��=5��96c�R���\��R�*�

�@���7�"�ea*@����bm���8h�RJ���v�S�Y�&��k<����������� ���� ]��5��(��a�	4��(x�F�;P#9�kC��f�S�B�MS6{��#C���q��R�CT��;��NS��@t�?�������B$����
���@]`���#����k6@J��R�o��fB��Pu�����������t�?W�'��*�c/���g��z�-xPB~*#���"���K��~����m��'"�F�3�{�y�=�%L�A���k�3P
P5��@!����7h$�5� �"�,a��
�[8�
���F��W�GrU��R��x�J`�,N�A�������O
P���>!���O�5�@�
��i�j�j�P�C���\��J���*�q���"�������v�:�>�s]x�I�.:
A'��0���
����^"��/����l�L��g��)]:t�(<
���|0U�gh�]�����yD����d��'(z��<���!��"O��zh�AzhmE�P�E6��7g�gD!�M������������9�.��D�:Pe�
���\"�xR�*��	�@"�x�7H��TN%�Y=*�J�:
���|.1X����w\F3J%�h����N�����z�m�x.������T]�O��p��A?��]8���J����R*Q?�����3=�qi�<tY��3_\�q�*���Vy$T���fP��C=�U@�z�7H��l�X���T�Vz@Vq��=C�i���=AW=/���*����hF�����$�A����w�sR����	�@�s�7���T�a8���H�L�(	��:��� ,�c�e�������$-5=����}j_�c�1�<o=[�'Y���c|���a8����	G�7H��l�X;�<��r��M�u��*�K����5���������5�����3����[W
���0zj@�) ���|F�.��M�;p��h:U��
�����~���C�>����k����c3Z�[��7��v[Le��g��$++�������w�����D�7H���B�Y�Azq4C�k��*��[�C�� ������������)e���zv�O�����������d�q��;��@��@����J)�f�X;�/��g��a�,RJ�/����M��4+MmY��m��.��#SOE)@��PQ�#SGE���TT`�.4�����
� �����h6�
�.g�����ew����"l{����A"���jAV���E`TG`T��oR���Zh�A�G�����l�i$*�h���B�}*�jqp�V!��l��4��~���6v9�>HD�s�x"J�Dz�(D�8"
� �DT������)%]@#��R�.����6���=U�A��N��t����lS�}��.���t�T���Q�qt�A#?���`��RJzT�
��dC�=��u�H���.��g��*�!0�7��`t�?W������B$P���
��F����������hz�Dh���Y�s�����Z�C��R�l��D75�9=�I�}�.����h�T]��Q��ph�A?��]84��QM���bs�_���3���)���m����}��.����<�T]��P��p%��
���C��y6T��
��pCd�|������L��b����	�H��J������� 
]��e�i(�����H W
� ���
`d���1��B���+��X�W���d���<cE�Y��
�3����vnb9�}�.�����Te�@P��pU��
��A�(�U��%GA�E(�v������k�X@����F>�.H������c���0��,������=4��Fm��X����YvF��(h{�n>�;����g���OH m��(=��b�3x�Ij���8!��q�7�?����1�����c�������{l
�k��!t>�
��4�IDC��?���s�?���\�@�.��v`Tu�oR;���h�ACk���xg6��! cdK�Y����X�1�=�L�>f���B{{�S�y]��2��s�?W����*�sB$P����
���9�{�l8��F��P7h����*.A�]8�$iU\���bI��"��C��s�?�����J��rB$�����
�A9�H�Q���I%��XC�m�*����u����k�u���IZ��E���S��D����D<��U"=�"�D�o����F"y���<�%��C�Y�O#�4XN|�����hN������=c���� �\�����'�����	���'x���d���8�p�3�*���I�!����h$��\�$���H�y{,F�fL]����>;����TVv@�S
O�iO
h�	H�7�����1 ���v`qh����f����~j����"C�v���,v$�u+p�r�c�.������v=�"A���'x�D~�O���l��c����h��������0�Z��������F�j]e�-��u��GW	��l��S
O����4����s@�w`BA�]��sYd��u8��
q/�us!|���B	e���}�nq�g�<q�b�Cx�Ij�S�	���p���z�����Y���H"kLL�L
'��{�tuAZj����\�	�����R��+��A��:�SV��9���C����&�?ToP����p4(o8�.Q���k����"���!�����_B�[�������i�!+�
Pz0���s��T��w�H��!Y��J���R/n��$t7����y�C���}u����8Vc�1~�b��>�nB���E�MH��E��p4��AC��a���p������l��V/U|�����~�B��sw>_���e,���l�?�������y��i(�]G
�H GC�$���*:�$�z@zI����z�fO{>����$ChOB��������i������`{�s�.���8�T���P�q8�A#?p�0��:u�MQ�!�K>C���������j����n��B4����cw����08}����\��R���
�@���7(�
U�,���%=��%��XC-k)��jw,�j!b-�;���Vd�RIk����&=��b.�F)@�HO9(D��rP���@�
`$��hzP�ql4]�k����3f��mi�*��6�Z������~�\��d����F<�U#=l"�Fo��6�F#kCe`���F�3t����M5�$w�y[z~�e�i��5�����3� &}vaRY��L&���T����������)�+M'�����a��\_�;t�����Z�L���6(m������U*+����Px
���24��P�����J�7���
��Pi6PRq�4]"��2���������"8�������&-�<-}����<�xZJj&���H�I\�(x�0~�R0�dm��,2���8X�����9o�e�1Uyog 0U������w���d����2d��
P�F�����|��e�wUF�J ��t�P��
RF6��J6,S��[i=A�$-�f ��6�5v�l+��	<�AD�����#R
Pe�S0
�@��7��"U�0�da����"���DH�/�)l�$b9R���Yc^�It�������h�5�H�s�xFJ�Fz)D�`���������7����l����?J�������=��Z�E�=u6$���*u��|{C�X�a���� }u�*+�Gz�	��@��@X�b�� ���������<����l���1��F9�&"�U�����>t,,��HWZj����b��c����.������D
�Wi����F}H#�&I������4V���Kh#i����|6���|�Y�ZI��s�\��A�v���)�k�.����P
PU�@!��1C=����f$����O-��k#{8��.����.�s1B���P�F�6
6U���'��I�b.O:)@C��H �,y�?"�P]D"H�"\���@�*!:���Ba��
7�?Rd�G\�R��|Wu���|��k�n.����t�TY�l��H $�!���E�C���U`#���2���KW�EA�����/�����w}F�0q�g�1��A����K�#O
P%�S
�@"X����$$���P��<�f�x���rU`-�J�Y�k7j�F���[���5�9�s1x�I�z0'D1 s1��/#;�@~c
G9�!��(g6<�6x����X��J#�E�UU�cg�A�����@Vn�A���NF��� ��� �w����D6U2�O����1Md��r�h���
��C���k|B��as����zB���\jR�*��	�@Xk���Y$��\!f��P'�����4m
�r�R����?�"u}4u!��w���]9�e#+�-@����Kx���o��=�l��w��l��_%~,��y�)����<
:��rr�7��0�z[b3����=H.��^��K
PU�S�	�@�\�7���K��^w�G����C�Y�Kt�.o�J���g+�����V��Se�:�s3�8&�AR��"���D������F)8R	�)xR�>�v�O��Q���!���W����W� ���j���J��z�������sO�|����-������O��li�7��e�LoZ����� A���a�lXrH������������"�T�UZF��<��>�c/23�]��J���#x
H����|�H�Lo����L�HVVBBq43c=��S�hKH�"u�HZ*�{n6
A7�>,F��i�_-}����|���&�c��	� �8�	��h<���s�c���2z�5[��v�M@��"~�]��aG��?������7�A����K�M
P��4!H�M�ix�)#
4����!
4�A(S'<������B(�<@
f��Gj,=���x����\mR�*��	�@m�7���M�Y8�����L��f)J7����ZC�]�O> ����a:`�������4H:�SY���5(@���UY����x�#�����TIg���	7����XZ������������x����wz���R����Z��S��vYY-�lk���z������7h���ZH����=C���l�E�YG)���]K[�~��������1���9eeu��9�)��	��9���9�A-�OX Udau�U��tQi^�oBHK
3/����#�n�.�N��s�?O�rR��(z('DQ8�	� 
O9��Q�
uR�mqC$�|��u�J�	��Z�Xwwy�f`��
�}���~][@�A����K�SO
P��S�	�@�z�7H�SO�i8��u��
�]dyf�r�.���D�E�7������{��c�A�9uaNY���9�)�Cz0g�	�%�7h�Wgf`�.V�.��!�s�C"��T�yoD����+����o��.:
z�u�JdsN]�SVV"=������& ��
�5�$���1�@�SG9[�*��8���DZ�Z(�\w��V�����M�7�A����g�7)@�(=x"AFqx�Ao��d�7�2�������t��w�Y������������f��9��$��=�q�����D<��U"=�"�D�o�����7q�3=(�8��.O]{�F�O�ZkZ�g]j�f6^�S6�}���ds�:�SV6����	O�<�s�g�	���
J�d�}�^�����Z��
�G�l
q��]�h�E��6>E�9i)�y��9�a����i� +'
P�0�b��-��T1�wCz�l�X��|
��W��C��l���Y������;[�~������;��+��A�������M
Pe�S�	�@�r�Am��W?������]B�Cb�C	h[h�~�.>���F+����wG�� b$����,<��U=�	d�H'x�,<������Pg��Q�7�*�����y{+gU��������kuoZ�c�M�o:�y�b.
�;)@�F��H 
�;���y���4�L�:��
��!�m~�tM��Wk}[%]�I�R4~���/���=�A����K�sO
P%��=!H�qO��x�)#�=���J>f���G���.�|<����j�oZF����������t^�gkh�V]��t�?�����J�g�:D�`��v��7H�cP����������c������/-:���R��{�oA�q�2�?�.�)+;S���������|������������'�4������y�Q���O9�Q���e}��$����;�S%>�����s�?�1<�����A�	z�>�z�>�oz�>����c�����n���x����J*����2'-�\)���D�s�)���=G�)���9zN��7�������������S�`�|�%��[�]T��q���f�z�sZ������?�Q�,t�b�����a���F�7iT�z�^z�4rWx�_H6#�`�gH���l�.-�>������-D�p�U4[%��N@�����x�_��������_�(�������>��_������������������L���q����?K*aed��I���,��$��&E�]d��A��,�~|�=Fk�I���������{��:������B���IjYs�<���3��D������3�ZYt|�o�������	��z�,,m�A�O�_���NLC�"5]���
��i�����2��Is������V
x��SQ�0:*@��:FGh{F�aX.��05���}�f4��e�:�Q��	�8�O����j��v���1&�����D1@M#L�"A1L��A���?����/VJ#�������4r��o���c�4�E��)��
&d�d����g�{D��E�����N;��M��M�����
�A[`�,�����G�������q���&����,�/�+������on������2?W�����OP���I���0���A~���.�l�wa���:��~��H�$0��ZB��c��'���n���n8�/c�s�?������2Isv`�x\�v�[To���l���0U���r�)�l.Q���o���k�����"������y�I���/=W���W�S@	W�7%t\=��qRjH���NJ�gF�`@��Ak~>tE�+�k��Y�:�:�e,�m�P��g�e�t�����%��vE�	2�)�$o�,l�g�#o
�l�5@1��y�I�,���j��Pe��%U�e�\`m��'��q��x�����\"�xb�*��}�	$b���7H�V����!�����A��E�By_����B���+�b.�;������P���� �\�Oe!+�\(@���FU�&5��w�EzSri�x����|{���`{%n�h
������
:�"�w�Ui��}�a���A����K�SO
P��Q:C$���%o����z:���0���U"T��j
��J���:U-�J�0Z%�[�����X�U���A������3O
Pe�Q����U@���A�y�;�,�LH$��H������-��d�k,�y�A+hc����K�%��u�{.������Ti�pO��p��A�{��H�(��u��"����N��������JK�,��rz�w�bj��������:�=�sQx�I�(:�?g��0���
���SO7�p�3=��$������M��n�#�n�����
�JK�rM��4.wf�+�#�����D<�U"="�Do����7q4=(�8�.��D���u�7$O�8s��95�SK>L���{�Az����ee�&�������I�7��I�>���G���pejk�4�*�����8��O����U���#�l�(����������EAee��CA�) �
�ob���hn��b0{���+AC�fC�9���������"����/����#�����b|^{J<gYYUt�x�G�Q���TQ�����@U���HP��A������\������)iO[�M�VW�ld}����-3��������S�����6:j<��C5��M@������0E���0lF���gk����]��v�$-��k79|�VY�Ri�j�A�����5e�z
P��`T���f�IUxWU�� Ud[TdCU��T�
u>�
�G������*j��B�)\�v�w�V��{��� �\��u�i'�����	t�h'x�.�;PY6YH�"�����2I��0V�5�h%���u��!"��
>kw�v�[�s�����.<��U=�"�.�o�E~�bm��)��G�����x���Z
<��9iZ({����#-�=���mp.��b���T1�N�bp��A�(W������%��
!W��
�o����������m���~���o��s�?��g�����qB$��c��
���@]�m�I8��G�`(y���t�N"�02�1}�$�5DTkA�V����gG�������66�s]x�I�.z�&D]8�	������#��9"�l�������3��G�����i����eW���R���{��|�����6<��U=t"�6\�'x�6~�M`��
]d��%C�nf�r�������r�fc{Y�0bX�SU~���9�b.�9)@�E���"�,��v�Y�w`���=�^��X(�8���b1��|�u��M%k�X����?����TK���DT[�u�����F<��U#=�"�F\�'x�F~PO0]��sUd��u8����]�@�m$�5��j��\V����a� ��u�MYY��C6�)��z�f�	0�����W��N�����:��zZd�zz�&�,g^d�����YZUId������s�?�#d�4BJFU#`T��|����}D��Hk+<+�F��H�T��
��R��**<�N�8uK�C���=Lgs����r���\�sR����	�@�s�7���T��PF6�"-nP���&���{s-�K���j�+�r����d��_���b��<)@�H��H��<�4��y*���c��Q�&�]G�xj?�F�+g-D�')����k������w��{���I��4��'�����	t��'x�.~����5�8���S�L��Pk�V#?�H����v����GZ
y��ly�� 
�w�r��w��r�����g�{�	�;�w9�mn����B��b
n���"�$"�U�8���Z/I�Xm���Zg*i��}��]K�����<�xJj���C!���7h�U�������w	m8�
q_��}������F�j�������V���W������b�@)@UC�H�@����*�Q�L����l�h�<��8�Q�&��\+�O���q�Af{
�|�����<��U=�"��o����Fk�V�mF���gkP���4�^N�\��3h�+ja}�k���A������N
P��8!��N�9��
`��gz�����D�~	Y��a�<-������`������1i�p��uhB�{��h�@Px
4{ h�	4�7h�GygF����]��@��Q�|H��u{k6����P�k[~�r�w��U��/������ ]�O�Y9�P����Q��~$��j��F�w�d$��F��bdC�]���HkPZ��T-�O
,�]�V=I��o���c�.���� �Tm��P��p �A?@�p������6�]4���
��!
��/]]���u��BD!p���w�f����vZ������.<��U=�"�.�o����F~���la4�)a�8���7u��!4-��������	r��x�1?�s]x�I�.z�'D]8�	����S�gC����c������k�����*t�������� ��0o~����\wR�����O��p���
��Q��FX�Y�%T�
?�!���:
���������RIwf���O���k+4��s�?��'�����tB$��#��
��A:���~�%�:�e���_������V��(��Q���e�A����K��M
P%��7!H��M�	���
`$�6��iz�������"-�����	�b
!�5�n���fbY��
��I��s�?�����j�{B$�����
��=�h�a��8�dd�=�!�����q Gua��P�x�C�"[c�}{*�n+����� �|tmk���=���)�-z�����������}�^�����l���h�D�x�.:������F�e:5����*
��V#�����?ee5��?�)����o����;�m�"P~d�<��vn����o�.���j�����^�V+�|\�{-�xr���4���i���FU`T�J�I�x���}i�����)mp��J6T��
���q�����BpW�jZ���?nx2���\�yR����	�@�y�7���TodCe��y%[@��]y����t���W�g����z��r��9�9�s)x�I�z0'D)8�	���5-?7vs�S�Y�]�
��
q���q�j�H����
?�QH~�e�k�#p[��D����6<��U=�"�6�o��������������������R�}�F��F��!�jZ��57�
~��A�������V��v�S`X�S��o�����w`��j;�	ta���Kcu�KIf�z���(�2�����K#�a7U����s�v.��������=�"A�h'xC��v*��/�������.nY$����e~W��Q��-�8�}F��@��~�1�A�]�x���=�x�S���9�3����
���A3r-���"�g8�
�K�I�B7��$C(�<�?}-�j����������A�����R���`P������
��A����MKM��]�QA�GE��f)�������G>L��l��>`�� ]��u�1(���)�H�W
���T�.���1vF��(hk�+_�����
9�������R{{���_c�A�������y��S ����|�+�����3#C^Y�e`Ch�Uf�r=�����}C^�q=	���>����;}
���������(}�Qv.���F~���J���Y��������5����H*)��R��^13I�����I�E]��gxK�.�w�E��F���^%�/ c��D��v�["�!NJ�%u���Q$�}xY�p���e\�;1h�����,@�H��HD#
�o����zA\������Z�3��� T���H?�������hni�_}k�[�s'-���|����Q�hD�Q�M4r�G@h��N�Gh[L#��z�m��q�
��������������Y.����*r=;w��b���FY���%��.%�D7l�.hR��b/`���u���[������d����1�����;�h���F�,@�E&�D"�Py�����&�,���9�#"0^`C�'�b;e��Ws��tl�L_�q����:��v����t���\�k3`%��$@�F���m(<J��6n�h����D�-6��\P/(�u�A�:? =���P"�=��.�kI����F:1�������`RRY�d0���,[2��~]��lQw"�=���8��*W�L���,D���j����o
���%����^��S�NL:��@���H�	$R�H�	$o	�H�&��
&���Jfm�Aa��#���,[��	s,W���dZ�XW���h'�S9���j����Z�29�����M�pG�7F5(8��bjPp����������sx;y�7<�]��}w�L��;�h���j0���F��D0��L�D0��IT^(��z����$���^��TO��n�#���v�.��7(��X:�h���T�]��bt9����a�[r���/����m��^z���R^}�?
���.x(�g���uY�������<�9e�U����9��ie�t��b��
CY���%��.�iu�E���b���P����MD����tA=L�z�z����4e��-67�M��mN�����z������:d�����Y
{�Q
�I"5��MSC����y�p�����Wj����]L�~;��u�k7`��_oq�k���a�����A�=�N�Y��u��'u���$�M�4]�*�.G2]��P6]���D��L�3����M����Z�������1��-1� b{N��#\{�N���Ca%���PR�_�HD!4����B�@2���P�v�u�=��k-@g��"^E���@��8�5O^U�UCa��z���<����TC�y�Z�b$���ij��D
N�TF�;Q�����<=.����0�|�������}+
����m�����p�q��y.)�	+)��$���HD"~���2�Do"G�L"4�-Hh��-H(������g�y��#"�6��C���4�L�<�S����,���t�b�<�������L^(�DB�7M!�7�aT&�L!*-�}.S���;��q����|�N&��s��4[��"�k�dP(ciW1���d�K�}�Jv�Ij!]G�D�A�7M��tN�0T^�;���F�>=F��2N'�"�����p�B�"���
����Q�wtb�%�
+)�L~(��H$F"���8�Do"'�L"�����AY��
����rw.��u���)��po���r��.��v��b�8��Ji��1�Z FQ���>�w��{3�U?����iT��>� ".����Q=V�����A��-1�����b��	5&�(�hsY������?�BsO �"�=I$"
�=�7����/���P��=��%q�Z0#��qGQ[a}D�?v��g��iZ��s���$��J�$@�C|��9(�I��4��������{D������������E\���W�x�y.��b�4�5���H��v"�b��ah�D�d(�D�(�&
����P�=�(��>����?���QE<�s�ph��8��1K�s���#��sMqOX��#�=I-dr����2��pO������>���g�k5�m�9*w�=8�z���o�8"U��9�r^���}�U"�0tM�PXI�d`(��H$C�%D"��T"��ql�i��5���"���u`-������q��y�&���b��5���#J']SVRJj!��Po	�E���7����t��1x��B��z�K�Vs���-�4��Y��=�9���J�kw��=�T�'��,2y��"��L4�$U������cw��<���L�tL
pk�14*�(���{��U������xkk�h��I=�����E�z�Z�.2��[B���ooso�](�Y}"��l��=����3�e=TA�({�	���<"����j��s�N����'��B2���B�������M��'l��"L�h�4TS�����r����d@�C������i�7��D���C��e�����q�
+% �\�Q�1���$��xGY�7�8x`���������]0=��f<���f|��q���L#n�V����KYt�-�I��`%eADYd(i
��"����BP��xS����D��L�3�|3nI8���[�J�7�4�cD{L�'������N(���(��B2P��B:��(DAQRQ�����D!
��GD^^y-��.<��+��<����]q��p�y��ru����y����E��HD
�o�
@�o�'jP�=�0���XN�0����q���?���(�-����������_�u���@����'2��B�����	F�&����<�M0h�'��^@F�% ��3�l	.�=��qe��v�G��osKl�/[{�VH'�RVR!Jj!
� Po	QH�&
���<�)D!P^������V���q�\������3s��Zw��x�w���m����?�v�q��$�( ;Q����E�A��=y`S�J��6z(��.%#x�RHg�%�L1�7i��
���i��9ew<:q������]Cw�Z�02��[B���M��qg���;� fj����u�q����^�z�-m�l�n�I����M���2�o����������]G��D!���D����M��bT��6���2D��Z��e�ox�[��5��fz�g����	����B�|n)�	+��d�'��($>�%���O��
Q���'���
.�|����o~���_x+���[��Gqi���f����0x��o~�����+����1����������������������RG�g.�������(V\<����FWeP����h�U��:��Q�'^����
��`���
���'��o�,Y����h���=�ai[%���6�0]����u��'u�?i$�?�7������H?�b�!���c��� 9��	��2���R���l�u����)�Q����t��������?i$�:�7Q���_�����#�G	�C���^�o�����W�>������4��`��D�]�D6I��?H���g�I�B���'�&�P�����O�>������T�������.x���wE[�%�>�������&�<�H�C�t��CO �"�H2�x&��:�$�� �P��|i��"~��� ����9����+v��6m:����?�u��b�������J��p��B��7���9�3�<����w>D���Y��?
�$�9]�3���W�C�?k�����j��X�.����=���<@��9'�D�9���^�bd��b���O��RT���c���C,S���_\z���]�����%���_O1
�	�.����
Q���
y>�N#���N�M��g�g��G�xd�����r$0�������$��]O!���b���F�D��b%���OZG�9��2�Do"'����
l�
�9��
����U�u�i�������y�����o3�%���E�D��,VR�t��Bt�L7��]Do�����.��yR(��!����g������ ��+]??����\�q���^G.q9����I9����+� )�(*��!�[B��B��M:x`(�"���8��Z���iX��@�}����	�BV�]��6���~m���s�gah��Dad0'�D��0'�&����|��D�R�%.Tk�]���D�l�6��
�\�G�j9,�~}�C��D���Yy�Q�I"e(�I��24���P�^���/����N����v
�W��{>���-���K �mH���N�Y����=��K����$M!
Q��x�h�	��>��*<���V�B�g��6���0��%
/"�5/X�[��`�K�U&G�N
�N�}����Q!
J"��-!��XQ����������I"��.dt9���`�@F�z��I^�mq{��I�g���.���Fu"�b��w�|O�������LH3H��|�|Q���,��7����8x�,��F����4
�k:�Qn�%8����cu���0���w'-��b�H���D��H��PH�x1��O�!�9�������������$L���s�[���3���7������I?��tO4C�O �!C?I$"E?�7������nKU�2��(`+�?k,�5�����w~!m��o�!��"��.Ndj�;�g��%T�'~��$Fd���ODCF�~�7e<��|��H���qZ/���`A���������"?��K���x+�*z^�	<�)�	+��2���B����2��O��R�����*��d����Q=:�w�.W�;��8�5�M[yxeE^-�
:�g��,`���Q�E��8�xK�B�wE��)�~bE� v^@F�ZM �G�H��8�t�Djsp�]K��f����HnS����}�g�h��D�d�'�D$��'�&���|	�qq�p�,Aj���������F����j�.�iI\_���]����g�h�D�d(�D�(�&
��B!
��Gf� ��Z`7X��~yc�����
��NZ1��Po�g��C��E'�R����CJ&���B����A�NRQ�������p�{�!E�Nw�����*NEksp��gIn����0~�=�ow�H'�R�VR"���!��$FD"�w���l��;���GY����d��Qx��)�;"�s���a����e��&��������Oc���s�����y$����#Iw�HD
wo"�*T����W�jxJ��	���s���i��
�7���nnp<C����q
����?�AO �!<I$"<�7����JEd!����%
y�Xv����-O�O�Jp��f����>��e������Y~�Q�I"a(�I��0�;�0hd�'��n���?�`�n�G$�Wj���wy�����]D'�R����S��$���f�vzK���D	�T	�����~��� ����+1!C]�q��Z�M�Vq}����l�o=��D�C
}�J�"�>I-D��-!���D�T{�b*��maC����bO`�d�Yie��EI�U�a������u2�b�8x�JI��1�� Fq���Di�(��L<2�/�D�"���d����������M��uHs]R�X^����k�DN6?����?�BsN �"�9I$"�9�7������Dd�7Ad����GIj����LkL�p��wh���n�������Lj~:�f�����,@�Cj�HD
jo"�*����B{@����R��Kl|����F�+0��f�Z�'q� �!�����Yp�Q�|N�(CN�M���A��e�p"2_��,>x��^��o2�MtK����V�,cI���;�N��INX�)Fp�Z�#s��[B���H���J���D<3p,oK�X^@��#����k__q���\;z�WsW��1��P8H2��g'������J*#s���B��I���eDo���Y�1*CN�`��B�����<�A���w�6����32��s���n6�pK�3C{Z��������y�����L{�Hd�Q��x��`O��^�������lF�!��L"{�q���mNFLvK0���)o|:h����,@�H��HD"
�o"��B"
��`u��������I�W{1c<�^Ca�w-(
������Y���QQ�Q%CAQ���������=���J�t'�c�lTQY��F����qT���#n�j6W#������^�*5�ID?)"
+9��QR�H��zK�F2D�~���u'�7(��e���~�T�4�w_��6�8���p^0�f�������4��;�h��F`�$��n�E��8�xK�D�wU�w0���09��(^�xA\��{���7�B��vBq��Ky"�������\�wI�;�h���&�,@�C���HD��o"�"�|��D����=���dc`T�z��4�������X����b��'����D��� 4e� 2L�D"�PL�x���`T���PL��b�/��������U��D��>���n9b�7��ep�%.izm�#'c'-���p����Q�HD�Q�M$rG��q���D"4/�$��?k��)����`M��6�oM�^�\�PX�@A�sic'-����L����$}�HD*��xe�0Q�PL�=��K�hr@Qf^��0�tO�$��5��pUK�.�;b���N:�@(��\3BI-d����2��\�Y���%*������"��6���������N�$c���d��9G��[N��=����[~����������=F���H��P�x����"��1I~���#B./ �R,Gl}�]������]�^kQe�@
_�����?�AO �!C<I$"E<�7�
�D!E<��

y��=���#�����+�
������"o�*4��$�c*�Vr���}�Z���!��2|Do"���O���4����5m���g�1|����+SO1��NMK����G���n. ��%:���B���b��MRCmzK��7�M�'�lb����&�6k��@|CQ
�'nD���R����@7��c#S'�,��C��X�0T�(b�
oI���r�����<2��Q^@�
/��Z=�u�q��-���4qO�(7*��VS'�,���� �����L�B�L�Mq2��L/�P���Xk�:pe��-�Vy�M>v�x�/fZ�%6<����a�g-h��D-d&�D��&�&Z�a� �@�$&�^@��#�� "�?����O$k������Znal��/3��W�g5h\�D5dp%�D��p%�&j��� ��r9��
�W�KYU`Y'���_�����j���o#>�{2u��b�,
�+Y�(��$��4�$�D7��4����#*^`���a7|]�{��Lq'��`���jC�e����-���,4�d�,2G�I$"J�_<�-a$��j�w�mr�1@)/ ����t�r�c����
���
�oN"�P�xO�%68�N\Y����q%����$Q�����t7�D��p�{��D�m��%aa{���4��=���uK����2����S��NX��������$FD
^o"�xY�1v�Y�RDi�.k�|.2���$���W��L9��rs�|	s�$����������D&W�D"ZP����h��\"��&�t2�(r�6��#n�9�q�y��Cpc�w��>�{��y������S
\�J�pIj!�*.�%�Te�e�J�TN�;�~a�`C�"�����=c$��/m�t���Z������l�Es'�,����� X��9�(b;oI������`���1���H�x
^�EG��n$�K!�Rn9b��w�2������[��D:a�1#yIVR"$@�Hf���(�I��Dn`�c����D�9u�a��L�x2o�����3R2[vG�#��V9��;qf��04�d�28�D"jP8�x5��L���^@��{������G����_�N��`�-~�P���Z.�nq���]�N�9�n����127r�Z������[B��M4r9ydTv�H�� �,^`9������-jU�����nW��pK��I����T��?��j����PM���jo������PT�=�����G-�����u���]���-;�DpI�O��S�#������s*V����c�ZH�I����>"z]�`M���5���#�k��m�L����Q�������a|����LqXnx�*5�t�9�G6�����}Gp�H��P��x��N}�^W��A��7���t�>yd�n9���`��PwK,N����4�;Ag�^4�d�D2��D"Q��x���NQ��=���@�{�yc�y.KI/������� ��W0���\��N�9�R4a%��L�&��)�Mo	R�7��
���mHQ�����H���C��r�r��1o�h�~NzEF������E��4:���������pOR�F�{zK�4�7��M�f����d�J��a.�=k�H��5��{�$���Ws���=�6�-��>��t6�.Y:Ih�I`�$����E��8�xK�D�w�H�FBkY��D�xAI����f+{$�
lA����k��tR�b�,M=Y�(��$���$�D7����9������p�E��v�f�#,\ks�y���~�8�17��g0�NZ��5�Y(5�a�$��b���h���"���^9�{��	/0�x?o�;�;�D/F�8����Z�bK�ZZY�����g]h��D]d�<I$���I��.n�'](��d(Q9����/H`]{��s�f�c��0�j(;���;�\�s�dt��%uO'��$#sO'��L22�tzK�$#ze�����q��h�{�%��DZ@��(�t\@���9�{�[�n��������b�3�U#�dtI�QXI�d�(��h$CF�%D#2Z��ND��F��H�<U��cTy���~��p89��b\�S`�n?��f�Ix�t��b�<�h0��&FI$2�(0J�I7rF@0
��`u�LSqsV`vO;S2__�29x`;����_:�g���&�,@�@�|�HD�|o����B�|�J�tK����F%CI��q��4O����j�?+^�<~����IA���>2��B����\�=���"^�����e/��:��[�8n�@�x��I�gD����_�g����~u�e��U��4{&rK~��������0��=G$�D!��L�D!��(�)�p���z{Q%���S�b���.�\>1�����O���i;52�����Q�k'-������9X��buA����-�� �Q��z/���>���2�ptG�?#n5 G�=n6��f&5����������v��b��
CY���%��2h�%��M�QYd�1��7Y� r./�+^`���7{����4-b�s�JW<�?��Y4�������?��v��b���BY���
%��B����#VAR�2�BTV(�e�!
�M���k��/��W����[����5���f�1���m:���������LPX�!%�	Jj!CJ&�[Bd��,Q���&C�����EB�k��7csU&�_|��������r|	R�����0:A��������PRF�zK�02�7�k��P^#��5��?v�/<Q���F/���fj\�����^��������������Q���A�����1(G���w��(�I��#VA�����������d]d
��EAk���A�]�c��{��i�i���X��u�~���D���	?���04�d�02��D"��$��� �px����BXc/`c����dy���\����G\������4=x9.L�/�����?k@�O j ?I$�J"M�
���L�}"�i &g��������6��cb�[�>>E�._�D�y�i����g�h�D�dN��HD!43�� 
�$M�������A;m��M1��T�Z�@ ''�=�m����9:\>��
�����ME��N�Y��e��'e�a�$��b����B�O��M/ ���?k,l��5'G���7����y��L<��`�Q�qY�l���?JVji����4!FQ"���4!�Q"���&�C��e�`ai��4�u�E���j�Q+�����'>*�-��ZC�N��B:q��:+����xRQH�D�p(��(D����Q!�Qm
�aW-���g�$
ci;(H<��1�������b�$�[�����H��]d�'i
�"����A����Q�|�Q�����oOn��gqD��,<��S�
���L8}rV�v��sK�OXI]d�'��t��-!�H}�7FT(��,#�|������
bo��u����q}~P��ad}���N��(��B2��B�a����uo����s��M!����@���2]p��f�nY��Z��(5&�����\��N��RAa%�I%��dRA�%D!��T!*��h
�=��� �Q?�����x����#:<�|A�%��qX���^��	A���xXIe�q�����)d�� (�����wPe(Z}�?��AQ������p|@F����[Z�0��g����%��^�V �tl����?/c5"e�B2��D"
Q��x�hD
����tx��3�(DZ�`<��d��{�q��
�m���1�,��@'�R�����D&��B��L*�����H@3Pw���H�O�_��4� h-��������,M��4��������f�z����tK����E�0<���"���-!��\��t�P���C�%��
�@����I>��v��C�3%k���86^�m"���[��7�����c�o����{��_~�����~��������o�O\��M���xaVB4�u�`FA�(��%A�;t�������U���w�M�&���/p�95�
��$��7���v���t>l�w�!C8-�C�p����C�p��1Do"I8k`"A8�G\x����������"���\��7����gW�uZ��yY���r�0N�*��`�{��#��A0NV��d���h���?���,Q`}���e�b9�E�&����O���=p!����X;���U*$A;�V=*$A;kKH���Bd�g
L���%�b����G��)��)�eU�y���h����������j�_}�s��TH�I�IE��dI N��Jd��,>�,���B �����8�r�>H��yCX�T�a�t��0sP�o������z���8���4+�s$'��H$�8kK�D�7����<>�J�kAX���9�<��>��w�*��Pg$��b�`R��������sG!y&
;��d��C�L�MT �:��=���I�a���
�<
���ej��<�������N�-�#%���).���Md�/���OH<_���q6#5v�� j��������Y=��!�eu�����^/0�y���������X-�r��z��\'}H��� M���H MV-H����	�Y�)�i���P

.��D���J�<_���bw��x���ZZx{�H���S���D�cZT����[��K$8fm	�B�cVo���Y}�hZ�iAp��z���[L���[�!l�1�{�����Y�e=��]��w'�,��S	X)Y�a*A�bA����-�� �q�po&���a���(/ �G��;h�Q?��z�>'���Doq����E�>�|Xbq��h����w'�,���x��I�g����"�����D4��w���DZ����9x�)��g�i��uil�p�jO]����K����)v<��H�}�;9g�����,@�E"�so��.�$� ����B�s�G�3��{ ��n�����E]�� ����s�
[�GH�}��0:��;�7a%���$��1%�7�%dL��M��c�H�|��<�6�(�y����u.o�2������P��%����5_g����I�|�J*#��i�����_~�}�_�������!������a��A�!�>���)�rLJ��?#��1�������������xzEY[#�M�������/�x@�
+���$���#8�%D!��(DN�D���b
Q��ma����YRO��~�����2�W#�^���	?����C�O N;2��D"�Q?�7Q�����D!
~�]�t;���<-u���+��
���O��k����e�#q�CK'}�((�d������qd(���t���BSPLd����,e��q(ZFC��C��W�U�46��������rwm�����?w�����Hm����J����?F���B�u���.L�)m��B1�Z`�ZP�\t��G<�_�>��C����$���Y�|�Q�S�o�7�bD�$��:��<�2*`/`�B�^����$X����t�#�X�w��[�s!�u�ru�����Q�Rc�A���A��2�%q� �Q���#^�^#�#5����o� [�g�����nn����S��Qn�O��=����W�v����?+D�O *$�>I$��>�7Q�f���}�D��L!^K���9�q�p�y�E8��%5T��WM1�N�Y�����'���$�����AO�1(��d ����[��e]��bbM��:	;b���Z>��O��.���n�gh��DD�H���Du��"cF�&"������?�w����Z`=�b���3��*������<���|YEXn9�T��N	�a����b��

6Y����$�H��R7�7Q�N��?������ ��p�I����������*v��tz�7w�jL���7=����C�f�JN034��B:������"C3��N0��t\�Y���BI.��JS���+����#vBny���0.���;tR�b��ah���#C1I$�a(�I�I��)&�E��D\�qS��0�������g�U^n��?�^y�!sK��C��qZ^��sHaLX�#�1I-���`Lo	�12��i������������)6�P��1D$x��S7��I�[�d��:0~�MV�����C*�VR
�NRQC&��[B�����M��8����[q(zy�o�Y����n��L7o������-����j'��	5�T:'��22�����L:���(#5��*C�s�mB����i����kl��p:I��r:��-##�ni�Yk{�P.I>�p��?�0`���1�
!Fq��-�
!�q���L!��H�k�BxRk	����n.���0�p�^����i
ey�r���1���4�d�2�D"bP�x1h�	>����1��$^�U-����g�eq�U�p<5n������E���L_��E��f�,@F�i�HD�i�p!��:��O���eq����X7��8k�6�L���X��V��$�6=�����Jp������Y�r�Q�I"Y(�I��,�;�,�2"�9Q%!�&�9k���|�2~�t{�
���r^��,�!�N�Y�����&����$Q����������A�m"���I'���C%n�Xx�}�<N���[�U��<o���������D���u�g�h��D�DI�������i����^Fz�� �U�-�h�-/��m|�u��^).��W��R$�������9���,4�d�,2��D"]�������f6��y���P��^@dA�b������;�O<������C�W��N�X���N�Y��u�9'u��p�M"]�-!��X���.�2�](��[���E:����3���&�V�nl]��on�������������t��b�,
:Y�(�L�&�D���4�7����J�Dd2`��<��	:��v;�J������7qD�]Qp�<����,�~?��;#�N�Y��%��'%�I�$��DT'�&���Qy��,$BL"
y��5k�R����t1��N�yI��2/CBN�idK����`�F����d�.�Q�x��oII�w�E��/�#	�]����K"��`[��n�6�Mq��6�9*�Q-\�4����.����?���F4�d�F2��D"Q��x��wP�(���|��w�,N������U��bq��8D���{�,�&��qC��	?���,4�d�,2��D"�P��xY��O�X�"������=L5���c�<��3�O��KQ&�?r�w�&�����d��d�����Y�~�Q�I"](�I��.n�'](��lHQ��]���a���\�6���`��wU���!E&��� ��?C�P 
#BI$"B�7�
E!�� �����X��u �w]�1m��q����D�FsK�]�}��9�v�`t��1��	+9	�$z�Z�$4C?�%d��0n�g��oU#o��0(��	�:�^����heI��:���s��"��1q��\%�IB��#D��IK$���?�}A�H��H(�&�!���DT��{�AE�Pw�A��\�sj+ly�G"���Z}5�sW=t�1��	+�ed2=I-���dzzKH��!��;��D�z�*m�J9�u
����U����]�cW��-�S�q�c������S��8vR�b�<��T����%�H���(�&]�
E1�PT�=��DQQ�@��e�I���=mY�iU�$ �����m�����������OX�#��Ij!=F&��[Bz��M�pC@�7�DP�`#�B��b#��d5�-7dg��&!n�l��|�����	A��cO+%
 ��(J����[�A��]|�;�`�e�zA$��	/������S '���;ni7k���I�[b�d~�Kq�N�Y��e��'e���$���������{"�I��BqO���D4^l�}���0����v��VCa��~G��2u��b�,�;Y�(��$���$�D7��h�&���A/!z8���M�-�'qO�o���Uq��Znx��f����S��NX�Q#sQ'�������2jd.���AG
���N�ZxA�Y��F
����m��u��.8,��@'n�g�q0%s��I<��s���'���$�H���'�&]�
�D�e(��dQ���ao���<.�����.w;#5F�Q9�tr�)�9a%{��$��#�9�%����D7��#�	���t6�(��.��6��8��"��B>��%i
��n�O��s�)��E'��R|VR�Ij!���HO��&��H��o�7i�*mrA)�-A���#� v��ZX`��-������WKL�lWYtb�)�9a%e����"�����#zY�$zzd"�[j��&�9k�1�u&y�n0�`�����c�uG'��R��a%5�9�Nj!�c��
Do�������t6j(��.e�I����.���v<}&�
/ov���;t��)�/a%U���������Do��|Y�1�K/ ��BJ�	��u��B��y�JF���{����%��-qf}�e�l�y,�N�Y�W�Ra�j�E��8���D��(��lAZ��D� J���X����	Y�qv������?x\��P��O �c�����Y_�Q|I"9(|I��n�%��// r�=���]L�B��w<v��V�3��H�f-�?3����� :��zZV��<�Nj!�C�iuo	��7�
����?8�]���{D`�L^?�o��\\�!�-���3|n��[����{�2t�d������L��}F&w�D"}���$�D"7�� ��� 2M����"�����D�c����nY�U��G2�jL�%H��]G'��SO�Jv���H-���<=�-!]G�&����c��@�{��D�Lw�wC>����\[1Nx(f�;�^k���um�����N�9��&��42P��B�����"����Ag���Q4G�F�.m���f��^U���r���0�.'������!��1w�b�<�h����$��I"��DM�Mz���b$Q@�=�H���{������]�n`�'��v��!��h'��S�V�g��KR�2��[Bz����A{������A�R�2�p������!��/��k��3a�������p1J������LsN1MXIid�&��H#�4�%D��9�wPi����;��)%�6h(�Y�G�g�oX�m�e��=;i�[b���5�8�k��	<�����Hx�Z�D2��[B$���{�;�D�o$���z��iQ����)�57�H�=�R��45;�d0�:�
C�}�t��b�8������(������$��xG9��`r��(/�=��w��X�L�J��k����%"�������V�K'�,��z�t��z��M��A�M�M�pC7�4O��DR�f�����
��3\s`��r;Y*.H�7��%�0�vA�N�Y����)'��I�$��0T�&�&���� ��(�{����Ig-��p�tu^�����D�Jm;dk��+���1��>�ru�D���Ym�Q�I"](�I��.n�&](��lQ�������6'� �r��`�W �xE�E{<L��DsIMX��D�h�Z�|"C4�%d>��n��G&�����=����f���d�pxj���D��M��8��i��4�^:yf��'4�db?a#F�g�:�(F�&��I���D�O�����#Wt.�����p�p"]�!a�kH7��{���kK��}0"H��?�� �x&�E��,��#Ag�+M��Gz�����������W�A,�a�������)+�O�$fB- ����l	��z�8fF98��4l8��.���J�=do	b}��1A��or���oK���Ps�?�<���{����5U���7�� #3#����*�����9{��#���;~7H��BL/��ylw��L-f� W��L|�17�]pSV�����PH�nfK������;U�����L�A�lq��Kf����|=tl�P%��tP�%�D��E4eeu�C4��E����.z�f��E~c�)M'���AC�C�-�.��~t���PU�������{u$���>|��������tP��S(�
Ue�Q]�fK�2��*�;H��(#*�����
*��Gs���V�GL�U�D���~���A�9�����M
P��C6!���M�1�M���,1`[4|�G]��
��}�}�@E�t�}��)(#3@�������>�g�gN��r�<�T9��L�rp<�A<S��L5���:v�w��:*�FkN�����������:z������� ����5��&��a�	4��&x�F���`��Y���C�����n���H�*]|W)Z����fHF������d��E6ee'=dj��D����d��l����D~c�L8��m�����V���������#-�^�8�/�2
$76?�`s�?�$<������	:	6�:�����p`3=h q����/J�����]��7����4����nbl~����l7�6��& �3������+��T����^�����)�j>[����N�sA�9�w�M_�g���� ��R���A�lM�����K��c?������MYYY�$mB- ������� ��4�~��T~c��YP!E��s�Yd����G�����(��"��n�,|�E���=B�O����A��Z@	t����=Tw�A���eh���f-�l�4V(1FI�-��
!re�}6}@�E�T=�z��r�Wj~��d>��@��E���@�����Tw�H�F��K�T��Hb��i:����^�����H7!	Kc���"���v��:�� ���Oe!+�uP�"0���HU��:��{�Ez���kf�UYP��V��w��
4��{�!���K�R�������%���wdN��:� �T��L�:�LJu�:h��"��z�Jd�`��!$]*�jz���s�_=��&7��U.����^)\���q.��n�������8<��U=X�"�80�2�Q�q4��pi�
[�Z~dAEV\�D��_�����u�H���l6���yZ^�n�0�/�]Wi���=WiB-0b@$�P��;�!�"L$���p�2=`�p�2=�B����M���&q7����g	[0�=V74_V]������.�)+���	��6 hcn
�&�;h#�i�l���,�Q���t����m]w-/��,D�O�#"���E*Z����5����^�8��u�\VV=����D]�M]Tw�EK��������@�->� ��#�PV��n���@2[�_��G�����R�������bk~�����,z�&�������EuY$�����;�A���p����q��X�]@j71�j(��Rg.$��%�2��bg~�p���zp&�r�����C�Lo\��7n��b�����f�4z�Kt�gj��5������4�O�Q���
��"-#������l���D��.�)+���	��6�g6���8����4���2��p����_�Nu��a��i�94�X��:��\$���t�f�\~M��D���\MYYm��jB-��.��Mm����Fm�\��m��"qh�U� J�z)k����"&g���0q�����k����+�
��dTT�����5�������y�*�o��&������d�f�6���E���B�z}iG�5�Yg��w��p�������.c�s�?������wb$�������3�1���z�z��B�L��,��	�}���hY��"[����m	�Mb�!���`��s�fD�}F���s����>"�,��:@t.�X�:�GIZA��K�C�����;3�4��
������%�
�����]z�gXYUtO�T�C<[S`$� ��G���|���=h$1�s����uE�_��e�l��c}R�$K�~>��
�+���w�.c�s�?D,��u�$t�C;�>����z��^��k��C;��W� �L��]A"�;�j=t��g����;�eP�r>h�
$�-n�1�9�����MPu�����`�0t��A�n�?������Z�&Ws	��*���?�T��`�Qc%@YY\�bU17�7X�'�?��G���MPUQk���n����Q�An����|<�"�*w�����K��q�����*����?���F]��:���6��;��� c��v�!�aeg��j�����Hu��N84�g���\3$b�g+��-�H�z����RG=68�l�/�����n�M��5I���V#�sn�����|���F�;h�&w�����8FfUn�1�s)�5�����h�?f���H��D-���Z#u
+vb�:�t�qb$�l�<Nr9�<��7�
���<h1bhgs���qU/0���k���`�w�������s�?�{�������B���)�w�*���5V��um������i��W��*�O����DZj
�7IWui���d���z/b7������\#vR����I�@#v�;h��N�s���sH���j���^����������?R�E�K�7[�v}rd����*<��UI�B�uLU8�	��
�:�oT�Xgz�@�% ���Z����>����\T��N�����L���������}�1H>�]�SVvt�!�P�.]�3��Ku��\�[����ctA�)���{rA�.�;j�]����{�9�u������S{&��J�k��u��N��������#=�"A?�((��Fl��M�����sdA�_����v!��n�,O�"��F��%��J�)~�:F'�six0J�4z�(Ei80
� 
F�o���3��L��,K`��X�V��>����������W��2��r����;�o�A>:�����Q
P���G)�cn
�-=|Tu�p�b�?�'
��5Z��rqx4=�
�?���d�?��	����{l���Ni?����.>*+;����P�?��h64��G�5b2Bu��id.�����V��v�mj�{��42����6[�����6w��U1HD�]DTVV=DjUt�l
�����7�����vf����V.������.k�>���V�td�dtI0������^#�����Iee5����w��4���xo���������S?c��04V.���:�St����OK�1e�
�S��24��2��N:d�dA����,(R�tdSj��U�M]G��"�,�(�$�V�6�+�`Pt������hI�8x�-S�_������7+�pP�\�N��ZQ@�*�>:5}�O������|�}cbP���I��eJ+����-�� U�f������yF�f��v���~~�UK�� -���;OK)@�H-�H�GK����?�[�d��F�������H+�.�����q�r���(������-�m��;�
����\�R�*�
-a���������8���h��	{�2T@��{�Q=�l9���cIKL�|m����nz�����D<�U"=\�"A���(��D<�?��dA��Y@��9k�b���<�5Z����ce��3��2�W�N5�n��t�?����j��R$������T�F#sAe`�����������2�����v�]O�K�W:�
���P����i��A4:�����Q
Pe��F)���QpYx4*#�F�F�F�CW��@�[	����J�C�h��&���{k�6HF'�s�x2J�D*��	HU}fR�A">s�����,�����tQ���|������&�_���j���_����k�>c�������"��B- �.(�M�Q�A>M�� 04(�>c(qL���z����_�eF�LK-T�A���s��i\�n9�_�B�[���H�Z@#]P4����E9ph�AQ.�8(�
�~}���}7��y���(�PJ�WX�m]�� ��O�Y99P�2��Q�E���lJ��W9�7�P,9dA��YcH����V�k0t���y��9�������:�n����[�,_mZ�}�N����P�Tq�@Q��pP�A����@�,��W.P_�"���^��k�u�7-s���	�����������M7�b���\�R���J�@��;��cP�U ��,#=��$*)o�����#U9����d��������+^����*+;����B-0�ta�l
� ���1h���I8�4�����2F�f���ae\�9�sR�V�j1��Q��d���{@o�����4��o��.�N�gS@�t��g������r���\�b�h'�����������.F�2.>�-��
rm��S)�A�9��%�}R�:���O�C�c����S�f(q�3=`(q�3=�t��C/��F���Ct������K�� ����U��'���A�	T��'��*<���Q�C��A#�c����������<�9Q���w��GCO{e
�G�A�y��
��]z�B�]��g6F�����nuY���Vwj��R��Y�N�D��2��2[����3���~�v����y?�q(��HE���.����CC��84��l��P.������ N��.w��P*�o���L�����A:��k�cP
P���J�`�q�����T�f�q4=h�q4]�1�~�
�Vqd;4��N�W6���8�t�?U���hB������!�RGp�w���7XoY]ET���*�@��V�^7�j]����#-c�R`Fk�8����6��s�?��������R$�������AY�S���S�,��y�Yd�������D#�6��^��iJ���A����TV����j�n���|6���7�tG��!�d��M8��0j�K"c��&�!t3�y*mij�mr|�J���*�n�?y�d�ixJj���J���pi���F~�#�H��
����/.���:���?��E,I��-��
6���P��� �����1(�r���	��R@����rp9��=&�!�/��S��
���� �2�L�8g��2����\�b��o��� �|���t��P:�D
�sFNMq8�	� ���K���qd��npu8�
ui���\ [E�M���Z%u#%c��I�.�c�N��=����� =8�"�HwI~����*r�/RYW���phz��^����_��!�.�v��I������9+��&=y�d.�C)@IO.(E��\Pp����H0r�:�	DG�rA[����W�t�����!4�lg�b���E;�"�GWF�����
�����fS`�R�A�!(��!�@��V,�������q��*�������5L�n�~�2A]��������Z@]���PGuu���:�*�o ��N�A��C��H
,���� <n�7L:��o�<��d:���i����6(RP�)U�^��>������W��UYP���C��Vi^J��.o���iZNP�k��W�W��s}n6a���� ���5��(����	4��(��F���.J�Ff]�%D��f�O��Wud;�-���]H�-�����|<i�d�OK)@UGO�(Eu�<Qpu�R0����6=���BY�N6�9>e��uKZN��k�Ki�\��p���� -���E�i)�"���	D�h)��Hh��8Z�0����������5�u.��A�vH5F���c)�R�� (�����A)�J��	��@)��@�%8P�4�8R�.Z�h/D��/G������S����B"Z�|l����AF��b����F
��D"�.#�:@��}d���N�l�#��:����i�O[��T���0�<�7�&�JZ
�^o�d���y��)��G#�H �H�Dr�H�t���#����t\��D�gM�I�X�������u=	�d.O@)@�@�H G@�$p@@�H������!�t������W�m9�����J�6�T��������M�� �����(���I�H �
��j7M��H2M'X�"��b�X�d�Z�<������"�&�sYd{%����>�d.�@)@GO(Eq�<Ppq P�����@S���@[(�Y��KV�4��4�?^����������]�� 
��O5"+7�E#`Tg��j$�RQ(�W����#K#YP;�,���]����X�h��x��ik�0�e�pA�\i���w^�������:<�U=�"�:wP�U�A��b�,��.qd��D�����u}������X��hkw�zlz{��k��N��"�,�T���P�"q,�A$,T�HM�:�d��4D�G�����]��M����q$�d.�b�!G�6^��s�?��������~R$������8��
`d��gz����g�Lk�����<H���k�r|�i��/��}
����\�R�*�J�@��;���*����jR�0�8�:���+!�����q��tckw�g�Xl#���d.�E)@IO�(E�8.
� �.�F$���)���������]i������Uhy�r���k�}N��B���T!��O�Bp��A�S��LD�L����]���%iH ������`m�di<��wo�k�N����(�TE��P��p(�A(T�f,Xx�x�G�6�m��B���%���A�2�Lj����6I:s�_� t�?�������R$�����8�
`d��3��%�p��� ���u�a2��E��`�� [��E� }ue������,P��VWh6�VuY ���[Q�r4h�p���C(z��z��l�t�5P�1�?��f�i�3��t�?�;d�DBJ�FU$����*p�"iB���dA�\YP�$\�!����j^@�%s�&��j�e<�v�p�E�w���(t�?�G�����R$�C��"9@�
�L�Hf��$I�J�ZAt$��gI��LZvBi1������� 	���5�I(��!�	4�H(��FH��8�0�d	���P=|���L����}��-�2UM�������v�� ���E��(�"���	D��(��H����5�+=`�qX����F`����c���h3����n�da<��=Lo��{�N����`�T]��Q��p`�A`T�.M`M1����r�����bFn������:�e1�B�])������Q�&�])����Vw��
mY'����)��������������G��:�d�t!����K�A>�������U��F��@���(�����������t�4��=� �.�C~g�����A�������Fz�fjO'��d>�xFJ�*z)EU8F
���F�L��T�y��
�-�
��q���:���[�-�nx�\���w��yY�.���<�J���l
(��q@J9��X�82�s�-��J[A<��s�o=�e6��o�_�����G�6Sl��{��N��=����� =��	z�:
� �n����v 
�8l�7=���W���oL����:�dS�3��f���� *��Ou!+�yP��0��E��������.���|�#������,���,�������U@bT��4����n��`Y����"���\�R�*�DJ�@��;���*wY��=`��.���e����
zG��;�7R��|c�j�����x�g�N����@�TE�Q��p@�A@T�"�������..��$�`]����������Jk���
8���?�@t�?������R$���"9�
`D��hz����hz|ig����k�O���r��N���sL������de'=/-A-0��zi)�������������� ��)�f��2��^�kIuh���|?cjq��� #�t1RYY��0R�D��H�) ��=�#mYi�%-h��1�d�_�x\��zQO(��}�dv��^��Bk�!|	�d>�xBJ���CH)�.���;h� �Tx[%@�Z�p	��`z)����lbXZ��
���S�t���������l������G��K��?z\j����)M'���A���-V�0���j��lkj\��>�T7h�RYb���~-�k�An������"���P����fS@$=��}��e���n��FG��i�������H��8��,����<���z���"�'���,�g��N��������6=��"�h��)��hs�M��6����e7M���/��pY�+u�:��+�NN7��z��\���H��u�?���\GB�F��v$�j$�R;p�iBI++��,�I�h�%�����AO��pSQGk�0�-s��L����e���Ig���A�:����CT
P��Q)��ATpu@T�$��7*2��eZ�_���`�F���	�����.���c�A�:����#T
P���P)��!Tp= T0zp5=����Y���xc�r��J�lC��+����K����t�?�G�����R$�C���8@�
`���iz�P��i���^���]�������Q�.i'&�:x?�D��.�*+;��A�P�7�j6e����YW�zy\z�s�Y����l��V�:�Z�pA(����P/�U�z�����W�~����������������y���)��G�1{��&z�2j�����������7^8
*��Z\fiz��v^Vz�fq�!���S������
]�����D����RY��'�j�"�,S�,�;�"Y&�������,fT2m?�E���eL����V�S�7����#�!(b��^#�����Lee5��L��H2����R�A#�2I#.�4k�5,r�T0m�����O��M��w�~q�1pi�J��i��A.������J���P(���fS@	������0��b.%����FX��tV�/�������!4���RHK=�����y� �v�����H�)|�$�u
?��y�)�q��I�$2{����h�!5������!J�x�e�I��PXM�nD'#�0�����������������������_��?����o.?���N�������\4��F0�~.JFE#�L:ZS�F��t#�4��J7bj�����I�?��Y
�J�������bp������Vi��m����{e������z�)��2 @UF���2�������:@��Y�����1�a���2�,��V)������P�y��U���2m�������a�T7�y�<^Q�D:��[������y��5:�����7v&����=��T��%z�i/?����_A������yfi?�,��,�������������J�
w#T�F:X)F�n�et#��F�aR7bPiF�a
b�1��h��]u��P�f���� ��%�]�/t�{~��A���|�a�(�j�8m��@
x�=��E�h��%5�����������Gt���M��	q�eF��6K
*�xp�+��5K������1::��K��QP%�AG1H��Qr��������VP0X+�A������Moz�D9���!�j��}iF�#}��*�_zPiX��G*�Z`���J[S`���J�7N<Lri�)+�%���T)]B]����n��H�`�/��k�s����n���4���/c�t�?�F,)���8������1�t��0�#��-m> s��NM��"�x�~w�B�l3_U\l3��f����/��s������\
��b���J2�����Q_�{B��AK�7v&���������*��K�e���m�&�u�R���V���5��d9���Z���T]t$�b$�%���.:i�a{	�7�|J���$��6��_��.ehsH�5�`���a IB*��7���������kP�"0�]E���T��{���7u�������k���5Z-��hu{�:���U{���8�z�Q{��{m6Z����� *���%�Q)��A�	$�P)��D<*�?/Y���"YK�,���U���{�b��O��Fs��r����hu>��	V��a�:F'�sAx0J� :�G�	����|�{�=.����E��7��9� �����s�C;[������U�W�����;�Pe��%�u�N����(�TU��P��p(�A��?�p(4=*���%Z�j\�s�}�
F��2F�������L�������-��u��N���|�T���Q�q|�A"����H���d��o%������z�����$-���J�S���:��O-ihl�������>;�����}����P�p�:@��f`�3f��pM��?{�$�f�C�^��uIv�N�"q��o��w?[L6f��h�\�%2C�]0TVV"=0j�t��l
,P�;H����30H$ic�l���-t�;�����sYN����������q����+���&��������)z��UEG�h���'PEO�hk
����*<��oU�>�q$��}J�&Ut�!�&y�57�W	.!bm��-]�����:��c��e/�Az�I���
�#q4�;���B�) �������A_c����CX�-�,�s)q���
k���5���������
�$��a����� ���g���R�:���	f����;H�SQ��Y(�������Yp����������"v�?��I����^Zo�FmZ�m��N�����E(@����"U�dSj/�U"�MT4�j/�5J"Y��USG+�B%��}y�4��E��pg%��z�~}�e����4�A,:��k�cQ
P5��E)h�aQp�x,*��YP39�@�h��Q��]����.G���(�oal���K���9"{���6J'�s�xPJ�Dz@)E�8P
� J�o$��3�lJH���V������%�����D�V�u���:y]��^d����HeeGP%�CM�) GM�$��i��m������$��v�n�aS-d>��Yf�U#��X�lr��5c��������F����������8k��kj
h�aSp�xl���jd��������B"J��i��{��n����J��)���F2�&-Jf���A�:���4�R������R$����� U�f�q 5=��KTJ����F�(,'��jM=��L�����u.-c�e�>vw��E���
���D<H�U"r�����>f�����x��iH�Z����VI ��[WB�&]c���f��I����69�����.�*+;���Ph��fS`���S���.��k��*B���:�������K�9�0�����k�#L�^���m�N���G��v=(�"����J�:�R�o��R���R[�/>�M�����o$ �������2yj��y�n��i��A�:��K��T
P%��R)H��Tp�x�*#�R����R[%y�������f\5�<~q�DRW-�Y����}Q���2���T�rc
(��:�P���lJS��\#�W�_�eQe��g�US�@c
h���|�����BH[�����c��WQF�y�f�b7o���t�?W�'��*���R$P�#�����T����ydAX�#���������z��(i�E��s<����x�]`:$����Ab:��K�S
P��CL)H�Spixb*#
GL��,��,�i@�?�N6���-����p���W��(�Adz�B���K2�Z``�H��L���G���0��>��G�a\�G���?��]7�_�xi��]�|}�m]���N���o���t�?�F<1���!�	$��)��D<1���F1Ma055$��cl��u��2��#M&�j�f���t[�2?2�F���b���\�R���,J�@��;��cQ�8,�4���t�����5�~
���N�d�uJk��=t����M0�r���\#��R���.J�@#���;h�sQ��`��NA���E�@�����/��}E��b_��xtIjuq���t��A"z�"�����"
��$���fS`�Z�A>�4�$c������P�#��w����������B���
e�]����P��=����|t�?�8<������	:�G�$����M���hz���i����E���	h$C�����8��������7��ilwm�?�'������R$�#��b�$T�LB�����/����&�eQRwPZk�b��2*��+{���
���� ��O%"+7�P�"0�C
E����!��D��6��7n�
��S�(7Ea-JZZ�(�8���~JZ�����E�D�RC��'��-�� ���%��(��!�	$��(��D<�?�"YP{�,�!�]B#
�j6�g5~�-��"�W�|mU?�u�N����A,:����cQ
Pu��E)��aQp]x,*����"�����BJ������e�Xe����mN���c��0�=[m7U�Tt�?�����J�'��"�Dw�����7qT4=*���
�[Z�������
U"�����I�z�%c
�U�c��N�����TQ�pP��p�A������q�������VI�6��X��UUd��mR/),���fHh�{�mv�c��N�����T��0R�q��A#����hk�H["M���Xw*|���lM<��>�I�+���d��>�����.Vz��C-�X�����b���*<m�X+��rSb���h�#�/JC�w���tH�&gi���4w�#����y��A(��C�I{���G]�� B�o�B���GB[%R�Mw��GB���!���Z�]�����k�A���:i/+�C����Z���:i�M������!���=�*"�X�"��.���,J���-�V����������R�����,�� ����C)@�3z`(E�>��Pp�7�h�8Jc.i ��9�%��h�R�Uk�����>#�>�:�wD�vO2�R������(�A:��jDV���E#`T��T5�M���W��!�dY�H��#``aA��7]��3K
r���B�s}\�N<[3t���������� ���E�y(�"���	D�x(��H�|�T�g�{�T�,��+���B#s���<��Ye���zh�{[gW6�b�&��u.��?��d��F)@�H�H��F�4����U��FA#�
6�Q�W�6�r�?�:I�T^����r%-5�l��O}B���\�R���(J�@��;h��n��c.�A�QQU��A&������5@L7&�s�t!-�U�w=fb���Kv+�� ������(�����	���(��8Z�&��e�*rh���d.��G������2��w������x%�b��9����s�N���p�T���Q�qp�A#�!8��RE6��\���,�v�]�(u2�� ���f�J��8�J����� ������(�R���H �'
� ����KUd#�����D3����DcyL��,�e$�N��:��)��H3\g�� -}v�����j{�F�X�v��fS`U[�A$�!(�K�JX�"�����,�������'R`L��D�_M	�8*���� ,���;K)@�<zrE)t.W�A�!��,��^�_��W�.�-�K��6����O�J����p�;��t��g� ��Aj��z�IV���U#=���q��A#��}d�����A�����8�x�c�"����F�@�����������d��������Q`(RI6�0�^E�>�:�VVD��vdA]�p��-�E���J�����5�!�mEMH[�%��`�����k��N������T]��R��p��A�Tx�6@�
0�Q)X�C��?��\����x�)�z����9�d��A)@U@�H��A�p�A�(�q����#K@�q���yp�����(�:f��ue�*h}���o�������cF��y������3�;(������c�Wm������%���Y�'B�-��������d�?o=3��[>d��A���cTVV#=w�B-������Ph�����q04=j�p@����s�f����G�����%+���O�����������'��J��H �?�dq�?�*��
*��J��������t=_�,���b��A
������'z�B��'��B�)0�T�z�[�\8
�NdpA�3�m��6���\i?�5P��S��[S��)y����.�)+���	���pg6�P�A
�3#���,5�c�����~�e)�f�C���!��2F
��\����)����������i'��F��H0j8�	� ����}�����A�����[%���;y��U=>.���,b��������~�:�9_]�SV�����P�	��0'��8���� ��	H��A���8pr�+�N�O�'���`5���D6G�-
R�������(U�P���lJW��j�}�2��j$�F�-�H���R��<x+3��R��gZ������~SV�3-�����s�?�������rR$������8��
��J�,fW�t�{>^����{�B�U�N9�A%���������x���\
xR����I�@
x�;��x*�QV)��kq�&�Y ���~�|p�B�l����>/�B2���ms��]��q�d��;)@�EO�'E]��Op]�N0�p�3=*���$���i���^{��E�}}}p=
��Z�l���^��������;ee'=�j�	F�������FpgF�	����A#�K�l��t���\���F���m������h�q��7�"������eeE��:=�"�z�>�"�� ���>�0�,�t�b��g�sy����������Z_XoM�8��yU�&��y�d>�xJ�`��J�`�qY��9�
U�Y���Fmu��������G�e����e�?]��^��� }w�QY����B-�ut��l
t=`�}.`]h:�,^��%t�A�h�5n�7�_�Y�j���������{�N����G���=h�"A��(�Cq�F��s,a��.���D��zY������Pq�#��{�;-�T��EEee;�*
�@��EE�)�aTw��mY���r[B���:��u���W�JZ��e����w��c�L��l��Y�����~��d�w��i�����F(R�;�)U#�^5�>��VV4����X�dI]�.
�	��KK[��Z���m}�B!a�:[o��� ���u��(�����	t��(��.�����;)
�eAa\ ��������Z�7�����_�8:*�Y�x]�OE�����G5�R
Pu�H)��Rp]�����p�4=*��.Z�q��8�_�ds�W�*�k�lZ����g�N���������R$�����8�
`�H�����*�������B"������2N���E��Y�:�z�3�E'�sex,J�2z��S$P���z����}�a�e��S]�r�f�\#�\���u�]�����:i-�S�0�d�G���v��3B'�sYxJ�,z�@)��e��;���*��d���p 4nZ�\/�����i����\hd�L;*��Y����d�~�Cee'=��P,N�C�)�8��0�}v.14�@#�x<�D��H����*����\�(������r�t�O����A:��w �R���TT	�����/t
�������fdWf'X�"�����,��G=x�3-�����_[L
2�W���t��~�����@z��Z@#]��fS@#�4r@F32h��QnKh����Cj�y=��%�����e�d|������vn��7��k�~�����(z�(������EuQ`���pX4=h���h�%u�D'�����!bt��n!��k��������L�w��N������H(@]����"��i6����H����#��H�J��,���4��������.�K��������l����e�Fl;�� ���u��(�����	t��(��.����%*��(YP�WL�w���_^0GZN�w��"���e�Em��w�N����X�TY�`Q��pX�AXT�,MS�t��{V�C���{�qP~��A{���R�aW��w�N����`�Te��Q��p`�A`T�fA]�rAt��f�6�o���=
I�F��E�u��bJjn������*<�U=P�"�*wP�U�
L����%T�RE[�������A�����������3^c�������'�A0:��K��Q
P��F)H��Qpi�Q0�@���h(qd�U"m����ZBL���AI��3V�T�]���V�R�67��y� �vQY�eI�Z`Y�ED�)�,�!��CpY�RC��B�,�e�K
m��r�e<�'���������}��S�I%5��������6�����DJ�dD�y�I2S��!3#�d�+E��=�O	;��GMaO:����
)��Jn(����D��R.*dh|���
u%�J�Z<Qd�>�5�Q�M����OUv�e��Z5~���rnK'��_M�29�T�D�%XT�(Q'��
"�(2�5X`QlhLb��_OT�� q��5��
17��}�����dD�#
)�=*�(��x�#�K!������P����kT��S��C=��Fv9���kDW��1Vz��1lb{���(�����~��O�������?��~�����/�������������M�w���&%0B
���	%�PK�{�RF�z�����c-F��^<r���} ��h�4""4Q�o��E��#`��%l!�k�AjW:��H�M���26
�(�D�!HQ�N��IQ3@#K$R4rd��XJ�b���4pl�fy�P��/Y��G<�?gq�D�z�t��Q�*1��A����~�,�C�o"��P�T�->��l������>����������~.
���=OTFK��	������Ly>�d3R
5,�x��^) jR��t�l��b[J�L���!������N(Ek���>�n�����9��� �|u����wd�{�O�k������u�|(5��yO���h*��!k'�D%s"G���$�@���a�m/">4�-���y�W�$|����?:�	�|������52$J
dd���Y� �fs2�6A���"M�2���]�����{�Q�g�7���0����VSa��0�c���i�G)�����D���@���@$K��dN�H�&q��ADd��a�\>4���������D�����n��^k$����=>JP��J���������E����xdd)��m)d���	2��d���m�!RDc��A���5���!'����h	���#�������}\�p�$����}C�,�����M��#5l�����>)�i�O����#r����B�n8��s��\���'�����F�}D�&�/$J
dA8N�,E`�-��"kX8u�`A
,��c���H�Pw�5]F4$�'�A�	|����������\�/�)�#��&��@�2|)D;����	�| ��H��i>l=c���5�g����?�����T�i���W��54��ddj�%��^S4+{K�M@��!�������)�G���m0R�Z�+�����vO��������u����c������j������&3�qQb5�)���i��s\8��p���nC`�>�x
��b5c`D��x�5�XM�����0���M|�����$5%RR2dTHM6	R��BBF��tm2D��:�9l+}���j�-������&�d���=H�Y���
��o?�����k���nB���@v%^��"nB��D��	�kB_��m �W���+>`�c3����:�q�E�G,����yi6��|.�&z��Ip�����:���(T�,�gTN_
���������x�^��$"�3La�8��g��@�wz��8���Mq��Ns�����d��(�����$��Kh��K(�5X�Y�����z���t��	���������'O75�(0{�,�H'���$v������i���Pa0})�;TL���A�u���C�x��Z�4P�s�p����c_�x[���4� �v���O/Y��G'��(������l��
��K!����Mq!r9�!�����?����s&���iQc�N�X����bCq��r�y����0�dN�R#���i��Q�0})��)2D.g|<A�?+o��
Eb��d�a��_p6�q���/�w�,��Hm�c�lz���S��zc��	}��*������@�p0��=�T�p�R2F�v��f?��D���:���>���#*~�?�W�.�{`�����d��6���Io���@��&3��P�7�)EomMoB�0� ���:4���@5>���W�������qdp[!����(���������������\����)Nf c�DqS��$��������8]��J|$�=c<\��y��A3���x��D��^}-�OB~���y��Is�����4'3��A�P>*4'�#@��d!�?���r7}��j�@f���y
�>F=�q��]P��� hhW�F�Nv�Y�x��zQ�xd���&�"B��D��@����$z(v�52u��A��=�w�\�"��O��=z4?��"�Cn}���^-6:i�g�����bVhM6�Z��B��Y�`C�mr���P�����x�����m�=��u�o��Z!��.HR����F'��,%lBJb���}`�lT6})Y�`C����a��e���H�Ptg��v�8�Xh0���Q�6�����������|��NHIPT�N6E����Pdm
���
(�Cu���J��9�`�l�)�$6E�	�^��4_��]����|��NHIlT�N6�F����ldm�
���
6T���`�����y8�O`�pET��z�H����g��#�Yi���kt����	)��
��f!��������M��36�S�4�5�A9M����94����d=�=�3?*L�C�W'���_�I!����3)�@ R�,�K�@ ���h�����IL������4{���%�b��p�n���T��iO"~�:y�U�
��d2J�&1E��xM�M��yM��u�n�H��*� 6�`���)��#l���c����b���&�d�^�|�*�	�g2%>��"�P|&�&��|&�&���Z�F>u����a��GEv�paPX36
��i������l�F'��*elBJ��J�&����
��K!a#kPh*3�1��M��B=CwS����#l<�/�9�9�2�w$��
�����x6���+�W'���_{�m2�[��Mb�x�mm�mB_x�m�� >��a�"�M��[����(�9�P��1��-|��hg����3_%>R�QT�L6q>��BE��tm��Ti����*���_
>`p�9,z4blv��R���z9b������I^�J�%�$*�%���B^�R*��kS �dM�!@P��c��<��"�Es�p�������u���8�w��~u���u��&3�#F)U��"��	��(�&C��C���"�b0]��
>�n� qo�+rx��u���Fc�S}��Yt��q	)�,*�%��8�
q�K!�"k��h�M������4t2+A�I������|~�y�d�2��U���x�1a�MZPtr��g	)	�
g�f!��p����M@�9Knl�rB�2����dJ�\�u8XL����|���9�����s�y5tr���e$��3�"	�!R9��R2@�v�k3��
 �����H�U�������*N��CO���r%��*($P�5%��N>s��F��3�����ILD(>�hDh>���G| �>��+>`���Gh���#t������R
���*����%3���%���"~E]m����%���_Q���/7| �T>�nQh!s��4b������SJ:x��o�x������C'���_�C���@�G)��"�P��D��C����P$�k�x���3iL�x�x��h,^��r5�l��= ���| q�����L6[����\����YLf #��bS��$�����@�6��
� qC���!���hqUA]���p
z;���A'{9��KH�e��d��e�����e�&8������(7�:�s�k��i ������L�Z�;����;����2�j~��9}��d5��	)	�
��f!������M�S2�a�J��BTNfLbG�qj~!�CO�iD�����y�x�xy^:rt��C)R�|L6AD%��B��	"4������Df0����0��&�����]�_��MEK������J&�$*&����a�R*�kS6��1�2���i@Pf,������'�2.�P>�K����g3g�_�����<Ly[>tR����>SS��@�g��Sd��-�`�Bib
;��c�M�4�x�d~t������8�0H$��b��U���M������IaN�c��3W�Kp@J9f ��	eB�28|)D;G�f��oL���L^��O
8>�*d$q�$��P�d�~�&���,�*
�b2%��"@P,&�&@�,&��
� l��r�|
���/�r�����a���o�8�v�{Fn���$��`���d,�N�s�����:�����IL�(��h�h��#��t
r�_�E��~{7���`m�rU��+3$c��j�$����\�����Nf ��DtS��$�����M��6�52��,���:�lIf�I�S(AF�;�����K/�!%����l�������-E�&@��f|c�R(^�5H$Q�f,�������l�]�b�l�E���	�:��?�/i1��y�%�R#���B0R�<})#�����Sel�N�<}�E�=Vo��5��0�C3�.�g�.���u@��������K'�$ *'����q�R �v.��Z�r���|�Y��Je�ADq�1������[���	��Y���c�a5fC*"��>Uz�N�s,���� ���l�
��K! ��$�!$4?�@�xO �E������2F�����K�!�mLX+��K����u�D�\���T�*���T(3����/%��P����z"��,���?��C�s����l,�(24��W(_u���k��(j��(\�)�X�Z�W�);y�U���d2 J�'1E$��$��!��y�??�B
eD-��;���'
TX#��_�46Q��)=]������B]�i�����z������
$�@����Heh�Rr !��!�2������H| 1���7q �9M!��
�}p��}E��4�:��U��e2<J\(1E���P�M��B��R:a��sxl$��
��O���5��3l�h�
t�B��|$��8����#�]��S'��_�D���@FI�
%�JJ�	J�C(JT�',J�!�2��_|�9��}<[O��E���p��i�r�9P�����x2u2���5 4�d@�Pb�B1�D��?�B�z�����5���9����8n��m|�Q}�6��%;���~$z4l����������(3��Q��$�0T�'�&����P���,��
�x�8Q���h��:7���F�E_�44��LHf7\����R>�������(��(3�QB�S�)�p���`�BJ�C(J)�S�,0��8�X=��!������:H�z9���M��l�p������J'9:��QH��J�e�dT�Q_
9�T��������2��v�a;�x�n��t�=��}89:�Eb[��6�����|c�V��,���TbD!%�QaD�,F��B�QaD�C(2�#�e�+dl,�(J4���2>�%��7;���;���g�.K�������*RL3q�����JY���(�d��YJ*��/��$k�������(���y��Y3��G��_����}��7����3�]A�Og��nA�I�N����� ��|g��T^��RH�6��4>2�D���A81E��������D��8]:�6~|x�t�} ��S���K��;��U�r�
)f �Q�P	��'_J	�� �a�&�H| { ��G�7��x3����J�V$��q����6#.`eo�^��?����*
M�2%���"�P�)�&��B����? �������52!��M��A�����.����N;�|�qI��	o�O��$H:Y�7�3�����4k�d��XSb��D��D���k�A�XS��a���R3H��r+�6�T�@r����H����Q�����T�R2�T�}�YHp���������	4������?u
\����e$��JA���|���HDC��M��Q�2�Qk�e���9�����;��U�:�h2���D�S��(2�h�RnZ�N� �Lu�|��8�����j�/�F�������x>#�]e�N��]���t��Q6q���B�D�&���r�����`�k�������{7�N�`2��������h����K��w'k��_;
��2�Qj�����b>�������PGA-Nh��mG�NTNi,��\����Q�1��&��=��ov�Uo�Gh!���t2���
)�M**�������R�7��%_rJ�eC��DN��'�p���!U�D���v���md�4��B'I�.����@���l�
I�K!@��_HRn���HR E��[����V\a��#Y=�_��������'�;y�U�:�h���a��VJL���J+%�'_xR �p�xR`aE�1	�
�dv��q~����y�eACbA�v�]�3��NVt������`$�PvD*C���]����a;
n����	������������}�[@���ew��������]��Q��Pf ��D�S�%�_�P���*(O���kd�+�@@�'����v�	��#dM-�_�~���'�����*���2'%F��"8Q�(�&8�������bD]�!@	J��L����;��1�4,���tWH,h|r
=���dB�
)P*L(���
�K!���(>������R�8���E���v(9�1gV���-� cp���[�`�$<��{zHI0T���Y*��})Y���/�g|dM���`��=8]i�����A��@������)F0��'��Re���I����AD���@"��Qb��<J�	H���0��^}�xO��U����A�E���m1�����!���4|�x����4���l�G':��G!%�G%{��B�G������	4������lJ��r
r:Q4h�Q&N�"��)�{?���FK�����=��	)��
��f!�������
�B�*s��*6
U���*��sE��h�E�������5����<�P ��H�U9�NNt���*�erT)��'�HTQo��6q_8QQEq����&�u��o�B���8:-��;��{�X.2i������������A�R�(�������Y������R����_h���U6%�nP���(��9U����)�G��
�(x���{�v�b����mP�t2�������B3��� R�a�R2*�vFE|�*1�P�|���|V��[T�!�[v�������W���/Q\w�����jA�I�.���)��� !2HJ�(Y�"F�6�b4>2�D����*>BP����2��_����V�o�������:�F����������Cs��@�F�%�4J�s�@���Cq�����|���f� ����,���5bDY��4?"�K��u��R�@!%�D�e��HR�@})$�dm��/���2��|� A=���3|����.��VXyNo�k���%����K[���]��}��E���+J����*�h�|�Ea����A�����R
"A��/
���_
�%>n
����\��$V5�7[�K'��_CC���@�F�
%�4J�	4���0 ��
dr�52�������hH��R�/���M+q"�����I�.%"R2�T�P6	)"��BBJ�&��B�r�R�������L!�LYo�
�S�q	��%��>MWo���]�����D���6qJR�D1����d! _rA�e���Q)�������Q|������@�>��$Xt<?�������������L(3��QbB�)QJ�	4�0�0 "�6@�
�w��D�1��5��U���=g��klCG'����NH��Qa;�,�3T�N_
	Y��������y�v���+��PSe>��^hBFXy��F������6��>"�'B������������?������~����������/����?�F��{m��{��+aR!�@� �PB�J"����B�!�����od_�Dh��#iw�O�ui<�iD��O������xF���0����pb������G}n��(��'5�QR�>�)�A}2m�M}�z���*��?�f���T�����m-
w����b��{���_��%�I
����d���_��L���5�i�������&���_��1���T�Iob/qE���>��W;�l<�������D��R��|'5��Af!���w�|���B������0L?h�!c`�C�|����(��@��L�	a��&r����������2�����Gtn��NC��@�F��d���D'�&��D�N�,??��A�Ls������:�!`� �[C��!Z'��k��B�IH^��H����A��5�6A��5��@��5C#1X1@�����;�7���{x<��: v��S��Ve��x��h���m��
�iR��Q�6�,$|��X
9~���C���?��>�,|�,�}��Y,��`�����?��^�&�_�����'�����i_v��>2s�W<MJ"���Ig!�(������M|�&3�����y�5��v1��Cc�J?�4�l��Z�~��vj�Z=�(-*�xL�-E�cR9�TxLf�D�c2m�
�c�A�$�34��/z�4q�����~L�L�v���s>����H��r��8�{��4)�/
'�����K!��@q�B#�d���k,��s_�E�l�����4�����D�5Q�����J�d�,���*y����3���� R�Y�R����3�|{r{� �	���.�1�`
{��I�>�s@��%.��<uIJ�p�������n5�r�����.�z!�V}��*
Mh2"%B��"����<����m�z���>�3�+�2D�w�aqS��g���h�#��-�������y�O	\����7���#��� 2J�&Y��
ygD�f||���W!1�G�5p��8�|�(��T<��KL���V�l�Ik>J�&�$ 
i�#����
��K!!$k@8��A�R!�@J����C,�(R3~G+��.�9p8O:�5
�$p�$���)�����$5�W�x��Qx�n��V����=�B���	0�md���1#����!�sG��#p �;�C\�n������������������U�z��Mf ���IL���-��!�A��	�
"_O���

��>��a��bs9U�����D�T4),=�k�����Z����\�����Mf #����OT�M�G�lG��	2"�A�m0������,���hM������FG���d3������
������������|��MH�}E��d�XT�M_
qY����H�0h��B��|��"Qs@��.�A����F7��z��9@Y���v��Ik>J�&�$
�;G6�C�x��^"k88�� �3���%6
?��*���}n���D����@Uh�P@p����������|�XMHI`TXM6F����?��	0�md�P���H���4?�8Mx�33�Wt����nV��#f���q$�4Xxv�����VR
�@�J0��"�7������������oLw�$��/X��c������.`g�#�v�����~|g��N��Yy�>BJ���
��P"E���K�M��������\�	>�O�10����^�� ��
�e^n�s����!d���
[������Cv�������t&3�=F��$�F�I�	F4�	}~Y��X�����cls�oxV��Y�������0a�,������A�?i@�L�{v����5,4��dX�r5�)�m)$�THML���|(�j��]�Ej��x��7�j�j��L�)�A2"������l8{��i�K'��,������Bo�Y�^�Bo�RD*��k���J�t�M�E����7d��MA.�����.��a�,s���:A��Mz�Iz>K����)42�,#�FFf�Z�$�h�3�1�@��8���.��t�7(O�'%��}y��e_
.	Y��h�G'��,%sBJ����f!��p���<*��kS��r9��J�Z|Q�g��|�N�U"3\�
t85c�����A^0��0�.L����*���'3�� ���#1E� ��-N�����'���T������"@]�'�c�����]6o8��{�5����Ot�N������]�����Pf C���ILh(.�h��j#���Tq���i�x��PT����y.��,,�����l)}���K����Nt����A��
B���R�A1��kdm
���������K1P(4����xI���8���O��_�_uK�j|�CG+���:y�U��R�
f a�	el��0|)y�A�36\�m7|(c#t���K�
h��E;����P�����`��s\��6�O'Ip�:��U��e2"J�(1E��Q�M��Q��	"��|zu�|2�����9K�U�ak��2*5�B�CIKoN%�N2t���&C���JLP(2�hPh2��]
B���|y�a��X���O�7��,����.��G���g��EF'�*%yBJ�J�'���J��/���M���P7L��J�t
@!�X����K���>��j2�){�Cq	��
���R�V�k���Pf ��R�'1E����I�hp�DO��Ag��$�("�5����w����n�����e>#-������D}BJz�
��f!�B}�R������t��c(��/�b��>c�?>j`�����e*�������
%fC��_z���F��(��*aR!B�U5���B��RL�6��N�t�&*��5X`QLhLb�ej;����qD\�Z�����B��/jC���:�N��U����DD%���BQ����Ddm��{�7�;�<0��h�������S��,��������}H4iSV[Pt2��R�'�$(*��l�J��/��"kPh�3�1����������B1�>��
?�{4����.�*9�����~#�����4F'��*%BJ�����f!�������
�����R������M���z�-$_���3�|�\Fr��J��b���u�����C'���_F ������ R�0�K�� ��S�6CD|c�>�9�0��O�@l��
��Q{[��Wj;��9q7+�F�N�s���f;����ILP(��hPh���.�(�v����X��?ju��g1+.ZF�C�D��(�\�b�jd��<W�khh����(q�����<�6���<�/��8O� �G6�X�����#�H�
�O�����K�e��N��9�R�m$��<��	)I*�'��D�
��K!���y�6�$��9��TO�9�X�$������z-��i��*j�s���K����o�LOHI T2=�,�LO_
B�&.B�������K�-��7}��{�WN�(||x�v��G%�z�%������Wz|r�!:9���qBJ��q�Y0*�/���q�6�����tc,v(��U�����_��R�`�jj����g�_g��� 9a�I'�9�NHI�TN6I�����d�\���)H����IGT>`�C�1 ��������)`��%a�]n�Nbs(���@��l�
��K!@����t�����Ll����h0�x�;��X�^Xy�o����!�< 5>�B>x� �$�P"9!%R!9�, ���B��	@4���	@��,�(�3&����' ��,~��tG���L�R�)�����r���C'���_R5���Cj)���"�T��I�	2tn'��
��B�qw
rFu�R�_�b�~�fE=e�gZz�.�N�s�����`(�PvD*��������pm�����.| ��/�$f���t���}>��g�}��8����,�E���������K��!%Ry��f!��@�I� ���#���t
O|$���d�'�B��Of�s�A���%��j���<�����*�:4��d�Q"=�)�Ezm�MzB��h��-h���:|��:|�O�����m���~>��������K|'������l�.*|�/����M@�s<�0q���K1P(��p>Yx�Eh>�>�j��[���};���d�*�4��d�PJ�$��kP��D��@����A������b;c�5h|tQN�}�	^);S��#.��T�<�7����N�s���&>����WQ!>1��-����$:��
��>��>`�B��1 Z���QZ/w�s*0�)�����m���8��&3�q@P��
���*��k�#�z�:���K1(n������jZ�gnX�Q
a}����1�����/�%��N�s,����
��f!(����-*��kS������Q�i����O�e��pMz�+�_wi��O���*��S�������y�{��QY\��N�s,���� ���l�
��K! ����MA����7@�W�!l�z���#�k��$����
y�������/���XJ���E%���B@QI���Pdm����g|c��C@A)X�/*��P?�9�3���I|�4����z�l�����CE��II�&�L�D�*���B	3�v!L(��H�#�/%��hg��6s��	%>�Q�$��4�w��_pq��)�l<�B17?�D��le.��Ix���@��'3��PJ�$��I�	4�	��tF�Z��i->F<�U�l���9���*7�N~s����7����IL (~�h h~�tF@���L|���K��N2�Ej��&�6;���������P�ux�6\t��S�������B�E�����pQ�9]�����g4pl,\(��U�pn"�_J���y�����T�����C�<�L���*�64��d�Q�>�)�6�I�s�5�g
2},S�>�y->`~Cq�1��,�R���)�hf5�^'6��m(|
<1�J���Xe�3:	����	)�3*��l�3*��/����M��B��r�����$����!C1�>p�Z|���{��E�#��?8�YlStU��:Y�U��mh��n���S�m(�hp��Pp�OX�����a�
���WGW���EJ�3���y��;PrBF����w)�w�S'��_�D���@F	�J������s�d!%�!%�����8���Y���d��D��>�m�k�b�g6�3xP0;�Le�l(!��8�F��.���T�>!%�K��d�hT�O_
�.Y�@�?�B�ewhx�e��6�����?}���+{���=���p��p{�I���"]���p�d=��	)	�
��f!p������M��B���=}JBhm���xO���f�sx�N���������=������O7`�W-��}�}����\�/��4��D�P���[
_J�����a���|B�S"��@���(�����wi�
 �E��u�V5��l������,������\�����Of C�D~S��$��!-�A��������@&�|���
��06P7h8O��_��?��[���=�����*���2'%n��"8Q�(�&8��8Q���,p�
���*(>	�	T�8�Zl�j@q�(��{{�����.����]����$IW�k�h���(!L*	4����
YA�E�bI}�|�����4�x�:<��������f��&o;\��`��0a�N�t����H���GJL��8R�M���#�~��t-�����8�$�l�:����26���\O���Y��I����p��(3��P�%�T&(�&p�B�����9m��
�h��QW���{u�	M�:T45A}`�G��&��:#Z�/����������(�)3�QRbH�)���m��/)�l�D����R�X��p�Z(�E��?z�$�K,|1&DR_�NVt��F�fE����+JLd(V�hd|aEa@ C�����/ �Qg^Qeg~�N�Lu�
�Cf(Rv�_.	�������#i�����K�(�$�Q�G�,d3Z�G})������!t3��C]���M�E��-2�p4�e���P�
i���"!g[�"�nx���w'Y��_;M�2�����S����L�'���|!Ka@8:�m@(%j�Q����Z�\3��Nve�T����*NG�	���*]�/�)�4��&�������d�A�30�C������#��O	`��rd�T��.���p�e1�b8�'yc���������e�I'i:���CJ���D��B@Ry"�K! ��$_H���E���,>BP���ET�����������B�%E�i��!�����?�I�;)�U���h�����D�S��(��h��Z�N]��L])s|�\�g�~� �b~{���5��5��-�e�}�K�����]����YQf ���S�%�_XQ0���7 @�k1 �����:d�#���7��h�������In�j����]�����Pf ��D�S�%�_�Px����1����_>B��O��n�2��'����n��C�XHAV�:���$�*
M�2%��"0P$(�&0�B�����6������->`������4>8�9M�t�������t��s��'��&�R���B6��W�����������t�$*(v�����b7c�v�z������2�=�AG5�C��n����X��3��[wHI,T���Y*Y�����M����t���t
v�Pi�1�E�s
^���q4��S��bR�#�u�1db���^����A���@���$������_���H|��t�tJ�$P����"/}`��x������� ��P��9l���D'o9��<!%E%���BPQI���G��	*���n��B��|-�
�[�� ����":�,x(p�I������	K'Y��_�H)40�G0��"�w���������h�S�G�	>�1���t�a#�0�|�|���+�6V��q��B���\J�$�$$*�$��@�BM�R$�6��j2>2Q�>��>�p��9J��\x�d�����d�X:�U��?h�����@S�?(�h0|a a��r�]��A1�>�6@�8{
�{���!�x�/y��f$��v\rK'��_#BS��@FD��$�"I�	"�P�0 A�E�����t���!��
�����g�v3��8u�r���
��]��IF�����d$3�Q"#�)�EFm��/d$D8���Y��R���0|kC�)\A���-S������|�Rz�)�u��Vg���C���/�l�6A�>2>2o��bHP9�>�F�~g��S�e*�f/�[����} V��P���IP����A��@���Kb�x�~I�	��0 <�"(]��
�P�����
�����6^(��*�m9-��	�%d�jPt2�K����t���B�D����'Qa*�C��Se^�)-\���|����a������paxfz<��Y�.�CG'Y��R-!%QQI�d�TT�J_
AE�&��Y�C�"+�Z,t(���f�������Eq�9��y�v%'�m��r�P�u7�]G'w���KHI�T�K6I�����T����:T�%��\�����"/]e;{M�y��8{�Q�~��t�	H�FW	0f��'[�~��?��7l~|`������?��?������_���Y�}�c�,=�W[����-J�`Ri�KI�`�����!1��H����������]C�^���Zi�O������"���Ds�$oI
d$T��3S	�:�&H����?A��o�������@5����S��=��b�����f&.Cw �
�v:�+6����H�r���.���"��%�&���e|$��"���,9^�-�<I�.���=t��4:��=�c���!9�-����|*��z�>s�W�q����b�t7
��c)$n�q�B������:����,�> Z������
9�y��I�!�8�v��J�u�*�Ph����q$�&����M|�f/�2���k1 �26���;K���<'�@a�t���l<�`�P����JDX�mU��(������M�Y��L!X��`QC���$5�m���K���Q�V�=h�����9��<=2=<X5�M�+)�1%�I
�
E��d���B�L�@C�f@�1�������9�g����iZ�x�����)�6/�EB�9�+,�II'Q`1�,$lX�X
qY� A�[�e�$��uC0$�t�@�u:m���VP�u���D���y�=��rg��P�fUB�@an���
3�B ��	$4��	$�,ns����������F��6��%�ElS�3������l�Eq9�+��IIh�K:��2�B���	4t����-�������q�*9���r�6Pn"20g$�������������o%��.��
��A3��� �� �� ��,���m�U�zc ����+J�(9��<���\zL����w�%��*X�4�`FY�*�&�+Y�;�"^��A�-���*x_�C�T�
�\��-��_�������-Ze��<|^��O��m��yf���C�L ���3I(��3�7Q���t���
D�I'���G�����r�Ab������vfn�����o��!�T�����	�uP���/D
bo��3^2�^���{dR�L�C�����^xXs�k�o�S�?�7,u�d�S�Y�J*��,�SH��0K�
�f/B��(�|�Sf��/�/�������aEwe7���Ln^���6�|������A�9��&��L*D�=���B4�+D&�/Be"�1��Ddrx������j����������?+c���*.'y�����B|����.y���%4�dr�?�~��!^�x}�?�WUA�0j������C�#A+<�!�h�����v�V���{Cf�$wK�B�T���cC����=��
�#���B���M&?f�d���/�B��|�������z���Z�|k0��z�������[�m���DT��
�l��R+bE�
mh���x:4�rL�LDhC�����"������=��oz{�Z�{��v;������O�����}"14�����Fh����"������s��J�E&���.)o��J���'��8?~>�����������Qe?����D�5�^�{�g����e���4�d��� N��lEtpu��7������t�#�t@��GC.�����_]q5����A�~n!�Hxo��?w-�^g�O{�m�Y��z�y�e6��F`��
 i�e�����m�c��qo6V���H�$H�
$�p���m�'l�?w]�O���<����?�W�_����_9��<�4�}_$i�Y$%�IB��BJI��K32�S��
M���v}]L>y�i�x�{y/X�|^g]J(ni��*N.*��A����B�L��B(1M��+�z�x����"�(�7���Ro B�x��@+9MNG?��$>���y��m�DF�� �����o�}DD�s	i�J��
�dO!i#�"B���6"do"g�L�
�_�A&����'�1#
pqV�h���|n��O�������A���w�|���yh6��h��&�G��s��S��2���D$�Hh`�[
�9"�����o�������-v�4�
4�?���F&W�����?����15�R�&���Q��dO!cGEdqt��"{Y8md���M�i�K"�izjiPjg�:n8��	�
��eI�%�4,K���.�A�9�'��,*��=��"�"�����R�7���G&U��/OF����D��X�da���~��L.�;H)�h��I)��AJ���� �l��,�A'����D�����
.�7�EAf��i��v��
--��.���Ip�
���twr�@y�Dp��;=�����"���<a%�
�dO!"� O�
Iy�7]��;f�P�i"���&U��
vW��x.��g�|�gW$�4 ��7`�@]�8t.s�Jj�R���B4R�����
uo����C=K0�����C����}��M����/�+a�_���\$�,�A����VJ$,@�6�(��X�l�]�"!�9��7I�c�7��mK��7@#���o��
WZ?ir�I	��B.a�� m�}Yh�dY�(	Ed�(�&���|�
D����=n�7Y��j�mw=j��-zai����4c�x�&���A�����<���(�P��B�P�Md�y(��,T��{���-D��;^���C��=O���SJ������5�J�}_$���Y$���Ra�xa�X����hV��o������������x�������\cY���7�h�YX�|M���k��6��,4e�,Jt��"c������B�Q���C�Q�����@�/����'�F���E�x��R�b5�������5�J�}_#��Y#��O�hD�R�M4�Q)��F*u�_+u��������5���h�R��^&�$�x��d�^�s���� 8m�}�hp�d����PD$
��t�Fjw����l�\�
��7��TqSo@�7��\�g��d�7��:�e-����?�� 7}�
Da%W��Q�2��pS�
Y�fo"�My`�v(n�
$�(n��fy.���0���!K���m��_���5�M�}�����Ca�D#l����i�&q�H�\x`���������������e_q=����������<w�D�����JY���R�(	ER��%�D�r�Dp���9�P j)E�Ro�����������Fo����q��c����+X�$�$!0�<@�,�JN"�;����!o���r��\X�F�}������^��Uk��P<�����.�T������(��Q��$Q������CSQ��a��:�>�%�a#��}�m`�����;�����>��d����/MEY�,�%��,%�D���_���bA�x�4�P�Y(
�i���L��+z��q�p���X��Z9h���BsP ���AI("�A�7������82u�KT��{�>���?8Jw�
���7�=7E��i=tB��tVr�Q9��B���Q�
�odo�
Ey`�o((�
,��c��#���Z����)�\3����qcE`���� YAh�����y�(�P��
�o�
B��A�7�tl)��N,Hx�����
�M�}�+71J�q��Qo�.������)'�P��$Q�����(@SN�Pfi
8H�P��=| ��5�2����W��A-{�3.�����5��'�5R*%��FT�(�&���B#
y���y��K�>�#��-m;����K����Jy�A����"����")1O��D1O�MD��'��H��h�d�R�B���K=q)��?����4l|KE{�����o$����
M<Y����$��*�$�D�x�_�BO� �EO�X�i�E�6��V�*�_���,�|Q�]�u�|6�� `�V",@3�+b��]�+�����|�;^;���7�a�H.����H����2�&���xCd�ke��u@3��?��:;�}_v�Y%�IBA(�I�� 4��?!�!�o�T�7`	�
(��qjd��c2�'�ni+��6�������qbv��k>a%���5��)d����,�$� ���3�1�
v�GN���7�������
����a"�6&��F��i�(����:��b�f�'4�d�8Q��$�����BcN��qBaN�`�CqNw�	O�H�1{��n�E���,�
*$}x�Q�s�����:�9�}_$�s�Y$��O��D�|��L	tvA���BK���Eo�dr����
���I��w��v��&�F,`��|�{��k�L<�d���gO!Y�r&��Bf���>�����S����ZjzPu�����p�;��k:V�yQ�Y�m��aR� m��1CP �%JB�1CP�M�M@�����!3po`�E!Pw����5�,'O8���F<h+�*��u�6��H4e�HJ��""Q�x�h
!���,[)���DPo���e�D$��U�����*
u�s�{F��k�D<�dF���gO!�R��]!%{!h��[FQ����Da��y��M5��*"�/��]��A��
��s-�����H�D<{
�H�D�w�h${�h���F���Q��X2���~$����Q���Q���g{�6h��g��fX)A�){0�,b���w%�xgA�7�����7d��������,��7�.p�#n�>��>y���`u����I�O��������a'��P��$�����A�N��i�7!�������7,�B��`|���u����e��	�� �l�}!h��d!�j9I("�7�7����BPx�=H���?��_�@���K��#(���"A*�
��?�q�'�A��.��	+�*w}����P9��]!I!{	h����7�u'o���x�7�+Qp�3������s[>���+������G�0Y�<2�&	EF�0�7��F������,h
��BUjz�D�����>�,���e�J�x�d����h1H2�%�	+9ZTH&{
-*$��BF�
�to:�T���s;H�7���P��,(���`U�a;2�<��_����h�,�=���s�� �|������E�zO��������������B�o�OS,/-�t��0������5y9�;u!�{�P6�~�����y�D(I(�7�$�$ohB	�7�t����L(=�>/>������t��8	e�*�x���q��� �|�x%���P���)d|��J�
*�������2�����
XF,�����m��v�L3X��:���:�hi��w	X�J��,�S�H*���BDR��ME��4��`	u.=�rC�O�R���i�z�M��x��6���zW�61�}7��J��H	�e���`�+Y$�;'�f"�wL�Uo�"�
X��S�H���M5���z�`��a<���d����/
3Y�,��$�� �$�Df���8��BSw�<�������'�)��ni�Bz��s�l�I��
��f���,@I	t�PD$
t�|�����4�-_��
�g�6���a�O�/����y�V���y�m���yg��+A�N +�T�IB%�N�M��/B�@#c�M>D	�������7��U�6�����[��m����R�z����/M?Y�,��$��H�$�D$�"T$����H��^�.��z�}8���0b�����}����D7�m�����A
��������(�W'��<�yu�M��/B��
:������p\lQ�P����(fd��n"���}�*��~o_[��[l����u�)(�uQ��$��.T
'�&����PE��,tq4d��d*���m~=�iz��qw���;w����_��{m�8����q(�Q��$Q�����(�_�*BUl"�P����*��X��/=��D�K�{q������\����'�����A:���(�$���Q��4*t��B�F�&*��*q����@���Z>Qt�p,y� �K����xO�b��#Tc|ue�6�@�}�����E��:	E�uP�x%��P%���:9�h�)A�lz�����q���ww��9�x�/��R�f�����$]0�<B���J!�w�E���]xC��7�<�-yj�
v����[Fm����������r0`�[Y)�3�Qh���D�P ���BI(��B�7Q��U�b�������o����jhdG�P�����v==�W)_��
��6�A���*�,��*)�P��D�P�MT���"'���	�7�U
�0�D���Y��~�qD��|s3��k����f���&�,@�E���PD��o��d�.u�cuw�9�	��8hU�nig��W�����9vR�9f���;<a%g"�;<�S�L�r��w��D�7Q�4/��*����7d��l�8��]�������7l�{H��t���F:�A&�����f�,@=JL��"��b�����E1z(&�$�(&����/d�i��n���)nY������DY��2�f���f�,@�E���PD��o��L�.u�UTm��`_Q�����
�[�k/�^�8��\���������AN��>g+�U*�3bO!Y��9#�
�*����'���Y�p"�[JCm��0�7��.��:���,o�u������l��!���.��^"������Q���
��]T�h�]���Qw"����t���7,�p�
����dJ�=����b<���W�|�����U4'erV)qR�d�I�7/~pRYEqR�`YE�RwY�D���y�L7�E�V�g����k��� 'm�]]�J�,@�3����������x����E�x�#��zCf��o����G�E#���Yy�����(�F-�A�!��Fd�=��Q(��PB�$�B����
����@����s���K����mu�S��=��xZ���3HB�}��$��")�P��D�P�MD���"�
�D$�����do��*���=7�d�]9DGP�u-����n��������\f�A"�)�����&����)$�T��{WH6��D?�(�l��p���2��
6���M�O�O��Z��S0��
�&��q���sr��d���?nh��q�TJB�qC��o����q4]�����n�Au������x��Jr��U��-]�����/
DY�,�%��H%�D$?�(�( �,�("�.H.���V
�Cl�Gw*)����y�� m�}�h<�d���(	ET��(�&*��G@��V���BP[�(:�
���T���+zCK����c��A���J���J(���PD	�T�x%�@� ���{�
o I��N?��	+����f_����@<�W��g�}6�� 4�d�  %��
����������D?�Byd�}��Po`	D�OwY6��o�<7_�a������	Y��[�����t��O��<���rh�=���rh��B�(����	��L;'�@@���@���s�q��s�O����=+s�`��
E����������_0>��	s��;-��������������������������Oa������%f%TB��j�T��Rr��$�0����E=��*����hH�Ex�����_$�����(>�����{���O3&�����}_��YL�B]L���.4&�t%�.��y�b�pN�^0}>������c���;���+��*�����$���G�+H�)d�(|)�BF��MT�9���y��2�k�������L$~z~���2o�}9EZ������3C�cZ�b�s�����D�R �%��Q��%�~�y�hZj�X"hix�#�G��u��A������P�����28��T�ZT9z*F�^��GF�B�ht�����B���%��!hix�uL���m4,����x~�W$���|Q}�������aL��z|�k������}��|���E���Pd�|�yYh>j�pq4����H�+����
�\���:V��%�=-G�"(L'���o��*����r��Y���p�(}
<
�FW�����J4=_2���I*����q&
�*�tr�'���0�)���P�p�J�`��J]
F�^��J�`4�Bt���.4-=_2�B���`IET�����f1)]��y����KP������c��t���C��}?�H�J�SA�,�1�2�t7�5�S�A#��{�Cm�Bl�"j4�6p|�}U��aY8�����{�C����P�J�%�J�BrL�FW�XR�w�|�Z�j*�@�TB�LM%�F.3���#Lo�>�����������~o6?��a&R:��K��N��%�R"a�P���H�UJ�+Y$�;%�"L$<2D�
��yI8��Uo�]||z��GQq�v3X�b,L��^"��� Om�}�h��d��x*	ET�x*�&*��S�n�n��*��!/j�#/u����L��k����sJ�7���x�-c�8sK1� Am�}]h��d]T*M�[��w����4e�D?*/t��{�e�7]x���^qeP��2�m�a��)�x����i��6��.43e�.J���"�P��x]�`� t���{���������k{`s��������-�c�!���������d�N%h
+9��@S�2��@S�
�ndo"���G���$���H"������m
����1�����to\X�����FW1Dh%rR���
�S:���J*���%�OO!J(|r)�B�P����"t�)�H�)sR����!�H#.,��'p�Q��N������8Lm[�z��)���b�N%L
+)�
&eO!"�`R�
I��K[��(#
'"�����I�X������V�>~�heu��w\�������J��T����*�@S���
4���do�T~@�x�M���a)���T����������>�l|+M���B��:��S�T�+����z����Beit�� {�����Y
��I)
����m3��Qv��`DQX:�.�������J1�R��DEa%ER���)d��PQ�
I�������/�E���Gc)EaQw�����.L>��Bf�E���d�����p&�;[��� 	m��5+��0X��feFY�*�Y�+Y�;��"L����
�qyC^��do������e�|��,#���2���j�y��6��J4	e�JJ$��"*Q$�x�� ���;������9�=2��PI>�uY�8�������|U�����4;#�V*� 	m�}]h�d]�H(	Et�H(�&��AB@�B�P� i�[�0<�^P
��������
9�V�����y��6��J4e�JJ\��"*Q\�x���� Tr4���{����{�J@�����yf���V�V*�����W
6R�f�����,@�E���FBY�������BEa$�9�-O=u�Tu�e_-������Y8��A�������u�)�uQb�$���%%�d��QK�b�P����������}5LJm���M I�_I�zT2�U��8�R[�������U�!)�UR9kP�����-o�bm�����832o���h�����>����
o��
�n9���8���zN!qb��M����Z����������y��6��H4#e�H*��7�%���DR�����H<�CI���v�D��J�a�Q����H���n������"�5�:�;�$�s���J2�
1eO!��BL�+d���d�p�9�����$���
���/��������,�=���>���?����������Q�m$���?thb����`U"�
1��D~��D$N2�HT�?��Em�P���:6���Q�~���F�[���>���8K!�	i��
Vj�`��Q��9���G�����l��My���,�!�r�=`��f$��[����d!Doh���F�<^�����/
AY�,�%��(x�d�x!�T���_��@���d���O���}�}����-�HW�����l���[��b���J��a%G���z�2ZTjC�+d���D$~��
LF��'�.����7�hq���+�?{>��H�X�#!������e��m��$���?�h��!�DBI(2�P,iCH~Q�#J�U ��d��o�\�@�7��m���������@�	���*������ m�}=h�d=�(	E�@y��!?����$�e��,�p4�����Z�3.��,�M��������`�A�D��#�A�*���J��J�({
I)��Q�
I)����9%�����[m����
EC�a�x|P�s^�����������l������k�6����(���%����(�&�����{����B��z��CE���9��\��6X}~:��?�c�^~��1H>_��PX�A�R��B�Ju�w����t�����g�\l�8<XfQ�3\�7��>_�\�F( ��/��`�����r��� }�@(��H* �=���B�+D$��T$��Y$GY������6<(n�{�sJ�6�rt�}!}a����m���&-�A�*���J��R:��BDR��"�
uo*U9�>D$��Z�Q ��gj{(���DQ2zk����'hU&Y�g���3`���,@�g0���'Vy��]����������i�������I�1i���������
Q�����]"�At�#����A�T>���J����(�P��
C�7Q�����Y���_�G�\��������y�m����h�s�}����X����� �l���B�O ��TJBYP	�A��,4��?_�xC�����2"�,J�!��Q�c?yi����I<
u��_��{.�����e��'�eQ��$�������BsO�Y(��,�(�.�������z�Mm������j$z������_��E2C��)yX��R9%��B&�S��2��\-��t��`h��%+o��C�PoX��[#����{�����:��q���X���� �\J�VR���BdQ���"���/to*Zg
F>Y�L�d���7,3�tpS����tK�E�9����=��������E�O g��$�HFQ��xUh�	S~�g��L?��e�?�Xs�p������|� �S��wS���l���c�.%
+9tT ({
:*��B���MD��?=0��C��u����7�(����x�&�EJ0�)���c�[�	e�z.%�	+��
�dO!��PO�
QE�&������BQO�S��������\���q��#��������R�r�����P4�drB)�{�P$��zO�MT��=�/��@����%�9�!��o�k�!c>��IB	����sw��g���d�� m�]��J
,@	3�C��"�����xg��7��zS:�A�"��u1t�P/oX����������~X�Fo
;��)j@j&� m�}�h4�d���(	ED��(�&"�h�|$����x�����x�7,�xjQ/,�K��<������A7����f����,@B	��PD
�o"
C�/��jA���o�4�P��	(�u/��#�SJt�P���X����ze<� m�}�h4�d���(	ED��(�&"�h��zC^��-��u��a�����n���|o�5��� m�}Yh�dY�*CI("UJ��,te(��,(������B�PoX���/��v7U�2�p4��F��xL/W�7�A2����d��)�'��F(��f�&�������4r4����h���_]�G�����=m:��;���������Mn���h���e��(�eQB�$��B����B�Q�Y
y%�yk�7�"���v��S���nrpBK7P���/��������)(��P:OB=���6LT((���N���}�hW,�(
��8R���g�'���'=�|S�y�i�u���}u2�A���d��BXF�zW��`P���P���Cdqx���8��z�+h�f{��q�\���V�u�p����c������A*���?a%ER)�dO!"�zW�H�7�%���;^>l��'���O;��u����'���'Y�����G��{�6�nF��R�2
3�� V9�xW�*�wV�{��#�1��?��
PE<%.o@�'�^��l�[V���i���{��6��F4e�FJ��"Q�x�h

>��LA���o��+\P,�>�4����-i�pt���0��5�x"�f�W�F�,@VE	��PD
��|���\���J�\��Fvm���6`m�
����^?����O����
J���L�����k��h���D#Q ���DI("�D�7�F�����2����b"QH��[�V���73����W"��W�������!(��P��$�����ACP�!(�,�(
�.������1��kt�����(�<��IA�����E�)(�ER��$������DSP��
y�w�K)G�u}�S9�(0�z��-6�W�����	��m��t��g���@cO ���=I(��=�7�����:P��=H�P����/�(#��4�'/U�7�N�q�_��N9����R	(��R�R��B���P�
Y�fo�

?=0�p�P�`)D�OwYp_��s�eD�#������S����f�:4
e��Ax)I��y�W"��MD�+B=0	}��J(��	�"�����{�l=����������������� 
m�}�h�d��
DI(�_T�(�&"�(�"��'�Hh��D�Po��w|ez~�5q:z���y��#N%��
��f�U�T~a�*�Q:�UV�w%�;���
��<t�'B�@�K<&��p�X��F�j��L4�;8�x�#���_�(���� m�}�h�d��X(	ED�X(�&"�,���zC�Co��pW=$����9����]�_����� m�}�h4�d���(	ED��(�&"�h�B$t�HB�b"Qh�����qW����+X��l��s���Yo�� m�}Ah�dA�0(	E�0(�&���Bj� ��Z
���3a�q�.7���`��6�$9��|_�H��
��f�W�f�,@VG���PD��o��F�/�q4d��{��)G��8�����s���F�Q�d�)g\��5��Y�A6���"�l��")�Q��D�Q�MD��(��Hu����!��z�2����%=��LK�4t���Dd;�Rb�� #m�}=hF�d=�)	E��)�&z���B���K)
���gF��Uq��X���e����Hf���b�v)�6�L�}_$��Y$�zQ��D��o"�L�/DB�h�
Fm"���7`C�mr���\���ZI*�6���
R�f�����,@�A���PD��o�ME�/t���{�����{,�!�q=(>�sl��mx�,-K�iX���A(���Ca%�W�>�=���J}�w����M4��h��U���O��������������dSUy��Uy]�I(��)���m��b�f�;`�t����e]�<vxW�.�w��-^�}V��2����x�o����76c?�{/��1�qu�x,O�����}�6��,4e�,J ��"�P �xY��PY����l�(�,h_L�l�7�]�]�r���,qK�A��o���/�>@�����$X����$�2@T���]!D�;I�"T	�p�;%$sp!���=�p�m	���5��7�����#�
�b�+!�A�����F�,@/J��"��B�����"T%����H�\��)o0��V�~+�!�Q�w��?N�*��h���C3P ���@I("�@�7�����*Ed!�K'�>4B��	W,��+�g?d�qBPr�J�����=�}�z6��,4�d�,J���"�P��xY��PY��PD�8Xnq�z�M���.|�w9��\	��M)���`;��U)�N��m���x+MAY���%��>%�D�"T�T��>�M��Uu;�{!���'�$�+��P<��/��������

>Y����$��.�$�D�"TGY�*��G�����7��o�[�v��}��q�4���g�������f��&�,@B�|�PD�|o"*/�LX���X�XQ�3\>�6=����D����4r�2�I'�V�'���}�:�;����\�Vp'{
Y�Vp�w��Z�7����eaPxi��
6�T��p$m��5qS����e���b���p���`�p\z�:����f�0`�t���e]�<`xW�.�w�E��E�%]xC��/������m|�bz���4�t���*���.�g{6��<4�d�<J���"�P��xy����������� ����>�e����Oh�r�PuK�&�}�1z�e�er*��>�H���U��(�UR*%��JT(�&*�����Ed�l���,�
&�#���~������\&���2�����g�6��.4e�.J��"�P�x]��� FA�#/K�!/K�ay�����9� �-K)��N�Y���.>�(���E�Q(�ERB�$�B�����
E!�B����B��>t��Wqg�X��e�:��9��.��t�$�����`%���o)���yj�[J�2O��D%?�h�d��R���{����{���(�CE��2��e��c��X�w�{[�~h������y�(1P����o���xAiJ�
�bQ��i���8���n�������#���R�:�"�f���F�,@�E	��PD
�o"��,u�FT�g�l������J#Nd���
2����|�� �����L#��C�SH������4R�:4^��Tu'�����G���3���=���+C�Maw��b����(�����ba��Ka%UR���)D%X�]!*����x���y��`)���D�Ro�YZ���Y���n�
X��7�`3o����}/����
pO1�(��Y�]I�`�I��YD[�e��4xDCN1���~�I��|@���5��{=��`s�z!uJ1�[!�\�� ���TI���U���}��W7-��PD%��g�hd/ITr8����H+�`"9B���<���4����=��i���W@|v��Tp9��.*u�U��P7z����B�ht�����BC��Lt! ����U��|x��T�<�0���U�,������0�E�'������H��G*X�����2o��E�%�
�G�T^6Dx�'n�}<�y}��i	{v�~���r��v�����^��(C"*����R��������GW�Q����"t�!�B�#�� x�!Hh4�o�r\Nng�6,-u�(�*6��Y@*��]%c(S��?
�(���G�H��"��(e�d��(��?2|iW%��J4`u�|���I����a�i<�Wk�q�mp����ZT9x�Y.z��*Q"�E�7������D����*��.����@�v��V��;���\���+���10���p!�(
�uQ�,������B�Q@�h4�U+���T������y	��6����9���xs����}?�H0Jd�T
GY(�Q8���Jt��*�����/�F�u`3f����s�V;�j������u>��W�pO�`>�@P�*�I���OF4�B&�z������p"��:!/\l��.�ml�}|$/X�d�{B�|+�����4�D�}w���R	��
f�i����w%��x�q#^��$����~����vyC�]�"�����p^����+�ai�%].���u=)+g� m�}]h
�d]T
GQ��f@D��o��x>��������5=��\^�K��J��]z���k��U���O�H���E��(�ER�=��[��y��$�G���LQ7j������]�`*QL��Y^�c�gM�t�B8�}^�������C�j�2
��f�W�F�,@VD	��Pd�Ph�x�a�E1l�Gb�$�(2��$�<>�������$LF���X����p��d���u��(�uQ)=��)D�(�&��E�@��>�tAl�Pd�@5@�Q��Vi�TsKK'����O���A0:��(��\�P#�a1�2-��Z(�N�7�0/���
��!`�B&~z�_�������D����p~N���U���}����?�hN�����II(�b'%���7��4��I��Z�*�����`C�������_���������RL���VI�m�m�L�d������(��P"�$Q�"���(�E���
�pxC�H��v=�a;m�.��#~;o���%e^�X���N��Q�*�H�F��U|��w�
�o"�(�_2o�x�����B=�Cq�E^o`-z�1~�v���e���������\�����Ga%UR���)d�Q����v��D%�@=.�P���@�8P��\��X���7����6x���8����{�_���Q������wS��JX��b�QV	��)���UB��J�E�d���X�
Y%�/�lxCVI���u�����	��F����G|����f�Y��,J5��*eQ�m���bEd��(�&��AG�%�,T��{��-D~��.<��K����V�;�kZ����<H1����<�G�}��x���G	��PD%
�o�������p��R���x����r������/42#��������[�!i���BCR ��II(�I�7��H�B
��G^�x<�|�S^�]�'&����=zC+�#~��>2�f���f�,@�E���PD��o"���,#u�T$u|t�)�����{������;`Z��)�9��nje���|��.2��/�S�$ET9��@������I�+dF���J~@�x����!�ny��H�A�hX��f����8��g���Dg��m���y�6��x�A(���%��x�@(�&J�B���xQ��)A�Po�gRP��k�_>i$��c��}<�W��<�E�}_��Y%,JB](,J��.~`Q����6�<XQ���o���
�m�v�-�B�+��>���w���Xt.aQX��Q���)e�����8�7��,/���DP��6\(*�
+��������{�_��[���m�b����T� ��\"���B��O�"�
���!T�g�eZ�0�������q���"�m�
��Q�3	��3_��aR�ze
B�f�M�R�`R�`FY�*'�J��#C�D����� ��$o��p���Xx�����P��}04��%�8�p����=��v.�5�<�}_%�y���mWz9��b�,��R���x|~����
���'RJ�$�%�����U�3�`2I*J�*B���q�`
(��y�/+RP��T6�b�R�����|��U*<5#�����%��m�����a�INrPa�Er�+��#9�`����R�����$;��T�5����IDg���I=��m����� �9����3�������W�x�5���)$��L�0�.���~���bi���RS�����
g��@K���Y���$]�'�:�o���+���0�w�K��f}���s�O��9�(��LU
�^w]>�������{>l�{�_����}���Y�Y�	��$	]���
��7�	u@B���tH@�2���jo���X,������>J����x�Y6]Fs��F55E��wS|��;o]���ew)=��i��TW���Tk�H��W���b��������~���Z����GlR�;�w��Z�M����������1�R�������5�tQ�0��mF����$	#� 4?6��oFq�g
�?z�>m~�����W$F}0���v��gRlz�>o]���,.z�ObGu�H�o��\Tk�E���t��Q�)�=��>�+-.U	W8��������
��fd�V����V�}�-2��[*-��2�FdTW�,�dTk@F���W�c����������V�p���b����M_o��TF���/����A�4������Si9����UH�b�*Fr*��<a�+F���)�X�#��j������-NS����}���sVv��PVyUB4�����}�� !:����������W�e*�:�b�FX���R���� `�����,'�.=!�VlV���x�~T'�-�4�z#�w��!��$G���������g(�����`�fD�j
����]�k�p
 r���iR��(���������@Rs�+no��OQ%������|���ATZ=Di�Eu��:��E�\$OI�pth~|��� B�cC�����g_��.`���=~�X�f�y F�0m>����0����4�������s���f�.n\D7
��d-	"����GvA�3�V�D7���TF6_���be>m���O/ �6��;�XFl$Mg�s�x��T�t���
 �HS��x�T�|���"��]�$V ���8��_zN�������'k��q���H~�:��o~�2H�N]T��lz��JiH/=TiNV�=TiZ�����M��6gu�����wTz��.�S}Q����������557?���r�����E�J�����Q��G��.<9���f��P�%��Q9�e��n$����H�(��m���QrzW]��^Y��I�.�TZ"=<)���Is*8zx������X!�x�tF��i�^�����iM^��RS��.��%�%���}��1�� c:���:<cJ��hU�Hc������4�#�z4m`O��h�G��@�O��ks[�K��J.�&���Z,:r�?j����?�<��� a:��BDZ.���R���ir*"`]3MZD�7�0��
"m����������)k��E������D6+XA$5�8�mv����9,<aJ*,�Sp�p�)X,<a*{���`�X@vI���_I��+��z���C����_U��]���0p�����:c�R��4��O���
��Rp�p4)X��oR�_���'�l�_%��/r�/�Xn����k�*��\�����4i�ul��<>��Af����J���f�F������T uTk��gF��h�TN#�3���|�����B�HzT������Po�c��-��� 3z�zeIZ"=�,�(��
��cFa��gF���J
"�3J#�M_����������z��wsp������o��u�������F��{�-yiY�����Q#�`�Q�0`�S�����,�M#�� V�@E��6��tG����X^$++��sg2�����QiY\�0�4
�����@z���_D��#�d9g�EVo����6W�?z8I�h[i��<��B�J��~-5���
t��wq������Fi�E7�S\Tk���F�q��q�)�����4QN������0w��v�sL��C������e��&��S��{9*-��r�F����9�H���"RvYl`��h����@L���|~~�Z)�\���s��UR,����r=�b��w��J������F\��9�E�\�Rv�Xl(\8B4qX���
��^�����ve<5��o�p���{����)�!-rPH
R������TX�u�EZ����TX�RJ��)P��e�=��A����S3��(�0�I���o���Ag�s`x�T`t���
��hP�`xT�L����_Y�x��M���������_L��R3N������F����n��@g�sPx
�TPtQ��
@�(P�Px
T�Hh�M��{��R�)���^�[��V��
�T��v���t7)��Bg�X\���.>TZ6����4
d�>4�Y��Mk�"�R���
k
(�8B4Mtf������$]������O��������O �m��c�}t����!Di�H!�S�Tk��m����b����#H{FBq|hCMu�u=?9�-"�$������|�������V?��Y�<�x>����U*
� �8>�"���I/�MH/�M������I��Wj^�^���k���UOQo�}��C]|��l���Ci=|hNBG���^\�h����Pzq�h���K����1����E����2��g��{�o�_~�����xr���y���(9����R=�����F�b��Feob�����H/�mc�~E����u���)�C���t�r�=��� #��bD�e�G#J�@��aDs*=z������E�(�J�MAd����_������h5��*���*"F��^��e����dD]���,.zQp����T=�hZ#.\�h�.���>�����b�^�E'���<]D���iJL]�n���v���Gg���"-�rP�
)U��V�*9����-���0���ii6e����K�N����>�V=Mi.tB�����y��J�ABt�?��'D�A�B!
�
�k��'Dep��)��h
���-
y�^�&�uZ�5�4�����K�pj���+�����9H������(9���G�@���`
����
D=��OII�GS����W����V�Zj��x�����|�Drt��yR���9F<UJ*F�O���C�j<�)u�&����82�bS��)8|Hd��� 2�C��.8D�Ed8�O��?��b�W�������]i���4
������J�Z.<=���M�G�6W:K�'�g�Q����?������4�I�A�Z�9H������S�����.J\AJq�(X,<%*{�R%��R'�&J)�]m����5��7�%5�<������k�r���90<'J*0�8�x���j<�G�5������X
*�+�G%��DR��?��Q�o�����Jp��B��>�����Dg�s�xJ�T�tQ��
b��D� �)Q����(�����(���t
2M�������.�XR-���X���a��2�_�P��Fg�s�xn�T��
Q���x&�Tk���e�E\�h
(�8r4MTQ�x��Z�����#5���W������M&{K�9������<)9���J� �,S��i0�+G5:����,���mc�v�E;�/4ij��P�� �O$4�:b���_�����
i��
9(� �9@��"�RQ�iM�h����9�MaFSP�[�K�NU�
AY}�h.����Z���I=��v-��Fg�s,xj�T,tQ��
���Q�,xjT�!RP#D
 ��I�a$����\k{�b45��Q�������`���pMrXi���Y�#�%#]�(��8n�#�����"��WZ��
"9����%�/�aS3�w��"�������A���^����O���
�.j\D5
�O���@�Q�i)�Q�i�$���sE���f�]o_����@���C��Y�Y�"�%%"]�F�@���k��gIeo �X���L�h�4�L�������!�,."��^���8��E �&��2}
R���9F<eJ*F�(Spq�)XF<e*{��Ep������L��6�����8X�`%�`n�m�$��������k�&�����iRrP��E��+��I���iR�08�4- �8�4-�S��*��x�.�S��X���+�
�� 7:����s������W����5��s��7�p�hZP"q�h�(����cZY
Xl,.z�,9���o����e$G_]����GO�(�G9�S��Z�Np�C��i�R��I�]z��e�P�Y����;�9RJ��������L�9��p���5H����q�������.r\Aq�(XF�C#�90�\g`d@�q�hZ�!��U?m�*�������}#�SS]$����������Y�#�rq���R�#�U1�S�q�+F��F��0������
�(�� �W�����@���������J�����tk�P}r���9H<�J*H�8Tp q*XH�C$��T���J���n[XY\�����~���ms��z�-�*�e��A�t�?���J�A�BU
�
�*k�B~B�������"�9%-	(�������um
PHR��r�M�h�>��������12����n�K�����4
���2��
��j
�A� 9���)�����4QN������b����mT�M�����,.5�|�[n�A�t�?�� %5vt��
b�#H�p���@���EP��@qiZ������s
K�t-�^���
-,L��<l`��L������
�.&\
���������t#- �dI���U'����H�M}�����>��
���\>�����������n=�����4���ZF�C#H�F�pi
(�8�4Mbs�6��P"Fr�J$�����B����b���>��������� �i<J�Hz��T$�@�� qti	;X$E#�8���!�v���y����f��O����.�n�
�~��Y�<�x����U>
� ���Q�\�� .R��}���v*�|4-�!�����E���J��J9��h�}�g.��� -���[/-0z���(0z���T `Tk�Z�}d��E�����E�D�+zN��M�����y��L�+K=r0�i�s~P/������_������V��s��������?��?�����?P{:�������2�?��e@����
HH�D�6��. Y?��*;��	
��e�b�:���?�N����G���VMe��S](��la�y����m��+3~������9F,-�*FzhQr1�(YF<-���	#�E�0��la� 2�^r{�K,��h9�mw���BS����������8�E��#E=)�\����#
�#m5�4	c��[>��%�4��E���������y�i��x��1���92,3�*2z�H� �T��5 �3��� c��Y@V1��:��*�����|hS����vR��VW?��������E�
�E
=|(�4>��
�

�m�?�,"h����(X�����?.�FV����wu���\|���^������eV�������6X|Vk��F����O��6J �N��D����5��k��r�~����j���Q����1rt�?��E5z�T��+��z��KS�P���;C�6��{����m��6�����p���4�Q�t/E�M�������w��\��`�OtP�S'J��N���������	XA0�g��������XH��<��H�����{�m�<�2����oOmhh���Q��@���
mS����Tt������~�|6�C}6�x8���Jk�(�&M3N��2�^����#��N��v1Y��D5J�T��+��
��!Jx�3�(���@R3��a;�@���ql�7�f4]�RX}k�X�7=��k��xp����2g���!-&�A)�0Z9�&���}�	��0��
���G
*K�\�uQQ�����6������F#�3���[:�RY�A�s�?���8�A�H�	�#��k���S8x�0�X@*I�b2?��}.����F}�>�ms�r���z�����\�X`"����mlH�AI�9�}X��;�@��������TV+��b��JR����������dMh�>�H
Vo�����q��4���:��jLG�2�Sc��).�	���/lg~$����i	���i��*7��9C\�U�<h�~���6>�9�B/�\���H<�Ij���:���u�5`��)&�8�3-(�8�3M����uO�����W�}B�jc,�%����}Q��A�3�����eF�I�����������ZH����#���Jh�@[D"���@�[�-���u��w��R=to>�*R���e�������=�A
$]�'��@�hO��|�=��MR�b����=SO�������W���QXxd�����U�m	�1�����P���8�Q�%ZA�(	mS�8R�#_����5� ����B�����&�]�b�X������~T���<�����u�������cF/����<<3Jj���	�\������5����<lg�������6����:��=6q_�1�sO��>���c��H/]������$T]��FGIh��5�?Dg�2��G�>�����TG��T�4m�M�%�E����I���$�����loK�z��u�.��Oc��F�A��T�
h���S�I�+F��]�d#)��#��IIM*)PR�����Vi������������%����
fw��A�t�?���I�A�FM
��&k���T8���2`)��qic��Ut��z��T^���f��Qwj�A2��E�J���2�F�0Q]
c����#k�X�
(�#]A�X,�����k�
��$C�?����s��m
��A2��E�J�����Q���9H�p��mYq���������4��q�>�4"n:�I���C���&��ylV�$�,�������,H:Z
���+hA�p,)X��o����������`AG�� NR���
�Y����NR����L�������,>��k3*-�f�F������T zTk�f�=��>s[�X�#o��]���7��jyT��YZ��K��U��>T����a{dFg��E�gF�A]�v1��
��cF�0���>bI`d������e_z�S�sH����f�}M_[*4��KE����.�:�:H���� ��(9� ��F������5��5*$�v���F�q5�)�V^���YX|F�E��:��P6��r`k{��d��O���
�.2\
��/d�(824-*�����+-.���k�u��|��C^����o����R����b��v����k�2�F��F�S��F�\|!C�s�5��#�
M���]���c��"F��z*X0}���|���[`��Y�4`H����AJ�UFN��+0�����=)��F
*��-BS�g����~����������n[S�����j�}�?g�s`x��T`t���
���O�`|�?��3I
�"|Z�L��xE�������0��h�����9�9�g�z�����Y���$]���
���O�,|�?��`���i�#%��W�B��yh��.br�m.�$Zl���/{��6�������'9��������_�O9�#���)��`A$��������QV���#��5�"}���mGo�����
Ox����.�\*�	���/e�r`P���*u���AR�bqo�������@=-i.�7P����#�_������c���u���k���F��f��S��f��>������$� \,�F����+�ou�{���T�������S�0�ch}���;��n�D��4<�Ij��":�
Gt�5�/D����5������tf��s�1�E����*��
q�������F����
����94<�I*4��Mp�p�&X4���r`������|������**����e='I�h�2m��p$�j�8m.���p�d���N��l>��O�@>�����@>���/dg���O���O\�g�(�<�w�_�����_����=�R����}����A����|J������Q$=�gN@��|��EG~d�"��H,��l�����*�Ra�vr�5���\��u�;g��\"-�rPr	)U4�V�%9����!�}���
RP���!uo�����������Vs7Mv����MI�H:�,B��� �9��c�3���b���W��|�5`��)��H`���OI���6��~~����t���c�@�.J�`wu��6������4H���� ��(9� ��D���I�.���MJ�B�����=3���lWXq$G�[��j�_������Q{j����vRONRS[^�%���L�����
O�����.Z\6�F3���VR(Yl ��#�fS��S	l����'��j���C�i��?��ZzQ����:�}�=
����9*<-J**�hQp������:D�d8)b�2P9��TQ�`��&2��B�qy]�V�Y%5;�[j���������d���HQi�i)J������������$	�����k�V
��;9�N����
��h��^���u��o������(fN(v%:H�N=o%��e���VR�w�����e*��������"���6p*�H'��9���^\\�A"I��a�M��������k���AFt�?O%�%5�t1��
p�Lp��<�[C����L��bA��1���~S�7e�{+���y���|�&N���<Zi����j��1H�N]���l��!Gi�+=�hN0R�!�$iIy����:OI�G�6mP�����q��U��Zd��FNq���q�:H|N]���,z�O�C|�T���$`�����<�H��L���&*�X���C��BH���@B�5^��mNmy�}���O���.�AI�TqZ5y�T*.���"�iQ���Q�!�<�0��H�����^���z��4c{����FL�Y�%�� �9������������W���>��5����r��S(\�@��&���]V��35���v�U�"5c�k=�i�pH%�A�s�?��';�A�F�	���k��';e��x
j�N
sl8�3w��h������^?��\�G*�}���D������l*���N�@*����S�T�Ct�5�W��l
��J%��L�iu����W�!���z���B9��M�����L��.<�Ij��b:���t�5��t��������@
4�
"��@-����z��}j��j_��u���:�mf|i�w�;g�slx��Tl'
!����x�B���;�C�+M�J]�T�4�J�t���x�n�\hW��gf)U*��n�k���T���Ou����.�\A�pT'XC�Hv��71�����u��������\�`,."���-8�E�uM���~Y��>Hu�����T'9����:���Q�`
��T��
.p���"���1�M�m�U/Z�j|�w���S����G[#��24)Z�0�i���� �9��c�S���b���;��,S���CujG����u�������J`�Q�)�����g�*�_>0�.�S6������N��S����q;�����Pi��JJ�����
��Fz����u���#��G��IlU ��������e�z�"$�^��p�dCg���!-�rPb)U\�V�9����-���p����6��Wa�b�Wi�L����������q����Tt�$-;����]���������O�����.j\F5
��O��>�?�8j4-*���Z�����hyz��Yn
�\LzR��'p�s
a_��c����q�iQrPq�E��+���E�p�iQ�\ �����SR�X|������Q�%iz��{����A�u`[8��Cg�s<x:�T<t�}�+����k��������"��FZ�#wD.����O9�r�F��mS���[�6����Sa��Jxr���96<'J*6�8Qp�p�(X6<'*{�
��������]�����~~�B{���L7:fU��Z���s�
}tJ��>{�?iX}���9X}Vk@�/�L���Xl b���2�b2r�M���V=@I�$����
�
/���\�A6t�?��
%5Zt���
��cC�p��P��h�#��9���82�	�+�FU�����L�;b�Ug�Wo��-�f�AF�����l��y�F����RNbF�CHi�;VW��#6�*
l8F4�����*k]W=ek.�y�����(��?Y�L� ��*����EO�'���)���.�5����c�%Hv.�F
�&��T�NuW��K���<��<<r�(���~�A�s�?�%��$5�=
��a>5���D�4Wv<��.���x���&�	����.��XL�����A��Vz��_|��s����O�!-3�A�)Ul�V]g�Tj������\��q?y��4��~�T�
���5qr�����%��~6���6�5d��"�����+y����9*<�I**��Np�p|'X*<�){>sMA�)�L�&�e����.����u��j�����U�E@�������z����9.<�I*.��Np�p|'X.<�){��EPw�iQ�-,rx�J~o*�+��P&���h�QK�SSgl�Gs�]�g>��1�}>�JA�e3IO)(����4���Z6<����f�}�T"��J��m�����PE-�i.�Iv/�@&I�V7���h��?bd�}v��$-�����h�Hu��q�0`�s��+F��W	�&z)K���5U�35�O��]i=/I���J�XN�9�~������������ZPpxp��`
x����M>��3I�'��Yhg�{��?���+5�^x�U@�X�z��s�� :��c�3���b��	W����5`�3��7qLhZ�s�~�Xs,�"�<��lwO`)��[���$�hWbi�� �9���������\\!(X|!��
\!hZP�p�g����N�/
YB+������F���-�?)~�9�z>�XOi�5E�I�����������Z�fmi�����l*���f�;���!��4]
���$�Go�.8}���r��A�s�?��$5Htq��
���:����N���Y
����������s�m	�H:��i�9��}��=N}
����),���9(� �$@��"�R�XWX�5����fS�D
 w��V)�������i�:�_M�OC�|m�hi�����Y���$]D'�\8�����=��TB+u���E������xH���U�r��]R���Y����|&9�?�	���w|&X����L������iQ�)�{�D]o<p�'c�4�y�������W;�=W��1�g���Li����g�(�2z���
��j
��|f:����������4��,=%�:�nDlXL�"k����9R���v�>����1g����yLrP�FW-'����xL�lxS�&n��������1���A���t�\Dj�:������E� k�������=5�4
D�����
D�j
��g-�1D�Z�T��Hd�,���+�R�����D��j[������C�-!Y�WW�NiY��4��Q#�D�Z���Z�7Fk��=\g�D��u����om��4�Ev��zs,:9�
9-W�$4g��,�	MrP�H�	�#��k��'4e��f
*W���#��luI����xF��T���U����:�X/���W���������=�&�������@��a5�	���rR�l*a�n��ElL�l���j[
��ln�s��#���4}��������Y�<~x������u�+���k�������\�*tP�qdg������)�^KMM��.��X�e�@������i�hA
H�F��H���h�	iM��}c�)�4Vs��P&a��E�Do�\��M��;��y\�ISSg w��x����9<�I*�HMphp�&X<�){�)4�T
9���h��M�4��g�����Es���Y�P�Cl��{�����q��MrPq��n�+��c7�p��M�8v3- ?��nTS������S�z@�T
xo��@x�o���~����������=�&�������@��a6��E~c���0��
�Y���lQu�{�m�
�����V��zP���r�~�by�� �9����l��3��Mp1�1�`
1�3��71�1�iq�c��l� ���Y\�ef���Es���u�������S���?�]���l���?i�=�gN�G���?\�&��c�����6�$��� �$us����Xi����j#�[C~�9�y��8OiY\�p�4
������.z+jBO�����)��"x8��	t�~�}���[��p��d�t{���� Y�y���=H}����yJ�4'9�)���W�R�	������0?a���ZF���,���������km}��t�����~�i�&��c��|wq��������Q ^�p�9������p��9$�������L_�>���-�FM�j#�P-Iw�6(9��l��{�������5�A
]�&����xM�\�� .\�<��@` }���M��u=�=]��T�����+r�Z��n���B��������U?�/Z���e����?���������_����A�}��Y���?���A���J���ZGp�S9
�>���!���MV�n.��
���*�{1^u0�~�c�����v��~�*��N��dv��b=S���d�'�`0������r��)j���MV	 ���Zg�u����6����P��kw��x4�j�u?m����;��Ue0D���9H%�*H:(Qt aJ�$�!���=H���*8�l=��<p�A���oQc1�.}�{����K'%�=6|�E5B/:k��r���(�Z�y�u*�Z�5�#�OG^?�n�v���`n�5M���e����Ev>��Wi��ab����{���E��=u�p~O}�
 �Z��D$p��2��XP.at5�'���f�P����\�L��L��aA2D�j�4��eArN��(�st�
��ZH��D�0�y��:A��A?�x��������;��m^�{�c���/��y��$��DS�|��8QvP���,8��$�!��>{��E$8�	s��`~�wz�����������8�aQ�&���oq�Q=�J�R������e#�yCO"�yC�u*I�5��2����fHWJ7L��&�oK���sgB�8IymO�����y����
&�!f4����cF�A�@� ���y<�3�Na�����e�*����L�~Q�u�!��dm�������mJ+���v��1��GS��e�(�����(Z8,=:;�3�U���8�]Q�j��Z�w�I?.���rb���-K�I��Mm�C���h�Y�$�r)���R� �UA�S�)�+H������JJIAI
 ����a)��rQ}���[�o���d������f��f�T��N��2H���� ��)9� ��L���Q�`
 �B��G�T6,�
c���������6���e����s�r:�jF�p���2H������T)9����J���Q�`
��B���)$-�T�I��U�\T��^_�tzs;����\RS5����� �2H�^�(Si�4�C��(�fz(��
����}��%�O��������L���f���|���E����
����Pg�MD:�h/���<\�&�C�����@�+����^m�lg�#.��CV��<��XE���-�ij�����FX����)�|�� e��q�������_���������rt5��(��4���Z,�P�����`�w��2MA��\�=�]��^������{w`���Azt�?��G�AD=
�`���Q�@|�G� b��Xd,��M����s���I[������Z���<�f6}*<D�|�Yz�2H���� q��C����Q���+�L��;�g��	��F��E4��cG�@���k<{�~�=��	[-�I�J&����� e:����S���F�.�\A$Y������e��M$Y�K��Q�)����t���:�k.�.lw��F�*.�
� Yz�x�]���RrPa�E��+��#K�`��,����������cKs���Xk�v�	fqO�^�	�G�Rt��v+�A�t�?��r�XrP@BJ5��VIN���� iB���U���Z����`,�qq���&J^I�����������>���9�Y��#%])�l8��_8R9�������@y%�KAt���������#��"��9WQ_�+O�������90<?J*0��Qp�p�(X0���r`��������@F#H/���z�=u�X4E}�N�3KL��
�~�&��������&=�(������@6���/e��#���'�DHpF6AADG�6A��
����V�!~Y|���:�?{���� ':������0:�������J���`
��RT��kQG
*��s	\8N4�K&�����_���e�u��-���bz��b�������QrP�����+�&�F���_�Q9���8���y2���b���N�?y�F���>e��{c�������{��� Cz��T���)9���bH��1�`
��������d1��FZ�����BU���^��:��[Y������KW��dFg�����QrP��q�>�����1�0���Q90Qc8p.�R3��������
�*��\D)X]j$������YXs��$C�7����%]d(�\82�_����5h824-ho���4�-j��\�>��n����u�d��'Q�`j������=H��k5*-�O$]�(�@��Q��|�F�GV����OA4��cF�@�]K��Xu���j����BW`�����p��At�?M%�rh 
�Tw��U��S��V��hhBh�4��B����g
�a�Shx�����h.������m-����\w-���6H���� �d(9� �"C�����`
 �B���7RP��@^I	�$������{SaO�ie^�������@����r$Ag�s@x�T@t���
�HP�>����	��zr���meAD�e�C
t4�P��AC�g�����{��B�|��(�5H|���`��'9�`vRH�������������O�K���)�#����<�U���J��eU	+�l��(1�|����K��-z�AiF��S��E�`|a>�3y�X[���)�0���En�uQ�SvQ��z����..����i�����i �n����1<�Ij��b=���z�5��)|����_���[,��{h"r���c��� ���"D��oo+b�>�|������'9���b>���1�`
���|���c>��0�H'��L����-��n-�^��F����6o�i
��/�2������K�����PRKOmhNRK��|a@�3�����H�&�o}<&�`4�����E<�x�t��k��� �y�b=�e�p�OT���z��z�5`���	X�!c���fGz6A4�z�p�u�b�����W�x8���.~SZ=�&����88~�8|�7�#��L���L��;|}^.��[�������v"C�6��,�.���1
R�����BZ��,,H��*0r*5g�uF��:SV��C*N�rFJ�v$�����Og�z:�4����W[�`:G�vD��;A2HuN@B��j�� $]T'�@��N��|�:�G><��b��	�5i�x%���\#�[����Rv#�$�35u�vX}�>�� �9��GO��
�.�\H�	��/��0�	�%@����w�u�u��tq��+}�������L����O���
�.Z\H-
��/���8Z4-(�8^4M�0^�.�V{SOR3������%�t���L�yY��n)���"��M7=)�k��4�k���}�I�#k�Y�	Ko�	cA��d[����t|�O)����)Tf�Q���]~2�S�MyiY<�4�Q=7�s*��j
A�3�>��a1< K�e����>�#���bR�!ME����V����U���8
����y�l(9�Y��
W�E
��/l��,�����,���4���;A��_J�R3����nmS�r��j#� :u1�������Q R�0�9����k�����W	��v��d H����&���)���6�l��������H�DQ4>�fU����,�9-R(4(�a�x<����\���s�Of��(}�W��"������=45���Q�F�+=5J�I���F�@DE��E��\@�����l,�����������\��R\|���N�N�t���&I���R(qE��HRbM�qA����"I���������#Z0���%u��1���	m>|0k�������I���� ��J1�A	S�)�h�n^��enS9����D�s+@��~*���6�_���0\��I������6��)����8�;Jw�uh��f���%��I������)s�aQ�H�+z��`�� �����SP��������	2����S���"��N�s�p�W����,#r�F���E ��MS'/���#C���AFF�%�2������ �p��!�:��AIN��$nb���Q<�)�a��
x�s�:��,N�C.	�iy��^�U����
]�����P� C���W^�;��J�P(��A�:6hl��k����X.Q\�����-��"2c�~�ay%��	��������N����M�2������62�$d8�pA.���L��x���B�\��
3N��_��8�)��
�i|J���
�JT(����B��Q��3�"����1�5A���F�?�D���%E��/@������>~i<���lG0�|Wh"��>"-�I'):�.�CKB���y��J�(�"�������W%�LT���)6�:�2��L%�#�x��('���XlLPf|,0	D6�iz�K[K*���Nzt���+�er^)]�'�D����<��[��A[^�"T`Q�h^�/I����b��x��5'	&>i�%����T*���bQ6
�*���6����^��pA.!��UT��� �LX�f4~�q�-� ��$:�6_s��:y��T,
-	�J�(��!�"p�2�&�5��S����'\%;-L(4~G�|�����<� ��n.�yDc���,�k��~Z��L2w����m&���s�2	S�!Z">�1�u��[34��w��a�XP�
�����u<����w@�5�	���������I�1Y���������2"%F��"Q�(�&��(��a�rqI*��0�S���{�w��}tv��O��u�~�i#1��Oi�����������|(s�qQ�C�+���k����^�b�-�[��R�.\`��1P�������������3���e����E]�el��n���d��EW�{�h^�9�)������E�5���Ea/ �xQ� �E��na�������e�l.p�
�*���L��>�6y�����N^t.�BK�:*��l����./J� ���h|c^u(^�-XJQ���(���5���;�����=�Xb�o�V-	���]����E��0Jw��+E�kM��^Z
��-HNQ�h�a;�ar�cqM�����Y��S��ty�)���X/;:���D�BK��
=�F!�#�"Q�(�@��K���o���V�D����
�/<�����:6��`/Zd\�w|��R��Q�9���K%��������Q.�+����1.4'��q�8Q�`9E��nb9������K?�,W{� C���~X{*7�NZt���*�erV)�W�%��������������*�����XQ���������T6�ML �
�=5���9G�N�t.Q�����B��QH��P�>��ekM�r���Q�� BKI
"�*
0�\Tr�y��)$����q�������t����m���Bs�S�h Z9`�T2�uF�[3V�;�&�\�M�m��?���O��"~��t�SD���~����e���������l(s��QbC�+
��k
������s�2����C�C�������O�Kb�+U�o�����X������p����B(��eZ2z"%b���@D���@D���"�u��z�����^'
X�Qy����<Y�����RvW3�8~�3:I���t�$*O'�QH2��	J� p�$�#��"A��%����mL��G�c��a��>W!#�8�H�]K'������29N��C�+��k���^��M��x���E����4�<�@>�wM�����~4���&
�^�;��x���.��Ph�xQ)
e��xQ)
����g�&��(wl�O��r�". �#�G�%���kniE�x���h8��
�7F�c����>��	-��
��F!��P�>��lM���O�����O�L��O7A&y�?���+����M�y'��
kV���~.%�Z���B�Qa?}*��@C������t�i��/����"?c�%mP �dsa�����,>7M�Q���L�=:Y��T
-	�Jq(�@�R�S!��"��o����A�V�DP\#�����Y�!$�V7��|�qw�4i"���\J�'�$*�'����x�T*��[S�K��M��6�L����L�($��,��%�c�A"��7��B2�t������Z
�A��0�
��7&>�
b�#�[3h�7�H���q��_D
.�c��za�zQ����j���LXo�&6&����d����]��!�iP� C�D�W"�%�"��=/�p��f�9.����hN�fO��<QY������^�u�h8tR���	-1*�'�D����AQ�dMy�7���(O� ��%;�9�4��'Lb�bE���"��MP�#�EO��O'����G
M29R�j@�+
Ek
M�^D�M�Y.��,X2�1�`u�����s��u��
��r16�x��)������\��q��O� ���~W��$�������b?��d�~��u���{�?r��������J���\�W�g�N�s�����?����R	(qEp�J@�5���?a/p��O�`�D�����a|�-��O8f1~�rc��<�|q���M����DzBK.-*�'�,-*��O�lF�un���t3BIAz�M��r�mF��#����[!���+	+quJ�>q�����A�%F'��)�������l�
��S!��"����
"�����J���f��z�k���<�(�i ��vv�h��^w����N�s���*��drV)��W$��PbM ��O���Vd~�,�(�3|Yu0@�NN\�?���w�m��7*��a�������Oh��Q!?�($bT�O�
����A�{r�1��zN*��������qw���
>�<�k�{I��4�k-O���	�N�s����R�`R�`JD+�	�J���pk����d^�9��(��v�����0~^���+����-�n�<����v������_���A���xM�����I���_����y����Y�<�&��
��&�*_�LtM<���eMK�#B����U���#:��o��������BbD�������	F4��[��l2����hq������b`���K���@��lMv���]�j��c��j���$=W��0�IO� ���I\�0�HObM �IO��0�HO� iD�|��z�}>Q,`:]l7��:iZN�P�����8��N��[��-:*��($tT.��TH������ty�J>��������z��?��g���/=. ��UD���'o�s����\�����8��J'qE���8�5����=�8]�7�\��X�1��#������p~��}�'+�k�v5����E'��-q������u�QHx�p�>*\�[���������e.Ad���
�u��?�{O
�
�\�JQ�sU�{|	�H����Iq���!CS��A�%��P���}4�%!!CS����
��tK��_v*b��)����Ut,6�>5�&��[[StR��R]'�d���u�Q4*u�>4�5���6�s������r�Q�f��l��5��������\�g�iy�y���K:)�o�����E��d�XT(N�
�E��tk�KT}��&c� �DQ����L	j��*c���/iZ�S�\���o�6l����"�����������V}�;��������/��_�������?������F��].1-����K�RB�J���JB�N#�	2�o<�$d� ��}�3�y��&�����1�Og',I��q&?wM\sG���`vb��R�y�)�����D����`D��l������1B�8��������*����Q�Ua����+8k�EK���D�e�p\���������A��F*���+Q���	D$
j�t��e�A�G�Qd/��%���
��wK�����f:����x��R�0Q,zw���w�*�G���6���Q��S!	&[dH�s��<�,��3L��%_S�'1~P��NDM�C�'2�?������}�������%�Q�?�($����
AF�NM��!��
���?e3-�P�
�������m����V/�6���Jh����3F����Y��nZ#���t������
�H�&�����a��sFhi�aD��������F6V��|�����-;����Y!GMK��@��Q.
�hL��"[\��P\�J�2��!`iE���/p`��K��s[z�TG;r��������S5���s�Q��4����mV���%S!x���!�l�8�	�>�2�'%������-��=��r.�����QD�
�>t����H�:���J}'sE6#���Y0��P0�O�l�6A�����/�6�A�������Lel
�����v�!,8�+��&uk�z:�C��4�28��mV����~�THp���!��������K����)�����dd�T�>��>��R!���������M���NHV���-� 
��WD+
�J	�� �a ��d
���$,r���qt��/��k���k���i��f�0tr���=,4��dXTj@������Bq�d"�!���e�(,���� ����iln������I�";S��,/���v�C'����C3��AF��$�0�I�	0�C(0�g���3$��	A��y���9�������=�3W��_h��i.����a�d:W�{Dh��9����yns�
���YD��PD�,Q��DyO���p��.0���������i|���y�_��v��i�,��������y2#���(�NG�$j�����`�?�bD�t7�#�
�aDq�!��D{|�@c����������-{���Iu�a��I<���Bu�Q��Bu�T��Bu��Ph�J���H��,XBQ\g��u��'����8~�d����D��4�w��9t����}��$'s��aB	"*$'�3pDTHN�~QD�P\<���Dq�1>�4�C�
MR��U���j��J�r,2t����=24��ddT*@���Q��I��;1<?owAf�\�.��������V�������6	<\c�9Jp2�c����&/:��R�9@K��B����E"E��3�B"E�&x�`<�#��S1�n�r�(������&�lF��~��Y�s�d����:�yw�Iy%�Z���BQ�<}*�3>��Q�9��-(�i�L�x�������t�(\X`$��N)�i{��}eQ.8_�����D��0��$��2F�VN">��b��F|�H�R�pA�#.�I$(��+�����.�	/tg&<��pF2K����s���`�<'s��P�9�+�sk������� nB���xi�9�f�;f1~��]���f��&���}J&�N�s�����:�����I\l(��X�.l���?
��t��3���3�������|��
,�M_��C��8��V�!�N�s��G��<�����I\d(��Xd\P�p@�� �]@�A�>
��3����!�?_XZ����z�y��~�(����t�����L's��Qb:�+
�tk��4��,�(��M�%�q�9_:\O����7\�fg���f'��*������
��F!��
��S!��lMqQ��=[�����8G)�2��8�l��3;��%�A��~�[�����>�w|�yu��r�J�������r�Q.*,�O��"[\\�rr���r� sY. O�?�
�qUwl�A�`	X���3���|���Px�,Z_q���L4���dR�;�+�L�I�	F.�N8�D�w�K&���M��DTm�F��\�9���^A��F�?�k
�_�����
�}2%���"�P����`�����
Z�iy�2��U��@���>�w���i�W[fq�D�o�4ApM�yO�it���	
-�Y*$(�d�
	�S!�%[d\�}r��Y��vC����Y�����y����w�Z�o�}�Sc=�mo�c������}���AZ
�A
L)C�h���S�� ��!����
d��$��$c#�X��������*��r��~ j�����^z�Q���������L(s��QbB�+
��k�&8���ip�	`�c������m
Fva�z�fb8B:A�h�H��F���AW�{`h�9��(�����A�5�
��D:q��Qq	{���z���j'M,9�^��i�������J���9�t����wh�dR���F!��r���B�I�&��`A�#am�D~��%E��/�N�3�i����O�Z�_����P|�[g�t����mwhI�Tn��QH*��>�lM@r���Gf�lF�d�-2���V||k3����.	����gS�u�����|�~�(+w������T4+���B�S�
+������TXQ��5~!��P��[����>�����?���kTr��l���3�����D�w������
]�����P� ��T�I\�E���$�$L\��p �
u�KT�������Pr��\���O�/�$hL�\/��%�5���}���CKf��w6
	�;�>�A�5��	�C�fD��������@c� �y#j�N�[������J\���f�3�lF'�.�BK"�R��F!����T2�5A����(P� IDQ�naI�Z8"�1csa�e:M���'.�L�l��������D3��A�&�&%��0�O�.*�[���*u#26�MT���l2c�yd��M�7�x����c|.�.����I�������
�ABS�� Zy��S�!�X��l�!����guAf��������)��!�Lf��=�8�-�KQ�5��!/p��~��^�������d�\Tz}�T.�5����=.�v�� �\�S	�0\�e��q��'vS�pM�?��K�U�)=b�>�J��De��D��;4���Q"B�+;J�	F.�P8�'�. �c� Y��Hv&���c�'
(��>>M��it��:�/mz��k��BW�{lh.�9��(U�W�"�Xl\p�p ��	2�������������c�S�v�5����������F6����$�5v����=,4��dX��A�+U
J�	,.�O8�P��[����A���Z�c~�hp�����������no[�q���?�����]����m|��>���%�	�tI�?C�����-��}�6�c�36��(Pp�|~sIiDQ?���������?L�0������/���?��h���o;�����������������~�C�����5�p}���3�~�����?��������Z���L���U����U���R{��g��^����Js��Z�f���i���������p���!���������
�����0^-����0���������_�e����s~,���/ �J@��; �
*�AQ"��Z�T �5��l��7 v��KT`x��?�8]��7��]s���]:����e]5.�J\��;.f�� ��(\D-��}*����?���z6\��*0\�'\�}���[H68�
�0}�n9p����������O?|�~�{�X*�a���X44��
�D��4��hDk
�

����4v�!��`��C����Ic2��s�Y��.����|s�k%L���q�@����>��hM��B�����
��6aK!T`��
,��,��yR+���0M�f��1��J��9N���zl��J����������h�u's0��"F�V��O%b�XG��a)��D����gF�0R��Y�y�}��
��~�q|J'���36��r�v5�������j<D��!j<�S!x���!.�x��������������3LK��o��������3��^n����d�L_������$���N��8� !J$Q��d�
I�& �� qY�.���q��	F|�M�����0��bo|<%����?���X�J���;F����HD�%���E0�O�`$Z���P��,bd���XvA\����L���eh��o���9�-�!���������������C%0L��q�A�0��>�hM��B����]@��X��%.@�i������G>�a^?�)��R4���rT'�Jt4�wl��A�Q"��Z�T6�5�����"6vA����#k[�R�%��i��Y��]>,=v���r"�6Y��5��3.�U��6�J:t�G��C����J\d(:�Xd<��p����E��I'�u���:���Yd�����w�P@����������XI�n��x�4(s���A�+�E�k��
�,�(�Ml6�������������5�k� ��x���JBtL���k�!�F!�$C��Tn���P��q[�P���.�����������X6Qth��i
���#Bdw1,��������]y�-�m�^qTr�c�#���H�#e��d8R�
�H�&q��A���E���DEZ�g������m��[NS�@~��>��X��4~?�1�J:t�M+�Rp`BZaJD+��J���ppk1\#F�	�$�p,@��?�2����<�_zX,AA������hq�����9r�:U2���;443�Dh��Q��@c�
�F�@��1�`t�?��.��V���2�wO>5���"��4Z�Y���lR\�W7��_���T��n����(s!��E�+�}*"qg2I2��"�u��GqA$6����Fz�e�+�YJb���n��G�0 �����L*)�)E�BK&��F!�$C��T�5�CaI���*�>"��K&�u_��`��H����b�8��(W0
�����;0*)�)E�BK#C��Q02�O�#Z`��r��]Y.�T���;Zi�x���5���84��}Zc��S�m��94�*��M�=�h��9�	$�W$�P2VbM���$K �X5`(���qI ex,���Ul2v�il>Ss�� 	�]��k�{���������9O� �!�yW
��44�1
QH�%S

��t���t�������5�����J��iZ>��9��*��M���d"2R�'qE�A�HCF� �ad(���i�-.�8��O��B�f#7����v
=�������/&*I�)ErBK.&2$'�,&2$�O�,&�5�B)�$X�o#>�m��Z
jXP$g��[]s�mX�]Xa���j1"�U|�(�2��S%�����Mr21\�
A�+.���dHN�������"��tg,�(���3�X(��3W��y�t���[��4S���+Y�M��R!�9x`J1d���J���!���Y�1��qcZ�]-2�TV����r�6��2��8����_�&���n�c�d57�w(hV�9�PH�����b5�5��f5a��Q]@�@�bPP��E`��zG����5���H��>X����\����2�T�'�d���x�QH���x�TH��P�nMC���t������2��?������*oZ���XQ�I`���[?�<06���JVsN��������l����S!����nM���:��Xog1y(R��� ��~9q��v�a>V&�@c�&�X[�g����d��77�����M� &�T�'qE�
e�=%�$�h~�"�(~�-�9�"�UVL1����'I��KlI����m�#��'��ZN�H_!�+��M�"��d"DRL'qE ��NbM ��N��(��-X~�ab~9����QJC0R~�v@���5�2��E���VC����������2#)���"Q�'���
Q�%�muQ<q/6���LCYN[�*���"���b`��]y�����/�������<���wh�eG��;�,;2��}*d��	4���	���b`P�g
�����j��L��7��=�����d:���c��������&@��&R���+&(�h��8A�&@a��.\	P�T�������>�\�"��]`�F��z�D�'Y>W����;4��D<��N���A�����A���x�d�J9M���t.4��������������/�lp��|�.3�R�zn������ �A�S��hE\�Tb!�n�6�.��|*�E%�8��q?�����t��_�6��p����Os:U�-(�J�s���f>����I\8(��X8h��<L���a� i�M�.�p��^)C,�p�n]p�u=�g��L��-��Ex�c�$?7�whh��9��H�qW���Xh�:N�h���p�,R�(1R��h�������t��?�TuM[Xtm���R��n���,(s!���N\��+�Xtk]�	{:�A�
"�-�7���.�q�	7�P����Bq�=hP~�rO0�,������\od�<�(d��������F�&�,�;&�
���K0���M��b�S���"����y�V�=`���hnW�0e��A7��0�iP� ��
J\�0�hPbM0�iP�s�q��FZ���Nl�MY��.�����$'k�Od���X��{�dA�T#Ph�0�i�F!a$���B�H�&�,h��+�Q�J�DZ��Jb�7�\��-�{�GVq���vW��R�}n��QBs��A��rO��D	U�I�	4�	{%��,�(��M,���Kt/�2k�r�>��Z����t��1R��.�2Ph�0�)e��0�)���0�aA�����m�v�r�&	Z~G,��t�d���0���N�����Du�r���F*��%u�Z"���l��]w�
�H�&aD��c&Q�(��AD�E�L������3�����,�����q��7���������i�� ��L��"D�V�4>�b!��,��oqA�". ����c�O�����������F��1(�������
]+��M��-e"0Rl)qE���RbM���R��%�nP���H�q��v�}����i�v�s���>�q���V��|e�dM7�w�h��9�I�����bM�5��fMa/ �XS��U_.�|z����~����&�_��'�4f��
�����V2���;4C�D8�R���A1����A3��pP�[�T�(R7��������6�e�:~�9BJ,!.����-��
�Jft���fF���T}(qE���C�ul�%���U�.��h��~�%�b!��L�wD2i�k�]�x�f���Bs��������k%1���CD��A�H�%�D1J�	D41
{~����z��$E�W(�Z��?��q;��[O�Sm���%�s������J6tM������F!��S!�hMp��PwLB�6-t��U�&F�!|���R��T���j���_(���-X
�JntM��BK"S�F!�����T �5��F�1��J���2��K5Z~G�
?�����������K�lG�!l;2��Kk%!���gM�21���B�+�ATY(�&��e�����:�"D�����o!������������S	�=�T��k�����BF������	.4��I�P,�[��hP7��%R�k2����8������X�	@�V=H���
������*@0�(�T �uD�&��"
�#��
l����s��<vz\o$����X��_8x/CO�<�?�58tM����cA���A�B��d����	$�i�4iA�9��� ���`�z����EQ�\%:�h�pe�+����p��4�&s��dhH��������_�BBC�&p��fqLB��4�E�E8�C`M��s3���n�z�����4��8��P�i�p9�>�iI<$���Qw��T�5���4�o�����?vJPZ����<����<�^V>��3�=I���kw��yz����:�k2��%��(���4$
<�T�5A��1�o�h�l���2��Q�y�rXAQ�k�E�8*4�d����:|����s�;D�h���<hdZ"���("���TDk������}Z,��;>������8~i����^'!	����������'�������������R3s���"KMq��Y��!�L���cEJ-� ���m;|l�H'�D�}TpJ��G�h�d��::�k2t�i����3�($`$��20tf��C�z�
Zuj9E����Q�.����Esp48���upT����������$0�� 2&sE� 0�5	��4{ �Y,X�f11�j���>���Q_9��j��������#.���6���"2���+��}*$4$�KB����P�6tD[nRz�B��-�1P�;w������(+gK�_�������I����:*s�
�R��9aJ1{��J���!��Y�(�v$|D@��2�����(�YOEu1t�����������;�����[i+�M�"��d"DR'qE �NbM �N������p��d�DJ�P��,Bq�Y���nN'�����h������\p��l���
�v2����"���:�d3���O^�-�Hw���m��^o��!���������N����n��L�|`�F�;4Ab��������z2�J���"�������?��b�\(�6�����B��E��(������+��H�����1:�t���:�TR�m���\pd(P6
Ypd(P�
YpDk
�

Q��;#���X:Q$h�l�P�T�]8��S�����
p������S������>��	-����F!��p�>��hM��B�!j8;2r.��j����:�q>��>�+%���M�u�H-�	>�Q�����J���sv��I�s���~�%0��D9g�
�H�&���e$���J�Zj�����/@����[���r��O���g������g��?�%���?�(���B��	�C(v��(��+������"@��Wo����^�wv}��#���d���#][I�n���QM�2q=�"F�+�U�(�& �� �����e�-:(�iAC�.��1�	.�Oq:�[��qo���b�_=�;2�J��M1����#Q��S*�$J;��+�?U�&�����e$z(������(�"�������-��ewaMS�3�EV>�/c��
4��H7���-�� �3�����3��)�0���>$��H�qI$����ru=��,3���������x�%S	�Ur���;647�Dl��Q��`Cq���`�?�bC������r
.@���X\��2���������!2���&����������$F7�w\hb�9��H����"F�5��1
�4wA$F]	0na�(e��*�Af�q���v*�j��5.J�O��c�X{t�$����M�2#)���"Q$)�&y I�@`D��n���bI�d{��A[���tp������,-�7X������)]WI�n����$)s�������:
Z��e)���I
� �n�0.���c �,��E�B��|@e6�O^��d2��
#�����"�R)��b4C��Q�b4C��T�b4C���U�F���-8Tyh���}������z��y:b!��{��Y��R����E�/
-	�/�F!����>�/Z>�CU���n���"F��e���&>���
h�&.oQ
���I%E��(RhIhd(R6
�F�"��hd(��!�D�iXq�],���b��
��E��f��[���i��(*��.���n5m?�>q�B���Bq����-����������,�(.�P�������sw��Xt>�2|04���6����E�-
-12�(�D�-�S!#C����?2bD����%����%{<Sw���T�(�m�rt_�w�����G_��n��[h)h0ak��"4�V>�
b��u�C4\�PDf�>�����iq����0���:s�}d6\sD�6�b�JVt���fE����+J\�(V�X�<��p`�!�(V�-"���H����<N�����������zcqe�������
������P� �!��W�
%�l(<��O,@����@t�8���[���<�tz�-����f]h��`A�Jt���fA����J\l(�Xl<��p �����-"���O||P]Xj<�_4��9wl�L���-K>d�e�}%���cD���A�H�%�FJ�	F�P8Qt�[�|�JF��Q�O3s'qdw�N7_c�W���bt����J6�O�������F!���S!��hM��P0���Bc7"�c����Pd;BY�5���?u���/�6��e��&���F���S�(�$62�(�`#C��T6�5��C��{&���������mTZ~HTd�E�o�������Xz<5�r�H,���)&�+��M�=�hf�9�����#f�\	F�5��C��{&�CF��0�jG��8���jG]�NS����>��=����W����;F4_�D���R��,>_J�	F�R8�U;�,�(��M�c��wtg��5�+I1�)�������]���&�����iR� B#U=J\h��QbM��P=
�����d�����1P�3����#�kZOH<�T���(�]�L��P��n����Z�0#L)��1�S�)�XG��a|i�*�#|.�������<���ZE�*���/Fb�����D�N}��K7�w�h��9�I������K�5��_
<���`d� )��H�{�:��7��K/�@V�����h"���mV�2p��N7�wlh��9��Hq�����N�5��w
~���YX��E������cX�1�B��]n��K���R�N2aC%[����A���ADC�-%�[J�	�R8hPl�[D&��H/�WZtW�p�������O��p@�WC����>��w�$���
M�2)���"�P$)�&�x I�@@C��n���bI�d/[Y�(?u��_�
\��O�g��x�A%%:�(Qh��f�e���f�����f�-B���@�i�����T�h����$�����yX����w)��O�}H%:��PhI\d�P6
�E���\d���!�>�i�P��]@2���h4
��G'n9�	_T����r&_����,�=LT2�C�&Z��P6
�C����8d���!�&�ip���]B����=�x.�!�����X��A.��(��{kK��
�|������|'s�)���"��w���
J�)2��Fd3JYMK ��,�#� ���n8�XaN������ie]���|��u�7�wdh��9��H���A��;�5A��	�	wA��!C��E����S�/��\����e��i�Kwl�!�=������_A-�8����� ����A�#������xd���m7!Q�
��q���,S���7���x0R4�����������A�	M� �$EhW$��$�$�&�H��X���HZq�a����O�����&����X���V��.����+��M�#��d"FR�&qE0��MbM0�@l����"6�"RY.��"�v����qR��i�p����
�d!�X�kn��`��&s���5�+�kk�^��,��&
.�Cv�& �#R[�N@|���W<�B�Jn[����������M� b#ElW��$��&l�!�Uu�L���8��_4#����\NO��9�,���,�������a��N� ����d!�!:1����B4ZX<�~�g��C,�����(Zw�=��{��D��5&��=��;u����JvsL��������F!`���>�hM��P���	T��[����=��^yo�l����S������t���s�����S�&�$2�&��!Co�T�5��Cag����x��d����s���+�h�����,�4Xtx��G���OkR?�nC��S���J�sL]����H�"<�`$�"�
Et�1F����#���B�b	D�E`]AQ!~����bwa��/�w7�������O��*��J�s�_dh��9����I\�(��X�<P�p��]@��n����<�/{]�Y�������H��k��Q���|*�;�����_�-=�����������C�#2��0��B�pA��\w%\�S���d�xLR��^d��C��k��������dnye�$?7�wdh��9��H���A�"?�5A��	<f�� ��y�-������-�\/|z�xp���J�OT��������Q�xN����%CF�mx6
	���}*$dd��/BC�*�t#���$7!�8���"�9C)3�F������_��n���8U����{���'sCE��$�H�P�'�&�����*��X��I"��,cl
]��4n_]�.���T~��gi���o]�=U2���;,4��DX�O���B1���������x�E<q�N�Z%:y�>'�niWE�s
F\v�1P���5�8*��)U�	-�825�l�82T�O�$��Y>�&U����0!������:G\5k����U-SE�k�y�o�"�-���8����*��M�=Th��9�����D�FC�~�}.7Z���v~n�p��At��g4�-w.�m�������������&:���7���
#����Dv`�!k��J6t��fC����5w���d�I���[��o�cS�JzHv�'��@�<��B������_\�������������=;/�J%:�Ph���!@�($vW� �bG4'�pj�#�P������r��(������8���m�`+
��{k�����x��
Ua�t����	-��L�O6
AqEA1
����%4��
O�!;TJju����C�
��P����T�k�:��l��)*��\�un���Z
�A�L)��iEX�\b� �1P�5�w������QF���E"���b�%WA��U�bo'&T��������p2)���"x�U����������"�������HxpdA;�%&W������ &���[�f�xE%���������iN� �#U��\t��KCG��������I-V�.��X�������FE����=b"�B!>]�qy��O�����	'ys���i��9� I1��	��4��AHJ�%�*����YH(%���H������:���HvMd�c����f�'
�O�����$�4��z*Zr��y*��B�����4�d���/��U��6#����a���y�e]3j6;�o���s;�?��l�4-���E%:��PhIXd�P6
�qE`����F�9	��$�C����q��:�
��C�U����xX���|�wH��?����2����C�T!(�$82��l��s!�����9����#p�k���><J2�xQ?����b��=�9M�w�iua�\I}n���QM}2q���>�+,�I�	4�	{~��������}_��`1z�n���A�9�!'�����S+��K%:��Ph���!B�($h��P�	"���B�?�
q�7v��R��2��A����[fKYA�B�P��K,.�n�]7)�9�
v�J:t��eb�HU2W$x(:�����D%!�a/��. ����Rt������O����:@���*�L;^������U�G�Jt���T�`�RL+����1�ppk&�G�0���\R�	��"�m������;1��� ���(�������^g��C7�w`h>�9��H�����C�9��Ca����F��?*�`�Y�@Q�3�����~���3�e��>�x��C��+���RI�n�� ��(sA��E�+E�sM��^�D��n��nz�o�����j,�rK&1�{lF���RIzn����'s!�"=�+EzsMz�^@@�{�K ���MP��bA�~F���e:�����+T��&Bn�f��X*Y�M�$��d"HR����b=�yl�	%C]e(���\?�8X2�Q�"��WY'��y�E�8��&��VR����b������L*9�%��BK.E3(�,ES���,E�9��@�c�X]i
�d�8�����2�-yR�UD2q�hJ/O��J�s����db�HU~2W$N��ObN ��N��Cx�,���P[P�W
��� �������eg���O�L����O��J�sI������{�l'�+E�AH4��y���nC�����d����C���G��xf��83A����-(���[�^��5�QI.)�Z$���B@��?}.$�Ds]�H���L��H"�0���\���K�Fqf��s�W%��JB�%��)	e�p�JB}.���Cs����Cq�.`iF��n����P��<���f�7���ScXr��}]r9��k�����_�"�R a�Z�)E�0��f|.$�<������;H\	�����H3\��H�!�f&��	�|��D]�����2o��\�O]7n�Z��n�� �,)sA�bI�+��s����/X]@@����E$����[�+���"K�����������'"���X+y�M��e",R�(sE`�xQbN`�yQ�X(^�-H�q	�E!F��x��U�''����j���")����_�-*��5U 
-�R���W��B`��RbN`�������[J��'h0.�����2h�o�6��gl���H)'b������`�ye��G7��p��Q� �"E�2W�%���=��]i��
���F�	m�1z���W�$Z�����e��m�	6L6�[+	�M��e",�IV�)B�`�Za2M����`����.�ynb�p_=���8����$�����+W��Ev���^���g��L7�w�h��9� IQ����2%�$�2���� �anq��YNQ�i�vA�0��m�K��k�����}}i��L7�w�h��9� IU�2W$�2%�$�2����L��$E�����L�45�N0������g�9Rl�dL7�w�h��9�!\(�6)��l�aL����`�mH �-��EQ�e|������T�U�"�
��[1_��w+�,��bI�%w+���B�bI}.����nM�*E��p����H��C"�,��#'����eqI���x����/�,�W���������i	�P��A�H�V�,e.$�<d�bM@r|�}K[$B�����@t;gi�p�/��;�^4�Yn���_��
v� TDG������;*$-JDTdhQ���B�����B��fO�E@P�.��.��7
��1_�9�-�11<�z9�StG������;:$;JDtd�Q���C���<�4%���=l���{������T��C�����G�1cwa��K��p��f/����������Jw�w�H��:� �p����J�9��J��R�E8�"[aa ��Q��-�g_/��hZb9���t,��O�}J]�Gc���G�2�y����e�}V7$��G�2}�Y|Ds���
�����Z ���sa�D��T���$��$z;��]^y�6�`�o�y^w.}SG�����C��A����"�C����B�f/b�.[�b81!���?��t�����s���&�3[�~��&���o�y�C����[��U����}V��#s�����hN@"�S���� N��$A�T��(+�W&����`x�oo��}�j
�7u����H$oJ�@��M�+Ho��	F$oj�"���XDN�H�>���d�t�M�O������E���u>�d�$Scj^eI����z� ����D�hN�!kL�c� �
���+�����H�c�e\�G�AV\A.�S���z��9�:��p�5$qJ����bO]���<�u���>��B���������
lU*��C`W��eB�����m���+�Rt����M��Ra�9p`Jq���"|.1L��ftw�0���$R���X��%�R�!������������U���&�D!1C�@�*�ADt���m�7�t(���th��|�u���	d Xp����e@MN����3\{��R<	S���H���W�o���1���C�fL���c�\�(����h���OwA$�\�*��@R�I[���2�%���&%����K�,��h+��M��e"R�(sE���QbN���Q�<�B��-XJ���)��������y\�/U����{���H��dB7�w@h&�9���T����� e��	��� 2^n/.��c����>d8�+�Y�YVd��n#2^���T.>�
�o+i�M�$�e"HR�(sE@�hQbN@�iQ��(Z�-H���EZ,CC���b-���#NO�9�%_d����g��<�%����B�")���B�"����VD��nC��n���"=���7P��u��*?6�F��?��>
���c������J�s����db|H������$�$>h��">��Q����U�E����q�%�����|�x��r��
�Ap��z.�W��g�"=�%CH��d���"=}.$�dHO��!D���nC@B�P��=C�G�������sywa���_%+
����j+9�M�=vh��9��#�y2W$v(������9O����8O�`�E��n����=�'.=}��8�.v��\J}d2�*Y�M��Rq�9�`J1N0��K��<���C����>�����E��(���2>^q�+�����pM4 ���Hl�_!������������Q��O� �$�~2W%��$�%�%%�=��m�"(�qO�-%�PM���c����Y����'i���]����d:7�w@h��9��H1���b:�9�	�*�g��^�k���!�p�����t|-���O\�M����k�uY���,}W�n��(��'sQ��?�+��s���U
�����uA����@��[w�fD������pM�������*y�M��e"0R<(sE��xPbN��_B�A=��C�. �DU�������8�:�x{�2��K.�������8���xu������y2)���"�P�'1'��/��P���,"�. 7�U:E`WQ��
)%&�c"���8��S����$D7�wxhB�9�������+q{��x��PxP�/�x�~J�]@�"D��>#.���L�,C�me������}�yy��QW����Cs��AF�e�0J�	0�K(0T�'<[��<��U��E`W�t��#S�*vW��G���3 qE��K�$A�	
-IndHP6
!7R$�����������������J?]����A���E�,�U��9e"+�J�Yd�U����{���'s�E��d�H�P�'1'��/��P�'<�m+X�P��,��K��U�!�]����[dI����s����+9�M�%�Ra�9(aJ1l0���K�<��|	CI�]�����Hn� �N���E`���������������������q�YP� �"��2W�%�,(��\�$�pF�A�m�����\eX��[���,�k�\��k�E#�v��������2)N��"�P�(1'�x�D�@�cD��-�&�>�:����(��x��5�\��3W��_��������D7�w�hN�9�(Iq��A��D�9A�'
%�u�\TM�[`I:�������pMZ�U�`U+9��������IP� "!E�2W	�%�	$($(�-X:Q,��l��G��UzBF��L����L������e���
������P� "#��2W�
%�l(d���j��$����L��<���H&q���R�+7��mS���JRt�G�&E����)�\�(R���<��p P�HQ� �D��n�L��&���#3����)�=E��3��J�W���;H4A�D��R���D������ �E��K2�!uC�g���������/�C6w��Js9����Jn�Oq���$G�e��#���\��c���%��P�������-l������M�m�=��#������4�c|��IOv�+	�M�=|h��9��#E�2W$|(����<�p`�-��4*L�. 9F���]��F�����b�E�[n"g��������_-6�����������A�# �����=.�a�$�p��"���O�����i)S�|2?Ra������;���J�t�G�fJ����S�\�(����<0�p���"����������u_�i����R�������e�Vt��!�Y�J�t�G�&L����a�\�(����<�p P�S����7�JJ
)60��R��0��r/�)9���9��&�d��J�t���K���_�\�(����<��p @��R�`	G�n�
N��yns� �s�����;a����\�#��{��$Q�ToQh��H��(��GR�E}.d=�	<HT���#NX^�<�s�4�C-���~P'<%��6�k�6d����������[!rGI%�:d����%Q�x���CC1���Me.%�����P����DB�P���i�i5�l��a��4�h�5-���+�����������\4������P�+�\�J�	.(T8�EQ�n����P�d+D_:<�@�����u���"���3��L������J�t���&O���y�\x(���x<��p��S��.�H-l(����L�a��/���$�������k�C%Y���A���A�AI�H������V�d&���9��g�mHZaj@Pu�E�����dy�������y��J*t����B�����\�`��PbN0�@����
u�+�&[7��E��lE]�cM�.Dx����Y�8Sd��c%?����Zj���0�)�VD��%�4�yDI����u���� 2_.��.�Q�@�/�D�6�'\���+����z^b���XI�n����T(s��B�+E�s�*x�p�nI.��[�"����[��D�������X�zn�����'s�b=�+�zs�� ��$��� ��=����B2�Ue�x�S�_�:]��q}L���7�b��:7�wdh��9��HQ��A��:�9A��	� �nAr�b:�+�#\�+�@t�NK�~�x��iw�/]8<V2���;:4��Dt��E�+�U.J�	:�N8�PL�[�D��N��^B���f�}����d���a;@	�i���4� ��E{�=���������)?��!���n:�f���
��Ov�����T��= �:��E���$M�&�\�bQ0\�p�r`p��E�����
�	U�����[d�GI$;J;mHB���������S'���_��S�� ����$WGq�y��&��!qge���.XP6$1��<��O�k�+��Y��3���������\�����Mr���Dn�+�#7��qBn�s�!``�h�#7�@��r�h�����|���@��UG�s�w�N�s�����:�A�E�I���s��	�)��J'�������������$�z]�>��E�,��m��I�.��(�d(9�(i"C�����`(9!C���d���g	�
A�H?����md.
�����~��1���7[�5w����%J���Pr�PBJ�%����K&C�<��~	�=����d�� oYB���*������
u
%�Y�,����S7T��e��D�k\xN�d\4q��
p�8Q0\�p�r��#���rL�0V��R��q{�������r�s�s�7Tb0�\l1��I�.�����(9��h�F���Q�`�8�F���bd,,2��UB�(��z�S|p!",4�y����ASK�:�������e~{MZ6����&^���p�(�,Nx�����%�8^�Zl��E�z���s�r#��Ue���
��������@x��E����yQr������+@��E��q����	�
J&�
�����:����5b�:���6���j�b�m�����N�tn����!-�4
,E�*@c.�������~��%� ��:a��H�:F��5�\��%��%4:�_D��$�n�����,]��c�'K�A�%Md)��X��R0����r�\z2'���r��j���Q��;�y�Ji��V{�y�����]��1�yQr�1����+���E�0p�������%RDafnmXn���4�y���cmY��]��2$��e���Y�=��;��E��%M�(�x8~�'��x���"Z��G�zN�>=�q'����X]h��y����A�c�����NJtnzN^Zv�20�(Q��Q�`�8�D�W���D�r��D�bi��]gj�Tv����F��2#F���v���d?��H!-r�AJy�IZ1������%�~VYDr�lLB�I������;�o��P�����_N$��R�����Nt����'A�A�G	J��s��		*�HB���d��"���~����1'��,������x��I�.��(��(9�(i�D���Q�`(9�D����Q�a��J �D���a�O5��������[bb���:�^��u�)��}61���)���Q ��+@�cFa@�	3Z�2�����W2&��t����e�Zt�y����sQV����Nt���%9D4��
@�l�.��yn�&�gz�5D��6���~�������u��<�y��6����R��i)Yv/��)��t�?[�N�s�����?�A�HS�(��`�f�H0�������3G.��2Fp*#����������3�������n{�����G�45�Pzv����56<�I26�HOr���y�F���
W *���r�#=���v�BTe�����bzj����������Y������B�kdx*�dd4Q��
��e�y@F0��W!*���E��D
W!Z]�*�"�f�����#~�AOw��z�>�BcW
f+��|���������&>�\F��"m�H�KR��W.6��J���CCP2�{~j�i���T�����eG��f]��us�5��Njt�����F�A�H5J�#�\#y�#�&Rq�����#��2��F�W�0
������$\L*����P�^��ec�=�9,Q_�t��	i��,9H !���%���K	�g��5mcB��1<��1���{HX�mL�����i��K����	���+��R?^�T��

O����&��\4���r�4�\���y������@#F����U}�s'�F��j�r;��S��Y�P������W')��_����� ���%W�d(� �� %	�LT�P��I�dR4��8���oWS�T���f���E�g���,9��NN����J���N�F�<����\ �dsGp�����A�^1�a�%� �Y��cDC :lw'ei�Qr5���<����#:y�WSkPiY ���QM�Ac.�l@��h��,�,@@������1$�����|�i����dT����������j��U�g�N^��T*-����P0�Ts����a��Nww����S0�ZP&q�h�%]B�>U��pQ2��<}�Hb�!e*�H:	�P�����,H�A^n4�4Xn8��!�x��~d$�M�8���g+����!XXTK��7�I,�*��F��Cs�Km;����N���%@�3�� C��!%W
���9@�3�����4p*�!�c������q���B9F,�����U���%���I���W���-�+�(�V����\ ����a�i����
�"�8s	q&�Xc�<���?e;y�������pe�Z������_�����]��'9�A���$W$�	�$<�){$p���@��	�~�@�c�=�o��u�Rg�D��3�`g��pb1��d?�K�H�Er�0BJ9b�V�H�%G0�	k��#��GFX ��QR��cF�q����D�e����.s3���}�CVyw����5F<
J2F�hPrq4(�F<
*{�#!��W`��L{�A����L���@3�1��t�Y������v]�I'7��2�H�e	8� i�Fi.���yn_���q��&�hYPI0�9���Q��j�����y��v��!�=�����}w����u��l(9�ph�;O�X�)���
�=�� ��!8������k��*u��q��X
����b��H�io��;��E�
�%
Mu��
���D����P�(8>4,(}8B4L�:���d��8��w�B��u�Q���x���N"��D�J���"�F��gs��g6p��p�O�C�w�!�t�Z]���y��R�2����/����e���b���N���D}J������Q�
b��>a���>�#��H���?�������p'4K�����
�YG���.::9�w*-�p��JJs�8�#�����*�����H��Kj[�������`[.�-Q���mI��*�����}7�����h!Ai�M$h�RJ6\���p���Bi`hYz�����x�s�|b��/������kh���gx�.:)�w��xiY,�\��Q�
b��@a���@�#�
�#��?d�~D���,���������1|>{4(Q�������hw����_�{���������o���o������S64K3���~Q����Q-�tp�����Z	
u.)2�yBC���������x�g��3������%c��5$\|\(c�u��O��� �]�#���>�s�	 �l':� ia;�����d �lg�G��
sa�%����*���������k��F�%r}�(�#����x�#8W�+�h���@p�(<����4��~>�xRR��J��!8?c(v<����$US[=�������UsG��>�L��EqS���':�����DW��z�9`�������0K�	f2��WI:)�R�s~|Z,A��Q��d>+�LCWh���%�������>�h�@�P��@ i��s�UH6�X>���9�����II�H���I�
W��u~?>�����l��e���KI�zO�tj9.B�h���B�-��ZG����u.��Z�Z�J���V�D����dc�C??�����^9H%U�����$��#����������������dZ�B�$���9���{�d7Z-hc��� *��
�\n�����*US���>+���u�AkY����Cu��i�a�Pt�q����+���C�pa��bop�#����%f:�
��s�o�������_��R��(~����E:�Z���esI��@.i�C�\ �4���s�)
5#\����Q2�QkB�4V7��D�V���+_U�M������8�-�8Q���','�r�h)EW''J�','Z�M�0�h���aH�� ����k�4��6�UC}f�+K������ M��I+��?w����}��������� !��r� ����K�`������|	E�e^���	�9X�IDy�����=�w��'�E�������R�����?�!j�;i�E�#�%#M�(��8Z�#��}�`INe��H')U����(E<���P����y�6�)���2�hY�{'+��_����� ����S]SsbX8V�����r�q'=3^!\�/YZ�����W��R\�J)���/u�G�_�-:��{*-�Q.��������r��*s�����
�-V��_�E��XP2J���s.������Rk���-����k"i_���M�����W��t��!@F�x��`$�F<�#����U�)���m���}����$\(��T��������������'?�C���E�:�x���'`����&�ds��--?�I/����'_}Az�_R\��'�gq>��*��%���S���c^��>�-u���l�h�-��*5%���JH~Y�
�����:�������H�����2�%f8�3���?����.�Yru���s;|w�HP�z#zw��u����1���W�����uV�W�$u<����A#����84,`��x��C����5?}�Oh.����U�+�`��p�+�N����J�����Q r4��1��Pa�B�p�0U��X����{5�������8h�Y�^��T��h>4���h������PiY\4T����)��T�_f����
����QN����/�����(L���}����R��c���DV��&�m�<�NJt��\xJ�����')�pAZ9��\r����&J�~d�E�I���l�
l��gz�����5B�$����9����x���
l�xtR���5F<%J2F�(Qrq�(�F<%*{>H	��Z��R
��� �^�4LS��V�RF<N�5����D���l���
]����Pr�����+@�cC���P�D���]
�L{��"\�`lP��2������.1��������-=��#Y��a��Pr�a����+��cC�`��P�X864,��$y�Y���DE��o�gZc����#��!�
����xtr���5<�I2Z����
�`���9 �s�� �q�aA)���u�+t�K(�����|��jj�1j�J�]r��Iv.����d'9��h";�����`��d��
8p��������tE�!�1�z����&u6:��UE����j���<��
�s���&��\\�'�<�){�s�d�s�1�1^�z��v:ch,����c�TW���V��_Mt���&~SZvk��o�(�5m�7c.�5���
�o���[S�o��Gp�A��P����~o.��#7]���"�����q=:9�E�:jx�����% ������e�LR��1�u�y���W����H����
����G�:��&j�NP�O�i���Q�_w��c���|4����Q����QM<g��F6l�� 6V`��<���pDg�%��n����Q�)Jt��Za,V�B>/n-~,c1t����e����9H!��2,H+/4b.`�aQ��`�����&l"��3�XzI�[��v&�E)�zl���3��PsGS�q���L����t����&��\J�	����4Bb�����ddb+,2����p����8y%�P��1VM-lUB�_Gltr�CK��QZ6�����Q �4U��\ �ds�F��$l������KX�EG�%Kb)��|pDY��*�K,Rz	M��],�T����|Mu����h��QMu�1G6p��`zYe8�s����/Z"G����:b����,���-_C�c��tm:B��
��?�e��R�I�4��?c.�l��/Ah�2�����b.�Sb0^�<|��>������{�a;�%iH���{���Im���I�.���
O�����h�A�,6
��ZsI	�}�3l�0PP��ACPz?o��M��"
�S�%��H='�USg'������ ]��A�	Rr�A�D��+�#H�@_��c�A�p��,����(�0)��������3J���P��^�����X�y{��w��T��D�J�����F��D��\ �ds@I|	�$d�+�!aw��h	%�*
��Ui�=��Y-����r��a�+8�!z�xp\st�CS��,Z�@i�CAs<ds�C-���������w�-�j�����y|}�?�.k�U����^��������/9:�������E9J�.��������_�q"�2�	G��7�&��L��#������[C�lStY���Z��:��m�!z��\������C	9HkR�(!�����d��y��K%U�P��t� �XX��Q]��38��mr��:�QO�sS��b�O-uwo������*]��A��Rr�A�D��+��J�@rB����@�s��#,2V�XZ�O�:9��f���]�04�{�����I�.����4)9��h*
%W�W
����T.\mhX@�		�N@�!oSi���vUK��Lx���?jy��~e�$I�khx��dh4���+��+s��	I*8������8��(-�_o�z�;
�+1�i����2V�R?d��������������bQr�p��`�8aH����1�a���B������]���#�pq�
����W��K(c'[��_����� ���-%W���9���-�����G��AJ�a]#��{6���R�Vp'D�gZ��M��I�.����)9��h�H� �Q�`���Ii���v�Q�a�����cHC��r/w�>7�s1i�������E�G�>�s]���oT;�i;���!�����0�4
�i����i�9���!
��G��LHCZIE�W)5>,7���F�!���k�zK^${8:����.��G]J�8�����#�8N��������0
J3w�
	�@��pCm�G��Vu^ZF>�L������p�������J�B��1�QM�i����'�������a�@y�7a%�Ag���6����h��u������\bz����e��H��E��.�AZt�R�i�EG�%��3.����#d!���E�A����\�\+H���Q�y��5D�Ch���cW*h����]�����Qr������+��F��q���A�O���H��*!t�	���6�����U������d�R�//�oW������.����)9��hbH���1�`�8aH����TJ	�L��@�+!X�+b;>�"���:��m�6��U��F���b��D�k,xJ�d,4Q��
��(Q0,�P�r`��(����W P��E`>��#��
�����4������C~[\��a��S'?��_����� ���%W���9��������WB&��������i%\���7�����LC�a�b?��N�tjzJIZvE����+���1X��<�T�W�� 
��p���H?�q�k9����XG��pYEZ]���v�e�N�t��� �"%9�4Q��
"��H�"�	E*L����s)�pi���P�``�B������G�E��u1���L����
O��� =!p4Q���	!p�P�a����l?}��c����~& nT���c�6�h%��>,_�5H�o_;�������$-t(��h�Cc.�:�~	fW=F�m���a����z=or�mPb6����sV$X�W�	�*z�t�S1*-�b�F�4�1I1Z�A�JI�@�Z������u�Z�?�����s�~�=�;�z�^�k����.���EZ$� �R� !�����d��y^w�/!�TYJ-!� 	����PR����.&�jUZ5�������c�u'��n��|��
�����&>�\
��>Tx����x� 3^,��k�����s7)~����t�~��x,:wR���50<J20��Pr�pT(�0N�P90�pThX�$u�r�^+��KxX5�%�Yz���"����t�^Z6y����Q y4����@����n�~eN�
J�^��]�(�!�l�B9Dyo���lW]�(7�F:y���&��, Zn��(����1D�CJ�Kp5�x�0�U��`AI#�b������>��N��L��c���Go�|������\����g>�A�"M���
��+
��(PJ��,����Rp�������{�����Hh*��_y�Rue����s`G`t��s�+J����%F�+J1-�(�/A`���0`���\��&%bT�Uu]Zw������smCD�~V���l���N�s����$9l4���
���=���II���Z`s�.��,>���XO�k0 l�Lu�����k�uz��^��MZ�N.t����B�A	��B��P
��8�@rR�!������u�����$i� ���<VY�N24K���$�O�{������\����J���.�F�4q�1�3�@rR$�$�H4,h�����S.�k*���<�Q�*jJz����Q������>;I�E�2�H����BJ���L�%��3:���*$d<�BH������~���.�V�Ut���j��C�-p�6����.��x�l(9�xhbC�����`x8aC���!�l(
��s�K<����A�5^9X�F������}���5���&{���dD�kpxF�dp41��
��Q0p�0�r`�����T��*
A���.cs����:j��r{��)��P+�����}v����52<J22�JE� ����9 �����C�7�,(a���!P�SZ�j��$���F����������iw4k�o<;)�E�$�"%$M���
@�JE�@rR**$X�Y�8�W)Z�PW��CWa���X(0~���eCS�^�o��b�g'O��_����� ���'%W���9���'�d=8V�G��A����g!VW9��������z�g�'a����g:I�gi*-�]i!Mi��4��1������Kp��jE�(�b!������D��q#�L�������6��8�O�b����;}6]���I�uz@�T3s�ds%'�i�����
��]�X����5��uI����g�Z����i�����N���T:*-����Qp�D��\�pqB�����ptiXP�q|id����!�����myB�X=��z~�z�!��������A!J��������_�9����e��n���P���U�J��,7������/Z}��U���W'/��_.G��@B�r��2HH+/Gc.$`�cE�I��X�C
$,�^%9�|��'U��Zx���64���|Xh*��v	����,}�y�$��&��Rr�A�D��+�#K�@rB���� �!HV$��J��Js{C��X��suG��������_�N�t����*%��{���+=(�����-J�g�;Z��p�G�N�d�
��S�j�GLF8�J�@�RL���znN�!zD	��a�Vy�Q���?�=^�<��
����E&JY	`���,Z^b�����P_m2,��-�p,i�&������P��O�v�����9	���}n�rDF'9�j���]p�@F�dd%@yddkAZ�K�p��!��y(�8n���9>��^�U5u6W*�� I��y����V���������,0�#k0��<0�5#�F�+���\F2����I�w�y�t�]d�<4�I���>�;�^�lI�W'���_�O��B�&Y	�������$a�U��q���
���G~V��$z6eS~�c��B��<�6���Ib���~"��������
� ��"Y	 ���"� ��W6	z��%#XPZ�E�ra�>�#�jP�o7�|�Z]i1��F���|5����x���x�J��u*��B�C�<Ix@�����s9O }8����z�v��_�����m�s��z��=��R���|51���X���X�J��u*��l
X�����V`%4�:�*Wq{jc���j9O�.m6����u������m�xw������BZ� #�22@)##����aMt;2B���}�a�ye�*59L���iSt�
jm�G�1uVV����M|��*�����
Om�D�6Y	���J0D�gB��=c� oGY�P�����Hu	@�����|`V]��������)=�w�;I�E���$�R\(�H�����y)9�*D���(��H���[P��Nm��	"%���n����^'4V��Z2M�G�������\��!�	Nri!8�@��`
q���7q���(�������*���J>W�2�]5�"��6{�Dc��|��O����w'���_c�S��0�Bu�+���:�0��N�����E
��%aA	#Q7�s�E���),:�D���f�r�[����r�7O��<���
�s�@E�	���k@��9eoP�
������1����z���;��j�����9F�%��U��;��E���$�z\�	�Oo��@���aA��o��]�J5b��L��.tE����!2��j����My������7�k�x~�FZ�Mpq�&XF<�){���,d1�����!��"�u��%@�NT��y���u��t������|7Q����E�	�������i�Bm�5�����&l��X- c8j�?�����S���!�t���}�"�Y������V��#�<������F�	�0Zx��
���k����jAy��u|����r�zj
A"\�m�$������Gt�aB+�_�������(�����������7�����/���������D����W���Q��A����@J)c��$4�u��V�
��,?��tkn>���XF���+5����e��I�������&��9.3���P�c5�[K�f��P�V��P0�&Y,�Y�1U�UAb5� ecQ��WX��u��s����|�/=����!Hw�tn�?��;Q"��p�����fj��u�C�p4���
�a�M�pXv�~#�����"g�*I��g���c~�b,h3q��p���'Q��	U_�F�9�Z�}-6� �����2�I��Kj�od�6�����2�6�������3H���aV����:���orE�9�Z
6��EK�&
�����:XVdk�e1����0,f@.1���1J.y����a�\���r����k�x��({8�Jg)����L�x�ph)�\���1��2�:�C�8X��8.p0�eP�0��ge�=�M���cUU�����&:yu��+���fC:����U�zb�Mt	���$W�@�I�Kl��$,��O����r��@VmA�+�H���a|������q����dN��J��e#FK�&
��R�N"F�PX&�8.�0�FP@a���%����[U���yR����?1YUS�����l[��r����u��,':���P�I� v�N����bob��%vD���sN���lK��
�����|������>�>Eb�03G�9�Z���eG�I�@�h�7�T pdk@�-��|����`1��,�0��g�����K��������+#�@�!���G�nmq��7��� -�r��he4�R�1����hk�7�7&4T�Do� �C�Bk�:y=�{���i$T��{��	�P��A���4�><�{'���_C������Bw�+���;����N��Y��(B{6)��|���.�&H� �3�p��Bi�S���&u�bSU�1�o}}A�I{��hOi��$-�'�@�hO��x��~c��M&��"�[,(���lr?�"/B�V���^����D�j>U��
��vF'�yob=�e���z�0�X��O1H,���Y�����H,X�Y��d��1����|����G'r�X=�Z���F�+�N����|J�����a-�gL���|�5�8L���).cPZq�g��v$�l��/��>�Vf��?��������]o������/��K
O��"
�����s�,b�7����?�!f��E����d�~��"'�X*8X
��
�'��ex����z����?��E:��{K�N��|�����p�'X0<�Y�1gG}�dS�YG]C�k����Q�U��Zgi}1|
4`y���������B��N���DJ�f�&���,�B�T ����a�Y�r��{��T�U�E�Yg_LDd�S�:?��[��v�PRG�S�t������]Wt2��&�SZ-���h��,����w�T�:uS����t���]E>S�[�gX���	��CAQB6�����d�{
�������B�	�Z��
��l
@�';��HqU|�x�#8C��}Q�W{����b���pbG-[7�����?:9�E�r�)-r���JyQS�����H���j�8�@�`��:���n�G��\��Y�X���#>S��O��ouN��J�����<�kdx��2Z(Op�p�'X2<�){>-A��� �CXP������0.�}Cy�(��6���4�#:��G�)-%��M�DvXp�&X��f|c��`au��a��9C�RM�R�E�Ty�{x�A�g��8���j.������DK-'�P�ZN�PxVS�&@8V3,(u8Z�RR���!!W���G�uyHQ�>Z��������p�Ik>�hMi�p�Dk�0.Zh��
,*�5 �t�������yQ���V�a��a��f��G_w5��D�����G���\����SZM��0���3�@���e���e���q�!(%����#���r�^L!���j1�W�{�����\���'4�$�B\Aq�&X:<�){�@��@�YY���e��LV��m��� ��_"FT�����8����h.�����&9d4\MW�k�d�S������S��H2m��7��I����<�f:�^^����6q&�����79&�NR��DjJ�&�&R��d�Bj�T-�fX#U���?�M^U`]g�����/���Dr��<���8�����1��|M'=8�TU�A��$<�I Hd����mj<&�a"$<���*�d*uPJ����W�����^'����\�	j��L���Ta�
o*�ik9�N^s����\� ���J9��Tr�����`Q�1�d��:K�7�^�@��q�2H��z�����ZR�_��7�N"s����'2�@���WGd�5@�����D
q[<C!F�
V�[��7�0�7�������K�����j1>6t��C�)-��LbC�S���������c���H!��D�}I����jEY'1�u���J������&�b]��L��!<�I B�0��
"�c2���L���
rEX�����R��_�A��z���n!i��-�?�1[tr�Cw)-��K�CwS����]�5.\If����.�����jQ��n�!�2�PoX���*1�&�zV�������*���\,��!�3��BBK&�����0�B�g.eoB�c.��r��.�D8XN$�����jK0is�/������*��<Vv
�|��T�)-�
0a

o�o��p��ab�.�,�������a�LK����W�1tWG�����W���r<!�Kb�*����S�6u5��:��E�:px��@�h�.�G]�5�$>A��EK�s�O@�e�$���6��{���L�v��n<���D��g���.�e���A:)������� M&�������"[8�C�03�p��V�Y�����e�Q]����{�JP�C9P�>V�:�Ns�����$-Z8Lp��q�`
��A@��-�+L�v`��8��(�?��)���V�:3��I�K����.�����!j��\��	i��A2H@+G
P� �����$�C$�Y 	Af�B���B���:�RJ�"�-��������^�9�\J)UAGk�'%�c'���_��������o�+���7����p���\��x�H*a��A�\��N�Q��K���r>���6���i�$:�kdx��2Z.��+@�+�k@F|"=�����:"A�.,���]t����C�nHZ�n��:�%l����M�?���9��N�sl��.-�R�n��0�RZn��T �dk@F|"�o����b�v)!��c������n� g�{M�
o�R���k2G4tr�c�)-��&��4�p�1@C�4�� ��L'"1$���C�Ce=������I<s�����Q��N�{�iZ5u��l��������M<%J ��P��
���D�P�(q��<3���3%����*��l@�	Z��6�^��V�e��h��wKG���#]��A�9Rr i��W���5�$>A�J��,K��, qi�����>3�������������z����������M���esK��trKK)gLrK�@����#3 �;q�h�~i�y;>�Z�j8������#e�jN�P��+%���7I�����Ml��,4��P�����T-lh����3�`���g����/e������}�z�]���1�2����(��;��E�:wx2�@�h!C��G��5��2T��5���x.G��+�5Czl����d���$�x�6N:h��Mj����t��u�$C�K�H�
r�AZ9h�RIL%
�� �BA��R>	AI ��$�U���O�J	_."�0���0�������v��v7���I� R����.��(��(9��p��
P�8Q���p�r��$yG��|�@�W�^E����s��U������(�-��:��E���$���\�	����S�9��E�^/�~2v������sR��R^�e"�N\=:�j|�l���I�.�� �D(9���~�+�+�k�	*$��3,(���
��D��<Eim��4V����({~;{
_Z��)�������L����
���G&-a����j��'��F��������OB��5X�
J	&�G
�.�'���N�%?�]����S'���_�������B�+���k�	�)�G����?��������C����^����O���:��E���$��v\�	���vS��-�Y)���cDf��<�>��=�Dx���#Ra���&k��7�kHx~�$Z�=�@��{�5@�����U�Qd1K�p�f����]W
�i���RKE�pq�Zi���g'�95Q���,E�	����������![*N
=�sI��b@�p�fC����>\�K�CUmY^��� �D���?h�:�E�:Xx��@�ha8��p�5�����,\�gXP�pg�<A5?�6��'
�m.My_��I�����G{�h��5�KhH�Er��Z9b�R�FL%G����B�&{V�Af�B�+h����M�<l�U'����Dm�i���~�#s'���_�����P��c�+@��1�Pq�c��*V��G�"sXu��m���R����q���x�<BS�%'��s'���_��3�����h�+@�c4��p�h��A�c4��GH��T��@u�5}TJ���Tw�I��:f��j�.0wR���5J<�I%-�&��8J�%'��0���Z�@;�LR�1��{2Q����*d�� �z��<b��W�;��E���$��>\2�	����:O90�X�w�=\A��e���]�N�G-"}>G��}Pr�qu���O�I��s'���_�s�����m�+��6��q�m����6���#7�D��u�dGc�W�e�>����S�����Xb��k������������K��L����0�3i��S��I�p����#S�E2���8�����k��q\�=33���|�N�RT�.:�9��I�.������"H
� �8�$'4���h������A���N��H�`+��������]f��:�<>����������q�I��MT��l�h�Ba-ThL�GZ?�
W�F>���0�UG�-���s{���O�����./�ad���G=�y��w�$D����	Qra��W6!
�6������V����[��N�G��b3�3��[���,'��t��:��9:sTv�^�*{�\�A���	X���a����a���h��{E���������;ed�<�.zSm`�e�FW�S�H�Er�AZ9��RIL%G�� �B���$�c�6�!���^-/�e���������Q������l�5��\y��e�)]��A��Rr �0��
@��R���`J��#I$�|�J�-x��$�M��][���9�r8�[�tmiUb����
d�MW�shx��4*�)�h8��s/7)�~��6
�L{����� #:D�ot,���n/U��#���5\� 5���#�S���P�F� �Q�`
H�A���!$`hC�+��^�Z�]�<�E�^��8���������N�����b�!]J7��e����0,5*�1XjT���P�n�����g��}���5���c����>W�I�guu������4��K��?�#J�������,JJ��0����F1@I�Y��!�W
FyK��kaAI/�|~�����$�����eG���_����RbG�e�PbGa�B����5���({V�F��{X��b�PJ����r������3\!���gA�G�Ajt)����EH�<�����B*���h�*D��Y��J�Db���
n��	�P]������d�<4u�JA��� 7���QiYh��Q�Q�Fc*�l
��G�({n��AE��+�����c���Hz�`w�����}_��U�|g��Pr;�

�`g��P�$� B���L��$#6W-��:��^�3	:�z���w`c��d��;����s�]�O"-+�AFh�XJ!1�+�:#����X�Lp��HH2D� -�\n{�$����e���ty���^������i�s�]��Q�iPr(����
P�hP����A��i��
,�p�@[����]�.G����6R�����]�A�s�?���=�`��{�+���=�������d1����SBcs�R�}7������j����T��[E�#=���� )��������`R!E�����`
0�A����	R�
&���cE�d}�D�r6�IL�l�O=�r��4&���|�s��\��q��Or8�T��+���k����P908�y�y3���:6Wm!��
���{yy���U�s��U��i���� ��,q������	�����u�T`�Y�:���������b��$�*C�������?}��!���z���x��b��hG����9Hx���a����D�<��0����5�����0�	������A�t�K��������M��!q���V�6��7���<Z�A�s�?��:���R
���?�@���c�&�=����������O:`�^tQ����3~ t)GT����)-�CJ4'9�Bs�T �Th��!�C\	h*�m����H�o����^Klbut����	S&9���?A(td�s��\����';�D�
�	� J8��!J� ;��D	Gv��W�&���S�������7�?������to��n�Ty:H���U��r!�dH�V��!S9����Y��^�����-����b���n�����,������ujo��Zo��]�!�C�!,�q�o���)�,��6W�s�xj�D*�&������<@d�F����r��}df3y#�
"�Jk���~M
��������PF���)�5�m���`��&90T�Mp`�{�
yC����l=�*{>.�a���&

1�v
���{^$�W]��0�|�/\�Gd�D��0��W��#i��Rz����Ry�(�9���QXc�p��a�7�������Gk����s����p��:�3��b7}��:B��N;�f���x
r���y��'9�0R�<������H��#)�����+�����G��N�"1���t	��O4d'zUC��u9>�����D���w�c��|��Ni��Q";a��3�#[CV	�����:cD��fAY����K���bFN&a�2��',9��;E�O3�}���\��T�)-��Rq'x�w�T��$�I��+���+AB�e�t���.,O�)+�'�F�t>�� �w��_}���u��|��;�e!R���a"����
@���QX�"�w��
"��6�8���+��*�a�Ads��y�ew��u�gZ*������r��c�?_%�SZ"%���T���
@$[C��K�"�����lQ�y��������U��t��kX��� b���1a�J�� �����9=�I`�Ya?��9��lk�<�!XI��#?���(h�q�!X�~P
7=5�y��B����N�����S�%t9A<���o��q��p�r��^�������������_����?����'�>s�����,���� ���R� ��>�%�:��[C"y�������������K ���@=�t���;�A���H�j��	���o/]�7���2�zn�����':XXOr�0�'Y,,��������9y�����R�$���p��N{���U������n�I��2�{n��p��':8xOrp0�'Y,���
6A��v�T�e

�+%����}��S}�P��y�Ed����J$�6��q���2F{n����':��9�@d�
$����0�V����w���Hp*-���-�jq�u�������Tk-Ui��#�9�U��h��2Fkn����&:hMr0�&YC���f�7Q�TrvJ1����&�q<��yvo�60�����/�a�.W���u�.��U��9_*g����
�I��*��p��@p(0��W���������!����!8�@��t$2'�3����i��&��T�����T����:U�����*���a�Nt!�P�I� d ��O�O0�� 0LUg�I%ZF��a���@��u�:�Z��\������Q�A$�}m��c��|���M���
�I�@�(��}*/
�f�FX�zN3b��e�o��W&��B��L.^��NQa��������������9_*�f������4 ��j��2�uj�����|��e&r�-`R��z~{�����J��/��(a��-���k��"�e��������D�E
�&��,b�M�PXn��7��2�.���@a���kfx����Jc�1�T����)��TwF�w���m��\�Oa"-;�A�	h��J&1�;�:�$�)v�(������*l"���I}�n�]����r��5�������=Ma��A�s�?G��9� �Bs�+@��9��iN�s�A��B�y-4@lc��dOBa6	e�:T�]G��F�[�M��d��
Ov�E��W
Gv�5�����7�pdgX"��I���.h��H=���	C�Q�V5��88h�h������u��\����YMrX����
��XM�,xVS���J����(e|���1��Zg>U���,O�SW��6�z+�r������A�s�?��=����{�+��=�@�yO�������dG{�E�"��'6����E����I>������U�
��$��B��Btj���%f�(�R�9�GlP@A����c������K���|Su�*t���La�i�'k�?j����t�� �y-����;�J���$v`R��l�L�5��Vxv�S��-(�8�3f�wkt�qM7�K��E7�zg3~r.���&uy��:�KL���|��EF��\����3�$[2�U���������A�2X��#>C�����>'����|��I������T��G[��gw>��u���8PiY��8P0R�@c*��
��c�{�:����3��%�
#�
�0��=�}��T���Io��*�;{Fb��K=y�����J��H�*�`M^-F�,D*��:��`kRG��5�O��oLu}!�8�9��8���2�9|]����}y��FR�g����\��Mn����� ���9�8�+@)� ��cXg�5����	!�Ww�m�X�1\!X��w��+BU��t�3V�f{��~��[��6������,(9�TXPpq,(XF<*���A��&��TFb�T���O�.�������ti]}�0��N��M���{>�����VB>���#�0R!Ea.�G��5`����3F)�OB�Y�����������u��u(���h�����[�g��Art�?��%��
9
��k��'Geo��(�#,2����[��m{��#obC���Z����"DCS$�>�lPn����O��F�WG��5���7�p�hX��@��U�����;�l����J��%1�"�\�X>N��c�)��JB�eJ�$���i�)��������5.L]I(����fA	�Q�a�&�i
	%|h���}�O��w	�EeZ�<|>dEo%VTZ%V�HTX��
@"[�msX#$�o��h�I��:���]�Gftn�����2�]k����q�z�6H����������Gf(T�1= x4��l�����H�h����;��WI�^���T�||��_�����C���A�V"@�eE��a4��"[>���w�u��~#���aA��1�}���x��������E�T���(��cb���z�RY�����D{J�"�D{�0���
V����!���odl6@e ��"�c=���z��[*�����i� �y�U�4�H���Z���S�!�3���!�@AOE@��$N+�z����g�uE�r���o��\'���_;�c���g.�,iYx��G����<�	�Oy�7<��AB���}��An��]�_�����u�} �vB�)�s�����D��DtJ�"�Dt�08�+@�#:a@�':�8���a�I-���e�c�����uF(?��@{�$g�	�0�o�+�<giY(T:y6����B'�>�!������F���7�"�X!���Ri���\Xt/������>�O�]]�7<���}��\�����$�C*,'��H�XN�xx�S�������aA9��}�AZ��@'��E�\zi�}_$�������p�[��K�r���92<�I�*Pp�pU�`
��U��g�;���m������|����
6!������������o�6�}��\����IMrX��z�+��+�k��/����R�m=�SiXp�f^J	������-�"�/�9�p�#����\����:.5�{��SZv�Q*��a`�Qa<c*������x�oL��!���9�\�(����&�.s.���Pg��U�E��a?�
��\�i��o������U�<~x��@��D$ ��uj<��a"��u�����W)@d �����:C��i�O<P\�c.	/z��"�;�G���1H����)-$J�&P���1�������9H8r��������FB��w��K��$j���7W��������S����J�A�s�?
�r(!9`�VF	(�GL%��3J������	%!�#�J�0����J^���H�>	��\�!>x
��#8U�L�B����T"=�e�Q"=a@Fv�p�'���g���QK����Dm������u��3���$�Y���������i��\�����;���Ja'�P8�����=�OBpP
8%�d�+j���e���-ByV�W����A��>�C���g6�2�S����%���Qa@c*�K*hXc.q�<b�fA��]��K����-��$�p��C���w�/j����-�'�s� �9e6��e \THO��Gz�u���?U�`X���*Z�p�gCev�];��
:���P#X=n�r�EhN�vr�c�>��#09��T�)-�R='��R�S�PQi��?��}d��������E[]��f8=�����It�1M7�Mn���h��)&uw��_6��hc�W����'C�D�

� z82�!z�� H\��<�"�M@	����K�����Tp���CIx�&b����uo�������1�r�S���� %��R�@c*A�5�#>��� �l��"��R��@C�*�]��~�*��c��#u]>U\�����Gw�w�uG���S��SZ$��O@R�Cc*�l
 �A����@��l��CC��Z|~�mmX2H���w{�����]6��h����V���H�E*-�E
�<*iL����!�M1�q�1J0�$
�v|6M�������]��fa���;����M��q�$FW�����6�A^|�V�(��GL%c�36��6B���Cj7���@+�>����~�W����
1J�i�U	�^�g��"�1����� �)9�T
C���q�`
 �A�l2�#I5$�)�� �a!P~Q{��"��+S��>&��O�n1z����Q~���� [���������QaK�����`
��Axl2��+��
�_B��A��y���j6|�����?=��!�w2�y{���|[��
y����9J<mJ%��Qp(q��P�(q��<�6�-�8�4�����.y����Kfc���C���uD� }�(����k�}
��$�d8��d�� 2B�����3��?��G��@���������$>t�����fA^��S���4h��q�.�A*u�?�J%>*���
@��G�@� q���l��&�$�
H�Wc�.�s�\T	U��;��Fkd��w��mx�B���P�K����tI��R�$S�]L�p��P�G��e$,`���h�EW�wu@���������X�Mo?����u<������S��"H�=�������XH~P�r���!��\Z�qTj@tv����n�������>���	*��^��g�� ��(�����D��0?*jL�G�@���%>2��M�
��)
�A
��q������j��z�78 �l�����PX�Ps\�����*-���
�J*<jLPR�Q�� J�#3J6�a���![�@�h��,3]���CQq��v�wVC��.�d>$F��j��cD\(�	�U�4�H����,Z$���LL%��s��B ��R a�@�����e���\���sU_��!K�!M�;o�(|.[~~���1����92<�J\2�
���,���#9|�v0!hU�������\����P}�3�?�ABS������r�� ���������PR!S�����`��JiE�G��Q&CX��GT�~�"��@��U��KX{�'��t7�g������ y��������PQ!O�����`
��A��ACL��C���4T8����^s�c��l.��-�e���x[������_��wdOW�shx��4*���
���S�h�`O��@������V}�i�q�Z;��O�I��Z{���`��4��<Om�?�����L�R����Z�T~
��Z�R~S��h��h��%��4���6�cA�%���i��/��{�}�/�����K��q�G0r�s�#��C�#�a�4�`��'~p��R��Hy.-�8�����Z��[`����=�|�E~o-�l�����y�&]�����I���
M
� �8��'?hR90��U������A�;\����6'�J��:�^�������){�idGW�sDxv�"*��� K:E��5 �Gy�D��m��h��
��%h�����;��|{=tD����i��,2���%TZ6��8P�H���@����������@y.
��c�,��B����c���^���#�P{b��\��U�40H�����Z��CL%��3����2d<�C U�$3Z}�����I�otC�7�c��`�O&;Csj��]��4�2Hv�����d'9hT�Np�pd'X4~��r�dg2a��	h��������s������5��yBSK����a��\J�C�eE�(���?4��"[~���#��H
�B=[44�Q������n1����S��i�y��&�Zt�D����<�(�Ajs�?���$)*�&��H��M�l��6��D
W�D�&��'����H�]m$��!1}�z�b�S>���GhR�K�0TZ6l�
Ca����
��l
��Am���ac3��UXd��-l}�z���;�����N��p���B�����Ri�"��rG���K����I���a$�3� ����Cp�
C�@�����A�{���0���$|��F=�u���a���^*���A�s)1����(1�0 ��x�T��"#>2��x�7�,����Y&u��%U��n������ILWHMo��	��l$9W��E�'9�,:*$'��E�#9�2��S���l;W�2���q�!h�U]�-�3�*5W�0�!f2�t�Y���2���9<�I�\�	���88�3,��p�@pX�_��F��S��/5�{��C����G-F?%<�_��U�$��$��KB:�P���)!�T����a:q��<d����}���|���o��^��W�t=^��v�t���XRu9O��5�s�]�O�!-�� ���J9~�T24�:��B�Y&BC��
(~�@�.�{��@�-A��$����U��c�:�����@�������(��a���1@E�T��@�3�b3T�k��*�LB���u�H����E+�k�4l���W��w��#By�g���EH��a!^4������hx�`��RJX@J		@�M�j4y�M�����Wh�BJ�� �n^������R<-J �ThQp)���`
��A���� �!��ZJ	�����F�R6e���������E\i���1��>K���l�(1�0�
#S����?���1�<���p��!����k�>���\��~��a�'0R���CW�����Prq�rO\A�@
R�:`
�����������FTp��Ad�@Jy�����d4�W{P"�f��?��Y�A�dAW�s@x� 2E	!���j<���D?�>�3���1��lxp$h����HBA�zj�qmu=�O>:I�=&�A��Y�%*-�<J�Da@B��hL�G�=��!\e'���lhC�#=C�����	�����"����SGj�~h���>��$2��>K���e�R��R*��c*��l
1��!f8>4,h_��>H{�Q��������S~�E"�%j^b����5~*�o~����y:�,(9�tRaA��/
���r�Ty��@��G��@�8���Tm�:������`�"���U��r��d�V�
��S���3��P��t��������@/ni�c��!�P�R�*��Q��H������
�v�2��rXl���U�&��$�J(���P��� ?���D&��6i8����z��z�����ObV��C�T�iJ!����*"�����d=_%�SZ6��XO"H����@����C0��G��	�5�ej�
�+���Z�}�pw���XV��?�M�W	�W�Y%iYT��U�a�g�b*����J�C�4�H�����@C�����1~�Q�PV������n���0�K;��-4[��$>W��t��Or��r�\A:q������SL:q�gXP:q�g��/8^����t��c��1n��s:�4���~Ob~
2���94<�I�\4�	����h���5���#<�B��q�Iw�R�������������T�)-�9Ju�0d�
�S����?���y=�N�K���A{�qOk]xG�yi%~���y8�T��=��&�G���A��Ub=�e�Qb=a@G����:*�g�\W�RO�����2��=�d{������	��=q�������]w�f��R�Fd��|�hNiYh�hN��]j����Td���`���-��%Ov�g���w@i�@.��
Gr�`y�Z��Ng�3�>�>�"�Wz�^l��k��\�����$���0��
P�MPQi��!�������
$[*�]X���s���s��$<4Xn�gm��e��z���s���D��C
�7�3`4-.�Ai�pAJ	}*	d��E�`��Iw��O�V�DbL����|���N�^�"�{"�����Z:7�����I :]DD���y���Txn�� "�DK�����������(���&�q�o������'����� �������uA����L�<sto�E���x�1����Ra:���G���a��[n6|dk�G�j>�q�����t��R)N�z��UV<tee/���w�������Q{<z����9c��\.��iY`T�N��]0��#[0��$`�����86J%������S��S���%�����w��[�
�Pb�������\*��M���R�I�.�+�E��	%[.�~$\��0�X����!9�+=����i�z�����kj�y�#��cye��\.�;�M���r����dW,�l�#[L��$��������`�T����W6�Y�?�E=l^y�R^���s��42�|.�
���,*�'
8���T`G��AI�q���J#���&�*;����WIo�I$W�lidW�+���T��1�s�T8��eqQ�<i�Ev��z��5�"�H���<���N��n7m_�Y������i�{��3>Si����T�����\*�g��0���4�����@����`&	&��{`���-���f�d��,=?o7�a{4�`r<N�^�����6=�A��.��� �
(���aP�\X����j.��<|n��@�@���_[Z�+�SM��_�tU�g_�XW��\��U�%�r��d��V���QS���3J�������S��+m�������@�%�,������/���G��V�g��Q���%�� G�����s��`R(]���q�`
0���9��`b.��Ym���`�	�p����]����H^��f;NI7�������� ���#�������J6T���U)L����p|��v+!$l�`�&y��=vw{i���������V�>?`:�_��4���O��@D����N�E�G��5 ����7�+;�c�E��X�b��jQ���Ep�����y
�T��������AV�ZbE�eW%V���QaEc*�����a�+
S���M��X�V�
���O=���4���sG0�z1���^��U�<Nx��@�(�x.�
���;����;eo����l�m�9L�g7�/:�SE�K������k���}^�l����y���0��cQ2��^Kl��l�(��0��
S�0��S�%�1L86������������jQ��:��fj ��3�'�:�M��k����I��a&r4�0��O����E�-:6A>�g����,Z�==��Ls�[�Z�M���d�^�A:t�?�2�%�e
]@pY����5���6>
Hn6`l�2�
�YG�Z�
����}:�2�5�e����(�<�5H�^K���l���
9
s�8r�%��o�5_!�],R�m1���(n���L�,F;��jG�O���6���r�EW��("-�r��Z9��R�GL%����kZ��cE�d|� ��>J��B0��h�]����=K���,���j��������~���U��%�

��k�'Be��%���K�d�+zU��z-����>�}���^��?H�� -��������pQ�E�����`
�����
.6Af7�"�`,�z4��v)��CE_[�X��DO�J8�+�\��T2*-�WJ%�0��
WS����&�+e�-�8��Zw����JP�Enh��E�J�l�ZJ�?�m�
����y���(9��Q��F2]J|��_3��0\�h(�8z4L��@�7�_K4�������{���}����I�U��4%��
i
� �8���4�=��!�U,h�q�!P��H���.���N\��0�}���������;��T�6H������t)9|d.J�.�x&nT�����J|c���6�.
lT]��w���=��@���g��J��t����s�2H��J���J�D��0�
AS��F� 
k���6�%�T��4������g���$��-�;��I�|��`K�x�6H����������E�W�O=
��O<=*{�O6�_�m����!3��j��{m����lwS��a�����{��}$DW�s\xB�.*�(�\8B������&\l�#��1��K-o���n��/��ELh������).���9����F@)�"���Xg\�5����i}���@��$&4:C��z�Q������c���4�2��O�*o��e�LX	���Y�U�%�%��
+
�%�k@�gEe��#��
Af�X��FZ?���N����y��-��]U	U��{o�I�U�&�$%�
I
�&�$k��'Ieo`�H���I&��
����O��Vl�/
��C��$^�W���~u�;lb����������E��\.\)X.<+*{����%�&'���|�������Sk��6�>�����we$�w�i���>H����(�)9�T(Rp(q)X�����B�"�Z}�I;Zh����d�1��E?��C
9H:Cz~�>�]U�2���9L<cJ&�\Lc
�����O�B��@�q�iX��4����Eb6b�R���O�G��}�]��q��Qr8�0��
p��Q��k��78@�S�`aAI�Q�a�����3�����H�h�����J�:��*<���i�U�%�&%���a���B�j<��a"�_G��u���"���cIC�R�-��"��w�w9��	��5���z�>H����0��)9�ThSp����`
0����M0q�iX@Rq�iX(�<�U��pW�l8����������A�t�?��M�����+��M���yS�`8�4,(�8�4Lt3�:�x�W��1=�&���ko��������U��r�)9����^@)�#���S���k���c��db�;��+��@{�,���1���QU��i����:�P��L�����O���G�8W�G��5���������W����/O]wz������
&1�3�P=�4H������T)9`T�Rp�pT)X0<U*{G�����26_�+�.;��#�]M2:Wz�^������i�*]��q��Rr��P��
p��R�\x�T�L�� �,h	�1�!���]��\zM��������P��4�����0�\)9�T�Rp0q\)XL<W*{�
�y����=�z����0�=�\����������4H����8�d(9T�Pp8pd(X<*{��66������g�hu���E�PiG*����H��Z��9�U��#%�
G
��#k���Heo��	2����m{��R�=�s�<;�����Y���l2H�N�ZRi�]l����]l�$���.6[L<I��1�`�9�l�H�p�����-jsnN��gb�q�����4H����������G�W?-
�O����G��������L�t��g&'Wt�f�2��#!���D$��4x$�9��t�^Z6|�.��0>�+@��Ha@��H�7���%�m�Lh��8�4�]g,���L�@��-=;��C\�c�+]�O���L�A&��aJ&1��e�:�$��+���`����@0���"��7����||���g*���%�C\�c�$]����IRr�����
��HR�`x�T��eB��w��$�fC�x=^:c���hS|X�h7�������9.�1�������)9\T8Rp�p)X��rRZ���(����%�cA�A����	�yR���",�3��gSh^�C���A�t�?���L������+���L�p�8A�:���.�	�6�8�4:��r�U�\��K.	������YzA�g��������B�W���5!>��JH��a�~�MDx��5!���1Xox�����5a�T��/v=i�U��&%�
M
��&kF|���sFf8B��04\l���3N���<n?y���;������KW�s�x��N*|)��8��'�!�WT*�'� 3aa+��B]:�(�r�[�
�I���(1�&�1H������)9dT�H� ����5 #>����c@��ZI��Zt4;?n�i��`��+I��]tjoF>�U��!%��
C
��!k�E|��U��s��yE"��=C�L/j��&�f�7�y�d8�,�����J��_%�������T���������+$�!'��6�8�4��������"9�d�7I
'����7r�i�U�4�H�!��Z��#HL%#�32��2��2K22B�%$y��Yo������1��B�tt���g�����[��""no{8d�I�U�&�$%�
I
�&�$k�I|������{[(��@[��%����7������Gh��a�G��]�PJ4� i��������pR!M�����`
8�A����!�dX��-N����x<��H���*Rah�S^!:f�i�U��&%��
M
��&k@��T2M�hO&�������:�44+��o��i�<H����0�$*9�THTp0q$*XL~��r``�	�N7,2W��@6W�SyW�A��H"��%4Wv�dd��%�ANu.=o/-�n-=o������}L��������������d��D�JO��Zy(*T�]c��M0��0q��U6����y�$*9�R!Q�G��5�$���H���<����&X������-A��
��R��V�}<�`)��Te*-0JU�0�J�iLF�\��P�G���F����qj�����u�����`�}X��e�A.u.�t/-��K�0���}LRy��fWnF$Y@�e���Z�\�aQ������Fb� ��������H-����
R��4k!?XT90������G����0����r3B:z��2�i[�D^��U��r�d|�V� ���S��s���!A����d��� ,��%��?u���z����>������8v���h������9N<�J'2\N�
���d����0�w/�-N6��~�fmru��iH�>��P�Y��U��>%��
}
��>k@��T2��T��2LH��%:��e(��|��s���L��}:������_���w�4y�RW�s�x.�L*\*��8.�&?�T900����L������J����mv����s�l
$Y�mi�A2u�?��'S���B��+��#S�p��L��G��E&�B�ds��U.��;?yK���)&�2��-�����	O���D�\&\1*X&~�����+F
J1�9
1������3[������Q�9k�V��R�e�7]��q�ySr�����
p�xS�\��M���kL�����yq�i�J&���K>x���l^���r� ���xTi�Mn�G�a`�[�Qc*������<*{n���(�d,h8q<jf=�rWKB�d[�O�����b!���K�@��EF�@�a5����������!����2�cP����kMS;���GgP��������C�e�1]����gL�$�
c
� �8��?S9`�4�t /���#LC��=k��4Mw�Qy���>N��ny2���)N��B9�8�:@)�$��CXg��!�����\2NB�7*l�
m�v�Z��j����0�;�4?dz���<���*}R���90<UJ�\0U
��T�p	���g������Z�.U����0����&�^��u�x���� o��������`R�M�����`
0�������&��XXdB�
%qW_{��Y-����V.������0�^��K��A�t�?�gL�����+�cL���1����$W|Z������u��Lhr]r�2� G�������Q�)9dT8Rp�p)X2~p��#���q�a���d�4��P�0z���Cu���^,�ex��)e�'}�
L�e���S�����
,I�5��O�?2��H����G���r�+�L������-m��v����~��
�A��
u�7}fB����M���
o
� ~8��'?xS90����a����a���r����i��W |�w��e�
��|���������Q���x������B�Ji����6d9�8�4�QbqLi���������-����s��]7�����R�����q����mw
�n7|��NJAe�b�Y��`DH���A����$������C��Nj��_��SS
2i���@&����$?"?'#/�,jN0�E6*D�e��{,��iL��&nS�-g��i����M[�ee�#M[��10i������
:��R9r�	��2�����R�A[��P���	Bor�T8�2����o�u�������S�=Xee�Q�A�U�2�;+�~%�,eDC�A�
������Z��?����W_�6�t�����L��0}��=.�Nl��_K�cS
�h��
$��)x�$�`S���h�<,����"�g,�e��q����YK}������k��	L7�k�x`J@ -�B�@0o�`�F ��G��DC���M��C�~���o�z}\����}��NR��_+��R
�h!�
��H)x�2��R0�p�4<(�8T.���6<��;����rJ�aG������������<���.
�@���7��U���!������z��A���n��6�b���6-�d7�:W���I7�k�xNJ@'-��
t��I�t���*��������q���%��2�6�����sR��y}�$��7����n��B� ��Z@(�!8
� �/ T�
J�����P��w��e�cfZ_G����d����;�9�t��WS�������b�����x��fo�Z?2��h����y�e2�{�p���Q��O�����
���/�t��WS����N��G�1�����x�I	���m�'�����E'�G9��u�/s�a�F_��VR��y[����:Y�f�J<���JZ�F!�W6
��c|a�
�kj�J�=(�8����\.���Z#	��
;����y�>����8+�b�2���������O���|�z<���������o�������/Z��������?���y���I�2H2!��a�Q�I}��a�w���!�am��R��
i�b�{�u�_���6��T��?�v�y�D_M{`�z������N,
���J�@'���7�������:�=��z��lm('�(���s�O�J�jY��.�w9�O���!�O;������:�P�N�(��(J��EK�E�GN;�����}���4kB{��tN��?�������m�m��<��|f�����Z&��b�I!�P CH�d�	i	`d�7$2V= �@Z=t��C?�kN;g�I�R}5��7}�t����+;.� �tJ�1���A ���%�V�7�V�B���%�k��TK,6��=�+������]�����K)(��R�exXZ������A�������^�R�5N�?0�@t.Y
Z&6�x�\Ct�0}(u����+��@�IJ�w� ��7%���t�\����6������nO,�H0��Ps����qu�JB����t^�y	�6���\�}(u���?,J� ���R
���"��CL����<V����������W��#�Ow�I70���,���4�f5����cA���y�N����������Z�l���P�1�Dj}�S�/���sD�'	���'&pV(��-��ez����9I*�S-�7������#�{�{w�9������RKZ��JZjI�1���Z��*������7���!��
�C��n��������,����S�2�=JI�qDaB��,��n��m�|u���6�r2�9��U�	�l��e��3	o�I4�aH4d���H&�)?���F��|��1^�c0���6�9��W�������>:��f-�W)���B(�	"M�����(SkZ��# ��rN��QI4�@��xS��n/M>lJ���������#�)�<:��f��U)���B(���*���G�>vo��)5]�ZE�EC��pC�F��.]c�e}�~r�	�ru�U�	��l��U�w��������LZ�*��g�F�#@&�;a����Q����,��C��48����	n,�\�Y�9]s�G'V�����U)��B(2�"��H�N�)H-�M?�7P�qX5b�f�������c��O*
�O-��r<R�W��%�s��D��n���;�����t��*+;hm��_����}��U �doPI�OR�#���y�
e��k4��Ae���V-T�n#Q�m�R�f:�*�l&{6���N���_�CU
�GT�P�}8�
� ���$����(���NU������1
����k���[��	ZXa�"������n^�tB���Z�RFT�P U������yM&���=(��}�k��t��*�_s�O�Je����������r��[:Y������f�&�
��L��R�U ������y�)G��	n^k6�W�F9�s]n�Y���$m��!J��Wa��t ��������u�y)���U�4��bPB/�#��K4��]�4�p���1��������{I���IApo���H�/�:	�f)Y�.�da�U�"�(#^%�s�o�"�7�1E4daD�����H�A{g'-���q��o�+�a�)�����a��\�:��f��D)���B(��C��
��HT�<������!#/n��]^|�=�ux���*g�h���=g��s������$�B=!H�QO�	x�)#G=�#O7�!��aR��j����5���,9���,J���[=}.����-���97�kax�I@-�B�0�o����7�p�3<(g���p�tC�=���3��t#,K��Y�����!��:��f��7)��oB(������T/m����w�I�=��A&7���?~��9�:s!��I'W��a�R��r�9:	��D0ee�M��������G���A��P�d!`�h�#���e�Y�mE���C-�^v��=�K�C'������3)�-<BA�x&xC�y����/��#h�KYvW���u6���?��:
����P����#9t����Z%nRPI��P�7�T�����J���"�,��c��0�4�2?�H��g���o����=~�r��I77�kax�I@=B�i��z�mj5^��+E#0t�n��a�B�h����Y7�����,���i(�
��,��>��RQY��HS�(<d��>�U`0��A&}F`��+
�2�}��.y[����`Ru%������T�<>��:����k�$���e7"+�
�������_�U�>�;�#�i�MY�D���`?�l�F�A��h��>r��<s=Y!�������]�e�D����0<� �
�@��7�#P���#2��"�xJ"���s�F���x��F��$��MCQW�vu���t������d��I!��aR��xL*#�I��K�dN
��r�����6�����u���������n��*����JZ�)��8f
���L���4r�7���!�h����6M�\w��������h���NJ��_+�SR
Jh��
��()x�<%��QVt���JQ����iV���"�Z��j����U��/�tR������D��)<�-�4^�-�4�q ��>�O���@��a�py��Q�A)�,�YpKA���!?��P�r?~���i�e�����u���)����B(�>4o�><4���>�����#/�sC����fmP
��Z�B�j�&��a��+��tl9�t���>Z)-������8)x�0<'���*y����R�����P��J���{�t�v�r}}����c''���{�I)�-�B�0'o����7=����Ay���p�r�n��UhOl	�����tx��������M\TV��h���n�p�xn�p����F|c�/����C�~���a�h�=�F�g�Z_0m���1�>U�=�k����n������0(@�/�*�r������oF��$~���
F}J�^� ^��f���0�~�%,�V8��b�>;	�f�O@)(���B(P�#��
J�T��9�!�h��-mE�2����z��;���!c�����0�R��D����.<�����	�@y�w>NF���=�C��'�T������C�
Z2�oO
85
1�jD��C�%�h���q}vR���Z�rRF��P G9���)����GC����K��C��h�R���SOA��#��aI>���X�>;�f-�@)���B(��C��
2�T�F&����@�e����6j���'���������G��%n[�v���Z�R�E��w��ZM��7��o|������	Fxd��
���C�|��J��rw�J�d2a�����yn�������0Z�'�a8�	� �<�o���gx@Zqu�����|�t�����Y}���RKJ��q�	`<;��f�
<)��xB(P���
���S�F�04<(�8�.�+x���*�m;k}eH#�6/�?;y�f��;)���wB(�����
���S�F���/�j������0O��<:�k/���~a:?��
��}`��x>����,�j"��[-�3^�V�a�J��+���En|4��=t��S��_�F�y^r������9=���S'���/{Y9�P��s�U��#^%��@���F4e��%�h��R��F4h��.��]:H��K����+Umu_d]u�K�S'���e�1(Y�`P�p�A���1h4d��
J(��yI4�~B��%H�B��,���%>��nM��s��V����>!(��O�%x�)������W)Jp�3�A���t��uR	F������N���_K�#O
�hA�
$��'x�$<�����iI�
�3\eg����4�U9:����������4���G����N����qn�������.Z'�]8�	���8�ot�7d�fqCI{(�{�M�)�CO�U���(�����?�����N���_+��M
�h��
��&x�"<���Q�����)V4��G4h�������j���",y]=�
�N���_��M
�h��
���&x�0<�������A)���p���X�o�r�{aYR�~O�Z�k
�/��S'����u�']�N�p��Ap����X��1�h���h���M4�, ��+:a��7^�;:���t���,�h:�`�e�{�
���
J�H�~c*��'% �,JpH3�%Zy
j�U��W���ks���47��.�#M
]D�� ������0Z�fx#�rE���"���R������C	h����U�aY�����5u�~oe�Y!���1w����R�r����0��#^%���m�C�b�h�(32������G4�o=���5�&F����
"�]� s'������a&e��L�p0�A�!���]���P�KQ���
:6Ew�:E~�w����T.���$L�.��;Q�f��:)��uB(��C��
:�A��"OE6:� ��K�Bv-�?mQ]�^���a����Gc����	<7�kax�I@-�B�0�oF|
�y*rF&[���7]��f��y�&�U����T�������h4~�����s����g����8!H�1N�)���\!�")���\��8�AsS��M*��]�aYr���h��z���;��f�O>)���|B(��#��
:�A���NE6:�(�aM�kz�E����<3���E2�8[�7wB���ZrR�@��P�9�4�pE���cMU������1��FiCg*�ee0�����i$���H:1���q]Vv���qs�
��0'<���p����y�7�,�8g4��D����=��88���d7�b]s'���{A)� -u�
t��:�t�:q���lz�������K)�\i����S�x�6Q��?��im}������<�����	�@	�z�7(!>����"��)6��Q�h��7e
����.�T��\���� K'���/� +�<(@�X��FY
�*t�w�B��BmKK#��q������FC��jT��8=�S�3K^U�an�3]��R�t����ZtRPF��P�:���t*/�F(c��t�.E��������u=;Q�MYIu�G��e��bn�����Z &�
8�	���/S������#O:��H`��]��[N���,��3,9QDc��c������2<�����	�@�b�7(��T�G1�#WcEC����B{
u���a�F����t#L�&��jn����P��2Z�&�e8�	���/PS�2���j���5h��]�=�r��;=,K��y#����5��tx�Uz�u,�s�������B1!�QL�a|��
`����
H&b��J���9�>��:�"�*V�F��YG'�\2e|������=�� ��	�@
f�7H����gf�d3���}���D]]"xcL�j��}�tR����7����Z�3!H�QK�	|��
`zW��&�u�y��QE7���++���-K)��O��q���?�����	-���MYY>�T�	�>�R��|"{�#��� �p�2�`�v��G����S��M�58�&,���{��H'�\�v���*�i7:<��R���h9�~*#* sq����+1�A�dj<1������O��aY�c�,]iW����kw��I87��D#+�
�
Xe��QN4�*Y'��{��!��h�����{�h�D�.�B�A�e�e���q��������*��TC��N�f������L<�� ��	�@&w�7���T��d�2�yv�
E%{(���������f^��=���D�vB���Z�RPF�P�A����*�Q�����'+��QW4h}]c%m_7���MC�o��%�N$�fV	�ZY�D@'-H�t��(x�N� ��H�A
J4�eo�t��6��6J��E�b�����a���)����Y�@<� �2
�@ ���7�U���#��R�����$��P��;���
��=�w�v����:�xJ@
-,B�o���F
�����B�CW�O*,�z���T���H�0]�r���IF7�kax2J@-dB�0o�2�F���%�F�E�!�:���M�{�����:J%�~'rIZ�?L���r��t�����SVv��T�	��ImK�g�
Lj�7��=���G4����{����1���u��z�A��a)�G�Ec�l���M�TVVM��h��*��`Z?qG|dVr�2�����W�
���~5����}�r��a�����kG��	L7�����)���R�	� ���O��>�K����K4@��{Pjq�4b��-S�p�g��\au����������n�����:(@X����@�Ur��Y �C���m����\�
y~�
J*�0��O[�4�u����j�h�t���o���N8��_+��Q
�h��
���(x�2��Q��#@�.J*���F4�l<u��S�����O������*��F��'h��/��W'������(e��Q�pp�A_��e88�T�����zj�������<�%������������?G�wN-�l���FeeSK��@j��@&���#@&_�h���Z������F�C�_������^���].z�%�
0�t�W^�Pt���@<����@Q�pP�A_����A���$,����:b��>Gm��!��=�:�2�����������W'#�����)e�0R�p��A_�e8F�Z$
���4���y�4�)��Z0:kZ�e_���0t��SK'1}5�����2i!��. GL�;��W?g-���S��rC�:`
�������cg-�����K�7j��C�������Iee��I�10�h���*0�����/�4"��he`�iQ��0��r�U�?�I��\!V��a�%L�/��:�f�Y<#��e����P�e ��Ey�
���.��*r�o��(�8H��|v�����u���iX��������SI'!}5����vM%���0ZJJ�U��h9
�~�����`�/><`,�{��(��hJ�e0��������u�^��_~��?��������
���x����������~������_��2��-������(VF
 ud��@F������@���8>�P�R��
I
�]�k��B���G��C9��"o?����m�sG5�v��Pc���u�#����N,���2J�@'���7������k�d������>�6��z�5mwU �8^�:��M[������;��2�~���;� �L���O��L&%o�����#�;�G��:��!�/�P���c�����������aY����F�H,}$T�GM���PJh��P�CB���Ih	`:CB�G�]���(}N��v_^��K�p����������y����q���:�X.�@'
\�B�N%o����%����������.���X�g��?�5������m����~�3wd@I��r���')�8(�%J���AK���!�VMC�Z��>�VzUl	=F�����#����F���}���}�E����>)(��O�e�b��(AfU��e�Y�!�~��Z����LR�����,�������'���k�]D�,Q����t���So0[m ��U`��@:�����L_����=(Y�Yc�k��B]��y�Z�Ux�q1�J���K%�{������=1�
��BAa�'yC��g	`���&�H7Ka�gm�G1�U�Z������e�d���T��.�H�,Qm�B>����@�g}�@���!��������:��SZtC�&�����5^����7:/�=��G'��/�Y9eP��g�UN-`���x����}F�RFm;W���!+# �DK��F�h�:�]���k��V��P��6�5^�G'���e�9(��pP2q�A&_8�pj���25����A�+tR��/�Cp�Bp�%�������}n�������2Z�'�e8�	���/�S�2������(�9�������-�VR"�����\n~���$����<�� ��	�@�|�7��T#G>��2�)�2����>����D����jX6��[������������:7�kax�I@-�B����_��$�mk8�����7\�i����<y�W���@gx����qBS� "�����jiy$
7M��0��u����ewY��&)��Nx�B0����"���B@����7���� �����GE�����F�1��2�#����88p[�U�1Gq��X�tB�GK��KVV"-��%�Y�0!�F ��,�� ���$S�Y?$�{Pq�3~���4�r��=��2��C<_e��*�������X����,}t����:�xJ���4��@#q�f�F�7h$8%i���7������0��=�Q�y��K:�Dle�1���+�ZV}��c�N��h�/�{����%@Co��@	X�Yz��
JIJ0����A	H_KBq�3JBVU
��q��F%�J/"�8�T��E:���	x��f�&�	��"�]`Af�E�]�$]�J��|<�bo�,��g�R�*
?vf^��9�����d��aR���������s���"�r�9��U�e����J����	o��p`e�h�H�����wp��� >�w��X�Dj�����yI4�����S:��f-�7)H�oB(�c)�G�vo����T����`���W)y@
O�	�k��%���t{�g-�yI�Yz��Mk^"��sh*����-�J<�1�[�P �@|i�	��DQ&d���\�
 �=d�p�D�� ����o��HxNwIc�7�n]�iT����Z�P��
�H:���T�)++������5B���������i�$i8�����P�������.40H[��c[�?�/�,�0�m�f�N�9���|� d��	�Zp��A�I��i95>����EvF�4�
��zlAUD]�{�{���U���N��-�tB��	���vM�$��8
��xZ�1K�7���I\�g�h?��'���\�F��Bx?j�O�T4��P�1�}�G��i'�(���F�((<4�BA�U`n�BA����3|����sA��QS���#��RI(���c���w����n�����%^C'�����*�}RH0-�BA���'xC�����y�5@���K�q������	����J�(_�iR[z��Y���|���:t����Z��R�E	�P�GB�t�I���.\�gxPVq(4\�������
1s��������U2�������9SqkkC'���5��(�dd	Y����y�JZ&*-T4�1��*���9,���w8(
�������L��R�e|0�<�K��N��q��M/c'��/%"+7��Y"`�%F��W����Hx�D�7��i4d�p�$R��3��h4l�e}NyHZ=KV�>�%OX4,�&**m��c'#�����)e�0R�p��A����L4�2�N�p�,�`m[�����@�~�������9�3tnO=��4�=�V�N6��_K��Q
�h��P 
�F�;��,��m����h����V3Zn(�F<%w{Cg����F�+5�VRt5l�2�R@Z�\e�����.<���0
�@��7���Q�s�W4d
y��%�����q��m�U�*�������,"����&6v2���Z��R�E#�P �H�d����,�x��bo��J�@&�]�(��ly����!t��Fy~���$�?��z��dt���'���R
�@���7��Q�aDMg�#�?Kq`4^��|_u�N�9����j����G�r�m���<^�4tl������&
��II
�W�II�]��P�J�h��+ �8ZCM�����Dx*}h���GT���D��)��Q;���^VVM���1 ����x�C�9x���\�g4P�p4\�?�������>u}�~��%�a�T2Jb��C�N��_��C)��
� �8
���C�oR	��1B��J
�MI�r�Zb��QL �g(��s"b�w�N���_���O
jh�
���'x�|9����`�P���g4<g-�������u!��m�5]g,��W8�������,d�	����H�(�"^%'���o��X�$�,�I}L��D�;[u�U�C���c�`YaY���
�[y��	i=;i�f-
O;)H��vB(�����
���S�L;�!�nP��
B�Z*�F�w
�H+,�R���K�*k��������6�Mu���}DS�'<�����x�#�7�����iA$@������h("�V���������*�~�DL�����:�������������OYY�4��c@"9�s�#@"s�o��jN�5����+
�R���'��F������]�(G<l\t����$;��d����PYY�4���c@#9h�1Oxh�3���Y#Qx��Vx�����������������O����OU>����-�D�o�����X�����������O
C��	�@"�~�7H��O��!�����)W4d*
�4��Z��$/�2-0�wo�h����::����~��vM�]G��W�H�o���Z����3(�8�.Zx�*tHx��f�������j�N���`��R���D���u���(����*BA���B���D�o�,�,C�%�8"Z�����\�|?0]��O���k�o��N��l�/+�W4����@_��>^���
:���~cf8��C����d����D���d��B:�"�������g'����
�B)t-,BAW�X(x�D<���*
J'���f+wU��M���N��F��	G�'������B����n������;(@X����0�Ur������q
�>i��
�!� �
:Ue��v��0���:UEJx#�$�q���E'
�2��
deu@-(�t�P(x�.<
�o��}��	X4���E��U��m����"�5�"�\������9�����'�5u�������`��,Z�>!��}�7���Q���Z4�,��xv)��
�Z�uS.�	�\hOb�%v�EG����������	A�&*+�K4APxd���#{�<��%{Dae��:K�p4�=��sO�%"�]�����#��9�r�2��{n������:��RO�����
���S���p�3< w8�O-��R����HXj���b��*�S?�]�(1?���sj�/+�[4����@o��>^z��
���������h�IH�d�
Zy_�f�cT��L�S�E���H����4�|.��S��N���_wvR�02�a��N=�_k�"���zr�"���������N�
eP��*[I�O?�H�2PC��$���y�M*j��S'����u�']�N��N�]x�)�H�H$p��h��CQ��a��5tKuF��y^�_�9wt2���q����&�	��.��q��@��� �89p�"���r��<��>�W�r�������9'0�x?�y��V��9k�rNM���j���Oxh�e�{�
h${�F<����8x4��Qf��8�Y�Q����{:�*�����1kI4��4�oSz���`�����s���(�r�9��U�����%�Y"�M��~c�H�I�h��=4���R�5��b�#$���n�{�������A����N�97�NYY]4�Nx�"�]8�	�]x�Y�1�"|�.�>����&�����D�F�y���sa�<��O��\���!����s���0<���a�O�p��Ax����!�aDC��pC�0��V�� � �����P@��~��=]}��I>7�k]x�I@-��!����7���O�]8��H�%�Y+���2��G��������3,��DG�,:���T�)+�G�*>�1�GZ*>�U`|��A}�o�y�����iJ�@�cM���AU�y\ZY��,e�Y�1�����GV��sn�����h����F��Wi�����������j��_�N��d�x����iZ��D��aY�~�Ik�����m����s��N$�}RH$-�BA"q������S�&�8��H\�gx(��k?�e�9"��	rE�r���R2w����Z
wR�B��P �;���q����J�w�K9lK���1X8��T����-���T�ct����������;��f-?)�e_;�a���#?"���Gl}��9�r5N4d��
%w8�
�JK;V?*�s'!J��p�����Q���#�Jk�$����0<�� ���N�pe��
��Aa`daqE.ja`�g�C��P6��Pz�(�1"�65/?N����ZJ$�h�jd�D����Fd��� k�����F�Ur��Y#�CH#Y����h���.yF��Nt=������eY:9�u�������s������<�� ���Nbp��A�!(�\���	tFC�Y�P���R&�q��{���E
�L����,b�*M����mn��Z�h��Z�&�-8�	������+�T��-f�
y�
�U�P�&]��V�QE
�������H! ��c�?7-[��t����ZmR�E��P��6�t������pl3(a8�.����q��iV}�r�N��e��o���������d�����H<�� ��O"q5��
"�A��"OE.�G�Y���y��~����o�+�kls�V4��#����{tb��	s��7�0'<��-�3^������������@��)Ix@V��Q��y}<����xVCh���9��+#a9j3�����(��y.M��ee5���i���W�do�H|jij�����@�Q�p)�9��tv��=���R_:���0�eQ��y��^K'
���3���2L
�P�a
oI|���PEf��
�#q44tq�r>���0������C����z5t"����SV��h����@��R��]F��j��L+��+C���R�5h������{��c�4��*!o8(e�D���u7�(�n��B(�&o�&� P�U�ha��K�X��>����R��c9]�h}����8�����t����\�A�H�*�`�E��{��"�B�GmK�G4d�
y��
�^�%�,�^��3,���Y��+�j����#��N��_���P
�h��
d�h(x�,��P��#@�G&`���*�P���>�bk������{C����*��T������g���t������4�BI!h�QR��|��
`4�(ix@~������$��\�dYD�[�UyQ�mU�>L������h���L7�k�xdJ@$-�B�H2o�d�F${C���������jF�>1)���T~y<>�(�J�SKq?��O6��tm*
���4���c`�R�c��
�BL�G�1�#������(�f�_����3��,%�����d8,uK��[���	O7��~��S
�H&���x����FZ�ix8NuE�����A�&v���[��
�B_#�^z9�6J��-�>���q�	T�a:�Q7,5lIY;��f-OO)���bBA�q����r�����\

y��
B!�P4�?_k���u���=h$����u:y�����������n�������0Z�)�a8b
� �/�T�_V�V�a8b
�@���x���w��#B,��>gF���cY�n�];��f-O)��~B�0\�(x�0���*��P��V=
�r���5�Y~��W���\~���e�u]���h[r�v����Z��RPF=�P�GO�����*�Q��i�2<�-��n�>y����2���\u�%Zu���a�NF��_JAVn~B��*�=�(K!^%�=�;K�~�=9��G4d)DC��DC�5T�pq�������yq��I��(��M�e�:!�f�I)���B(�����
��I���h]D���{�G�cw)�v��������S�X����h�m��g���x�^�N2��_��Q
�h!�
���(x�0��Q0���d5<2���C�A��v��k32�!�;�%�i|�������:������F<���0
�@#��7h�U�F������KK���s����H#��}�a9����-�t��W��x4�Qx<Z�h�
<�7h���e���h4��$Z ���fM7�q�v�[���7�o�`a)�v4]���$���uG��(�����B(�Ho���R��h�l�� ���������Qg*�l���t��Y0x��h�B�Y.:h*=~u����Z#�R�H�P�F�4��*����
s�$WUe�E����r�'�aYVX��E�UMg����W'�����Y(-��PZp,�A_X�-�������`h�(<C^e�ro�0~8`����$�3�������jIeeM���p�������#{�6�����i%6`F����GC�3TXV> ���Z����QF^���t�W�i,�N��_w��R�6Zx(��n��P�i|��
`�
�C�R����Q����k���+aYR�2}��6^O%��Js�A)��w���/�M��?4�����_������~��o��~Q���b����lV�}p��F���}��Y#�����>k��!���N���p�>���Z���|�h(�E�5=z�s�9,�"�:}��u^r{?�����f�C5
�k�8j�@$��C�H���7��R�-u$G��
g8����]��3>�?U��J6��!�����cj1�}rm��U]5��5�*�\Ta����P�F#P�s�9��zQ-�����U���5��E����R'��?���.��4���Y�d�p�>���
K��@����P���
����
���lo�k���~����=����a)l�E���c9t����HA���\o��P���f|�>�mj�	o�?��p�5��k(�f��\������8���C�.������I1
pL]5��5�*�\Ta����P��s4�F�h�l�����*����)��������4�9�5^���e��L5�6�oV6�4l���@�����
d��
����E������);�\�a��v��
8X/���|80�v�6�
�q�3K*jI�1���yXU������"�D����K������=�G��(�\�>.E����:�/��%�!��
*�,���VV$z�s��m���q*z�H,N]?��s����i�0LS�����
;&~C(��rTa#?DK������c�rb�i>�8��@���
�����
��,u`���	�����N?��,u-�	�����d��un����}{��1�t��G�o�
�wG(�0U��\#�r��9T�P����
�Pu`4�Pu�������
���'����,k}�}����NC�����v"k6�^�� U��OE"+7�P�*���0�"���ixW���5G�4$*0��
��@k�,P6��Qu���Zj�E��?5����s��j�� T���5��*��@UqP�A#?��pG��>��?�RP�Y[�������elZ�,I��Ol�Rg]n����8�n�����P��Fz�*��8�
���PU�FTMl�D�����]���jz����]���[)���4I�.u��0u�?���F��`*X��S	�����S3��]1� �GW��*{���+���z�W�z�T$-�]���9f|cjj����M�� K���5�Y*�����B(�?Ko���F��:�
p���C��r2��,u������>R�ZF��Z�9�,1�)���$�~d��F�@d��o�����(�Y��j�����
F��R�4������A2�H&��ODG�`jhS��9����)���)3m/�=�JDFZe���+�:�Ug�s��lT�����6=\U��`��U��r���ww����,�����~=�����C���������������v/�n�^i�l.
���v�T�i�P�8�
U@����*���R�X
`�q45����dv�l��{��\5t���A2P�@��q�����^i�l.OS)L>zh*�a8�
� �4U�0MMZ�8��.X.J(�\x\W�K�X��X��5��_�O��k�������N!���S�9���#�<cq��x4����:t����~�l�a�^YB��b��.�c2����g���%�An:��v�rsQ
P5Vu,���lJ���]5�>��i++���.�F����,��e����%���AB*E���.��6HEg�sx*J@=TB�oP�*�<hd(����a�,�����]�[�����A��UK�V�=��^wi�����A:��+��P
��a�
��`R���W�F���v�����i�����?���\]ze�m�<��/�>xY-��2m@F�����?���/)��+�TVv����P
�=4�r���*�7�"�&�������*0�4T�J�@y9z)���4x����c3�l��G�_t!W��m�|�����'�z��	����C����Jg#�[�$R6�X
`��.�]$�������\�E��������k�2�@��HU<H;g�s]x�I@=�B�.0[3tq~)�ldu�U�&
>VdA�X\�H��aa�z\r�P�������d�jj��?|���69g�smx�I@�G���=�SF�����z�X�����������3lJ�%�qfA�%b����*o!4�l�4.���i\���A�y�����_t�M��=t3���������	���	7���G7�e�*B���r��X�.��z���K�W��O��>(9��������@�8!�%��c��3N��K���6��X� ������VpQF�����*x3-u��e�"9�Bb����Q�]F������u�z�����l�@#=��A�U@��8�����=� 4��gL����/��n$C����a����*�t=�?o���j_�������O�Y9]P�����
U]dS���U�}]dQ�\���,�q�U��4�������u�'���,��=���i�IK=*�u��G}�}���������6z(�m`6���m$�$m8���v��\��P�8��~?
i,�Zx�)��3�� �������tR����NG:���� ����X6A����)��w�3�{\��	�X�\5V��&H+�&��$�� ������'9�_;��b��/K���g>�]S�,�9|Y@C�c��2��q��[��������qM��!�:�l����f�9����q���']� N��C��
�H\I�������(+*���&�z
q�d��"-4L�b���d�����o{�������NY��eWR'T��
��0'T�H�H��o�C����[9�1���������v��i!4�<��H�"IK���~X�	9�A�9���+wR�?zp'��8�	���;�?�?�q�3=h\q�3]b\yn������B��}}�����:5c^���W�����E<ee��.�	�@��C<�)�2�!���+S���5���ftx�����Q�����B��R��%��]G��?zpa;��z��{����hO>)t=�BA���'xC��L�����hJh�q�3b�r����v&f�-sS�\4���>�1��fc�i�l>�x�I@=�B�.�o�����7C����AC����C���~����0�1��������g���}]���cv������R(@�X�!��,�)uH�z�[z���Ev6�����.�@+�V���D������I:i��}���������v�l�1=g�s�x�I@#=�B�F�o��������=���X��f�fAl������.}��i!b�����V���[����� ���u�a(]��P�p0�A���������!%K@K�I�\��V���Z�B��������[e���f��$�� !���5�	)��Rq��A#����hd)�$,=*	��_������MW<��JZh^���W�����i���c��>�(������A&�Rz()�6%o������0���:�����SGC��qEw(�/���>�B"�T)������|�&v=	�l�gxBJ���I�P�GH�t�	��M��iz���i���C�|]�@������+6K�J�2���^��c� }t� *+�_t�t�j`����M��J�]x*����@+��0�8(�<4�<�,�0�X\5��_I9d��������� }t�PYY5t�P���C�)���
j�����G�C
�l{��3s;#%��~�D��s�m���)���:������1Cg�����P
#I�P0�8
��C�oF��4�8�.1����'hc����ue(�-��"?^�<����)+�}teB5�}�P 
�C�
������[��
.������@����u��M_K-]�����2��K�N�5
�����������XU��Q�H6�1�]5��G�7�4��@��H#��JLw�o���[*�j!�U���"���HM�|s��A8:��k��Q
���
4��(x�F<�?��dA�G��v	�,��[����<����M��Wy�Ul�FJ���?;=���l�G)���B(�����
��pT�FKA_�Q+\����4��|�q%-�4�����E�i)�l�����Bv�����6<���(
�@��7h�CQ�m8(�uC%��%����y��K���B<����[�i�����6��A8:��k��Q
���
4��(x�F<�������A������'?���|����V�R����	jZ
�)1�A�i������p��.z�(�]88
��G�ot�5
{��+��6A������l���d����*i)$��#� ������
��B5�B�A��X�� ����K�C
�Jc���hD����/�-3T�!�\�6V(
���{�G�AP:u�RYY�t�R�4�J�)��P����5�5�FI\�h��
e����6=L7��[��3��7��+o�`�4Jg��Q��R
�J(�P0�8P
��n7��Q#�6�J:�F?�8i+�s�bf�v]<�X�nzYwV2T$��5?� (��@��l��J��>z@i6���
���4�}�,���6��@iL��Q�S'�m�`�0�X*��s���g#�C������i�!+�
P������edS�.���"��������S��,�a�]D8� V���v]7���Y�J�$|��:����9�@g�s)xJ@
=B�o��g���m�,��+���B	�T��p�tw|Y���G�]�p>Z{���B}����\
�|RPC��P�G>��������L�J�����,P:�N�_���Q'-���s=�)��n�iw�>��9H@g�s�xJ@#=B�Fo��'��7q4=h��j�BtK����������g�:�R���`$Y,�e�]'�����A��J���`t��B50���@#��B�O@�a��5j��0�8�<�&�M������������R���������A
��:D/+���C�P
h��}6&����)hm8
�M��$`����8�r}��=�:�BD�We]��Z��U�VG]��g���E�j@=<4������"��A.E4=h\q@4]"��~{�������6�w��`J��iE:y�� }v�PYY�t�P�4��C�)���
����KVOT���w8��gs�Lz������ ���}M��7{=	�l>#�������B(�m8
��
O@���jY����*���I�����
���4Ch'eO�*�Hr����
v��9�Bg�s�xJ@#=g�!h����7h��P�� ���R@��c����X�-��z���"�Jyo18LA��x�qw���� ��O5"+7�P�����U�dS���U#�MX�}c]����l��~���mq�~������#iPt�;8����R����e_��t�?W�'�T�CH!��R�UxB*�9�T�M	U���PE&�����Kwo��D%�-�F��������a�o���^��t�?��g�t��J!���R�]xV*��J�F�,��,��M���z�S�i��y��s���H�$�IHZ�������,<� �<
�@��7���Q�Y`��h�GE_\�EV�����i������e����'�"�n\���z���W���ht�Q�&=x4���
�x�}c�h8<�M�!���,��[G�'�F����0�hH�;��5�Dg��>�#Q
}FOb(��>�%��7��#Q��>�!�����e��K$_�:r������iQ����2�^vY^v������F<����.������/���gT�z	\z�[�YV�hT���i8<�
�.��7t�p]��G����iy�����?�r�W��P��(T�����G�q�������3f�D}�\�����{(�5���Y�.bX�/�9E����w��1�l��xLJ����
F�I�D��"q����E�Y
h�q�4c����do�E	��5K����<-e�G!�_�a��5�Ig�s�xNJ@$=�B�H'oI~�$�0��t)��84��I[��hn*J^���5��'g��>8�D��Ht�?�����B���,`T��M�xW9�!9dYX�����"9dAX�`z�=�Z~��g�}�X^�~���5��~��_�A*:���SQ
����
���(x�0�CP.qT���c)��$]@��N�7�9�$��f��A��}~�rl��������l.�E)��B(����
��Aa�RE�����YP��
 ����M�)IZ�@��6z�l�v�����,�=Ig�s�xHJ@$=�B�H$oI~��%�*�	���0���f��P~�[��=%m!4)��Y�W1n���_&��A.:����sQ
��98�@.m�A�!(���e���FK
*��f�����j;e0�XB,�G�e�-=c��{=>�s-�{����"����Hz`)��8X
� ��������X.�����,������uX��Ul����a���2�x�=�Gg�sUx<J@=xB�*oPE~�"�j���3���P���Y0���b�%U=G��������
��/Z1������f�0)T��L�M��l&m�"q�����,���%��Kl�<�
�����4-c�e�����L��n�{�AL��z�IVV$z�L
m���aR����&mY{��	V�C�'q��h_O#<���i.-b�~�S��e@^�@��u�^VV]g���3z��gS���� �@�}d��������Y�\�!_[��22D,|?��1sl�������3�Ig������H(@�3����j��M�"�*��!4���"�,�}F���%��e���/����	�������94C�`i�����3HIg�s]xJJ@=�B�.%o��J������hV��q�`X+�U����m������C�l4-u�������3HJg�sqxRJ@=�B�8)o�R�F������Y���%T<'������BDf��7��R�������T�g�����"����HzH)��8R
� ��T�H)MY*M��L/�}�w��iiZ��?nUR�,i��$�k�vf?��t�?�g����J!��R���I�h�������t�kY.�����Vp�����)����RC���K�KK
1�����d��~������K���C50/�9k�M�yi���`��n�g��m����,���7MJ������a������c�@���>��t�?�=<.��{��R�����
���K ��p�4=hXq�4]���?:6�0z}��Yj�y��2�R�������=q������]������fS���� �Y�D�8���h�����x�y}_�.mZ�K���,0�,����n�q�l��x\J�'���
z�K�D��*��I.MX�8\������/��;���z��Wk�n����{�A�������=ID�j�'�����I�7��D����8��4�8��.AQ_�#�����6K��|�0A�k�XZ������,S��_�����/��W�{����]������������o�����/���_|����k�?�I��������H���$�)E$�]D�~�hZY����_����+������<��=��VY�i���m���]����c��^/q^����o�$�8:H*�q�J� OR�G�8���8���ZA<������>-�B��^��5��m�}a����e��.��������AQ)�PT�ax�p��
m	a,uhi%���Qum�M{�����������-t
��^�x�b.�K1����R(�����
r��4 /m�x���'.]���������"v\v����,����w����_s�s9XJ�@��B�%%o�������g�������� �X���wL,Q0�b���WBi�����!d��^/=�4������R50�����)0���������d�6'�3B&m�����&M=[R%K�x��{���r���$�2�N���2FI����RR�G%�P�{JJ��{xJL���<���)�I���cM;K2i��{�w+�CK&�)�h�cgct�z���ae{�:J�@��AG[S�������`�a�I�����o0��3����#����N�!�Hm9��J��gw��;���9FOiD���I ]����@5���R�kx"�~�1��0�h4�$�b�pr��:���~�G�"��:���e'Y���y9
���������
���.�:FmM������=F~d�RKUs��U=Fz�@�LA��m��@6�_@^���s�?�]������X�~�j?�M�r��O�!9��"�Rk.��$�� �&����]K��Bh�31�8�\+�s�^���s�?��'�d�C:!���N�Y� �
���,��D���%��<Y��/�yq��]�:�����m{�h>Ze�(��z^��l.O=)���zB(����]n���h��z�S]�rA�K(�H�6����:hf��q����cM�j�q�q�=3-I�M3f�smxJ@=B�6o���L����m	m8��#���{��W��BhG~}Z���"��b�^�P���t���Q��q��F���)����KG�7] ���ZV	�X�B��c�+R����|��uY�,#������p���ov�[�#����y��'����@�BA����S�Gk�_�$�`�����W��R..�jq����xC?�T6��o�3Z��m?Q�LU��Az�y�Ic��5z^e��}�5:�mM�^��U��Cp�aE��ql&&O�,�O�����w�!"����� 6M�����������@7Qa$(��w�>3~To�&|6�^at���m4~�t��2������uP?�~$�-�������S�.q��w���+�p'T������!�7�����0�@��f��vf��z\���
�CdA���[S[kD��� ��v�NYY]t�N�t�C;�)���
��9�zI���vr[B&���c�7�Y�������m������������v�Q���l:������q��.��N7�)U�]u�>�fYV��R�E�8�%�]d�t�����=���p�����w�
�LK�P]oq�����*<�����	�@�z�7���T��YPW�\����i�n���^"-"�s���I��T�r���
!�WV��� ��	m98�	� �|�}d����%���f�g�z����_CD�H!Y�=���uq����y?��&a��M�pp�A?���~Qe�K��n�K���T�����tK��\�?]G��2^_��� ��u�MY���nB50������T����!8�py��T1ET��1|$��E+J��5M�?{�S,�Z�~v[um�1����?�?�<A��+�SVV$]y�P
��'�3�"���CP$.�3�@$]cPqi�YgH.��=�����r�K������=&p�����0��&�a�kB(F\f'x�0�#�S�0�2;����5�e���������SaVZ�v���e������y}��c19o]I���]DWR'T]D���@�9��`��:���^�#�� �����_M����k��F�D��^�`��6ee��6��B����z�f��B~d]�8��m�����,X��������-���\�2m��;���)�l>�x�I` �D�C9U��@C` �A93��]��?���Y@�����G��p�0-b��%���I���M�9��A�9���AV���U`U�F�.����'6��t�.��5oxC�����-�@}h�h�G�$���������}�R>�r�C�[��n�}�o�����|��.z�&�],M9���E�t��o��u�k�����;����V����I����O����3�H����"m���� �������&9��Mr�,��C��������7��a)�Q�]B��?���}�����"-oJ������-��9�|X����l.�7)��oB(�X�0j ���$����!��(��r,.]��QW����wE	��J�J�;J����}��� ��w�MY�iE��j`ZQC�,����Q�A�I8�K��cO��e�
4m�i2�1F��{s���R��Z��]Xs��f�A�y�b�����b�P
��a`�e��������r7��A�
$�if,M��n��S��M�b�8%����G��<����� ��w�k����+_�=�P���=To�CA��K���?�E�����5��Q�����=�q��"i�{�tDd�l��������cQ�}k����
�5)�*zr7!�3,C�
PEbER�K�T`�"�
,�U���l��Lzoq�jo��x>^���%����S����� �A�9����#N
�����
�@KS`ZQ�]${$]`��V8��4z8��.����_�z�t?�V����Y�����!��� ��w�l���]9�P
�5�3+������������B����R ��~����{�s��1�H�~d-!O�>_���4&���F���R���y���������]XU��Q�H6�v�]%��8�7�4�}���,8�o�D�C��z��wy<�CDJ�Rt����s��]V�����c�j���b�T��z�&�18�	� O5����YP��,���]B
+��W���:��Bm��*EA]��r
EJ��r�����MY���+k�����98�	U�<�l�X����.F��.>� ������X�k$	PCV�}mol��j7�X
^]_�S;�|�������|�@���7!H��M��x�)�c8��0|������r����I�,>&��M�n�o�H?��g;�A�������F��j�������L�`�������L�0���V��w�
��M'��S����z������b-�[�o����r�f>�h����h&Tz�����C�Lo��K�L�J�Z��_=&�f���������"-O�Y��Wi����cc�����.�)++�.�	��Dzg6$�8�%�7�$����Zj�������3M?PV���$��s���Qi33=�b�p>�7eee���	��,j(�W8�	U������u��gz�J�!�t��p��P�7���K���r��ym�Q��Y�&�z-�N1��+�SVV]Y�P
��a8�	U����C��p���T^� 3�G8��Qo����*����|�,�WvE�4�(����"�����:eeu��:��E����8R�Aur`��,t�	���:[����!��U���~?G�2��_}�4�7g�����.(@]��U���"�Ru�U�M��.�O�/�v��d�_�{8df�_y�M.��gt|����t������� ����5��'��`Oq��A#{�?����.[��r..���1�)7����IZ��L��~�����2���$��|tD����D<
� ��Oq(�A"���H������+j�V+��r��D�_���3���f'��S(����^d�N�Q���#
���PhH��P��x����C��A#�c��K������hN��j����9�e}���q'�[�L�t�?�'<� ��'a��C@U�w��;4D�;9pL70S��YCI�RpFz(1��C!��@H sH�>9�ho=K�>���w'���E<ee;�.�	�����xfS`�Y�����79p����y�����,�q�U/��s�����R��ID�jQwFZes�<Lt�� ���0���.�0'T������E�]�<N�@��X
hpp����Yo��Wr�~�18l�NlI�����*x���b�����8��&�q�'�B�d��M�=x�)�7���+�@��b��u���>�-~���b)��B���9����'���\M+��i�t�������Dz�9!H��N���dN�� ��.�<C"�t��Xr<��VZ>u�g��Vo:i1E�t��N1I��E:ee��.�	��P�C:�)0�����F���:�O%Z�
%u�������~�g�~hb��mT�k�:�g�A�9��v�rr����*0��D6���k7��$���e�,*�j���&�@#I�#����/F�k�Vi�0�%���hZ*���y��0�|2���\�iR�E��P �4�d����y����;��u!�
��K�����%iZ>q������s��:"��@�sd����� ��.z@&�]8�	��2�ot�@fz���% �V�")�����������t�	g������.D���r�?������C-!��QK�=xj)�������GV\�G�!�p}m��m�2���v<~�DN�����n>��l.O/)���^B(����
b��R�F�^�G������M�f%�P��� �y�}���q�Cc�����bf>�`�����fB50������Z�������K�L��a�VM��k�1��0�\����o������MtB���� ����{�0)�0�z����
hH�-�Q�,z�)O.���C�Y���	�wk����e�&�������2N�.��l�9�5g�s�x�I@"=XB���&x�D<��?o�gA��Y���l�t������4x�����!�C��ad�o������\�fRPF��P�G3���i���2�LZw8��.�y�5���:rZ��9�am�F������������|c�l>�����|��lB50����l
�7�7h��pr`m�e0��1�8��~�:o8=��
R�8�'�i�Tk��Z��<����5�8g���CVN�v`U�F����Ta�wFz���KYP��uXi��y�Z�R���G=f�O=��k�hw����Pq.�e��k�m���r�l��z�&�98�	� �6��cI�K���S[�����E6j��dQW8�������f4��K�o.���� �������&i��M�px�Ao��Hc)�K���8���� ^
QG�=p|���$-�;l�����$
 zVd3����k�r��������,z('�Y8�	� O9�od�(gz��R3�C�����4�����;��Oq��Q�-�2r���H��kx��������.z�'�]8�	��<�ot��gz�H��5�%V%�����3
"#���q�b�T��2��|�Q����?ee��]���9g������z�4|&g�����ep�#\#I�5��th*�J�H��>�w����7K]�<�+�^�t�?�1<��c�@P=����
������-�,Y�����X��_�qZ�l��A O��_F����p]2�kx���r����z�'�98�	� <�o���RIz�A�#����(/V�<3[�xk�ls���WK�f����f����q$�������H�Iu������@z�Jo�.3}`��\3<��Q)�wsAHd	�xkz��iY����>��5H9g��.�SN
]FE������x
]�������Y��6Z�Z�Q���=U���'��2������Ly������� ���O%"+�uP�*���:�dSj��U"�M]G��2����ud�*��<��u��K������3��}�%k���&�o��q� �|W	gKee@=���p��A�x�o����MM3��b-.�������C�4?#iX����Ei�Z�����Q"����uY���D�.��j�����3�}F��x����J������JT&�
��.���{D��P��y��!�\��C:Z�"�IV~<M�D����X��'���}B(�:�o��G���JT"�4�8��.�B�����Z�x�N�O]�����c� �|w]�)+�Qt]�	�@G�sD=�E�%��C�T��	hT��Z�pA&��T^�[IW�hg�u���'�[�X,u|���q����.�)+��.�	��z0g6�P��E{�CP
�G�k+�	��N��G�a=�~������P<������Q��um��,�Q�H��u6]VV]g���@O^g6$�sg���K�L'��=��.��jNZf�"��zF���|���#=�@�� ����g�oR�=��M��7��������T���KQ��1X8����{{mnT�� ��xBD����i���n�q�������.�)+�t�N�����M���z�F�CP#.�3��,��p�3C��q�G��H���P?����J����g��l�gx�I����
�8����z��1�t�3h��g����<_����5�������#G�����*�����{��gq���"���4(@	X�N��H�)���*��!$�,�dA]�fAEZ\�9G�_�I����fM>�2Y��j��|]��e��<u�
������F<�����	�@#�z�7h$?5�=9���F�!���L���"�J����"S=��5�L/��������<�� ���N"p��
"�A��NE6"X
`4IPA��^(���UT9���U{��~���=���v{�����dr~�����M����j`4��@$rB �������u	�Z�pA�&��U�l/��Y,����l��9"q�9�A�9���uR�;z�;!���w�7�"?e�eeSD��?�?C.��h�s�d�����6�j�4������B�3�:g�s9x�I@DB/��:Ug^@C@�!(w�=�9�����
;�r����YiFZ
d�H�4��W-[��L��c�~�������@�PF���j���Ae���t�#`5�����uw��<��5��8�v���\������u�;"�A�9���yR�%z�'��A�!O��^��T�1����QqVz�|3Y��������I,������m?��H�,���KR�O����"�('T]D���@�C9��`��9�	d�x�:�a������x{t�W}�t��P�j��������g�x���]�'����	���p�����A<�tK�
��1�t��������*�������z(��WG�Xj�q�m9��S[l��������_������p���������_�o������oQ:�~���w���Laez
P�AV�� ������k�wQ��!�k�e�H+(�0�4�Va������n�G�[�C�2�l��x`W7���@n�1������Mr� �
�`�&y�<���Q���;��\��b-�<@����2�X�3_������4��:zl��.���`?���Qm?����4k�?�':�7[S��������C��0��9���)�~"Y����~"�q����],���Z�]B��X*=k�<�F�c�19o��VV#=����t@���H��x�i"��������kAY���H�$��Z���L3�A���3�?%�w
%�������>�'�3�Z=�$r.�:�3:9[S@������z0t���q	Ad���@���ct���[��=7���&kL%?��M�	���]z�;����'����N��y���H�7��3O9D�8�ek+(d��F�9���u��:_��Y�0^��!�C1�X,5�(S�k8c��K�+���I��2:�gk
(�z�2�Cp�a�>[��s��y�Tt��������;�?��/��.(�����fkZ�\���e�{.��k�=1�M:�'�������
z��3��K�&���pbr=�K0J��k�������e�r���Z���
�]�8�b.�91��BH�:8g���d�����4�C���L���1���������Y�fgPu������RT��vE��38����3�����n��Y����S��)0ZTo��G�&r�� NS�0�s-�7at���Y�e����\�deZ���������0f���BVN��`U�0��F6�
��0���4�#KYP�V���%u�
4l�������V�T�v�ya��+�\�}��+o>�v@�A�9����sO
"���
D��'x�H~pO`��oq�z.�*���H�W5��jZ�zVn��q/�}8�YTMc;�.��������s��F0]������U�.|rg0�@���FzT���w����%Gp�^��b�fh��nN��Yh����|�������.�)+��.�	����=�)0��`��!8���N�l�_}�R@�9�����
^6����@��Rt�l����%n������� �v!PYYmt!P����@�)���>�a<5�__FT��1�d��zL�K�`\�J_:����R9W���_��A�M��� �]������hw������b����<�0�����Z�m�1�d}�7�1?h������%�1��Y��2TO|�ps9��~�������(I��1��{����t�?��G��Q�$t=T�1����4~ ��|hW�]2s�����R@C���l.���9���� ���@��P��Y�d�oH���n�A:����#P
z�A�
�
�@����*���b�g�Ag,SmuG���fqdd�+�3��Y���Hn�A(:���CQ
����
���(x�0~@Q0�pP4=`qP4=�������u�"6M����#YtYY�XZ~�����0<� �����a��O�a������a����q�t��������Vh�]��������Z`��#K�6�Cg�SE��-I(@UX�9UEdS����}k��!�$���SdA]�r��.�������^���/C��i�'�h���6��edr���N�l�
����\#�R�H�P��C�4��*o�fh�l��Ghd��oL�j��q�����;��7ZC)���w���m�������D��4:��+%�c�*#^��7H����}8"�0��KhcE���u�� ��l�c�}�BQ�$�2��{.���I�l�
OB)h��*�����@��4��Z�����v^��+�*A`��|��E7�r~� �� �����YR��UcJ����J���Y�`�m��KL����t5=*����F>_��}�������,����~�B�,u��f�n�t�?����d���]'<�t���,�7���T�7�p4=��$�"Jk�a�0���|Z>���������bF��9�o�K����.*+;*�*���������F�7h��r�H+�4(�8
�.�1�������c�|U����|��u��b��3��� �uU������""����* ��
"9��9D����_������C*�����O�/�`g�������������j��"�[e��&����F $�D
�c�H�4r�H�#w��8D�
1�u��5��D�M����������[��[G�u����!�b>����4z )�i8H
� �H�f� izP�q�4]"��o�����������J��QV�[L��}]z��n��t�?��D�L!��S��S``�
�>�F��xi{F����d��"�U���(u�>c�U&?�)�	�4�P�S����
P5VuFU#�*u�U#�C����b�
U#�P'4��ynkX��ge�i9F���H�O(�3��.�N��t�?�������K!���R�=�R�>#@v�<�� >�T%�:{��Y�[��,U!G��h���?���vs�i��.�������6zx)�m8^
���^�FkC��I�:g���F>����l�L%�u,��L%-���w���i��.�������2zh)�e8Z
���Z�F���dW6�qt�C������2�������K]�o1���v�m�����0</� �^
�@���7���*�������l���5����N���m��)���H��Y ��Dt�����"<����Q�p%��
�8��
`���_�I������������5��^�� �
��Re_��)��e���b�N]@TVv"�D�10���*0���Cp"�Y'"�f4H!��G��3��n�4���.���7����*T����O"�t�����������F
]�����28h��*
�A�������_/��G��#6����mO�l��r���G��9�b�M<��Mz8(�i ��;u�'��QhnonM��7��J�
"\�3��t��w_�����Z�p.i����c�|�����	� ��v����]���1�{����W�u�!��]�Lo��U-�4�*,'
�8�y�v�o�������"6�����|���������
8���s�?�:d�tAj�VU`T��|�����Ho�E6����]d$vQ��y���E�\��??���~����H�0>��,��4�;z�?W������=!�K3�K�7��a����~���4�������J�!���c��r��I{h���=��5���y�};
���>�?�s�x�I@"=�B�D�W��>$��5A"�U�H�ZF"�*J$�QIFk�J��2�����r���b�]���PW���A��1�<�]{�ee�G�^yx������* ��
rHy9��|"����<3�&%OUv�+�h����K��.\��P���RQ����A������zR�)z�'���b}�F}Hc����+�k���'
qe$�|J��:�K��:����+&i+&��f<$��a�C�$��z�����D<� �
�@"�$c�QI>I���H&��rC$GA[C�v�������y�!���![�����a~�����l.�����%=�3_:��
jH*IjX�`.��g�
�?�%����A���d��G��~����e\��y���b�x��������'<��C<�U@���$����X�g>�apC��w��QXKg%����,�R�:Ht�G��'� ��1�X��{��H:,����b��?:t�s`h�
��w�7,��.��3H^C����x��?��KJ��jZ>�8>a3���RC"�0]���t<��b�q����l��8�1�������a����p���������������|���+�n����GE���������m=1�b:�����M����:��W����J������t�P��-����n�j�����v�@�H{�
-6��l�o+$i����C{��c�}.������Dz�'��8�	� �>����W	��S�t�5���c��{]KVI�e	��=�	vk1�U��-B�������.<�����	�@x�7��O�]8���U�����Aj�����;K
C�����������L
7������M�A
���k�SP
����P���U ��PP=�-�g����1������A�3�����!�!��1}���
��UPK��DbWH��s�?������=!��qO���sO���q����$����!�(���g@��v�Y��Bi�������A����/+;������g�F�|�zgz���Uz�5[��r~��K�tu��x���'����Gwf���;&�=��A���"����"���H��W�T�z�[z�F�l>e��
�e8��~G�����-��L���uX�����$����=yB���<{x�I {�lw�P�=��RK"�
R��S�\��
zfd=[(MC�����i�cznj�+�j��$��YW��d����<�� ��	�@���A��S�F�u�e;�e)�}>��IE���O���?�r��s����j��gZe��G����]�9��v��@��������:���jY6�p������;*g�������Pz����t��Z��}��������i�!+'
P;����y��T��w�Hz��"��D��J�_E������>�_6�<�VZ�=����x��0ZL%�����+����9H=�s�x�I@"=�B�D�o��������
 ���K��n�5����z~��uv���z����/����u%�s��Cmk����s�?W������=!(�qO�ex�)�|����Q���AK&� �q��[_�
�H��+n���.�Z��1�X�*|�Y�������D<�� ��	�@"{�7H�cO���C"���a��d��
#eH�q��k�RRIW���j�F��A�������O
r��
���'x�<�������A����t�Z��S' ����T����
�����Q��:��.c��>�h��������c`H�CC�U`HZ�A#���}cY����7�!�C�3�����w�Ye
��u|GR�|h��{$2C�]0TVV"]0����* ���8kq[��O���*!C?��s�~��n�+��I���e\ayI��*/��s�.��Y�Q
Y��B(�*��7�������z�����#�-V\�~]w?T�!�"���*�1
Bi�z�K����A8���K��Q
����
���(x�4<������WD��S86���&���a�u ���*��65gL%����Hc�>����l�����=`4_�GMo��4}@n�{{J�^����S����3�����������+(s�y��A�����r������0��E�JUx��"�I���3�2�RI{L�����O���r������N��m���2���\�����F<��� 
�@#��7h�Q�sN��Z��
{q��W{������9lZ�u"}�|������i�J��������k��.����4��,zh(�Y8
� OC�od�hhz�e�l���5h�������e�9	��W�W�[���.��k���lJ���������=_RJ�]x�����@��R�c��'������6�[�uM��8��3���={i��W��pq���4 t=8���C������
�C�O����	7D&��Q{N4���2;I����L��*����f;'y
b���<}x�I@=�!h�mto�����g4�
bd��=[�H����b
�oMZ������1����J��.��Y�b��:)���HH=�S��?���
�HBE<�Vhd�b
F����K���u�g1�N"�����^:Q��i�����>�������F<���
���p�A#����k�ZsFq�3b!^��[�]�L������}�w��B��sl����5A��kG���P�kG<<�����*��To����B�UB"���(����4x�o��v��|��L�^
���@ee��@�1����j���_��A
�tv|������>�:���	-����>�����5��Tg���������b�Id��Aj&�*0��$_�
��0��f$�T��OT�N����Mp�f$�w�������~�"-�z�w4
MKe�Y������t�����D<�� ��	�@"�7H��O��`#*��������I���t��/_��i�P^Z�=Xum1u���n:����\�R�E��P��?�t�����.�L��dK�)�!�W�_��H��>c���O:=x{��-���b.>)���B� \�'x� <��?�l�3Wn���q����</��YIZ�:�s]�0F6+�\�&�y�t.��Z����zH'�-8�	��_�)����q�Jh��}��
�8��(�����>%��>��8�b^�z��3��b.@)H��B(����
�T�F"�3C"k�G@�e���~���V��^Gu�����"�a��c��A���+�#P
��A�
��(x�2<��Q��Pg��Q��!&%�MJ��#n��|�:���bm���e[�c7/���s�?����t��=!��aO�]x�)��=��.�d8��c��>\L�!b�}����E\�}��n���s�?��g�t�S�	�@���A�u��������L�`g��.�������6�u4=��nI������R����vQ�A:wPYY��E@�1�zh�
�����\	(?1Fno<7DNi{�������rJbW���sC1`����9�k�?���s�?�9d�TAj�VU`T{�|��
��=Gz�*�7�Z~�T�
���������u��q4eM�X��!,_�UU�b.��7�,�x����\"R�H��P �?�$����9�dC����%[���5������;����Y����J.�R���E���s�?����d�C?!���O��G��h������VWN���.n��#!k�:���������$���vd���)f)���^������2<���
�@��7(#?����2�jFR�w	e8��`r�\7W���X�`9���G���-�r���p�$�A������P
z�A�
��P(x��CP��S��{��bm��`h�]��~!$��|C�D���9�.�4��
��T����� }w�}��v�}�����g�{�
�?{Nm��pU��TQW6T��
�������������>'4�����r�$�6�����b�Qx�I�����
:
�=�:������qO=2��|+`������k��������b�D��[xg[u�d����<�� ��	�@�u�7�!?��j<9���3�
�5\�g�����1M�#�jd����^����{V��a�l���8ee�CW'<�COg�
���{����
���.aj�%`�X����|�2^���[i{��3�a}B�(������b?p���.�)++�.�	�a���|F���!?����0p�zB:c��)�T
�{�^�V����m}o]%k�t�����(41��{�?Kae��J� ��2*���J�y=|>�����#���j�h-e��iP��|,�}^5�Q��#���c�����M��cV�ee���"�(�H:P&���I� ���a�g�����S��t��.s���'�h����}�%��;��_?�x�~��9���VPu��`�Q1c��Qm�@4s}�_��2�$oP������/����(ZC�z��������I[��,5;QG���l_}��yeSZn��e�k�������4��5)h�pM��x�p%�5�FVJ1l~b)�<_���
�E*��X��~(}���o�G��hc�u3���1�9]z6����?z6��c`�QC�6��G�6<��|d�?V�2[m�c���?��*�R�`O>	��C_����
���UyE�l���2�7u�d�0��������O
�05��
����)���H1Z�S��<bL������9=%����e��'TlC����3Xmzr���Qm��s���Z�Y���}��U`�R�A�v~>�v�v6J'I#�+��%�1y\5������t}V*���������.=����I�}��I�}��U@$��>�S�i�Zd�1c1��|�����m5�X)7�EbY����6���^s������|0j�(���Q�I� ���O���#?�aj>#2,��0�!�a���A��Ob\G�5�3n��mt���j+�gc`T�]#F1�>z�0��������F��(�����%���0��QZ1d��DZ��dv_�yCm!�s��m�?�v��:n����ht�?�*d�r����0�]E�J�!�]�c��g�RD�U4�
u��
�*�As��|
0uj��"s�chn���bup��Zc�������@��8t�?������C!��P�a�P��0M�:9����[���y��>�����Z�M���F��ZS�Z�����c3���\�R�C�P�A��pA��45�HH���R���q�}�!�;�
.�N#?��kl���}�1HA�=�{N�����t��W9�����QP��P��H�4V�J����.n�l��������8K�N�:o7�T��1�X.������^i�b��xJ�#���*B�8
�����#q44= ������e�Ng;�=���X�U�UM���?/����t�?�����@!H�P�)P0Rp4=(�8�.�7Qts�@UE��f�q%�?>'��kg~����A�y���������c`Z��?�U`ZR�A�3#C"���m ��i������CN���+�B��������]}����T���l���kd^�����F��'<4��?�U@#�4��@������6�F�l
q��E���������:���3��������s�?�&�zR�&=�BA6q��A�SL6q�3=(�8��.�&�s���� �O�����1}��^�'�^T����ep��E;�v=e��k��Fzhg�
t���@��vf�-d��A8���'�����*�5�r��la��'������e��{�$�1�,������F(@�4��&0��F�J�xW��!����Bk6T���H#��kk���������+-�wMD���������&*#��� 
�UF	u�����
�q(�A#(�}d��C���%[*�j
*��^���"�6���] ��=M�La��m���Z�qd���y��Y(A��P�p,�A,T8�dC�\�P17D���Pby+g�t�E��P�}�^BP�FZ
s���dg'�A���k��P
�A�
4�P(x�FP��8�se�Y(���U�.����E���YR����0UymnN�!�� �\��u��']��O�p��A�S�.�LJ&���r��L�K��-�^�������?��::e(�^�<��U*+;����>�c`0���="���`�z�6xh��:��G�h���C�3�d�����+-5QOPv��G�
��-'�Jh/�Az����������F�Wa����!8K����p8��%f)���8f�����6
��[wf~q�f��M`[����9^z��A�y�B���z�B���C��W=����!�W�������o-0�H(�:���K���
�K��g��I�)Q�GW����]����6zv�G��5x0�����]�d�|j�a��z ;��p���q����$��������{
2P������d���p���0�p(<:����F�a���\�K���U�!�C[C,����3V��]Ki�� �|���S���]v�d������DVN�NJ��J�j���R��U�C����*�l��������Fk��JnB��t�f��.-�l��D�}�j��>W��^$�t�����"����I
"q"9���� ���R����n����:�����V;����m6'�_`�|Z>�ZN���[2:
�����#�d�@G�CF!h��Q���Q��@#kh�E9&C�qik��U"Z�F,�(���.�6%�Pq
V�H�4D�six J@=@B�4o��U#
D��rL����F�"���9&_u�����f���H�R����_
�� ������&�����H,=��*0�����$����@
��lz@bqD���J,������iU�:��;���������Y�}���i�.����G�z��RQ��C��
9@�
`z�F��.�d$�d����L/]����2DF�+
�P�.����:��m�w�ht�B����G��@���F�U���� ��J����C��Ai���t����[��6���/����x9$����F��-�� ]���
G)t�\�"z�����(�"�������{���<#�`C�A]]h{F\��yw��V��r�^�������)����� �����lG��E�1 ���Bu���E�Y����,k(���%d��hk����y�hjX=c
��9l��#��u��l���A���w�R�&z��C(Q ��^x�=�P���l=��8m���]��eY5����g-��W�#�d�$������>�C�Si��u�J�jWFU�*uL�U�C����UdC��P'$���j{������B����N��vb�el^|m�;����2������*����c@=4_�Q�A�}dU�c��.J"�P�!�!��^� x�W��������@��MZ���hO�����s�?�2<���e��O]���
�8�
��$��X= ��(cuY6���� �����������2�Y��'nm�>�=�sex�I@=�B�2�oP��T�|��#���a��?.x4_�"O��S���7�1�i��.m�������� ]��5�	(���B(��+
o��U�G@����l��#�h�B�nv�U�!�n��;��U�S{��Of��}{.�������0z�'�a8�	� ���F{��W�.A�t��f���i�[�xmp��hZ�d�9��K~$����6<�����%a,
u�!��z��]v
�����c���g5������x1��#�
1K��=���T��r����H�������X+����a;������PY�JWi(<T�@���*�;���f����Yyx6T��
�J�l����*.�i��B���w�Tq~�~/��L`�?���{����{�c@
��U��KTw�C���������V� �{�Xw��R��f��'�����/��L�G*� �j�b��}�{.��i�sO
i��{R(��.�A��g��1m`�Hk�T�n�p�3tEk���]����LU����2_Uw�l���";�������.d�:

PuV���PU�.U�^;��&]p`�"�.��&���R���7�����(h+-�?����!Zs��������<�������
��'t*e�;(!��C��?�Zd(a����.u:�����R�~*��t4-oq|����Z����F��+����A������O
"��
D��!��I�u\!#d��}�G�Z���G6�T��;��>��]$L�"���G{omG��aS��'�<��b.�>)���}R(	�f�H�3@$��D�j>�?B$���F6@G���y��HT��k`�����)����p�p{��(���s�� �\��5��'���O
Y�����A#8��}feW��.��J�r������+��� �M��H��F���N9��kg�����kg<<�
D�U���TwI���8�__g��������g-�?�:�~�W'-i��������b��:-/�����
}����'�@�@O�D)��Qp�$��q��%���l��l���m������|�?K�uH���D�4��<���G���;��(<��."��	����7Nl�f��Sea7��KU���U��du'kD��2��Z��k��ogW[�Ht�?�'<��O����P�O8$
��Ox$*^T���"��d��h{F�d��-6�$q�vl���fL%9�c�.����`��,z�(�Y80
� F�od��hz@�p`4=t�n�����)`������� HZ@i1��b���A.����BV.}P�*����BUY�����U�M��}d)����[d�O�c
m
��C�Xm�������:�"p{A������� }v���j���Z�P�GF��OF�GV-��{������4�h��5�|���Kp��"-5
�G������h,1�������0�@�C)���Pp]x*N������QdC�\�A;KT�w�Om��b������7�R�@���6����c��>��F���;��F��@��@#���3@#�����}�c��Ay����8��.�$mHYE��y��o7$V��1����e�>a�b��xJ�#���
D�`(��# e���i�D���S����c��p����5�������o���j�U� �|vQOY�.��z�c������.0��94�q��85��3�
����l
wm
y�����w�r�ILEj5_��>&�5���k{�������c@]E��. ���?�%��4$��f6P�pP�����q���eMm���n�Mw�*��`s\������� ���m����m�c@]h3��Q�!]�bO�X}`n�����;��|_��LGm���U�B�N��x3�^5)�)�8��b�gW����8�*?�1 ����|Guqx���C��#+/�����0gkX
�n��9J�v k�(��)���f>L�!������n]�9�<����g���=��B�x�1Op�x�)����Hl��7D��gk�_=�8�9���$s�������3�@���
��e^�t�?��\GB�H��v$��$��v$�^E��4���H�����t$� ��~H�z]^���-��bk4��A�~�x�����~n��d����*<����*Q
�p,�A������lU $U��u��"��D.�d����i���-�Vf�#R�Q^�(t�?��G�d��B)���PpYx*#W�0a��E>_{��\��6{DB�P�����M�2�N��-�A������P
"���P �B�D�Y����gC-����;
m�x��_��w�[�2*��f�M�(J<�v���b.O?)���~R(��������S�F�2������gt�~��)7�b7���j�Tnm���5�����O^�0t�?�������B�8\	(��8|	���8MJ,�����O���/����@������`�LV�?����a7�}
����\$�RIO	(���Pp��P���
u>�u��b����ui��Ml�Q����*��i&-����Z������H<5� �J4a>�EM�@I���D��i��W�g5�H3�SC$�T#O��Zv�F�YMb��3��4���\l9�k��.��r����z�A)���A��������%J�E�f���[�
�%_G�e�#�D�J&?]�O&�p��ua��,���0	��a88
�ax8�>�TldD�1Kql�=��?���=t|�cDw��U�d�cZj�b��A���v�r������

UU��Ri(�WU�7����E�PU�
RE{�o(e��p�������)�l�i�����l����R���i�b���9�b��A)����R(�����z�T��>��`w��K������xk�#��>��h	��]>����~J%?�lv�9R���\$��RI�P GE���&�Pv��6�R��
�i�Sj��?d,������MR�%�c�>-���w��� ��*Dee3KW�(<2KW�h�d��"�T�k^�
��fC���Gd��ny�����%U�m���%�@f�`ZW��'��8t�?�4<��i��P
�����z�8T����
���&&��h<T�E#��\yh��,?��!���T*��bI���y��.����<��:zx(�u8
���C�o��6T��uqC�#H����|���U���@]��U��=�����y��.��"�<��Hzx(��8
� �C�oD�xhz@Jq���!�����������de��8y�;:
rD����<�����I�@}�;���O�=8���R�L-������+,��I,�o/��B������c���n�d$����:<	���J�@���;���P�u`a��_�R���gzi_�N�Kv)e
�^�k{������^� ]��U�1(U���R(P������T�Fkl������!r�#��|c
�����|��������^�A����BVn�J�,���V)T�E�K���{�Ez�����fC�E6@*i�)L�5L�U�w����+���{��><���>h/���?'m�T������:<���*J�@���;��SQ�3��:w�u���I{�;*u��3���IZj��w3|������
���g7�xR���\$��RI�P GE�D�����H�qj��uv�
����8q�����=�/-�W��H=���������A�����sP
����P W
� �A�od���!���2�+MMRt�w�r�K)�=b7�����Z��������<�� ��I�@~�;���O��
o�G��pC$�|������>��� ��9HZ�J��B���V���s�?�g����>)���Opqx�)#�>��+Ma������S���8����{���~����\#}R�H��P��>�4�����F�L�"�}����CW��n���eU��N���=T����t�������� }wm�����vm����$��4�&��D�yh��R��
0�E���C�3�e�5����"6��,��_��$���y��I(�����R(�<	w��'��7��#��	����P����]�X;��7��?���o��h>���� ]����(=�0P
zp�A������1���d� h�,�&��_��I�N�ye�YW;�8��8����/��o�����������������������_��?���������9����k�?IX�d��H��$UD���$r/'�}���7��a�����je�b����o���m)����m��7�l��f-7���r���954�Z]�������@=���w]Tw�E~	��l�o�]����#t�,6���I���&����O0�����U�����������,]��v�����
z
A���_��0����A�G�-�%��Du9���^#�v�-�]��w��~ot����?����/*W���JzjC#��CzjC��:�CzjC��@R�A%�%����[����Z��D?��*�U;~{��-������p�0�H��U�����*F��1h����@,4���t@S}����*�/A��������J�d���O�U��������*4��Qn���!"��	�����o����w%c�T|F]��H,<� �x��@$��;�$�Eb*G#���@	���kY_�EZ:���G�#�l�XYki��i���c���d�.yXn�@�C�<7%w�G~	����F��Gab�2����#.~�%J��_f1k�X��)�������������S-v���S*�($�P�SHJ����T��$��F%�����O���?�����P��5�2���2>PI>L��'�1p�1u�.,8���p��@��;�"�uajH#��E�y�>8�a���A�I�6]���Bh�Qhi%_i������:��	L�f��j'0=����:���l�o�����/Aa��	�\��CP?�d�;�z����R\�,#������ ��c�LO��^��b:���=XU:F�j���R��U�KH�m������������b<�?Z(�G�z<�r���(������|����G6�\��b��N)����T',���0u���8`�z>��Ddf���v	y4xz�����;b��_�2����9mS����ZK���u�?W�g�T��P)��1Tp�0T0*Y*�H�:g��I>#.��%L�M)5��%^����P��r[J�#��F���0<6� ��R���w��0����8���#�M�������l���*�����|�J|���)���+:fP�`�����A]������u�@)��.@i��:�p�4=(�8R�.�������]ti���������>|Z$���NJeeU�s���*�c`h�s�R��w��iu�����U%�SEa�i�����X��o�d�N\�o���Jcik'd����t�?�*�R�=z)����)��.����iz@Vq�4=���r}��g�Y�WZ.In\���*���h��e��^���l��SZ�]�=�GOii�2�Gu������p�4=(�8D�.���:&��js���|�Xm9�'�~�8��g�ADz�B�������I��=����)Bat����o<"�G�@*�1Y�_R�����?�*�_uO��e��_���:�K���y)��Qq�aX�0��^��E������t�`-j�����%������"m�,[l/w�<E���������b�Y�>�Tu�U�R�:�w�#Sp��h_B���RG6Tud$vQ'����*E��(���������~����R��i���s6�l��
����\%�RPI<�P�O�TrOrX�dm�X,���B$���cm�qw�4Ka����e���]��*����TY���Jz�)����Sp������#C%�fC��d�N>��O�����=��NzZ�Ie��I�G���r7��d`0�t��*�DW���ZgK�`tA8h
� �h�_	�p�4=(�8j�.Wm]
�u�����������+2�9�1�Lvve`�AB����OH)�
=��B�!w�!U�;�b��A��'7D���W���z{�K4�SX-����S��e�z��3l��m��.����d��:z�(�u82
���2�F����GF�C���?���5�$f'��{�W�W�:�m�.����H�� z�(�A8$
� �$�F����D�E�}�t,�[iK����p��Vr��������C�Ie�������������c`���D�]`�Z�A%L�}ea���V$��Tm�PRy��y�^��B�.������t_\�����.@��V%]�ty���g �w����%7�+�J�����KC%���5��.:�Aw����[��s�>��I��rD���y��H�@��A�
r�C����U^v��=V�1���X��`���l5G{�8l�!1=I��W�vye�����2d�z
P�V5�P���|��c�{UF��1Z[�1����l���4Yi�����Y�H~�4��4-������`wm��b�� ]��u�(]�P
�p�AT������Q�)�Q�VkX�.�
�rF���[^#RHD��*i�q�A�A�������N
z�A�
��P'��P�=8���A��a��?���)mu�(�u���Gg��aZ6�����^c�N]�SV6�t�Oxd�����.�M�;�������$��o����������8���~���KK,�j�A1�4�D����#Q
�G�P�{8$
���$���pH4= �����P�������gf��c
e_f�=>���G{�A������P
����
t�`(��.`�]8��U
M���X�%�\ih{��"�����Rw����>��l�}�{?9D�S��zY�t@=h�����(������Kpr��h:�i+7gpCLN���Y�W]������d����������u0
����������zJB)(����;(�*C�l%`�h(����`�q���#-cVR��iat��]@��b.�8)H���B�\(���K�SpE��)d����Cr��e�D�s������o,�L���wO��`Z(�=���`L��s�?��G����:)���Npy�N0=����f����l���n���]�]��Bh���>C�|��}(�A�����DVndA�J��NT)TUI�K���{UI��D8�*����$��$*�h
��L��szy-M��Ar�W������c�g���,<�� ��I�@�z�;���z*w��X=`>�.����:�I��M�e�|ad������`]5�}|.�������$z�'�I8�	� ���FkC���iq�� ���);���}f+�
b��>%���lq�{��6�{V%��t�Y��~�OrU�������l���_<��MU����Z�<��Yk�.��L�z����\�uRF�y
�p����8`�
`��Xgz��Xgz,�\w],�9V�VZv�6i1u��B�� �\��u�Y']��N
�p��A�S�.�LC�L���`]�Q��~�I�O����+ii��O�'K���1�<�syx�I@=��B�<�w���TF�YP��\��	�x�u�i�f��T�<W�`Z'�\S�������A�������O
*�!�
T��'��J�B��3<���0�����X���3)�W=Y�&�����[f���z���t�?���@)�PpAP0�@��XhTq4]�3�:Y2��1�7�9�Kr��wT�lt�?W�g�T��F)���Qp��Q0*Y`��4�F[�����z~n��$w�?}�U1g[3�EO�����%����u���,���P���u@>�����Ji�������L����8D�
�U��|}��R���&�
k��]���A�����r�����
U{�lK��W
�/!�����j `<��J�Z�r���r���H�*Zsb3�g"Z+i�w�~7S�
��:�A������P
*���
T�x(��Jx��x�~eA�_\��J��9�-gNv-TG�q��I�l�(i�q��&
�H���b��F)���R(P�C���8@�
`���hz�EJ@�RBu������5/#}}���h����@���A$����#Q
��A�
��(�� ��A8$�4�8&�.:�:K�\�����������Ne��O�x&�gG��i�.��*����Jz)�� ���V��z����\���?����3m>%G�4���S�����W��5v�X�-�'�+�?�G�x�������0
r���\��RPE�P�
��U��*#�
���o>U��P�K������������[NOZH7��&\���|Z�Z����������A:u����.V��@�X�@(��WmTw�1SR���@��+2�V
m��EO��
���jt�:��Z��
�V����U�����1HF��������:��8�N�g[v�A��������8\fh��B�gt���P:��}����;�'����w��p�V&C]g��������S��xYY�t���j@#]���-���HK��K�A#�,Tm?���2>?�M`�Eu��p;�GVK-��?b��A*:uQQYY�tQQ�4�EE�-���I�Hqy�Y#hd��A�Q����B��3<}�N�ON7�Y����<�v�f��J�1�]G2b���t�*+'
Pg�`UEB��L$�RE�U$�M�M�j�,�k�.\V�9h��q��ww��A5!i?�=��e��R�����td����<� �6J�@
x*]�p)d�&���elT4��wdA�_Y����j�+�_�
)d���GcJZ���~�v�e$����F<!���BJ�@#�+C#��HJ���U�hd-�1%]@$��^�>)���6�I��vDO�+��zFcf�@��uB^Vv�:!���uB>�HuE$�$E8B�5��,D��i<n�tD�3������Yy�4�,U���?hg�� ]��{
E)�=P�BAO��(��.<�?��,]`SB.g����[w��[���>��L2V�Y3���]����t�?��G�4��H)h�!Rp�xD*��1�������H#m�{����d��$C��Vw}��F�l����2< ����tQ
�p��A���(c-�k�����\���Q��l����t��Q�eFZ�(kL-Q.?q��� '��8�������B50������y�p�����K M�Hfn��������F����%m�%-��r�nM�
���u������v�g��|t���������P
h���f[@=|4�Q.m4}@�0adq����H'����ZK'������JV�9��M+�� ������0��(T�����FMoF~d��c���sA,��x��|��l��:6[�+�I�S`L7t �u�A&���N>d�4B����F(T�|d[�F��N>��4�>�h�k�F�@��V�o(i����c��tk��JZ�SIG>E��������\��R�E�P �G�d����yN��}e��,���U���t�����.��Z��b�iu?�5#�S�����A@�����R
"��
D�)���dJ�����S�~� �����k�p�n�	j�//��!n�d�uB��Z�|�{YF��AL������cJ&�j`L����Sz.Mo��;[�|�.0�4t�0i�!/����+��
��5����m��u���J�� 2]����L)�=��BA���)�C�����y.*2�`3M�Y���\��9k�j����h���������jy����\�RPF(�P�J���A���2����M���� f�J��u��TB,����-T<����~}2G�]����cIW�(TcIW�h����j�p4���,5 ����%��R7:i��9�b��"Qg�W�F�
i�^�`a�s�.�����z����
z����w�C"J�����2#��-�7	�J4�T&����V���:=�l���k�A-"b���\�RPF��y
����PF����D�o��5�@�=���Y���R�����A��kwd>Nm1���s��� ]��5��(�TZ	�IU��O	�Iu��4��	��������[�&z��s�,�L2�s9��9����b��Z�y;��!�b*Y�I���*
U��lK��WY�7-X����k�,��VM!�����<J�3������r����7T��@_�(t�?�G����B)���Ppqx*F�YPW�\�>����+�Io�Mi;���}���T�Qh%�a��Kc��������7����j���P 
@�����#��I�4�)!
@[��^u�m���IZpM�-��4-�y��c��A���w�R�6z��S(��GMG���sE�o�
D�1�$��}o��Q�V���<���vK�j!������V��s-����4���"��N��F���'a��O
�p��A�|������S�FzT��1�d(�S�I|�I��|k��)�1�JKm���7^��s�?�.<�����I�@�z�;��SO����Q�������p����8A?���_wQuZ.'��#)ke:����k~.�������(z2C)��e��;���O�Q�����u����M0�:]}vFZ���%����ja|��2@ee��]�P
�5�2@�-�F���>�G�aQk�'oF��g(����M�U�N�O����6`=��U�V������)+����O�T����mU�@��Fr�r?��P��P�������������m�@2�Si[?��V���.:a�kv.������F��I�`�p������S�f�p�3=h�p��Ur��L�.���9iy�9�������V�B8z� �A����*CV���U`U{
U��m��������z�,�����M4.P��Z�dA$�n�
�����Sju^�")3��Z�a�b�;)(�vR(P�������S�����heA]��G(#HM8u��N���EZD�����HK�l�����i�b�
O;)h�'��B�6�w�����7�@v��Fz�x�%5'8b�]7�^���"�Z�QbWY���z��Y67}����\�uRPF��P��:���Y���2������"-.�^#C	�����MQ����r��wM���)+�A�������SR���������R�t3?��n� .�����a�^�����q���i�����IZ������������ �\����:)h��uR(��c����S���p�3=h q�3]4c�N��v=I��V���E�e�fL��
�m����b.�>)���}R(�c��"��S�F$Xc�6�p����,��������O
cJ���ke�������2�L5,r���\��R�EO(�]�Pp]x*��A��A�#R���p����J�Xt��������"^�3*#��}�A���k��P
��I�P�
�
��
�*�
��4�8�.�B��f�����P2�V(����P��j7N������� }w�����v��j�xu%~f[�xUwP�g��#��I�@���C�Y��K��}X����^z�R�8�Z�����
[Co���g=FX]`��c�U��*=FkK��]4o ��������'��)�9&��U�����:c���E�"��?����Y���O����tc����6,�����@���;h�������VP��VPG��_��C���(�~\���m;��(��']�������O�1������P����
�a`(��8,
#�<�^;���q���V��:�X�g(�$[�\�2�Y-cp1�:��1,:]z����\z��S50����om�����F,�|d\5M	���O������"���q=�Li���6��	�t����y�a9)�����b(�?'%w�����o��I�
.�61	���>�n�,�����cXC�
�k[�S;��X_��s�XP�@$�C�H(%w���oD�5�'���84:�I[�r�������c�i
�7������]���f?���Q����F1����b(��a����l4��.m���X�dZ���f��e��$��]���Z%X���������'O4����'O���IGO�hkL:�;h����G�I�����H?�hs�}}]�j��$o�zrK������[��P=��3�����2FGW�����Q�F�P�k:J��N�0Z4�F���7�
`A�4�CG[A�}M"����a��-�7���Q&�|b\�2]���j��F1��#OC�.L�(��.�KPk�QgL3�-�G[��z�4�?��4c
��=��O����1��\i��AP����DVnX�U$`U�
UE�m��
�W��/!�dYI��#`\a��OtO�e�����[i����}-���5��w��w�evL�Ad�����#S
*�A�
T��)��J��$P-]��?��m%��L0S"Y��#����)5%��
.�n��������!�kc��^{�O��=H�a��_3C�sX��z����>_�=��mN�
��f�����K�a����m�hX������,g�����yVx���"�������@��I�9[]eCb2I�:���I%�����$:w*q�4����<�����3"������K�+m���'��kOf�$+��d�F���B�41�:@�%(��f X���������iz�	H�6�O�Mo?�6P��=���m����s���� (�v�RYYm@�6z@)���@)��6�KP&�t���
g�
lK����_RDl�����5D��������e�1��\���t�?[<5� �jJ�@$���;��aK�����I���I"N[LNi���9<_5y�cJ��]��
��KV�<���+I�:�?�	�����sS
���+]��k�����;�#���,+�T���.D��i�cP��n=C��H����t�q���Y+{��%�w�����,<4� �hJ�@��;���*��d���p����U�<�?�
��"V+��l�����F����HK����t�?�'���CN)���Spq�S0�p�4=�9t�*�gR�P����t�����/��%��M�����6HI�SA���C)@XU�A�� �-�q�{D�DZYD��._���Q�@�����O������"�5��NU-���LR��6F�sax0J@=`�B�0w�U�)���m�0�~�:�@��\�g}Z����==�vp|�U���/%	{w�I9�MA�s�xBJ@$=��B�H\N)��H�K��pI����+`8�PIk�5^�?eY�4W
'�c���^Ycv8����RY���+�������f[`8�� ����'`�5��C�C���-�o�V9�b��b���k��z���}���b�kx.J�����
z
�E�Dr�E�-���G���4Z�Z�,y���X-b��y�
��OYI����o�
����\�RG�P F��qF������
)���J�B�������xJ����0%Mp����6S�ft�q�b.
�C)H��R(�����^�����>,��~JT��1%u44"�xzjNZ��l��(�6���Z���R�
{�����6<���J�@��;h��*@t�
C��C�#N�_.�<��#�Cu������G�p��z8cs�v�-�����<���
J�@���;����*��2����4�8�.�u�V��g{z��m��}/�0�������Az�9d���r���}8�����JwP����^cu���3FGC� ����]^��%[�S�?>0�$�����n����>�H���DVnMKjWVuMK��H�-uM�U$�KrdYIT�p[$�,���O�"az��^GZ,Y��<�����"`�$X~~�����8<&� �LJ�@��;���*�3Y�X=`�IPGk�V_�����^i!4���WQ�$-�������}2�I�]����]HW&)T]HW&i����*9���D����G�b\�]�,x\t��>�i�������?����J�8�b�=tW6��E2�N�]�TVV$]���t��l����H�iF�8v�m�q����Kj	��oB�3���xs��)�4K������>���}��.��c�����^J�`�q��A�T��xiz��I�%�2�F������}Z
����H�:�6����}��.�������$z()�I8J
� �J�LI���0.���A�,�>��,z��_��Z��+���V��&|��z�2���\��RF#�P �H��q�H��g�4�-!�H[������M��u��i���2�%�l��e���N����4�N�C50��:q�m��F��L����VVvi�z���o	�8Z�i��h��-"����-���1�<��3�@�o����A�z�y�IO�y���������P��������du��-���#�Y 8���tH�� �@1���Z���fK����;z�������a)����R(�����z8��
`�K��n�d��V���}��,^}����Q��=<�
)�AH����CVnH�U`U�
U��m�C
�Wq�/!H��Jg����X�dIU��z=x~<�U�5^
~��@�E�S������1�F�sIx4J@=h�B�$w��U�/�`�BME��B/.����q��7�noi�������I]��e,f~6W���I�c��.��"�d��Hz2H)��e��;����*��#��Q�,�n��WN��&���N+,�Yr�>9A����%������s{���G��\��(T�K�����CF�����_YGF3
.�~�������,-�����_��o����[y���� ]����G)t!=x�BA��(����d�����h:��,h��1��u������%-0���3n�w�d����s��������]H�y{��"q��������� lK����K�8��AD(�j6-���Ts��Tc����?��b�Qx6J���a�
4��(�����>J�VF�X�>�]X���]w�Z���R�F�����k�1D�sex J@=��
���G��qD�(c-��+b���
!W=����pXZ�����r���bj��s"��mz����\$��RI�P �C�Dr�C�������������N�����L��''"��s-�+�����.���+�TVvj��D
������&��-m��I�KM�[����B0#���q���a�e���R����L<�U������G"�I����:��d���i�!+�
P{
����P����Tu�{�5���:�������J��@��,��������*��R��q�WZj*{���4HG�s]x:J@=t�B�.w��UM���Q.��#*���MtY��W)��X$�Eh�*N�R~A�Nc��N]�����FW�(T�FW�h�:���8��:
GE���,u|����Cv�Y�-�rze5������|.[�j�8�A,���w!�R�B*��taQU��TwP�A�hF�5�?�C@�CD��h�y����_�*+�5B�!RAw����J'����A(���k�CQ
����
�E��qE��U����,����iG�������V���9�Z������X:c����fu��O�a�b.C)��'{�B�8w�U#D��q�9��G��hh�D+�f�A��l �x�u��Z���� !]����	)a�R
�p��A�T�0�����+��G���������K���%[�sD�e�����'?���SW����|�+a���FW�h�f=O4�/�El~e��M'����/��������������b�e��u����^������b�_x$J���A�
��D���$���pH4=h���h�$�����R�b����I"��^���^�Tt�����]F�j�������2z�h��2\�(Wsd��e8(��z���_����j�h��}��3��M>�A �����r�����*
U��lK��>�}		�������0�@����6�����m8��=�2����Y�,�,-uB�n���<t�?����d��C)���PpY�P��$*�J�t��k�:oPW��u��VF����H�^'i��v�i�����v]@����r�N���X��D��bL����
�_r�?b�-�,�<�����D���,u]������c_�/�>#�Dgid�^���V�N�=���� ]��{�C)� @:A#�
4��e�O�������,����J���/������{}n�g*S���o��}l&�0���B��n�y��.����,��0��%F�a`g�z�0�\;	���*�Fb�J��%���"yO��;O��f�z*{x�*�.S��S@�>/}m��� 
���Bee���P��Qc�6��@�Q�A�(I8:�����,������C�=���*d�=7��~0�d�n�<c����yO��'�������!1��V������z�L�\����D��4q��h�x�!���!k�Y���z�h*�>I��o�,�w�t�����D�j����@����� �$�$��_�X=hq�h����/y�^�ed-<t�m���vZ[$� �S���d�s�1yYYit��j@5H�7C���`����B��a���3�G@��D�L��s������6=� Y�C'����a{�]�SVV
]���5�X���-0���������	L?��*��l�u�A����V���������8#m��A����N#d�D@�4������ �RE�U�M��KYP��,���U�;;�����i�F]���}��&��
�d5����������ps�?���Q�&�+!S25F�7b�FA8������
\zXC�
�����vR��[��&�MVN]������]�Z�D����F<�����i� NU��ZI*	#G�5>:
$�9���+��d(%������D�_h�u3q��Hk�u�4Z����A���:�.+;�t��j@=d3��I�Lo�>�#��Y�)�;h<q��-��������i��}�]��������~���!�)�d���y���&�^��mB,Y�o�5<��?��eA]�fA�W\��Z�n���$b�:M<[�,���������Z~�n��� �|v�MY�>�lR5�g���l�=`3���p)����t��r8��
4�<t��g���!�5�7F�4#����n�J�Ttq>�=��#�7j�G�:�.�	���p�������}dK�LK�L�����#�um���6�Y����ti����J�kc�r>�(��l��E9��3z(g�������g�,���u�;�j����C��CO��m��3��z�#��_$����� �\��'kR�'~�z��*4���������>����#k����������	���9y&��z_4��^�a�1�8�]�SV���b�T
H��qf[���a����p������u?�US����Z������iu,I�X�\v,V#
�����D�A������rr��� �*��3�lK�x�+������.�OY�r�f�\�%H�!#7����E��H�\/��W^QZ���A������M
�����&x�<��?�A!G7����:t��]wk�N �um������/���hZ�KJ?����������MY���+s�����kf[�w�� �59p��a�,����:��]����pqJV{�������
��gL>,�z
b������X�@����	���p	��
��	��7��ZPEzTv����!5p�"'DV��<�^Z�|��m���}�1H4_�+����0 ���hBc@�h�7����������5�&�����k4�`�)���$M������r`����2y���g�����gR50����l�#�d�59p�#.Q3hq@�U�q�5��!�t��u�];d�"�<��D�c�&�ks�d�c"�W������iBc��pH�Ai�GB�����9MG�l?�_������=R����00_]SVV	]���N�`f[�����������X}*���X��,�V���w2�>bu����3��I�ep���� �|u%j����+Q��q��l��z�8|�&q ��n"�*+����'��"yl� �]�B���[��we��_��j�$z�WK��>���kh.��4) �=A=@S2����2|�f��7Y�ko(s3��pI�����H�R-D��w�$s�Bd����� �\�OU!+�_P��
����:�������*���f���S�|
����au����M��Z���Y�H^���,5�l��V����� �{x.�������2��'�e8�	��<��I7YP��,����4���b��&�
T�M����#�]S�Z�:��|���t�]��w����.�@'U�E���@w�:������]��[�`�D� 4�u,G���_�Jc��e"��E���i��b�q��R7eeU���I��*zR7�-���
]�g��#�*�LD\�fk�����5�L�����a��3����|�5�l!�\�� �\����<)�']�b�x��'x�H<���O��:�O��J.Z�<^����g��y�R�5xPG�Xo?�2��C���cx�����l��<��9z�g�z����8�����,���������E�v;�g/���5J�'����b�`��&{U��w���UE��j@='��-���
]�g��#�x�Xgz�x�Xgk|,Jt�?���&�z>/�\������S��>���� �\��G?)�"�HA=�S2����4<����bdtk0E�l��(2�l��62�]g
���y�=������+F:�h��x����\~RGW:'��)�K��z���BZ8�``���Safm����s����t{��d��������{��W�H2����� ������)+;�teoR5�M���l� ���a'% ���A�L��\�TBk��%���
s�L�����������������������W�����������������?���_��_�s1w�/����I�g�CXU`��?�UQY���������*�7�&�\~��O�����j~]��'���z����S|��?�0u�y�+����(�O���K(�wP�KND�r� ���6�W� C8��`	g�����
����t�����I���,��b3�����k��e����
eiZ��;t��1����wqbPE�u��0G��Tas9����I��
���
vr	U���k��*���'�M���I����B�Ow�t������\obPDO
'�E�N�EX��Fo66��T�aC�]�m�7�=���z���B�����c0SW�u�$,�� ��I�@f�7H����7��4�����Y��aX����4�A�4t@�|�8@m�~�#����`���D�L���������`S6M�.��y(B	�`�
����|�p�hm}
liQ^`o�������2F3�W��WX��������X�W��M��X����04�y��ap���>.�Km�Fz3�����8����+��I��
��23W��������3)(��L�eX���%�
J��)�����O�E2�S��t���B9^5_���j�K.1(��\R,P�!��
J��2���C�(����C.[����N�t��s���yu
!-�?�"�6_�;���C���v
eb�FO�&�i��M�iX��Fe6>�l.������k�`e{zzL�s>��3��c�������}������d\��b����MC)@U	Y�i(XU�d[�4��J���f�������e�`����Q�E��CI������*�,��� ��v�LYYt�L�Pc���:@f���X}@kA�V
��5��k����X~�������e�;K�������[�W� ����i���j����eC�������A�5x��>���A���A#K@�@��v��)y�������f���|ia
�������|�YVV=���G���-���
��@�}dU��S�(ZS��G�qd�,�.�Qc����y!��u��a/�A�y�����B�9�>S5�Mt<�� �����
B��}d����������Zv�^����W�D�F�nl�~��@Y�����wl����/��J�5)L(�~�4z��*4������J��9���:��^-I[5e���^���k3��"�����p�g�����j����g��YAb�k����+z�2g������ �w�S�y��ISY�;����Bc���>4��w�y�bH����5��/�����M�������2�4�=����WI����3����lm��E����/A� ,
� ���R�L�����|<6���[��V%���#�$�Y*7�b�� �\���O8)�']�b���N���#��a�3gE������$�q�XJ��i���BRG�H����:���Z�qm�6v7�:�8�smx�I@]�b�6�o�F~	j#G���@�C�C�Y ���)��5V��v���������*�y/����m�f.������kP�*��s
������Q�� ��� 8��UYPG�,�+�V�F��\�jU]���^���yE��������_���r�V$���(<�� ��tM��p��A�%(
��p�3`�PEK��\noaN{%o��~�g��[��[�^����W���w��� �\����y'mt�N��pI��
��/Am��ME���.W��"-.P�f�r�6x��i�k���7)�����*��UB���A�����#N
��B��r6��_��pI��l��R��Y��2���NF������oa��"�l�5�o;���:�����.�)+;����T
�/z�g�����_��p��Y%t�
%�|f��������>=G��z7��Y��~7����(v��6H=��n�SO
�F��X�m8�	� ��������T.�n��sf�.)x��M��QC�L!Q�;ab��U\�S�?��[�����X'U�C���@���:����+��H:UX���p���^�_o]����+�H���(�������l�4�E�A����w�oR� ��&����l�7t�%�������Xh�p�3c������6�VTl��w���=����c�m����i8<�����k8gY��������
d�h'x�Lhg�����N0�i�8�`g����-��-:X��R�#	��!����[d6i�6�=����sO
�I��X��=�Tr�=`�o���q���a�q���TB�GtT��W�L��TW�e�=S��MI�?�����}�.������A(@�Y�)XUmd[���6���@���6��� YM��lk���������Z����j�N�7G�4i9�u
����Q���Pt�?����d�E!��AQ��@Q�.$�J6*����,X���ae%M���R@g�i�����A�������b��G)���B,P����
*9��
`T��hz��&�/i��:O|���p3���w��_hZm�����/i���Afz�J��r��B�rz�B�-0�ToP�3m_Y��L���w��5���
�u���_t9���k����N�(��_�]� 2�w!SYYqt!S����L�- ��].^����|�!�t�H�����h��t�4�x���m5�;��7w�OB�{��wYY)ty�j@
=����B�)@R����,)��`4q�����a��T$��O%	�t?��{�z�PBV�w�/��������R
�����&.1�A�%�3����H,h�p�4]�f_�����V��:x(e�����y���L����b.ON)����B,��#��
29 �
j��me,h�����i���_�#��z�Yk^5x��v�����\	�RPB�X�G��pG�(���������VG����4c�{j]���7��l����A&�����3Q
��b����(x�8��q8&�4�8(�*����	|xt1M�"�S�tt�n�4���3�o���h������Ld����������L�-u
�U&�Kh���5������r/.�r�����{�Y��x�RkiA��T�Q��[�<q�b.�C)���B,�����
�8��
��G�,V�:��G]�
-��><������zP���V���
�{A��G�AxY�~�� <U�D�A�l��q@>9r��|f'Y�h��v�=~��(����?�h������Jc
��D)a=��� �\���
�>)�]���
�.
� ������"
.���b8�:bqBW�7�x�]��f7��vQ������<������Q��MU����CK@)�y���(0�S�F��Z��6*��rMH5\���Z������r��c� �|t�NY�1��wR5 ���m�1��w�/���K
M'�2Es����l��GG��'�<�xm1t��,��nS�];�������I[$���^�����+*+���\Q����+�muTo�8�g���k��ke�D�lu��AoM�i`�*���|��=��Kaq>����J���;UR�9��m)To���l_Y��:��C�l��1D��	f��� �n�����uE�9����5Qb�^�����*++���P��Qc��1O��q�<�WVi8���q�3]b�d�;�G�:i*
������G~KQ�����%1H:]{��J���"�A���H��N�I����U��
��1�p�����������G�8=���-|���o�S�44������(@]�U� ���!�R���jh_B��VV��U
\ 5dA��������nn�,v��c��e+���cw��z��T������v�o�� �\��U��&�t�M�*q|�A%|S�Sd�d��a$]@&�|�D��p'�e��Z�<�;�#k����r��|c��Y]��4H=�s�x�I@&]��d��=����'�E
������TW�\�Ib��~I8�,�o�?i6��Q�j����j�k�4H9�sYx�I@]�b�,�o���T����;�!��C����^����E����g�>��B�?a��?R�O��td�N]�������tO��#=84����
R9��Y84@*��<��V���1f���e??������5*U�\u�`�y�y�y�b��xJ��:$��Y����TU��#�6����?�PyT��!�����yDY
��8f�������G�A�9u�OY�n�~R5�M���lH�zC7q?9rHau��+*��������a��~�Q�_�r�-��;�W�s��_�O�Pt�?�-<��[t�}B,�-\�'x�D���I����AKGE[%I��m��X���z��z�yo��}��$� �\��E��'t�|B,����
"8��
`D���!g,H�l� �!c�\����9{}������ ����������=�2j,����PH�{�W�<�,) 
)8��Z�\�����"hQ��wU��A�g���������� ]�O{
Y9�P��k�U�
XU�d[����T������]dA�J���%���s���G��V�Z@"/Jef=p�h��5�5-(�������9R���2����g��d�^�F:&%��m����]�qz�9�/j��B���3��f$�a�R�X���G	��/����`
����r�04
��4�p��!����{Tn�K��}�H��Q4���vM2�����,<�$ ��_ ?�d�?���Q���a��$Q�F����-,�����I��I�t�_�u7N�(��/V��|���!c����#��e#J��;5�r�=����Q��(.�3�`�X-(��#����Y���z'��6��#�����n�zIo+�A�9��jY)�'5R���H�B8����+��ket�bqA��%�+XQ�V$�y~����$��J��s)�S��J���(���}%Txf�T���#P�j��w������{�<S����qQ�����D��E������ �\�O&<�$0�(n_0�p���&;�S�d�������;�D9�J�h��������e������Y�����w�&x��z�B�����Zv)e~R30�T2?�/0�dk����_��	&x�9'���q��X�Z4y|�7����bMQ�\��T-+�R�'5R�dzF_@
����:�Wf)8�M���o�*j��e����:=���;����Z{��j/���R�8�x�I ��2>��>����>����L��L.��H�����D��L@{1jN���A���$�f�x�����6T�
� k�j�ajemD_6����X���������L��M�S�.�j��!u������~�5o���=�&N�K������Xs���5�(��5�(��7%�6@	�5*�a����AFQ��[4%�+���,-xy5*,�.~^M<��T����c�R�$���t�]�� Q:�N�� Q9�}�A"[�4:����~6�� ����4mD�sc�z��������5_�tzL���qb��$�z�k��@�ds�<jx�I`�(�v�/50�����;Ui� 
��6�atg�����5������Eib�7%���oS��A�0��>5�|_���{m��A��(qN��cF�sR30fT8g���l
cFpF���j��X- ��s����p�;
��������hqE�g�uK�H4��4;�����T-+���f@�}adkFB����A��=��E����*a��cGz�5Q�p�������vx��R�jY�28�PA%�3�*������$�mxXr�E�].h3��q��,`��D��M�b�~M&3��5����'[Y�G	h���E	hR3 �
����,�5�"H#��%mr���\���g����F6?q��a�z�5�����`� �|�H�jY
�H&5����h [���\�&��4�ZP�p$3L ���K7���C�xiF0��S�M��Ml����?+�A��(�s��I)����T�9�/ �l
"	�H"qH3Z�%)��6P8���\O�����	[�;�`D?��k�?��F��ls��6U-�r���T+kj��i�%k��F���V���#�m���H#l�9F/��YB�|��ITh�Dw�}~��j�����_��o&�A���?��'���P"����'X�<��=o�DA3��4A��|�_�L�z=}��RD�X�.T������sl.����&9E�r6�("�����W`
��`S�FkA^��E^�pA��JQD�L�57.�P���H��H3�^S#��!f�A���?��'��4R"��4�'X�F<����H$K�(�]iQ��^p?]^����_��j���������Q@�r���9H3���"�4���y�T�B3�`��`*Q��a�S	��60P�c���_���"���������6D=-r:?`�����������	�v21�1�%��ZvrY���(��1�/��
�kT�K���J����4A�m�	��S�4��'�o��\��,z��z�k������T-����f@�}]T�fX�.\�&��t�������.���Y��4m�X]��z�[�7�y���8����;$[�R�g�r���H�rR3��
����F*�3�Q#��ya��6�-D����u�����0f�@�5_�����S�w��������4����"���U-+��f@$}�d�|�ZX�H�6y����
$���?d{i��G�Q��4���s%���d���V���Y���eUQ������=�/��l
���3�|�(��N��E�vD"����tG���r�4�)c�3��LK�VO��#V�A���?\���9�������R5����YaMCFeqt�4dD�6�:z#����T��C#������N�tF����U��o����R�X�PG	�/P���`
���S��?P�Z���a�a4q�����}4>]������wbp(�����D�K�cUxJ@%
�@��5��#P�U8��K��(��|M�F(����=�"�uu�����=������}T�5;���j����J�|��kP����7jp�3,(�8��i���0y�4��z������7+]M5����'������4�U��T-;�(�vR30������3�5���v�c�$�_�8�"�����Z$	��D��st9����oU1HD_%"�ZV%"J��**D4�����
���Axj���g��$�<�Hr}}]���$AL��i��';W���*���^�)�R�8�x
J �����/�#���
����){G\zgXPq�7�8�{���4Q�W��~���{��|G���U���e���f`@�`���D��9�G&��UpA�Z:���P������ ju��������X�A���?<�$0
�T��]S�	�gr��M��� �J�C�Q�c��������sG���LOO�D���S;��m�O��da�c�JS���P���H��1�/06T8fX#�r���&�����K�_:D�c�R��FYo����n,N����,/�T�����WZ��1����_�����M���2����������������?�mf��������7?�	��w�e�������:�V�>�J8p�u8�5�����$T�v��\7���C�[�a���Wqo���	���}�k���3(�� �,�7!�Pa���`�%Y�,�l����S�^B����~�he�v���
��6���b���N%L�=��_m�~Jw�x
*�r&�y�CD�L����~`�(�I�}I��eXn�����-�E �$-7{�@��vF�6�=n:G��6j|~������������c��q����jYITn��f@�[6{_@�$a���#�$�����FD��.�m,~M(���^����.�{�^��E��C����li.aa%:�RI�$_BL�&Y�*,�l�&�X�-�G/�qbu�r�����Ow�~�:��EJ��5�����_'�
�z����Z�hy�j���,�/
��������8�4)���6^����f���K�At���v����T��gg&��y����+�m�������a�Lt�F�`�/6�$�tiZ��d��0�m�]���L�|,O�d�l�+��suV����|���sc�Zv��`Ll����}�a"[�,�����F/H�t�)�`�^��U5�M+i��h������l��,Na���|�+����D%3�A�f�"[� ,���A)[�Xh�a����t)�?%����m9�$M�j`J�6�����}���1����KQ�bMtQ�i�fk���l
�����1H#2!s�@x��
C5�>j��-jhv0�7A�������ea.�'���	r��@���V�?D_� �Y	aM��(�J��L��+RBo%�-�@�M5���G�M����v���e�g�����?�~���Jd���T-�pJ)1N�(�1N��x�	J1y����p%y��Q8yM�b-P�Z	�Y��[d9�DM��;��:,0�8���c�g���Q��\;��P���$kP�g���&P���FXl~�6�D+y�w�{�l��?O�����vm��A���?�'��PI�|�/�I�$k�'�j������*
`dX]�7���=s�>hY0��B��W�Q�
�����/2�
�d�R"��eG%
�A������\�d�ux��axpd3,(p�<��{�����m��-r�_Z���%b�uHo\��>?��qf�Wo��;���-��;�i�\��J8��9Tpf�f���`�/���?���)��E�h%G�(0Ku�,t#���$��U4��� �����jY!T��?�B�zse��l
B���d��6m��4!8t����|��+x�	cv���Y�����{���������A���?/<�$0��	2�pM5hdR��a�+Ts
]���hbQ@����0i��$�Mio��
/�k������^sz�o��f}�b�ep.����'9eT�6���&o��a��S�/�d@�`�1���f/h������B��r����9�Z���A�y)�L�������D�<u�1��3�/0������1��i����% ZmJp<3
���5a�8�YyQ��'b����=�UD{J�{i��l���� �\���dB��A��L�V0�/Y&`���&���LFd�D���L�{FA%�Ms���J���~�z
��7J���0�N=��Ts�,	O5�H�D5�H�QM�Ix�){�!Q��&�Q��(��Xv�rD��~%r�6p\��R��������_B��~~�.�~~�.eo~~�.�"��(��*
8.�3�n
����0��~:��{�t����l��up.��e�'9��'��8�	� 8eod�gXP�p9�a����sv^��]�Kg���GT�^���$ ng�,�Zb��eg%�I����2�/0�����U�[���ef�Q�W\��&�x�N���n���� �r�5�|�9<����E���:[z�o�4�� �\���k�#�~�4*\S
2����4�KP�B�g��)�T�����f�����
�E�����m��5���my?�����T-;t�p'5������7���:\�f���ZPq�3|������6o�DU=Hv����k7�4Z���7F��#�A���?F<�$0��p'������`
�H|	#kD���H5�0�pgh�qS>0��5{�`>*���&���1�+�N>y�����Zv)%tR30�T:�/0�dk�F|	JcK��������4i8�����%�G����_��C����t
�#����(����{f�\���U-+��f@(}�dkJ|	
e-�1�����D�����<����RV�����tKFH4��������F�A
��?�0���Ar��ZYP+G��KVXgu�/!uDYV7��jd��Z��6t��y~?�,�������>e~_?�8���3���U0�'�
�����6<%��_�
�C��_��X�@�d��;��EA1�_���w��>�e����i�5O���-����� 8]���SrB)�S�Bq��A(�%(��)���nBY �D	(%���:$��%S���o�������m��k?_���m����Yw����t����XS9�}�X��A&�%(GN�����MY����H?m/�]���X�V�.�w�z�'yF���j��������V%�T�V���eUR�����BU�/��
U�_�*q���T���M%.C4
�Cs��i#K��%m,Y}������a'3��y�=t`~��]�J���8�ZV%��QjTR��PI���d���L�5
@%�7
9���/�j�L/0����U?���:���(��m�z����� \]�OG<\%0)�_0q��`
����r�;tQ��Y������h�T���{��Um������`��k��V�D�V:�Zv�(��f`��$�F_`��������2���r_ZtqD5
����L����>^<���0Yk������O��*���\U-��\�f@%�}�T�j����l�0�d�����j�j��y��"y{&��.����(�e]JV|@�� P����jYq��*5������T���8\�)7�V��M��xj�[�tS���r�0p����	8�)�7y��<a����]�A���?�y���9�3������#����9��/!i��]�II�$�(�|�V�Q��.���
7�j��;W��u���8H���0��r�@[m���td^��6JG���F��p4�m�����Y�����(q�$�?�����cJ�~�����g��h�X}�q=��S���W�A���?9<B%0r�*�u8�
���?UZt�#�C�a���\�F�Hq�Z9���k=��+����@���/�L��MI���R�X���E)�|�,\�)X�,v��4��-�(Y`_�,\�iocy��s�UCG�aQUW�jL�?o����Z�a�����4HM���B����PJ7��/�;W� ��\T90BA�fkE�M{#���41��]�R��������6m��3�/���Qgq�4�K��������<J������Q����K���c-����H�Zx�����Y��/t�:*��"y|~����MC�]��II�TJCU-;'-��R30'���F_`���A;�4<CxY�@Hm[xq������z>�R6�u;�|==u�!���M��� 2G�U-+��f@8}9dk��
� G���������|�"�C�SAD'Y�>~8��s{~�*_F�Qc��N%:�ZV&%:J��L*t4�2�� �����2Y�6M��2�6j88��������x�>*�wW���E���Ze�5��� �J\T��,J\��YT�h�d��A;���d��,����q��F&�4����P��j��kw�I�=r��>�F���3P�r� yJ��:�V��F_�:�:��	�(���&5hD��(���7�|����{s}�1M:�p�v����� ��+W�n��)�R�X&����I)�|�L%k��%�����<�@�h���h��m�Q�=��5x���x5-��"����<% �_ �D����D��Hg1�/M
�F������Bd�U4t����~]W��}��X��o���g5��u�	�R�X%���PI���/P�#�`
*�!�r`T����d-���i�����9��9w���c�&�oy_��X�����t�����L����U�Zv�Q�*�f`�Q�*����#[�Lv�h�����tDA`\��J���g���yF�}�2%j��2�B���Q��J(e�R3��J�h����A	;(�eV�C���V\�h,aE������������~�A���m����V�J%���{����UI��R3��
���J*d�	.W\i�x�ZPXqh4|-o5����O��
+�����:��}^��l���h2J�%P�ZV%PJ��:*�4������%����<��u�7�&�����*�g��S�r�N��.{��EMZ���FU���F��E�F_@0��e��F��6Eb�B��Q�����+�V,��v0�����B�ZS��h,��nj;d��{�����F��R3��
���6*t�	j�e�r�M����(����h,Eo#���qS����������y�
�����u��tt���U-�r�W�T+�j�Ul�%����r�KHQ��hd��hy��E�Q�n���z��l����O��*a���4Q��J(��R3��J�h����A	;4<�V#P�;t�2&��v��s�����G���������w645������?��A*��?)<%0R��(����QQ�}�PQ9h��>
X�D	$i�PO���i��m��7����E�\QS�X�n�����y��.����Y(9m�X(�m8
���*FkA^��E^�pA�"�HmY��(��}�SDSo�Qdw�%��Z6��p'5Q��;�/E�5(aw�g%V#P�������E������S4����]���"�9iFj�A�9���jYI��'5��p��H"[�$v�gxI8�8�����S��Fe-pDu(Z���n��:���N�;��p1H;��T-����f@�}mdk��Nhxm`���a����na�j���1����j��V	�ds.�M��J(�Mj�P!��PB�%����gP�#���8��������D6��-n�W��#��q��Q�y�r.��g��r��e�T(�lr�dk��N�gx������@�0go����-�D�t�EW|i�,���O�V�k��v�D�>�S��1B����4<�$ 
@� �
�T�F�����7����Q�Qf=��-�|?���L�r*:ftyg�jdm���<��R�P���
9���ZYP+�K�/y��,��%D7{Y�.��,��d�Z�F�h����Y^�t����nEx����z{�|V���I7����s���>�����	�@'.��A';�S���N��,Q��V,O��u��_�&���?_�A��IM�+j*�x�E>���v%�D�K�c�xJ@&%
�@&��5�d�����Z��FX��
��$hj{��p��e�����O9�DMe���t��j�(S�^���(S�^4�Q&[�,v�g��<z8��}iQ�%�FA{�k:L���_���q��F�2���v��ON}l$��U-���f@"}uT�h����Wfu8"�(�8$&�{�����j1����H�F���� N�^�xm��G	����F	�R3��
���6*h�	j#�2kc-��\�
�#����l����E�_}���G��e��>JTT��,JT��YT�h�dQ���KP.����P���l����M������������l������#u�W��f�d�K����g����@VA'f��z@O`���L�s��d����B���a�,[����*�����Z��Jx�^������c�.����a)9m�`)�����9������\
��o���o}kze��u����I�hB*�����Wb�{�R���|��i��0:,M�o����d��G)T�l4)%�R30Jd_��-|+![�Z^7�}������)AgSb���F������@u[�v��4�8mO���w�{xr����`�ZN� T+Kje	D_6��:K �i0`��@�I�A@�`��(�t��H����"#��;�Yp��_�N������r3���s�,>��>�#�&��#&�
�o��;jQ��V��4]�m�P����|�g�7���)�Q�9�6���J�l��Jl|�0	�)!�J�HJp��r�$J@��(�wW�yz�����,/���4���9��q�`���������u�,1��ls�,�6�����	�@��dQy:I��O���%v�
�b�����f�%������	N�E���,U��?��r��c�j>KTS����D5��Qd_���/0���0\m���fsjR8�=��������p�Q��7���e�j�0�0�%��ZV	%�I���/P��2����6@	I	��)�!L.h��!�(hwG���������B��K����@�]��'����Y���eQ���("�E�}��![�"���"�qS�����_���w��Z�f�5fl�ZwFMm�m�����?����p;����R��jYa�r<�F��X���� ���$�����_��o%�P�he��@W�(<v06���">SAcs��
�B�j�����AV�,_W-����ujt�}�.��6@AI��YH�.0������(������c�5������sLX�Dkzu��Ar�,�K���(�KjtQ!��/*�2�q��u��e8�@��e�L���N<����Y���������U�6�����$7\�i�R�pM�ZN� �I�V����E�%���p�$���IQ��������.����v����.v� ��������^���5�/���J����J�|��kP����_�����0��)=����v�{���\�e%�gGo��@J��� �\����Mr")�M�"qd�A$�l��������(%mF�4?�.��e���"o~DM-G���S��)6��J������2:_�$N��A���I�@P��@%tB�:�G��d���Xa����du5����Y�\iGd�����Q��A7\���z����Y_��s�<zx�I`�
���N5Y������8��D�	h1tF�r�/<�lQ?��C�tB���vIj�� ]���3Pr"R
"�0P5hD��A$���c�c�aA!�A�0���=��;[��B�w�����BL�<����t�k�-�\�U���eCL��R3��
����%[�`��-��v�����9��
%�F�|k7;���~����-�Do0%�S�TA��BU�*�B�PB�F_@	���A(;!��B�`��zS��Q0�������.��8�H���x����bG�A�*�P���(�PjtQ���E����	k���Q@q���0Q��Nj�����8�3�w��������&���A,�*aQ��*)aQjTR���PI����'t��������6z8*��\�7���z��E�#��NH��^��U�������_�k�v��E3���L������/��_�����:���u���{���G��V�(�y'�JJ�Zi�����@�I	�������Q����n������9��}\�}D�8�t�'y�{s�h}��8��s�.Z��1.��?�����tQ���ta�(Y�.,m���i�0\��8�K�z��Z����%�I�w?����1��{�z�r��y��k�cEX�@J�@��5(�B�fo�Y��"������)h#�j��N�Oo�P���i����=6C����}�]���<:���J���*���`@'Y�,�l�F	tv�&��[���]k��K/G,1��:Agw�������1��<W2:[-;��dtb30�(����Kdk����q�&��P�0���L���fet^�f��;
_/R����Q����1��<W<[-+�J�'6�(���F�aX������H/H�����d�f/�!��=_��D�R}��I'�
��y�c�y��V�*�0�PB`���
�[�r�K�lf/��af�P�������������(O�����_]9���������3�nmf.����}]pf�F]����M��jA�����k�_��WYZJ��C"����o�mr]��|�W�YN�<������L��Lt3��t�3M��I�8lZg�_���Zt�ef�m��^
%��#�.e����-�������<�/�������.�r�pQ���A�-���.,����ni/]�� �~�v��$�4�u�{$�o�����	��e�6�l�lm�������2H3����P-F�A�����Zy����0�YaMa�d/� �"
 ��f���A���v~.�����^��T�F�EW"������O��_x3	������J<�$���_��6�T����9�DAJ�QR�b����
�dE�|�����U��yG�U�\�S[��<�u������,<�$ ��_ 8�d����,��T� �(���.e=�%��V�~�%jJ�M�xM�W����{���c�z^*��O��������
9�-�����-����g��[��-{����r-Q����}�tD��|[��Ql�~s�fh=/�t�<�xJ`�${����"&�����j�R�;�@��h�IY.h��!�(�'%?��m>�����B�%�=�Ux���m����/�����'9%T2:���`2:�����Z�]�(��3
 �8������i�A���n��NN���m��;���?U�F���f �T�g�"H�E�S�Ov��D��A�� �~���x\�BT7���@������������|J�����R�JQ���BU����B�PI�F_@%�Tb3;����d��5,�A	�el�u�Yw|�c$��(�������;���A���?� �~�� %�	� �8�	��O?eo"���a�����^����:_��L?{op�,
�����%y������!?&T�5������;�/0&dkP������	&_�P�p�����������L����"�2wo_|�<�j��ux.���r���!�j�`��}���,����;�0� ���,%eq�PV�����9)f�AiM�{CB��v���o����A���?V�����P������&X�<��=�(%�9X�E�WQ�t��KB���0��7�Wi���u1]=e��"����H<�$ ��_ �8�D����H��#Q*Y}i����-�u)_w�~�uiz�������u�t.��U�I'9��H'����N��x�){�Q��h�-��m�(�E�C��~�d�5�"�����6�Ae�f^K��e���Nj�����L/�5(������yG(�]��A�0��S�@�Z��|Q;f�tH����Q���7F���s�<\x�I`�(�N����`
"��S�f�@l)�T�]o����+��I}QS�����8y���\y��,��:�;��������0����Q��j&�����;�'.�3l2�����������-N�B�GN��sZ=]��>5 �O��2����<�$� ���
�T�F	�����6���'�ivW��f>���[@���g6��7)���N���s�,�;����;���;������7q�����8�xg�h�}�YT�1`d�5���%�G��zb��+������0<�$ ��_ ��	� O?eo��I�m�����G���v4Y9}E�e���EM�����	�� �\�*A��z�d%P�G�VVB�%�G�:+!�)���L��(������JZ�D�R�.�S{�pZ�������)�(Q���k~-�
�����.<�$���_�?�t����y����j�#L2���iV���������G�f��s��y�QJ�=/�&�
�����J<�$���_�G?�:���T�E
�(j/��K�-	��3��
Y���(�(�w�;v�t��������/Hn��s�,�8�H��8�H�!N�)����s��fjNyE�b�Q0�."[�����9ZQ�r���V}�J� �R����*����JJ���T��9�T_�*q����F0��+���v�]�jJ������O��K�N�GUel�|�� �\��#Or�(!O��p��A�%(��)��'�@#
6�i��!�(�'] n������	i������l��!�R�X
r��B	r�/����`
R�/A)��I���Z��Q����0�.����vY�Q�I����� ����8U��UKy���U+�3�k�l
��/Aa Mm�p��Q@��Q��K�x��;%�����������-��'e�#���1H=o�;�!�eR���9(��A-?�k�I|	��ey�����f%�zF��W1O���g��*��HN���2�7o��N��#E)��������}��"[��KPHN�H�'4	8��Pk�@�go�������0�.�}��w	�4�;���s�r� ynA��0�V�/Y`��������%�(�cC@�����@�s�O����s����;���J!����n�[�.��e��'9���'��8�	� ���������T.����Q�E7Q��2@S�oTis���f������g��4�=��������>J�|�>�k������Q�1F�%
[4}D�����O�|��[��fe���VN����R�X%��PI	��/P�C�`
*�A�r`T��hXP�A�&��@9����u��F��_>�5+�{���_��kd�K�c�x&J@&%&
�@&���5�d������Z�yGXd��M%�+�S(��:����w�D�6�S���L��s�,O?�H�D?�H��O�	��O90p�3, �8�J��l:j��|���5�^�}F��})� ]����Qr*)�Q�*q`�A%;`T�J
�'���I;c�7�/�O'�,{�?s������w�4�D�����H��$Ji��$��@�$��D��H�=��;
��v}�u�����F|����P��fw�d��K�c)x�I@
%�	�@
{�5Ha{����Zh���a�(���U��/-J�I���b������m�����w��4HF���*�d��JJ���T�2?�T�CF����������h������C��E�������P=�{��>�F����P-�F�AV��hjeeD_2�|[�b^�Yh4
2���\�eH�W��|}*����'�7t����v-��� ]����Pr�(�P��p,�A;,T���"�w�-�.V����>=W�gj\\��)<h�[S������g�B����0<% �_ A���A��S95�%Q��R�@�\������f��w��:ITm�\?�|~��V&��^��S�l\)��I�@\���}��R���	��FoEA^�pA?����|�)UQ	�4b�Ya��Or�>���t^��.J'���E�$|�tQy��	������b���+,`����Uf��r����~���������P��>C���q��Prq�C��C����W
�+���I;�t�hw1����E��S'i
nDUE�/^�
"���^��S��`Q�������r�=��E�M��O��'����8�����(�uu���z\:����7���)|��j8�q�K����Pr�E	��/,k��3X8E
��tb{��>��v�������:�����������0<% �_ �C����C�������(�xh��f������f]�����$��k�Z������6u�>@�������.J|�.k����Q81g�$�F��w��m���VI����-I8e��Ww�A���?��j�I9�R�ZyR���/yR�Y
�Kh�����,�(���-�2��3���c��^y���Y8��[�;�>"����0<�$ ��_ �<����<����(a���l���S����y)�/O4�N�z+E��It�=U3�����"<�$���_��:���:��("�4�?�"C,.h�������t�T�b�[��5>���y�g.����y&9)��:�H�eu�5Ha'�S�"3Ka-����:��I�Q=t��e����Y������=~n]:�����J<�$��RR'����N���$u��Q�K�X~DI��Q����Aw]��f�5+)[���+V�ts.���Zv�Yz����Ig����L:+/��/�I�K����I��y��#E��uU
������p�|������ �\��j�#J������	�0F�@M90c���a��A��P$���M*��f�PD���XTmY�?	[v;d$�K�c�x�I@%%�	�@%�p�5�d�p��Q�#�aA��!�0���M�?��aN�D�I:�X5��{��3�6�"��t�]�l()t�f �d_��8��|G[�%q�,Sd
e�E[�������
�����-�5��d�_���pgQ2H5���w��R(x�f@
����UT�3�_�R@^���0)`�g����Q0i/�����]G��
�������r��c�p.�c�j9Y��;�V���#D�%������d��%�(����l�!"
�	�l���A��=���Sx ��6�k7�<y�R�X&�w��I�w�/����`
2���r�{eQ�Sq� �I���dmC�o�Ow��������|c�S��ct.����A'9)�@'�)8�	� ��)F
t�E^�DA�Q�l���7�������|"���G�A��(eq���$�,Nj"I%�3��$[�.v�'{n��Q�(�H��g����/:7"����yV�N�������y�������I'9��D:���t�5(c�t��10]S[#a�c��#��OJ�������2L1�7������b�c�m.�����&9)�27�H�en�5Ha'sS����6���c�a�d�����R�8��������jZ�6�� �\���Nr�(�N��p��A;�S�0��u�IC���qW�w���:{w���W)_�U�Q���y?#��?Q�R�X&u��I	u�/��C�`
2�A�r��3
`��@S�-�:d�X�F.�YOj�DBI|'�C����� �\�K��MrR(�o�/��c�`
R�I��#D�mV�y�M
.{3
�"�S�
=��BI��<~���V�yb����J<�$��R2'���dN���$s������r��u+:� h�����yH���d0�?�M�*W�k�R���������%� ���
�W�L]R�r��v������P���r���G�!�A����JAVn�J���NP��J!��NP��J�}	���V�9��2�l��nP������n�|.Y�����8
�/v��=�3�s)x�I@
]8b��o���T����z�iHz�i�z���S�0d*��2��w���j��=-�z�NBy����\%�tRPI��X�G:�Tr@:���������- ����,mz���h�!F��9���y�>���w����+]��y�|��@^���K0��r�t�L# �8��ZL��D7�9I�N�+�Z����A������tR�1�H'����N����t*��1�L�+�t������c�u����\�`3-{NiL�C�������0<�� �.�	�@�{�7��{*�����A����t�&^��p������Q��{�U-�jo~�o�=>�sex�I@]�b�2�B��v�����(��t�j6���O�+�
����������~���Y�=FRXh�s��F���T2H:�]W���*��2������*�|��o2t�s�Qz�.�j5��T]`�g���tf��;uq��E�C����s��/���������yw��'����zB,�.�4�,�3��Hy���?^&Q�����X(�8�����M��u�^�w�\t����H=�$Mo�����wK"�A����+�N
��"���������dd���L@�1��p|3t���\�_g��*�M�r�x��D��P�_�i�b*Y��A�,�������w���kG���?���+����_E�hO��Z4�h�n�	O���o�;;���v����B��F�`d����<�� �.�	�@��?�7a�F!��M� ��2F�T��b������3[v*3��T������.��i$��A�9v�=���b�nR�E��X���]����]���W��__�U6Tj�
�A��P���.:5��[����.�`b��B�E��J��.q���3eeG���@�����.���
�H�#����k��_%����0����f�����C���p��lWo%^���n�9����<�x�I���B�����F�3@�I��S��tk��63����u,RT]�����4��di$�<*����l���*��s��uY���k�:=:�����.�yToPI�GR�������F��03:G9�A]�n�i����_'#iI�L��r6�������s�?�=<���{tQN������
���S����t���p���0�@W���]������������������br�]�SV�������.z g�t=�3�q��9�O�Y�@I�Q��2��-�A��RKr�R��w��U���Q���� ���6���
�k�:=���q=����9���j;�&�3��p�3O�U��uz�2����#�2���� ���v�����k;=t���=�t�;�u��9�t��5t�`gu�i����n�u��K���i)]P��j�:KK4d�6���lxVF�/���������K�y��E�Y|>r_���,ZC�#���i��e�YI�#��}���U�fyd_��y��������AY�,�+�ZY@�E�\_�W� �>�da���#�,�����,3���?\�??
��'��P��:^,k�-vdw�{"�/S{��|C���ywa�']��O��0���A}�?�FZCa�!t�O��E�u^��������F@���`����_?��wc�s�������E���@� ��] �t����Y��q�'Fw�zP1����Z������a'���j��F>��������*����V���*���*��������V���*1 ��������<m
�u�U�_+��S���2�Jj���pF��d������<�X�� ���tR,�"X`���w��F���5����O�J����0�s>���sp��|�%��.F���Q�gQ���\��5`�&
�pM�0\�����ad5�$��0�RF�X(e��\tU��)�S���s�3��|���������\���`Q&%��L�X�,����>�����|��z����,�70$�5��3mF�.:���a}���������d�o�38#����z���Z;1����k�w��C�����T���J��4Tb�fk�n���w{���,��&&u������l:_���j�yX��������X�y�N�Y���7����X(�����^Uq>��kZu��Uf��d��%��"�<:�����(��a�:���T%�r��*!��y�UUI�K�<���$�i��M���'j�����
�Y� �%j��]����f3�v`8wi�^\������:�9�s]x�I@]�b�.�o�������#*��%�l�)?t���7��|�Y�s_��lb"�{dh �?rM����:�7�sAx�I@]|b� �o����7��'FG�6@:�PD���{�i�������J�������u�{�i�#��~����B`u0�t�NE���w.�u:+P	�]j���[��L��a��� ��n���l���]�>��/�{+�J�������<����s�?�=<���{�Tz�/�j����=�3�q��Y�F�)\�_%t�g6L��jr�����������Qh]Tm����l��b���\{R�G��X�m8�	��mx�)�m ��ncm����gs��lS�����c������f*�c�������\%�RPI�X�F�T�����J��:�M��?��D�%���%�U,ad�)�<]z���u����:�I�s]xLJ@]�b�.&o�����7�p�4=vy.t�0i6L��v�����"�����s�v=�����M��a� �v!QY�ah����A��.0��A����
S�9��nG{s��J���{���i�3s`�
8�W���G��^�P��S���B/��Z��/�w>?���l�����Ux(�>��C��@�gt��f�R���>�h�1���$�i�B�.n�t�?M!�r�)��j�V5���T]�w�EzS��>����(]p�t��R���a�)��?mf��;�%i!"��~��^-~w
�m��.������,�((�Y8

� OA��#�l�4`^�-5�4z�{i4}�ZJ��FZF
�/�*�����G�L�6C�s]xJ@]0b�.�r��
��?>v�6��3�!�W��
���w�'�Y|�=c�NB���3�� �\��u��']$�4�S��2��7]x����jZ6T����p�3&����������v����[�����A�������O
�����G?�da���od�N�d����X�FV��r�e��C��\i5�1����K���Q���K�R�m�.��*���J� (��8
��A��+l�P'��P�7h�5t�]7E���^~����
om�g�9�b.�9)��sB,����
B��S�F�s�G�Y��E���#�����]��7����6tZJk���n}�n��s�?����d��9!��aN�Yx�)#�9��&#�s��
������.s%��E�����+M��;���.�)+�-�8'=��=�3��E�I����M��l��1�t����x_����:p��m����o�8ks��0���.�0'=t��9�]@�t�1g]8����p�3T�|�^��:���Z����( 2V����&Y9P��,��*�j��w�
����&B�MU�D����=��l�.O�^��t���v�
"���h����rVN3U����]���b��7)���oB,P����
*�|S��7��NL�A�7h��
�0p��|��3�u9$-{��6S���>�7�s]x�I@]|b�.�o�����7�@Z��J����l���O�4o���	nt�	!-r,�th���b.6)�lB,���
��`S�F�)Ck�G6������f���H'k�X��R�h�7W��9vK����� �\��U�9'�tqN�*q��A%�s���$�dYI�J��!��������L�Vm�oZ����mk��������`s�?���d�6!���M�Yx�)#�������<+*��������37U���x����������>�9�sYx�I@]�b�,�o�����7�p�3=(��z���-�o]'���FI��)����S����������}�z.��*����J��'��8�	��O=�oT�O�����:��
��M'D�T���
��$��ry�-�gP�3_;��c�A�y�B�����}�cp���|=�3�p�O~b���zoO�|c��������w�/K��+�Z|MTl1�:�]�SVV]��B�a��. ����(���R��O!��F�l.~>���bo���4�y^���<5Z� ]�O���.(@MdUuV5q��T]�w=��}	\��Do�����-nP���F���z�t"�vk{{e��]�s]�L���iz.�������0��'�a8�	� �������������@���0�n��h��C',-?%u�eLSA
�{>2
B���\zR�E��X�=�t�_��pU��lt�6@&a�����H�Z6�k�Y����=$-Z�g��T�$�qN�t�?��g����@!��1P�}���>�:���"�>*������!����Y)q�[�T�nT���Y)T]���]�O�-�.�������.��'�]8�	������+�Td�|��'	W�T�������Hw[e�^-�K��'�Fmi���W�;
b���\{R�E��X��=�t�_��p��lt�6P>I0Y������St�5���i%_G���5�Z��/����=V��_�A�����cP
2���d�0(x�L�KP&��S�C&u�Q6@Zq4=T�{I��W��
�J�����sUO3������A:�\k4��Ng{�5�;W�
d�8(x�L�KP&��3?d��Y���l��)��7��F����e��Z������s�B���R�B�� =�3��F�)���r?y����|J,��3]T	|��-���gi7��U�����;S� ������@��(=��D�]@ ��_�qL4	��msg���������:�Rk�DA�������Q�0t�?d��)��AYUE�U��.U�]�����J��
U�.��!Ed�t�i�qQQh���	/�����$��A����CP
����� (x�  ���j6� V�!��H�����V��UsH�h�}���t�������z���\�xRA��X G<�Dp@<�� )�ow�����Sn
����P������h�>��gW��������s�?�G����:!��N�a�N0�p�3=j�V6���l�.���wy�8�5-#k��Z�E{�Nm���A�������O
*��T��'x�J���8���C�l.1���|�``-��	J��i\���Z�������� �\�����'et�O��p��A�S�2���Xp�p���lx<���F��K#��U[4������c}.��2����L��'��8�	� ���F&}���>�C�n�^q����V�����>a�I����K��s�?��g�t��:!���N�]�N0�pE��A�������8�����>���U�i���]Ab�7��5�1�A�s�xJ@&@Kj�pP=�1t�<�I=�-�o5M'��������������K �,�+'i��}q3���5[�������L<
� `� ���To��
��u/t6�L�]b0�
D�a� K�k��$s�R{\9i������Lt�?����UdUuV5���T&
�U�K������l����2���#t����c�4�u���v��r��|�*�}m��n\������2<���.8
�@��7(��*@���e����G�b��XCi����&]x��3kB�^�c��f���z����\��RPD)�X�GJ�q@J�(�����3�l�@,�����~��G�h�R��6x&W��w}��e���� (]��u�A)]t�R��p��A�T�.(M�!�#c]T��[u����^G��o.�������\�A8���+��Q
�������(x�2��e�#���r�c����'w�$���|��g�+�T�o���d3�gs=��b.�F)���B,�c��
�8`�
`���hz@*ql4=���$���C���|{��dc@���z��0t�?����t�C!���P�]�P0�p04=(�8�.��\/�4��<.�1
=/���J�c��>�*?ee��]������X GC� ������d�S�y�_�G�FV6���b��b�H!Y��q9�f���z���������@W�>!(��O�E�*��lE�����h��u����_�$���)������T2H<�]����=FW�'=z����d��x�/A���?�	�a���K$]/4@��=���J����>��U�z���
[�����i7"+'
P���2�����T��w�F���LZ[I,��#R��dC��p�T�
:AC��YG��������|y�kx.��J������'�%8�	������P�����!FzT��B]������LDQ��i�U�-~�B�k{.�������.��'�]8�	�����F{��I������O
>�t�X�'����sFW{su:}��� ]��e�)(�tQP�2q�A&T�L���H�uF�� h6h������~oS�����{�[=_c3=�+��tr'��wzJ@]b�0o�m_YG��Ge]��G�j�K��k�����t%C�)]-�z��e�� ]��;�@)����B,��c��
�8`�
`:�@�������e�����t�z��1Z��{w����%��y���� ]��e��(�t!Q�2qH����2Z���D���c�!��#������Oi�������S�kY|�� }u1PY��j���T������T�z�hF��J@Jp��0�*��d������]�mp����>��5HB����P
=D	�X�C8
�����h�
����	��H��_:D���E�����fgW+�i���9A��. *+�}tQzt=@4����
29(���}�N�7���������Uz��T���2&(��0��c�{P�{�.������L(@�M����jo��Re�U&�Kh��mU&�Pe���,�
U&-�nL��i��|U����3![��n��Q�J�� ]����(Et�Q��p`�A`T8�d("���>8$���$���r��:,�vQ���:m��Q#���(���������L<'� �.N
�@&���7����*��	>R,=�J
7h�
Z��a�������
���(��L}��{��.����d��0��(�a82
� �2�F���G`��G���&q)��S�*?oo���_��H�3��ye�������H�.��(����Q�]���Ji80�0o�F����-���2�_�<�U����-�}�����b�_x0J@]`b�.o��U��P����
�E�A�?��D{��ZiK���������}~~\�{�.����(��0�P(�a8
� ��F6c����p(4�u���^��Nf1��i�s"K{��+���=HF�sax2J@]{�!�U��7���*�r���@�����"QL������2I�=�����n��;��=J�s�xPJ@&]�b�L(o��(U#���eKcz��q��x���n���rK�i�e��X>U���7�{��w��{v�PzP�������/A���C�	����0�L��sA�e���u�&��	����G�c���i� +'
P{��R��+��T)�w����Z[����*�l���-U�%��*��F��@�����=���7	�}�1�<HA�s�x
J@&]b�Lo��UN���Kj��
5yp��V6� /1��PF���!FZF���[�m��]��I�b.O:)H��tB,��#��
R8 �
`��HgzT��
�adq��>��0���H����^A���K$� �\���'Et!N��p��A�S�"�L�!������G��3�s�����������u���s�?��'�d�E<!����������uXx�3���������yb[-���n�n�����7_T�]�����2�:hc����O�J����a��Z� �\��e��'YtO���=�!���E�������z�s����������������nb���uy���R��
�{��0����2_������� �\��5��'�t�O��b��H}h$�$i�U�*����@	�����"��&v����5-���^l]��7"Y�����H�����r���BTVv*�U!J���l�"I<����H�7�$�$�G~Y`&��3:GA�!�����u�����2��[�K�q��l���,���uW���,��J���,j,��q�,z�J�/�a��M�"�����-�;@����t	���B������{;�>-�O�����.>������j���c@#5h$Aa�:�7tY�I]����f�,�������x������|�["�>+!s����j6LG�y#����� d�b�p�� ���"A��H>����������];��>
{���_\�??
�r��e��sG��gq��=.��O���q��l�g0r��i.G9��b,r�0�3@Y�Yz���f0�o��x����a�j\�[}�~���%���~���b����d9Q��?����4�AM1h�8C#����;.]�x@#�G�0����36^:�:��H��V���)�3���^M���Y�����a����HH� ����@$X�"���$����9�������!2L>C7�D9��}4����$i���n�����zQ��S=��\t�����rQ~>j,��S|Hc���#!�n����{J�i��cH#��*��6<��kV�i����m!&������F����2O�����++=F<��Y0<Eo�����GVY0<�xPZaz�qy^����y4����'.����I���b�M�w���������}�1DO��� GO9����b,	�S��8z������aO@>
���w0<��q��/H+	d#�|�.��#�&�f=m������QRb���X ���
bp�t�7b`J���D�����D�������&+B�d���~@�zZ�5�\�C�4��e�()YtT�b,�SR�Y$���'�.�C/�G��H�~b)�\���\�1�����$M��|>��IxT�
g?-��z@]�� ����c`�yNG?�����~�q������"���pt�Ej�9��r��:~��p}~v��bh���	��)��t�?�)d��@jOAVU
`U{�|����=Ez��G�!g6T5p����Rf"�3��|�����0h��Q�QI��4�N'�z�� ]��5��(�tQ�qD�A#�����1{���=�����\TR�e�iI�xh�nN�hWM+�a!������� ]��E��(�t!Q�"qH��G�����h6�7�n>������H�)�#Y��r���?��+m!��Rv�~p�q�k�_S�}�1HA��'�tUV6�@�E��]8

��OA�#AX��2���~�Q]n�<�<:����"�b.�.�+����C}�&�����y��jI������ZR�q8>4�q��y�-*�����xh6<����}��Qw����f�s��: ��W������NP�A ����U�����D�e@��7�Q�a�
u��uqC�b
��2�w�e�E�"_4��N?�7�26;l��>�"�k���)��.
/�p�A��GBNY}@kC�^���f�������m��5���>g�s�^T�9�CM����s\��b�ux2J@#]db�Fo��'��7]�#��A9���t�"0�o�j�w��#��"�����*��2�������c��^;*H����RF�X �F����h~$tX(R$�1 uh4TX��C�LAF�"C��X���!�"Z�7�Ub��.��������/����a�zc���
����]�M'�������c���^o�����.���~Ip�j!�"w��#/K3n�Xt�?�����N���0��v�.��wFz�����*�l���SpF6({y�-�����������k/�������a[7|�����2<���.
�@��7(��P�3��:W�u��)I{F����I*M�����w�~j��b���o���6�B�s�xJ@#],b�Fo��g��7A��!Fz��I6��I6,�*�������"����������i�+C��Eu��a�m��.���\��F��(��8.
���E�o4��hzP�q`4]�Z��[3.�84-c�^w�-�������o���0a���6TVv��UJ���GOmh�=�7�����eE6*������!L�"u��U����c5P������>�����q� �uU������"�b����p�b��}d��������2�kV��6�(�+-�A�O��D����$�q���bi�m��.�����P
����B,�����
�4T�&�8��I\Eh�(�h��[��\�k�X��|"v�����y�J~��t�q�b.�C)���B,����
"�8T�F$kLhz������U
_�#d��J/����G��Uv�{�{�d3C�sYxJ@]0b�,o�����7�p04= �8��_twk;CT�4k~��_Nf*���Wt�1e���t���
Fee� ]����@�|�Vo�����#��A�����(i�(�\��.�<���!��@�2}%��k:c^���^��� %]�O{Y9eP��{�UUX��#��*��G��uw��%DI9�(i����87(�p��*��#w&���
�9K�K���+_��4-�<�]���#�52�K�2k����F h������F/o������u�-*/���a�#4�>^DB�N�_��d�B������\��iZ�e�s�������������F�x)��8^
���K���p�Y= ��K�du������I�&-5��=�����H�d�������Pv��>L�s�x`J@$]�b�H0o���7"Y*K����!4����}��-���j�u�lh���=w�����t�?����U=
�@�z�A���#�8�(dm��O)K��e����8����#����M���e��z���(d����8���(����c`���Q�]`�Z�A#��r��:��
�aHM�8�E��y9�@d�|���'�Em��q��{	�|���c���JIee��UJJ�a�X�y8x
�axx�	�S��)"���8v�
1�yj����
S��)��������o���a�+
���<�x`J �tS��p��A���dL�������w��Yy����lZ�v����O/�>�����OP1���fTV���8uT���I�
��0)x�<&���^�a�����8i�D9��Z�*�v�.���3������N����R������w����y��9)����b�H\1)x�H<'�?��l�Y,��H%��g��O�W=m��*�XC�2L�����#�[7U/����i��.�����H(@�Y��XU����q(xW��7�����7�2W���U�C��lxhl��?U��?�z����M��h=?-��t|GO?2
���\#��R�H1�X�GL�4���������d$v	������u{������~����<�PW-��R0��2
����\��R�F)�X�
GJ���I���6��
���B0ni�3^����B�$�}���s��;�,J[L��uw��A�b�J)h��B,����
��T�F#���WX�Z����
�/_AiZF��s;��X�F��T'�U����4�L�s�xfJ@#]�b�F3o��g��7q�4=(�8h�.�a����2XM��>~���\��s��4M�s�xhJ@$]��!����7��CS���
�WKH6���gh���=����\7
���7�zk��J�Y���Q��b������c��)=�1=�4��1����i��:�q�4= �8l��c^��
�%=�Z^G��IDY	�}���M��t�?�)<,��St�R�=����
r��T���p�4=(�8Z�.Q��z=���>
b�������z�"��$v�nYe���� <0� �.`
�@��w=2NF�_�^D�qYn��X���a���C'�\�Ie^@O��ur�n���X��:%���A.:u��������Qz�����|�=��/A- q
- ��qV����e�����5�[��:.2n1c1�{����<��b�a��������H��v�.U$�];��%$�,�dC�0�R�����8[e����SsIZ�V@
'�8ZLM]~��W� }t��������*��@%��3@%�%��*������NI�!D���K>��a��O�L[[-�_��hds��S���v�� ]���F)�]`b�2oPF~	*���*r���d4��$v�	ed(�E�y�P�8Z]���W�VVZLm|��F���� ]��E��(�t�Q�"qd�A$�%(WC��F$k%�F3������+I&���u�N��(*���9[]�����4<� �. 
�@��7H#����H9�QaW6@fq@4=T^�
�����**��u�l6������p=��b��>)���}B,��c��
z�/A=`�i8��
��d��,U>yiC�g����T�����b���t����aw>�p�����w�c`���;�]`vR�A
�%��?
58��
�8�L������u�u��<S���W������F���~v2HAW����)(�>���B,�3oPI~	��a��|��"��>�Q������6(mW������M��-����2�>x?7�����QY����z��G��p0��8���+�jI6�.��4t��G�!n�Ri�v9�"C��kB~���-���ka�>�`����`(=��C�] �����%�G�*�t-d%��/bp44b7�4���JC��qE���O���a��)-�]��������SVN%�f��*��c��T��w�1���JZ[�1����l��n�h��q-�^��Q�2�H��D��2�NY[LI�{��w�<a�b.C)���Nb�Ho�U^u���Q�Wz�)J6DZy^��������j���R�V���I[�|������O����Y���2�`(�(��P�e������p04=`��- �5V�w�^o���?u��!�����x����W���_w3mO��d��>�nf��UI��L�H25���Px�������*Y�*�JH2��g�$Cw����j�I��r��$��X���
+�A.���'�E)t%]\b�Ho�U�dMH2���G��s��|KE�i�<9�����KZj/��b�� ]�����(9taQ�rpX�AXT�M�,�(4]"�\uCF�*V����__�k�f��Xx��?��b.OF)���Q��@~�4��'#[��mukl6��,7h:�
1SY��)j,�P[�F�B��;9@G�Uc���#�A0�����#
"���2 F�DrF�+A$�9U��DM�K�����;�f�t�:��3SY�������s��>�h����h(=��=44��F�1�P����l�#�3+c��2V�8�EGxmwI��]�-x�h��m�����K��kO��c9�s��.��y�sR
]F�fz�]�vF^�9tTF>����t��44��������_�M7!�d�X�����Ct�G_2y
����T�r��� ���UUF�K�?����K��rd��P��
5�dCe-��h'y�u��xjS�s��q�����~�(�5�C�saxJ@]<b�0o�U��edc���I��2V����u���r����������������Ml��k�.����@��4��(�i8 
� � �FkCE]�QQ7h��
�&O��n��AI9�M���;���[������4�y�b.�C)���:b�Ho�U#�C�������?�Q�`���W]U���-��������]vS�� ]����!(etAP��p�AT�2M�,�����,����P\G�i�A^��Y+�j1���ks�'�5�F�s�x6J@%]lb�JoP�Uf��P'���)7D��g����xkM��H��z��e~
x��ZO�>#����y�b.�K)����B,����
"9��
`D�e�Ba�I���G�
����!�t��� DS����R���==�5�H�saxFJ@]�b�0#o�#U#�H��r�������zT�����|�T��]>���j�--�Z��w��
����\%�RPI��z�*q���
*9��
`T��l�L���R��������pi�`����-����J�AjD����RYY0�UZJ�0�SZ��`���/A0�JK�	���g��?V�H-������k�O�]�{���%o���Qx�������	�YU��U�@�]�H��v �KH$���dCI6@�����l���y����QjZFi�N*�5�6Jm��<������b.
�O)H���B,�����
�8��
��%*$���[�A�����=���%iB�ri�$�������A�w��{��.��"�$��H�H*��8�
� ���F$���G��dtk�e/�]'�^���j������`,�{R���\��RD5�X GM�q@M��Q������i������u�=-����^���T�yKZ�y`_�3{eR�w�E���C�����10����>����
������S�a� �
�O�gh���x����/ �IrW-��w
1�_Dm?o�(�������de��u=��sS���2��%(D�Z�O'V��0\
i6h���s��-'���������c����{��E2J�][�eeE����"��r��"�9a�}	��m�O'��Ay�������������m��������4�������8���{R���|���)��G5�X0�p��!�PS���l��,���J4m
�9O����`��(����$6?�O|�s��W�O3����Uh*+��t��c��)4�w��z�6X)G�4�Xi6�T���Ju��n��7!�d����l�o�����2�=M���CS
HW�)������7����TL�6@�6P�q���-����n�I�J����k���(um}���wKG�A:����AV���U
dU���j�w�]xW5�/�AGV��
�eCE`��)K6�_����O��{SMYZMY�;J�;tX��*� ]�����(at�Q��pl�AlT�����k�{e�.��N�A�1n=�T%-����2���y��.��"�l��H��(��86
� �6�F$�����l���������������K�t���p������&ie���];�ee�J�N{z������.�V�7�����,�n�P����Y
7DZ���2L�o�����VV��`��>����:��o_�<�K�����R
�GW�)������7����T�g��OC$��4�N
��S(K,-�`�v�q���������kD����H<:� ��S�"q5��
"9�1U#�*LJ1������^����"=�����}��4-��IG���2"���\�RGWm)�q��R�q��*�G��R�0{q��=C�`����� �e _��u�y�b-���$�}�������t�?�'�D�EL!��S��S0"Y*�H�J����k�X�����?0�M���U����r�%?� "]����)1t��B,�+'o�"U#�H����c���t����pV�WZF^����e��&�C_!�N� #]��U�)�t1R�*q��A%�T�J�~4HBcd�i{F�:<�����K�W����(�H��U��E���������O���H�#G�]���Hu��o�0�v``�����R�JR~�"{q�"/����$��������W���:������������������)L�w�$���:���2{\tP@�Ze�KZ$m.e�K�$�/��#;�q��0dF�o��5~a%�4��+z�o����}i��1lR�S_�4�c�7��g��P�A`���^-0z�lZ�u�����6F�`x�����������Z~�2�6}�g��qZ�����8���ky��3@To������K���6��e�����dT_2���g�&�ui&���z|��_{�K?6���D�������z�k~��[aI������e�@����������2�c���+w�����'=u����E��:�#�=	X����1$��8���*m�9u�u~��ea�q1�-b���{��/����e�%]����%E���|,��X�1I_,LMi86�Xu�bL"h�&�=���	��-��M=�u����_n���������k�r������9H,K�$=,)��`Qg��� I��@b�K�q��PMPx0#�����s�O���.Lo7��z���e�om�1�����KO%ih�eFO%)���J�6�L
�75k\f�B�fhX-
���.3V�J�@E��>(���$/X~�\[��=��^���U�<LX*@���B���u.�:��$)L�����	��r�)m�t�M�8��j��@�H�TcWE
(Y5���{��G�1��z��"
-=z�Hq�U�m.�j
 I6�@��k�CW����:���%�����Jg�����;�l�|��o�(l��1��1�����I��(:�����/� HSFj�x�)���%�)��!F� �a��&��&������U���h��<����h�[ �K��9�����i�B*HH�F�� ����5��5�?RT�R��4��F�@�i��H
f�Vz����*X M3�mQ�aBIj���n��� 1�������p�SSz_�d)<�p�Z#.�u�p�[Y(x�@��&�����=�lj�G��x�R�H6_���f�� 7�����s����SK�4\���q�`
�H�2����-�e��S	p8n4�5����Cg%ni�\h�[i��F���l���������O���E-
��k���Eeop�$�h���d�x�4Q��m|�(�=Kj.G��#�c�q������������:H�.�� �$)9�t���@�HR��x�T�$��2iQ�/Df��u������G-����zC����]RS��G� +���'������k�W������
c�����5�7L�hP�S��b*G�`���������������(���T�]��w�� 1z�"F�e�E��0�b4��b4��\T�|�+�KL�h�5�t��}�c5>���!�n�Z�Prx���3��4PG-����7���G��M��N�6���� �8r�!�xr�}d��H�Z��h���&��D��<>�K�Ws����]J���m���u�]���%������/	R��Rz�Q)9�+E����J�� q�h
����*���wj i5�?�oP`�.F?$�����Xi�k-*-;z:�^iH,�H��Yi���i������c��5�$./����-��.���<`.��%�c�
�k>����E��)�A&t�?
�r� 5\�V�h�p�s��.�������U&����/,���	t./����pH�������0��+���a��uU�J����2��8�T��\���	�tpp(O%���U%^�������u����k���G!�b+vn�t��<�I 2t���"��;����N��
��L������@gh+M�s}!bu+�y>a{!����������� ���������`��v�/��c;�`��N�X��JY�E%�X"�W��=�����������S	�6�]z��|�Y�E���$�.�|8�	��r������iq�c�,
pd%���D^��vDR��\D]W�o��j������A�s�?���9�����|.�	��_�){�W���K��&�K^�H._��4U�+��%���]'���;o]t���r����a`��Cw�\`�Y��C��j:�5�{jW��L��,�L���ss������dWz�h���o�;p�A~s�?���$%��?�D	�o�5�����7QG4� ���7S�[���W�
���>���ls��&Y�[W���l`�*��a 0��x�\ 0Tk��g1�1W���0��&�]�i�I�c��TMC]d��.s��T��[B���� �����_��
]e��b��/���:e����}����
�RFlp�e��<.�kmEj��C��n~jNj��Y�{$1�SpH�ErP�AZ5Z�VG��F���Hk"��Gp4�Bb�T���#�t����w>������E4������3���eZ���R���9H<�I$]���@�j;�@��M�sIA� )���&���6u":��-�)'c���������������g��A�s�?�'=�����_Gz�5�����7 Y��J��!aA`$��#v���T�;5�h}������T����� ������s�����u�/@��:����N�48�3-*������ ��.��C�8����P�������^���MZ�|t���9�E�$��$�.�|H�	��y����q�iAy���iu�:U�}*�j�Vj*�\�u��W45�(���|�<�s\x��.�8O��p�'X.<�){�Q;���T�(�f�T2�����65tj�}I'6.��hn��A�=\����98<�I@���������7�K����cqu�<b��]�o���g)���z��I��YJ��(`4�s�<�x�������SZv��U�I��/����1 Nx��}d��:�3-(8�3M�2�9E���YM��HCJ�D�.�g�/����_,%~�A��8�yPrq���|8\5'X�NmR
�c����
$5#N� �
G�6���_�<��r����U[�u����=�]u������n�����S��s�DQ���L��O�H����(��:>m[�z&�\,Y[���y�|��L������PAjX ��
��a!�RQ�iMa!E���)���&�)RmR����y���������/�H�(����8��i��\��A�	Or �"<����`
 �����85��JA��� 0��_����q|�W���5�����;������v���<
���98<�I]D'�p8�������#:��R[)��V
"�\U�������"@��^�����U�iw`��=lG�A�s�?��<����_�y�5�s��7�p�gZPj�a��4M�\z���b cu��������%�S7�le���y� �9��#��]��#��hXT-��<a��<�#a��#j[��Z��,�x~RfY-�lkw�^OI�t�)0
2���y��L'9����t�/@�c:����N��������\��������W�y���h��5ZnU���U���X�sd9�s`x��0�XN��p,'X0|y��
0\ygZP.q��iU\wu���l�du�\����\x���������c.�?�.�SZ6�t��4���3�{�j
 ��g��Bq��2,�������:'�M/���!��w�DWMq]��;Z��J�ZU��=,����TZ]�4���4�����|����p�'O%`�������|���?{o.tB2�B����:���AFt�?O*�%�T�*A�$W	
������US�b����(��5�t�H�~i7���n�z`T�������]�k�3�e�?��\� �U#hU`�\j��
��&R�}d�)�|Ws��P�`�I(�<����@�]&I
%�]+��V���wV��o$C�spx2�8��P��pd(X8<*{�)p�T9Ja5RmQ��T��(�*����x�UA#�q�f��y��\�����OrP��>�@�Q�`
P����
���@RR�����J�������8S3Z�\*)��B����W�����E���$�.�|0�	�O}��G����bA$�>���gF�aZjD�-��35UA��i���� ������S����Ey�/��<���)O�p8�3-*���+=(�3k7�y���\������kj*����|dB��~���K��~�4,A{�y�\`	Z�$�	mY���	MJ-�
M�H-7�*�������r��QG#�\>5����}� +:w]z��I��w@�S��s����l_B���������`YkQ��6A�zW�%Xv��qz�)�h@��\z���� �l����J����
�a=lh��Q�k���%w	>��<[�"d������1����B�s�@S+z,R�s*-��.
��,�P�9�B�,�� \�g	XX-(�84}E�u
�){����uug����M�*|��p���}�@�����@��E�(P�kQG��5#���� ��`3����(�&�������+g�JV�H�������xU��W|2���).��9�� �0@��"�RXW\�/!\���"<�"u����Z�v��_+;��"������U�&��fr�6�����O�@F�	���k@F~	"���s��d�H%i�XM"�<�n��aX%9���J������Jj�~���]h��>��E�%�%��.f|J3
����D��
��@�a.b�RP0h���+����k�X����QM�c��m�c�$D]�K��px�"Da2�G��5�!����A��8��'�4Q����_�p�.DzmZ���H�G\�c�Ole�c�
]��#�gC� ��
_����5 #����`���P
�b(���4�����4�����dNu���d�	���
9v�����?��G*-?��P��=th���=��/A��;�iT7�,�����&�����~o���OI����y��~�]u����-<�I Zt���/���!Z�� RVxqy���K��1�)�]�i��m��`�.��������w�r	��������
	]�&
!�����@H��7�� �++��z,1W%w�=M�������3���#5�������_����{���l�1�v>��NiYdt��4 �����2z���%������U[T�4#H8��	�,��S���a5��wwUY@`$;+�|�#,u��R���y
��'9��Ey�/H!��kH!�P�r���)t�\��LAp��g<H���m�E��!I��:�.l��������i��A*2H��
���������K(n4Y�)��Hd��Tj+������fo�fd��PR�����S�� �����s�����y�/��<����)2RP��T6�b��@�Sn�����F�5�����#����s�{�� ������s�����u�/@��:����)��L��$�PRQcu	EM
nW�Y�����R���_����� ����#�������z�/@�c=����)�4-(�8�3M"�����Bk=+IM%�������)n�� vA�d@�s�x�J�P�(q(X��mRZ������Q�4X���M���A����i���.�fluQ���D�����a3�d>�stx��:����/@������D��/��x���c>�b�j��x��k�9���������F�]��/��A.��U�)-�]����a`��S��s��JO���%��$�v�q�����#C�d���������5�x�w{�^�^��D����>����.�TZ%]4)
(��Is.��j
�#�Q��@sH��"����M���{�3�7�m��s7z���Ee�+���2�%���T���]<UJ ��
���J5 M��w�	������\O_S���:U���.:O�>/�s��"��a�@F�I����+�Ab��� TZ6Rt5�a=
Bs.)�5`�b4=\-hZP>q�h�D�.=���l/����������.������� I���Fi9���1H����z4�RQ�%�K(��������[
*�qaM�D��[� O����7!���WW�������k�)]�����Rr��bJ���1�`
���)���7��!(P>IA��R0��^=�x��b��"��~oE���������C�y
2���9J<cJ%]�)��8��%�0�r`P����L��IR���z\�����A0IM�|���X=����j�`q1�����D�esKW�(
���\8��\�����,;�T&�=]�z�Q��]���#5�m�],����c���K;F�A����*TZ]�Bi@GO����<�5����}eE���[��
t8�4q�2��i�:sk��L�|�R)����?_�#"��WWU��,"��Bi@DOUh�Q��p��++"7��G9�&K	�m��������R���Bo��:��o0:�d�'}u����(��Ii@IO�s�����Kp��jF��r)���I�E����%5�j�g����ci����g���E�|-��QrkQ`Q=��df1�~4=f�G�����s�����,���#��^�i��"�Ke�R����n��m�c��F_]e���q�����d�p�9����7���fWF��]9�&qz�G	�\l9Xj������e�:����8���E�<dx��@��� _�Mq�`
���(�.�H�a.�E:4B��I�@	ez���V�Qy��Tb�7��]6R3��w��� 3����BZ.^��
����*,r.5^�u�E�Zg4Y�<�`��"��H��?����+,n."������X(��o�'�G}��A�@��_G��5@��}e���A�RGJ���^���v�Wm����|�#���4u�����{�]����A����_�G��u��&�%�a�p4hU��0VW��%yu��Q�}�t��w}� x��U�|���?������Y��8r��R�� ��d�n��a
e�z��)eji�}@:"x��	��K>�2�I�����������Y�u��7��������~�M�����������_���?������_���������_��|��L���_�
�����������m^��O�O����O��f�����A��������d�[�>=��}��9��������?�
��G�,��"��m���1��=�u�wz����z�����YC�>��>��_������2���2�M3�H�/PF�e�������xT�ne�P�nN/\��!E�a�G��_�+�1��c�Fx��i���������d����C���.��_c��� �ad/��, �
��7Aa�6�n�EJ��NoRt����0<�������x����>�W��{z���x�G?~�u�0���(�����. ��@�t�?�"G�.�MP�
t�@�,E����E��%��2�K�_D

%�p�/��_�sKA����k9��uS%��Z%/�H*/PI����*����xT�n��P�n�ae��LvCV���U�<A&�Y�����?��[����>|V�rS%��Z%�W	$���${�J�g��hPI�	�d��J��2��c�P��0<�G��	��{���v��&�9�y�g�z��zS��Z��$9��!{��g9�h�C�	�a����"4��e7@���qx��5���OR��F
�A��Z����i�~�R\G�?�����_)�x��+&H�@���J�����A�I�7eT[R�i��j��I�$iT�4�Y����>O�����O�c����4�9_�u�1��x��?��x��%�������4xJ*/PI����*����xTI�����H�9/Y�����������k��&�E{��x���g8�����|�����~�����TG��VG��	@���^���Y@9�o��[V�nH����u`DQ�n���c����?�.*�]���(c�J�z�yG��GGw�ke��t�2�c��ed/P��,��
��7Ae�-+c7�2v�.R����{�������{����5+0��O�x�G���w8��Hi�l!�������b3��RZ�T�@J�o�*���*�
iI[����1�A}��g�����U-V��Bv=-kJd����*.z����K�g.-^V-��u4���,��\�~TG�eV����Y��bpi5��e������P-�K����@�Q������]�SA��y���F�`J�`t1���at���$����a�i�����2
Oq�g���c��������Sh����Y�=P�?[@i���F(�f��h��Y��h��7�~#�2��lY�"-����o��4,�g��G�=�F�)���FZ(~1{����6Z��0Z�(6�h`��Y@
l��&(�x�,�F��a6Z
e@Q��>���,�H1>��p�=`�R�0��������L�Vn\,2�0�� S���A�
��G�%�\2�4�fZC�s���E��n��zj���0eF
<�b��������BV^���Y
����:��Y�x�����p^,���������%��OL�_�(y���P��x��x@�����Z�0��<��i�:�����hw�n������F� )�����m�F�^�F0q��c�a8�M�
G����H�G$u��<��8��S'����s���2����b���J7�k�xTJ	@#M�r�F�����#�
�HH�&.q�40�pHIm_������ ����u)T��:���,;'�n����Z�R�E(�\���Y���m�.`�.0q��n�,0�8L��{N�#�JS���K��{W��S����2t�����ML�5aRy�9H&�f`�s�F����94��4b�IK��������)'�����jb1���0��f�2�1�
q�������+7i)T������F#�f@9H?�4r4H#�#I�T���F����H#W���o���B���Z�}P�y��o��#�(��X��M4�5�QyYA�����"�A`�gD�A�$A`�2� ��
�'��R�������u=u7E�c_�c:��x�{j<����_���] ���H �f@#9h�=�Fr4h$�%i������O��4������PF���Rv�}�a�53�w*M`^v���E7��9������-5�=�=`eg�Cn����`JFKb���@�����K����������I���f����j@����u���&��X��lO�R>�S3�S�\�,�,�����@���B��O�[z
���p$���Mz��5��C*���A{m��b����:��2��\t���3���A	r�A^Y�����������E�K�����h���#�3��������&	�y�Tx8��7>o�:��i'�i�7q��-
�C)H�	�B.����
��8T����i��)�x��y|.�{�='�qt�1,���O������V�������&���E�y(%�4�P�"q<�A$��*��d7d�y�����hC#���#�,a/#��C���i�����.I��t����������A!��aP�9x�x#�A#��A#B3��Ks�)�H�L ^����sRS����?+��vIyn��z����Z��k]� ��!O�=x��x��<#���<#�,It��A����z���/�E$������I77�kx�I	@-�u�D�
��hP����7*�
TDn�6�V�����@
�TE���v���S����;���M��71Ny��G��f`���8�Y`���8#���3b@?�*�������"��s���{"�a��N��?V��L�R���M��7�LyY94�Lj���3�Y@-83�Q��3b@{��g��L�i���5
��Q��<�`�'Q�#�����@�o*���UDS�'5���`���&�qZ��h+���f��p�e� ��?��������(]��b���1����K���3ya���}���F��f@-<3���
��<�kw,�&������j�ni�L�C�y����x,�C�h���=��
���;���_�4���A	�L���0�+��,Y����cpb	���#0�p����u�l�q��/��|�S�4�i�Q3��t�x=�d���t�_//���s���!�=8�	m�<�����#�!��0�����|t��:��+���<�:�x 1���f��v0n"�������@���0!h�!L��x����?h�!����*��|�����9��O�T�Y�r�����y�m��&����5��%%
4qK�p��A�[*�h�q���q������R�.���C?.E��D�����x���
7���
�0)h��aB.��c�
��S�FX������1�!���^52����*��C����2���(��#��\s��57�kex�I	@M\r�2��hP����7�p\3"`�p\3"�i�8���.����EkG@�F���i�����7!��9�eg�M����h��g�I���BNN\V$��34��J�������L-�Z�����Ql���������N$|��>7�����OJ	RI�T�������I@$��3���9fb� ���2�RxJ�f����-]����U������^�sV�M:4!Py���	�R3����G�e������>+�t���@k*A�~���mL�3�Q���\��N�Sfk�Q��&���;
A)tM�!�>\Q'D�4<U<�����G���(h���u��Cyd���n,�/������f��t��T��\�A	�2�+w�����;
����h���%S��L�j��N��������EG���@><R�a��xx$#<�J�&�5�$����F<�����N�qd�A#��*�{�0�F��<�DD�^�P����Q�<�s.��a%���s�����2������0<� �&
�@�B4��P�a80������Z���
�����3�K�� �iimlZ�|��oB���Z�RPF$�\�I!��!���2vC.�����P��hC�J�<��e�]�Z���h$���g�"e�IH7�kaxBJ	@M�r�0!�h�'��7�p�4"`,q�4"���"���,�=�<,:vv��5����:��N����zo���Z�RF �\ H!�����0 �K!�����e�CH���%�O���
��T�V��A��&
����i(%A3��I
U���I��l�+w�����6�����a(�y�D�V	�y��v��a�0�X��^������M:6�����^��@��H�g��k��x��C���F`4qe�5U!��S����b�GQ�����"������Ho�������(�@��t�r�`��A!t�Q��y�-��=��Bk.��u\f��S��C9�w�]�����5�����M��_k��PJ�hb����X(D�6<U���n�����+��A��N��>���U��
�k�O
-���X�u)�M*��_jD^n\�Y#������g��
Dg�D4�=8���0d��!�+�����k��n��uK:> i�C9�>e��W��|("����i�t�n��r����(�98
� @�]F@{'�9W5���\���7�)��t=W�)j��nv+�����m��@7�kixJ	@Mr�4�h��G��7��
tED]l(=E����`�|9w��e�����[��SG����u�1n��	�d.&��@ ������,��h�������P8 ��dDt��ri<�I�{�Y����pu4�5�&��-��nb�����������'�=8�	���=o�	�=#�F�=#���Y��%�v��8������?R3�������&�������'%i4�O��p��A|*��g�Z�
��p�������jm��x�2���������k=�����7>Kb���,t����g��4��B!h��P����P�� �T�_D�z����(�6������c���U�*6K`mR���8��>O/nr���{��N/��'5���������>��LU;5&��0�p����}0��9���S����nkKT���r���U���SS�����*>��AK�g<� GC��1g}�,�8�e�0��r�6�Q��P.~��
��F�E��������-U����>�&����������;5bh9��b�� �5�Kf18�0F8�e��i����{�<������������/
g��8s���>��I���yei�W�>��di@t�FD�pQ_2I#�_����L����E+����G��\�h��z�RM��u"���M���_���LJrh�����p&D�<�T<�&���D2�b��Um��������O��<��N��V����2�d����.<����&�	�@�eB4���L�]8�y	C�V��3������i��:���m[_u����x��lb�	67�k�x�I	@#M`r�F��h����7q��ACI4���7���B*���?u7?�����"�u���p��i�9������
<����&�	�@gBt��QN���>���������`CB�}}6F�9��a/�x=�~1<�3,��o"���ZaR�C��\ �0!�o�r���/W��!�Vu98�Y
e����f���r�Vw\Q��hM��|�|^d��+7�kAx^I	@M�r� ��hD�	
�o*��n�a���K;��4/�2�L��S;�v>>�	�cFc������q�\�M�R^v)�D.�X����xX��hPF�	*c�������g���#���,���	�p8�~����Yf����,��A�&�|�fn�������:���M���+��h�F�	J�Um*s�?0�n�������,�����dO1�]Y�|��y'��S�����&������l��D6��3Z�f<�9�o��p%��$c��������]�5j�i�h(�]�Q��T��H�������r���&���/;
y9mP��i�W�x�N#�%k��6���68���a�0+y<a�V"�}����UW��:����_/V
�r�������%^7���-<)��	xB.��
"�7A��Ne.���H�	���d��7�������C��Q�=������w*��Cv���n����Z�tRD��\ G:!o��pe��l�`D�PDE��@_u}�9L�%R��W�������|�P�X�����������T�)/;�4qR30��\��:�
PI�	����x��5� ���2�D�5�W�)���+<��y���p*�
��Q��vr���B7�����PJ]I
�\ �B!D��*�0�H�����UvV�.���scR^��g�eQO���!�;�����u��n����T��2��(�e8*
���/TT	�2�d���l�6M;�)������W$�����'�7Y���VOy�Q��VOjF��s��,�B�� �/,��e��C^��V(��6t������t!t{�q����5��^��p���7Q��	���
�	�R3 ���hA��Mpz��8#��Gd�0
FZ���I���B�^���j�c������0n��WSY���0��:�FKYg<#GC������=��A�#�R.�9a�O���s�	s�s1�\h���@7����G���M��!�%T��_�I|A�J`����2�t�������m)S������G[������2��:J��A^�����x��Y@tVF}E�-ua��`���<�T�vZu>p�gC�A�*��p=N��;���L���w��=7�kmx�I	@M�r�6��h�����k��a�G��V��;��~�������U���v!Jm�\�Z���$����4<� �&
�@��Bt�nMN���
G@#(c6�n#8g������_�q�������O�QVk��S��r����<a��&�������'%i4�xB.����
���=���&�h�����E�����$}��#U"�;���'I��f����rqn������� �'�A8�	� �/�S	� X��iD�z$,0�D.�1��nK��E<aA3� ���G�>�����&������&%E4�M��ph�A_��E���.""3-6hg$b��2w����Y��Q��r�If��b��<�����M���_K��NJ�h*��\ 
W�	� �/�S	�4�<����T+�W�)�������Y���/� �
�����L?��I<�&�)/�Vm"���U[�g<�Us4h�K�'gV�N�����@��?#W�tz$��Rk���xj����RS����p���.[nR��������@��T�	���p��
��B=��tX�Yf�6���A�0��'�;�vZF�@�/u��'a�m��+���������2<����&�	�@~B4(��T�?#�?#���}��o_n_?lO \��N����G;-P���s������0B	��+#��������z�oB��3k	C�Za�a$,y�Q
��%��U�ia���u�*s�����9	�?��N��7����?)��	~B.P���
*�?���g2�`��6h�R���2-�b������hw%\��=���f��3Yo����Z~R�F��\ 
?!��~*���LM5""(a���]��u���=��,����O��>k����e�u�I?7�kmx�I	@M�r�6��h����F�2�6v
.FH�|��Z^����`�ei�]5~~=�h�<���<W���Wy�9$41Ox��c�
2��<�[�m�0��j2�bC=��r����]������E<+(c������V��k��t���+<� �&
�@$�B4��U�W80����PW����@����}r�����N5ke�r0m��A7�kmxJ	@Mr�6�h���F��F�#��3B�O���G<�J����OA����Q��JC�����7����O?)(),[[����.xP���OO_��X�Yf�8�������.��a�Co�W��������]��Sw���a�:P����&]�j?�e�M�������3��F��|���-���Q���!�Q����q�t��/
�r�����]������5%��~]o����'�@�@O�D!�-�B4��U3�8 4�8"!��������h}�r�d��� �7�
�}��g�!����_���U��v��w7���������?���?��?���Q�U9���J�_	�x��$a�W�=�+	�>K�=(:���~ �&�?l?Ry���jDZ��?�S�*�8|��<����V������S(�xR��������Eb�(&���Q�"1`��A$����R
 s�D���B�rS��M�z�o���B�V�;g���>������E�a����rQL�h����a�(E�2<-	�2�y\��F�}��[���������k�����=������q��j��I��bPD
�\�CC)�ihI`��0�Z#�#j��S���S�gPth��t�z��%���y����c����t��G)���Q�]x<Z]���nH���
:�wE��ot[�_��Xn�=�/��*�6z}1�we��
���aQ(&A��P��0(��A��F��:�!����~�Yffi������	��L}�p�X��Q�y��B�g�	��eW%-'��X�4������
��(�d.��=(�kT"������P��}$�z��Sw�|;uR=�'��M���j�����;KG1t -t�rAb�(E�H<-	Lb�h�������C�+��]��9k���|]�DcZ&�D��=$��_+�"QL�h)�\�SJ���DK��Dk
-����2�,O��<�����o�>CO*���z��	�]���y������(�8ZP(�qJ� �BKD��0��?i42$��F!^���g���N���J�p_p�{���_V�B�n����R�r�J��A^y�^Y�,y��Y�M�V�y#eCF�x��\��m����z����r��z��\k�^���37Cw�n�������2��'�e8�	���/�S	��(���/��Eo:���~(V�F<�8k�u�h������Wt��C��M��_K�PJ�h����(D�4�P%0��
kDD^���(#�xt���?e���U>g�S-E{t��N�S�����������<�� �&�	�@�~B4���T#G?#�G?#B�a|���f-��4��X��+�|���L��h�n����Z�RPF��\��?!���*�Q���k��dZ
��j9��y��cH���:tt��!���&����m�;{����4<	� �&
�@��B4H�	U#���iWD��@hmC�=�����S�2�~]4�������j8t���j���M$�5!Qy��I�f`u��D�Y`u��D�����T��0��� 4h��}�O�?����^
�g�h�����n������	?7��>��OJ}F��\�g8�	��g|��J`�?#��G?#� �YW����e����\�o�-{T��
��@�}�������8<�� �&�	�@#�V�6\Z��������������|G�x��g������`<�����GX3��Z����D<#�nb���Z#�R�H�\�,�,i�DRq�A����5��R�GA��������#u!K�r�y����y�W�@F�~n��z���_P�������z�g9��#=@t���89=�)]���Iz�Oj3?C2��?�9���	����&��M�����t��� ����d����<����&�	�@XgY���=T�x�}5�(���0d��<����!~H�t��a����X�<�Q#�6�U�y�yZ���D����2<����&�	�@��@O��e����W���*�e��h�����q�~>|~5!�9��4?�=F�����aC��@��t�������4��A!hd�Hn4�f�j���F�&��FHY�����@$��Vfm�~nw;����5����W;���A��c����<���|I����<#{�2����+9�Q�0��pe���c���h%O3��,����{�CWMU�~�Z��K��z����<���A{����A^V���&
�@&Y���e��>������2�Q���(h5������)��g8����������k�r��Nn���	~��J�	~R3�i���xNr4H���[&��	'�}F
'�4B4���yZa�Q��,�~8���\k��$*f�����s���_x�I	��h����
d���h�Rq��UG?kL^�"�,}���a(_gU����$y<�S����[�Y��
��;������o9���(Z��-������Y��h!��&�!��?xQ��u������w��!j���R��=��!� ���;&5�0�B���~�	<��o���B/��o��-��^�e8�	m���g}�,
����Ac�#�RJ34_�A�S�k�^�+"O?DY���S�Os���&���/Gy�)%�#y�)xe9������"�z���Ia��&����
�[���%=���q�6����g�a���
�.����s����������;!���N�9x��x�N
����xVC�Z���{��g��Y��.������7��i2�ZN7����>)h�	|B.���
��S�F#|F� a����Q�u������>�>~��%v~�����|n�<�:���s��������D;!��N�ax��x#lQS���L�
e,�6��A���B�<���7���:��Q������vK:���s��V�'��T�D:!���N�Ux��x�
l��
eHq������H�N �5��X�����H�#��������sh���S�&�I���3�a8�	m�0<��/�����3"hq�3B������������!��,j��J/T�q]�7����u��A'%�>p(H�t��"SX���tO���*q���$~���g8�q-]�1��0������^�LC���Ty��3&v�}��>7�k�x�I	@#M�rA���'D�F<�T<������2����8�Y
Wt�����m�P��G�����W�����jM�J?}j\���M�941Oy����yR3�g�0�x�3."�w����>c����A�w��~�5�>T3O4��w�?���@����O���U��]7���D?�eE�D?�I��g��h�4<���y-�7�q�N����C���"�������*5�AU�w�`v=|���g�7���9������1����+�!�,Y����g�)�!Y�(�Cm�gG��������s�ey}/�z�*(����&�������'%94�O�rp��A�~*��a9�0�DH&[�P��_��s`,
!�YN�h�$o��3����|�I:7�k=x�I	@M�r���h��'��7z�5�����P���!5Z����@s��,����D^��g�1��+U�oB���Z#zR�H��\�=!4�����F����hV�W�Y
e� ����WSl[�w�iF��=������t����G��4��@!h�!P��x�x��@#���@#��mu/����$~j]�#���0"A_�A����y�y�}�M�S^v���>��x���x�x�hP�g��%�C�l(��C��
�,��YSP��b�G�^���A#��J���u�����t���5<��k4AP�����
�������0�F�Q�F���f
,���#�~��
|\5U)>y�����s���e�Hzh��0�<!���g}��g �,���@��#�R���.K����W����;��g�)�p���������sl�����h����'-�3�����2\�g���g�+��?����U��/o��F�����&k��<����M���_qR�&�
<!t�������S�f���8xD���!��(��Z��\�\��15lL~��_�T6�TuPk��87�K=��u
� ���r�^Y�,�k���������d6����0��Q�I��������[F}�q��Gth��w��9�1��[�M���_+��NJ�h�����p'D�2<�T<�a8iQ�0d������N��������_��_�=�S��:6�r�n"���Z#�R�H�\��@!4����F��<��!�jP���w7���Hx�"���S���[Rs:b4�\�5�D����F<���&
�@#�B4h�#P��84�8!;��I4?���3��V���|bxj%�����y�n2���Z$��RI�\ �@!:��&����8Zc���
Z���6�C�"]l�������o6/��G�|�u�}�??�i�:������<�� ��3����J?!�������!��0����gM%T������<��)�'K�����5�>����v�r�yNM'��e+M'��X��~���b��z���^�~���K�4�8��n_7���2���$~�rLQ����+J����b��x��e���S���UF��f@9t�~B�ex�/�Ku��Xd�e�p����"\��p�(c�����8��h�1	�i�y�����t��n����������c��4�g�uD�F�O��w��ae7�F�Q�F��/}�hi��[I�� ���ZSi4���C��M*��_K�SQJ�h������(D�4<U�����A���R��k�n��~Rw�/���(���o'��M��_�@^n�Y��G��2�g����2�h�_�)�"���WM��o��
Z���|�h9\�[�������������!�$Z�u��M��_+��PJ�h�����`(D�2<U<wa�0�
E��{5�;����y-<��)�t��@DjN�XTR�$�o���Z��RF�\ G@!��	���0�9��h5���~��=��TQ������g�)�<�nX�5�D����H<� �&
�@$�B4��#P�3
Cl(��#��
}�����&����h�x��n�c�R)<?]b��|��n�����F�(��8
��O@o4�%�e���R4��@����]��IC����,S4K�v�|��f���;���sS]������.����iK]h<�Os4h�����i�>����h�qd���}���u�6�Qx}�d=B�1&<�������7����x6J	��h��\��zP�m�zP���c7d�y��
e�mhe�M�:C;(�Y.��C�a-���(����&���u��(%]4��B.��#�
��dT�F�t��R�GF��P��5�~�"�=r?�u�����~����7)���OA)����B.����
z�T�F��F�!�FH�^���S�R6G_����o���}���@oB�����������;5������,0����A�K��b�2E�Yz
�@k*�Z����0�����!o�G��?��A^7����c��������.�+��,Y�u�DE�K&]p�����J�}7/��\�k�P>�5�r�/��5�0�+>�l�nb���Z�R�F�\ 
�E!�����y0	C��a��$,�~UCY����S�O{@V��I����OUxm�$�/@��@��l�	@Md���(D������X��wS��W�l(�F��o���K9W`����F�R�������r���9_���&���{
�C)(�	�B.P���
��7Ae�6P��2���e8Z
��x�����
�r�Q��XW���(��Z��Y{������H<� �&
�@$��B4�$�E�JB�����}�[���=�e����O���\>E�0���t���g=��M���_��OJ�h�
�\ W
� �x�+U�"�L2��	D������cQ��z�2�D��o�y�`��Q�Fx�k�R:�&�|5U���N4��B�X��T�����$G�0�MP�,4�a �-���
�u�rj�W���������n�z��S�kx�|��u�n�������z�&
���p�A$�&(��s�A�d����B#�C����'/�F<�8�{�?7&�����qg\m���&����A(%E4�P��p �A�&�W����W����p 4e<�����\.��m��*j�Wxj�~>����_7A���B)(��r�2\5(D�2�MP�T��2�P��Hh5l�^���%���g���<.l�)h4����$
�k�IE7�K����:(A	y�Yxe����YDg��7!��-(���a�%,�}U�����2rR�������vC�����p���&����a(%E4�P��p0�A�&�GC��W�lP��-P��E��1_����b����(��Zs}��p��n�.7Y��-
�B)H���B.��c�
��7Ai����l���R��Xh58�����ZZ�X��8yeRSi �7^[��4tL�9���@	@M4��h(D�6�����i-t���Ck�2|��o�����r��+/L��Q����������u��!(%m4AP��p�A_ ��[2�Cf^l(CJ��!Ej�v	KmTU]��#'i;%<U!�o@�J��&�����a(%a4�P��p0�A_`�a8��pP\%h5h�2_��gY��;5���:��R.������H<� ���P�"q%�
"�F������F�,��FH���]
�7��������|]���K%���e�q��.MG��e�MG��X����g��l��|!��-��alY�"�,]���
��N�������uv���I_�9�S��aC��������u��)%���	�B.�> �h�@���p�4"`�q�4"J��z�&�{���!8,hO�RG���������\ti*���)�
D��)Z
D�Y���� �/\��e�)�O����2<���t��>�>��?FX�D��
��i�z�n��]���"(A�"�++�r����n��&�5��!h���
;���Im_{(OA�/5^�Y�W�}��r�sY�F%�m�7���-�E)����B.���
"��E����"q\4"�8���A�K7&�y;%<��y���Ysj~��y�7�����E)(�	�B.P���
���E��(�a���%, ��%���������J��g��JS����=��^O�������"<���&
�@�B4(�U�����WDd���2�D�[����j�g���P���Hxj1�.�Y���_oR���Z$��RI�\ GE!D���*��n����Y������k�OC�Jy�A�t�h
�po �M��_���PJzh�����`(D���P%0zp04"hq44B��D�����+� �	��},����Z�/��,�Cki���z}n�������$��'�I8�	� �/�S	�$�IA���q������s������9g��2o8�
eV�[u�q9�"����Xo����v���X�h'5+���+�
��B;#3,F��g)c����P���z�#7���N��J�K�s�[��6n���3I���h�L5�h�LR<h��3I�M�f��Pn�t{�#�xF�vU����H���^K�>�;^s��n�tq�x�M�S^VM���E��xPD������o��g�o!�,����
����"�j'���o�_��{��Y6D���*�	�F�_~�������j���N3����?�����?��_���?����M��3o���);�\�y��N�������F���P������=��!u$��)��'F�<��e1H�}�kd6�+�P&hU�AVe���R�A�E��@_����
�2L[�u�Y�+�b
P���U{��Y�����_�>������	\�9����A��,�]���Hz�����bA$�|�7�����+�H�lu@i%��v>_O5k�:���'�}~������q���$����~#���y�a�'E�O��0���A�xF�Sm]���.1���$����A;���R]6�.����R;iO�=�;3�S�$�FT�S�$�.�:���ta�'����s���S��<�I+��"�Ci�����>i!�H�vY��"cj�]��=����_H�pb�'����{R,����
"��3���p��A����e�����L:�9:��s�\�Q�oLL����=7�FT����4�����������������@��/%t�t�������g�8���!���.�]����>�h��5�^7L�~�������P}F�X�gJ� B#��3m0��<�e\tL�sh`�!	�C���[[�r�d
�=��:��Vc,�~�a�ae;����������@���B�/�5l~e�4m�h<10���k�Z|�6�O��]_z���7�!e�jZ/�[A�������D�j@��Q��un��� mN5s�(b��Z����^7�k%�j�z!G�~�z�Y�(��3�L0�H�j��]��bGz��S,G0�����4�f��4���o��@�L����k"Ru��^A�G	�e\(|�lN���5���W\�� ��O'�r��"!���UI����]���%$�,�"��*�,��]��V��x��Tb��q�Q�����>[(]2�T��9�a�l.C)���B,����
��C����?T��}qAh#�����������Rm�?���N�h���q����������l.�D)���B,�c��
���D�c)a`[4��P�}�
-/m�TH�1�|���a�z�|�8��8�x���l�
�E)h�'ti���*�� �(���.��aDq\4=hDIV�_�D��������
��5��z�/����k�?�K��u�]�L�`t�?W��T�F!� ��7��U��`��_�C����!��|�����iZ����e[+�]�
<�+��������E�)��d��!��d��7��"U�F$��G%aY�L��xa���~Y�d�H~���A&i�����M��t�?W�g����H!(�1R�e|a�
`��iz� � i����m���$����zi�����cry�T��G��t�����k�.HJ�����f[`R�A>a����@�,�5�9�BA�����M���{nY���v~��M���+� +���;
�J)t=���:
<�. ����dYVu��3���X�8V�
�����7u����e���w~���d�Z�U�t�S���S5���<z�Gut��Y�<:�G#�Y��P��%(���*�
�,4����#�f
��R����<�����f��2���!�� -��O{Y9qP����UX�$�RG��=H�G++����=���Z�@K�V�r�^������i�*L����Q������#�0s�����H<5� �.j
�@$���7��5U���	�E�Lz��G+�-Z��~)a�48��6K��^�B�����K��?�!��:Ng�smxpJ@]�b�68o��p�F���.Y�H���������l�����(i���
�fy���n=���Ar:�����S
��"����)x�8��S0��*5�H]8muh�r��l%�8�c/��I���_����xkw~�n�\q�l.
�K)H��B,�����
���K�H�����$,��H��c������ V,��f�rP�XX�Cg�u
�w\)�l�
OI)h���B,�����
��BI�h�Q���1�e����N�.
]���fs�1��z~����%-����`:�<��������������P�����d�O�� ���J�Up���5��y��B�l
.��_��������v%K�[������vEW�?�N�lV�Z���� :�v�SYY�t�S�T�@%Kc���*I���!k��i�g���*���X����V���V/��r���,!�Wp3�l���d$+b�\�#e���� G����Q)8]�b�H054DR+�$�$���S��H�pHm�t�Mw����kR��������Z9����$���6��r�S�A�z����]IH�j�+�X��
�TwPI��<�������,��4��Q�/������J	�p�G�p�2:����'�v�����
�������S�}YUu�UUG6�4�^���4����dAUG�M��$	���)�X�:������X��m�i�~��[��0�A�:����#T
��B������;���N���C�<
�P��#L���n+�	����R_�����_��r���p��qy�6�csF��~�
����\��R�D9�X 	�
I�J@�\�$\���V��X��h�%������g[�B��T������#d�;�ALz�:x/+;tt��j`��X���10tTwPBK�%`�:&�M,����^�(�������Ci��s�J�\7k������u�^VV��.PJ�q`�gt�G#� �W����nb����%���$x��5�P���~"�xt:J�Xk�������Adz�:{/+�����T
t!T������*i����L���
��q����Z�����R^��R�v��x�2f��(u�|�����"�[2��UI2�j@%]�4Mu�$�����d{d��OC%���#��&�@��j���<`����P���q:HGo]tTVV]t��At��l����7.Z�+� �u�������h�_��ue��x���F���*9>X}D���� ���H�
����|U�Q(��F
�X0�8
��Ex*�{�X�"��.���V�������Yj�OZ�@�C�:7��[�~���l.O?)�+��b�0�w����7�X
@K�-G?����g�t���q)i�G��Gl����������X�B���T$�r�	�"!�:��UI6��'�^E��4���,�IT�d�'��BA[�NBj�C+�z���C���o��$u���%'-x�z�a���l�>)(�|R,P������S��mdA]�fA�\\ ����\"��[�qZ�r�����K��:p��<�����C��,t�?W�g�T��B)���Pp�x*��B��.YP	X+�����N�%�*��i[p�o�=�Stu��j1�}����"����H��I)��aRp�xL*#�I���N�.b`z�Ww7�Ulk�v��')c��6����� #�����)et1R��p���
�2�??N;#m>������-Z� ��%u��e�{)m���+-5�\���@��R�*�v���l�
�D)��+��b�*wP�g���=�,��+`PqL����������Ae	��&3t;�\��L�����>HAg�s]x
J@]��t�G�t�)���.�iFo�������a�V�#�����6��0�G�l��2��J����nN�����e�������R��R5���B��X�VwP�O�����X�"
�82�~���z�������e,\N��[�c��KD�;*�A\:���%�R�K�p)�����Rp�x\*��,��L����*q������.���Q���7�������L>��>�Ig�sAxLJ@]��b� &w����7�p�4=h��8i�h�{h�����-�4�d^�D@�a�iuHj�;�<�l*Y�Q�Ty�UU���#SGp��HoZ�p`�*YPqXV�.�J�������^�/�+i�Q���A�"i�������P��t�?W���T�M)��ASp�xh*�D�T���G�a� r�/��R���&�c�V[��`U�R�1�Gg�sAx>J@]|�b� w����7�p|4=`T�PD����-�ioFW��!4�� ����rJ�#<�R�Lt�?�g����D)�1Qpax&*f�YPh��1�d:��4����l�E��M�����;��y��(�A��J���`t��R50��X �B��G��+�l� �`z�G"����zT�c6[4-5�<N+�����O)a��+ATVV]	�T
��+A4�����'�.A��D�4�8�b�r��l���96�;���%�v�?��'^����*+���dP�����ETw�)<m_Y{������8|I���:4t��<�!��F�*�~	�m�<q�l>�����D@&��w�F�*��j��3Z����i�ua�pg��"5X7��W�YWK��Y���(�	�m�����A�9�����M
������o�;���M�������A������'��G[%��Zk���cz{�<������{|�1:g�sex�I@]��b�2�wP���7���O����s�:�?'R�mRPorJ�8������G� Z����$�9�;g�S���M2(@U	Y�a���J�1u��U%�M����e���dq�&��}(-GZ���k��o��W��4y�e�]��R}H�3]� ����9zYYAt=GO��  ��M���f��*$��6��,�$��y>u:wR�fk��GJ�e"X�[U,�|�������h�@��6)��Mp����(��=�C���,L� z�����K�_��^&8���?�0��_G�W;!n�����<�� �.�I�@h�;��M�3��������+��#����t,���u�X,;��ZL��lo�����A�9����CN
"����X 9�D�!���H��{"K
"�*i������O]<]���v<���v�^#!��;��/~�1�:�]�SVv���:��ct��lL:�;��'~r`��,�,#���B�Kji;�z��UK��3'���)��;�;%�A�9��w"�R�D��@)t".
�A%���t"���4��������{U^�w�[�i���E�[A��+�SV������j������<�;���P���Y@c�#���.�H���a{���5�L;�Y�����_��4�4u��>�^���UI�+�T
�bA���(T*�\4��Pc��;c��3V-��_R���tm��bv	�!f�_�����E�o��3��AF:��1��Rb�)��8F
���H�o��H���H�C�\�8>?w������%���]�>) XV��x��� ��Ou!+�{P�����YU]dc��U�M������JE���VM��  ���[��kF����R�nM�v+[�������0<� ���O��pt�A����7U���.���Z��:�|���|�[>oM�HK
+��w@$���M�=>���� 2���U��)�t!S�*q��A%�����&�Q��,������T'fg�i9�=������b�N�]��� H}u]**+;�t]*J���u�h6���"� �}e�h��J�`4����-�����h�n����?z$~}?��1�!/�3xH}
�����+�$�@W�ER)t%���;���T���d)��,=**��p�%�M�����u����JY�f����6�k�����B������S,�;,� �L�o���2Ys�����Km�N��V�9xG=�2�syKm15=]�Qt�-���+eTVvL�J�j`L�J����R�A$����c�C��Ac�c��"6�{z�:�b�Th��c([��9Z��ed�T�kyT� A}uTYY�tT�T�EP�1����������*�����"'�����KjLy�G������)�7������{B�W4��F4�j@]�4����}xh���
�A���u�����=
���������KZ����[\-�N���� &}u=�$+�������E��J��Eu]xL�����a���a�q�tY8�^>^��cl�z��S��ei���iX����t�?����I���(YUI�U��fc�$�����>����B�YP��\�A��0i�k�q����J���=��0i-<������ !�����	)atR��p��A�)(w6^���;�!�5?�r���A���M��/�I��v��=Eg�sax(J@]P�b�0wF~

�%�*��R���22VU��n�nz��qF~�l�@���������M�A,:����cQ
2��/�X �_
� ����K0U��IYP��J�~
,��_wO
M��s+i�k�a���A.��:F/+;��:FO�����}6�=�.�OA���t�d�g�a�Q�XZ��K�NQ������R��va7����S�����xb��������@���I)t��;t�)���R�I��FJ�Ek���g�Mi��E'�6����26{w����=Hg�syx@J@]9���rJ������pI���f���] M�����:=��4dZ����
F�R��c��� ������(yt��X ��10��J@�)(���C�\�Z�F[��~��|����*K�1��w6�Q%+���d��a�AL:����cR
*�:sO�@%.��A%�)��*q�4h�q�4]&=#��#PJ�2��j�������L�OL��n�������EMee��]����)j5��@gR�A&�)(�,+�4��p4:GM�O��$�L5��}���df2����� <��O;Y9eP����UUY�$S��U�SH��(#�2��n������`fdO�z�������B����b~;|������0<<� �.xJ�@��;�<U�����#��2�t��zny������ZX�������nS���� 6���%��)ItaS��p��A_��I,t�G]�pA("������f�������vr�yZj����\�����������L<6� �.lJ�@&��;��6U#�M���M��s��x��,i7�rD��K��s��5|A�l.J)�+��b�0\)��0�Sp��2H�yz������yKt Y9a/������O��oyo�9*KKm�w�����\�R�G0�X�L����*��80T�`��������'L�z�J��_=����~KY�fal�t>F�d����2<+���.VJ�@���;(�+U��J���J�c>���=P�z�G�t�����~�k��AV:�����R
��b���X)��<��R0�p�4=h`q�4]����u��e��5'�{��q��H�O����n�}a�l.K)���R,�����^����,�8X�N��E$*
�1���?O�>nW=��XiZjgf�.���UJzZ���~?���|a�l.K)���$>����y]o� �/�TBB�*���d���!K[A�+�]������p�,u���P�S��S�&)��������7���8i:�?��O����?��o��_�����������/�g�+I1@QZH�VE�1��{Q��)�q���<�J�/�<���JJ��Z�����U6k�6�M��L�V�z\�x�b��C1h���b,������<4`��

�jeqb
�Mr���p�S?���Qf��R������k���m�����=�z\��b.�H1����b,��a��2��4�F�<�J�@O�f����O�����hc������4�1d���&"�9���T�0z�(�a*J� OE#��������E���(O]O���9��C�L������onW�q�����$,� ���@���;H���`$����������R������K�
�p�U�}��i#����M����p���������\��bPF	�X�CB���Ih0�0$�y�(bHh��3G��|�tsq�[���b�V�p�z\���b��?1���b,�����z��3=��<h�0����K�������6j���;�>�aq�v$���1*B�5���>z(�}J��@#�VP��)��������C���~��u,��H����K��e�������|\z����=����=��[c�_Tw����g��Z��GC&It-�#J�}_�ei]�,!4�L��k�%^c�*Z�r�����ju���4�0z�.z�B[c@�t�I���U�-c���bPgs�[G�,l�j]����[��K���$��l:������YU�U=�1U�^e�>��f++2�������-.�:���}w����y���,c���;��2F��
�vQ2
���\&��R�I�X G@�d���*O2�d�x�EIzT��
b_D�(/��~�e�����
�������1�l��9)h�sR,�����Z��9�h�a����#K@�s^��y��k�K�vhAR�EjKd��n7H�A�9��K�N
���$�'��$�N0�X
�y�Gz�(��f�CgLz�s��8�fZ���F=���*�_D$��.^&��s�9-����l������d���|kL6�;���l�Y'z��"&��~���M�v`�B��=g��$+��A��#����y���'����yR,�?�w�����p�3=hHq�3]�
���z�Kv���#6H�B�-IK�����=��l��@)����R,��c��^n~{�h��RM�
1�@�b�����G�����#I�P/�O��{�� ������'AtAO��p��A_��0���� �l�dJ<�7���-���)ke?�d�U�A�9����N
����T�J���0L�'��0��N�o���RX+=h$q�3]4����a�,Mt���y'O��:�t�>���_9.XQ���:ee��]����ig��������&��N���,x�@3:G:�O���i��>�
�"����Hkj<���u����io!+�
P{��� ��[dc�2��*�}
�+8���UYP$YPIF+�������w4Y������r������� ������y'at�N��p��A_x��0� w��.��<��<u��1�B(#}���zAO��N@�R��iG�� ����%��'ItaO��p��A_��I,�q�Ge\\�5H\?W���=��j~NZ�NIE������.;������\�~RPF��X�G?����~*�Q���������1_X�����'���X|cW���4w����G�A�y�������.�I�����sfc`ZQ�A_8'G�i���Y��,��#�d��u�<��L��������\�n�\�����=�d�z^{�������������
�
G=�d��z��,8<*���K� F������{���\�r���;���{�������s�?^<��0�t�O�2q��A&_��0��	f��L\�g+�wuG��9���%�|~�B=������/w��V��]���\�RF�X A����*�"M����a�Q�t�D+~�o]e�=3�}1$~I�E���A
z�J���Y��>���tQ�lL@�;(�m�YG��	���������?�~���|�Q���U��3D�3|���~�w	�6�;�.��V] tn�i�'YA�aN�/���d������X�@�t{�%~1�E������\�J�N�������py�L���
�6�Ag���DVN��$dU����������0��meEYP��0�dIUF+��}�5�������z)
��_�^i����(�A:���sP
������8(��0�pP�)FT��zq��(���^����:��z~5-ze�V%:������AzNyh����xJ@]8�b�2wP��>�v��G�^Y=FK�W��j�#�����?��,�g2���m����}�����A)(�aPpE|��
`�
�A���A�E[��^z��!�����8�Xxh��}w�=n�<t�?W�������I�@'�Q=Z����F���x���f���g���:�0� ��������r��s��0yZ^o?����	`��N���
5[����"�4��H�h(��`Bf��V"��L	bV1s��F��"q�YpU���*q�.U�����}0�\�~��.���'�������CVvQ���|���d�@����� H� \�g~~��Y@�������\���x�}�V�CC�uC���X��M?��������0d���S������[t�Oj��.CO$��R\�g���g��|�Pz��q�$�A������r981�,��d�>�yD����0��'at%�R,�1a�J��H&I����T���p�3�q�����Ua���p��\���V��m���0�E�A�9����O
�����A�@C�F�GR��_8��4�8��.1��0�][����T�.U��n���g������>HBg�S����+����*	%���l�a�&��{UIzO��,:�(�p����`����.�R%�����z����h��~��a���SD��7���u1B�@(+����]t�Pj���.j%����$]�|����A� t��������acu"6z��yu���g,
-+!�;
��zd���y��Y(qt�P��@0����8�E�8\f��� ����t�,4bh�6����WC�u��o�5������V)�A$:���#Q
��B���|2�Q+a$�$a8"���4� z
D�@�K��e�s�4��L�������rX��`��9��+/TVv���j`r������EMo�\�W�-��S8(7%�����_R���������i����������Vv��L9��++TVV$]Y�T
�bA��8(T}����+�HMLm�h��}���&%I#-'�����hUG��6���@����|4�@��h�D)(�Qpex *�_���6����sA�A�:�6�O�][1f��o����.�����G��a��"��
������\$�RI�X F�D�����H�	6%�F[���^�d����>??����;{�'Y,�z6���'"�����*+;�t��R50�t��fc`"�I�'"��u�q�4���(i�)5��sh���-���E���{+������]��.d�����Dee���&J��:���gc@=�4�Q��U�f��!�.���,������n��BI�&�V���GW�����1Gg���EVN��-dU�AVul��Ta�{[�������k�0�@�h������G�9��KZ�V��Kkx4:-�:��;�hSF��t�?�'�D�EJ)���Rp�xR*��dA�aYcK�T�*�*w�L��>�:i!�t)G����E��w���i�c����������4�8)�i8N
���7�M�,��+���.��#k���R=Nq�	���/q��������y��c���������6�P)�m8T
��
�J��m8T�5�8��X�W5�P�O[�@��$T4�9����ge�h��%�]� 2}t!SY�)G2�j`���L�10��� �:��A@c���h��i�)c4���t�Q��[��Bo��|��������@O��(J���p��A����KA%�Q1�eAd�9��q�n����Y��E6��z��R���A>��J���+�F��+ ��Q����h~%�	 �����/�~I�/]�y��N�R���s�,@$Ke��Xo�y���GW����H�G�	��8>
��H<���8>�4�8@�~�8�ryN��A>��H�9��YyW��f�$�8����.*++�.J��$�hh6��$�SF30H3Cc���3�
C� r����w�B���kE����Y5��?A�l>�� ��L�+e�bA_�RF���SF�of���IM�H�yM��z9d=�����v\a@Y*{�����y��������Hd�z
PEBV�� �*�lL�=���$�	�fQ�=�F�YJ��P��Sj/�}��{~��BhO�r��9���q��&���a�yR���\%��RPI�X�GE�T������f���*�4Q�1\�EKh��������:���h��vZ�-��_��_y����\$��RI�X �G�D�����H�v�H�)!�G[o�o�QWZ6��ro����	����_������?��'�����ee�"������"���}e����If>8T��J[%o��q��jT�d�����3�M(���`�9�Jg�����R
��:fO�@��=��2<+���>�l|��Q����t3�[O9�9=0�dk�����k������{�o�$����0<1� �.bJ�@���;�S�a8b�u�uo������#����MK�+:�R�i�*�8��t��q,$��.b*+;�tS��.]�4k������}eK1MZ�8d�*�X���G��Sni)��W��X�Dv ��~=��l�Ux<J���J�X�U8<
���G�o�
�1��Acr��hhse�
,����q���@	j#g���d��.6*+�Ct�Q�z�.6�������7�
�)�5��q�VK�YG��������z�������X,��;�%o���� '���;
�I)t]��bA��8)�C��9��M��8iz�0�@i�L����w�l��*����eC-�-�q����K�����k����*���J(@U	Y�����J�1�+�z?\zSW�E�6��<�
-\��IDZ����"�cL����U�F�����$�.����h���.���S5���ph*]x4�_	�@�!&=��$����s����,�@����a���t:��1�T����b^�ht�?�<<��yt�Q�"qh�A$�������R�,�!�]����2N�L�����*��������:����cJz�3.�!f�������lW��F��J ���Q�T��h~%t%�OM
K�J�� D�����;f�mRS��o������m�%-�FJKF_�dt�?�=<��{t�Q��pd�A��������1.�4=����58�������������/�5HFg�sax2J@]d�b�0w�'��7�pd4=hX�j��6]������o5�#-���v�
���U���rN������']x������fc`�R�A�f`N0o4f�Ac���h��EY�e-�~�8����}Zj���C#��1�l�oxLJ������
�I���1��M��0iz���0izD���}��EK"M
$�����VY*y����v���r�nc�������l���L��6��i6��������
�O�4�����S������e��9e�=���U���0��u�����y��)�.��R,�B4wP����gh���E4��c�Y����MW�����O�=�����Ho�x�,��%�>��d����Hd��
PEBV�+!�*�lL�J���$�����,���0Sn�D�Z��:t���)�CIi�aF�Rc�h[���SWE���]����l�
�M)h�+��b�67w������$j�0��K�#+QZ�{�u����j�o���=Hk����t�.n8LF���t�?W��T�N)���Sp�xp*����"���H�B$JWP��T�;,T�dk��9�0
�g���������l�G)(�+q�b�wP����7Jpp4=��%�>K�#������^��L@KC�6vs���~�����d������d=US������ZTw��g��+���1�����A�VI�/m��^�R{�;*0d,�z[����t��Q�P��u���2�n �j@]7�fc@����h������^�C�c���/b�_:����9r�~~���Zv��w�l���FeeE��F�I���Hz�hz��$�����QnJ�Lm��V&��Uw�����%���7����^�\:��S�AL��:l/++����T
����}6DR��'���}e����Ac������/~|��w��1&CD��.g�l��fh.�9�����Jy����������R5��.^���T�zc\�z}:�*0�X����k��C(L�]�����ZK��\w�V�d:��&6V����u���0�."�j@]�fc@=��OAa,�	�@�3��1�8F����9/�}��1���G:��-i)�%>LW?��t�?]������u-YU��U]�fc�J��v�SH%YVU����e�E����x��}}�s�-D���<��jZ3�n��������\&�R�I.�X �K�d�p���3�B��g���eA]�pA�d�C��M��3����a����i����1��e���d�~������L ���Rc@%��;���y�J\�i~?���!'=*%k?��5�����Q���T�]Q�M���������G���OW����J��L�r��L�10�TwPI~
9K9��f[h�Id��^Fg���:��g&i���=�o����XlI���� b�����X)t&]i�:�v
� �����;U�����L�_���X�#���jU�m���j��w���wk��S���c��~���l��EX��?�k6�����OAa��S�2���5F�X�O7�<.�I.<$-������UK%�A_�K�)�� Z����
�V)tMA]hU5���z5���OAu,e0�8��m�����t����A��vI�]��BAZ�:my�:��$����6<Q�����S�C�K<w�F~
j#�
RU��#,v� zGT�@C��s����T/yI���x���Zw�l>�������3HTg�s�x�J@%�J��"���� =D5�u�����U%Nc|���J\�Z����;��������&���,�����~a��A�:��+��T
���6�X�o�lSp�~#?��M��K�)��f�HYW�(Y�r��|=%�%�N]��f�:���������������b�3����?�����_���������.?�t�'/�^��tVf������hUt�Sf��^t�~
�`-;�����VP�$����<~JQ0���e��������^��IJZ����&�����2FG�s]X:�@=tc�.%w����`��E"�}�����m����p���x��u�����6�����e���:��������\%��bPI�X�CG�T��h0*1�����"�d��3qza�]r�VW�"���t�V�:�?�?���� ��?#�Dz������`����10�t���Sp1�����,rFga�g+� ��]O�u�
�������!��X+�����\ln}9#c�S/�JI��%�:������0�������3����qPi(���
#�A���G��el����+;��Ria���{?	���K�+���O������G�\?����:�0�-
#�6�F� �����VK]�k_�l�����v��af��j/��3�@g��A1t&���{�V.�f��LmN�u���Lm�c������B3��j�~��HD�e$n^��d�y#�����c�(��Q�*1D��A%��F�PP���T�m	�"�
ty����Y����,"9�{�`��N��4�6��8!#��KOniX���'����������PS�A�)��,e�CB[[h�1(t�-�0�[O��^�E��r�o����#�J:m��=~��<�c���v
1�b�<z�(����`Qryx,L��C������@C�������CL��&��G���U-L�ht�?Hd��

P�@V�� ���lL�*��j�}
uYV��,�Z��HYP��Z��dx��c{����7���T>\+��)�q�M7��e����J<(���.PJ�@%��;��(U�1�T�x���. ��=c~y=�B8n�<[m�kuB'#KK�����Iv`�I�l.OJ)����R,��#��2�BJ���"K�����g��x�^�Z~��u`I�x�~{#��<����y���9�������l��N)������Cc@Kc`����.|�h�ot�U�.� �N�@$C	O��n�Y�2n�^c[��n�i�������N�u�?W���T��Q)��aTp�|��
`T�0jz� �8j��n���������q�zi���.����~.2�Qc�������X�8*U3�.��������L|��W4�=*����LF��8��[7����zR_���um��&������� -��h���z�k�U5b���X�Cu1|���3T��%��hi�C9�O��#	-������
c�L;��k����i�����#���F�.ZJ�`dq��A%_h��]�J2s���!�K[%��t{���V�"N�L�en�I6CS��3`v�e�����L<6� ���'���Rr�|�����d)�U.��P���Y0?��LWv��lM<����VY�i����xb6HO�.z*+;����%�m��;�8�,���F2�9=����[e����A���������XrF��{,������=���K���<W�����'����.�8@%�<�����"0���o$D�d����w.1/�
P���*!�:��`���W���WM[QI�\8�����l��n�w�P��f��ooy�����m}��I��I�z��e��*��U�2Q`��L`����e	C%fl0b'f�Q�W5�����N�l���	������ ��g�$W=��*�\��J��*��(�
�A%
�j�JW�5�,0�d�V������!����*��]x����WqP��,�d���}�h�J@%C��b�Je����e�B%��FZrf�N�0��������[�p��z9�&���0�%�}���Lb���/�Y)�d�R,���������?�!��(���a�
�e�/9�>��s[o����nLb��\h������7bK���e���}]h�J@@La�:�U�G>��P@
��k&@@�z��-\���W�k,��Gzz�+_�7$0{�5}+�xNv��}���_%�RP�p�X0{(�
�A%
�j�%M@%���(�����m��E�s���}����0�W8����$q=��2����L��+��(�
�A&
�j�L�8�"W�EF�0�K��f���rc?C������ee�Lw�������$c=�������0�+�a(�
�A
�j�0c�p�Q�5Zx��mY��������SO`�$�i4`&��-� ���KA�T
RJC�X R�9H�R-�������ER��W|�����l�����9�|����j��;���X6�Ijz�w%a^��R�*	��{P�����Tj
��$�Q�l���.
����06�4~6q����q�_�����������V���Q`�J�������
M)�d�R,P����T����'�0Th��\����0�r���L���������*�j�cc���I@z���)ER��P���"��E(@-`)	H"`W�����GQ[J��_�T��a����?��I���7/�~���n`�������64h>fy��z:e��������������.s4;�����lj�����S�Iz��gM@)�C�b�l�(4�4�`PP	��U��L��������)��w^~�	hx���"�:�@��04� �!J�@
�BsF�Z!L����@���-�N�D���2���O&�������Q�s�d��$=��2���L�(��(
�A&
j�L��
���R�vJ����]c����{�\���_���3)�{�I��|j7/���(����T�"��z=
�eU�i<?�q��2
����V���-�N�����id��i�s�%<����\�T����<����%��R�K�x(���D�PhsI��Z1�(-h�Q@4;�K/�f�ujrx���ek�O�a��nz72�G��DS���P�)u�P�i&���������0�H�w#HA}Qx4G%/��L 1�~%���u�����&�����@�K���	���J��N 1��h^U��B�L��J�KS	L%a�������������Q7<���]�if������@���M������J)�c�R,P�B����@����0TTX^��x_�_�V�������'�}��,���C\��	�~����������?����L�[
�@E���""���~�B��3C)���w���p��A���>�	V����tE��C�&��^&1��P`r�:#�"V�:�i���<�����,h]�zs
����^�K
�6���PK!_�b/m�
xl/worksheets/sheet2.xml��]o�0��'�?D�'����Z����}\��I�lS`����ME��6�������fr�oj��I�E;E����Z*J�������(C���-I-Z6E�����7���j��v���)Zk�{��k���
k��J��h��+Om$#��QS{��'^Cx����|	CT�l&��a�>B$���j�7��5�%�����fDE����\,9
��Z!�����8"��Kx��2���J
�R(Qi��Q���s/�=����"�<����3*&	�'Vp��a�	f�K������Y���xV�fEq3��a>���b�g����?Og��������+G�j����]�9��&v��r�SOrG��=���`�h���*]����n�����l'����;3����k{)�����-�r�B5d�H}������N��������G���"�Z�����
�.�����<���0�P�
b���|:
7.��${w��k�b7M�0
ShK��i�|��^p�G�*-�o�����6,b��g�`P�� �0��q��Eb����c�A�p��N���t�v8�.w�$�����)��mbGK7�w,�'ean���a��9�/$��E.��a��O~���k���7�IN���z��27����T��q?�&9�0���������\;.nfg������G�o������������_%8P]�(��H70��>���!������ 3����-[�d��� w���U���Dr������������������?��O��o�������������O�����������?������o�[������&��C������������?���?���d�Q��5��L���)CoM~�M~������|���k����~���M���������91��S~�M��]~;y��>���S��{;1)~�����7�A�F��~�]�Co&��.M~�M���_�W�����O���/�_[s���B�y���R���g���(��o��]_����������{0����;���n���}����6��s����s�S��O��c��c��������o���
�����y���L�g�����b-?�|C1����gO��>=�~��O��?��8��q��e���������<��u�����������������>���4����������O�#M��{vo8w��9�y���y\������l?��4?��?���O.|����~��s��r}l�������lS����������8L�y�������L��X�SQ���z��[>���<������������y[��x��I����f|>Ky��%��������nz���2�$����k����m&�����=��>�Lh�q�a�����f|>y��	�4MC����L���%��X.��X���
�����e��c*J����x{,�}*����|�X�1��>L���?�iz:{���������X�1���1��q8A�N�13O����bG����g)��p��a��>��1���BY�=�I�������14�4��B�+�,}�S(Y�O��v�~��a~�:�~�]
�Ws�s�h�������=w
<>�5���0q6�O5![��LH�'��/�gG�B6VgB.�#��2����Kof�p��h�#
n39t�6AU��~����g���}��d������`0��L����y����K����(��{i0��1�������4���6����s����3
�j�g!�s���?�����u
�J��Ri4M���:����O�[��m�]L_�jN�z�@�CG,0��#�G���.�����Cyf�<��`������(�Oc�M�<�]�+Y;��wS��<O}'�q�% ��i�[9�F�SwW������8h}+�'���e���k�kX�L�4v���R�Hwz�i�w�V�Q=|��o��>e�l���'S6z[�W�v�E���3���z�o���u����V��.'o5��gz�6��:����d#���bW��q��~�������J��?:�`'6��I`�`�Oc��4��������RZ/�3/����<��g��2������������'k��Xk�x�������|B�`����{��3q���-��!o����o�r��z�]��z����?.`4��b�g��[������l��t�d��o��nx�s�����������F����>��o�VM4�M��U�!����Y9t��#C]x���L��7��G|5j��������C�����5�;-�G'@�;��51'�t:��]����@��0��9u|���a�>���k5�p�,��4x���c�7u���������r�b5��0<����w����n���V�f�+o�D��|����{W���!,��)9O5/��?}{�MDQ�c�
�:�*����C���4�eBz�J"��)	�����h�D,��U�K���^EQ��5���g�n/���[���f��&M�v&���`"�6�{��K��x>�
'����^����'l�hb�8`����Ws��K�=��=�/�
i�<v7�.">/�L�v���x&�~�}��p��]C��F������a��`*=+�����l�mQ�@_����&�<q,���crY��Y�%��}5��GyUoK�
~���R�o��`��t�51�^P�\��|�`�YN�LC$3��m�p	R��������Dp�.�#�>�51�R�?������3-��2�������s�Clb�5<h\yX�t��K��.Omk-t��=���J�&Y;�����R�]�3���|�2~��o�Vs�����J���#`���w�Z�s��3�SyS������Z�8��2�������a���(O"
f�7V��vc�����p��J�
�'��5K��c�{\nI�7d�����i��T������ }���Z�;�q8����Y�J���4�B����LT�EKC�M�0����v��\G����
rIVR}�	���vT;�MOx}�I�-��e�%��h���i�D	�������y���WU��7��u�:�M_�* .�9u&RTX3���qn�|�i�Im��$�?2}���Z�����:��2XL����].���'4T7x7�C�
���j.���X����d�e��4T.�@�o���G�i��I<�7���A_<���y����_�Z�|��������/�1C��%��{�0
f ����s�Oc-�\�{)����C
���9��1���������C��2\v�YO�'���R���	�{^��O���4A�i�o�-c|����c��-��j�!C3Xoi� '9Z����_�����E�v��=��K&*E�m���6^B�����D�@��"���������I	��f����_������7Mc�CR?���&�ky5w���h"t�7%���
.�+cM�y��w~*~%kgBZHb"+B�e�N�f�/a������3��iN�����$JO�I�'�y���$`�v��F�9�s1<<���s'�/�����u��7&�������-��/��fNF�4���S���S�t���0�fR���wB��L�2��/��:T�W�v&�%��#���i�kR��uq	45h.�?3X���GL���sa�u��Oc
o���$O��r�y��"�������d�	y��x�1��H���oXn4�{^�����UaG�a�����&Vr4��a���������=����y���	���������d}�aj0�L�=���C�8a��;���{���S<�h�c{���^trlK�89s��v��Nf��Op�;���c��{�S���
��y	�Z�e���-�(Y;��F�ta� 4�_q��Z��j�;�I��a����������%j�����<���3!M�0Xw�&.������������xf�r"]-
(t��9�-8=��B��OY�����j��3/I������4�����!/9���}��J���z$uzxd�bg>�(d�G�L��_:��Z��#L�H��?�['W,.�n��k
�����d��C��$��pK�Q�y�q)W���|_'�f����`����4x�}d��{X�X��	-�3�h+Y;3��������1�B��w�<��r���?�`^p$�Co:L�������'����G�����;�qz@w�D��^���0�"��	Yn�d ���,�5�������L��i��7���N��	��K(�m5�@�,��4�������^�8��ND	��,�c,Y����b./��U�H�N��"�ki�&����4�i.��'V��=*Qr������|��d�;���N������k y5wAa���Z`�������5��j�%'kgB�J��.t��Ou���UL�P��j��B3X9�F��~�y*�mK#HO�fT�Lp�����J��5����=pT�1��w��a���Ws�A�f��/�<7�$�Q�<��`�v���TRA�]qyN���I4]��X�}'2�\�R~ABC��W��Y���\:A�\ QFqz�+��9rt�]fn�<'�D�48.	@3X��h�#
&�B�.v��R.�.c
wc�('0J�HO�'�����	LG�����[��lm(Z�4Ws��tb5��!��y�Ez�6i�]PC�c�2� Y;�T�Gem(F��A8^��kiS���F��Hn����������>-e���yp�/�7����dcn]n:��L������p�M6������L�lA��2�&
��?��V��&�)^&nv�gz��e�'P��q����(w�R'��RF�<���*�g��A��{���b����AZ�Oy�g���9�Q��|>�`���`�P[�o���Y������=���ZBv���������n���b�MF�<`8���� `�v�u��k{�\K���2�w3X:Q<N�5+�L�0Fr �u�G�%kg�|��h�|������;��Z��j���!����:�W�L�:N&��I�_�Oy����������<p1E-0����4_K�X�}��D�D���lV{�1�|\2j�\����y0��C$A~z�����%�������������<1��L���v;y�?�`�6�1���'���#rgU%���%}������t����r���Z:�j���N��`��Hu�w��n��^jI��;������>�CZ�I_(�-��w�_������4��5��H�
p�[ck-���>��I���(Z�(� ���M�%�6k]�D3X�o:�Yr���s�|�G�$�D)�q?V���^N�:
�������w�ot^����~_�]o�������dp����Q�����M�P�oQ�EzB<
�C����r���:,���j��oi� {.a����XKr��T4&�s�Q�>F��rS�u��G�.�2.Vs�4�����tQ�I�W���HxAz���J�e�\�eH�z�2WADA�-tMt������{.���Bv}w�l7�m*��"��?E<������,
\��'�N�8�yP�\��X�}�)���]��(=+W��a�T$�������|�Ry�����K���EW����,i�h�#
����
����q#�����a.�QT����x��!P`���y:�n��{���~er/�o�����C/�G������������j�wP�D�wC�Mo]K�X���4z�������)/~�5d�D��#��G��G�X�����3K^��V�R0��sKXg��I3�����#'F����Q��J2 �'�%��������������m��p���Z�j������i�MkyC��p�]���9w��d�LCP��C�t�_���K��f�%O,a�D=|�{m��=�Z����b9��������NX��m+���}^K�X�=|gK'J���a�p�W���Z�!�c��1��|��`�Y4����0NP���E��{N���!/BR�+�rS1��(�&��S_����6�����a����y
���.�K���.�+�Ok�3��*�OF�4��]�7���A�n���s0?/���V�}Nr�x�����E�i0X�����W��i����Y�a\U^%kg�[�4���)����w��Z�j�:`I:������<=*EZ=���2-/�J��|�����P�������v<�B~���P�U
pM������te^��&��rH��Oi%kgB� 29��c=�)��&|^��o���@JLh��L�+�GCD���gz�6��[P@^U����&v�s�)Q�����gF�cC����u!
���$_(o���AzTr!
^'��"Y;���%#D�G;���e`�4a��S#c&D��[M�i�!�F�gP"�d�LHD&����1.UHb����&���!r[��`�B��.����}�>�����s���_������'������9ND��K�e�l�����)�#G$�����#k>2qR���E�����Q?f�C�1	&D8^�%�g%�I�=N5��|��?%GQ�7)�U*1�~��{������R@E	H������2-�7� ��~��%G?���$�bvP:3�^�n��������mJ������y*I�dcx3�w��m�\����>F>�������O�K2��U�����d4ebP����Z)M��Da�u��u�_�%
���F�9���"-�6J�
0�^v��[S�z4J��q�^|=��u�����[�1��@����'�>��
��|@�2���Br#�R@�J���[A-�Y�f��)��L8�����
��rY�=M�?���[,��H��#���|��z���\��[�,X@$"� Z���DA��tqV�D��q���wr
��z�������:�2>��&��NDV&�e�v"����dpJi�\�B;Z{�F��
�a�<^u�A��B�v��%R���O��"�������~��Rz��.��� �0��k�0uhn��n�m6��3M(���)/��!���tv����2bx�;H���E�,�������CeU�n��F$��Ro������&�M��[��5��/��Ok$'���=3*��o��<����,d����������{�>���+�����\��L3-+���~�XO���A
<�+^�o0������U��Q��t��{�S�.cNOB���~^��:����B��1�$��I�"��6�Uo+��[���X��{lNcB�A������4�t&��p������X�@:�)�m�/�d�!��]���0-o���L����4$�X���*Z6B~��*w�~l����P�����2J���z��lJ��mi�^��)+��n�r������!�b���v���
���Q��K'���!z�������cPz7��r�������T��Ms���w��Gb~���(�l"����X�H^�n�[i�^�R3Z��F�B�����u����v��Sb��S��rj4��)"�����^Z��,]������ ���$f�#��\��SeaR��s+������$�B��Z<�7+��9X���c���":
����h�G�c����������i�(m{�r(T���v��_���a���$qWR�f��.�Qd�5p'	n����_n���O����Lh$�w,$��R�T(vr���y��(�U����������jR�6V�����}�Z\���f]�/6F�������d��)}�&�KRz����b)�GSzM�[�5��2*���j�V:~'(R�3�u�(�����(�.��MX1��m�R�b�d+V�:�����6J�^���j�i�"��XB����^P,,e�B��@nP��{�+��t�}��Tp�� �!������J���E	���/b�Q����K��x)���B0%���VaZ_�)�Kr��\h%6��6E�5��SN���y�3�j�����(�������\��oK�t+I��n��5�t�S�U���9�E}�>K�WI��MUK�H�H#i��$����3����X��m�l�b�,�S�[sa.��5X@!i��}����?������k��0U���V���U�����9����}���{��-���F��y�,Gw���R�����N��A�<�^cH���+M�	����O��]�C���-~����)o�$��
� ��s-��d�T����.����<�R��2����*���I������H9�pO�?	��)�����k��[�7y>�!�� �x�b�b��C�H��g�<����w������i�q���H-����<�����d�4w[D���3�_��=1�:��J��]���n�a�
�i������~C�/?|�_G��U"��w�^}^�7QW��sO�n D���[Y,������&yb*�D�����v��"���Y	�����Zf6I|jK�C9�#(�$�%'�Td��&����`� �����d�1-dL�*F�x�%�7-����5#�iFk����i���F���d9���U&p�o��Y���9�ZM�(S#���o�2No����a��AVD�f?��H���i�������� <���@�1+��tY[jPA��7+���J��bD
e�A7k���������9���)%��i	�R����oQf*���n������i��^�F�%AiF+p�r������CA������Q�������P�p��S��'�#O��O���%����A�%��,S$9{5��@xGYv�T\(4�?,������H`�Q��WI
6�����Jm������S��5G;�=�T��b;x��t��!��Tl�d��]���(��&���:��Vg�Q��\�=�:�O��U-�$A��KL)�����N���.
����J��m���>X��!�7�aK�UJ�1��`/i��f�����/�	Yc�"��#��[@�Q<q'��!��(�8��*��N���!�x��W��Ud&���v1��x�Pi��U�]�S=��FlP�H�k��.@~��m��������CM��/��j>�{b{
n���VP��-}�����*��D'���(L����iNE&Z�T��
���W~U#�~&�]@(��J�b�%��L�J�q8�)q:Q��o�5��(����CS�&\�A����_# �H4s������m�$I�iXQ�f�B��({�I6�Z�^]�H4KS[I]�H{[[�&C!��a��$�W�g%
����-��2��Q(S�~e�y�8���R(���
?`�s`+�@3�Cj����f%c��Y�SL������Sz�������������J%cx6-��w�ge��+����J��E��+	��W��f���F�PJ�n.c4c$��$��^��o{E(S.���A��%�����a���B(R%T�D�s2�`��(/�P��K�#h��{��KU���!���&�s�6k����� $Nn�=o�����K��*�:;D��aZ(J`b��M�<P����)����~�J�o(������>�_i���Q=l�eW
�B=���^��l�S�gS��p�R&���R�Eb��+��!)�>!��"��
t�;\$>��
Q	�RBCn�)t��"�6��'VdbK�:���"�������U ��x��\D��Q�a����ONr?7bUl�(���1�f����������{���p��H2���b���w��+��`F)�!���N��g\B�lJa��Q1$��<�;�q?
�W�ZBQO&�k7w�A�&b�����9E������"�6��'�g�G�0e��a�����s/���0=;=Ti\0�#������7�����!H�D����%�t�q�9��Z�$�~(��?�k"��������(L	#6kF����F�P�O�M�=@�����Zu�	kBO��
��mB!��a�X�Uf)0�
R�YI��`���;�ZI�_5D�g@f��.n����	�;��*;�C�_�����20��^��.��i"F�X,L��Uf$a��	�s��e����=��I�r�������3d���>DH���&����� F��.�t�L���7��7q3h�s������KI(FD�)�*8�G��u��� F'���LR6F�c���~#���f��a�,L�VJ�n� 7Fh)xrN+����7:�k��F�
��[��|4���������U���Tz<�
e*��s=0D��
��$�
"��l����EZ	���A�7�ao��I�8e�s)p��!'���/D����bI��j���_l����&����[����
e"�2�7��R8����*�(�U��MM������yR��nR]7]�i��t4��q�A54	���=FJ�6���:����GO�H(��n�*I�yr�p�D����������S�b���#�O�0��#��\7��H(���%[���~��g��/�_t��<2_�)6��QS���($B?��G�T~e���hB��%_��G�H$���4 Z�q������b�_Q�
F������bG�j[��v�L'�<)��9��N�9��c�����Zd(��m��j^p�H����.xI3I8IC�>�m�1����K�|���~A?�p��\r�)vp+e��b��ip}�`�|��7�,�zU�Q]��K���S��f���]S�F�K.q#�����b�&;�����(Ms�8��S��Fq�|mz�E�O����w���`9�H�#�
V���.�h3+ar�R��G�rqU�f�%i��w�=��v�8����t���.f���JU�����#��>r���!}j�3���>���B���"��7y' #���6o�\���(\�xH3Z�ob�CRHK�s�|�_u���Z����������
"�	! �,������D��X�hH���Q�����L����_�G���J�RB�g���������%\��J��%�n"g�����f�;�;���fj�h�aI�J��?j���g�����
@�BQ*�������P�U�G3Z��Z�>����>�CJ�\0H�����_���#e��Y8T�I�Vn��[3�����"i�ROj���\%Y�q�1�b��B7}�,@%�x�~")��LDv
\E�t�Z����c���Y!F�Q��qvVg�5�V�wHR�b5�a���������O
�{��blJ�<�}uj�o�����v�b%������x�e�j�� x�PT��18��K�0C����EF���Tl��X�woa^�11Z��)ht%%�Z��_�)c>���o��>�v�4����>rDg��n(��p��b��Qx��@��8-r��������-o�>���������
��t��
�V�P�������95���I3Z��hK3V��qU���:���pT��z!��v$�8X�JH��@�������V��
G)��qD����1��D�?~�M����Y����b>��(����t.Bo#�x�`L��<A�����$��L�X��_���d5$��2E9��6�|�*���Xp���f)��$F�Q�F��e�07�kU�4�s�d�,�j4 )A�$�S�������Pd���"?���$%%e��]�1��h����7C=���_8��*4�����z��{���J�Lbd�*� ����Y�`TeC4�l���(����`�	o��$�@��Du��0-��&W��")�hu����4.�&�$���}Z����C�Z~����6Ay�&.o�kJj�-f�b
������Qw$��8�>$K[�3���1�@��j5��,��9T����b�����7#����W���%k��J�|����i	SM�htG 2\l��@'�f����u�q;���
?H�(�9��k��c.��3p���s��`��;�0�x�k���`R�wb�1��d������K�$4C��RL"��j5jf%�AzN?�Nk��>H�����I���*��G��Z�d���>HJH����(I����'��H*��QWi6FL������.��I����\���D��]�;9�((���bbI�u<�&c�m�x��Olq>*���_q#For=�hnS���:�����c�H��&��6
�iN���K"�I������OH������Tu����w�������\;����?.Z�j�w������F�������+����������q�����F�����|������JR�%kr����A�O��j���h��d��f� (�i^�2*~lK�����
�(���Y���8�mrw|����-�X�=Mi���IVn��ZR\V�j7�f%@����lF�Y�ZI�i���kh��(���m>ZI��3=T����u���Tn���7-����}����g� @��m4k��X�6{X���l�G�7Q[�,T]q�9�r>��%��{S��
"EG�����q�9���E���J���~%I��3y|j����C������e���m0�0����n�������-L*��e�H� 1�c���$���e�w��{�%H*������G�L�5l'
��_�������j�2�E�� �K�v����������rI���ZA�����I�4�;�%x�dh]� ���g
�OuK�1�Q���4��G���������;��Y�'�gI�����&#��At
�����XhH��{��5�1��As��6�����U�Y��O��P�������\
��p�U��|��=�m;ZaC#SYy�����$%�d��gS���
:g��vsg{�4*��D
y?�*�tH	����Fe��2����=GI�$H�i��Nd���[�����;B���Z�`���6$������F�,��������#�0E0��?*����Y��2�:+���ii��4 ��� |���s:7,�74w�VT^^����4/-�"%1Se�e�!Ee���,[<�hl=�\Y����M��L���$^\JQ���iJ[���-m�
Ysq0	I��f%b��h�`�.��MFzXx�P�]�
�� �H;��\hZ��"����]��s�T~B�HR���������%�����j�vg��WA�f�Gk���Z��h��7@�v\��x��������J#���"�	��;�2�n;�����T�A}6�j����5_.���+)�����f�I=K����Y��mV4�>�+��A}6��D�a��(���W�i��k���y
��+'������e=]�n�����y�g��Gs*eN�zfS������Uv�j$_��f���Q����D�t����>[�jav�2$�h���V�����G,
�\C5x������
#;Y�U�#IR�e��Z���\�UE?c?M+�W��'�,����B-,/j3���':N�t'����N�N_HJ�	���Xr�
�Z�At�'��tq��EN)P�i�����@���	�(�NU�@Y�U3Z��Fo�����W�I(3��oR�
=�$�x�tQ�R��xK�x�Df�{
�*:6��w�������L�l �#�2{y���E��dw���]iSG�l3C/�����l�N��V�q��S��JJ����\i�k�2�8�5��+�7�����4^9q�l�p���Q�J~���4q����h�S��f[�N�t���ecv�G�������/p)�� R�)�P�����"
)Q~p��6
8T�����/o+B���L����t]��_����[IR�D�+\
)��G����K,8+C��o�p��&VG�*6"����+p�P�M&����d�&\\�������k4�+�������%1���iB	S����;"1A�|M��[����H:���?�[`-�&�l��	0,1o}]K7��!<�l�m�I�=�k�3����	���Vv������&�L]�tJ���%��J�<gX���������C��J�����_y�5��*-�C{P]EJ.�/Gb���H�������;�e�gl���F�zDc���3������4R>z%V����_�.q��Y�0������
�6�Tr��Z���b�������������|��TG�5�4\�[��^A��>[AMD8�]�D��R�|����V��c��c�d"�\��]I(.�)i�HB���#��}�Z		�8
\�{�h��P���������42�B��"r6JPhD#%�"��I��J����F��i*)-i�?�z�t1w$��@�H��	�^�����9I�3�TG�D���Y��x������i��&�z��*���%�;���j&�"Q��Z,��_�9����R�q�����������e!d�&�[,�P�o�K�i9����h�y�(F��D�z�9'� �9����s����Y*�jM���*l��%
�AY�N]�M�R�{b� %l����Jk��ys�H��m�Q^=+!o�d%��]��(��`���EXm�#O�9����iEq�BW)9VC���5E������'Ja2��*��I�2�e��iH�Q5H3Z�wF1���x�sZ���P�e�\�E�cTG��-�
�������z����j/�W
�-U�t��"3#
I�����;�H���vvQZ��,���B1��a/vv.�a��p�vr��Rb�m��2!\�i-���-Q-c>�������U��PMr��W�"E����hO�oZr�I��
���hFk�������J�E�#1���\
�e����&VR��QB	U��h�s�����l��d��	J���PG����������Cc3+aB���djDi�v�#��r2�n��8�����o�\\3'��:��9��+L�K� ���K��tRrh�#��<X��0���I�q��bm;ZD�okw� ���$#��g���	Q(�����W�L����eN�DDF�6i�/�nM��-�0���;�S�sO������0J����{�9�J)2���@��*f;���U`��5<B�5H�%��������e6hI�>�t�xYiDF���/=�'�	��](����K����s�i</�bF��r=��Ok5W��H�P�[9R��`dt~�&����5�^c(�i59�}�E��(�"�O3�Xe(Y�a�=i�w&��	�������,:�����4�����#�d�HF����@B���G����<����U�0��_�B~��J9h������ �H8���N(#�(�	��kd�9��'����C���"���&��R�
%�`3���C�JG������3�i6*���sB!�@(�8������ZAl��G���ngK^5/�PI�q�tk
�(:�D�*S��c�{4�GF����Ol�n�)���w��^��-,-_Q;��+j���}4���F?�'���w�r�|K�3��O�����	������/\W�%b���@,y�^Y��� ��0|6y��"@h%����5'��4�+�R�<���f��1B�([���;+�e3��[*��f����d�3���]����T'K@��M,��ydQ,�F�dDc$!
�D�k����N�0-��kW����g�F��:��k	-���$�PmJ7�j���;��1�F�4
�i3��k���	��@����,�D���'�p�d-����1
�"�7{q�:�b�!����v�I)E6y�Q����j?�,9����������M�����	�B�HJ	�����?���`C����G�/?|�_���@d��6
��MY����iFF�@��tW8��4#)4����ruZs�HZ���������.�q����\�E�"��*'��-�����������"����
��9�l��'es-�h�V��:f���y�8:����#�zd���	"V�|�UGd�#�#�����|4�6h%yr�G��jZ(�������l�e�����CgcS�S��f�B��(;C�F���O[q��#�������$�f0�����AZ\@1=�sA�R�l[XA6�`�G�#z?���ue�G�@-$�o�zVB�;R���&���B:����J�mV����(4$_�����0e�iD�i���<%�x�Xre��h�������7_�o(w����1�"�%��r�2�,N�w��F�����cB�������|!X�
@����G�eI������'B��1G}���:��4���C D
W)�� gEC��h@2�����s��c��j���C�m�*B
"����E`!.��%;}H�R�M�DV��V�e� ��$eOa��t0k.����5��j$[��4)���!��@D����T#�+R����y}���A�'r.��j�<?��Y3���� ��|�r�Ce��v.9��������A�e���Fj��������������	����;��J.2:���&�� q��Hr�GS+������kmJ�QVq�f�B��(+���HY���W��fp�\�n_"�P\r��4����@p/��,��������=Jz#��C{w�\t�e����
6������rF(%B#7B���m�b2�Q�<qF��%��[7x���P#�H#v�VH��#Y�|����\$�����T��F+�-���P3i��"���o���i-C������1�	��$#��F�'�D�6�g�z3��c�R���nD!k��L����Ol�����W0��g]��B�:�TI���K�(��.<��LmT!O���{<��]r-tbXY����s	�Pwi@B����H}0�n�L%�T<md!k���v�`ev��4�M�*�d�D��{tu����9��������p����E��T�iE~�}�Hu���_]f[s�|�s��ow(����&��T�%w���t./�-Z�B��J�%R(=R�I!���@�5QM�����'�$������;�(j���:��P�TF�t�,G3Z��h��f{2C.�o��!��>]/K�J���$JXB6��	y;�R�l[`a�Qu������c������h8S�[X��:*������#X��y���V�����:���Qj\�����b��[s�5.�Hu�-��V��!�H(�9���ux@q�C�P�i�+Z��V��r���zc��[.�E�Y�|���P�5�&���8z:H�L��G����WN����w��&�x�����yZs�q�J{��So.!X�t��E�G:�M�bc,XV:�m����O���_��Y��^k�{�e!�������h�NTtP�[���t	-��NI�],4��;�y�L�O������|�QU��*��{�8�Hb��@�����h������S�����R���� KK�r��f���:��w��3� <^�������EG�]���f��\�)O�U�>���5�$b���._��;���y�L�Rpf����yV����_�I������6N�����G3ZA�h�c���Sw(,�Gu�f�#;q]��LT��'�K�����*Pr5�
���S�t#���"�5��+�����l��G":n
����E�h�6��
w��iR�!�U1�Bz��~<qY�}��)���_X��>��t\,T���{xEV�����^]�!�L@M>	�^�
&6��)BiF�#�(=�	XD��e�5�k�_��*D�� �� 9jA�Cw�N���X�h4+
��8r�����j�G�Zs(H2D<v���r������=�^r���w@P��MUFF����L����	�O�6dE�+��r��zZB��>nY�?$aG��Z(�����������5#�XaM3Z;��y�����r������pK��=��q"a��r���);�kv�XO�8�]�K���O�]7TO��_�����cf*E�\R�:��Z�%*����,�����-�������b���H���M�M����#��)����~���W�9�ZL����E\���:0'u*!E���)��k���KcG�Ys$;07s�E��zZ��Nr�{����1=,�~��==|7F��\<`o�p�DZ\���6�:Y�O��f(��5�,nv������>^H�3�.B]:���6v��7��!�q���Y�B�!7�K&�����������vu�k�D������@F|m������i<��p5-������D2	�DkR�����R�'���q��v�TWI<S@z�5�������loSs�!�*��W}5#!�>�+�u!���S�����������kci6��5��q�����fe�G��
5z,��D���5�����(���P/�}��?�-��"����h��1�i������-����%T���.�C��x���7{X;T u�j�HKK(r?���N�Xj���h�UM��
�3V������306^�i���OJ#���a8�2s=���]q>��n�����a0��Lo�s{nV|����u'S����V���P�"�$*�������\�l&����>/�j��|y��Hk�[,�q�,��9s�(#���&j3�-���pBL�p���aj��(#\K����3-�VBy	���\G d��_�g���i�o&�p�k&���v)g��������>i)rE��F#��l�T��u�9L��$jZ�O�������52+�C���],�����;��T� ��������gHX0��1<M'\0r���q�t�{��}�xD�1�5��J�����5v+���7�������q�|M�Zu�kF�����i�.q8���`���K'���nU�0I��<�O%1����N��MQ(����'��p�v��P�0G���]��#"�������A�o�����j)��D�D/���4T z@&��S�`��{p�%(�}��u�� �;b&��Y^�������� hQ�+�+�,*��V��)�i���������,K0Y�8�n�j�,���O���H����i�gKPOp|d�\�`�I��3i�:�g`g�}��wF��)���D���1�P ����VO������gn�W$�`:�����d����j&gr�{���(��Ir>Y�b
T�D����]c0K���g<�u\�8p��z����^4US{C�V5�aU[� �6+�C1_��+"������d:J7���;�p+�9�:�yb{��<�����!��.���n�
}r�T�`x���`�����<G�6:#}��*��tTl��F^h�:X*��A���
�9*N��J+U^L�|#�T��E�a#A�3����nZ�{b�f��B�,�������1��87+���.`��Bh����;����<��
���X,��,/v��@�a�tG�C*�*�8���*pg���+�yP���I���%��EQ��(����b�&i��"D<F�h#�"B�'}�������"D������9����z^=�����/{y��#h�Itn�f���j�ho��3_LI�s�,&�-;"��n!7������J_�^�����6���FE����W����] �*�
L�O-��a,i+r�2-���n�)�#.����!���c�y��y�+��c�N�����A����
���9T�O����0���?��
?�m���������\�K�#�K!��5��s��|j��W���2���Y)s���/S���=����A�J���f����=�KhR�(�.��KGf7�Yt�r��:���m�H���P��C����CK��q�0
N�c�
9����a|
�dm@QA�8%:6(zc��*m�7������+������l��N0nL�pb4��v��������,"N��(�����%7�&>�&`����|�Z1�1275c$Z1B�4���K��287B0�u�-��a"���*o@�,��
�V�
��q.�
H���@�Ba��{z�$�8Ji{�c���Q�EQ�8���bt���@�Zq��3�r�`��(2E��`�r�Ugpl��C�r������L�*���Z`���BYW.������4X6�E�z��N��,q8Lbk��wy���_VOKi�D�f5P�p�Q�F�+�������`�����w{s�)a��@E��"���,��#3X��������^�8qiGX
������`���-��P���?�CRU �5r!E�
�p	�F����vx;�&���x�� �'�Ov��*���������O��m#(�I�W�n��;�'�Oz�$���.��cN�JH�[4�&��8�R���d�'I���WKuz�{un�$��{�������Y��Ok���	��)����u�������;�	�eE����2��pq�
�����Y�4���	 ��p[����	 ��������@2������6\$����J7��i��;�:�"v��x�������/��0�1��UWt���5X����u�I�6��%7nR�I43@7��
dq2mv�������k����=}�D���{�L
Y����J�?�V�(���W"g�8U�?l
��-�����P!��*�b���N]:.���:��6�U6s�Pe���pz������+fc����4e��SV`��l�q0lq2��SVM���n���j���/��QM������7r�!6�12j\�Om��ez��s'�0%�ma�>��&�c{�-d��ix�����2���pew���|!�1`f����E�lxXH*���I����++F_1x��xG=�iYJ�?���xL�bx� ���h�<�JM]�r��"C�(�����*�����!P��uA��&�?6?��%T2C����c�Z�J����~,���;���<�rf�+�B��L���`��"9g���.83���^��6�tt��X�5#y��^R�KkF�\37Rn$�&�M6�<��Wv��}���	��)���&F`�����sb�F���*V�5��{��{G����4)?X�q�������u���"/����Hu���}g�&`�tfl�}���2	��=Z������a���"�bB
��|*�\Q{[c���C9�Z��@.�d���L�)
�f&�1M5-K���w}OZ��`�|�OYh�?B����%1��&t���Bl���j&�nF����P8��2��
�Ud�%�L&$�#B#��#�1+��K�;��m�Y)�g��X_43rG�*�J{��*�-�*B��.rE����KW���`�h�hzs��\a����2���R�f`���VN�;�H�<w��'���89v��
0�K:b�^N]�ce.]Y-�x���a�1i��*�vV�S=��L1�S [O /���c,�w��*���f���+[E��@�g\ii�V7}��875'#+�g��|�T���vE���F��*agg�Xm/&�Iu� 2�PMJ}�:Sq|���8
.�[#=���|���x%7!�w��lq��na�d������BU<mv��Se���<��2UJ�|�|��-���z��$ROi��>�VZ@n���D���
�LJ���^����5'�Y	tj��C��S)Y.�+�D_�b1�&${�rT���V��#�q���1��xX��mB���<VOY��
P����H���!�)��?B�yWF��D�����h����0og�8����tE�����!�m)$q8	u�/UZ+��YO!��P�0�'�fT�v&����L��|�+��-��~>��������X�fW���	b��q�@DU����B�f���pG�c1�����~Q�n��agFv�?,[�[~��p~��G,*M[�K���BuQ;�6D|���	N�8��
)s���'��K����n�H6�|
.L�%�y/����3>��!�%A��sT��E�s�j[�[����/���|��
\uj�w����f����^,,����A��?���B��ps7��p����#c�F����R�����^U�v��S�u���Mu"���08c�mXt:��p�B�����^��{�}��Yy���NKNE.ob9���I�� ���!����+�>q8.�������|���z�q��7��3Y�e9jV�0��������z�2�98
��?�	
����|���|�&����Y"�H	XY������y�B-+�,18&>crn��S-�fq��`��������
ck���������U����q���{��ELrB��q�I�	j��R'�=-�kv0�w�r����+�-�41<Mib�V���OH��};�BL#��475�;�����!U�s��#�5X���Y�TI�8���G���O���oP��
%��>2J�E���]P	i�&B�J�/=����P�@���7�M�:�+��r����L�������KP���L��&�	Os<��=�d8]Q���&\`r���
�My����:7_�����l�[s^�B���vO{\���Q%WP�����p��	�)=�7�I��7�{; 6���E�Et������7���K�A�F>Ec�g�&�����J��v�'<=9-K|��(���0{Q�	^�<%������&��.I'�t�p�q�=��"*���o{�(EV��f�|���NS��H����@X��n���t������L�]]����fR�0�ov�Y�^�	L���EF|x��t�g�5�_~��F������D'9����+7FGO�������+�W��1~��t����4x�{qK �q�����Q]��>���Q�u&��de���D�(�Q�'�Tq��T�t1��
����4J|�8|����"��+#<�9��IJ�A<���q
f<��?�,>C��~�,��.��l7�����^;����2�^#�o��e�O�S��2N���3���	Os\��k��������x��+����U�t;'�"�4�U�9��)�:��a	u����Hu�H��*��N��e�pk�6�q!���\���i��C�������K�!c��F�����q���9���%�u�B�Q����p���3�?X5�J�]5j�}c�
%��3�I7��������9u���D�(����f��A��!���CVO���]����DC��0�0���P���b�\�E����7�/DfU�yk.�_�CT�d�O�8�\��l��4��V�Y��4
����"��\�:�b�u5DX&N6"nX���,�2����w�$xU	 G/:R6�l���@��tp�0�9�xH�����'�k�C������F���S&-�@�vj��,JWz/��YoMx�����#Z�d����f�H �}l/Q�*G���1��0|�ME����TJsF�W�K���+�&�?�r���N���p�&�����f]Mq�5e�q�9����?�����&���]������C���^��+EL���Os��=m����8���A��=HNF��y)^MQ�U��n�t���s7Vs���������A��p��6�%�
y��KO���7�xe�BtN���j�fe1>��,�$���N� �[<O!����L�+SEO�9�iBO�I��tgzL\+��yK���������P��4g�%�$�4����P�Bc��:"H���_#�L%��2st����7���)�!�;�����a{zT^��0������
�
=�+N�����������>�D
	�#�{�$�(���������K����Q6�<3�W���h��<�����������j:������\5x���"���O���L2>����]��7���<�%�0`~sIl�6L��L�Mq�0dkC��*�:�{n��l{DK�ko3�V�#k�S,�a8=�hH�)��f�����1G0
0nM�[\Ti?Z��������{��zC�a��"���#��
W�1�XC��F��f+�����2R�������@�@�����%v�,I=5m6�������A�G�\����@��t?�����<��-��D��*��S?�p�I�Nv�����j�����ZC�G���wU[5m���RC���ns�B�0�������a8JW/�Lm{���'S�Q������v�)�7���s������+���tL����i���-h�0�;���(E�S�������}��`�����D���saW�F�R��:��N$d,O�AB0��p�z���*=�O����C�\��Tz�|�f%�,&,���ON�T����s������P�>���M������1��1Ofn���jKr�v�����v,O�ok�s:����)5��GW�-���s�>[6��s��i/�q�=lj�t�HCt��3_�[:���P�~����0���`o��kP�
W+g
���HM�cZ�,jA<[;����E@E�G��C��c���6���H*���+�f�.��v����������_#�1>���?�AX�oL{G�'�jh���'MU��n/���e	���L��Q�.�����7x�p����F��M��0�x���c��;q�-/�
2��~�U�h��t�O>[���@���A]��#��a�1-a��2��q�o����t�.>H���4��Q�oO�g������0���A�OS�^��r���
�O�[f(����,Sp��X%�T��C��ex'w����4^=�|O�}
�5m�go^>�G�).���K�i�si�2#����]@HI��,��87>����1���IaS��#F�CS�R���e���Z�@������f��Io����0��)��K�������G�����
�������A�g��3��I����\��P�aL|����4�o����C~8fe�1rfo'�����
�RC���F"Q��c������G���lJ6�������%��
W�2�O�P������2	�:C��N3t0�2<����p��4�T�KZ`q�E6']�Js��5o���HB���W)��f�[8��5�F��q�?���|�����drT�X�1����=���GhA��j�Y�P�b�r�&�S��"���''l�����S�q��7���4��
W�4E�xA-�Xu�)��i�����+���4�������G��
Od��JO7����T�TSV���p5�S��g�]��p�@h]�%��K�\�U�d1�u�����<��p��Y�cJ���T :F�y	�����
��$��\R�pzA�]U�Q�-��B7mY�H�l�r@��;@�����
�iY��I}��������m�M3�)����a�_�?���
S����~�](�"����m1G�q*���aJ������W7��'�:��
���]�}�'S���i�"�J����9���9���-
MxQD�.����������C�����o.��I,�yS�����'b5-��1�����-d�n�R:|��AHl����t@�����xE/��Kn�f�g�&��1�u�a��s@E�BS8;���?����_��D������Y9%]"z�8�Z���>���f�#�s�R35��!
��:s���E�j*]��o�ER������i`1a}�WuN����(��L������AFS@�\�bx��Wc�%�W��q?�o�9���o�������``
��R���oA����$P��b����@�����?�S���������'�����?�����_����x�y	����r�(���bH���	zJ���'����ej_����*!�����vJJhpX|���Y�������`��L&g�Y���g��� ���w�J���n���5�ji5��L%���qb2�*�;I�kx��Y���c�C�5�?��W�tgj���)S�����(�u�p�\�)�)!Q���`*y%����:"�Od�/$O������{��'�6\�03I=��f]���q<D09��`
6�~	�L���:O����Lv���LJBJ?G��g�_�uv
�5KGRE��8?���:LE����
f���(��2�?���i��C��K�~s��8^ 
��c�+����v=��T�Ln��������\��0�E�Y1�$-��;D���������)�ql���$��"	87��Y�p�E\���iA���h�PRl_q
��s���������qN^�V$U
Z��l��x	k��kVO����s
�������(�����}�����hy?pO��7�fl���/T�!�Z��fV|&�$��6�eV"y�u����JcVt��{Z����^m������l��?��\-u+s�G;�<	����l��<��K���7�E�2��5=�xfz��_��?i�vpKJ��|�������C�*�^����6�ee2]�9�&4�C.���=��x��l|6������	<7q�$�8���u����A8k�����E0���j 
J�7���9+S��a�(������E�J�=4:7B=�{L���q�
nnnoH+jf�Lk������`JL��4S�g4_�{���*n��i8�l"�~�p�Wn�=�X8������^U�i���k��������C�c�+c���O{Ycr�a�gX�wf�d����!����M��v��T��*�6�181Z\AK<
z��#��SgN+S�:��L�i�\����=
������O�����"l���7�
���u6�W��JiA.��O�)�����5%��L����z��(*�5q]8�N�E8��s�]��h{����o��I&l�������rs_�G������h(`/&�s��i%�f����}T�d��I�{�"�����Af]JhYH��Y?N�����"rC��`�;=r��.x?��@��(�Rr�+��sZ��g�MK�z=z:�h\"�p4z�:���i>}���&����i�~iKI�1\�UAJq���@I#�+x`�M\	����A�U��T��gV|��+�,uy�u�W;O�uX�"�"���B��L��)��"�|Bm����$���V~�FZ��,����=^MK�����E��$U0�gE�pV�S���oQ���i+����c@�t���[Is��Br������ ������93)-�7V����{#�"g9�CWF����h�6X�������Op[:�;1PKK������9+� r�6%I�]�%����d�`����LG��a���%+���
����{\�;���t�(����R������m���3>M�b��)���$�	_��cua������{y����o&���F��(�$���T(FjF��/D��l1�8r��bM��Y�����u*&9������u�i�����At�0���|���f��F�}��4���OE�y���'���s$G��:�1��6_9h� �0��=�|�Y���i�+�X�� E�G���������,��8^-��������X�B4���N��g���(�)��i����L����#�)�^�t�&����Y���iy�\�C��6}��,����c��A���sV�l85���V�3�j�<�~����/`I>�$Z��=y����fg�Os\�T��
��>���+	������R�=���m��"��Jr�f����JNH�g��~���<+��r3�s��WU�9B��Z@
�������I��qj�	n�9��?���'�'�"���I)��I����tbj6����J���Je����:$&���y��/�Ws��'8����_�����w{�A�|�*��x���y$p	����$#��yd����b
�p�������]q!N�d��H���'�������"���B'=~R&U:A����B�P�#���������\�����hD�7D��b~^,>��6G�B����cRYA�tm.Z�����9+���0XA&2<Ma0~.�c�\<�"��.������=$W��M������5rK9�f��W%y$@��(�,`W>9���m�]]z�pq�uv��>�y.����(w�����]T���iTQR��{$�G�%��,)��>��m�t�l/�����p��c���g	9Wf�����Ur\y�t����D
6��>�d�~i��P=U�B�F����rq&����G#��5��B�p�}k`a�?�TWe�������@�hj�S}�)U����W>��H0}�l�,�e�b��*�N���fH�Pi����h�����
����uF����� ��&��z���|R!���>v[��4p��F�_i�>��nDfjx���r�S�isD\��O���F��P{;		��t�.�1TJ��/�����A�`�p� ��s���-z;�b�]8�Xpk��+4�����02�	.�gSD^,%��Y?���K�V��X��t�d�_s��w��G)%����
V�F,&K��N�"�01R8�Is�����*XJ?"��hVb��fk>�_�P*�189�� K�oL��(�c�VyP�F���{�
,Ys[1�S)�����->-���@�	�X�h�:���#�,�3
h?i���������Q� �A��j<��������11��E>&�Le��d��y�p'�w��z)�`�s��;�+ F �(0�T\u��T���y����h&p�v�`~�`���m��WWF��tzg��K
�����Uh�?\�+�_�U l����xj�^���h�G��SX8r��$�����_tq5��L��������$�0<����s4V����������6j�%��V��K�G���L�|C����1�D:��� ��)W<��}�t���~�tI
��"�D�H��J�F8e�BB�Ji�����E������1�1R�������������Z�H���3^���y���T��g�@�K�R�y��)�S����
V���w�'��a8�F���_��\,WsV��u:#��z��Vq�VKi��<s�
���1�N�����#x�1xO��\�(=���	0��L.�|g���f��m��}�F�����r���*ka8�<�fWmx�KSGv���� �g�d��C����q.�
���k�#3Gl;.R�F�����p'q����}�8"*�����2[)WK�j&�$��F3U���sj�y+t�K��tyl���������e�R�l3N\(Z���s�'��q��tf-��
)8Y4:N��Q��=�L��������
}�V`�V�0�tV\�$<1<M�b������ig@�l��/��QMCo>�Kt���t��Q��������":��6��&�:j���4���]�hq9K�������9.��:`��\I%����O�l���K����H�
��:;V��1����4`Ruh��+��i�#Z�s���X#���q���W8b|:���iG�Mh�]�Q[������
WS5E�BC���+.E��
����T������Z�9�N[����W���)R��6\��L���A���\��J�l1�J������������a8�	m)���B����������vN��(���V��s/�����W�d�����p:�w��<���
�
W+h&{��� �_&pq}�sZ\Aec����,>�!,~M�G�-����J�x^��jVf��\��wSBj�I���Q6(���'����	�XLw���Xe���_���N{����4(=�,*�%�8�
E��,�?��W�����&1��#
9��uDnE
iOOUGD��O��t�"�:�����J��G��nr�I!�q��^�{�)�9XD�	f{�S*��K�8���'(�i��Xh���A�9q��HD��:�I�#�����������S�9���
�0���Ij�pN��b��?-2M9�&X���T����2���8���( I{z��Y<���<�w$v���#lDK���R�� {w#*p,mp~��|q83���3�A�v��D�C�����`������O+m����{;\`�^T����������K�Y���5������[$����j��%�4�Y�-�/��n�������"g�a��)�<��W��=&�}{�>j���M��t������6:)����6c��-��z������� �7P�����O��
�/���p�lf�ns�y�nn��*�r��3P��u�����h�dR�go�C�p�|N��(�>S��\��m��;kJ4p��,��X�Pc|:B��S��H��wb#;�F�"%<^�&R��������'���RG�?ces`���1���fY�b�5���$�1G���c>��5h�����x���G��v�W�}����Y�����mNH[I�z���%�3K����9����&x�B�iK"])�$��{s�Q�1�ff��V�����*�S-"��f���\���%V��o;��6���p���w�p�k��R*��s)��O���bR�{q���EcZu��G�I� �rs}��i/;�<�H;~RPz=��]�}VP���D^d	������Jpo4`�u|��%�����2{�)��.�6\,����M�	Y�wb���z�5��O�tU���t18A��&���4��	'0�����/�m��T�u�*���-����d\�q@��(�*tZ\�s����`�h�|�:!?�J�N;G�j����p��@�]��n��A6��� �gF��Wp}���cg�=2	�-]�W����)��	�-K�\,���>��2�b�	u�<�����������Q���#�P�HC�$o�i����i��f�Mi��e��\����a�nD�?�|)xA�ef���TN#K/�������f�(�������X���E�3H�qw��PQ����P��Qiy��1##�
W[�Tv��4�_
�E���Z��9�|v����4�r��m�$�f�u��7��	:��Urn�<!/��1�<�P^G�E��{S-T+���O�m��02����'LB7%�!m[O�u�a��O��9@�Z����{~�Sl������	�s!M�J���e�`�h0	����Nlnn2\:�&��G����pw{qRcZ�$R�v�����ML�'r�U���
�f�D~���0���rG&�~R���!/����<g���J�O�N���J,.��5�!
}|x0\ue
�A�2���'eK�Q����Z?�&����-�X���������L"�������=&B
u��;��)3�����9�B����7s����x>��3|�9��)3�6���$�&�4{�}�ys�G�V^��pq�W���f	��G`<4�TV�,&�v��Rr�#wy�_�+�G��FJ	<��*L��������qE1���AD(�(1bzt���R�Mf���Y�O��8���F���k�c�Ws�$q@4t|���N�9��w6���#�39R�x ^;�����d���5���<U~8H#��^��*y�#c�*(�1>����O$R���SjxX ��Dr���<>�aC����TB�������J�|�FW�fPg��yCE�	�z5���z��d���rF_��2c>^vyy1fZ�d�����R�������i:���"qu�����z���S�a�,�+��|���A��.�1T<l9����3N�W�9�7<:+iDc����8��F�(r�pq�W>����H��5��������
���Qa�Yqs��u������� }V��=�4bD�w��.�5fJ$��`e���>>R �"���y5ht��cO��U��?!��i�K���h����X�&���($�QX����9�(="�G�@,�Pb%�����T��<"���d��F�D�1�����3z��H� �����9&	X��x�7�Qi0_�P@�Nmz1��7E�>�fZ(r��0\��U�0�99�Kx�2z���9�'�� �{��sF���	��]�S^�������L��{U�'�|�Wx����������+�3i��Qe ��+���s+���)/
���6I�T��q��t���S^�����*�f�o��!��~���^����������`�h�%����P�h	k�2+^���[z>?��~AY��(]z��!����\X�l=��%�$<ZX�IE�������+�u���T�=��o�\C��K�)�����!�VRe�
��������g%a�a8Vips�4�k�o{�HK+��s�F�V��P�+9�g�57+5�$z?��Gs�u���i{���YH�$Pz?�������w�Z���JUJ�����|<���b��)[����|��|�4�	���
i{D�S&H�md��&���]��mX���Y(�����V�O�!#�	L�GY��0�y3�u�7��W1W����$���h8��KmD�s275�d4m|5���&�N�8�����9�����jt��]���Y|�p����8>���!��)C��F�"5s����+�2`�:m����Z73�7
GHGA�F�L+���J�\�`�h�H�	m^���;����g��X�	VB��iI�S�<@��,�U8-���_��y�EC.!Ju-�T���i5H����"qvO�i��m �+�W
5-K�3qU��_�i�+�jW��F/�������J��$��F���I�iI&[GHA�`r-e�LF�Ms`���2�������a���uH�kh���>���i��ic7�8�n�3����3G$x-N��9���H��8�4����]���"����7�ifi��@MJ�Nk��������8 �����D�z���|!�cE�\�)�9�qL������60�egde���4���8C&�*�Rlb3<\�p��		^���������6bu����+�9��v^Y7v����K<RpFFm�����������N����,���0P�`-���R�M*���=���d
�0FY��>����	,Z���i:��e��^��akW�\n�Sd'R�];��{_�\Hf�m�����8�_A���q�jsw������.���WCW�Iw����}���l�B_���Uy8�'�1��</�;�%��y	�:����g��L��/�i)���������f�y�����3q���&�	`Os\y�����
EzE���
�
{M�tx|������G��d$m��+[tI,	��`P@{����Q����$����������BZ���:d�s�����$������
����!�sm/��w#��T�h���"7j*/*�t����A�������OS��m
�`�hV��	��"�p�BL��ffq����{�����X�l�&�+N3��8����b:���"p���G�m��vg3��^��7�#���u��T^$z%����v�#�|��N��Z���?�_�~>KNj���(
���6��m���&3g�8u�.�;2����4����M����U����������)9ZSZAeB]�3vv$G����P����3��U����������������-8��h/���n�"�c�%"�D^�������"������P����_����	�,�y-^�����!uk�~ sp1�&��<sp��W|��g��#Me�S\�l7����>[����Q#Jt��w��
�������42+%�#X,rp����Y�����0|����YVKG���ns��!+�K�_6�y�p�,.A�0H�[��8I�'X22����Ye	+-�W�5-fvAq�
��>�{=�r��Dv��������a�(���i�=K&��o;B	�/!a�;����DVl+
!��z��H�gTli�+��n����
?	v��{|�>5kR����	\�=�A��yl�,��!��v��Be��#��nIowH$��VIh7����	5�f��Rr�S��2u4�|C;��>��t��`+�d(Q�_t��Be�~�%=�����=������$���NTa��i:5�C$�
 $���V�p>�����x���"��"�9��T���;�:8j�:M|n)����	�B�E�_���p@�o*���t��W�RI�Y1n- ��A��z��rnVj&I0|dA����yO
b'�vT�s�s�����)+��4��X:�=%�z�0�'^/��`>77v�.����#��
�N/Su��C^-��[���M�x)�����R*�%��qg�j��p����k_�2�e�.a����|��8��F���W��tn����d���H7t����=9q�pz�!���������(�����+x���9��ol��+83<M)e���d`���br�&�s+4SL�~<<w����ly�S�I;0�@��X�(�������S�������J�����ULM��N�9e���� ��B����nw�F����T�?�u��Z_�6��A�P�&�OYJHZ5��g@�r�=����j��'N_�?I��t1<M�Yp}4sl:�;�4�iu�h0�����l��N���g���-=S�����1���������p� ��1hH�Fwu�����e���X��A����{�f��P����� \\7W&�;�G�f���������h*��A'J��^h�;���8=DS�o��9��W�
I$f��&A�0H����]A��	�����B��Z$���N��n��Re�?|�:6	9qq�m�~�!���	nuPz������H	����a����]�
��Tr?����*3eg	YMzrF>�fcb�j��LyJ12�>^%��D�OS�X=���rU&��6�4�b�J����Sp�j?�g��,�r�<�����Z	��.U	|sW\��K�I�{2��R��2+>�FC��{o���k�B&�Od�D�������]-�E�I�������~��`��R�_p�4+�N�KD��1����G-�F�\$9[���Dv����W����x����&b���e-��AG����6�z��O��Fm.&7oPtdZEQ�\s�M���w���)"I�W�v��+)�l���S��o�f���L�<�XM�L��5�	N�U��=>������}�
>7I�yh�?y�t������J�10��J���P�pM�Uuu�+>���Sp0GZFz��p@}��(9���&�iUTc�,�C��-�$�J-FU�0������
���A���-����R%�����(����Y
����Z��nU���QF�i&LKH�)Y��4�i��f3KSn1a�|~�In6K��*)���Yc���;�O+��z?2	�(������Q�����?���&�#�Vh�������YM�0����,���� &�e#��^�����P�"3�Z��W�.����E��|y�yQE]o�����vB���,�w�M0��g�K�MC]�ss�6\MKA�v�#�3�4�^�R�?����>^'�)���}m9B���>;�=�2����$q�
WS5�WWx�Q�6��'�o�A~af������@��K�R"RNF?q�&vgNI����Y���2U�C����P����#��4���$�;9!�Wq8�#ec�A��W�2�:�)��������)�i-����X�����.%���$W��;~F(�����)��"r��6\�Y!�PH
c��c�up9G��b�o`�6�z*��C�^v\��X:�7R��xq���]\�Q�`\h�-��?�MFI�����+�{p�Ty����W�<e�||���'{�g]H��x+���>�v��c������J�:.�~���x
D�G���&x���j0��VO���(����W+s&Q�)r:���M�_��?���0
�8	#Z����;8��������k�Y)�n����"@};�v����S�>C�
�6<k��r�����������������nD.�����s	'��b�i�T~y86^7����&-����K��>Q�R�R��$��#����S�lu���� ���~C�O��9[���e�yX&�=���M�x�~>��!�/��	��*����z��uJ�\��j�qn�Sbp�Lh�K���i�n
����	���1*���]d{��3W
�cZ�IDT�-W��D3�E�D3<M�$	0�v k���1
f`�3����"/N��K�3 �B�gY��`�Krn)U�6��\je+�8�R,"7\�����j.�������*2"�G��H��l^��}9$��| �a�fz������nsg>~@��!������	�l�8'\�:`�����>�]��_�3>M�d����7pL�_F:������[���|�m�Zg=���I���Z4�G���K��Rv6:����~�A`����/22����}_&y���b)�,��d��N�[-J���Z��!q��%��o�u5�'i(?���:���M�24���fe�^��H������:�#��9w������,4����
J��qH����L���dE6�#�'�:\3�vqj'q!0mv����?�f�hi�]�T^S_�;,��1�~P��Fntb�N���.�$��4VGd}�	+�������h�u���6VzB�W�dE���$E�8LOaMs�:����QC��6M�k���s���E^y=���J�<["C��~�c�4B�f�dL!�*��2���i&g)M��V���o�m&����Eq�(#�e�Jb	-F��Wh$e�(��U����m+�Z�����Y��8_t�r��Q��,������6�MB��I5��w��3|shL�X��<:��)�����!�:q1-%3�<�N+�^�=7����Im�!G��������F��t�y�&���D�,������h�Q`$����H�.�I�+�������@�ifS��xO�-��Z�y8n�8�����fai3q�Iiq�y"M"�CdoVFRkd��|
7np��L���Kc��+B6�8��Y���Mp�]�(Ii �VH��`��C��C�YXN���U���s�f�"qG��#�TR�Z��������dc�����`)�dmj�Mw6��3�)�4Pd������M����L\�}�D�+#I�4&-A.�����4�R����<��$1�9���	��7��)7��.��"`���Du�pj��aE��H��i��������Oc\�\�GJ����s�$���@[�D�T�����9�k���lRN��<d�%����]���]9R����'[@n8�#0� ��'�GJ�H�������@������X�y�R��V^�����t
�����m;�_-f��6ac�9�+IW��
���[�ye$���v�Fz��?
Y��oC�9�I���\�:�2�.�Hx���������RyO�)�/b�l���=8�c���8�F��s1H;bXg6����q�7�����l��R��V�� �Y9���Y�9&��d�����.8��9�4x���@��V�#����1�#�o��z��?
�jS:��1)��o����G�����3��RA ������7
�[T�t�5g
�HP�i�F�eAW@���_YF`�����5��l}y���*�"�?����0}����qQ:��w�����L��#d��c2�i��w��!�<}t1R^8����g�c|`y�����`kh�`�y�Hm���?���yAY��:�Q�	��X@�5�V2D�����P+�J���� -2�����_v2i]��@���d�"����"���8#G�[Y���b��p��l��#���~-Btv	���)�>Fg���������p~��v��p�;��~���~�)���K ������t�������H�?6sg3�����u|������������r4'���1���j�~�������B��UHf��_���Y�
'���%B2�+�)���������49N��eq�Ha�
Y����=����e^�V�}$:P4�����0��=��u�Hc��	Il�6I�}(��>��CZ�4v�c�RW�����oo�a�5M��%�7?�RU�S��2�7��U���.�VT���!m�[�'2���1�^��!SuV��~�a:����%z��mL'{[d�����f�k����������$)��b
��b$4���Xc���?��f���%�D8����y��X �>y����s�����w,05�h���r�oo�{�0c������\�v�zI{�����q���(C�=�!e3v��t���{�J)��}s	�c���+7JR����!z
8�����!/���b%d1��c���rL��s��jP���6�N�4D�e����6<;�R�I�p�g~  _�a����������L���=J):/q���9��_x!"!��������yZSp�aD���;?�vx�B9b�C�R���i���_��-?|y���'~k���>(�o�w�p*�,�e'��
h�����{����,?��T��#a�Q��f���p���7���;�l�.�&���m��E ��"�<P���<�6�Y0�nd����	?���c���	ybi7�?�l^�<�"D;���k����+l3uL�D�4j�]��dYq���������`��PCiB����e��o.4�y��9d���%�U�N�q$���D�����	��~{�"��6��F�����N#*��N.*���dkh�x�'c��<$)�y�F���b'�89
6n9p�ilUJuxr�C��p���oU���=g�y�<'C�����r�bs����r��`��M���XNvn�!�I i��/
��y�(0����c�V^9Rr��<Z�/��p��YKn8��\����O�^����sKJ���J�����$R~���jb��l�t���WVG�N�������yA�
>��/Ek`��L�O~������yt�������<��#��]����Y�_�%��#h|��H,s����F�h>�I����`�����R���s�\G
���P(g��\�`L����WR�/;=�����L�d�MR�',����++�=Z8���8��7R��f����R�y@r���{�aGTf�2%������PwOCD9�AW$���0"
�L�jmYU��<���SaBM9�{/���$��P�����!��%��������8/J���J�1C��d����(
:B�`��p�rG�%cF96-5��K�8����:�������1U
����'���;`���y����Q�'l��}���M����i�S��n!����N-B4�O��J�J�Eg�k2�t� M��M����������J��������Y�����kAk�i)�'�
�C��E���zO%&}(����'����d;�-B�������._Q�O;=�{g��B/��l+>�'#�6�;������z�R�I���������~�\���#�$�k��4
U�b
]�����\|�f�������!�+@�n�W���0	buG
@l�#�q8T���(D��d�I#I����s��,"�_�bB���su�w�X�v��IR�y�x����B���$���"?�t�2E�4�j��Al|
��_�w/Q�c�H!��CR_v�1B��P��kX����+N������}���1��(Bf�������Mi�M$����oe������g���83-5��9?��s�����rq���"r�(�;d��Q$},�{m��>��J�����`	�U ���i�-�)�u��8Nl�aR�F�����������������z�5]:Z=t��\#�(�.=U����$�a2�p@|q$��Y�S�B���a
A��]
d��9���Y�I�����E]��X?q�q���I[��p����P �	��#r	��_���
�o*m�!u$�S��1&���!�����z�*ut��	����aX�s/���R��,h�:����@�����9'Jv��V��!,��e����K�
2��K�2�6��%@��AfF�x����������:�++o�X��I�{����C��|�����8���'T��udP��%-����X
��Q��c��i�����0��p
:?I99�!�����]����Q}��J�I��g!E3����)��v|�79�v��d�,����!�F
�E$]n$N�vd�t���0�8�����,����Hr��1��K�+����\��O���@�Pq.n���L���@�����n]��zL���X=��o�k2��S#����"R!��0)����\���l]yM�J��K�T�lZ�5f�c�T�B�?�@���������5%�_P�����H~P��WIg�|%-���::���Bi��1����;1�Sm�0��N��\|��p�J�{LI��%�|�Bm��8M����0��6�
rtOC>X=��7dG����������i9�C�D�+�I�A�I�eY��*�,�R��Y>���4���C�������-��+���3��sH�������,�N�b�m<5;��=
������F�Y!.���z����>�v�A�5���i��������� ��Ht(B����r��Sa�������s��X=^#��T�;*�HK(o�:3�[6Yd]l/����������@qZL�V�pN-$��;���������A��EN3����������ND%�w�����l�_xh]9h����-���X�%P+� _�L2����q<����r��O�����b��Lg����������T�Q�M��:<���47-L�U�_��)�%��\ME���m_L\��u��X?m
�����T�t1U���m/��������'�R��R2>��A�����.�����p��VA@����G�g�e�#�V�P���7���H�y��\��������������S�bK�������0�����x�����`�D&���YHN����l��T�=����Nr�o�c��e�)��|Z�L�g�2�x���������&����3������.k�X�=�n�Zi�;�'^��H����
2��lZ��n��F����g���>����#�?���c�����'6����_�eZ��[���a�d%������v�B#m����?�EX3_�v& d��UT�����/=$3I�U,}���/��i����;����0c����)�X"/*��m��!��C��E��Y�H�Uf
���o�"�G��x���9�����G5�������1�������������M1v�Y���Y��Y(��
 �"���a��/�1���}���~Q6WUj���:��0��"'�Tr����5Ti�p�^V���n%�C�z�g�d����@sTJ��g�R%��Z�}pW�\�AU6-S�����R�Z�n
�H�����RB�m�4�t��$����4�����o3/B%� c����H�5�w�C��!�t�m8��1�t������o���ZH�O�E���cB�'}���^E�T��C���5�xT�E���CN��������vw��#�t��Y�A�-6���,�T��4������V�d���";��D���8cik�$�v#�U�q,�$�.�"�e�x!�K:�����e�����D`�S��1�Z���(N���s?(�2W-S��S���i�p��?������i7���*����:y�}M���q724,3N�-��=m���b���ML��/��4d�j�&��y�3��������*���5n�P�����O����,��4��'��E@�u��f,"G�(]��
�{b��G����zuvQ���'�3��%+���(YD�U��-.��CD���YV�[#t�u|��;�G9� �@���P[G��if$$H��2VZC�����T!~�q���%������c�����.�H������dJ~�����?�?��l�&�4���s���t^�����B� C67����>�?����Ig��K��]�`l���H{ :SJ)���*���6�������*x�&AS���NJ��9S��wg,r��������~�����/�����]�$�����EZA�0��O�������2���l��u���F�h/%X���{�L�3�>nVK�:�O+ bu^����P���2������rM�4��NV����Ch�a )\�)�������*�I�+��7�zK���I:�������'����!�x�'4��f�h���H�d��)�yZ�p�{)���O��n5G
>6-%��?���I*�G?Q\&>���tp&��=�k�rpn�P'D�8���3lf�!�
,�
�>�1�x&"\��)��\-}�Q��b����KN�)���Z��k�[Vg3��C��2�����b�^4d���l���4�!aW�m����d���2�`�03l�F5�u���>wN9O�xk\"�Ug����a��d���z������[yH;.�|^��
(d�:����k��*�=
�`���e�n���?i�o-?�|����~B��o�V�Pz�t���o��0�A���J�?! �7_����%���������pzD�&sg����4���>���l�F�-bP
��������8�G���I^�]r�p!r��SW��X.-�>-B�GA.�>r��,����m��'2���i8��+$	<E��
M���.]2B%
w��<R��!���&�m���������8�G�y+���?��,L��n�
H���wWT�sv�]5���lM��l86[�a.]������
��xd���2]��"����'���" ����sN�DV�X3k�i�,�A�F��I5�[_v����7�������5�C/Z!���NA�6�k��	�Q����pib�����h�����j�����5Tf��U�YC���=�����N�n����"cl9�GQ
��@~�&7~�(.����X�D����e�����I76d��^c\"���%�����`�=
�d��K�����~@�sm5��wA���t������g��h��GAs
-)�h�'����>2 ��R��9iah����2"K'�M�������f�9�*�H�����3��!�����jsg���$0�<����G/9�����8��~���=*8�O��r�[����
QtO�1�|"��e�� �a�pAa$��x��+m%�w���]Z�A�Q�H��)f�'�fuIsn��n-{��G-�������������v�~w�l/;��I�1�C�I��I�������g��O(�s���FR�G��~�1����g,�Y�|���^2��FU��ad�f��1�����
�$���eG*)�;/�v\�HBJ����+Rkd#��� ����F]2E�O��_�����s�Rs���[������8�OC��	zNU,����6�JZ��J�<��u!T��Q$�#%R�����t�\�	�$����g"�I;��S���9E6�r�A�L�z����2��~G��f�N������\R_r���H�	6X���~ni9o��):���e��o����e�D�(���
cg�.M-0�m����a&�c�8�@oV��i�]���"s�F~��JuG���c����dh'���>�LHa�|G��x?�)"�,B G�4�f��q�����a�uw@��)co���b� ^��`Q4Ka|c�80I��*U���NP��
���~@.s���>�n]�3�/:�a`	4Zu�`�b�H�����sx�E��;B����7T�er��-F��2�G�d�������F���t��y92��X*0i�}c�5�}��Z@�g#2
d�>~�������E���`rB�.���[<%���,�l*Z/�)�}8���pQ����KB�D��/����D��_�[H�x�B\����s�MK���~������`��L�dD�������#���`��v4��{��-�k�Ly`^�E�q�z��K<�5���cL�����'	���~�&�KZ�����L1��o���a� y0(#����5F��T�9[��1rh	;���ynO�m�M�(���6:?�6RZ���`���1���
)�^��Z���>�MK�Z��>�������'����7�����e��#������A�r����w��q��t��!�tO�Y�<�('���9��w�C�(H��	Wi�/9�,�8p8O�pyH��A#[B%���8.w:��*��~8��_e*v��VU�������>�$(�o�5ttK�MKM#�6��M?)	�q}������L[��H2Yxr]!��y��Tz�AB��V>��Z^j�>�;4�Y����:�UR�K#H��-�����Ke7���Rvf���lO�>���+��|6��LS��K ��V���S��)M�o�wW��`��"���x�F4����u��	����Q�q��|�)i$��f��Z�H��������q`^�p
��
8�{p�����YqJ� IM��"l/���%��w��Pu+m"�;���W�0��[��rx���y��b����)G�\EuV�p���A����~�++���i��r�t��f�.�����1�W�+J\A���|����_�MCN{���:�@c�XB+� ��:a�Y;~��8���3~G��}�R�Kc%��!?LH�dp�e����~2.�uW6%D���$g���8QksN�!��E#6�47���������{Z�)��� �,hS<z����t-1F���x�~E�H�	T�}Mt~o����R����83U6����cZS
�I� f���=��1����d	�ui�x=$��9�nw�5 �gKhJ�8+��O�4�����	�����?�&q&w�p3�����J��IBPw���\�������q��Ty?���
=R�d�^>q�����I�_�����Q���kG��o�vT�*��n\��������|�;1I-��{�Gb��9�K%G�1r�-��/bb�j,�H>��qd!5����-�M;��n@`I��!��������~	`�&j�,���-���s�~����OZ�\�J��c�����d�8;Gar��r�a�~����P`.��%�^�@��V�9�g�z*�h\ �"�f�46������"���������V�?�!����7,W1��(A�&����!����{N9g�(m6����0b
U��-=!i��,�D�:�=����e<!��s}Y�s}���DO����pb�G�_����.j�iL��iGG8Q%�"�d��q��$O;�����0��~�h�@������x������Q_v�<��d�7� �.!��q,�
�� ��1�l��<���vSs��W���;��S�����A�������g�<��x*N\?*/�/�J�b����zf����q�u����U�J�G}���a�����H?Xf�Kh���P[��������������ul �>%�k��4�\�);��k��K
S�]531��f^���w�q_���OC��Y5���q-�D.���
�Y����m=��i�����yc�.OhLL��9�H���t�<��O��;�1�DI�$w�m]��$�P	g���!�������J����d�f3���c[���4x����������}?�@c�������&=����D��2�X7lC�M�czXr=�Q#N�(^Qs5X?L���Mw�T	T�>��&��l�����fi�1r���UcH��D�+<�=�N���H�C��VZ7�D��jy����#V�����qlZj��we$�8EE[a�p��f��'W���P�7���������
b6��n��b��7�f�4N��Q��sP��h~8�"������tx|/��b�P�#��S��8��J�����D*���M��D�j�pb�5DD:-e�\z9�}�m'�}�3��r���������YA~��� .O-t�+Wi�H�7c�I��K����-�|0f���7:��9�bc����.��=
)f��K�Y�wxb%����f����b�����Dv���C���H�bQ��+>	Pb<$��gI%q�>�����m����T�5��svy1���i���U0
h����������J�C
9����.C�g_��viI��X��x6'#���$������f����B������	�P�F�P[�����/�
�@G*�#�� E�|Q$~�L1Hc9��l��MO�w�TYR���AB�&�2�A�Bk����,��thsT�wN�%���Jo��`���g���p0H�������D�����!�4���d�2�P�#����I����>=xXt��w�W��EW�t�����l��2s�!�b�8�Z�w�A����3`��3�q}�I�hq��8��p0H$���dsa����T���A(���=���:�������J?���`���p��*f�o�w�c��lZ�HA���^",p�,���!P)3!���6?,�~���A�r�9!�Z��4�y�5?��eS8F�.��Zq~I���I�K��Y�v�,1-F-�0be����������G�g�r�����	�0�8t�r��V�n��C%m�C@L�zj�	�����(�-�6���������f�S�zu���y���|<.6onD�4b���\RS>_i�%_i*�<7�tx6YC�4�^1
&i�z��������������<��H�v|�����P7}k6C���F~�\���D2[;s�j�+d_kbH�  ��|����%��X����`���|�%���SpvQ�Lo�1���������K��Q��4E�n�-��0�GT��f���?���d����8
�&�,&���g�x��?Y�y���O51��b�����H3��d����RcZ*��}Z�S��L�>7��3��VT���fs'2=Do�����0m1�F��f���rK���m}��-���r���?��cG�h���iH��chz���/� �3����D��W���o�$�+�����58���i#�w:�S�7Y�����.�������i"�js}��������##&���r^v�M:B���e�8_���wo�������rC�m�p�m_��X"-2L�v6s/��=�q��D:$��i�$�A/����o��#�`��'5~�M�Pv���(�G\���!]CGr=(�\�g�#x`��c�o�����0�����P���/;�,�C�A����hD]{P�����u{-6X�����������(
�e��A�2P"K��-]&G=R��ZW��i)�5�C-# ���&�{!��%~�%Y(
%������	w��$0��C�T�;~��h���~��M$����f8@"���o��0-��1������#Y���S�Hq�IGv���j�/���&��iV�b�����0$��$�.��eg���3+�*H��������g�����:�q��Q"x��r��a"G�XJT�_�D�����P�Rl�t��j��8�(�m��|�-��o��z;����\�0��xG%E�!�����e�X��ND)L]2�z����\�Q+�!�����(��������v�Mc+��o��e�'v�����6=����z�^�����%g�4������E3�)L�Fl�����JOF�%�15�?�JH|�i6�����Ve��SE?���9�����[�gq��/fJ��
1ud���l�Fee�EB(7����a��R�8���(>�lo[2�c\�,|F��W���Q1���
zp���EW-|��/~���
X�Uf���Xr|i�D��D��:G�z����mX����6(I�q��F���m8S�3���m��#����o"�%O�`�\������iC�9��b���f�H�4_|8��x<y;E.Y�EP���"�oS�B���9.[�_�������+S���`��M&�e�}�o�y������1��kt��!���~��7�C��E���n�q��JM�=K�G}�9�"�_����N��)N����1$����������][�p��Vz[�-���zYI�����x���=���N�9�:+��2��\��\������+���i;��m�(I���.��Z*N�����K�H�@6�[Y�����|�5�y	��{�A�������f�!���Ek�
���[����{rKk�l� R�l��|l���(��hZ�/������#��))�W���F�>1}@��r���o�c�R@�7#ZsT���-VB�����Z�g?���$��S����E<F�h���~�*������4�/���6��!���~=v~�fe���s�N�{U��������f��7�9z��x;���Wb��}"����N.����/GQRoMS�[:2��9�M`�wwv���o���*���\����=�=��O��8g5)��u�A�/m+��bq]<g����
���h���J�����
2�����Ivcc�80`d��+P���+VO���������Am<�I�F����(���~W�@;"�fvL8��E��M�A�ZR��1���SC��G{w���f��+
�s*&��]�5��N���W�%����Ec�86-5U��5Bi�l/�^B�����q���Cm���K�F�I��[2K,)[.^=tj9���,����r��
:�OlH?LhXwZ;��TZ8&�S/���2�bx�Ans��3i�1������n1�JSn8TF0i�Z?��Q�4���b���;��p���fkg2wv&�5�����s�N��G����!`X���s��.�&���8�_���l^�����si�.�i�)������!��0i�F�����*�kc�X�����F�6���e���\�=�����#��A��S��\--(�[�ui��(����
KI��_�v&��1�{xwF+�io�TO?�s���K��2C.]�|�x���K��L?�����Y����tTj����+v�v>�#��E�t�4.���Q��aK��Iwi����,c���
y�i�P�,�9��O�����=��Q�BhN�q��/�����G�F+M�e�#���?(�qa��e����*������#w��6��x��#�Q{�Z����Q_vz�S�`{�q
8����;��k��Q/��!��G�����EkO7?�2�3�6�JG����0�&���Z�[�15���LC����t��fo��i?��"��2�G!�J�n�iq����`��e�f�e��q����6����1�������u����4
���f�F��A��d��i9/����&W�^�����T"1U��l��i�)��-}1�
���t��5�>�2I]�7I�� ��H\k {e(,��\�i(i ���f �V]��t�7�O��O�M���W�7����F����<sq����#��r%�����i�`t���X>�h�)y�II��~�5��o��vvP�>o����g��v��y��=@&��L��~��L�N!8!�v��&���3������j���4i����-U�c�M�\�q���%M�_I�3!��EIh	�z	O��j��8#���	q��3�Mq��rn������\��Z������. "E{�yI��v�\�G�1x����T��e;��S��]���i8��y"���P`B�?+���|�����cQ�Yj����HE��z��0.�c����!5t��(b���/+����$G�X��������j���y!-#}���^r�Z\'N���0>�C"�E���l���pX����V�2����F���t��)FD�l^�
?xY~;��Ocl�Y��W�����sY�n���2(����K��#E���L�,����������e�T�,~�/�L�a7�`s����Nl!�(���'�/�gb's.m!�������,���pfZj:��H���|����'����{��[zi��~,� �
c�k���-�X*,��f�b�Y7������cLG��_�741�woLs����BR����W��6�����"Y���-�[������%:�5��r��^�,��4?"����C�eL%�6�������	���WG.Gh��M[�����_K4�<-��"���#�S,��3I�W�L�{y7��t�8��;1Z\\�y)/�@��o�[E!�%a:9��E��r����J���pH;CH�-��
g�%!�-���	��1���ViyF���
�;J�k&�Y��)�B�9C���[>�+.X�mL5��m ���$��1���G%�gg��4�)z��9�/���f�8�GHW����]�F\j�]p(�P~YJi�?k�4F���9s4$n���@���s������7�@���v�
eB��|����1��J8�;����J%�> ����z���8�#W��_|���tu ��6 ��+���G�pU_� V�&���i��ML����A4�r,�,V!�I4=h�^U�^����0�*B�N��H�X8!��<j���
�x?��L�*��=
�g��Ue�s����%��c���$V�
�����S�jq!��]
�HA�2@yqM������-��ba���������"�*"-Zz:B�[Z ����?�u��8H*cAf��N��|���K0:�F�O�u�pJSI@���C�����|��-�s��2-`l�6�1u;�lz����bv��dX<���(�����0�pi`���K�u+�/$gI8��Vm,�V�3u��wkD��B���a5��c��d��?`��"*�
�	�V���������8��};��k�6N��B(B���f�Bl���R�<9�KwGba�J���	C.����;~<6��5|Bz���;��rp��r�������`����R��e1��n��:�DaY4��4�����4�Cq��1��XL�}���ww���Xt	�Ei�����Go������a
E�dF���y2�,���
�y:�
���P@�J3G�!��7d�cE�e�0�2��>e3���
dJ�4����/�j�(z���,��u>��PJ�%���������>��� ��!%l��3c9T�sV�1������|��O���)�F������c��H8z)����2�f>�&�����*����
���ak,��C�\��&�l�,���&�F�&�M�{��q�t�C����S�&�7����MH�$L9����e���
��*����x����$-��A�������C�>d�S��(�1���4a��si�h��sF��R�?E�3~�)��r$�d;�`Q��6el�_� ���}i��kl���1*�GeP��|���_1���U��Q�
�^��#��h�8�B�H��nD�B(}!�egG��N.��F�>U�6��c��$�x+G���e�^�	�����HT��z�D�����."D"�����k
�������E��#Q0�<��$���f�����D�h�����s������fpp��D+c�oc�80!H�*D�=
�a��K��f6�y4����6EBs>�8
I�F���R��4���xJ���$7��cK�T�,�2*�"�����h���(o�O�K�FT��"���^Nn��#��2���,3�3`\Ja�A��������d���e6������q,�4��Z?�&h��D��.�on��<�{�1�#/ueq�K���#�Y�Lh���)�KGh�C���U�1�`|��f�u��C����1d��S
.���BiWr3\�Y ��
ma�.������4(!@�M��g�4l\Eqrg��Sq�t�h9��������_����O�Z�K�F}�y�g�W"},qUl���q��u|f=���Q�O� �!�3�l=��a��n�������p�}��:��?��1+�v�:�#u����o����RG�X����;�d��Smf�>���M�>��F�p�k���8"�JN�2����9��wU3���LB93���N�4~�#;�:>
)}����w�Qm:��{y�hN���\�gR��fB��@��b"F����n����2���������[��������){�]OYUv���r�/�B�!�c��*-�-�����R��������z�kJ����C����oT�h��
�������0�!���M�g�rp�cve���3�I#��D0�^��i)�"��T�}���8���0�_���;��G�~����\�r��}�H�G�M�F����"��[6��l�_�z\���"��ka������l�*D��4�L<���-N��0����T�}�88s=h/e�����}��C�7�+������mP�
@�Hz$�6nN�AUrC����/������D�
��mk&;�n���^�%;0N�:&�jB2�,z'Y6�i�&�dB!�f����iR
�A�����l}I?%7Q��&�O�4d)�q���6w�.`��_�`��Hz�
�_'k����Gn�$���(*
�VL�����%�!V�x����|�$.nB�����&�����>9�J�G���u�N/ll���o��Lv[��80/p0( H��� ���1X���s�`K�0��`d�a�-��\�0���[�w���M�5O=3�-L�O�[5#���v ��H9R#FK���f�1�$#���*����)*��%�7����E4}������wo����r��B���b&��E��l>��c:�KC	,S�{�y:~���r�A������A����Q��_u'%x1��
-)��%��$dFn����9����E�t�t^�1�<"-��M"��egx@a_���}?����K����3��Iw�9�"�u4�|�;���K�6�ktB���}�Pc&*��}Z��hF�kIE��������������)���cy�������~�NR��,�M��l�T�#��0%
k,�'@��1<�G��S��`U@��ra<����5!Zu+�����T�is��v�D��C��t��_���!
BeT9cG�.M4
�)5����B�������Kgi��/;K!���V�X
�p&��}p&�~���H? ���2����/}�:�K��~&I
�O�i$�������������h,%�}�2��"0J�=x���F����.���p=��&�mZ���
q�{���k����Q��t������g�s�Y���������5�D����'$�9����6<����[+��&uuF�h��s��!�tO���==�B<jU���7������j�d8�iRKIN��Y�
���<S��;���Y�����B����pi�K�����%6�?2��1D�F�z���@�e�����|XD:����#��*����	ix!�BDt��������e�;�A�0G�
] �iizd��rJg����@�3YB�*�z�oG�5����$�
��
����>�]&��iY�A`���c
Q�,;T��
�"�1�|"��r���Hr��i�2j�*�s��6~����:�X�MPx&djN�}Z��F���1�<s�����K3��Pi_�:)�h�����6�5��x�\��� �<�d��}��!Wt�m=�#�+�-M�Y����������J�H��g�-������������s���@��5n{�c�w��}5���q��"qI)��o�F����k������3�&����1���f$G��}����m����NA�.����������}O+�����q���,Cv���`�l&�����Us#�7��U;Z@��)s�a�����\����W�,�IR�6l���1o���'�d�>-Iv���Kv��	�br�(-�e�k��Y@iR���7f��$�c�QLX���(mR������F��,�l�9���������)�Y���7�E���Cv��1�eh�5��(���^��z<(��TwT�'���e��G<�FD,��l�L�7�/c&S��(�����3���c��](�����m����9F�b��Y����Uj���\\�6�{RBf}4�1���k"�����1�DSo�b<���-�9�������[
�[`cJ����<$eA!�*�"��Qt�@�_��p��i3\t1!�\>�e�����$����y �{V 0N�C������8����>"��{A�����~fY�>R�C^�w�0-�����b�8nN����cf�����G.������9������Q��:�$#P8��2��>���b����X%��qJ8�����D0X���L a�^�������C��`��0\LgQI	�>���I����F-�%\��qT�to+�������k	��/�r�7}c�8���\g����1� �U�f�k2qQ-��x|��5hF�=�j���a�-�����,�������EpV�!=��~8piM�}
��~w�3��[��oFCP:�"�b���$�r��^q��Do��/)?���P	���������a5�*(��fB�M�x>�7g����L��Y&�B���F.l���B��D�K�GS�?5d�c2�t^���?��	��8��q����<���KxsIu�~e8M�����=�A�����v��x�NH��c������>��=�?e��$��#������*e3�G�����LZ�H�����  Y3�m�����"H'�p��7��X�.=��5x(W
������Sm����N	��#�8����?�kK�y��X�:����X����a�2�s8%nh��1���p�9Gq=�RPp���
g�`�
�eZ��H�ri����<�	h:5g���'�6n�����5�g�z���U,"O
AeX��f����c{��Zb�H�
}��Y�c��F�(O;��x�r���g�Zp�1E��Yx'��i-��w$����H�Ad�?������
+��1�lW�T�>7�����#���L�|��(����]C���)���
)��������@��Qw.���]�� ��VH�n�0e��(M�g�n�� }�:������D���qR#�N��Z�=f�z��dk�q�)1�1���tSC��L��>��-"�?�,6���~@�6�_��z�G{���"�
���K_���s�T�
�'
��e�t�l�y�a'GC���r)H�$�8��x���6Z���pr%���?R��E7�R_+M �_M-����&i��,�iIW�=��t�:lWTs"��?��2�!Y��KH����:��/	*d,�=�����		�=
��3}<����%�t��jHh����6��D��'��#�txJ�OB�p�H�1���`��p.�!���#�R�8���h��74��(?M�#	��
�����cO�O�=��k>-�^�������w<.f�>!��Q5���EFbQ���$GT�@���GW2 ,��3:���\����A��������'����
g�X�R�N�s�B5�4�l��DB�f��`	��-�0Q|���UT
|xG
��96�a�/>O�%^��s0Y�A���i�������o�}������H��}��7D�D�h"F"��B�	�>���6�6g���s|��p���=~B�v�"1�h�&#�)]"Q�,���K��?�q��l��������<K��tO�������C����AUH ��eF+��������f*e��&g&%�T^�}mmY���U��s�Gmg��E�]xZ��#\���f{{�11��@D�mJ.$��~��i�9&��"��/tmIBn��������KI�.m�r��
���<Z��r%���<��#�, �sQAs������p�
Jf�g��b�����5iG��K	�%�t&�f<��sU1��u^,�B������u^&�n�p+)}�������0�G�5|�l>����� ��"�����(�g�p�2�|7�;���[�N���@1�\�WF?%�4+yx.c��H�UYI�4J��(�y-E)�a��$��k4GL�c��w��L2,�OIR'��������J�3Y���!|��C�%4�@P��
��i���4Y��|	A���5����2���1Y\���2�&~�D�Z�sk~�]#���L��J�Es`�A����J���-��w��gL������������0'�M���i����?�'_Z	�%9���3�<l�����w���}�}h��i;!�J3nV�2~�k��`�3'���d�4���!9��%�c�8����=$���5tZ�������+	��A1#��v�t_/GQ5��B�F�
����\�:���-J:&i��db��T��A�0!�������b�#���&��O(�/;���8I|���B:�����i�.��s�{�F��`,�<��X>��"�#
x&�f9��d��	.���\%8/��M�,���hCO��x;.M��v�i����5.���y����/f�T�:2�k�]z�n~�I�x��q)�����N~G��`�\��Q�tj-�Y��j��p�>�W54�(��q�E����*��40��"���qm�#���[9�"]���0�U����'-�ts��o
��GZ-��Lff�|���*��|ToS�>iU�O����(i:������qI~�ts��n�d��[Q��&�g��Q����s�I�4������������'S(!8SH�f0Z�|�����l�}w�������c�];��/H�k%�2�y�5^����0�2r�m���J����S�K�4^A�	ve������I�
D����+��|�u#�4����u#d�����S/�/%��N�^�
��9�<tO�	fA��G�Ue��O�*?�����P��*3����9�X;�T�x$�`R<{��Ud1u������R���R2WT�5��8��d\��h.]���@��K	�w�.se������^������?~���0���wJ>{L-s��Y�b�o� .!Sn��:��n\����1�B�4D�3j�
��������3	��3��j���&���
�>�f�=2J�qw\ZD����.��������4���p��(�(�2����2�m���njj�����R������i��$��#�j}���9_�������K������Z���4�������>N��g��m��ts�=
(�z��D��1�Onln��G6]�k>�5Xj���WdB����()5+����=���Q4��-L�g��#B*�A�d����	�.������������D:�?�4z�+��GD��u�$y�����N����K����n���R�����:K�.�]8��1�I�4{�M��������?��F*�Xpp(��*����	�W.D�d��2V�����*���t��I�7�e$�����5Z%��u��`,6�\�F�&@��Dt�|�j����qg>j:�`���sh������)���
�G0iL'`��x4B���
�*\l�	����,���8����[%��#)�_P��D�S��"��&w�UD;��x�d{��V���-�-{4�����-��j+�m�/d~��q���"_���4`������g6��g���'���
L�r1p$��($��-I�����O���i�g�|��y$��O��:�_WL��N�i
�>
�3MC��9|�VC���_���~��mx<��������������a��^�C��n����S���������P
`M�l���H(�qWA��$����5��>Th����X����O)���������t������!�E��q|�����z7�K�����.��x�����G�f���+������t��Qw�BfO3<!d|E<(�i7�
d�m59���tI�e�� io|�
'��%�a{=��/�������'��hE���	��T����J�F����O������_��E`��������^�d�q���p�c��da:kTm�oh�y%�sT���#�H]k[��'f	!���y����?�wM�a�8�.������t��FI���#P�2�$���$v�������� @ ���FzL���s��st�S���XKmk��� ����i��U_e?��~t�� ���| i���d����e�gGWi�h�#)Z�+�'h#������?�W7`��d�Asb�0`L^����������1��KJ��I�u�x�6���{B��������1��n>p�h'�o]���]�~-�����g�����~�4[�s�s��kR���/Nge��t K��A�J���g�d����	j''��f�P����/x"����<����q���=	�.4�tZ���9&�m����c��D*p��<�)
��5I
�O��y���O��2��6B�(���N�3S���.���*����K���G/�����s�����%���~O��U�	'2b�4�Pt��B���R2O����|���P�V�����9;�$���w��M��*J|2�i�]�|E��O��|���'��$ u56v�����kx������)-�'��n3���"vD2L��9��0��R��'|;�A�
��8L����@G����"���e�S�?z���S�S��v���������=������|���o�����S����UZ
M�nb�����q�\1��'�b��b9<�	N��1*�Q����/0�	Kv�a<��(�Bj,����=.q���� �/��tJ���Z4:#��+�5��8��9V�4Q�D"\1�`�!g�\ii�A/�L�q�x��Q:�(��|L�o�Wz��
_�^.�	��sia�VS�9��3hd�Tt��cb]J7n�%�v��~�s�r��b�IxY(~4@��>�#���:�>��O�n�������!r�~HG�����*���q�_��0�&�:��*�^H�a�u��o�|��)���YY��p��'`Dj��K�m+_���YKNN��\�&f��z)nBO����������[������&��Y@�u������s�;������$�<#���?�H���m���	����'%�������#
��n�?j�JW�n��2�$���#��T
8.�����GO�iL����������d90��H�b+W��*�(�y��H*�U)���9�(�w[����@;��|�0����t�W���$����oZ���]���B��r��a6�!Ml(Wx�|�%�L������t/������_�N?�����EOS�(/�j����&9K��������ZOP�P����i�:[���r����*#u���)I��>���@�}�(=a��>��%*z[������&s%���N�.��9UN�����K���g���Q�?AC��+�*em�6-d
Axb�psHA���\��;����t�����,u��*�)NBM��$%��v��8�4�d����������%�*�5M���o���y,��R
��@�N�i�a 
=�Gin�������FsE&;�,HS��c���l+g�y)_2Y@�)��
F�d7iz#�./�+��
��iyC[3I�e"V�m����+cz�=�D�E��]`�������b4�)�����N�A�WK������H���.��YYN�0���T��TA�s����Fw�Pn��T9K��F��R�:���(P+"���a���yY�"l � �g-������.����R1��[�pt�O�a��������-hQ�t�<�������=���zFv6���A�`���I3�Y1�����mg&\�����S�E�f=G�^d�t�m����8o�I�����:�h���i����,�b������4������4�s�Sr��'�I'w���0L'?!�R�L�,�a<����i��/�)��j�&u�1N�����me��4�P��{:�H�B9�h��i�`���]���*/-,����4R��Pv��4k"�m��rgT�9U}��8�5�����t�S�R{���.=4��D���5)��
����[%����
+��al�U���4���i���������9w�u���5����RN��,��O���qU>G�r���8���FM���jNyiuir"$W�6@I4
�����\+�Uf������3.�0����S��F���X����[iii/*�����Oam���i9;�<��}�� ��?s������^�	������LV#d}�$>Lvo��K��5��{Z���fP��t��1��G�Z=����_�Dv|�l��0���������k��,�5���������J?��X�R�aQ��i���gyI��8�� uH4sy�8	��s`�E��S%�XX���e��p���������H��l�&��D!�<5��K�t���L��Z&�y�����c���AfD��q<.����a*��&:���7R
o��1
S�i�"H��$F�C�m�D�_�h5��Mg�����<d�IQ����a<�>����Vf2+k�s��O�u#o��Fh�2sqVJ9����|��g>��A��/e�����Scwc�Ybw*OP��(�����s�b��q2�Yso��B�L;�\��;�YS�Z��Ko���f�V������t��=H��.��2���w��>%��"��[�xZ��|h�E]v���jE�B?�%hZ=��A���o��
���q�c�����������������laZ>(���w��M�=��.�?�,`xN�7e�^���=�Og�y��qD��c�8Gh��c��"�a�[��l��?�����1����e�`����hvS�����/����nG��;+��4�x��0�&F��Ci��?�v�0�J�MNn�^�K�qf��9�c��B������P�s�������r��:u^:j�[#	=� ��C9+����9�P��9��5c�yh>�Z�I�Y7�pf���3�v���P�4��a��L"�]�N�X��o�
��f���a���h�������x������3.���hi�i�1���.(��I��y���l6'a78g��JY�h��7������������c���<�%��z���N�����w8T�������*�{��s�T3����>>��6��q���J�
�c��������i�~Gn�FLL^b�pC�5K���1�\4�
\�/�p�0�{�@:���A���?������|`���$b������U��D��_B��.��/�Y�`��[����<�!�q�P�-?�����S���\�����O��nD�t���V��-t��G�Bz)����E�m��)�;���*W���3��K�q��cqH����#=\�'d�q�m>��kp~��6>���I
�Gn��]�p����
f�^4�jR���K����� ���8���y`����=.o�����w�r�K��[��M�G�y�a>'orz�;��] (rb�������on/��c�9�Xkt�7�Q�e��f�����e��$�|^��ZX&�|��5�_$�BM1��d����<.���
?$$�����)�Xp���9�r������fFcN8A|����@+-9��z>���;Ai���m�p��	���2+���1���.zn��zJP�/ExMn'PG�Y�v2w��sB����g�����1�\\Kh��w��T�9��9dv�a<<�Z���V�@7�k����@r���u���/����n�/�Y'uM(��t�x�X���P
u������c��2�j*������/L���3�{��^V���,HW8�)!���K7���4�Gol�2���s���49+�N��+5�L:�Q}q��b-�(��xJ
2H
��^'�S�1j�6��:�_(�7���8O�����(��!�L����E�Tl�*-5Y�*��<l��>����������0-$6��f������*�y*�y��]H��	0��vi.]7���g��U
R+�g���T��R�*�,����Q8��0^.�J�Lv�:E^�j�9�"��hA�W��"_�����oi�Y`�[��w��r#&~�:������e�C�D9�#�������Z�9g���t�c�9'Me2��yu^'�9��������+�7�kn#�O�}��|���b�7��Re����3���L���=�'��l�������f��[���AXOIH�F�l�6���s�S|3���'-+�i�x��+I5��)W��	J�MK�c��^�d��l�����+�)7�/��P[���q

�����oG Z�Mb��=���z��oQ���C��@
`���~x7�S};�`f@��5}r��u=� ��V����#P@t�1��==9. @"M��|?�[U~���R��ou[����`��E���.�?�6S8�=iW�u�A��v�n,��~4I����LI%�h���x����^�<�w�g>s7�G�����n�Y��Y~~���Hk���J�8�����.b�Ri��-��'Q�B������������)�jr����@����E�������2�%F�J���t����$`����i���)��v����G�K�s������A���A���f[fC��F��E�A����Dw�rx���cN��,g�O ����.������D�8�_����(CZ�4u����tS�kzs�����Z��������m�$$��A]?���l�����)MC��2���!J��qM����
]��K#�J���GJ��v~<a��O�4���<�`�����b�@@s��;]:W����4O'�,u�G�Y=������4
�FmjcT��)�� �������|WM���2�@���}(`=���%������:d����(������0k�����W
U�,��sXrv@��G��������wW�����4QH&7��]����T�Y(T:��y��0�
���t���-��q������D%���=������ou������t��D����6	�T,X~
g�CF<�l�(�t ���_y��S�zi����3�D�=R ����@o�YIy��#���82<g�+��D����3���YX�j��H�N`,
�]1g6:;�z�i��Lr�g;�����)��ZU�4�f�
���`�>����]aW������I�Yx�������uZYn<���_zQK����������l�Iae�����s�?S��
)X����&�CNM����|��AX��7N��~NqB)4�������+}�B/�a���{37=
Ax�&P`�����&��6>I���������L��4���q�3;�dw�~F/�#B��? �|�G������^�~��e{�9?2���b��t�+:������M��}M����8Pw�fI�y��������mvB�{Oc�����O'�{4�|)��B�O�6��NNM�y������R��%��
�G+��\)iu��8&���t6.D_��xI�|~��fR��R����������J��p<��=l���K���3K��J�Q�|�+}{���(�W����$��E�W�R��R�������C��K���0�0#y���{z`��d����ZX�mh(De1��������Z:
����H�&��h�x��f
�^�
�6>
�*�~mBdLQ	��	N�R���,�]�s�}x<	;��9�8�I~7�>��4wn�Q��4��*�}\�,4��@���!2]��.D�v%	������W������������&��������fA������T����{Q��y���j�	��9�X��~w�aX�Z2%V���S������S���3;���UZ$��k������N���Z�����0m�v�T���_�O@���uw� ->����}����E*;��'
������ZQ�����'������/��oqs7����a���)����?��u
�l�:�>7������>gK�<(��P��xn3��1�@����q	�Y��8g�N'���K22`x3)h���s����MB�
#��vI�r�R��n�L��P�����T�UtsOw����r�����[������K����uO����q��?�"d4)��4���DJ��&�\c�
��N���Y�R	���XL��9>]*��l��8�u��;��-�
�<
D�B�<�.��9�~2�y����2���{r�iE��s��3�xr������i��+�/���!@`��G�O���4}3'���7]P��,(�:�a�v`����.�>�@3�/�I��?Z��qr=`�[M�BJ��}\�x*U.��{�������5p]��_��P? &<����Z�?	��KC;u�$��Ca��p����w�a}�|z.�C�!WT�;�������)���K���H9 �b]��
?�	GK:�����u���I���;O�mH)�����n��/e	gI*���+9�|�9O��h���(�>?��J'SS+eZdq��3����'4_��IgeQ�{����wA���-��:j��������M�
��i{��MO6	Ri�.�e��q:]\Y%��Z��w!����&�����Gi���$����O�Z��e�Yi����N���'�V���L�y��M�����xY�"7����Q����q�}��J{8�5JR)�:,@����H�>nO��qY���1�Buj��a����0c�LgI������(w��CV��<�/��S�9��������\s���������TH���SY����i����������|���b�Y�>)�;#S6�/�S���Kn�W�
�:/n�(�[&��^�U^���a��bHr:k���G|�~{���V.�$�J������0��t�kIu�JT��6RMi����.�7��'18�_����������T���r���y\#����sCcD��2��k��lI'���F����U��_?^�3T8���]J9"��,q=����<���0��9��Ki��q�6�a
{�E=����H,�p��e�`��gi��I�
�2��2A8�7gQ��[��6���G���x���bCo��{����q�2%�n��4���Qv�Vq�������{D�~x���3i��v(E�b-���Ef�\2k)�<�=�����a������$�
���&�
��T�B;mqL-�o�6��q��H�|�*9${�
����U��<�w�A�0g�2�da�|;�j���|a������v��[��

s,�r��a�����O��Pa����Mhd�p����"�])�Z��I�k��l�t��f�R��Sr�r9����t�#����t��
?6��St���cU���l�k��n��	������6l���s���J�^\�5y�{{��QD���_L�w��t�KSS���yVJ$���q^i���+��u��|~Q����K�V����g���j����-�R�8�$�������))Mgb��{�BgYJpQ�����
�x�]re�z�qc��N����;�+d�����LA�������rV�rzkR '�M�()�����d�dR��R��K�=8�B2�WLzk�2�`�8'7��L*����#8M�������������[�"����nADF�
�jLfEQV$�����I�
��o5���AU��T<i�I]�|����M����2����`'c�*��+�/v�+�@�����T�LH#�P���E���Z6i3Np�v2����:(Qh��� ��.�����Y����
r���>��m�V]`���f������9�u��2��x?�r����9�e:���t�#��k�
t�1(���3�Z7�
�w�bR����PTi�zM-JJ@yZJ$�����C_\o5��;�*�����'���?���v����O�f��Bs/{>�|i;Ow7*��'�?g,��&L�����,�/������p�Dk�j��+-w*���S
F|X�Qm�o��}Opf�O9��@Q��r���L�X���I����N���J���T�H���N7y�Y�t��4�Lo2*�K)��3?�N!�z�����H_K3R���s7Rl���N�1#]��Z��P2'�h�Kq�/��l�Ri����PP����D0a����*����p;]?3F�{��,�� S#so��q�r�������"�EY�Tj�`m����������=�RP��"���g�����w�N�T���0�A)��R����qI��+A"�_(g��������,I'������������+�H����s5f��5s������������5�;�V'��������o��o�nWS�����������}���lk���C�2�CE�
����^w�/�S�{���l���y\���c'��������>R�k:�i���7��o���T8�#
���z�I��0����R���R	�M_*U?R��D����T��������(����Igw:uGQ�N	g�}0n�vqE�"��A��?�+���R��"�4L�0�zHh~���NR��#pu�''�@|!��_+�u����gW�"9��/��j����y���;��[��b�c��/�����J�G(�cZV���9Q��3%���S��^�����T*����b�����M��7����+=����=����KQB���G����qv�nQ����u��s�����i����1�����Ae��Z��+�G���i���&���q5w$��� �3zTR*�1A]\Y���w-�6(	,������%��}�1�]N�"�xv����%�H�����hqrn(1��� ��e����w9�t��������y�Yc�Z�r�d���~'m���V�(��m4���c����$����n�e���u�f���E��6>��'�jjq��z��J�F����#����i����TDG0�!�F���z�P"��Fc����i��8{�����PFM'g�;3��!82f�X��;���1��W&7�C�������j���VME��l���=��i����$�,����/g�u��8�T�qQF��eq"\�
O�w����\�4�(@��j���9�N��w��D-�����L%6�!�D���N����<rD`66x�>�(����^�i?�����p��21������$�'�����dqVD��?�Ne�p]{�D��0��=���|K1���/1���+�|]u�"�[qmw�����_\6L�0��M�i����r�Uv��*�?)7;j"o&���?����<=��aV������t�����D����N<�B�0sF�G�T�
�6��1+j�g���6.����/���kv�;�Og\:�}�1��G�Y=�t��x��W�����9��~������q5����w
\KIa�+p�+�|M���}J��������I)������v���=��~)}MM6��6.��%1Ayx�reN#�<<$N�Jk8cSi������1.���J_S�-	9(����f�T.��FI;�C����{<9�"d��4j%���I���m|v�+�O���+!������s��n[�2L�I���8[�Y�Ds���5�����fmr�^w{R�gQ����}K��9�����`�k�d�18�1�~8�.�#nV�Z�%5�495��A9''��4�v�/��hV�w�'�f�@�/������?�
�v�Q�y����:���"*�=R��1B]#zTF������I���Dp��S����jV;�����2��H��q$u��q:]�����.U���a��$�����uO�������vv���2aA����t8��������{�e�Xnt8B%���/~~<��^r>��?�l(�RM4����	;j����p����+�TMN �Pdll?�u�d����'�Y���tr��RS[ QG�*%\	���E��ot��D�>�/$�[ePqQ�y~;��G��nS�BczHB�C-��.��q/-���<�8��x:�Q�� �T��3Msg�<6Z��<�Y;������d���RA y����C����nq�q:KQ����Y�y��,9�4BT�8Y��Y������`9bC��*���6������������}����mO��R��+������"V��`5�H�J�/�]���eO�n��P@����+����+�����'�R���uU���$������/�����E�����rh�%c?�Y������������w����d�l�J�)��tmr����,H�����
����)S�jR|D��\H���������;�Q
�_��Q���d���
����n@�����h��z�s��:x��Q��K�������,�c�b$mJ�\�3�r��>Q����4��T�-�Z_Z���Ev��z�=�<jmr
!fiiK<��7������p������b��"*����D�Kq;4h-b��������I���D��k�4����syV����4�[�8������&B&�z|CzO��2��k������s'%�/��$3l1��x��h�w��o��������,=J\&�\bj��82�}i�v�+�V�>\w�W�^��=�wZ��Ke[�4i�s9��
�Z���xalIg7�����SaT�=
�8��
�b���Z�Zk���$�f\Oumy�^�i*�>}x��d:�\t�ijS8����
�VS�7���b���������r>����|}����������!����J��x��v��,��Z�Z?�+�Y���`ZJ�!ER�[W�i��1f-U�"���&�dv���Q�/��K#$�8�U������/�?���YjoyC��hW������'��>������Hk+7f�|0B����Bj�7b���H��0N7��a�%t��tR*	�FM�
�0����4�;)�����Wc��D����8^'�,�F������H\G��	�`I��,�CQh����m����#�=�ue���v4��������y���Bb^SYKj��&����(�K��y���-�B���#�r�qv��v�zm|:+U���3�X P)G�;;����R����C{~���qD�2�>n�#���d�NH������T��Y���"K����p�$�9��w�'�x_��v1
.iU����Gty^����o���������N�\����Q.�xgX���W��y����?��Y���c=#�]�����
����5��6�0��\�u�����0�c�]�k��2�^������G�.?�KFL{�����`L"�N,L��p�����������~DHk���i�?���S���
Y�(�zb��3^u+�xP�,Q�����.������w��K�a3`�)iH�<�g=��H���",gb����K�0�������x��'^��."!�Ne�?�+�Z�:W�x	����N��A����H%���8�m\:�h\�'���`C�%����Q�t�9#�t�c�9�Qq�`}^r�!��>���d�.���J�@����^w��������#�����[����N���e��KS7ufn������u�����Zd��:!�(YP�L�`w� ��ga��/t:Ju>���R}�Q�(U;�������L�2�����2��t1��T80��N���^���`M�������R9�����-wQ���[�s�B���s+�
�Nb���0D_��Q��1��+��>k_�����
�k�����`�����Q-����7qo�Y�`=u�y,N�D������1�y\_��y��}��f��������`:g���GL�~{|�,�:�����"�t6��Y���7]��d������������D��tj/@�$���+�S��2T�sV�������dXH�G�l�mk�r�*����Q����~������m�i(��(W�X���)��a��1m|6;�����S�_��������C�S���O��F����}r��J2&�K�/��%�t�c:��C����X�CzB��MrR|�zt4Kg��S��'���9�����B}��tq���4�������z<�WJ<?U7i��W����AG�7��\c�[����{tqrj�
���BG6�S.E~<|���"!����c���
[58����&]2[�/�I��Lx���0D]��}�d�a�,IE�>n[E���������@��<�{�<�6E]���s�����r�E��m2)>s�� �n�@
��M�J�O7�*�n��rH���roP�vO����|������M������O���U��t�r����<�
7kj��%+%�������kP���bR����~C�RP=��x1�4I����0�mJu��'�������:D{���4#����V�t;��hnJ^L{|e��Nz�������R7��5�7��9��,u�o���%pm7�m}d=�7��'h�xxO��*Ze���6>]1��'�:�(��yo��11]���4�F��x�:�R���)�������O��PtOm`XY���}\F�nh'�p�">�����$s�13��I��
bx�S'���^�:'�yGS��������TAa�[�qr.����`3S����C�Kg�����2�%��:�nW+��H�����GH%�(I������{h�L�O�������[Q��u1B��������-wp�x�C�}�U����m��t���,M'��2��&�()0s���B��#,��>j+�7��<�D@��HoW^}0F��Sb��d;�DMu��#��8�w�6�%(�j�^:�����a���/s�w�-�yQ�\���q��V������j����l+�����s.��k�����WE��j���}F�9��]n��'�����+�h��jz:�|>^�s4��"�m|z��an�2����������N��9����sdH|J�#��������K
��};����T,tsU��9������f\SvO8��'����8�R��"�y���$���z	���f���t3'�I�>M���~�_���L��9��f�yT���8���_Ix�q��I�������10���']��p�9�RY���%v�P:	��r��<��r�0�yq)i<��1^��|UiUj��;Y����;��W)@�4���{��s&U����
�T���8�������!��]���R�������i)Gj��T�fS��p�u����ho����"�����������}����pfA��S��i
i������
��md�lIa�����Tf�bO��S����i����}�\_��:<>7A��fv��O7"&7���B���Pgq�����_��LN��JC�)�0nq5�/�P�-���>�tA?��r��{N�
����L�X+~N�L���Sc�VHY���2m��W�9�����r��n�Yg-M(CY�4�r��"�uq��}�v1m{[�|���97-m����T�y�>�����80d� 
r��KZQn����y��������������z ���Q��v����tr�����*�9��.<NE^��Y�!�q5��{�/�xa��Tj�����,��-����o)*���$���
D]{�L,����,�~,�9]��<��`.��ipe�����1��[�>)��[������k����R�i�Kg��t7�N5@���Af�cn��tqe�d�K�S�K�k�����]}4�_�<�]�c�^wO��
����������sCv�����T��
���_�/M(��.	����>.�
�`�IH�@6O]xp��)d��t��1��GhY=Z:3S59��K��q2��;�=5d�I���>m�z��8�d(t
�V���]
���g���8�y�y�]����@z(��h�����5�@���w��a��a�K�l}���!w:������9ZE��A��.��t�^�� ���U�R��v���F#�t�'g��R�-��>��X��^��
���R�1���b����KR/RLKy�{~������$�|��������^������O�2�����q�Z@!I6L?8����[�vCS�F1Y��7���Fa)�)h��Bt�TV�)��������d;���(a�F���Jr�c�9?Q5���mM����/��z�y�����u�U;	�S�x!��-t��)�9��U3�.-���I"�u$�������o������3����hv�E�N�4o2�b�����JH����H�!��&
Z�E���}\�E\J P��$��Z���<����_�=��om�2!�^�o�3�B���W��Z���.An�=Tc*����2r�1j<Lg�aiqjR�?({@��N��4��n
�Z�|������>��l��PI}q��t:9�������Q�VN���\X��^�]ssC�LJ������������v�<����`�>:��E>�7|\ P�������0+:?-��0�=����"
����|]��R}GO>?#j�_Z��
,"��:$��IW��WTc�k��0���*E���5'ey?^(���,�?�I)������������<�.������$3��G���N��`R��I��H�����Yc����Z1�>p�����i�?_9��h�~�Vk���$=ziI"]��!����HK5��)����H]�����n�OSD�{��tm!U����s��2�G�u����F�)&�kk�����O�$�@���z����&]�����sU=�hw���QJiSe�Gi����A����L9���a��)/���y��<�{qe�X��&��E����-�c��m�b�G�P���O��QZ�I�?�P���t��tq��Dg+��|�j%���8y��B.������!5p�2���,h�[i��0�\^e�I�C:9�#R)?��;�O�a� �X�����O������=u��`�|�l�l����p�=�p���B_��N�(/���"V}=��������J���M�PT�{�;�)d38y�wn-�����Eq	��3�vrU~<v��P10���N��Z��1M���rh��2>��[����(�������0�dF��m��j�{����P�tvA�p�������-��h+�9�C���hh)���{���s5������ OV
�K�RsyJ�����4q|xnE[�u�;z�Xc$�?")����'����6!c0:]G�Y���!hXg�B2��kQ��G'Q6�r�Q���]�{#R*4�oF{nk��}\g9�%#;����17�=R�;O�gxJ�t#b�=�H�z�t~�'��3<&[���@z�����4gK�<.'��������n�tI�
�;���\q���=�����V������������5�lN����b�@-,����Pvt�L:�SOIGn��8�����Y�b��gs|���������Sv>Y���u��������JG���(�:\��,
=����7+��@D�<x������^g�4�4�G��2Kz:g�{�V�i����-Gw��HI�@�K&�1����%�8�v��j�V.m���U���U�Eb_�V��z��W%��md�fDz#p(y�p�]bs��s�d0\<�0��Y��I��LW�����V#`��ix�8�l��SHS��������4P*k��t��:B�����C0�@� �Mu5��(5��U�phS����4'�5@�I#�<�>xar����F�����Sm=I���8�0ugG?��3c#��Gw��ss�N�t`\��(������w��YG�J�x�0)�b^��g"AW</�/�yLz��'���)������7�upN�����x�I0����������n�p����'�L:q:�8�2i�q��yV��'S*�5��'V!����/�ah��c^����:�����k�o��I���������]8'���n��_`�~��70�A�Y��~W���~�6r_d��a�����f���A5��
pd?��2g��2��Z������D��D�����>�Ft.���9?�#L�}��YAyh[|\������=�Z<�dp�=/Di��V��0l�C'��F�8�O7y��Ls�_��<e����Z�.Wq���h�E�.X�7�����I���Nd4��S��Om�(u�������-O���o����C��_�(����p��]��c�cs��=�<8����q��'XU����Pk��ow��`�e������r9������B;U�����lU�S�t"����i=)
�U���*�y�	#���\��^��r~�l�P.0)���[���H�_��p�6Fvc-i��;0� �r���2��KNv�(���u����?�G�����w����]�����}������M+��-��V�<�xkh��0����7��6�S-�y���M�ybq�]����*�%���x���J5O��k���/���m�����	� ��:����x����������������@HOxB��p���L��R�l>��*���[�#	����"J5	��'����ZI��E���X!��,��f�r,j��}��/���wH����I��e�����=�x�oi����mzv���@^�g��b�Q��o.����*�4��;o���I��LwG����Q��	p��1=����Q���>�o���$������d>x����.:�YV����!m�s���mL���[��	��"$K2)����������
�i��������f�N+��~���5%�I
�����q/�pj�y�4�A��Ho�Y�����=��)�p�r�d�;a>����t��<�M"�dO�xOw�M������*�4���Qs������������YI��n�CP��Kw�^�X

B�6��R`hlR�pC3w8Aa���Y|��2����&����C\Y.�]�@&���x�~��;s�{�9x�KlM��WV�i�}��b;�� 	[}�Kl��������d�%�l�^C
Rll~������m�rf[����
��'����=QN�4{��\���T���t�oOF�7�q2�������;�������z"e��X��m�J'H�J��J/C��F�1��rm?����RC�m_v��{Y���B5�$�<g-2�ECs����9aR����P��2�n���g� ������3l�f�On�`L��j+�o��#;�������C�P����@x3������������@����B�\�]
��H*;+
���v���oR����JY�i�����5��5]����&s��@6H�K����N���T�w^���PB+�d5��n#���6�������\�����G����E�����LG
��|���P��;nY����4��fp�����<�7�Z��?���^�%@����X��>�w���_�����������IU������f���!��
S�����E��)QG|�}y���/Bj��-���<��(��6��Viq����<�����@�����aj~@�AfL����T�����������Wq_@��v!9�Q�k��6��1Y�f��`i+�-Ks�M�s���3��B�����-G���Z�X�{�\��{�QAB0&�a���33�eY��u������o@���G�I����I�j��n�sg1g��`Nn��]��v�:��$��������|#Q:y8����O����������P���i=�}czg&Cq�p�4��&wF�t������&�P�<�3��9������K�|��K��k?>=���(��wF��{���0/����)&�9��QL������%�n���$Oj���t���d���j�%/�]R{[
��F� �s"����U#M�y�_:A��J�R�����2��(A���407*�S�P
W&��5M��jY7�F�]�x���I_x�����)��}~z�����|�x�x]�NK��F�t���A�����{��S%���o$���0j����i����(&��� y@j�a��_gS�c
���11�?����s7=���������!��dwd��qi��,������������}��������L�����s����S}��?�Ys�9sF����D������(�(',
�	�J���������a*7��h�Z�Bo�6?�J�}8��)���_h	����F(�[F3E�n����Z�>,
�����������&�C�p�#O��&�9�����j�Y|��c|��Fi�S������v�|���@�����h����k�g�L�_�T��qFbt�*�X�]��h�����G��!�R���*��H��m�siY/ �&�Z	)S�n�C'�����@���,�n|il�\��!��e��(��)��sKh��SP*������+��*�����)�n�
��y����{�����G�:���r������@�{���i��W�S�P5������'���O�����E��ON�kZ���c� ���V\���NiY��:~��dt10���G������-�F��'PV�D�8
�d�D����`�,�5;�0���Al���6��S�{�������{�1<�T�{�#��M�SJ�Y�6GA�6���bdL����R��Fc�����)������E���q0b1��6��'�S��\� !A��&��6K�����8!���9���������Yv��gd��h��N��"�a(Ma
�zUN�eQ�����DJ����y��7������9r�7�P���F*�����J������1|������}����I��NPtK���#W���n?�M�b�+�������d��
��`�-�z3#@:if�,J�������J\��	�Y��y?z-��].��'pG� �
<�K`�����tJ�����c��Z���@����q�i���1�p�<��':=�S:�x�eu�������41}�V��Ne4�^��(�"�����b^��V��08��u������EO��P�
���A����]�����y��y�����w�r!���tu�m����h���r"���,�[*��h*����yt
��$m�,EM�O�B����C�v`W�j�����g����M��j��	��������
"�@4�>P��^���3e��z���) ��vQO}�z����-�=��DqI�J�,|���VA+Nxtdx�� p�)����s#PM�FZt�DW���=�!`�������Q�k�;���*k��h~^g�X��~
�"�����_�<OW�Kj���u�zl��RW�0�Q�R���P	sB��P�IT�VbK����+���AE�Zx�}��e.�g�QI�����"���.��N	9!QbN
E
s#X]�_f��HI�Y���wu8`�}q{�Jo��L�.�S�����> F��2���+ez����"a��w�Xh�6F��=k}����_:�����Hz��H�P��S(��g��^�U�f�B=<����1��5J):z��!�C�����z.=f�>p�NtT3�ul\#I���{�se�C#z����)%�8)oL_|�!��>�wf)j'Ju"�1Ju�7��+OM8��y�Sh!�0m�����#�����
�7�a���-����"|�5�e_>:�����h�x���������P����s�Q��������'�g�>�-�4��9 ��(�]����>f���&�)��
vk1��������q�L��/�R�.C+���K5������'�X��[���?TXh����G���8-�Q:)D:�r��9�������J���TG�x��,���M��n���@��'m?������x~�(0�*��g��\��`����9���� u�A
���\Y2��G��I�ruY(�H� �&� $�X��S�t������c�Y-��9�tL�*#i������+8F##���5m?�2�|�D>A���i�&]N�&k�#�$Ec��$B�S����}N{�%������I��D��\&; P7�/x�S<�����S��i?��
�+�K~#S�cV��uD�ri�d�xP��Ad,MC6}����!�[�!��z�����m��W�������p���Vg.Cs����gsdoa���':�0_�����YH��D7~�<�����JO��*�p)�m�������
^�;���%���0�#v���wj�<M�j%Qz�s�%�!Z��������,�Wv������f�pM+P��%,�Q~���b�X��F?����WO7_��}�@L���O>�5j�)�����u�������'u����$��%E���](.7�^�@���9�(:w8NQt���D��A��g�
u���P�R�'9�Z[��L�~�Hk\���4�3��s��]���h���sl��nC,������@�Z&J�r��*� �������6�x�������^�c0���F+:3i�j:yk�hit�Npt������F�md���B<�cK�H��X]���<iIM������8�Q��~�t|����wr?�U����8g��p���=�b������yPqr�1*u��Kz���(n����_���r�P�Cgm|Tw��JM��u�Cr� ���!MX�99��IY���%9�4vr����
F��Cu�VZ��P�<5���VC�]4�F����B�
.��v�o���
[������--O�!�mN�t���5^E��:�YP�����/O���Bo��O��N�,	��g�O�\&�R?i�	�d��������dK��}�	�ah��������{��]���tIm((�074��H/�*�e
���C�@)����v���S&|2a)��U�Z��.��i�1�%0MA@F�tn��)����[�U�j���J5��C_
Av���)��k ������i���m>"a����e7nM��`���F/�dE7�w�QX�B�m��H	�^�Zx�}�����m�`A{����u�=W��Nxt�:%�-��4`>y��'��K0���jR(k��L��t>��KK$�7A��$�@�L�����u���g��������o�-2}�j}R��SgH��>�5e��i�=�n�1���o�R����W��&nz���PK��5���1q���;�>;� s�(:�3}?��KMJ^�$E�w7P�l�?1�3������5��o���>h�Q-L{/�6�����z�����<n���H����9�	ylb;�H�����X����i�T��7j�R)���{
�O1$P�I����/���q
nT�=/5F��"�x\K�����V�n'2:�7F�]�$�;��3LD� ���1_�j�v��/m	�B�����R�
�sW�,��nG�h��AZ�_��)%`.$��HB����a�>�;��b�NK�;���d��B��\kb��fN�4:F'/���5���}�n#C��`=BZ+���s����R�����\�`GYj��H��F	>j���@�n?���Zb|�je������E%��xdR�����M�s�3:e������):uI�Bg���?��vl�����O�\�/h�P�z��M�F��>�
��J���\�?t�"��<�(��<��������v�l%`������0d���s�\&O���#{�p.o��������%#�C7>Cuu������K�H��2���������Pp���]��ma�;���J*3��~qd^��4�>��[*c�HGg)Ujb+��U���izVg�|��:����A�O!��621K)��7�.�q�2��U��& ���8�|�8,p|dP>���%�����.����_fL�P���/�$���&�/������O��]�s5��Vv��������bYW��X�[���m4���)��������4m?���)��jg�>AV����'/^
{q��;'Ejh�����O��
nD�Y��gm�|�����Ye��;��+���r���~�W <�C�7u�]��[c&n�
��\�'���X�ry,4�8xa	���jM+��q:�y�'�r�sm��t�2A�f��9c4�D����?�mzv����4
�h9ht�tp�n�@'�T�I�|�)'�-V��{K�_r{$���=h�S���)~�9C����X�Z
M�|�QY�
����������9��XG����$P��75�'���\���m���*�_�������� �78��P�~� (�Q%��f����\��FgX,��ig�g���pl����|��b[&�qf��Vh���P�Tiw��Q����� ���-F�|q���p���rW���H~��2Ed��������)a�������5ti���?�����(�w����]A������7��h�����>�Da`�&�l���T������w;�GY�os��5���!����M�S��t�6����l|�#�8�~	>�wI�E%<������|c ��wO �D��������`}�B�zx���������K`���^�~pj�s�v<�kt�A���f�bDE'���DEc����9	Q�L����;�����:��������������+!F��,i��}�m�J9i�hoNZ�#�
�������0y����Jl\9{/'fGFCtP�hw�[�;$�P�!���b�����c#��o�$C!��k0�L�d�1��5cf$O9*��cIB�����.��
B+��oqV��Y����?�U��``�
����}[y�d��:n�-����'P�#�m���9����U�L������X��c�:�p�#(�Y��x�t���J�;�P2<V��Qu��H��YF�G�,����H���e� (�^Z4��z*5�����R�L�I�����k@�1�Bb�E�i���+eOG���y)�L_�3�x�����:�/��|��g�1�	�n�v���G��L������fX%��J��9�' nS�r�V/�
�^�+��AN��W����|[MG�������re��;�_�4�K���Y��t�6B9�J��9,��qe:�����K����w��K+�0���l0e������l�}�����k/�*;�gQ[j�H0��? �r�F���?5�*�?��Jhn��:)���(��])'5��I�n#�����^�������b�����mwu��%�A��4�4Z���RS�0b�:x=��U�^�n?^O�5y��m?����4�=��&��3b�sI����� �v��X�2����mc���e�Y18�[�7��Z��}��3�3�����K��%��UD��z'�uAva@����y
�_�g�E�?1��??{��rd�,���nA����Z
������u��$� u���(!��Lw������]��rh�]
��L����9�8�3hc��������b�Y-�Nz(�Z�y�����XRZ��]K��{e����
a�,IQ�&�@����
}�h�r���H��Z�Cs��J�������O
u.8������1����z���Sl��q~Q��h������/aE7��>8=�
Ycr�cP��A��B�&���e�������?���k��Z��/z�
M%�P�#]��R���B���''�L�Y4R��7�D�0��.rb����~5Q-���4�������!�ER](��-���qZ�����R"����e(�v��N��_�P�*�w�\:D�� fd^"O��
��<�&��4�/GCs��
<U~
�q�S�����a>HQ+������1C
;x	�z�b���+T#e���. �,>n�S���c_��#wu����(-/�,)����o�a�iA,:�]4v�|��j> �g�apY\�M�l��l��}pz,s%p*o�����T�-u�F�tj��)Kc_����c8Y-N:!S5���
�d�}"�)2d���3��i����O�Y�7�'>.���iS.o�d�P|9���a�SY�����I�Hw��j-�����,�������W�3F�a�/4�8(��X���q�9�fp������`
F��RS�4 �F$��I�%�):��l����������0)���=qc��Q��Lh4����]](/�G��#8#+�k��ES�&PZ�3���~��8�������_?[+��+�������WG�����WI	�,@(��}_6���k{�n�G��F����R{��������P��h�z�eMIdQo��K��B|�����9q�]��]:EQ�}>�c�����F���M���L��I�@
�r�3�n����,���"�Is.P�c8��s
�[���|!O�6��Iti*�-O1������5f�[����9�0Y�_�����,����83�� wT�������LP�j�me�wYa��i�5��#�z$<_9����'�������cS��tb����������N��S���RL��<
���9�����^�^T��-�(����5]S��S&��O�4�}�wC�%qGp�����7QK���u�u��d�����-_`:��;�&���C�p/���M%p��>���w.�����`?��XE��l��T���e��-\�
��3�:#�85���������gL%��-��e%�K�yi��$�����O^���G?<����=�n����!�J���s�n�l�"�E��rBZ&O���k��
���Xe����`���p�}IT���3m�M��D\_���:��3r;#�)V]f��J�[���e`��������}�!��l�Pgv�����\��yaJ���KK-F�t�*U��8u'%����U
j��n��t� ��Vm?�2����84�|c,w1uJ���A'J�MA1�@��dz�$g�/�WV�i�G�p�H4��6�6��i�l�i��:�[�u�jY���22���$|��!�p��2bj���q�A$��_�HbH�,0�t9WA�#cH�*�JJ����q�E1a�#U�r(�,Si1eE �Ohd���	l�|�j�~��V�!s�Ue�n�
�H���>7KK�P[��V��>�WiO�(��9�]��S���M����	U:�a0
������ZW�������F�������q�a�����:x
9��������.+0H�
S�QE�p�q����0&�at�Ss[��8�����p�En�e���NbNO15$w�C�Z�D�S��.����mH� U��Ch���8��
�M�<044\ M��|��~?�s<����0���h����)��4z�7�6���\f�����hc����9��n#��3Y��H�-l��i@�%��]f`�����CX'*\�`
�����>���~�y���v��'�P��E5�W�cY"N�Z�C�@M�#�+�r�8�e"N�3���Q��Rc�>8��/l����1���H��](DP\C���
}����#���9E��	�B����H���������H<�������w�\f8((|�Zd��q���DGf����������/9��O!@��U�q�C�������2����F@��Y@{79�����q�[��p_�T���TXE�'���~��!��$�uv�h)>j[Z\Y���������*��������a��=����c����5G��*HJc�C�k��V��5!����79��X������G����6��E~���	���e���5�^���[�T;�mj�K�V�i79����9�
\�0!_j�NsT�:R�GT�o?.K�K�Q3�k����A��;��H]�Q�U��V�%H!v<
W���C��#h�^���������$��|��[E��j�)Z���r�.�+�i�F�4����G��;J]�.UD���>q&�Bi5�L4FEt��TDc��-�P�Z&�t��*�'C��gq���������Z4�\-Aj����O:�T(��Vh$H�BJ'�����T����y�,��+����I���{W��ra'�	�-�
 ���)#:y6�,��'2�ajm�?�����C��kjA��b�>G��T5�?�#�!(�h���55P*,��fV�t���9	�mb�������"��7\
��Vb�
�Z9�}�2���#�/�[�����E��5�P����[W�.
��k�Qbq���p�U�8+�������m��&��8���CY��8m�A,��}�2���v��D;�+�����S#NJUPs*�4����5d��R�[�V��u�����<��E�����oC�������:�Hgx��Y��z����@������W���+��[�~�`�)*����J������rR��|"���H�}�����@�R�������%\O�w��Q�8���<��Y;���B���(�6�U��*$�C�R�18xc�� �hs	�Z��-T��MZ������[���?0n�ba���zd�~�5"anH��u�i��-�����a�����@B���A4��s�e'-���
U��[F�+�Mu���I��H2��(E�jn�����l:�FZt�p.�[zmQ��(��p,Wu��S����Aa�J�w�\]���Ae=�v�Q���;��Xu�����(4�%A7�H�����tz��z�L��i��q�z��oq�oD�����v�7:��7��qq����H�D����6j����C--j���$���7(f�h����
p��<���~�7��/���6�"0ix�Ir�-�����'#�4��N�3�X��3S��h�����i�X]3D�Q,I[NzzC�6p�9���m_�������V�p���"��i�<�/�?������WCp�������l=&_�U���riZp�Sw��D�e������ ��Ai�S���6�q�����5%����QRS����[&n�k��y����tj�����c$Y-������A�r��b����0/�����U-=�~�������o`��&Lo�j}�)�����[	��3��9i���h�k7�d}ux]J�����Q��`��]��AW�A!.��.�����.��v%��^"�l�0(&�Bh�Z�O4��8O*�h���Ok
��7�#G����r��������U('�j���^�"��A=(�g��&J=���F�r����
a2��g����AC(+q�QU��e@�r��0s��m8�>fd@�zNt��:]O�}���F�)����&����N���5\�IOE��9c����]v3��t����r�� �A�#$��4GEC��OG�0#:
J�P��1�GM��_��3[��Tb�8t����cd������O��D;u�X,�eA���z|h`2i���VE����������j���:��p�*S���7�['&�����t�j���#}A�v��5tI�&��B���r�"�Lw�n9����n������N�����Ji21�:N����u�\*�_�������8�M�s��C����	��8`�r����L
�mw�b�F]O7�eV�,"K��4��?��A����L:�O��(60M�3
DF�ef�[R�����+���?�$�1#Ksq�k#
����2BD����k����:����j�����o#/9���]�
`g�K���������i�g�aD����_H�j��#M��M�����L*pI���'�:�\�O��1���	)K~�N�����*���^��H���c�������qT����[;��~YY��1��YQ��������
�\����B���_��=d������������=i�|e-a&�V�L�b�7��j��}�b
���?�7���I����4�J��*�
��	�[^��x���?I���~����������������Q����?o~D�Q�����/E.��2M�~��"�#_x��K����F�	��/�5S���c�7>���� ��C�X.����������[��c�>�N4��~q��eX�H/��Dr������S�%�f���)��;IY��Q�x�M�~�7L�O���?+���qW9�_��i:��@�-u�g�{��,B<�td����Qo�����N���>��P`k��r��e-��N@g8���8���������c�i�
��tC��]��<v�l1���lpveq�}��]��O<[�#��?E�ECs����9�������������?
����	DoR�k���U�r��I*��9������\~��9Cs��C���PB�����~�#���M�Qr��:�����F9���X�xz��v]��:	�� �\�&���A�"��}\$��1��������O��\�M���)���f�^x=W����|>�4��(�"���]��4��Q��":.w�Ne.�V��
�7�5�����mN�Y����7U	��M�p<P��'Q����l{A�mL�eA�J(��������nN���
�0�{��d�O<ft��pS�@�s��h_��Fd��I�������3J�zg��������Xh�]�Z
�!��B�W��	������*��{��T������C��� I���g�����l"�xeM�}��D���E�IL����`W��
���&��wM���f���T��CH��Z��}���@�@� ����,E�D���f2W���eC|��i����}�H��qJ��R�������g��"<WP��+�zzxK������;s�<���Hc�>eAn�Y"�l����S�����47J�0V��DR��� F�s��:��AT��>)�3���)�H���f�(1�7�Ei����lwu*$��A�7dhf��7iM,VEw��,�u�����'f#�?b���0���z-��|�(��$!�U@�r�`fo�����YM'���/�G+PjW^��?���;���3����;��=���N�%�@��������H��k���}�
���J"�)vB���1h����n��l_����1�(@-���:KtBH'���n9����.�E��k�P��\�Iq���U�W������az��M��p!$�W��Ht:K���Xnd}n�	�GR��4�����M�3q��*��M���/�/������)�5��^.�r�S���5A����Cl�\�lHL����C*���QL��z7[O�%�]�3��8��`���_��
�e���6(��?>K��r�[��� �"O��7�����r��cC���k����������V�+�|�gJ�l���1�~���@�#��W�:	�/���H%���������Y�*��4I
@�VG�f�:j{&�
q@}4-���u��9��������K:�N�GGq���S���!�K�gW{�owY!��@\B�����t=u6�|P[��Z2I�y����D��hk��;��\�4���L��q���U�N��kksR�[���=Z�����A�H��flF�s�����gw�x�{5�A���gS�t��#�M�#�-����Z&����J	9
��gQ*&7�~�'�1*O��?s�f�t������[W3=�U@�S��^�+|*��7k��Nd;��'��_���>���I"�Zt)L�CY����� �s���S�sC��s�U����ka�u{>�++��;:h\��I_���
mDcO�<��6^�7jgpdn��pB>�Y��z�f�P+���{�q��p�<���q��M�*��T���]���s=95���4�B$���F��@|j���r)���b��\N�����A�`:�5%z�����(�����Ks���x>LA	�WQT
y���"���Cl�Z��qu��3:�\���.D��
�Cxa��������F��'n��q�F�8=`���m�a��-��:���[�C�1�6�"b�����?!��Ob(�q�;&=y�(�g�lU���Cy:�N���h�'(_}����N
�)�
��?��R�{�����$�z20�J&�HY>��5qOj��z�>L��8>���jY�d�J����j�2����`���������b�Rt?Qu@�+�}���'���OZg
��2�"!�%8�C!��������T"@��ix*>��g"�T���
��Z�"�
��V��g:v�����Q��Dy�W�q��ue}�S���$��4��d.�'qA�Tb\��c\��:��C�g���x�(������l�RY�f�<��;/iRH���<�`�C(���J���F����B�������H�v�j5��q����������RS�Z
�w��~��5������(�����NF��Q27r��7�����f14r{i��Y�&b�d]E��s��:N{�z�:�n O�_��V���%;����Q���]�+(q��F-E�:�u�/�DF��@l����ESq�)
��:�6����B�Y�T���hJ�mU��v[�������4R�F���	����Z��}��G��x�+}������J'8�C���B?7%��~C,x:)��V��N���P
5��cc���a>@�6����=,�BO�	�
=v����_�	q�L��gt�^��L�w�|�����I�*�6�}Ya�?+�(�&'����
�Q�sj�i{
�N��� ���Sv��dpfB�Aj�O�E����D+������w��M��2'N�|o�����n9�N��&=d8��
�n���P#f�
h����_�a)�@Md�������%��#:y�j�W�$2E"���q�@��,@Y�Y*���g��K��M���4�i?�(}N��e��<
i�HldTU������VaL��U��>Fg-�z�~P7�h�`k��\���VN,��@V�AU�y����*6���f���+�BkU���.UX�!����_���>y�jl�.���xL�r�i$����AD�U�j�O�S�8�a����c�����}N����x*@���fnmC�]eZP:�[�5�2DK�>;0;�����S����>#~�-���Z&�t�J�A�!<�V�
O�TN��h|�Th��+�#3���l�WY@�)dN�[<LN�3�^�S���S\iU�O�<DqVu��<������,�)��0Q��3���=��9,�4�f(��'�t~�K�J�{�j)�G)�i��� ����7�2Ka7���O�]�S�*�J&*���?��%���@�������_�,p�|+�~��������5������'&��b��F��1-�Z��.W�:T{��@@$�tK��Ms�
� rq��27����O%����Gj-O��s�c��Grz����hy��sN�3�8����z�������3��x2�� ��;
��'����#,�����f�E�Is�,^��&8��/68�x>���Y�i4<G��R��
��3 �J�G��W�n'�9�r�"�������G���3��S#[J�O����@����&��D�C'>�����N�#V9�� ��1N!/I
��+�R�6�/*�|���5�:��g�_���]�EP-�Y`�&$J,��@�st���V��P0�x����xx�2k��@��+#0	��k	�5�``L]���
���X8qOyI�|�
w��#dJC�Z�}����o�����d���5P��]n3�v������6��:����~����u�Z"�}��#S���~�1���N�:
�����P�� �H��^p0a���>� #Z7��o�$z�4`0
�s$�Nt����6����F�s;�U�����
�<'SapH"��(�Q#N�%D���(�����#��8�Jnis'N�8�	��_�f=9�����R��������"hshs�#[��A)5������#c����X��K#���F'�L��sST�xx��	#�9ie6,Y�G��h�2���NTj	q.�V�4J���2�q�
����a3�I�Ms:�����>cM�-�@�Z�Es�+�C�&�V�s����*�>���I������=c�_��[�C�
�����|��_5�e�<�����l A��K�����dOF�sx���k.��@2�;TF�s�P�>�������G<�F������� ��F��_-���t��A�I#�M�(Aa�)��\J��[����p�U)S<B�����!hJH����fV�v�/����1`Hh)���*�������T�\	I$������`��,Q�9{��Q��Fm����F�s��.�W����L�z�������M\�B����x�����K	0�n�&j9��� �,g,b����u ���+�#H�����p!�Z�GW�RJ	O3� ^:(����Nr3;��9�W������������)9�~6*�4�����F���!���D��u5��{U����.�8��-S�5��p�~?���~B���Yku������#F��|�C�g56��i�4@�C��PI?7e@N'�K�n9F����m�y���1�����-d{�%�1�R6� �x������������:������w���������?��V�����>hY|!��cOhq���LhS������(Du�8	*d��mT6�w�Ca(R���?C��
�v�
���b3I
�Bve�����S�}w�}�������)1��O��~:�zF�Y���K8u�c	PizL�������N�3Z���s2G1���>FPOB�#x�Z��w.��},^.�����b#:y�-NTS��y�0�i��P1�wB}b`ZK�2?�"�`<�gPf��ji����I����-���4Pq�<}�����i���t-�Yj�P�\���1�CA�n�L_����J ��=���3�@~��n�\5�}n�L�
����������zz(���S81'��
�c���w�x�_� �����.h9�A�cZ7��i�`������6w��l6�-��Z��r�
�h�?�Wh<��d^*���J�g�R�&L��:�x�H����X��5�����������a��]���>�������f�����Kg<'��G&�#WpP4X���+7�A��Y�W+y9�!�o�R���#b���t0Wfe��� >�'N�Ssu��@<2R��Q�<��q����K<���B��-����"��-$���
V���������w�7�	F�N ���#������0�R��HKg������5�����C%���t����JQsmz*�Q4����0�	d>U�!�>��T���R�#����|�+�ys��G�O�.9�/�����U��%�<�R�����Bz��'���<�x�~�����8Mte�A�����BMV�S�I�F����A�T���������:�#~������I��	����Eo����N	��(�6�(��z�gB	��|��VS��sO���4��?NK��?�3,�d���;B��/���\�?�aZ�_�:������/=�>m���
���M�o��	�v���I��"����\�PNi��C?[e��"F	's���s�S�0�����������39!���[cY�����
Z���x<�\�d`�2w��N�m9���T����uO�]�'Z��6R1����Cx����0��E�jw0
RG{��~|4���>n���g������I��+�n�Bq��6_��J�~�"8U�c�!�m?.�HU�n.����eBN�FF t���@h����8+�se)�{�_y���s�
o���)B��@I
@�iP)2�E+s
����������&�8�������0O��k���]����F-	]a}�~o������JFG�uB�\����p|q����2���}z8Uq��l�9��f�q��������,m�i"*p9�P6	�e0���/=�*b����);�$[b��������_f=���6.�Fq�Is �^'2�Lx"�f�f�}y����N*��B
�M��t�ly���F�=6u8z���h�De� e`-�.��k?�;��^�/����VH�����$� ��&����������q��Nc��}Q�MVG��e$�S6�����,8���@Z�������,L�{k����] � /P`���%M����e�`3�@��tq��p����.���pnG���
�o��1�K������sh�����C������/��OK�l�;�\�q��t��������0v��Qkf�������
���x����y�?K�.]�*>���g5X���[��q����1��3��O�c	�C�$&}R���@�j����'��q8���U�RO���-����
:�k�Jt:�o)��X�:����F��P'�a���������]}~����j�L����7
3���n�����<�e������)��l?��C��jH������������?pe�qH���V����`�b�� �}�]�YO7H������/]�5?�(������9����);����
����������Y�'����~NT�uB�������bT��������_������hm/��Z����{���IK-��������T�A�9�OC�y @�i�C��\���.�������piH��������1���y�AO���O������R81���q�1�wN�M[� 3W�b�8��{�������yf����G�UdD'��![��P���#ee��>����X���K���p/k��������CV"a�����#��!>k�����3���=�oj@�����DE�t1�������E�1���b�`���#c�'!����#����Go2tXV�y�QRl��'$}���Z"b���Z=u��
)�C��������~):�_����Tg�
4C�.�J3�."���:�5�G�Y/������5@���d�	�PN����rTUqk]\����z�+�&:&�s�
���	�m��ggk-
=i=%�;���l
j��CT�@cT�c��}�T���&.�zJ~>Q&�y� ���%E��ti--��H8�?�T���n�]4?�)��	n�2Y���
�F$sU��'<+GD�t��}1�96�y�&�0�����N���*��s��>�A���+s:��L
����H�5.C��b�l�a�t���H�~�(}��@^�qp�d�	*F��/Y�y������������O�#����t�H~NN��S~p����]�:���;M	�����*����u���i����)�����x),����)+�4��4������OWg������#	9(��t���X���)���%(�>�[�)����.�Pwj�3���F���13��,�N��$��wa,�:9f�U���qP-�|Jc�.�4��=gY�R�s(!�8����i�Zi)o��\+s������O�l���5��x�;cJy�Y����e���2�P����[f��\Xr���9w�������}V�x�"z��c����������Iq`k9�q�Z|Z��_�W5���sa� '��5��|<�"�y8��IJ5P�gCHk~���y�����Rt�~��*�	�RT3Kpn����������9��M/!�� ����E�����V�_�������(��s~�m��x���O���
_�M
+9�S��D�)c���V{��p��$�b,���`�A?t����Q�<hL����~��'�3N�+�x�����^�VI&	��I��E�.�(���������F%����N��F�x	��AI��2~�D�C�^B[�s;nE��O��G�`Fn��X�AQ�n��K<�Q��>�Qk�<��F��]���g~�o���85�E@1�0B`?�z����'����$���B��<�V(���oL��������i����1����/B�<���B)�����_
��VY��S�������]S��'J�S$Bc�S���Y
�c�^��������"���G+,V�{�uF�?����S��L�x�T��`{:��B���>�h|�:pb^���������e�?<�Z�s.�wR���^ ���JW�H��i���it��?�����<���]x��$�<g���<oK�s���w�!�������N�FO�@9n�4��	�FO�������h���e��R�^;�?306�����Q�$�_��a�&�m�W<C:2uy!Z��_�*9m��n����������	��Z�~Q�����<�Q�x'�_Ks�,�.�	4��4+^�|��{���� 
F#P���f�=�%�1zl��b�8<9f���l0�L�(��!�������l-�n�^�8mI~R�v�_���jd��v�o���p�/���Ao1�Xkz��y�G4oI
���C�kq���k�Hq��d�����B���{r��{�R�`��j���Y�p�)���.�<�{.�H����I��?��_�^y�6!zMJ�9uBL���uE�(���=���o!���:�&IkRZW�:b?!��Z���{0�����}�qtF���(�F�)��Q�l^I����|�����������
��|�����*p�o
a�6h�..� ���"��#���pc�Mw(EE�>-��)����C�-�fS��.�C[s�/������h0E�S��p������7��F�y���?�V�_W	#��.[�	�2
i������tD*tr�<�)rNzD"��J)`�\�n�	�j��q�,~��"�V��:,��z6I��$�k�78vJA�����{L�����g���������HhK"9�o�]�K�<0n�`�dxr�����
W�����|������y�ZI�?��q#�I��������j7�^I����OCwB����Y���c�u����~���md���N�d"�)�5��������#!^e��o7��[N�V������x<g��5������1���!�)��1�)�1�Y
�FDI��
�W�t�)�p�M�����E;���7�V_���X��Q���R�������y�������{��vu����j��PA����T��:�'����4h��v�<��,Y��F�C�9���J�R�������fu�8w����L�b����&h���U�*	w�����	�=IP
�"	}O��P�@-���%����a��Dz&�*��2�}��nV�Q�>�?X�7cf��C9o"
��H}�w�:���'|��tu����A�S=�(z@�������"�Et�|:�O_^�v4-���4������=�T?g��R��}I�K��q�5
|�TA���s�I������uS^�3������h����biB���O����HX`��7l�1�r�mqj|��}���J�]o����|����*����,����08�AY�TTBg�M=�o�o�1Bv����_H�{����V����f>$.�Y��7����g��l���U�Tf���G����UQ�����X+��
�@f����}�F����-i�f��g6��v2�':1�x��A����t����Z����"���t��D�3����}Rz����d���.%s�tn��5�S�a'�@�)���7���P�R�Z[�j'�p�Q9o���(�Q�E�*��:��(�
2A��L�����
�o^2��G��?�6.z�L�����u�G�R��/O����F*i�O�s������3;v��((����F�`=-�Z�M��fZ_\��'/
%i_(%�s���@2)��jy��u�.w3� ��&fy��}����2<9mcJ����������� b���u�]�)�Z�;����2}m�1	�q�>�A�tqK�@�
���kvu~	p��:4��`����U���
y����8{I��.�NF�E'�����~��6��AV�D�5����Rw�������]�@@��@~1o�`�0E^4�?e8z��0�h�Z�$D��s��n0��p�Q�*:jm��uO7�oj�������2��-�3���j���|,�`�@8N��-����(��a�K/�@�����5G����;}B�[����i��g�@D"^��!��&
�3�J�3/A�$���R�1�U+����Ek7z�������'��W���>4��W���)$���{}�/��A"���Lr�x{���^D4:���i
ZJ����|�i��"��]��4h���<�����K�\���+
�)�����A�.{PF�J���:A������K:�^����F^<;n{��rM�3s.e8:�" z���������|
����
��h��r�����3K�����e8�7w��JI�/��g���]7A�p�L���/��j����?7����&�i�����E��T���y�Q��#Z����J
�\��/Z�AtvE�Z�����m�r�_���U������B��{R����
����V��m��&��PG��aDs��\{��l��^��Y�7"/T�����l�Y�ok"iy�0%�$�b���M��b����R�t��R��k�&
=��|��_�<��6��>;�������F����8��e�^�������k��	�"���r��>]�;k��(f8G�=��2<1�����@�WbV�}��C��H�/:��ni����a�U�*S�t�����j�������E�0
u>����a�f�v��b-�n�x��D�$���V%�Xyiq�t��x���-�B���O��yl1jw�e����j��S��!����������V!/����F���w����O�<��b5����xI[�;[>�I�p�v/���A����kvAQ���H%���}#k.�$��q)�T�+�lI'��I�G:Wkb�x�SG;��t?�h���U��e?6(p�?-�3�����Zu��	�������	Bj.W��m�{,mRb��l���k�dyH��pp+��\_~F������r��{>�����������K[����3~� ����+���$���D���7�5�Yt������=�"�p�cYz���}���TL���2�7��������yT�p����R�=�$�3��|�z=���1��GMR��MV�:�>�4���y~�������t�/_���3�����l�T�sWf���=����`gc#�
���;kt�CF����
r�Po�7��a�\~�}�!����^Jk(����~��#@���S�\�/�V�����D��.%E���c�&�R:��[�"o�uHDB��������a���q�p<l�JjH2�'h������sw���Fg�S)F:�N��Y�a�Z�k>U�E�����}y������n$�9��ov���=��
N
�Q.\���=��..J
e��Mr�����<u>��a��r�V����{Z�C�R$9�4��S��%��_��=uF�p�ptE��'�T4�(��Q�����k"bz�V�f�!b�Z�t�� ��������
~���Fz���6Q�E����f����la��!�����M���n�k
$�����1�weHie��:�H:(��lx��[b�S��k5j���s�UA.�}���bl���vi:�����O���W(X�)���r��F62������i�[?[��R�k��\���Iw\^������;���f�
s`:�����&
Z�����s��e������a�����&����c�hs"h���=l��Cn=��2S�g����"��?A%)�v[����sWd�Iw����90��nLz���\)f
�C��sW����i�����`M��M��n����tK9w"4�M�v�BYLx�y��h����8:��KA�1�d<����<��I�c/����4F~4�o#Xy7�����������0\^C����N����A�ta���l"����9���^V���D�R�o�C�L��'�LK�M��v���K[)�>�g���TZ����Q
�t�^�`�A�ML�b���N�O\<���4�W���{t���I���2�G
����T���M�B�*we8KHG��J�:��=����p�1�������r��Ov�:����h�D'�ho�y�l(���-���%�?Tz�nG���^,�R���L�AOtqC�����������Q�
u'���s ��dLu���������'��
E���U���7�����j�;m�v�SA?7���Nx���������?f:M
���7_�
� !��u���q�M��6a1�W����o��	0�`�S�J���������w���#gt	(\��
��@�55�����>B����[3��X�kp�g�|2K�Z���^����b�q�pN��6h��-�h�N|M'&j�
���kT�x<��/�auju��u���*�0D�(%H=i�NO|au�g���sC?]�M'���o�$����������T�N
���o��}"-�b��)���o���a�h�OjeBj(�4S�~�|�Z��&C�I��)uV2S�C�`=OYh���bD)5�gE��������_��n>�������p������AHU�������S�@����qj���S��}sj��!�
��)
���C�t��a�DHuv�������5��)�������I�5�������g���T�:I��|Q���D�������C�tyy0���l�&_T��0U�a�� �_;��w)��\�
L�:2�L��K��5�b����������nz����z�F\���������|�������~��p����s���e�>�g}0��������S4r���'�g����a?��D�_,���(���!dl�����s��o����:'�@�ZC���d���1��7/�k����bu�Sn���F�<O�',���xv����r��(=_FGM�<��&�0����(aF��G�����,�a��^o!]��n���y��?\��1��F](�_p6����j�g���rz!W�9'0=H��z!Mv1
�E�������A�r���s&�4}��PbJ"Eh��7q9�Rc�m2/'�g$�����hZ��n>74(	���_��?�wN�H)N6��������<�;�lDJ;a� ~��=�69d���������������_��_�M�~��9�)�����Gl,��a4ID��`�}��4���kzO)I��������}�D���qx�}���O^�����~e����I���:������{(,�]��I��9b�C+q4�|���h����FH�x�����������/�$��L��h��Psk����������sl]�������|����6�GN<L5�������>Ro�Vf��^�,������)�x��;�+�0h)������EP������������/H1���	`��1y>7/�7sA����$���\b�$m���#�q���H�������d"��)������K���F2�����^�Q�-H�O�/����BR�|P_[|��"���Yx���.��VtR��o�@ 28�E���cV`�6����N��:7�=��6��_���;�b�*2jd���i�B)���������}"E�2.�Q��<j��M<�6��T�I�����5�M�+u�D.�T>I�Q�+=yn��/p�[�R��so�l��&��a*d8Z�j�^�h��
T�,��R]W���=�$�*��5�z-��e��15����$:�;�)n>L�F�G���w{lP�Z{lQ���IE�
w���0,�Xd?��H��6��?��;��
���2`��J��6����d���q�� ������4�SE��H$��{$P8��?��Q�;Q���G$~.���^�^�w���2�_E�i������AL��wg��U'^��p.����)����ge�w�Y��d�7K�����,�&5&��� �.�#�a��_e�]'��p���I�,��������
<$~��!.�C�p�S��
���?Zi�s?H���FW}B��o+�����^bB���=)!��	1��h3�a����~>J]�K�K\����G+tz�na����"�C[��%c���B��������;�_|��{�����g5�����1���S��Pa���c��Y3�"�@�w�����z����~OLt>�$�0�<yY����{]�h E����A��x�:���z7�O��xj�Y|��&@Q���G��=MMJ:z�:�����d?O�,�-�NA4a�rcidx��F��&"�dM$��3�;�b��������U��nv���Q���e���')7R����n>8a���`�����y�)r*/K�T���Mr��n���l"����a!���'M ('��'���U�5>!e�����{�57#������5�F�"12��4��C���<����r�_��)��6����]��v���.)f��4)Z�voWd@��PDQ-���R���)`d��O�(�����2b�N�#��u�ZcB��M�������
k,���7�(a���bE����_4�n��j$��+�;*_�&���d�H|�t��L�_,��|4�
�����?dv��M�!�w�p�D�2��.7�I�6��h%?`�]J-m1�"�F�4��O�	�����������Y�%d�������c?$�E��M�s������n&�/(i��2
OL�Od��>��;�������(��e���LD��u>p}�s������Y���N�]�&��K�����P����I�>����E��i���i)X�c�|v�y����yL�����c�<O�vt�J~��?��2<��$O�L>=�����Q���C���_�UH'�hD���������}��zMWq��L�]��m�����k�G��'�G����z�`d2 ��w�A�8A������;���ma������#����P�w��3��p�Y"�M��g��_|5����s����������$���In�>K�)����'���M�l-'29���A:q\�k�?��<8��cT">�z��G;W
s�����@�[O��[�����X�S��O�������z�sB�N�X#Xm�gt��|��
���[�o��ZD��g{L�3.�b�s�5��	�9���pg4Z���A��S����������s�b��[8a'7R^`�h	��`�&��!|�}$��/��9Y��������
�A����R���M��p��)�����3�����_H^��X�=W��\J��H7�N��(76Q�pM|m" ?�;(D�n>H#r�>����ud~w~�=��P����P�_�'4�����B4����;�0��Z'��/�d�v�������g����<��$v���g�3��C�B�����tB�������������,;�-�E[Z+����]��T;������,����,��}�[~C���X��$M�����@�0U���K���c��o�^���������x�4�N]	�����j����]2���m���&�:�?���f�������w86��2���_�U\��_���p������e���)����P�w|b�	�o�M��}���
}�=7�� �@
gn7����>�����n+���S���=�p�>��;_�a[�[H�_�G�����B��`�)����V�����5���<V>ul����_������t��w.�@�������3��}�����������d*�B�;W�?�m����&J7<w��t8D���,�q9��IF��M^�d�����"��pE)��s�N�k�����XJ5hQ�ZSR��_w�[�:����/�l��u���,����R�'j�\��w�pn&���8�O-�P��>�
���N����V����>.�s]'�GA�� 2��eL��Z��+��6�]�Ir���N��w�i&��Aq�7�@7n%��:�Z�I�N��}m������xE����Q�u���;?�|BD:�7�7�n���]�!5��TP�4755���r~K�9@��#	d��d�$��o$�5�����re$6��psNL��nX��+�u����w�h��d|=@�1;���'�I���tF��������t��,{;H�e`?�h�����Lw��g.�B!���O-
Je:��g�O��~����.2�����q�2y~����j��������U��i^6������F.��F
������:^eB)I���kG�O���a/�(r����m��W�AIR����8~��	#*=�U���x|ZI�X�|���z�yk���p,}�����|�h+t����7��>>���m����N�CiO�'�F���]�S�r�����N�M����`�C����Z��6J9-���w�.�b��Z��l"�O��`�-��7�i/>v�7��@,7j��9X#�s������CI.lA�J��e���N8�bp'���U��7!�C:���w��k����G?�������v>�A�v���%�t��	1�k�����5�r<�n>w)�Hx�����/>�C�]���r���#����!��W��%��g��t m[����-�x2/*�h�Y���T����i�x��}���7��	_|u&\������v�G��d� LX��~�M��M���c?d�����q��?��(��a1!+�l_��
6{����5��������9p~�-t�e9p�Z��������2p�-��:.�h��a����/�\/����	���{������Ks4��Oy�L���C@�`A#���M�����"����Gma�������.���LIp2+��������-�V��E���v5l�T\�
����^W_�@�Y�����+$m�~
���"���W��F��.+���&1�����
����VH�&;X����{3(�V�)R7� =�j����\xn��p�E�)��T&7��bC�9�NU�ft>�P]Y�ya���.�/��:�.l�`cI���a,����*��O�����~���B�F$I�H=����=Z|���s����1A���o�@��F8�&�F�o���JN��;����k�_@U��`Z%'�7�;�����0���1��x�E��V�&M��5c����v����~�����&RTP��}3�z���UT	Y��~�s��JP�����A�|z��\O���N��B���6H��r���s���}������p�a��8@�l�����h������0�������������Q25B9���Tw���F]�]cE�o��|m��5��,�)�q)��5[�Rhm#I:����@0}��&��F	L'k8��rk�d�lT�Su>�`�o��tzR�g4�->�h�g�HL,��(��m���� ��	yA;�H
��^T��&*bp,\#V���P+�����|B,.�t�

�Ap��#^97Qi
<m�o�N����\R6��-���|2�������90�->�C]��8v�^M��X�x��|�f\���s��M�O����?��2-a5l�O[�>F|��R���8��	��=%������o�j���p���~��E���E�K���v2�`��,�WvY��1�����[L���$P����Zh����V��<��;7{4mvB�;�6PH� ������B����G��J�O�����S����}�u��C��P�s�E[��(�B%��fz�,mk�<ai�l"�i�u�����M1�p�Ykm�?�����z��mt��Y��[��2<�D�rc�Q�����x5��'d6V��z������(���&���0-���L��0��FT�s��[�O���V�!�.�k���p���_����3;���N�K�T�����jD|��-��d8��j;�8�+O�U�Gv_�O?�7�0�����m���S���f�;����]����4.7���A?�]��x�2Y:t��fYpd�k�HU���e�H�[m�3+}��m]T���X�"9n�h��<$�X2[� ���|��hs<��j�#�����xQ	�ip`�(�&1��'
di4X2O��f�����~��eK�
�M�`�D��i�G/h:�<������NvDPJ;'���9.6�i�o@�BY�f7���'�W�Wm��a0*.6v���qY�������w����!����mk�h�$�id�`�>�6�l��=���&�j�Mi�c�m�2���V�z���2�p���P�������K�-O{��6��];������xB�]�~�n���N���B����0�0��U���.�Gf)<:�_R�<l
��G��ao1*l�����_|n���Br�mj����8m�+��@�Fk)�n��N���!�v��I�v��z��H	xi�y�D7Ln���e�/q��x�0��|�{������������mk�-�m��P�������������h�����N�A���}��-�Ce8neh�_�n%���jn>$�����L��������P�q�`��[v�OL-���PH����wu�m����^6\=#��
o���sKz��������\"�+_b�<?���mDl��:���+��o�:T��(n$.��'0�����_��/��:�.4kq��-�S%:�@�@3����6�->�C�]�H���� \2����$u��I}���m����@n�dxrEi��t���Q"���L�A�\=e?����Q���`��
W+�/�?�5HO�a!���-���6�k��P��l��dC������s��p���p��C��#�sad�N���Q�5N0�C��Z�%fse>�-�Fz�zo��	l3�Y��qi
�� y��s&a��i5�@D�.{���.����n5�d�`n'{�d��|+S0��l������	�px�i�Of�(O�c�W
�����oj���C���*��=���oeI���m�6�����5��C���'�F��\((\�.�c8��->�C�]������gH~���T��p;����6]�����^�����On��D�OC5����)���+���EE�Owo�:�.�k�8p'}Q��k������S��`:}
��tai��6���W�y��>�i���T����[w�h��)�h\o�l��c��`��+	����(M4��������9?������E��Z��W���!��3��`t t��4�$�>p�E?������NZ�D��@�;7��x5-M�����;��WV��t��`��a{M#�
}�8���
�>7��e!��Y�������oK6��@��4��_��R$HLD��~5��%h��.~���Y	�9(����>�������7"������/�g�B�Ue��?{dkc��8�6�S>YX�C(|�6�Hi��m�H��_�f��{��Y��r�
��Q�!��i�/?��z5:Pv�6����w�;X�����U{��|k!�Zi���������o���~�8<�7������aluFv���%^�t����'���6����}��0�p��Py��m�i�o|�4�������R��F������ZBz���������GYpbW��*|���,����I.�=�II�O'���}^S7�Taz�;�
��G���-����]���� B���LYL�vs�
�[�����v*uJ�@zc�AN7R� h+7v�K������E��:&P�q
[�5I�I~>����v�{n i[J|R��V��2<�=�B7�_���n������^���&���>�)f��B��0�Mv� �(�����2�uf7j9�AG�&��~������_x��V3�N���7�P�Y_�Hr�����:�\�4������uq�}$on�n���2�!I�������D)�-s��~���a.��?B������{>�d�h��`+!�G�����X��M4Z��@8�p0�Dt��R�F%?�~x���z�F����mK�=���
�=��|:�1�{#���o
�������q7uv��x�0����&�a�gUR~A���t�O�k/����$D�%m����un��_���EM)I/�mC|6,�-+_�����]�N�'�G!�8��dL��%����(O������f�Gx�&�HS��3�VH��_������R-??(��W�A���o�"���$LlT���E,������[DK�;���9�������v���������\������`p��\���9�L������T�f�/ p��>�t��B�����TC�t8�.�[|4����rR�&�c�����5�������A������������3b@'_0�w%6�y���V�N��������(��@y�4���e<��-�XB�d��	t	=>��F��sc�>�������!�.|k��h�����^M������5�"$M�{� F�hmvR<����m;������'�u>��$u���o$�[f��V��&�G�k*�A�E���)�"��p���G��6��������/�1$&@��p��d�(C��/�Fm\��8�������0���v�d���P������������.%���j����x�����!���7���GB��8$���"?���~[!��5sX���26.Idzd��n�q�p.`�s����\/<7e�F��t��p4j�����|����z��v������������{�e��taI�����`��|�.��'���R�i�'=�g��cg�d�)	f��#Qn��T���o���`�@���S�������d��M$i�:f�J��>u�����zSh��A��m��Rm�m�@��@�3�N�	����1����	IQ;n��i���8;����~�x+i�;4�r�0�AR;��������PY��7Q��Bz�4pB�F���2F����]����y3���O�Mt#��F�'G9..�K�u����^o��A�3���� {������;�_�
,7��b��U�����D&R+=0�-<��q����������T���'�O���+�!�����`R������������d�%��7�]IJ0_�-��_����/�l41���J��OU_�{t(HR$����4%?�-�����0�}�X��8��T7l��9d	l���t��
Lnk��0�EK+���
oz����i�m�p1���d+L�H���s������6��-n"!f��p���m�q������'��up"���~H�3}�<74U@�l��O��KC
<,�n��1�e��O6���A��$)�,mhq�U��p��~����PR��&�h��)f94��&�8�D2�Z��\�C
����c��tZ��f�!��w�oLb7������m���o�-�+e8��j�&�n�J�kMf?S�s�a���E�]l�o}���TK��m��Fk�����~�j�:H����.\k���{7e/dm��R���LCo:��F�������A��>�?�&���2e*���J2���4H$����t�
��J/.�4���r�W���{�{��c��+2�o(7�Xh�3��f1�u��j����|	����=}�U�]�����U2��������JY[���p�=B��v@���Do�����N��E ���'���oc7�^Q�qs
�G��\����:J���b����c�����mr)���V7�b��X=Pp��P���Sr�������!k�V
eF�d8�j�^���M�
H%��>-���h��$8<��}m�
���*��3�2�L��^�}�#q�k��;�P�2�Qrz����KuoT(���P���'|�t:��!����z�P��PO�'��d�9�x���Q���K���M��|��
ok�F�6�=l�KmN���u/�������a_��	���|�/c`p[{lap�\_B�F��t�-����9u�i:Mv7�fE���m
F����PO��&�6(��}�:��kb���[r��a/R�oc�^�������$n8{f�$�v��Y��hT�H<j;�g�q�m�1R&]AC�4�pk
��&��|i�X�T�������yB��2m[5l�I@�F�{�{k���
�o���O����O
��n�p�E�&�k/?�m/`E�+��y�����e��C�U�������0�FX!<��uEiE�x��<���.:k�|
7]�������%	n��yx�S�����P�2�:�Q���_u������7j����&�kJ����n��r���@�W�.T]��f�H��9=psSN��q&f�P������q�p-�w27����TnR�br�C8���i��#e���H��4�E
X�'�����l����4&����?�����F�����F��v�KH��0,�������v�mwN�h�(�5Z5,���������K��5)k�	1���V�j�^�&�O.|���W������(�����m��Ro���s@|��^�#�mn>�g83�,<���w(��@�����-*�v �\&�������6|~�b����:p�-	��4n�DI�'��������G�t�[1�&8|~6��D
8���f���Q���������Jec���}���
)hP�}e�Q�{n������F�.��%�$�����	��(����/n�gG��_N7.,��0�;�w
I�s6p�� z��~���Q��q)!�cSA�.:��Q���\d�T�VI��@����u�e���aw����	�|zU�����i���c�%�?��C*-��2��j�^�&�i\���A��t�-���i�
=��l$p�/k���pPH��I�������f�Iu
�C�}�g[|��<O|x�g���}�
����|c�6���5���]Q�O-�s2�����#@6��#�%:�<�(���������P��l������
x=��Bu��B�hr�3OqQ9��c?$��m
�bK}��-f����f��7��Y&����d�H�5�_T,���|$ap#�@�`�K�����e���������4 F2,�/�^�H�����%������oyG�)�5��@�Mb�.L���f�5ebV(��|l���/�}�(�{��_���ao���\�z@�����y�;����c����w������2�k��G�5���V���=g���u(hVYD>�Wo�v���
Ak��������������\s�&�����} ��v��������+��cr�$�n�*F����/#��U(�{�@�c$�FG&6�6Z�������FI���c���&��.���x�
AL��|��������O�_$�O>��6n����g����#Ysj.&��Q�9����.Bm��r�t��{���b�����\W�����u���v���_y�=[5[�1l��5~�*���4h�d��B���| ��b���+����L��
Gk&�oq[��~)���[�<�\�o����uU�������`m�w�p���K���`����oPl"�,__v�b�Z[������k�{S&�l�\�a�;U9����5��$A�?��,o�mZ#[����	�%�-����C��:�R�����I��f��>Pl�w�A`A` �����}����\����]�,dk��I���������k4zn>�
�%����y�-��ikis����AJ�B��v�L�tm"�O���(���m�W
d<v�������S.�������H���$�g�����_�8�hVA���-����[��$�(Q�.�x��B�����s��a����&���x��F����N���#�s��Z3����cS���n���������3��pw� �3ju0��='�m�p�����@���@��v���mr�	��v��H�o��4%M����N�A��.��JVT0n8�p��F@B�$y�^�����7qR��qKh'Oq���P%���]�w��a9���a^�������a,���f�z����/>v�.o_x=7�i�f���m��p�}�e8+����K}���m��s��v����&R�8�Y��;9:��Oc|�O�h��cWy�k��A����n���$�7n��(E���e�Zwaq��Ab�9{��f��i�����@vx?����:���e�In�oy8(�1��1pm����i�fSj	�2�����@ ��x�c{ 7�����c�4��p������u>'�
 di6v�h[����m���t�<yZ�hAoE�xE��("jrdAt�j�g[������-�X�����Q�*U��q<������ ��
�w����@n�y (�k��k�m~�����8�,�Q`	|�)vs���6�>bi	�����+��C����@p��M#�=�6B@!i��<7<�>���r]:U�~�����)��o�NW�Mc�=
7LrE���������u�[��a&�
�3Jc��+�M�7R�1��piZA������[�	i�X~�i��$m���a��'�zn����l�c�q$��Z�;WRDH��+�k9��?�������/GI*D��	�`0��N> {���=B��i�W6Ns`�a&���������!�������������1����'��@�x����[ �f^o�k��O�9�'� ������
��f�2�r:n+Z���N�-
a9__��fdM�P"��7�����Q
|��w����&�t���G��[�T��
��f��e������`�K�����oPt�e3x��A�O��V�mY��p����q[4�I:�3i�l�.�m/�Ci�#1I��WC�C��P�JWY�q�n��aF�VS/)Vd<��t�V�1����@s��t�����M����D��j$jH~�����U���"�!1����r��tM�m,�������M�������E :�%�}]5�������Y�auJC�����m����VG?�\�	47!�!s�eVS�.^�L��
��Yu�(E�:�6��k��q��K`�t���N��#@����D��k�������>���[2Z�d;�/�SU[���KE6-{6�A��\��2$����~k����C/�t'��x�)��?�k��I c�l"���d?h H'�N�9>�^~�;�^3��sV�d9�����,i�a����v����o�������C��|����4���������<�@X�!�w�4��8���c���MV�'�4r���r"�9����>�)�b�O�_����p���X;��.�}c�8��<���x}�m�9�,g�:[&gT����T�������1�}����n��\L�E��#�"�fR�(d�V������@����9�~#�������5(������S4��\�""x|���2M9$���<�h���Q��b��aKm�Y�##��O?�����>���.�9�.�k�Na�\��
ve�O������9���R���R���Dz
���c�sR3�4��6x2��_;��5�*"m1��l>*�F3�)��U�@+fa?�������_�)��$��:E��A`�t��������n�i��r�0������P`d>��C�F�$����~�c�K�0���������Q��b[_X��M~z`Z��#1y���'p��G������'��?Bi;����_����a��%���WN��F-�����J7%�x7LuQR�%��W��t:��!���CQ��{�kx4^�M�v#�8!rE^����
�e�	���Mc��A9�.S]t�v#���t�V�4�]��m}��������Q\]F>)7>z�2}X���S>H�!?b�hx3����)Zk���r�UVCj���������h�B>����4|?����L����h��5������~kd�"o��%�pH��\�"�6��NC���%P���
�n��3\��6{(��p�i���v�}z<,�����r�z�e{����'���*��� �<}��F�*��N�m�2I�|{��I�	s�J��~����@m=�U��I�-���FV�Qu*�RJ4�E�ST.�lDVL)�K�����o�.�]4�����U&�5`��h1�1yQ���5����D�
��������s�5��s&����}<�;��(�������w}vSO+�8��F:����(��{�n��p�������+��|;�����z`8R�F������~[3�q����V/C���f83�6.M�X�1k�&�O�"����#u���M��'�A�E�
�!�'���N�v~�~~��7M�2���a#�����r�j�e{@R�����DW����o�o���������2�E@-{5�E�F�zd��dtC��R&��*�Y9�:j��7���Y�m�/b�����y�k���b`Mst��1�i_!���d9_N*�fs^y� 6g���\��k@q������������lEfm`D�_���0��s`����W�qEp�������8Z��D<���1`���?���p-�fiO�D��l����U%?Y����4 �D�C*��g����(�#e\q���:�������w�82���Z+w��;�_r@�u"�=�j����j� �4��y.?���V��N`S
(��
p�i����d�|O_��t��b��)�b�,g9[�C��@6M���Rs[��t���g�����2�E{-[�h��[q�:��b��a��������ON{m�Ro�t�T�1~���o��	�V���D��\C����`[���dZE���|����iZCR5�����N�3��*A�\�[}�������l_�p�W��'Y"w�:���zHS`��t8)��c����@K2=��������<��;7���b[�A��Z�3ey`D����o��	�Sf��nV�b@��o��q��ff#�9�u�C���B�~�I�6'�<�.�\d�F�Q�:�'�(oK��e�7AZ:+3���k���r�@������M��c�!�"IJ�����RA,�7�!f._�_��f�����\>EO��/��
P�s�r�0������v�G���{�9���N5��e�)Zi@�c�t�t����w�h�_��v��/~���)���t������+	*R�����;�+	"��!�c�iL	���]�����^J�B���t?���D��q:����z.�$[-���[�\�z�/��R��Y�h��
�/�M�i�:?�W�/G�o����������q�UGC5$j�sUNN����P���.������a�_�b��p'�����6J��}6Ap�I�ad���/�9��a�����_@�Ar;�,�������~��j���#��T������b������4����3�q;�n���[��[����y}�[�{"�f��C)�jP����|]t1��qL��V�?x����X����hi#n�[�D����j�T�m�t��T!�tmz?Qa��O��%&�k�pD�`�����@�[b
�T�r]\�R6t�@}{��n�f��!W���q��*��I������4��=���t�C|j�3�y��~GY�Y�w����H�w�b�%^K�;]X�t�� Y��~��1��o�U_�<���CHt����~~����-���C�p�	N�0�9�E�������M�*8�g+{������BU�������\�'�;�T���������D�/JJ�H�m�����J�����{n����E+m��J�+������q=�P���Y���bi���)Q?#�C)��)He����7���|>��	�o7�M�?PJt�r6UH��.�&
h�\���� �]����|�)�a>����G$����2�%=���z�o3�y�S,�~W2
R��	���ShY����y�|�`�n�������kg4��#�^o:���-1y
�R��"�%�k�V�9��6��:O�j��e��
	����N��q��
%{�nv}���
r�������M�hJ����\�Bw���������7�]�D�����������[v��e{���sYq�1��'��&Z���8�,�Q��u�Qk����
�(:��L�B��\��M���G~#!DI�
�V��u��zd��>��#A��z<�����.B���F�O�M�7�I��#"G�J��r��T`�v/�(k9��]�~rp���������������e#�5�|�����TJ��FJ�����Y�����o�s^��
t�����x������|�m�n�k;��E#��='���60"��������K��b-�������.t�l�Oc����@fmE�q.���uLg����N~��s9�w�������	l��)2n&�����
z��k#z����{���s:(������vm���Q�8�6Kx*��b������EUV������R����`�B>���<xVa?�K�Q�h�|^�Mo':ko��{\���y*���JAkR���]�[^�x^��2�)���J��h���0����~*j�h7����_[��_�nN��[�g�{���G�a�Q?mpuTRF�N`m��7�(��� f����C�H<�]�i����i��9���V�����F��1Dq��<���s�9�H^J����q����s��m�L�JL`�TM�+*���Y�Td��QY`q
k�����E2�h�����H��N�9Ua���S��V]��H��Z�tGN����-������|Rp�n�x�����S�'�3�X�D��M�������s�"��`Yp
������q�i��>�e�������Vq����
1���D�s����

��T9��".�)��M]\e�P7���:�S�6H
��*��QN�m�� �lO(���� ������t?�`��y����	��[�q�(�Yf�1g���'��#1���&���?���D�D��~gA�e9����'��5Jy�u?L5�Z��zrzm�O�����D�
#�����P��>��6aT�
Z�'��x�p]�L���$:+aIjx�PA��u�����b[81����VHw�r�s���R���G12��.��.(�;�q���C�l���<����%N�,
�Q�X�����������:a�E����^�Z�M1������#Nhc��|r��\�����s���5a�BP��u��;�o�����T����5z�M
���O��:��������Q=�&�'��E��t���r,�&�U�Nm��7��(�!���G�	���t?�����D����D[3�D�`�,��Gt����&�S�>����6��SUsO�z)>�mq�3�4�:cQ6��}+��@��)Z����C���4�}���2�%K�������~��B��������P([>�����>��v�;����A����>�R��7%������x}��Y0������|2x'|X��
9�I
���M��������v�(���-���6���c��vvO��	�h���+m�y:E��c��"���+
�J���[�7�#�G�����#!5�o���d�"����,��G(�Fd2�pp"�FPbR'h7�L�;�{��|�k{��i$@ �����a�Ry���59�~���-�
K^��#7�.'�2~�0
��2gp�n�����[���E����8��q����L@����������}�3�7��������|[*�ta'Z�@Z����t�|��P@~�/�*�K�4]��� z�� M��������5E�-�e9���e{L����M��X[�U��$��g��X|Xo�����#�����2ya��T�t�I�s��������L���TT�p���Q���w��<�O�
����,T`��k[����Z��rvi��F�;~�q�]�*�a��Sn:�U{����"��SM�_����������a��wF�N��h���|�c+y�kC9��s�9Q�p�������mv5�r��-C�E���=�,��;^��_��J����3��A�6����il���������#B���9��~���4_��7�+�F1�!��+H���87Q��&j��g�:H�g��;�����(�
���#e9;��2�Sd���1z�����
m~CBn<���[��D�-�a��r���h����~S�2�b�W�ON�m��7X�H�Bx���������;M{����L!���P&�n9���|o��LbBLbm
5���P|A�����N�m�����E�
,����gv��r9%��&vTfDN�m��7�|z;oz�IMp�Rx�1�~p�@�!��i�'�n�����n|��[GI��(Mfw���!�0�
+E?��������"�F�a#dO��B�����F��9����}��7��fD��6	U����CY�b������)�����[N�d9{��2�L�� .ZSC|���������&u��r�#w�hb*e�NMD������j)��5;2�����E��|�(� ��5�b!Jrk�@��'�����E�5�nO������U��,��J4��?���R =_���~hQ&�@>x�i���QoY����0ppQ����($��?��vE��H����F_
A7l
��N��<��T�r{zlf>sUY�����(&��_Sd���v��5g�J*'���L0�i�;�������W�_���M�m������-���50�����C�-@�Q)5.R?S����ru�K��]AL"��MU]O$�r�'�79��O7(�t����T
z�����al� ����H�.7����(�?�ri���n?h���w����;a�w��z��}�{���x
�<������?�o�u���S���xi��Q8�9��L���)��t���n>�x?���z�,���(�H�����}S��������Fc
��*=:�/{�1�����*`�)�Q�Es������b�Om��������A�f.�HU`�z&	��1.�<���;*t�s%�$��A���e~��<��6�kDS:��y����D�	�)�)a���S E���l������9:��=�H	��v����@�V���� !��V!����|����S�]N�m�����;Jn"�L��g/���rJ"�h~ZW����e��L\���t6#��K���Kw���+b�g�m%oX���%�mdr�������<�����
��&@D>E�T�T�pmMu���t9_N�����+@qR]4�n>��FO��H�Rga��6�Y�p;}�z����`�����@	-,��[?zi�N�m	���3���zG:���ltun��\e� �+PE�t������$�������x���'x���
�&Ro�PMKI�\
���<������Mk��`BB��.��.�����R���y�������
�
�Cy��������e�
��S����s����:{�tG��t9;?����	��\&�@�i%������%�Z���jW�C(�>�)�~~�;��������7;p%��[����|�������������^�Z���y����45=�M������i6�0Q~���,g#�7��D"����������g�,�R m�(3����%Nn��4e�a+7�����$�;f�*l��*���Z�w�e��� �	��.�~�(��������d{,��P�I�-U�,�P(��Q��M���G��+H��4��F+��:U��wy&g�Rm*X=0t�*��P������O��r��������}�_)��D6���D�
�jLmG�n���k� (9����*�.g7'?=�1E��(������x��b��W�_�� )�>�)�������� ��F�r��ul���!X[����
����p���l_�A:z�~�BS�^��c�)�}��0^To��$J_���nn����n���"�e��G���S���m=�����`�����((�n9�!6���4��	h��5�gX�a�yC<�X��/�mo'JlV�1!'���#�xG��p��2����;��5#jJk�@Rq���M$:m8:��-���M����
�-'.�z����!p�^�8l!6�gy|���-e},/�,~+�5c{w�z{��e�
�'v�o@�~���(�U�w]���5�fQ������r�w����c�^�x]&~�7Hy��IZh.:���:���%�T��}�sP]����Z�8��b&�]���P�+���6�%����|[�H�y7�M%���C@����e.D���v[��|������
�1A����������������17T�[���W�^P@px��t\���z
����j�o�t������
���E��L�CI����Bn���L0���8a������U^P�����m���v���U��*`���������]8��5�`���e)�n�4��]b�n=�����R2�x�x�t�n���A�E��P&�:�5#|����
�Y-4k�����pn����V���rN����%�:��l�+�=���%��i���;!ji�H�5>b��qK�E^(V���hS��X_pi�jmD{�����'��i������x}�a?!-g�6��u�
�9V;��dy`=�~�%�Tk�DJ�h�g��#���.".bn������� �O(+�i�z���=�XG���E�E�mdD�g�������������e�����b������\[�\��lD"��_�	[X�ar�����W�wA�������fbn�D��T����[[H���~��h�bE�`D�jm�M^��E~����=�D���0#3��D�**�6A��x�w������n�aqOv�O��#�-��d�&��,��TQt�u�A.���M�i��Ni�}���F��W���8���}\'yZ���2�3N��Os�$.�n��Ld��'��W�]���G�>�:�Z
��j���J�v�y?�=Y\e�?P��*����l"G����������;�~��\1lzT�VMX'6��I��!nE��a������x�9;���7~�h=R�\0AD�
|D�c"���N�m��f>x�9h."n0~�H��6���q�[g
CH w����`������L���S��^�S��j��t?7/t�H/ka?��N�m��7�qk�{3,��q���7���S�r`fC�z����u[81�^�a�R=Y�n�l�IDYHA���GTi����b\�i�#���v���i�������:������|9���QT�z�<^q*i�����U����$��J&
���.py�f���4�0~�o��vE5Y���)����(��Y�V�F����E���UB��<���.>�9.rg�&���oc3��SD�-A�`�����M����(]����$u���Yd�/'Ea�������]8�>���L�fD�M�����*l�_���zZEDm����lj����v"j�F4q�vsR8y4N�5�p��\.��(���W����[��A�j�}�dE]Y�����?�`D����.���L[8��`jt��T@�U����������.x����g��T�������Qd�	���4���Z�E�
U�.�W7w�Wn�."���/�[{�B|��D��+$q'�>:���Y@���KSK���cDc�������P�\���L,���nL��_��w�}i�
}S�T��#0:CX��U�T��5��Q�`�����_���}�54�l�
rv�Z���5v�7��8�4�|������H��.9��~�B^S��\�o��h>��?*�m����E#�V=�l#��^7���r�m�Oc�Do�m��_LPcR�t�Cy����%!�c�������v�����a����0�?{DM�-�a?�:&���9[�(��h@l�������.��	zT���Xh��vr�����5c������)��=,Q\��M���xk���b�:�.�"�6�t���J7����������h5�s��I��T*�������Nym��jziU�(���R�5���l�b��3����_]R�ZN#*k;N�m�i�QuU��h�����L���L���r���T�|�Tq5pM��iqs�aR��$1$�O���vbl��3I��sM[�����M�h5�;����j�+O\��t#8F���������^�A�F����t?9���?m�U}u�Q�E#�A�E\��IA�����C�o��%���s�.N�_Ok�(`��:�S��K��<��sX^�����c�1h��\`]��m	���B������#jm����W�}I���<a?���\��sW����I�'�k����{�-�:�x�o+%"yFV2|����D�-�e9{� �F���B}������5|<�ZPYtbnk�8Qk�/�#D+0�xge������I�mlW�WBHN�m��7������;��E���r���zct�tn9��5*�����,��W��O��W�b���PI7�L�t��2:��.>�9/l��6��sWZb����]F��wZ3���I�-�'��T��0"�!��N:�h������������n[c�6m��L(�����	^T���&l��������~8������F,"z�]r�h�
.*�t1.�2��� �L��5�-�}��^JSzl���W����������8�c���]���
�Nl��W�����������r���roj��V��m�o._���z~PJ8��r�&v��T�������F���3��U��������U�*���c����^�~���e,d���6/�#�FHK�4g�i���
Y�cc���H+�=�	�<�,�_��:t�	�{?�������R����T_no�����zV���3N���M�~�7���rha�v}��D/z�k8OeD>����!��u��4����w��{����{SQ�
)C?�����\G��J| �~�Gb���T%��S
��a�j=��v-u��������7��~P��q�M���iP�h'�=��??�}A����H���s���y���!\S�������"����,��N���(/@�O��|��?�b�|h���1����E'bo)����C�|�&t�U����6��^���Qz�C�m���"jn������������Ea�a?�;7��N�m���K��Rn��#�v4C�x7i��������e=����B�����G'qng`D�i[��	2T�;�0����k���O�r�j��u����$bo��|�Ak9������'��T�l�]�!���Su[0�b���'Q]NY�t�o2l\��%<%��������&#���0D�m`]��S1�w	�����/�@�Ml���'�j����H���.�����!o�9�<�be�X�������l[�����)���u��D8���]�a?��TD1�x��7�)���9�l�F$l4� o�H6[���+
/��c��+�v����)6�9�G%��?�'�}�U�����Cim'+����d9�9v3�o�W���o� G��j����k�.���g�n\����`��)���vVW>�W��DTfW��m����B�bc�r�]"�Da?�=p���w���N�m�]&�;��]Pw��O> ����=N�G��o~J����A@P��d;��D��Q���)~�;�C��4"6������cu����������x^n�P��������18��G��<v��������-�s�qr���P�W�����@h/v\F$��@�U��t�>�o���t����A��m�F����"a�/tf���A��p�nkF$�n��O��(,��Q*4�ml�a�!k$�:���������e(���D�
���A����V����k��O��9����0��E���5O�+��BS��	Eh�����{���R�i�M����9?�b��Eq�����U4���^o�����tzoK�@T��#�$P]�nN~z`\��*�L]�){���D&����v�����	"�F��AG�T���������W��n�)���%X:���3c�|��/_�����Yf/1!�Bft�-Ci�������,�Gi��qJ�`�M�!���5H
%

�&e��'�2(:m9��5�}Z��f��N��uHz�|�9�s����"�6rlR�F���)9NC�rt�>�'zd�0�(!���Hw�
�F���G��l6*���P?���|w������Lz�	��to^'���d�H	D~���R��5�|�+u�/=8"^p�%nk��������Ra��75�%�����������k[����5=6����?����e��������T���O����Pt�l�\0�/��x�<r�o��o�e�k�~�v1w}��\[��p�FQ��+:��a�!��:\��-�N�m�����)�� �q	�Q�)��~6r��������*�o�?�����:�
��~���P��v�1�����S��i��Nn�iL����%�&a�s��,�gq�����i�smD������.:�>�{A;|V��p�"dG����8�nr��mnr�g����zM7�rRS�z��r�����QJ7%��f=358�G�������q�r��[T�4f��.�p'��x��Z�(A���n��b(R�O����;7U���V���\��������a�Tkp�M}9���L�2>I�yC���!���
�1ME�
"�MB7!���ck��b�z�w�Iw��`��aNQg�P�f#����6�UF25�[���R���e�&���R�7�a 7fHD�H����'��-��CEm�z� �����������w��	�?$u�PO�.���c���"�f7;�G����\�t?�C�2�&k_��N�m
�l�k����,�H@o{%���������������y;��5�s&�f7g��#����G�G�"��@�b�.����v`Qq�G"������se:v_�$�ND2)���G���@�$wMp������'���l�G��~�>���>H��a?b_x����N���J����������If�?����t' >x�IL;��������$�J�S$�l����B�������������g�����7I���]�����G3��
�J�q���
�������1��E���/��6���/���\��t�zs�8��E�����r�&6.����Q���_p������9u�5�'�n�����lj�BZ����:��6��������~[|s6���q���&H�M��0�����6�>�������V�o�Q��nQk�a�.�����f��6��R�G���-�>�����N��o�-�����1�� 2e���*5��0u5�I��=,\��������[��P0��m,��e,:���?�W�d'��x����<x��A/����J�~�m���y_���t�N�ok:i�o��|E��=���Z�rP�������9}�����->�9����u�6�d����*���!�Y���$o��]�^<�
��jjF�4Q#P��'���KP�����9��E#������n��y��)UH�O]�3����vtcH����c�q��m��15�����"�+���F�����w���'��hD7�rUS�D�9��2E(����������t�N���A��on�]n�*���MI%T��g��k�Ir)�_;����)����1U"����$�$����J������e_+�n�[[��_�������
�Tx=����o�zQt����
��W���q����}�wr��X�|�x���Vy\Y�������&��anG%5�[�����n�=Nn�i��xt#�1]j�2� `�o��! ;���S��c� �����"�\�E�0��7�N�\7��&$�t�*,U�F$�u��&xCSN/�p�o�d���RX�;
N��@ ��UE4����h_��`��Vz��\A�N�m-���*�o�N���U�����[�X5�O��e�����
�]��M~z`UJK���#��P��K��KRT��x�2l;@���
����Dq�,O�z�b��D�I��)�Xq���]A7\����'r�L�Dq��!�/}�a?����������tO�
���*�&��|D�
 ���T�k��n-��XT��WF����"�y�<��b�x��+���*���K(3�]7()7V���|eVV�o���9�Nk��h�
�'�eWobk�����'�&�K	��:XoE�
�������7��������@�D��I�)�~+%���"&\z��\���M��������Q�4�B}����b��������,6��IqPv��N����-�����?�4*V9(��`���@�H�����ok����
n/�t	���nR���!�I>�������A��,�n�XJ�H��E���e����������,�B��z�u���r���r��@B9�z�8������X�-�Z#�����%�����U_>�Mm�<9I�t��wBy��&�)�pRn��C���~-�a�k&��H
�=�����2����{
�g�����$a�wX����X�	�@]�ok��;�g��sJ���L��������1�t�7�m*�����Y�+�y`������8��j	����^$,�+
i?+�9 ��,�����H��w.�K�A]N�m��LPh��MA���`���t1���b)���S��wb
�����l������7�w�$rTS�*&��iT|H�|t�������sC�0��/5W�@V��,x^
Zgo�������7E��_w�f��{� ������w�c}�7@%��
%l)>:?�������Qj��C4���q��	���gVI)�~��s���@���M�d�����6�6�j��`��o���%
��y���nN$���X��qI���H������9���%v���>?�mDV�d���`�I��i�1��p8��I��R��7Q[�7�(�DF{U��)8�6"Y]�+�����w2�^�H�7-��!����5������RR��K�������sz�o���m�K�vMtJn??�2Q
��p#
l�������$�����N�����lc���y;R`n�`�����������Unn��{_m��;wv��<����AL�O�g�5=��b��on����v[|�b<A�h�=���� ����l�����A7���n[;�h�e���xJW���U�����t	���r��Z����f7M��@?���=������(o�����F	xo��5�XN�m�L���_����Z���%
��!-������qBm�����o������N��=�q�7�c��y����\m�n���N�^t9�|���:���qq}�j i�m�^%��V
����,>�)Y�!�j�.���o��K��;�7�����e�d��='�s�x�	�
�=A����xm��nN���v�&���O���	&��6e�~������csl #tU�
b~8y���0e�����-R�x�<l�,��[�%�p��6rJ}C���c��(�����:�;C,����
��\���M�����r��[����B	PwC����
���!P������f�lDD"�F=����)d�����{M/�i���x�`\���@����T"*����"Y��.����?�>�_)z�����d8[��)�:�"�?E�4=�����&9�WS$���=}7�OQh���(��� �F�������tA.��a�����<��60L�x ������y6�Q���F��<�;��6:�a�N�:Y�a�3�(�]v4�����AE3�����V�������T>���5�^�C��j�}���������<���&#�pv�}������0���3�����
���x����?�{g>8��E��F�;�y���Rx�N2H<bAd+����s[3���W���Bw$�V��9v���3g*���8���c�`���fLq�h!��`���q�P[�A���=���$��gD��'������a����/�n/�����d�\%�sB�����Z�5c��rf�7������X�)��6��<��s"�6�����)8�_��	����Z��}����>9~���G�R+����=�(��3�{�����S=�7��t1���uJl��3��E:
�x+�f��9C& �����a`����:���V����\���5����i/(�q?�7t
KgU����N�$Zk����������4��6��u�]/��G��X�������jlU���g�lvu�l}>x&E���i���~_��U����F�<��i6E�&�=s�Av���L ��H�Wf�{2k�����}����T���O�z��O%=��l������<���_��AA�
R��%\�{JoX�h��LP��������e���)B���
s�1I����D�m�!E������H�$+�<j	�N7'N�mI������)*n�~�����|��S/����fB=��W��J�!����s�C�"�f,s&�)&��N���L��s�*���s�����4�=��t��c[��N)����h8����~y`8>�9�M�D�\z��^�n>|Q���3���h8s���5�j'�F���8����~��t=�862>�]g.m�{'xs�mF����/�E��f%�-~��/j��:��5n�����O�s�7[&��6�Ip)`�47�o�I�rs��SY�qRok6&Rl���	�g�5�6�G����G���X����������=h���M�TG=�����j�b}�m9-��+J��r�O��%�kHH H3�����~:d(���:����R������"�f�X\�T����������q�d���-�+jW�
��y�.ZlO\d}S�q?�H:���A�=-6z��ZEa���)���`������u��Sm�=�T���T�8LR�^!��&8B�9�sW����7F��g~�7�K�������.F��e����c��w�}�}��6����\����'�c�"��B�
�O4����� X;���������2�w�ok@��eO(Bl��g�$&~E��O3��;���O�{��	�-�k%6���J�+��o��	(y{��&h���r�Z��)�Uh�,��Lt���,��nF1��L��p
SJE��W�s���1�E�
��	0�N��f��M3���n[<�
C]���mP�����7�7��)V��!����6������p�6� �����Ow�I�(��L!SC�o�p�����0�H��!X��W��i%g�>�9o�I�-��C.�m��u%1c�3���N�4c�tU����*m��lk�"�.g�&?=�.���T(�_����?&n������L&�f=&�F�%�g����V��qp�JoY�S��8�3��N(m��D(m�$��W$H���F}>�*���do�3L6�.'������%@o�f�W��l>��3j.��_e�����g":���|g����_E�.i94�4	P����!�����i
��G����s����/��p��gL�!���g�!�����P������9��$�����F$08�����Nl����(��b�:%qjSc�� ����4�0�H��LFA7(�D�o�./�:1q����j�%�����}����b�?h������u�����e���s>�������7����w��#|�T����{SUu��/�r���e��J���d�l��cv}� �J��eC-�K�H,��o�m':oE>\J��	���X�?���p�l����������`�g}h��!A�8���7Q<N�m���8[Ne9��*��$�%�!���G�S���Eds3���g�7�6[v���5���;0��x?���L/H�MDvH�������T� "�5$i^�K��Me'��a?�� �����Si[4������
�G���@Q��U���D1�������������h��f�����K�a#}���x�*>T��z��,�	����>J;������'��7�f8����Pd�rP��/�-��D��Z?�WD�Q��L�[O,q�$"����e�����62��(o\\�����U��� T'l�LhEtmt���xA\/�qET&�k�\����J�(i��wi���6pfs�\����"�
��u�Y8m����
K��WAB�{��]}��B|���p�D	nV����i�j%�*Y�b�{�9t~~`/
�s����
�QH5�8�nN����s����t~�;,'�����0��>����~�Ob�B@��;�����B���]u�?���W�W[��a?(��������c���/#Bz~���u������tmZ�j�0�?���GU@���:]N�p�����7+����V�R��k���
:�I��������8�8��y�~�TL�m\����&�LlH2�E�E�{����~{�L�)�1�����
CFi�i�L��������+�8�r3�2G{u�{�C�����|u��kn�����w
���s����?�N:��x$�fPz���t�lK;�d���aT_���<�	 ?c�B����Z�NS��M�ibjE
����|T�
�����K�]��fn4m��L4���Un�<�M��I�qe������;��b[�����xZ��z{�v�@����/#B+�����'L�����N*:P�C���=q��}Nu]t9�7��e������^t�|J��o��2$<�l*O%������5E���sQb��Q���S)'��]��$j��g�IF�r|sQyT�`/�9�� >���?�/�90K����~y��I-��;�9Y�@���"��t5���������I�u�*���]#�B�������Eh3�s���e[��D���~�������){{o��S���L��)��=lI�GF L������y�M[��.��u21���/Bd�-!�����3D�~��q/D6�������3�OPkNu���p"�5!"��j������f�����I��l8lc����a�����E���eq���7�F��|e'=PO���R�R���.��j��j+��JS�l�)��e�3����a��'�{������������������p{�_'��l1����������kmh����X�����2��^��p0�IV"U)T)��P���j"��C�T1/��	�,e���[�!��v�-G��'���i�S�gU����>n������6),kmsp�����d"�{���'�T[ja�mP�uhM�%���.E����|%?�8P�����;x�k��:��t�E�vC�&*���%�������w(��p2F��l*=��t����wcM�W�!�V(D�pg�tt�D�q���9
	DMg
�m�)�hn��� 
�������v����oQ��������
���h�`�,����������l[�+�b��:�`���/�
��$��pteu�P�p[����D����?Z�o��y?�bE����o���#U�X#5��&B����`:�2�v��z����4f���d����F��������/�����
%��Xg�����([4s2�"Iu�-`����~VTT�x������:��x�!�����m�8�k�F�ld�7������~!���:x8�d^Wo����_�8����O`�o�Zu�����-�K�R�&��D-z�pG�$����N&������Z�is(X���l���5k�YfD[Y1���-�L*��Gk�������d�V�l$�5[H�x�!����}��4('�
�/��sP6�6�Z��OgDOO.-a5��T�>Pkn|(�G��*F8?�J�S�5{C{	�-�@��AG�������~>I����#g�O����_���^PO��<5/�	^3q�9%xf�`�I�%��h�d8�-������ �
ym��hi�� ��|������8���:�#�jWh���>Gc-\���L%6�[�i��NC����VK�X^#��8�2M��������	�Gh'f��BTf�iD[��d�#8;��	D�H���d>����r��Mpfc����Iq����>`����m���@���RX'�Q�:+>_b��\CpZ$�x�/�����%R`�.i�s���� �B"�Gk&��t)�
Q�E���m�������iYF0�X��nX3A�E}$h5��p\@���N����pX���5J�w6�.L��.J i�Ji>�*�����kp�<6��&sy��������wF�xna����|8B�d�z����s��&7u�[i4��NB ��~�~������h��7?��e���W�pI�i�����0'��|�2NNf|��?��x�M�|8����9�����;����"��%m����]�iL�r7!�������_Z�pk���@��d�N��h}�2fM�DK�DL�c�<��S����������m�[�����}O�b���Z��<'e�z�i���w�c[Wf�'���m����|���S�s���fg����
�-�I�����qG4�>������"%C���=����6�T�!|�������S�����]��
������6���fk�����R����*���v����g�_��P���4ln n�o��Q��^T���}.����mI��z����0�������.�6�������f���t�	���QF�B\��O'{P��d��1���I��Tl<)bN����R!n�5+�[��@��F6� n��Dg�
H��Fz��$n>��a�uf{u�g��/s�:�� ~�G�����es������}�����}?����EWNY>?���������������k���
��:�\�k��� d[�]G�4����n�(������nX�k�&5N������e��=
�3?�����u}'�O�����j2���Zm���Vm���'a�U��	I��{(;��)c��D�Az� 
����2t~K��������:[#��7�%.�46�l'����O�O��ZF����8�d`kc�l��e��w�G���W"���K��yo�>�������3�����
�����i��	r�����w�|�d���s	qmP����B
!j�l����l1}V��o���V%R7x{�!�8��@i�9�����G�W
��)q���uo��,���_��h�������N���9�����jXb'�
�����������d6�Y�4���;��������8�R�����<`�F4T�l����������'�+�t~k�4�vM;�mK^0[�h��m�,��v�[�d#Mj����o�Z��JHk�t���p3���M����^�#R*YG��m���Hj�!����h�xk�A���|�IGUn)���	gi��p��FK��%2�q��v�I��q���?�q�f��/�t!��Y�~�
V�����s�Jj�(����9����T�6N�F^�7e8��j3'X�vT����_�/GB���E�	pb��5<���T����F�J���r�
I^�����-������Y��������(.mTZ�?�,u&���Rv`[��4�a�h�Z����M�C��C���D���N<���h��"?d���!o�O�E�&�3�7�Vl�3L6F:�~������
 m�������>�����f�m��[Z^�d�I�����LwBZ�Z��h���`���/B�a�Z��gj;m�i��8[6�BG���5���d�b���A i0Q:$�^���\�����������y)�|~�����R�C��s<
/f��}�?�n���=C���$'G7�:}�����s�9f��M`����c��*���n�
z�:�/9I�6{�~�?�+�����)����g
JnP�@���(�&�}Gy-�O����3�i\������D�����������mq����~�- -$������

.n��\��S����<�OW��2���TO�<�|�u$_�n�Y������B1������_��~������uAN����D�ax�! �"�(���$��4��!qY$��Q�kwA\������b���#.�qy�'���U���~��2��������C�v�st:g��n�������[����Br�N]�����8ql���4�����1)�1:��r�����|Q2�?==�_e�1�~�����Q���S�c�u��!�Q������,R��Za���������+�'A��v����5�,���Q& ��|���X#��w�������7����W<��fe�%�����S���@�	w��\t�3�1��w���L���2N���&���]����l�-��8���B����[�mx�1�OgwD����,YP�����a�+��R�n�V�e�q�6��P�a��U��,��)(����Oq��=DR�,@#�5��U�5��@���<?kz{0P��u8Z;�~G��eon�����������>�eKe�������,:���|�!���hR�U�����������m(r��VU��q���w�~m�?�y��Y�d�r4m�{�����l�7~�f���b�,�Rxl����������>l>w
���	���z����j�E) �����=U�L1��=�������v�����e'$��T�����L�&.���w7��?x����`�nc�hEt��	��mII���:���Fc��l��:O�;��=i�|�6�:h��jlB��mf4`H���h�du�&��Q���]v���-q�n���Zxg�Xo�����~Z�(���t�d�|�d�a��n`�������A�W��/�A���t��=��!��S�j�l�pGr���R�����E��FUC�/���?�bk�SF������@�5pp{�C���%e�s�U�q����w<�][\����/r����!�9�!��`B-�x����X�@{�@<�d��KL���G�_�dm8�#�
�6l6�8�a�#�p��;�v���4��!�[�Z\���0t�k�~~Z�H0wl�n����-0����N'D��G�Owy}ylX��F�^��1�h��|*��"�2�g�n�o������X*�~w�(_V�Q�Of%�/�.�@���K\(k �Z��3f�y��V��H������~a�M�9H�6�Q4�2U���O����1���+��}o��US&3(,�����F';��F:�����",n���}1,�A����Mt�������.�2����-=�3����0��C'����m�H8��.������4j��Y���M�d�jo��S���zb����)��:A���y�����`�D����N����n��
��':Q�������p�&�����pt��a������
�~��:�"w�0M�J��
��Z�J�Lk@X�����g�t>�s-
���O�����&\��i	4�a��.!gg����P�fK��Q��;�Vc�����Mz�h2����>?���j�q��*k����U1���4������k!�!;t=��s����|d��&sv�E���c�&�u����
�M@4�:������|��S��h_-Z4���:�7*.6VM�n����DGO��1��lM]ZC�d�����o�4��'k,�9ZL_�lL[ k=!�����Rl��?�bO�p0p����
����-0�4zo2�1ze��~��A�(����U�5�K�5
:6���6�����P�(�'1�X�X����p"�f��v���B�c��/s���J>���Y�>������v'������G�%��_R�9���������&^���LtL�@�:����������mi��f�ng���b�h-��QTEa�;�}������t�dlC�2t1��~��������t�0�������1����2��n������*����YMR�f2I��X�g}^�\�I��e�u�=�=>H��f+JS����q���7�v�f7"���}���+:�K����_v ��d���B+�z_u�\�gHW���|2l	�[c��f���YH��k��K��'��2�/�!L�.�6KDZ��d8/�Q�L����/�$��*��zR����6�FZ4^�<���N�i�Q0�d�i���&%m�����{����h�����2�IR;�y�����S�Fl����>�[4�S_�l=��������FX8���8ZGz��6��PA���9���R����wPd���r���Q{t>M����p��{��_CD;��8f��������o���=<�*�I���L�N�����I�=���5���UL$�]���p�M��z��� ������b
�$��*/���y"��$�+���6(�_�������	�K������)���������������(NBX�B���U���K������N����	l-*8���"nx�Qc�Yx�Z��>�I�9����J kMn.TtK�g�b����`���
�����pGB������~%UiW8�]km�*%cpkcA�
��� |�&!��Ta/��x��m��[�f3I|BT��w�Hk�BC�2+�/��r3�;��m����������R�������a�Y$L?�c6
.H���q�����6d��x�~5��o!�MI���nA�5��KV������4�4C��M`hU���FT�)l�R|
.��l�%Xs�4�g�@�B#H����DRM��w�DN������B�h��5��E^�}�V�T�T~�ph���/��GM#�3�
N�U�������c�5��^tG~���;p� ��>[6��
Fg;��f6v����Dn��5�!U�l �F�Z�I�p�N���6�Yq�&3�s&���U�l=�Fk�
�����{%���P��2����4ZG����0�Y�8�RkVk0��hy!���-B��l�;e5BRCR��-U&�VS��M[Y;��w���x�C&:�[��5��[����U)��vG�Z.S��}���R?g��z_GD����n\hi�'���������_@��~wo(i��m����p�!a�!C��zE(�Zk
�f����5|�u����V7-2>A|�9b:� ���Kp[g����vm�g�n�#vy!���|�v��6tY���p4c�0fL�h���ri	���R����_M�(�W����,Rh�]�������/�oG
c�D@4���,b����bY�kX������h�d��t���3����0�22`��#Q{�����
v�&��Rq��r��f/���G�S�A���C��i�&������������y����*[D�����'�����Y[p�*]����!�
��� ���������n[��~@�/`sC��w�y-8�	�Y���}��|�A�@6��d��{�	����?x���=��.�]���@�����8�>������M"4��������%j^e6(�e��z=�v���k����j��mJ�oA�|-����/��J�]�(
���{��d���!�
�}Y����wT�Tb�=��K���~��^�������Q5NY���w�;cj�~���L�	��r������/���������6S��������e�:��
w�E�����r�7��������B��OQ]�������
m)�_%��36x�u�����������,N�S1l���v/a��J.]9���.~J��������'M�a�u��Z�V�*��8�������J����M�����������O(N���9w��
3S�
�fJ���^X�{�;�'�|�m�������|�,�����=5-M�/����J��S7��ge�:�1��^��J�.L�p�����hOU���K�:���J�����e��bI+�X����R���g���>K8P�vo��3��19~XGX@�8uH~"������I�r�d��O�oZS�cJ���q��!��y}BK�Q�u-0�t�<>���uA�U.�B��=b��e������*���^B���:s���������l�f)�G�a�J�pT<�����P-�{D��fVB���z=���+��B	CB�N����=Y'�JXOB�����JB_����Dt���N���X���%&@�;�b]�\Z�Y�L��[,C�������J�t����A�-)��k"y��+��	������C�Z�0���2X�!j$����G�J��9��1G�[6�_�qI�W6X�1�$���W	��5��v/��%K�YU�cRSn����c��� �6f��G\@�_~��,���JX":5����9���������x�����1b�s�a$�Ty��
;�'�:�������5������C�j����������/
h�P��k�r%�:��kj���m�V���0�����T��Z��B�X������\��������9�L�_���uAq� �.P3o�;�-z�/y������������RW��<'e']���~�w]�������c-����v���������	�d�n�>J�_��	��� o$��H�]�n�&�}4�5�������������$�Q)a�m��w�k�n������@A����2��L(`���/��3h����Wis�����q~���5Qa�.��w�}�T�?�316��[�<9"Y�
�I��hPv����E��	�3[�����j?�w����E}�M���>�����������kk ��u>w����M.���������:-��FE��-#G�XG(Y��}�6Z�^=u#�-"����2�H���D����~��������pt�d8��j�
���>T����!!1�"�abg2)�Y�u^��u�I�^�|�X�;T���n���|ze�	l�rN����:�%$.�������[�A`��k��8%d0]C5��
��9i2�����P����JH�#2���X���B��QB���[�o��[���)Gs��������l�e�8j���I��N�G^M�3l(���}oJ�R�a�e
�.E�t���wD�����,�[��;�y5�#��VJU��s[��P����5�����1�9�Ed'����
B��'2b?��+E�{��i;�4��o�]C���`��Gb2#�f2M\5���������9?�n�#G�}k���#����~#t&3�y��!�2Or�z���P��J#���y�-�u(��WT0����pF�����L�����G��0^��F������4����:�b��{E3�8]�7r\�{���Y������
�k��k'f�$�����������l��T�
�kP~�BdL�F�J�X����>:��EW����9������5�s�����d[dy0QY:��r`
�5(3���<!u[x�2�<gG�"d�g</����\�������p�c����\2v��Vcz��Ky��y���}u�[�Z`o�k�y������0{�������
���1�j-�>4h���
����0�t�=��b����2gM�;B����������qi@���9;6�B~��u�`�~	������	�����y��?K�"��>��V�8���M����/&1<���5�W��d8J��p�H��ScwK�^tX1�5�^����$��1\�p�$�P�^��5x�_b�B�B���[������Q�;Qa��N��)8;�\�����	�+�z2md5����I��%����%Ui�����J	����\�X+������r@����k�n��coH��P�X�)�_4'~Dn��k&
�k,����23)��f�9gk�	:H�!e�V����:�c��E���N!�d0���L��[}u���uVQq_~A�N��C�!}E;(�QT���@7+,L�5�����5�&J��N".�
5���_X�t?k_�w[�D^��Er~���P�Z����2.�����GsU
c��������z*(,�W�j�g�D
������HLzpN�����<#;9��t����S<�O������
�u�����<2k�E@]1$)����c�({����@��J�x%4�U��#8�x�����xs����������������
n���J�k��:
�+�&�"����m�~6jX�'�5��@]�?Al5u������_��za��:�kF�m4�42�\���5���:�����{ ������:��Z9�QgD�K.�X5���=�^��WXCXr����
^����\��b�w�I���E
������x{:�=P:9]��k��%����^���*7W���pGwY���*�i��BB����������������!���J�������~�}xT����+'�/h�4�j�^�v�pu�
��#+�P����+�K�9�����lz�bg]�����G�u�~��B���^�p����� K�y1�>~�.IS�

4V��K�sd[����-=����g�������>.���!��h�XS_�xx@��K+!���LM]M�;x@~�Y��s9���:��^���+Y!�0���*�$r�=s����^����R�u�<������S&j�������������q��:0�^�
�z�
]X@�v�>{/����x�M��$�����0�7O�����$���
�kP��0�z=#C���
6�� C�)����"'eH]��_/he���CHX�Z=fq.Z�����!a�sf��>O�����;)0"����:`i�N�����j�]VGa���u��Gg��_q��%7`����Y�)c���,#Bq����V[�����^�m�ac�`�z2�P&���������
�s�����X
�����XO�J�,1���+����LXN9����/�&%�I�5:���,L�b0���{��m7=yq���9Jj�z":4\���6�7��=���pGt4����T��Hm��������������l������(.v����������C���o+����hSNyh�Z�@���u&.6 ..�EQ���X
�r�#F	%���z�����Y�Ee�8dX]�����}��,r�L��5N���j�G5�e�Q�j���dyh��������E��s�0�p��g(�Q�l��#��HP��$�x��Z?~�����Jy��5y)�������	8�W��8y�
�2\������j�����Wh�Au
B��{n�N�k�����������X+���3e��U�%���#������I�P����t�U��[�8�>��Z������:��W0c=�����x�q�]�4�k�Fl�����c�8��=�g2#�g2�1s����&��gw�b������o_oEdL���	s,�6�5xZ3�������`��T�O������� �:B�b�b+dR�W9�����'���a�3M�G�������|��6��?x�d��<,���DVj�r}�NjX����U�}������W;!�J�#���		s�!��o������1��{��P�� � �pGZ4I�N���Q�~�S�|Ji9���G^�����s��(j�?-3��+�������%.��ED~��/���j���+���=p��aP,)���h�duG���~��/��
��#��z=m�E��9�~AuM��:@.��f�.�0��f���!_�|$�4�2������W����s(�g]h3���s�0�� ��}��(#kP�W/g{�Gv�Qn��1'#s�0�-a��=�����`R���Yp�v����,��Q�RwK(_+W�������m������`M\
o�=};���6�����N��%�1���A�eA~���
!��f6�:���^d��S������q��_���3U�U�s
�s`�v.ESf���Vp'�II��N��������������
�g7�F�_��t`�l�K�2��+��m[�
���7Pd�}D�G���
a^Y��`��:L�7NX.��?�BAm�	*7F�L�J����z�`����VQ��������Q��b�(���p��<u�c�dc/A�c�6
O��:4�~�g���p����I�>����q������)���%������ ��;�`
)�x�&����id����(Y�pw&�%�l
�l����k��5�Ay�r�`u���n������3�l�n�o*d�}C9���>n��������>��~5^�9��G}XM��p�_�0�@do��!��4��uX������M���u���������o�T�������%*f�"j�go�N�^/h(�7�
 <�Q��*/��{4Z@^��u�XY��c��0�p���}��KL���
]�q6���15��1A�X4V�C���>�
���<�D��[^�]���m���:�h���z �x����������!E�l���?�G�$�0�������
O�J9A��dC���:�"�On�o<jd�te�c� �4P��4�=���6��F����~9������iR��l,��������p�b�0VL0dD?������aI��\'�{���Xy���`����g�|��p��%�'��8{�,L��z�l��*��|#z�8�y�v�:2n���h�d�����2�}�:N���Y+�}Z~W���}�TU@b����!���v�=�����z�p���]8b�.������x��(�)xJ
�����2
G,:`2�Q���46(������:PL
R�CW
���Q���G��0b���'Wv����4B��Cz��U|���$���Z��;l,�qc�egj2�Y5���V;����������_{j�^�D�>:�4m����i,�k��������N6{6�4�AA�6������lR�g)X��s`������)����^I]�w���G���zJ���tRf�
1l���!ja}!J�p�1PGQ�i�7�������
��F����**��T��)yk~�������7��E�&�Qh
���X�:����zW~So[�v�s�0�����W�����/�/h�$s�i�e���^�����;�(�
��|H�?������-��p��A�X������[;[s��dF���7��>]�Q������#�'��n���n�b��D��5s2,����t�f>]�I-f4�%M�D����1������{����^ �N���B�n�wd�Z��[�o\��\�;�B�,�QP��.�X:���:r��U=!&(����y��:���G�B���<D�F�G�O�cX�!�x����I�u�b����I��L ���,����Q\Nm]d�r��U�2�p�`����?�����'�(�������;�=��^`��(���l,�Rpe��M��X
�����/�r6�5|�a��&���w���6�E%41�1?���w���[E�i�~��
�����#�Q*A3�* ���q�P��>�)���Hc��'���9N�\j�����ykO�����MB�XG~��&�F��|.
g]������7�X�o����e�w�������C�5�)�k�2��('���G���.�|G�[���H�u��As�-C�Ci�"�(�r��E�E`�9�k�?�+p�k~�u��J��|�l�o]��y�U��.��W���T�*v�yR��{���/��w������V��,�
�������v�aAS�[,�`��X��
��������6�t*!�q�v�B
��������WO_�$�/���q����|������(������h���p@�����zu0y�M����GWo?��N�����2���������U�N������n�����>�������ys�0+8e�|��2�,�"�k��%�����r�n�+�5��/�YT�)X���>��4�����I[a��<9�FMVG�����5��C�o�7�����N/8%�|��+������a�?,x^Mp�}��5fh���i?�'����Zye7��������4WL���������8��v�_���������~�>���A_
',l����-�rr[���I}�������������R"[a6�0�����c`�BS����P$x�o�y�������6����l�������v�)��������r���P��w[�m��x���	���r�}������w7��I����?��h��Tr^��s'���,����W)����S#���7������O6&�
Nc���ja����oX�<)t1�'��/<	��:nN��F;h����YxK�th�bc�M�b��p4�k#��x����+//S�M,����8��"�
�t�E����sc����gA��pa�uv������ZHt��%��}RlPZ
��=};����j����_eDyq9���/�����9v<}<����Yh�F�9]��{���'"z��1���wD�-@D�#V|ax
7>�����&��VU���H%V.L~Z��g�h��u��5��+���~��^?O[/���
�&�L�u�����j�nfX�_jL��Z����&����6��h�d8��]H+zt��n�������+�EGxa7���a�*�����V�)��J�
;l��vXO���1zq��?<o��lm����
�W~RIt�#06���l!�l������
>�:�\o��n���K?*K�����n-���<�����e�����w�X��H�s�/��2����?Gkh�'k2#:_2
Z5�A���|�N���R�f�����%kN���P�C-���ik *��5�7,9w����C�1�u�pG��1�SK��vL��Z�BB��'$~��e�'���%eDPbT�_lbz��v���W��o~�������s����j����I�q�C{0��@�:���Y�.���V�����XEC�9.�/�"���fY��N��Qq������������V�:����S�����5���%$��r�p�P�p{S��.{�o=L���>4��5���.�s?�P����jQ�(�Yji��_A�%�����r ��y���_�!�}��~�S��I����5��MT���pGi�����������9�op`c�Yx_�t9��(�Y��<s��)�W>��5x�_����Z��$�!��������uLH66WB&�>��P%{���t��	LXR9�mkF�T04�������i��%x���RF���A�����1���<��D��xY�t�����B�~��m(/[��v��f���XL&D�L��26wyno��x��[}�(�a�q�kS3�u}M	g�pX#HR��i-:�Aj	�����`��K�Z`
%�2�t1h��a���i��_S�����0�v^������.�����<�4;?��o�B�|�48��Go
v����}�������/�����&��6v�B	�����ErJq����n�~%�
/l�A^Xt�d8
�`��P���ItZ�\��8�v�Sf]��*�X�'�#b7
1����+�t�u����qy��+69��~<.��!��C��=�2�g�q��G�LVw��Zu�np���{@��Atk��3��*��5����@�@�-'_���� ���C���5u�_R����(�3|�}������(1�� lz��	,e�#)��}�����s!���.%�
���@0�U�qg��E[%��������iDc��H~���f!������&2���I$��

��wqV�]v�H`�n���G�����`y�a���:���;���q�Z���V���B�4����(��GDf[g�_A~u$F�_d���a�o�}KNU�F^�)��3���� ]���tT����D{V
���N���(���q��l�V���I��Z	�����[��������{��m�,��Y���H��|�&���;�_zN?��j:���h�d�����0v�8��mv$0�%��E��.o{���B�fl�%����m�����48,����7>��X^���������~���Bn�c�W�.iTz�t
?�&>���>];�w�K�V����$����t���
��t8/Z����}���{`�|��)zN�f�_|Y����B}�f�$��"��45�ib����&��xv&���U��}),GxCU��R�@�f��p����os'�%;o6���V�o"^�fc��������� ��R�"
w��y�DM�A>���M��n�������V���ld�>B3�����F���<��=9��_�����b�������w�$��{~t�c��i���6y�h��6�xEoM�;��xL��{:o�-@E����5�q�X�`��8�L_�;T���������Eu)��DX]�	��7o������0�s�.��6��=C(�sRv�v\�M��G�W
c��E�������q���WY�s�0v�B�������������{�� =���k��GH_{�P_X��;~L~�����z
����)�Ed8�.!x5i����Y�Z_�M,��U}
>�:��P^�?����x�WK~N��B����/�
�uV��|\��n���-��;�A�)���wj�;bc-��dl]S��LX_��;�����5(C5����u�w<���Ai��,���Y��S2j�Q2C\�����������1#-n����ao�����G�$����:�()5�R����3�zD��Y��1|d��~���f��v�������z����������D).���5b�Y8<�x�Y�����>eEr�/1o�����������*|��h+�AFH���&.��	�}������o��do�h��8����
�%O_xPp*4�v�]�:�c��M����H����%�`T�2e���
�iEW<Z�]��f=���#�<c��bQ�A�
���������U���&�:,�'����SW�fx��
w�X�>V�����o"fm��)*$l���	��;[��a�&VX������/������Is���%
������k'�
�<����G�\>����:
����TxW4�U��i�Gj��zQ��6rt-��P�z�h��=������>W���j�]&_���'�>$��Yt*��r���|E�^x��'����X[�8�6&�	�W<��b��T%-�-��*.���Y1�����i������������.����)-����8�@��DO�e�am��~����IVw�I��G6?��Z�ot�������mU��>�ts\�b��]�t���=4n�^��W ����o����o��\HNK�UL���X��������(C��Z,�9���z^qI^���?�E�-��tAf�j���|?�NVOr.��8kU�������^o���{t�.J��s�Sf�IH���Ge�#!��]$D5=�\ Z�P
����u��#m7_���Ci��4t������w�b�i`��,o�`���C��Z����O������p�Y54F�����9��G�"�t'y����`8�g��=��w�����&?UD�^���C��m��lW�����U��lP��(87�^9�F�O_�f�~�Sm3#��=����-����O��l��� T�2������.�R_�����+��u�i������,�P{M
�������W���W��Y�/��V��[� �5x�%^����F]|_������)��
/������������(���[^���n�^W���U���n`7hH�Uz�T�F���/�nAw�{�����p����i�7�7���;��	��'N���$����Y�jB��� ��PI�D��8u+���Z����VG����w�e�=uXpyf���������:v��=5+��/ AQ�'f���5���Y�V�c����^$���a2�{.�,� ��G��Y3D���"��h����������,hz�9_x[3�A~u<Fx-��YQ�h�����������������
��I7U� ��]�[Q����W\��V������U�:��������"�+n�������$���6�WOX��3j��v-.��%P�*0_`kXX8��W�|Nf��9���C�`�h/[S�����X������8�?0FXG�\T��^����/����Y��d�sC{%�K�����\&��x��������R�4S��m,�,�)���As��_��k�b�S���a������,�D��3Sy@v�C��h��������:�J�^��=6U�ZX����5C;���]}�������u�'q����Nz�[������e�kf[Q��P���jO�#�23I4��d�J�2���X��'9�S�h0k��X��x~o{������h�+;���h������A�G�yoO�����#��z�������B�B��w�Ve��d{
s>�Ay
J����@�������i�_@���,��:-��o[�gKn��U��	����*�-�u�e����j�������Jh~[�����2��p����N2���/��m���W�i��8}!��~e��9�� ~EM��	�Mw��DY�dH|�*��@�F`f�6k����Wsmx>w�����8�L����T	����e�WS8/zfTr��3[;�Q�������~��z�i&�+�����	Z,�j#( �A�_j��/'Dld>��4/3J-
�/����K���
��NF\G���\Z����/0��T�4�zC|�.�"��?p����x�4* c�lwS%v��c�p�~e��w����o?���1���Wi�7
?ls�4%� ��W�&����$@���C<����N*S��L_����XgK�t�7Q�'
D�S�6����[��-����nD��+k_B�R����h�b1�B�7v$��s��jC��,�r"sF{2����+J���;0����-%���Ns�Wr�*	�����
��w����n�M�a��M��������'��f���vX��������c~��6� �CDX�C	"������#Gr�t�a�f�g���~���V���9:H�?�dp���+�lg�����d8�/!�}H�i�^����6�UFt��>Os��e6�-����X3�D�
�������$W���{��_��0\}��x^��(Z^RdJ&)~d�����zXC	���`
3l��fX�l	��NCT���N{�2I[^r.2��k!�E�M�;"c�+:�0s,�z�"(��I���TkV�1u��6�����<e0�_�����h.�1�6&'B�����N�������|��J+�6�Y/�o�k���dF��d8��j���ahc������$F��26���t�Y-���j�����U��������M;2�'}Z���!�eN���3����%m�P\@a���pTMF(��*s=n�����P/`��4����=P4���<z-{�%Fq����6(�_b�B��v���c��c�a��nw*{LR�`���pGh4k�q��

t�wGT��`~����3�B�5�"�-|3����7l2��9%���)��O�����0�����7�i���5=3�+�2�`���p��2���%9l;j"rU��/��A~����@�8cH�0��`�/���@�+��AO�M�+�H��{q�_�����x5p����w��^mn��7|��������������L�S�W�P.K��3L����@�:��ea�|.���
Go3��v2����R�-4�&>7�<-B����N��/<���������Ed�#IZ����a��b�������9�l�-�Qi!y�}�0�
��s�j��g��LR9�������J���j�s���������T1��{:�c�����N��������L�4�����F��6�}:]�?�Y�:
.X������QX���[Ez�O	�k.��.H;���c��Iq/f!���~������������e�o������d�)��fn�/�uAZz��A�������EK�+=�%��[Y����h�����o�t�p}{�:Vl�
_e+�U��2{�u�{���!�����4e�T�?�?��8:������pG��O}�D4���UT�W\%�������P������R��R��n����]�7�G�S��i���d�y)1����>0�R���N�����24�t����V�oP��7X���7Ds�R�
�Y-���x���T������������).��8��<f��5��m�$�1�Fq�o��Cy��(��]l�K�It8*#�����J�/h�eYn����g_F���Y�!��:w�#l2*��
��"�p�&B���$���"�=�~�C�����n���#�5I)�(:�����1�a�;����RD���nA�M=���r��d�C%IBF%���2\?Yx���f��o�����!D�qv�@�)F��2eK���X[��������p4_r���Y�{����p���u���!�'�'f�eY��^��(@Y�>C�d�(�Z����}�n� �zA��p�z��Ik���ksK
��%����(8���S���HA�	w��������Wl�
��>�m�"0w��	b��j�;;����Y���.���\�4��w[���Z�u��8|��5M
F��?��i/C*��&C�G�W7��!k������|������C�x)�����
���0|�"b����lJ�!g�[�b?&7yh��[������b*7��T^N��p/-�i�]�
,�4��p�d��#L���%$w��m���%����|�}�!pJ7Ud@kG�Z��ca�<7�u�[k];{j�����s����_�NI4|'n:#�M�XK�1]etz_}���#�aV;�s���b�#�WH�����>��Q���E�^{�l���[��vh/:9^�����z����'��/�������J2l�:�����y�!8��/A��������]!u������(��v�3�A����o��?������~�i�x����z��1�A��������D��?vg��y���M'D����e������0�\A�]~���i��L+��/�?6���G��Uo���j'=���$P��:�b]��[����r�B
�=O��$+��ci��y�+>�pG�ie���~0la�.�A�
>�:h-�����0D������lK��9
f `cw+��^�S��������	p�Fb�
6t�E�`1�&�Qn���G������X��R 6���P���VQ��*�TX��/��/�k�l6��j��>~j���6�������
6����
Se2=�jk$��\������R�N�>(Ku0[h_oP)�D��z�S�4X��=������~
������;�J+����;/�Y�����u�����&����n��]��k����������w1�j=k�����:� /s�
�kPX����4����K�E�@y����������(!B�J5K=���k��W�/��6�J'�q�}C}�"K�b$8��4���[��/��6	�Z.��9�y����������%3�'%��vU���i�T�
�$_P��9ba2�$�������d��q8���v�Z����5�D�Wg��_4����/8[~A���s�~�mm|������>��u�g��� X4r^;��l	��#.�n��%&3��M�1�������
��'.�v^_��W[h���b�|�v�R�M\
g���w�������v��~E������
��,����Z=.sL�.��������\��&���:���^d���Z���F����.�jcqey0����R	�+�-�i����lZ@[�\����t��d���g�����/�k�s-�G9�5f��(;iF�`�����tE?K��R�W��|UCqI"\h[3{�sE��
�:��d�����%Mp(���D���)��VliY����)��)JW��d�#C>�l+��������);M�P[���y5uimo;K��������i�6����$6��c����� ��?a���d���3���cu���Z��i��cR��	�y��!
V
���u%6��f���Z���V&|n.Su=�
c[(������0����$y��~�Z2�����>#�}��R�7t>��%��3Qx�N��F<���F'��p���������
P����
�W����{��'^G���(�8m����E��U��'���S���Z�Z<����z�"�1�B��#�&��y��������P3?gL��N�D�pGb��B����
��jr4��-l��^I���?O��vW9zA:mI��+7��X�^=�_�����\y�d
��2=bX^CDG��T�<�G��	����5����]�OA��(�%�t���S8���H\������P�}�����w}r�]��&	�����u��������^(��������i�<�Z\p[X�����e���g����|��p����tD��Y���&��O�Rb���dWGd$����iGI�iH��X��
���YN=���ZA]���@����4�Q�p�:���l���vQ�;"c�,"���|H��L�^�3��=���f#���)GI��e>$;���V���y/��T������1WJ�j�����D���Q7������=���`W��LfD�K�;&�.�.��Nu��~�gM6��#�
���+j,��"�zgW���?���m��w��5������X���@C�
����%�oR>�+7D��H�����������(a�*����J�|�u[]���H0���]e9�x�g�u5������(_{����G'��g���
�x9��{l�����pGN���=�����y5O!�1]F�L��bxa�sTS�_��yHW������/qi�t5�#'q��g���0Z2�b"k���(2Wt�d8��j��M!S������0���[������E�����Cwy���n�(?l-b�7e5�[�����e�q������������2���i&�����xs��A��������=�+V����m���<�z�b���!��7�(X�S���q`#��S[e���,Jk]P���[�����~������U��"b?<+?%�c���k:��M���<������.����l��Y�)c����6�7%36�<���d����/go���^/��9v��������]���^W���3��G��nvf�9)I�������>IV����U�Z"����y��YR��}�e`{��96�s������;,����A�H���m]p�^��?hl��h?~�(C�J��7a�he8*7at5�9�|I���_DgIV���Av
�����
��>|(Z���N�������-/w~R�6�V�^/h�0�*dcX��#���?�)4&,�s�����������N��~�A}c�]��R?(2���bO��~����`Y<�>+ u0{����3�����!kG��u�n��|X{<9�\9���.~�t��XDq	�~��W�Q�5�*	����\�5�_>d]���)
�Nw�E�w�#���y{�/���0��|������;���Q��2L��>���)8l��6i~X!8s����b^a��G��I�g$���@7l��o
��B��I�}���'�A�<��*������o
�����^�����k6��[��~�s	D�A:����(C���+6����"����t�"��5x�u��Z�;��iPaA�
d�m���?!/�0�Ro:#�qr�h�a��6l6I��������[ZAN�FGG��\���j����5=���0��������~��jJ+b�fXp%&;�v'+���-m�w��
�/~\j�v[(��eIe��L{��el|6���9����{������@����@�:&���a?�<?��D��l�g�^c&ov%i��GQ����{'3�W95�6�q�����|���=|�}�%�s�HK�b��c�j8^�P�����;DW�$��:|��z�\�"t����nCp%Ml:����H���c�H�`Q(��a�`�/���i��*��������[@ld�rNw{�n�����1b�F���{!�m��e�q��i�����@���J�u�#7����7�N�]��d16�f|u[�[�
[�������hZo���6w�n�����q�S�Ej�e�f�������@����+ e8����!����%�V��'e�Bxa����t��{�/��������"�����������t��2�g��y�^��n����o����?��3��"�d��k��R��K<�M'DQn4�j�����y�_w�n	h�JI������w����_��E����VvN���'//3��?x�_������~���H[4���l����l�wt��L�y�Z���d��_~�a��7hb�����j2������Cm������K�=(��xk�:�K]xa�)���D���j����0�������y�2����s�B�`G"�&P���|���R�����9�D�W{�@�C�Y&!"�&�f
1l�v��n�������/@K�����
�.����>�6�n���q�[���
G�<@K'|)X�Qt����`��}����9�	6o�C�=���2���0��:��Aa�����a��;�GI�`)���`��
Pd8�!�A�uf�z_tl��p�=6H�������b
)�#:>����f@�`�:c�<����l���:�La�5%���J�Yc�%E[^2�p���K�R�'n�;�bh<g|;�����j����A�
$������R8�U�N|@p
�Xv����4F�0��_��4���'���p���9�=/�V���	a����Cs��'��M�pp?2�%���XTUB�h}��A6��&KP�Q��f���B��,
5xd��B���W��i���s01������0�zK�aW0�C��Y����M2U��l5���:���.�@��7a��R\n����X��P�:�3���b)y �Q^��Nl�Dw�e��wDG���[��W;����|U��
6�
�Q�l��x�#e�Yr22U�k������|��{<���K����bLvJV�������K,e8Z�j�%1���p���lY���+��9��J�_Q����d"�W+��X�6����!ga_��z{$ �ecM�\���h�N���kF�1V���>�G�#�/6���V����+�6�t;�T���XGtl���DO��77B�b?��p�������+��l!���}�gql�+M��q���U^�s����~����v��.A`�R�L�l�����YMD�T4Q2��_�6K�2�-�q�r��19�['�&��H��2c6��Q*$������sR�'����p���%�c�MZ�sh��C����G�9sD2�Q�bQg	#�������.��X�'4��f�������j[�h�e/hb���H��Ax��~��b����M�0a�E�'�Q�	J����3!i7����u����Qi�����F].l10��O@ ������E�1P/����������c[��(���K����`+s*���;t�����i�����us��\?�H2~�v>"�����/�����d[Xa
���=@��$u~��<i�7������&^�
[����0F��+�/"���k��R�v�Xt�d8��jGMcop�J������o�]v��<{���^M]�-�0�R��-0��d���A]2���mpac���:%�/�����/>n���@���+������SK�+�5�pTK�c3��51i.��g�i�T�����(��R-������i���������MZs�����Oa�/��y�!�a�Typ��������JTGd��jK���/:"c��6g�c�dq�k!x�h�
+��"��
�+���44�J��m�A�
J��jmA�5%��\�d�b�y~�X|5�s01�il�����3��&���no�e�)��������P�9V;=��E�q�����3^sQnA�aY"K�TgS�3,
S������`��[�������QDF��F�{�L��t8(�}�����o������6R5

r�w��h����$��{�%���M#���2�����������d����zA�d������m�uL8����1��j��O���`�;�����>K��~�C�tB�J�d�5������[�?�~S������y�b��SU`�������!�w���Bg!>�ra�����.�P9$��q���������YhY*W0�����I?��u�(���*]���M�v"���D����x��I�o�t��Jlos,o����j���'�����m��
�9gV%Y'�c�������������7t^�V�����}���2���k�X�h��%�j`�n=>rz
l\�B>l���@6�w�����m{[|���s��mZ�o|��g�����2��6����~�q�!��_o_p'x��q�c�Fd&��n��e���&"�d�#C�jX��&���&l<{�u,{�����P6
�u��y?L�S��`�v����-pA�<'%�9�|>�^l{�e�|^_������a��{{�F��5)D$f�w6_�g��t]w^�+	'�w�Q1	T��0{�5�d�� �4���7������{�2�����=6=f��E���������.�r�wCZ��f�_�g{����6|�nuQi�����[Y�e��"B�����Vj��)-�4uA��0���I���-+jF���{���|��6��G��x
��h�&*f�J_�g^����.�j��a���^���i��6Qx�����89����8����46�3����5�������U��x`FD�r�`��,)0
��10���
WZKJOa�9�g�Y�9�Qk������YC+H�� �
E�_v{���E=�-�R�4i��yxR�;��9r���v����B�`D�&�U9e��z=��nh�(����_w�$�xL��	��3��R��ip���d�����}>�V��q���G,`Lt8
�����2��K�������Y�Z���z�[������/p�#23�C!�2�b�����������LE���^���^'���~�\ Z�j��Gkx���0v��~��`��m�yMp��Q(��i��O��*>y���MQ�KG9�t,�������zA����;����e��T���nSK���n�w������i.��,%e��~8���=���:�*��������,Qi�
[H��2$r[1���a���%	kN�>���R{������h����^�w���i��U�*�_���ZS.	8�f cc_mC)
Q!� �aC�@r�\R�0������v�0���G�GMC��Ge�!��e��],�F�n�CO�j.V<}8�"���T
Sl�%�kk����U���/���,+����5x�_b��E_����S����K� �������ZI��GM��[l��������f��F8�'}X����~Z�\1u�
����p��q8�������iVo�Kt��Z�M,��o_��xq,uZr` ]�BSx�������8w���G��y��/��4�j^�O��!|kq�8m����V���E��@���qauu"8����H+���jC�v����n�l�w[z����^�	���<'e	��*mBt�d8��j�(�����\�q���~z�Q��jB�ZM�^��>�H&)�]p
�M�^N�{Pt�D�=����1�����w�Q�G�x���3�=���}7�JN�`M^z���0K[��O�2�(b!�u����S��	���%$���4��A	��rO��/�a�-�����I�+ 6<��]�W�3�pG(�A*W"�4/��;�����P���{BH�O�q���,6����J������{]�(�x��B_��I���GR^����G�S
c{����o�����^c�^\W�+��B6�v�
x����b*u~��m��.(j������@@��r�E�bR`���(W4=2����.?vz��?Y�"UW�6G�j2R�Sc�F�������%�[s����l��B��	���c���3��#�?�r�Tb2�,��h�d�#$�����s�]����5+��\M�Y;x��&�[u���Q�m�Gf���������,��M�Z�y�NfN}^�x���6�V�v�3�%�X���ZX_��Y3f9�3��1�ZW4\��z�#�%�"=��$K����X��
�k�����=��t��:h���(��u�5p�1wE#&�Q?	�k�<iv�x���]./:S�c����:�c}��~6:�/�t����M�W���!w
�����0�Z��=�\�x�Z[��@\�6��A�)� 7�zuD����������������M\���mx�@<~8E��}����M���T����5So(��i�/�Ct�E~A����2Uc��%��{�p�Z�0{{�V+�<������2�tu��fu9��
I�Y����x�<�
���4.pu����}�c�����EgB������5�`���M�G=��.x����M#�<�]N�X������8��E��Iz�������m#��RJ�5x�_����Z�����q�g���m3kJ`J�mcp%,�}���sI8>k���������h�E�
�u��V���K�eIS<�w��k��m$���GH[�K1h]qw��5N4���p�`���<���4����'��V��-�K~%��0&D]e>��7;��8��-4���[������k�Y�:D�����8�����f;�Te_"�J�����5�W|������=�n��k�'H`^��p4������$ce2�	
��y����lW�{��������������� �Iy�Pl��m)�V H������v����[��x�S2��
����u�#&Z
��������������?������,������m�~]�����r�eo���'��1�����BQl���u��L8%I���<�+�)�m�rmqH}�M���
T5����>����E*���=���u]���c~�{&WG�$�:,v"$.���'�����n�������BB8�e���:3��80w��L���:v��GoI��(���km����s��}lv����iC�>|������TO�O9m}�i���������T"0�u�}��bsft��uhsI�uA�p�J!����F�������9L0�:�'��h�K��M�)�v������\ �4����;p��y+�+���$q���2�������2qTP�Kow���O<Fkf{�c:#�^:,[9�������Lm�^|.n;0������7Y�2��2�n��e[����:��lCJf���j���z�����������R�v��%�d5F1������z�]�ut��-S�O��/r6�%x�zj2v�c�eFK��e�A��5N9����^�7��!�k]��	��{���J|��_�=|�����������.�m���J����C-T��Y�k;�#?��X��!G���n
���z��4t�y�Y��>K�ZT�#F������/�����8���)T����[�A�1iH��p4b�:j�mx�8�X���f�F����6I�R!m��}�����t��%��F=o<��O�;���J������4��&����������o�#�"eMn�\2��������3����;�8��Y�UNe�y�����z��4
��6zq����YC����t�]�X�����#����ch��!lzJ{����i!?��
��:����>
�^4s��;��B)����1+��|Ah�������/J�[��R���,k��	��w�,��g}��G���s��Cy5���|�sR�W�	��FW���p{��&e!�R���f��"�N�rR�	�+/��Y>c��*?0_=F_M/�$�����b���\�9k��X��_��W:LX:8��s:Jd]���t����
�����M�����M�	�Oq\1�y����������oR~![��b��r������5�1��^��$����l�
�����{�O_�[��Qh��]��/������d;��%�P���y�}��m�P\T�??!`�g���;����5x�_��u��GW���.���S�������g��N0]����h����>l���Rz�f�~r��B��8��������a��.�Y�~����iR��K��a���]q��QS����['�������� ��3'��H	z��������$�e���I �h���#��0�����"����N�J�5(6_�����N&�*��~���v�ee����	j+:n2��kJY��)��h�Z������|u0{ct�q�N���YPe�|>��-����&�����J�/���w�����\����{�tFt����U�4A|-[i{���g���Qwl���N�]Qey|�O�
��w�*;��V����<��J�N�Y�^��u����0b���^(�
��`�t8*2Ax-�#Ke���5����s�V���M�� �c|��~bz����~�����y
U���U��<��R��_��A)��ls��n��w�z���|�-	���s'�ib1EQ=���e����oG�j�������_�X��|
�b]iMdU�K�����.1b
�k��H���|t�d8��j�&��E�Rc��k���lu7���%��h�����qLh�.���o���4H����V����*��~��A�@�s�aM�OR�c$,:r2U���E��f��Nj#ru$�����|Ugd�7`��������&`#��"^-.$����G�9��=�������=��ZI��w�����_7B���M���O[}��B�Wp��45\M^�/��z�`�y{Y,r!p�4x�_����YS��e�E�q������9���� w�|��\Ug{��2~8���|�����P��o0�@�D/���M��*$������?������������t��n�0�_%�}JK[�p��v���M�hi����~	�1�����o��S�86�-�����l�N����
6�rSJ)aU�����'o�����v�'�d��`m?��r�����0�S����]l������I����u�@!o2�f%���
�$+C��r���L�E���AN�`.�<�y�%�$z�����u���YU^w�YSg��p�=#� �v����#X���!3��7�1�9���I��s��^����u��I[��C�=z�`�"������)���zj.��#��{cx���PK!�r�."xl/theme/theme1.xml�ZK���� �.Ow�=�������x�����(5-v�AR3#,����	��@n9A�Y��c�F��)�[�������l���tS_��*��}�����"$�I���"���h��x���J�CR�d�OH�[�}������UDb�@>����EJ��r
�X��)I��91Vp+3�/Ao�B�o��&JpjO�Z_�/)Q+�d>�S�=�L2d0S���2q�� ����)�������3�.0�x0��_�����R��7�����0bj�lInd�r�\`���b1�N��V-��7�vq�����3<��g\�:�z�o�9��.����j�K��;��v��,�����>��=�-�e������v��P�o��k�n3Zx�M���F��j��-d����n4�� �(��m��)�<Q7�\�_p1��3�h��:%s<�����K4�2ex�A�'\��D`������������������T�Tu�G��+A^�������z��W_~���_�1]D*Se��dQ��������_���?~�;7^��o�����������+L�����������������+����HtJ.�S�S���D�NbajI�t;TUdO���p=b����d�>\����Gb��c��QlO8g=.�x��*Yx�J�����{���k�>N,W)$[�R���E���D�I�B�7�$��t�Qj���N�|��g�0u�dL'V BG4��]���mN��g���	�3�1a�����K���l�c�"�����qC����8���.�'���������'l�H�����s^F���8ur�IT�~*���q���p{��{�N���9��[�� ��%�������	�����1qe����������
�cB��3B��Oz<�l^�~AV9"��z��X��	���f7ESi��9Y�=|N���'1�4������X���|���2��BQ��4�	:J�=���,������;^������1X�/n�.A��Z�[�f��5A0cL��+�����BD��Fl������p�GV�������/�;��G�z�������W��U9�p������3��n��+m�J��������
������q��}����a��):>������S�����ci:@�nf#4*���v�.�c�<n!��A��_Q�G8�6Q`����^H�r	�#3l����n��Z�'|�5?M���,)�*��:���q�X��h������a�0��
-{��lU��f����aX�,ZZ��U;�j[���7����W�e��1��L�)s����9�������]O�5�����.����E�8%+������d/�yt���?p��u�p�EO�b�
������%�rK���%��x��CS�v�9���2N!x�~�l'3S%��.�%R
��2�����'���h����o��%&�d���t��B��~n������|N�������tv)>K�_�����$_�����%���x�!���@{wF%�"��gN������k;S���#�"c�F8�R��<��
eK��mmP����k��B�������Wk��c��4����Mw6�x�|�U��Z���}=��7���M���_�VLfQ��w��N���M�V�����n�M�i�w��A�z��bS_��7���o>y�c��+����Li��	��	���K&�D��\�Y*J����:^������`	�M�+l+���lA�������
ge��z��Jl]��������6���d73k�eOm���\�Z>J��h7���g�T^i�Z	��~����~X�W�V}X�Uk~�U�V+�z���?���=�A=�,b�Bl�a�w>��7�^��<>�����}��DZHd�D�����G�p��n���A�R�J�Y�V�acva�n���{�����`0��J�����W��j��h
{�(�>����
�bt��m�������PK!��+�
c\
xl/styles.xml��[o������J��������"�4(��+-Q61�$��[��!%[�����|�0�e����Y{���������gM���,?�QV-�e^]�����<:���K�eZ�U6���6�a��?M������dY�CT�,����E�����L�'�:��OVuS�]���N�u�������H&��iR�yo�pQ.s�2m~�]-�r�v�U^���p�8*����I��p���q��6��fm����u�|��m�����&�j�/��O�<9O���H���v��I2�|��7�7�8i��y��������6Z��U7����Kp�sU�R]�?
���Y�i���}Z�G�q2�.��n�.|t�r�#UZf�g����6z�6M�K��UZ����g����#�=����?��'�=�����^��?������������x(���px`>
�����2|��~w����V��hx�W�}��w����/$���Wu�[���?��C�i���P�&������u��U�uay���<�����?����}��"+��������coVQu[^����,@���_�7��r{��7����=��a��S��mV��/~;J�����my�5�L�����9���o����u������B�xo_��|�����}��&�8���1���u��;?�/�����V���`?,���Y������.�nAEW�y���'�o8�r���NG�a���>�q;�K��{�:Y*a��J`�}��J`���F ��J��-�J`��/'6�X6B�WTX�V���P	8���Cp���:���#�:�� }z�^��x^��x^�pe&j��F�j��F�j���v`��!v�b��!v�b��!vx0�AF�dAF�dAF�dy9
�!8���J�3%P�p�X�V����y	(�U`#���Ft��Cp�����:���#�:���#��Z�@�@�@�@��F�h�F�h�U%�*���@	����@	���P(�U`#��ah8���Cp��a�#�:���#�:�T	�`�J�R%�(�,�@	�J�s%P�@	`�X6BOa�X6,��Cp�!8t�L7�t�L7�t�L7�t�L�Lp3��7�Lp3��7�Lp3��7��\��D� ��%�.At����\y��#��HRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIRIR}�I�����<)V%�)��0,P��F�S��@	�	kp�!8��p{����#�:���#��K5EMQS�5EMQS�������/���/����F�j��F�j��F��~�!v�b��!v�b��!v�b���dAF�dAF�dAF��C#�4�@#�4�@#�4�@#|p�H2�L"��$2�L"��$2�L"��$2�L��2NG�tAG�t��(�\	�`�J�Q%X*�L�@	�J���%�
l���%�
lX0 !8���Cp�(�L�@	�J�!c��*���),P��F�C��Cp�!8�^��#�:���#�:B�J0U%�(�U`#$J��(�A�]tm��A�]tm������kj��������i��i�F��g:���3��`:���������������t�t�t�d�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�T�O���1v�'�J�3%P�`�%�
l���J`��`�[8���Cp���:���#�:����i�����)j�����)�'���/���/����+�P#�5B�P#�5B�P#�hQ:v�b��!v�b��!v�b�#�dAF�dAF�dAF�gz�4�@#�4�@#�4�@#���'��$2�L"��$2�L"��$2�L"���o�F)R��"�H)R��"�H)R��"�H)R���on�M,��2�L,��2�L,��2�L,��2�<W%X)�l�@	�J�S%P��(A��@	��'��@	�H�!8��&J�S%P��(A��%�
l�>i	��*�`������Cp���:���#�:���(�L�@	'�J`��(A�J`����fz��i����fz��i����fz��i����fzhi��Q�����`:���3��`4�x�x�x�x�x�x�<�<�<�0�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�$�1I5=&��Y�R	���hMX������aw�v���1��pN��Ya!,���������+�
����+|����G�Q��G�Q�|6S�0LS�0LS��H�	g��p&�	g��p&�	g�'���x!^����/��B�/<��!c�2��!c�2��!c�2���9���@+�
��@+�
��@+�
�����@5QMT�D5QMT�D5QMT�D5QM�o�����BX��������B~!��_�/���42�L#��42����BXa!,��u/
~!��_�/��B~!��_H#��42�L#��4��p�����d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�%��q��d\2.��K�Gt�uQ7Qs}5�/������Fu[^�]����r��hU7e�����,�KVeMZ�����o%����i�^������hQ�V�P�e�Jo����g����f�������Y����p�Y�����������6�Om�W�����g��^=v������l�����ivrt~��������/_^��&�����,��b3>��7]��H�vq��i���M������.�z��Y���,]�7Y��E2�����L�*�O�A.�"<��������?6�������i����t���xtt�t4>:>M���N��]��'/O���:�<98��o;��(��'r��eV���gu�	>>����Dr�I$m����+5�7��PK!%v&=q�xl/sharedStrings.xmltS�N�@�#��JHp�)BI*��	.���M���>����|=[���r���x�������G��l�.C���]C�-�������X�6���Rm���:>*�%��K5�����9����f��	O���y����bt~9^����v++���n.U���\��]��`�
������"���w���.Az�1��d\r
6&f$�:9�~<�����T�l��C���)���0'X�1*=%V��!�
��tM!O�@���5h���+YY��^��$
��l��@�%����O�1���C&B=h�:��T�P�	�$n�o�N]�K���I�ei�c��p.�b&��LQ�.v��>O�b�M���$�	�����~���0X}��PK!�$R�!QQ�xl/pivotTables/pivotTable1.xml���r#I���2�;���������XV�J�6#uYOK�F�*�@��YY#���y�^���d����G8���E2I ��	��?���t��|l���f}3��M'��n3_��������C5�����|��������n����?\?,?n��}X5��,����}�I�b���������WW��_����O��f���b�����o��\���l���i����$�����r==������y����=>�p�����������kM'�w~��z����f�i���?m_����n��m�?�/t�Y,�w������m�q������~\�o������?&���~��"�!�����c�c�C^���$�~�1���t�������f5�<l��iO�o�Y>����S��}���N�������m~k�`<��V�������������W���7��+?�i������g�}�]��������}��_��r������/�~y�����������7��m���������C�/��~��%�~��[��6�_�p�����]��q�9��@Y����o�u�B���~�j?~�9�6�o��7�?o������Us<Kw-G���r=o?��E6������}���������W�����i���?-W�I:�s{0����lO�_�K�;�7���)M�����nz{����$�7��Y�L����4���d�����?m����?������w��e��������=v8�]{>��������'�O�����o���U���n��}G��_�����;��;�^w���B��&��_n��������M�v�kt�^�����w~v_��7o���6�+���������������<|��?��+~�����������MR�at$}����d����]�Z����!^����g����d���t����������n���-
��������p1�E����H���qxl����������`w{�>*�?��6�,�jq����������q��CN����|�]����S��w����:����?{�e�B��z�����7�������]1V��|�||z6���4O������������o�B��#��>^� {~�?_Y�����h��WP���D����=(N������3�h*#��<�f�}��+V6�W���l�����M5l������MKQ������D������b�����~�l�}���O�����>d}����C7��Qv�kn{��jI[`��v�����C��h�����_gB�$f(����0��)�)��S�1��F�O����5�����{J-���H8��������E)�E�Sttt�}��m������{:@7��_II��x�V��SrO�a\~�A�>�!����CrO�=E�>vV�t�3��<y&�6���#;�����N%{��f��yJ�)��/�0T�z%�qEO�)��d��yJ��m��H�z�z�z�z�������>j���8�����<�yJ���HDm��O��L��7�W��R�n�P�B���g7
�O�O��'�6�8���<?��}�����?~����v�*����R�6�6�6�QY��F��F%���SrO�=%��Y�����="�>{D���m�����{J��q����,EM+:K�Y�K��]��Odb����"�>+44|}|}|}�����B��~�k�n��U�:��(����H�E�-~���y�:K�b��gfod��{��������3ggg����8gr������>�>��}�}�}V7��T�u��~�~��MY��6em����)kS��t=��8��(��p��p������e��M���e�L�zJ�Svt=%�pr���?	�Lf���|�
i��H}�w��=��)]O�zz<g?�����������<�llh�h�h�h�h�h�h�h�����e1�',�G��G|��O�OC������}*��0���.
�Ue��*�*�*�8���zJ8�7]OQmPmPm�z�jC��J�����MY��6�����S2O�<%����6K�dO2����E��=���OY�����������������vg���������7J�d�J�N���e�}�R��t�g�O�0�t��g&�������]?kS���MY��6em�������)�����{J�)����>��L3���3�_E��z
]O�zJ�S����F89�t=%7�7���i�9�j��������^�mV?-��|7��<��]����E�����it����~�����Sd�5�)e���Nv�U��O�4��g/�����zv��L�}l��_���a7�,V�����d��a���V7�������]sx��7����}>�/��~���~���	�(�>���������]��<,?n����l����=��j5�t/������W&���j�n����p�~�����f��m�/���w�����"��Y�v��_���j����>}�]w)��Iw)�����������}hV�����?R����n�sq8+�3��q������O���v���M{�{��������<�m����1>v�G����0o������cwJ���h�����ao������_F>�a,�����r�8�t��84�����_;�b��8F�?6�C���Y�r������<�=G5�q�;�����8�Q����q�2�g�wm��y4�=��?^���m�h����y�`8���o�mH|ZO�_��������?`�y��=���������;�}J��Us�o�
���o>X�Y�~��Q��y����������<>[l>��'����������o������'���������_lBW���{�������4=_��|�miz�xJ������J(M/�C���aW~o
���i��4���BO���iv�z!��?M����[x�uS+���m#���aW�jO���.v'��m~�( :v/��D[���4t�Y����a�3jO�=����<
>=��J���'z�]������v�=��N���'������h��`g(�D������/�B������F���;C	'�����������=�J8��D�3�V����>d����?�\�l����.���g~�|��/u����|W�q������|W����!��R�Y���v�}�K��j�]z�w��0v��]5��V)�4���/��a��+@	�
��Pmd����b.�j��*	����R���$\*�
1�B���@��cu����G����T��
�Ke���p�P+����UY|�����Uj��K��Z���JB�@��a_%�R�V�	��b.�j���r_�Wy�V����p�P+��R���*Q+�\*�
�U`��Z���J��B�Q+�\�a��nCM}U�q������z����E���*��;�������b����{X���mq������Vs��������^^*���p�}������0^
������^(�W�jO\l�������0>���3[BW����������*�
M��]T�<P�|�q�U*4.���u^�ri�����>^���x����C�Ga��@A�b;
��
�U�@A��� ��(h\lWa����*�{���xfK�*4��������������_/6��������������fZ�P������*r���Hv��~=K�����c/��G*��t\m3q��x��6P�93q������8R��f��$���C�q�Dr�B���r^�M����szs�Dr�9������s����(��t�c�h��g�"q�"!
����MFZ@�J�"%
���{[EN:��W����"9u��+1��!�+Tv��$�v�j;��	��!Er�����w��]IT�����>\��sW�������]Ih�R$W��f����K)�S�9��P���xzs��_���xzs��_���x����)�S�R$�WZ@�J�"/
���,��������vi!Eryi���E�����"yF�g@�JF�g@�JF�g@�JF�gH����"95�!Er���v:�j����L����<S��f���p���]��K���?w%�_Z@�<W��f����s��B���xt�%����xzs��_���xzs�����)�S�R$��3�HN�����BE^:�yY�yi�+��B��*�������������H])��H])��H])��)�S�P$/��(���ji�*���=Y,���f�L'w����f������+d���W>�[�du:�����v���j�|����f�4��Y���eX=^,x��-u��2�"/��A|VS�^���],��J��Eyq_��5/���i�Ma�nu�rXe^�����V�E���V�����J���c�"aT��w����a��?��wg�~����R�K�����}�^$�G��"^���}����#�.�8����?j�w���u5�.��>����?���.��>���w}���]	��Z��/��G����jSt��2���u�z��}^���|�/_T{�����O���.�����Q���/G^����}^��e���D�������6�V���}�D���"f�D���"f�D���"f�D*�]��#�.�8���}������}���>����_���}���>���w�'9W���Ib�.���������<��w���}��Q^�w���E��K=����
}^���T�g��O����w7�*Q��^\����j�����|�?J��)�e��������"!�.���"^��w��������$*��������������?���w���%��.��>�����y���?C��������}�^$�G)��"^���y���(e��"^��]�G}�������������/�.��>�0_��Q^�����/*�3PO���1S��^\��e�W#������j������������������EE�]�I\��r�w����(g�������j����B�]�G}x��.��>����_���]������K~]^�w����������K�|5���������/�����������/�������|5�G}xa���>����`_Ob�����n�Q��?����+��vx�_$��+�������K�����	��J��*��A|Q�������y��J�w�����d���z]�|5�G}x��.��>����Q^�w�%��G��"^��_*�]ox��U���xq_���
/����j�xq_�w���OW�����������/�������|5�G=x������/*�3PO���w��?�������F~]���������w�����w����������8����/��G�Wct��(������/��e�����?���.�u}xA�%��/������}���>�0_���>�0_���>����?��K�����/�����������/�Wc�����?�������$���mp�m��;{���/'�e�jG�0��=	i��?�8��Kc���G>������j�x��T��^��K��*�e������k_O�����Gi���.���%�Wc�C�K�����/��e�����?����@�1�(M�w��@A���	�Q�������K�|5ox����WS����������~:E�e���]�G}xA�e�����?�������EE�I��N3���U~]�|5��z<�2���_�����_�����_���w�'q�?R�^������������j����B�]�G=x������/������}���>����_���]�����|5�����|5���������/�����������/�.��>�0_��Q^�����/���������d:��<��7�x:�5��n���t�'��Z��AR�W��_A����^��t&��n��:��*����]�5���+T�W0�:Z7��q%�O�����aUj�N��7��
q���z�7�Rq��s����!���IX�����'���c\9��q����|�u\9�cq��s����=���c��u��}�8:�q*<�>fo_u��Z;���*Ux��S��;1N��q�{�\l���s����s�Tx:��W�Tx'gW�� N�:�R�9@��u\��s�A\�j��s�8A�8<\�^uu��C��qN}��:���3?�%T��e�G����q�9("NB~\M��9�q5u���ZS�qncu���������8z+yC\�jMo%�s��'b��qncu�\.sG��:�s���:\.�1�:�s��s��r��s�8�1�:\.�1�:�s������6��[	���8<\.�1.�s�8�1.��s0g<0_���w�i����f�����5D�|�,��s�������x�q�ka�"�=� ��~�*�qx�qb���
E��u����C�����qG���W�<�����c\��0��eV�M=2����%x��WMTx�gA6��m����[g�1_u���Y��s�81��D��qb�re�
�����������B����� N�:N��qr�q�
�����R�����US���Y��x�\N�US<�s����9���m������8����6�Q�qncuteu�2� �i���s�+��G����q�V�+��Go%�s�����6��9@��G�.��G����q�9�r9�q9u�4���9�r��q�9@��G�.��G����q�V��r���qnc�.����qnc����>��>��>��p�b�4���;Rx�9�y���g��u�w�i�����4_���s�����x�|����B��Zx�9@\H�U<�i�j���q!�W-�.�z\�UK<�s���*<����������C��s�89�j��s�89s�J�����U��t�|�R��q��q*<���S�9@��u�
�Aqa�W�Tx'gW�9�r9�W�� ���ZQ�0q�'{a]GUZ���k�*� �m���Aq�U+� �m�������U+� �m���Aq�q5u�4���V���0|���J�6��9("N�:����8�p������9@��G�.��G����q�9�r��qyD����qyD�.��G����q�V��r���qnc�.����qnc���L����LZ��#3�!NL��<��s0gu1_�8��T�/ ?.�=� ��5>q�ka�T�=� ��i����b<���81�����v� �y�9@�?�9�qx���K1����!�A��s��2GS�� ���1��9�Uc���Y�:��_!���K���4OTx'�WMTx'f.W��� N�\�<Q�9� ��)�s���	Z��� N�:N��q��q*<��8OTx'h������^5�s�8��jJ�]Y��8� �m������ncu�6�Q�@WV�1�:�s��s�+��G����q�V�+����[	��������nc����q�9�r��q�9@��G�.��G����q�9�r��q�9@��G�.���S�qNc\No%\.�1��J�6��9�r��qx�6���!m?�����>�������r3�!NN���������Uu�8s����c�u���s������x�|����B��Zx�9@\H�U<�i�j���q!�W-<� .��������A�c�j��qN}�B��`�]_�\��s(Tx'�W-Tx'g.W��s�89s�J������Z�� N�:�T�9@��u�
����Tx:�c�j��s�8A�8<\.���%��9�UK�"��d/���J�W�|-WE��9�qu���P�ZQ�qncu���ZQ�qncu���������8z+yC\�jEo%�s��'aW�9@��WS����4w������8�p���8� �m�����m������8�p���8� �m���.��WD�V�8�1���p���8<�s�T��f��P] �g��p��!
qb�����9���������|�qE���q�F���3_������3���:����!�A���s�8�]��b<�����c\���q�L�t�<� ���8<��\.st0��b�j�9@�S_5V�9�����.�U�U�m�X��qr|�X��qb�r�
�����*b�����B�<�"Q�9@��u\��s�89��D��qr�q�
�AqA�.��	Z��9�r9�WM� ����P�@WV�1�:�s��s�+���R�qNc\J�]Y��8� �m������ncu�6��[���nc�� �m��s�+����qncu�\Nc\F��9�qu�\ncu�6�Q����6�Q�qncu�\ncu�6��[	��m������qx�\Nc\��qNc\�b�����tzu{}�m��Y�5���/'������I�����4���U���89}Gr<3/��W���*�Hm��<��s������x�|����B���{�9@\H�Us<�i�j���q!�W-<� .��������A�c�j��qN}�B��`�]_��|�B��qr|�B��qr�r*<��3��P�9� .����
����Tx'gW�� N�:�T�9� .����
�����J<\.���%��9�UK�"��d/���4����9@��G��"�$����9@��G��7����V�9@��WQ���8	���:�s����
qa���� �m��sPD��u����q�9�r9���s�8�1�:\.�1��������:\.�1�:�s��s��r��s�8�1��J�\nc�� �m��s��r�� �m�S1C�i�Cu�0�I���e�b�4���;RFx��.���X�j����������'�|-���x���K�8<���81�����v� �y�9@�?�9�qx���K1����!�A��s��2GS�� ���1����U�X��`~d���W=W��U�c����Uc�����U�*<�3���Ux:�3
��X��q��q*<���S�9@��u�
�AqA�.���Y�%x�\.�U������	uteu��s�8�1�:����q�9@��G�]Y��8� �m������Nc\J��9�q)�����6��[	���8<����qx�6�Q����6�Q�qncu�\ncu�6�Q����4�e�9@���Q����6�Q�qnc��p���8z+A�������6��9@��7����{�f�l��]3Y'�%���.���4Cn�)L�����W��x:�5��n��{E�����n���wuu{}u"f�~�������6����f{?��^�;�Z�o�����-��������62���|m����TYC�k��ph����� ����u�<���p�6xh],�u�������\e��|�"t����F������^������v�
�V�z�#v�`u8b�kZ{h],p������o�\,p���V��������1)�uh,
1�Z��Gh]�����	3��
�����JS
1E��{P���8�Gh��8b@+N=�Zi�A�#����G�h�^}}J1����8b@+M=(q��V�zP���8�Gh��8b��
$���Zq���JS*1���T8b@+M=�p��V�zP�p������[ 3�4<��X�p���������>6V*1��Z}*�������
Gh�����@�#��@�by���#����*1��Z��Z�#��@�`yP�p��v(hE�5��76��r�5�=�.�8b@+-���S����ULl�q��VZ�B�#��*j1��V�P������*�Za�A��)�6���*�Za�A���0���p��V�zPE8b@+N=�Zq����fa�T���Sp��V�z���4� �Zi�A�#����GL����8b@+M=�q��V�z�#���1�����8��9bAk���sJ�W����J�#�����9b@+M=H�#����D�#f�#���5����U����f.T�
GhC��P%*1�
i�B��p��6��U����f.T	��"�+��U�#��rR1o�5�Ir��o�U�#��rR1���{�������Gh�U.�8b@+N=�Sm }R1�����4� �Zi�A�#����Gh����7�&�����m�h������>}9Y,�U���{p�=�p��V�z����8�Gh��8b@+N=�Zq���"h�=�q��V�z����4� �Zi�A�#����Gh��9s���������C�A�1��0Gh���Zq��
G��1s���k�#&b�B����f.*1�
j�B����f.*1�
j�B����f.8b�$�@f.8b@+-����Zf.��������m3k�^l������������!���Zi���J�=(q��VZ�A�#��*J1����8b��
��A�#����Gh��8b@+N=�Zq����Sp�����g�T8b@+M=�p��V�zP���4���Zi�A�#����
GL���T8b@+N=�Zq����Sp��V�zP���4����3�y���n����L��d����}3���?�ybBa�)��p�{P{�����e%@k��qt�{����L��=p��V��\5{h=p���h�����I��z���&h],<p��V�.�8b@;����8b@�	Z�1������:��Z������@6���������u�#��r�G�h�1���.���p��VX�B������:�Za�u�#��*�Gh��8b��
#��nso�O�'��~���������f������e�lV��7��~�Y��m�h�����}���n��O�r3Me&8b��7I^>@��\�q������Gh��1��JSb1����8b�@����J��Gh���8b@+N=�Zq���JS1���$8b��
$� �Zi�A�#����Gh��	���Sp��V�z��3��������L+����+�6U�����
l|hT.$*1��Z]S�������
Gh����� U���?��X�p����"t�A��Z�u�<P���?��X�p��v(he�8b�����
��E�����Gh���8b��u�c�
� �Zi���J�\�p��VZ�B�#��*21���d8b��5k�:�1�����8�Gh��8b@+M=�q��V�z���)j�H�A�#����Gh��9��JSr1�����8�GL����8b@+N=�Zi�A�#����Gh����JS
��
�y�^XO+0u�	$��`��JS
���8��9b@+N=P����r������5�P��mP3
��5s�T��mP3J��5s�T��mP3J1E��}>����Zi�%��7������7���Zi�%��J�=(q��VZ�A�#��**1���T8b��
��A�#����
Gh����JS*1�����8�G�h�������Y4�f}��^���,���u��=8�T8b@+N=�Zi�A�#����Gh��5��JSj1E��{P���4���Zq����Sp��V�z�#�����y����]�AE�Za�AK-���V�~�R�(1�& ���p��ns�Y��a�����-�*l1�
i�BK�
_jC���R�����F/���p��6��-�*�1�
i�BK-��"�+��q��A��4��Z�1o�e�����������6����f{?��^�;�Z�b�1�����A��<�oj��!�xcP+�������A�<
oL�atBhc-�������V�����A�8
!��ZqB�7��4�o�j��pnB�7��4�oj�ixcP+OC��Zy������QJB�7��4�oj�i)����R�1��!�xcP+NCH=����RC�U�+����w���1���\�c?��oj��>+�Z����;��m��.�_��:����6�������oj5Qk�\�Z�1���Z����9�xcP��Z+��oj5Q�b��y��A�P�����y��A�&j��<���V�NVxcP��24�o������B�����A��<�o�j��p�����A��Z�oj��2�xcP+��!��Zq�9����r�1E�������A�8
!��Zy�������V���7��4�1o�e.��y����
�1��!xcP+NC(���V��P��A�8
��SDm(y�������V���7��4�1��!�xcP+NC(Uxc��3s��j~3��gC���`�u�T��A�������EO�R�7��P���b��������S�T��A�?�:Y!�����j��TxcP{�6�<�R�7��P�d����Z�u�B�TxcP;�24�
o�G�<U�4��~]kO��B�7���*�1E]<��ZGB�7��j*�1�W�P��A��Z�
oj��2TxcP+OC�SDm(y5����j�1��!�xcP+NC����V��P��A�8
��S��#�<�oj�ixcP+OC��Zy��J��oj�iq�7���@��oj�iq�7��4�8��ZiB��A�<
oj�i��Z�����h`�=HB1oj�i��ZqB;E���a�q��mf��l?��z�{�9&���fu�.�N����7��~�Y?��Y4�f}��&w����f������f�X?S�Wr������q��3�����yc"z*��
oj����*�1�
j.C����6��q������2��
oj����xc���@�2�1����C������)�Cx?���r-����CH���V\B�7��j�1�W����A�8
!�SDm(��1���!��A�<
oj�ixcP+NCH���V�����yCm2����:��R7J�1!��ZqB�7��4�oj�i)�������V���7���P�R�1���!��A�8
!��ZqB�7��4�oj�i��������<�cB��1��!d��Zy���V����3{B�e�4�!S��AmXs2���5�!W��AmXsr���5�!W��AmXsr�1E�W(sr�1������yC-svm?��(����f{?��^�;�Z�r�1������A��<�oj��!�xcP+�����ZqB�7���P�!xcP+NC(���V��P��A�8
���Zy������������!xcP+OC��ZqB�7��4�oj�i%����J�1E����P��A�8
���Zy�������V���7��4�1��7�N�6����4�Nv����7�^�������*�����~�����}V���G����[�����oj5Qk�\�Z�1���Z�����9P��7���u�B���ZM�:Y!x��A�P����Py��A�&j��<���V�NVxcP��B4���]�1�u�=�.V5����C��������C8�!�xcP+�����Zq�5����e����V\-C�7��4�1E����P��A�<
oj�iI�7��4�$��ZiB��A�4
!�������gV�'��J��oj�ixcP+OC��Zy������QHB��\��~X~���m��d>����^���}���C�:~�6�\,�����n�_n���s�,�m��kv�uv�����?��&��P��kt��/�uP���xc�ZqB�7��4�oj�i1����b��y�\�n3}3��{&{�u9�UxcP{n����u�!�����j�TLb��E����I���Z�u�BHTxcP��.V�
oj�����������j��TxcP��NV*�1��ZB�7���k��`�n������
oj��!$xc��x����<�oj��CH���V\-C�7��jR�1�W����A�8
!�SDm(y)����R�1���!��A�<
oj�ixcP+OC�S��#�<�oj�i����2�1��!dxcP+NC����V�����)�6�<�oj�ixcP+OC��Zy����r�1��!���Z������`�=JB��1��!���ZqB��1��!�*�1�zn�yc:��2z*�*�1�
k.C������2�*�1�
k.C������2*�1�
k.C�7�H�
e.C�7���
�1o�5GJ���o�qR��A��<�oj��!xcP+.���Zq�������QJ?�oj�i%����J�1��!�xcP+NC(���V��P��yCm2�����6�f�����������Y�<y�<�oj�ixcP+OC��Zy����*�1��!Txc��
%���ZqB�7��4�
oj�i�������V����1o�5�}"���P1oj�i��ZqB��1��!�*�1�'�\�c�ne�y&c.C������2�*�1�
k.C������2�*�1�
k.C������2�xc���P�2�xcP+.����Z�2��~�����m3k�^l������������!!����VZB��A��<�4��Ziyi�7��j�oj�ii�7���@�!��������V���7��4�1��!�xcP+NC���������!�xcP+NC����V����A�8
!��Zy������QJB�7��4�1��!$xcP+NCH���V�����A�8
!��3�#H
]W��`�_k������~�2�!$xcP{n�Y	����6�pr�y\���6��6�f����s�g�1��D���5�xcP�����Wsj=���V�NVxcP��Z'+�1���P�R�1��D��B��7���u�BH=����jeh)�������2�!M���V\B�7�
��e8��M���V\-C�7��jR�1�W����A��Z�oj�i��"jC�C����V�����A�8
!��ZqB�7��4�1���!��yC-s��C����V���7��4�oj�i9����r�1��!�xc��
%!��ZqB�7��4�1���!��A�<
oj�i*�1�����f5��V��!j�W���\������Na�S���b���Z�u�S�P��A�P����X�����j��TxcP��NV*�1�=RJB���Z�u�BP��A�?�:Y!����v(j�hxc�8���*�����Z{j]�J�1���P��)��a��:�J�1�W�P��A��Z�oj��2�xcP+�����Zy��"jC�C(���V���7��4�
oj�i����*�1��!Txc��x���P��A�8
���Zy�������V���7��4�1E����P��A�8
���ZqB�7��4�oj�i5����j��
D�y�^XO40��	%�f�����7��4��A�<
A�7f^��?oL��VDO�,R��AmPs�H�7�A�e�"��5�!�TxcP�\�,R��AmPs�oL���\�,��ZiyY�7�
��HI����4�"�1�����ExcP+-!����VZB��A��Z�,��ZqB�7���@�!d1����b�1���!��A�<
oj�ixcP+OC����dzu{}�m��Y�5���/'�e�jx��y	�����1��!$xcP+NCH���V�����A�8
!�SDm(y	�������V���7��4�1��!�xcP+NCH�7�
���O�!�R��A�8
!e����R��A�8
!U���=!�23t+��3sR���5�!U��AmXsR���5�!U��AmXs2���5�!�S�|�2�!��Zqy��7�2�a���xX~���m�Y��b����o�����?-�y����C����V\B�7���2�1�W����A�<
oL���C����V�����A�8
!��ZqB�7��4�oj�i9��7�����������������
$��$h��3��&@�I92���R�Tn�Ct���w�+��M��#ft�w�C/��s5nj�2���e�1���pcP+�!8���e7f��T�nj�2��Z�����V.Cp�1���pcP��!D���?������F���_���gA/C��E�������
���5������o�s�g\nj-Q�\Nm���x���>|R���1��D�'�&7��������KQ��B��ZK��rB���A�%jG9!D���6jE2����nnl��v8���pcP+7������6�*�B?��qcP+w�����V�.���A��]��Z��7�r����6�9��Z�7�znj�2���e-nj�2�7
��2�{����A�\�����V.ChqcP+�!��1���pcP��!��Q��B��Z�7�jB����V-C�3���eu��Z���L������2����f�wC��a���:3�����MaoO�������x�a�b��pcP{)j%v*��	7��P;�	����x���`��AmOm"s�)���~X����|���?m7_���f�k���O'������a}\��}���V���.��d�{�o����������d%��u��j��v�Bn��Am<��qB�M�1���B������B����k�S;�	7�rs9n�������9�7�jw�7�rwr����e�qcP+w����A�\�P��Q��B��Z����A�\�P���V.C(pcP��!���V/C������B��Z�7�rB��Z����A�\�P���V.C(qc��Me���A�\�P���V/C��A�^���Z�7�z}c�6����h�=��BE���e}cP+�!T��A�\�P�pc�������8�j�T�L�1�M���2���6�^������z*nj��e�L�1�M�����J�R�e�qcP+7�P����6�J�����4�k����!��1���C�qcP+7�P���V�.C��Z�7f��T�!��1���pcP+�!8���e7�r���A�\��pc�P[Lg���}����v��^��9Y�����3���!8���e7�znj�2���e�1���pc��Me���A�\�����V.ChpcP+�!4�1������e
}c�P���B?���7�z}cP��!�7�z�	7vB�2��>�y����M�1�M����pcP�V/�7���6�^o��AmZ������z<n�P��J/���A�����EC-���>������n��Oo�v�������u�|Z�s7�rs7�rs-nj��Z����ehqcP+�!��1C������A�\�����V/C��A�^���Z�7�zn,j�e8s�e�1�U�\��Z��e�1�U�\��Z��e�1�U�\�3Dm"s.��A�^���Z�7�znj�2�7�rB�������G��-�����}����v��^��9Y��_������2��Cpynj��>�@m8};��}��=n������bq��L��ZK��'�.@mnj��v����������A�%jG9!D�������"pcP{)jS�C�#pcPk��QN�1��D�'�"7��P��!���k��%���
����!��h����\�[���V�.C��Z��nj��2�1����P���V/C���6�9�7�rB��Z����A�\�P���V.C(qcP+�!���h�����9�7�znj�2���e�1���*���en����!T�1���*���enj�2�
7�znj�2n,��3�}���
�~��[�+nj�������S�2���6j���X�pcP{)j5v*�&���C�'�����x���`��AmOm*s�	7��P;�	����x���`��A����pc��p�J��y��v8���pcP+7�P��m��Z��Cp�1�����pcP+w�����V�.���A��]��Z����Q�����A�^���Z�7�znj�2�7�rB�3��#�9�7�rB��Z����A�\�����V/C��A�^��3Dm*s
nj�2���e7�r���A�\��qcP+�!x��.Dm��w7�� �{&�9O���e��1�����Z�������o���Vc��7���6�^o��AmZ��	7�i�2�&�������pcP�V/C�3�|�������Vn���ECm���9����i�Z����!��1���ChqcP+7�����V�.C����V-Ch2��!j���d�1�U��7�jB����V-Ch2���e�1���pc�P[Lg���}����v��^��9Y�����3��2��d�1���pcP+�!��1���r���e9nj�2�7f��T�r���e9nj�2���e�1���pcP��!�7
���O�!�s}cP+�!��A�\�P�7�rBa��������t�I�24�	7�I�24�	7�I�24�	7�I�24�	7�I�24�	7�I�24n�P��H/CS���Vn���EC-���>������n��Oo�v�������u�|Z�s%nj��J����!��1���C(qcP+w����A�^��3Dm*�J���e�1���*���enj�2�
7�rB���Zz��C�pcP+�!T�1���pcP��!���V/C��A�^��3Dm*s5nj�2�7�rB��Z����A�\�P���V.C�#pc��|����p�����
n�������PG�������
���5������o�s�g���Z�6<�vj#pcP��7��OJ�67������"pcPk��1N.7���6�9��ZK��rB���A�%jG9!D���6j52�������
?��v�nj��n,j��R�!�s7�rwnj��24�1��������V�.C��Z�����6�9�7�rB��Z�7�znj�2���e��h�������7�r���A�\��qcP+�!x���e7�r����6�9��Z�7�znj�2���e-nj�2�����L/���,o�~p7D;��o]nM�1�=wS��S;�������x�c�bk��A��������pcP���L�1����QN&�����2���pcP���L�1����N>3����R�Jd>��Ect���]���������3����!�7fh�G������pcP�v��g�1�U���3����e�nj��2�7�zn����!��m��������n��'��q��v����o����m�������
�������m���}����v�&����x3}�x}o�Wn�E��F�k]~�8����[�pv��3�e�qc���e9nj�2�7�rB��Z�!������B��Z�9�7�znj�2���enj�2�7f��T�
���enj�2�7�rB��Z�7�z}c�6����h�
Oe��oj�2���V.C(��Z��4��������h�\��S�4���6�^_�pcP�T/�/M�1�M����&�����P�pcP�V/C�3�|���P���Vn���EC���t��!T�1���C�pcP+7�P���Vn���A��]�
7�rB�3Dm*�*���e�1���pcP��!���V.C�qcP+�!���h�-�����k���
����f��7�rB��Z����A�\�P���V/C��A�^��3Dm*s5nj�2���e7�r���A�\��pcP+�!8����6���9�~��7�r��oj�2���V/C0���N�^�u�9���@��F/�3���6�^g��AmZ��	7�i�24&�������pcP�V/C�3�|�������Vn���EC-���>������n��Oo�v�������u�|Z�s
nj������!4�1���ChpcP+w�����V.C��1C�������V.C��1���<nj�2��Z�7�zn,j�e8w����V/C��A�\�����V.ChqcP+�!��1���Z��!jS�ChqcP+�!��1���pcP��!���V/C��A�^�pY7�����U����n��w�f�;S�i�_���{����r9���_7�/�/���v��l�������k�/��q��O��M'�O��������|��7�������w���/����;���g��O��w�8?O�~��>?����{:�x8]�?�:y��o��h]����}~U��+��E[^�>TW����~��U��/�9�<�o������f��x|�~6;,>u���w���~w����-v���j�^t���i��������Y�e��������W�� �����V�����m'����o��cw8=��'��G�?��vO������C�����D���6Us�rW]�����������]}�!������}]}}�����ts�/EXv�������=�����������?��O�����t�~�n���PK!����%#xl/worksheets/_rels/sheet1.xml.rels���
�0D����n�z����������
�I�F��7�EA�4��f�j�$��z���r�w�
���j��'�H���z���4a�G<��"SkS
;���4#K�e��q���8�����uQlT�d@��m�!�]	��9�?���5���6�K?"T��D�q��A����R��,��R_����PK!7�[���#xl/worksheets/_rels/sheet2.xml.rels���j�0�������v���N/c���=��(�i"K+������v�M��>�����<�3f�,����ch�����_@�8��	-\�a����89)M<���P�-�"���#��uLHE�c���4&9r�MU=�����������6��k*��f����������r �|@�� ����������@`����d8Gi�q���[�M���.��ei���PK!
A��G)xl/pivotTables/_rels/pivotTable1.xml.rels���
�0����Pr��<��:��U�B�m�--m}{�m�c����W�_� ��u���b����������	���1ixS�}�\T0���[E�p����wJE���Q:O�7�#�<�Ny4w�H��b�����)���pnJ�������m���3��8��P�>]:`~�H�e��|%
RN?���:��J�����PK!�!��@'xl/pivotCache/pivotCacheDefinition1.xml�W���6���!�{�C 8-�H[�}�^��'�`���p������[nKZ��x�������o�������������mx&r��c���g`[J��Bp�GP����?�vt/��d���r��������wC�S���r�8���dD�R�=��@r����B�O<F(�kCy�
�Z��"+p]�P����:[c�-���r�d�����T+�������I�Bp�g���q2�I��J�h��C|�m{��`n[rH��-��a%Q�������K�"5����Dcq/Ic7����~�l+CL5�O�jb��m!�����I����z&d>%�c{������&����`���Sg�HN����EN��g�|�G�Y����2��GQ�,}�a���V�������	C
L�,�����k�2��P��������I%�jHY��ml��l�������	�������mB�	��[I���^z��AU��1�|$��������F��G`�z��=����g���e�w-��%[��q�z��'���%B�$*9��g�0an6:��a-��*6���r� j���.s���8�<�4>���S
J���p�b���*#������_��T3��B�����K�~��wx<Te~C�!]/��C�D;�g�|�T�4���Z�P��\���q�T�,���zI��i��]��g����
�F��ph4:6r�����(v�h�����}�7e�������K����G�����,��R�+y^��z����{�eU�zX��\����W�?����O�p��_r��8wa^���M� ��Z������>N�d��F�n`F�K�Q���M�}��E7E�t���[��	�n��7M�Y�
q�.j2���Y���UJ�#v?����w�N����L�i�����/f�?�?5�� ~q������^Z��&�C���L_b|��*J����o��PK!��>��\7$xl/pivotCache/pivotCacheRecords1.xml���������c6�@����_d�z1����<�*I�!�hUl����s	�<8�H�����s
������������/������?�����Q|�����>���O�����?���������������O����~��?������~�������?��������/������>����������_������O|���_~��w���o���������?���/�WE�}���?�0��?~9�;>���?������?~������|�����#��������o���3���������?}��7~����������K?|��O�_�������������s��������_?�������w����_>��gy�����|���C����/��?�������_��_+�_���zu1�C���Wu[?�Z�f�(������w>�>�����wm16���>h��rX�����<v�>�y��QT}�p����M������}�(�^����{�az@U�������z���h�8<����T���Q5�+dpU����nh�g�����t��zt/y�U*�����
(��'-�G����GW��1��@���h��Zu[/n5��vT��BtX-sQ���SQyb�T���j�P��7������k#�$����jY����W_�u���*`�������7)�Z����^MmaP1<z���Q�����{B��$��T�*������`i�/9��
�U[�����>��t�e�������D�����J���<�S��z5����7��F���.����U�Og��}T��PU�\�+^D���4�����:�^�D�P������v|��.����;`���4�{���X�����?r��=p|h�X��z��|��%L4��hV�������G�5^�G�p��C�x6�5�K���C�2�]�"i�]���*d��kE�F~���p^�8g%�<o�(��F?���C�i�Wn\���dk����S)dj�M5�s�����\���mOn��{I;��1�(�~�@���+�H���UiG�	�X}��){b�a�3M4��6���v�1="�6_U�0�����c��N_��_kC��Fc�Pc��y�x�]��F�I{2Q����K<��%�*�51���\f�����1�i����T}MN�@]�X5��MPT�6�b^������i��J��l��j�����<O
����� )P/Wi�I�x��c�>5W���������57=pq}7���H���;�k�9���C��b��6�E�%��v�P��l*��ar"wK��Z�K�M�Q���������r��j ���o�]�k|�����B�h�0��G��D�����'H_��M���FT��F�K~�
�x���d�q��(�Us�(�������h�]!Gack�*Z.����������|�+����f��M��b��h)���F��Rg<����^���@��K��=��gvge��SI��u��������0;��Rq�Lj��W���@i�G��A$���:����v��'�Q�T�����nW���D ������+����?�O�h��U[�cI��F��,����K{]�gO��Yub�'Q�%�*i��F�9�oeE�q�x���#�d0:�3��]yiM��1�T��Le�6�q=w�4��(��1������7���@i�F�$������U�sv�yC>W�p^��mI�eW���TH���x���U���L�*O��-=F�'�G�o�t��Kei1�^����>�<l#�uW�*{���
u��H��)h�nH�}������a���y����S�X��U�P�����:5�N�^Z�L��q�G�g�t���{��~�1�V-:tmE�����O#�k��&Au
�Y�������'E�<��[c ��(phkK�*�*U��m\��:�(�Z@�b��[��($���m���r�~�=N�a|�����~�����K���M��R$hE?��.���V=�s5W��y�iG_����-�5ZSo��3��K_d[zo?�����
���~�DJ
@Tq�#`n�6�O�������^�������j���FEO�.����A����&
��Q�3�`lO&�����~�)��*�Kc��&����0�sa��c���uG���)P���G�+U3�Ug?��
�u��A���l��U����B��1���)����<��nU���f���]�0��^+�6|<7�!��0���52�k�� ��Yg��Uo|�#GFW����Im�z$�����&|r����-������Yd��&"���R���P����9M�'�o�
��
�1�OF"��W<�IEl���U��NZ�9�y<t�m����
��������)�������[����#�_1
��%�V�<�s�(����Q��0&���_�kFA��(����Y�����
�AR2�Kr5lU�L8���{�M���t����k� �}E���3[�
X������s|�2R�'����ts��_�8��@��os���?�G=8��7������B�U#�
����&L�a��y�K�#��P����_�{0qU�cz�9�[t�n���p,�W�����Q��Y���4W)+i,����T%��LO]�^L}�V���0�T}A{����K}�?DM���l�	�����&�+!U�LE����Z��_��x�B�N���+�i����P�[���	�_TX���4;{� �S��y��[t��O�E4P��1�6�� ������IE������mZ@x^I��L�����x4�w``�H�a�B���A���%U�v���S��#�X��6sqi�p�=�
�����{��������)��I�n!�lA�<&�����r�Krk��b�Y�d�������~.�L�y)��
���h��-��jd��%�]*M�����a^t���L?%�-����1���|���7��2�H������x�i������
�;K	��[o��#�VV-l��3��"�L���h=-F���Tcc07h(�XT��1s�`&���2�$����g��y�#���b��j���qQ�@e�L���[cZ��ejS��X�{��O8s��:J�������fU��1��2]h�O���'���#jy��I����%P��5:�|��f�T�1O�IE	4�4��T�T3f���b��#�Z*:A������z��3O����B�����FS<ok��6H]!������c���G�%G�[�����b,�_Lt�6[�IDUkF�,���n�DZF����=��������}�F9���X�I�����T�L?z�pmuo'Ak\F�H�5�9S!���TO��~#��|�M��������7M��J��gH��8��u�[T���2o��u��X���i;���7�L��Lw`����%�����)�N�Z/ni�W�"L�M���������	��8e(����W���,��`I�,��^��]����j�R��"��#��!��0�f����e�hP���)
R��	��2���A�������3;t�
eg��������Q�K�t�v)bjD���Dlo:���=��N����b�����5��I8�J�y\��ND���IH�,��y3�L��>�c�m����v\e��R���<S!���F��gs*�'N��e��CcE
������i�Xw���#O���D�X�mV�f[cXkR��������h�O_��T)-Y��g'�_#FD���]?q�\n@{�h[���vm�<i���ej�)�1=3/O�O���!bj$T�U,�����(�����;������{�� �MH�v����@�K����PT�7�M��o�U��a3�>�JrA%@���{J$K�F���d�*�}>'����$�������/
+\k��-C�r��OcR��0��'�vi��y���Ho�z�>����p���k���^��u�0H3�b�����!r�IC�V?����Ov�m�0g�������vd��.��L�������F�9�@V�]Z��
p���z<k�o ��2�H�Q:G�����x)1=���i#�b(�,�p����E%���� 1s���n 2-O��*��eP4�����2��k������Vc%���*�����8"�L,i����d<�@��	!"�!��FT��V�
���.E/���^�G�,|�7�ZO _�P�������K��4�<����8��v1=�[da&����Q���X�
��[��|B�K)���E�W��
@�h�����#��ys�n�<!�j��Z}������`��[�=���SS�+�^P�Gt)�A�e�k�Z�mF���&
z+?�^�O�&|bBg������~����dF����dE�L���8Ok6��)R�R�~��y�,�9����wQ.&�����c�l�!��fCO������t�K�Z�Y	"
U�L��N���`�5���Sj���|V
�B�MXV/���������9��\�dZW:��@iO�bUf��N��V
�O���FKu��p
��\�Q!/Zc`}�������(��fJ��*�^2��
�2�9E���� ��51�)].#�f��.�t��%��>h���*;�L�B�o������/�w1�b�@8�������;�.��x�)�o�����(S�-���Cv!]J�{V�6"8e�oZ�@��U4���*�WL	���c��)A���5�8��Q�{}�E9C��l{M	S����y_�f�����������RG�2[�<���Br�����s����G�~8tehWi������y�/MA5I�:���%�?�:��R����}
��l�����	���ujh� >03M0��HC/<;���lj���3i�<�DS�|������\
|��@�^�t�i=����#`��~���n?)��~M�zxKi�
���@��M�p���
���3����+N%�FBg�f-��u41�pm����#���4FfQ�1R����P�VS�%Ij�w���t'����F�\`���5��9����c��9��7
���D��\�Z���\��Us�3�-����I���50�D�F������Y��\Ig�_�4]�L�����~D.��-$����:@�@����}b�B���;�qx%�������Zd8�uT���$n<`����]��|@���V��K�h����D�������|��IB����R�K�a�5^G�Pxu��R��-�0�*2n^�T���^v���d����N���M��d�1W&%9]�'PL�HkUV��8D�gU��1�\}aO�����i���Tpd��>&DT��}�-nl�d���������&G3)S8-���&�1������<"-6pAJ@^;M�9�hv�	Ed�T�RY�#3]��F�����*fJ�{����S�m�.���^��J���b���(�Z�����<;�	6�	��{L��:�(�Q�iH�K��u�7*V:�#�e��5XF@���f��<G�=��rDT���%������:�(���-/�)P�;��*+HNZ1��`^C!|������������B�qw��d�Z �}��9
>LQ���w���LU���d�Yzj�@���cs2�����
��p�P��"�|��g�2���1E�����K��xt��Z(�Z*�W�PL��{v�Y|a��,$h)c�h�<����r�R����)�����b��*l���:oZ���c9v��k��;��J��@����CG�a��X�L&6^�.@HP/A�r}79��x#yS�g'��]K����������c����v�j*b3�+���Cn�:���m��p�hj�>����,����h�0���0-E��	`�����Wl����T�K�
����s�_A#�sn�� M��
�F��g��g�$�X�
[|�fbWc�����������>���	�����������o���#2m0��0����m��R��o������������A\���	�b^zO��_�����Q��	�����������-�����
|)�"���Thi�f��{��V�b���fxXL�(�~v�bY'_v?g��g�������F�K9���C����f�<]7JI���G�C�.Kj�Y�M��Et��7�,#\��\@+�X����{�$��dmB�U~�dIJe��ewCQ�rJZ�p�k��>f1��]i_rw�b��[>�?r���)O��eF�`Q�w�t�X�����[���I����O+�q�|W�O��I���6��d��� �|���������������i~JSrg��������+���=����($_O\�	
�a,=�=�4F��F�����M�!m��Z���zVRH4�	���$���-K��~Ji���%6iz�P���7����nc
Ht�����G�������3��L�L��wY���C��x*��,)�
f�aM��j��3Fz�C�5�_6]��e<�Y�']�{��/�J�����^�8�K%7�5;�f/0cQ��f�B��Jo��/�5O&�C��3}�B����Y���y��=�c�o��$�~�0f�~]�����pKG���y6�z�2�D2@��.�y�~��~�U���JL���m��H�Q��2���s��n�<�L������IX��Q�;��^����Ld��M0��\��z�28=��P&�i����������6���B�=���0
�"�o���"��e<������v[
O��
)\�GV#�'��v��$��y�gru��K[����X�w���b�@A�a�%��0s���L��	`�s;�X���X��C��������
�.�M��9�
i�R�V�]:�r���a�2Y3��..N	�����#<�P�M�d��_��xG���
0�� ���V��9!LN3=�mA�Y�Lp.�F�V���hz����Fu�h��,�^��1Q�\e-?����YO�hHS��Y�����`6��\�?��:K����Npv����k9K�d�_�Rd	]4���g�k ����{~_�:fm��Y���uZQ�
���*�ye)5}c�&��V�������o���T8^E�������[g�W-�G���t�O���
)���Fy���	��%O�� S��E�^�8+����Gp���!��	���v������\�j}	m����4����c��;�%:��5�BZ@��O���O��$_�\��uM��A��l�	S�|������@���,�����Y�W-K�ru�����%R��V���'��u��2�v�9VkuL�L'�Rd��J��ZC�I�e6��L���,[xy��7���Ko����.:��c�X���Y�\]��2���@X�04������jY��eJ&�f�������)�2�OKt�,+*0���s�'�9IhY� ��L���d�i��m�l0���g��U
-�X�.�YT-���_�3��-����w�e�a5l��2���*��Z��^��Ep�Xr����ie�o#v�<�i�B�f�)���u���6t`�I:�)^5"^���������J66�����-��=�:K���^Bh([��
 -e�ZC]A��)�����x��T��u��S����a(�����x��+g	Y#z�$�������#����hc�w8gF�������UP����m8gO����������reP�=W��Q��5��N����~r%`��"6�������}� ��M�*��zV�_��_���l�*����]�^I�&�w��e�>6a
������'p��A������x�gf�w��BO��!�0r����B�=�\���3��;���{�����%~j-
�F�L�{v=�C-���'����(e�66l����^j��rB�t�-@x�O���S���A����b��=�<M�w����
���K��c��������|����h�(;9��	����dt�!�f����N�������h���0Lva�Y�^X6|Y�!�Z>�����5�L�2��+
��G�+� �[q���0�nM�b�@�����|���L	_j��RB!W�XN�j���f���D�����r����p(���dx���K�F�,����]xK�aSo�d���R�x��lH��l���� Vb�>j
��������75���p��d�����;������N.%���y<5KT��+���$G�R�rI�;��Mk�3���x�����v�}c6#	�91Q}m�0��G� k�\����������t�#zU�~;���X��'��X��p����v�4Y�7�6�i-��d��RzK�/+j��I\�/������u����1;���j�5�����
������YcZ�Md����&S���O�T/+H���U*
Y�����`�5��������ZF�����_����4oR��>}.G�R A�gL:���
O�����dy1{�p�$����Y}�(|��� +B�w�{lC5I���O�M-*��B�����95V����d��O'z���!�1tr{�����
�2���J��Y�mh�����"6�h��Z�g��X�$]��Vdwz�\��^�n�L��A~	;0��sl��Fd��W�(X�Y�Isi�E�����~�����HJ30��=�����#^�3�N��/2��M	9�
����%H���u�E�F�0�7�� iL����okAxD1�����x�+a>�f��^&���s�������,>�)����V�'K�u����Bo?�Y
�h�R&H��I�����;���w�{8�LY�jE���B�GJb��3�)�=��C���h���n�BxJ`�^yc6nes2d��Xwl�+m���m�E4s�*O������c��1��|.Z-�:��Jl���?����"%	�Fu�e?Y��ky9��\^l������H���3��1M|������9u-z�[�V��d���<��ih�_�kg���f6��hg���pGHf�j��5��5d����}�z��5d�Cq���lXt^<���ge8����$�Qg�G�$���7�b���[9�B�k�(���G���D�EE�}�R�l���	nV��G}j*�n���������U=[�����9OR����)S���>�;�De�]��Z����������W��C���*4B���^����F�)��U���|TP�JD��M����]tXK�4Y�i�	��I����7y�]�G���{�j:���F��k
or�3`��p,Sa�h������0Sz@�%{���=���o(m�g�
T�N�^b?��������$���l6=S�������(��:=�
��o�J���]
!��W��ge��	@�����@��.��>��$�J�����#.��\�_��#F%�Z.�>+������a����U��g���qI�r����.:�Ih5S����Q����)$�X�)h�����Ob#�.a]$�H�N�b�RX}�\-8+E$�J6�-���yiN�yT��gV�l����nb��$�@�d G	C"��[��O�+�4bf����(�	�U�i�+oB�R�g!#R��[�����>��V���hpq,h��'��I_����L~8�ibh���u����C�G�6�l��h�	���.t&ro/���P:Iru��,��9�I���`2;��nmd��pd��S�YH>A�'�;�e��t�68�
�n�o��u���-We.+�O�(d�t9���iW�r�Y�|�����\H���*�]���+��D�A���Z�U�m�h�(���~V
��Q�c0SG��=a���n����.���
������+JGL"�Q���doH�l��=�����{�@����s��>���2���hy�Q+��[difHx�V��<,����Q�#���a�u�W7"`��u� a�A}.
rDk�2���J�iW3�5V?A�MP�&�[��6��I����Da~�
�n�*��8�8�����s���O�����Q���(����6�C,1	Ki`������Q������Z�&��d��@���?p��� ������h�E�@V��k������2�BA���X�*a�9��"�����vRlm��P���V�n��z�t�
�A6�c_&���)%���5�g�Le�)t�Z"/�I��������Q��8��y5 �9����W^���:�N!
OL�D;�+4����eOK�G�t��B���@��YB�k?�=cT��v�M�gn}07���I������I��@R�0f�����%zR��U���d�n_:��xJ���A�2��#yVj�����72���E����"Bt"�x)�ezR�$�Ry��P��L�!Y���^blX]��Rk����&���CJ5��,qmX�7�qu���c����K�!,�x��j��y����z���(�D	hFP�}�~i���[
��I����|��nC)u�kT����k�7�J����r`������m���tZ�M�nXM|~`�������E��y�pzz����QR�C�[��pe�nH������`V<I�.I�����p��e�w	{F��5X���*j�v��L�Z�[�z�T����F
��c>���2~�d���~B��A2L����7h�"�+�}�
������QY��Hr�`������g��B;Jx�r����������Iu�]:�z��wT9.���K`���x���y���[���E��Ui}:(~�~I�f��M����	������O���+�Z�u�!-�k�����HcB&�x��68CZ�H���:,���
�k�b6wxJ���������I���q���������:+��W�{P���f#��2��b-��4����.6-�������m�es)QZ�/�����]����R�������."44A�������^j�|"��5]gG��$W�c��ay���7VLA�B�y�a�o	�e���#����J���[�;K�����pm�>g%�b��&����G{< @!��=��m�t��'��zV��k9a�w�/-��O�!��:�J�P��/�������	�I��Co����462�JR-���^�i��e�!�=_�JL#��N9508���X��v��O�j�2#�H��I����?���h���Z��[�4	B�)A�=�[
bC�>���/�A|oy�[��S'FS�����yo5h���)J�; �m�����]Nx�<i�ZW��C��l�-�>�<��r.�K�S&Y:�uu�^=�R�a3��d��"�Z�����1�� t����k�W����0��jnH3}��H;;4��,�*��E�����j�]~�Q����l�^I��(D��M����[�?��c����(��m�!4P�AA������+nzw�t�Py��T ����������
y�
`���V}��q�!P�n���*`��Y��,�5��W�{t�R/�d����������y�=�vv��@�X����	���1g����������4��Pv��(��Q}�XC9Y��ep���&��4�m���Wo~���anc��J�q��*M�OOK��[������Ji��!A+	���J����T��\���MC�Q+�'�V�}�&=o�U+:��C�=���\%D�����Z����kUXc^�I�F.��g4<w}��]��)k��Y>��a�'�0+���������4�E�H�G����qt�����m�_R�N)�Fy��OBu�w���<����J�a����n���b���b����#���34{���`0wen5�s�7�c����[�+w��"���l�z�k+=7��a�',=��l��t^�G�d�}���������${%������=����;��l��{�<��B����X}��9GJ�
�pa���&�>Eo��R)1�
A���� %�\�yO����	r���`���3��n0��HWs������<������D�,EB��=� V�4&OJ���N#4�k��1�Gm&N0���R��96^-:ZE ^��H�
���\�x�.�G��������e@��&��`��ja�b��A�����z5g�=���c��������X'��g���)�[�1I����������X�K�{}��C
���1�M�k����TK�����w��)�8��O�{\�z���U����_�4��Ks��b]����an�Wl>�f��sJ�����w����Y�|�A���H�*.C��=:��u$v=��4zB��!����u��I��i���Q��e�
����x��v�
]��%B�A9��,��g�Q�I��~��~�(��mEZ���>u�d��>��B�v^,�a��{.M��4U%
�@��'#~&���\e���J;C��g���F����-k?I�0_�x6"�t��������In<��p�\����S���n�L�_h�u���,_6Y�V���J���!���}_�Y�^|YiC'Px~c��2�Mq�r��v�0DH�LO��n�N���*&E]��J�����ZXll��R���������We���P^��x������w��N{C��� �B-�n��zm�!���n�}��}R+Mcb�)��$�������H�=�U�S'y
���Uwv���{}��s	�q2?&b�9��jzA)��(s��M�P9i���Yg�S��c+�\U����5��%u��jU]Zk��d��B�?�f��A�������G����@��m0��1�|'�c���:�^&pd4b�VB�I�K����@��k\�EuSZ����n<�f��$��reJ� �(6�v=:�}+Ns��=j����r�����
�m����m���%�����h�h�-�w�#�z�XHr��F+|���Rv��zo��B�]�n��
�����7%�_��@�SA��f{&�8U����S��4�D�K
�X������KL]��t�ZJ����y��3���'�:��TzR��D+F<�>,L���U��g��`�������&���S~7*�F�}I��g�]������+��F������e�V�3qK:���D"�`�/v	R���P�A����;	o�
wm�!@_�����anX�YDx�����[(����E%hs�w�rCY��
��8�����vi�5@'�������t;siFZU���Y\7�4�Y��p���7��C�>��������<�(�zkMQ�<� ��^�X#��s�i7A<�A��zZ�e���R`�7��>������`Z{���d�m4�<D���qs��<�Z!,GM���-\N���#
,,Xl^x=�����dO�������b���,\$���e��=�������&x����]���U���1���V�M��<��y�N��3���;�H��o��T<�{hB��M�������C>���kN��Z]}=FdL��%�I�e��l`#&�	�
�^�1x�!zC����m��i���'4 �� ��M����Fet��E	 }�����[�6�y���<���
��czG2�Y�� �@�\�E���0����:Bi���f2��A�xl��Iw�f����:��6�Q)n�����d�P��t��a��9ev\���p���M�i�A,` ���q����,���C���xG]u��r2��8`�R#��������L��������;�8w��gk���N2�O��F�=(�T�dM�.� @��u��K�~��:+A$@oC�l��YL�z6��`e0�	�
�L��c����78<�B�|:d���}��	��v����*��qw�:m�m�e��E�K���L9M��X��n�N�{m.��j���ip�z^�@��^���(��L�ydtMy�����(�-xCb���Z����������]�����'���P��G��H���j����!v��>]J{9�&6J/��c�f|���H��!v�~�%��������N����u��{�Q�TR
�[��eU��5�����Q �%}i��j1_�^+Px��;4��\w����ujT���7�7n����������>���_����:��#�S7,����g���KE�_�� N��IC�G]/��^��A�0�!Q&������T�.��cQ��,x��XL������<N�^`��.�g�Z��Z���H���s���w����<+C�B�*s�zAD��
�	T�� o��-�!�Z����q ���R��:�E;]���w�UpW��.L+L�d���6u3��J,$�wA
���]"@!�j��_���
b����E����]���h��g����?�W/yj�H�K,��
�FJ���:r��{���lR �X���/�JB�vk�B,�%���
y����7��-�(p�e��������=�o�:X�0���uX��<5�w}�J���	Nt2���4���[8��=���j
�R+��9$a�U���
���{hO��C�Q�>XT_u]��2��WL�dwaGen>�
����F�(��h/�a���&o���!����.�J��h��]� K[y	��Y����Z�,
�][B4���!#��6�l4��Tb,�w�����m����I/U������NSY��j�������Q�
�+�%W�����7x��mV����o"&�����[�#����no���z8~z���Ck:�Jo�������;�k�
���zi���~�M�kkr'��.�g�c ���[$�x����Y�:��m���_���e��\\{J������$�4�aFB1+b��Q+�.�����Y�b�*������.\lWX2[����F���t�z`H��k�� ��[�<|�D��2������m6�.�[���Ws�XO?.��_v�Z"
M�t��s�"���R��m������;P�����7�C��(6�x�C��j�6�X@r�����w�����c�+�O/�\p=\wC����q��(�#����p��U��C�6���[=�#��h��?��K�(n�I�Ei���;q���8���r��
	k����������
�	��a���$�:Y�|>��.�g���9�i+�%����6�;+�8�"���x��;����
���^-������%�~�K�P��-���������r��>+��JD�L�3�<�J�L��XJ�A~ibtV������}X�b'C	����J_Z�JKp�����O��&Z�D?�8���wV�
���)��������
*��P����B��i�6����_�u��B����V�|���;��~���X<�\\>��|���a�{��@�b����mSf��o��`���,�5�J���
*%����:���5�I��Ysn5�Y��[&���g��������a��m�L���M�jN�'�IDv�O*�\��M���<c��5������>�#_	��������7l��9n���,������(O?V����xy�/f� 	�g=)9X@T?�M_u!�rB��?������B�`�������<]��V�V3�'��Mbp���Pr����R�z��L)����C��	6P^!��{tU1�no�~�,��[�f��{ti� vUN�%@s��U�px6�k��) �g�����<z��K6kS�*������L��h����=l�kI��>�"�����W��:N����L����$(c��2!������2�6���OP�
vb�'��L�1?'[�\�o�o� :�����E��n������S�*��v6h�������7�ZNH�^��T���	jpC�l"�f#��,J�a0���'��������~�D��!It�a$�����h�Q����P������9�@�g\Z�=KT��V2k�,�;�6�l�K2\��JV7j�9�X{�PL#"Q����|�'�u�"T�,J�_+1]|:�a�:V!��Q�<R���9��rj|@W�����q��6}���1�����>p�M���KJ��s1���
�����Q��h�6����V����!#�����J+�M�o�������/�*2����^����{���N�_�b���u�J�C�}��y�{��m��`Dc������2�[��X�c�
��:��w�+a�~��m�O��e���8�+�����>?%d�
��u�M��O6�u2�
��\����������)XY}��?$��x"�����D]��S��GB�d�	s*m��&=����2������<�����aTT�v�$�j!���lgFv��O������5H���ry�
D/�h=�Ov����8r?�hW�]�L�Unj�����^ZE8���6b�L��hxk��t4m������1������@��b{`O}�-���+���v2�,���<�
$���FdBWxNzC��3H�$�F%i�(�p�$0Sa����cS+x[�����US�����E(����k�
��Q�Y��	u�~��|��m9��^i��l�p�q��`[;����J�<���P}6���E�;�q'N[������
�2��B0~.�nZ��',������'�(a)����OHM������ �hB4��,��e��Wb�mD���t��k�Y%��4:<+�o�*�dh�>�A���E{K������S9�x�C��b/(V�-�*#������K��.J7��h�u�Bl���G)�i���{�o�<[1?�z��y�V�>�������
����>/`#��g�K{������/��Mg;�e����x��< �^*�������M�a`����G��!��Y��XP�
��[���7D�z@Gn�!Sp>���0�
(���	y������A���������������6���L��iB�jp��W-����;�1������r�w���E�?���f�$��L�n�1�-����T�G��b
ALs�u~
E��*`������7��	��"Y�i��)��0�s�:�"2���~�1-��������A`=X�-S�(�P�����@�E��>�`F��������T����<\KWp[�]��CSEKV�Q,q���bI80�mw��#I�@�8-%X��G"z�X`%�4�x���<�L��`�e+Y�w�,(�������;<x����DST�A��`�_�	�L�������7��'b�`��
=��=&64g)���3Z��������Y-�y�qv������yIGD��(����kz��DWt��S�0���=�l�@���P����A�,��z�6=����M�%�����B,
����.���p1�b�u%���9�c.r�/�h�L��3���g
����Y���{�LYV��Qx +Q�,��f ~������S&O�0�&�	�<��j;������{�];F��YW,��3q��-!���2����-���,�+n
.��a�rW4u��� {�5-�*AE�n��������B:�j������B��'���b�.����s&'��',��4�����F[���D�@C����2�����a�\�V)*�E�rwm��z��?C���9n�7�^��,�vw6?)9(I��zy��2��h��s��E �~LV~��]r.��V��X8hk�j����]������ki�	n��}�[���>v�	����\�Cnt����t�k���S�qa�0����R����]�+f�$��Z��*O�'�B��J���;�N�G��g|��yk�
�����?
N���X�m�wp0���tz.����]_�(���+�`��x��(�"L�^�����% ��7���+p8E�
��xV��Q��5���
�p�QP�e`������F�>�Z�bp<�����}��������Wy]0����j��ut�4���8a�Lm7�8-ZM��7v��A�mr���������]U�P���y�XJ�p�����j���<��2T|����;�Yt~W`�X�����E�[���{�dg���_��}���\nh���e����������Jt�wr�ez177����?�|�n������e��j�`?;�>�z��q���H�gk��S���4p=�n���+�!�Ar�g����:Tk�4u������e��>���<���B����1!IZ��T(�:�kx:��������9	���P���86��R ���f6��'���5V[$�V �������#E&D����K�Z��Q%M{QbS��e���1����*���2�F�vwv��k�T����U"�X ��U��;/�y�;�L��jA�x��~P�����K�d$�=��7��*}b�6T.��Qqh�b��;�z�z>��Jm�RD�S]:��sL�z�$;���zQ���l��mw=YgR��]��6���,��VV�Pk�"lZ�B<<�N�c����������%��nmV��������G������iV��m~b��AZ��%���c�Y��W�F4����|�W�61�I�kGQQ(��9�w���
���J�����X��������%k��������R�"^�_n���e;#Y*~�1��S�. �n��)��3�(|FO��V#��h
��PH���=��|$�zq��P�Y����n
��*�4���h	:����l����6���������@z9u�ug��V�/�=��h�eA9`"���>5���v���cY�tt�}o�",�#,��x���������TLQ��e��ybt@q6py��p�~�������'��}�3��q��wB�f'���xn�h�3}�_�#��A������h6�`n�
M���f>9GkS���]� m����O�����:B|��/�u=A��w�b�������TtY�[`q���\�8�*����*C5Vz�$	r'h!�5;W��A��Yl��K����EO����&N�I�,�lW��g�	�_$?��A/�x���{���/��������j�����d����Mc�# ZJ��A������]��=���u�$�5�u}�m�Z��xpM^�6u�|�G������[��^�Z���dI������8����9������s�����93?I��C)k1w*fj�IV������Y�f��������
&k���6��$���uv8u�������DU�l���o���;
R6��5N����t_��9�V���@�_a� �k6u*��5ktL��r.�Z^��������g���2��g�]��6��[)�m�����[H�D�U@E6�w�i���,2!V��|v�e��E�O�\���xcW��!<�FO���!<�{A��������~J��j�t�@�c`���c�l�P���7�lA�)C�$m�8���J�E��<[��$m���b����!dD[�%#��������(�2���T����	�����$-c:�gO^�
7�
���Q�X���W4��_������.������j���#�n�L"h�2��`.<J���`=#;_��������\�����7`=#���e�H#O��h��P���Tr/TC�j4��Iu�z�����������F�l�A�#���y�"��
������,���8�q�'�)v���H�\� �����b��a��{���9�E��|��e����w���X��7�)>a�G�P�~�9����	�����?�j�FCW�inH��oT<�|�Cx�^�
":>.�"y�����Wf��?������Q!j�1U�>E#���������O�@fa��&*����,.�yr����}/C����}U���B��&%a�?HI����^�Z>�JI1��O������tJu����V)��[x��-XvzO5V���A.�U�/l��we����(d�7!0��!����(��8P�
�x=�����J���R��������Z�����pq��Cer��lH�����BgH�I��V���ON�V��^{k���B1@@�������2��T��������A�Cu'����l>Ht��PtK�G���JM��eb�����b��P���U|YlI����s���5��:�������5E"\#s�;��)07��^=�3m����-�t21B@�P�i�w��^s\6�J�{	�!���|h1�=0���X��������F{���*�?{����G�����������u�m�Y���]n�j�H���t.��E=bC����k�"���t�fvN4�����I.���I�����u����l����-���f�Jv��n~���97�N�vy^��H�C���n�3z&���QS�����X��a��
_��$	{����D>�^�o2�y`V#�'���c���	�e���������
��������i�0<�EH�t�bG�U�;�o���u�/Qp�JT��U^�k�`o-����aB�@�G`���n
f��1S����K�����{����%��?:H��I�O���g'0������\���;X��4���W0}~�+G�����{����}��y������73�lV�xG�j�V�~
&��IO�&��5%b�v|w@Nf9=!v�cCHg�=<P�V�ldRd�C[��*�'g���!xq�7k~�1]�A
VBzLp�=������6�vn\���G=I������
�����f����
"�TbW;:m@��vn��Qz��.n��!'w�c%nz�&�tk�Y����BM�����
�����������z��epU	�_�S�~����*E����gB=	���L�!���
[��Y��u�i�����g	��c�zS[�ug�Lz:|Kg�%[�vs)�i8�R�<�����';����n-`���|p�5�C�@���D�}��zXaY�X�{�,X�5�xO��c�d��d2���k�o]c��zT��er=%�����.�������i�_`l�,�B�����'G*�7{^?�UO�����Z�I7LR����W\���B��p��Db0~g��'�������u����g��l��w3;rw��MZ_��Lx����b��!�4v�x�U<^�����;����(_|�\tr��/i��)z<����P����*��\��V$�c�K7�w��[Is��yqX�d����O������P�U��$���*�(&@��'F��I�����g�#�����A&����>�z���>��K���&._$L��F��n��v��e��(*������]��)����?�g���iD����P��i��)o"��roeF��(�E0QQZ����
������Ne5�#�<K�ZTn�v�G��XZ:��-DM���;3:��2��xI�+��U����]����&�����C�w��h/�hw��Q1j���FTt�t=���4w���e��<�uC������b9�����
�
�P-(�)�y��$�*�q��-=m�w��A9��b�p�:Z"��(�6�����B�i�1S�j*f�2}z��f�#��Z�����G�
���\���7����L�'49�$U�f���/�*���>C��L�Y�����]/H��}���zQ���O	�%��j�O\����U���{.il����}e����M)�� �IoC��P�%z���",����"������o�����**������->���5�4�����@w����h�;����5E'�eSSO���`1�A��w����{����,��B��)1�0(��^���|�3�+D�~v�2`x�u)"����'j�b�"�37 �~7�,��Cb.����dT+}^�<�*��S�-����{|?{Z��H���������s��������|v&�t{����;��Ao(s����_���L�{�M�p�h�f%�q[fP�y�Q�B�D����bY�&|8��R��j���-����k������5H~wOW<|������-	�6��+O�`��=x-\������F�� �iJy��m��c�~������t1;0T�(��2	�t�BW1��KP&Mx�:d�����[w���j�������?�O6�)O��)��$�w�3��Iv�>��w��b��2�7��G���dN[�
�N�����A����?-�������[�m����CzJ^y�*S�66���n=����6j�(��%B$���0����<"���~Cl���Qg��N�17~5�3�&_�-�Ki���Z0_��B�t����z���LySc#cwxpq��������^t�3���b������4��q��?>g��T���%�XR@��}c&DE�l]���%+�\�<3�]��i" S�5���1h����!"`g�t���-OE���W1����-+�SC���_����1� !<{{`��"�(�����{O���J�������#/�Vf���9�yj&����}3F^��+b���zzegQs���$��O�W��"ha�h�!����s��2�����).4;���)JU���@\�+�oJ��6�������%������b\������yR���A�0���x�C�������O]��r�]$Y �X��G���P4%D|+�h�Q�7��*��TQ4��"����������'�7�^Q�+L����d���>�m��;s�"��Fi=Y%b�7��,����oKj�%P�H�t�����d�����K��q��Q��b�������O�� �����V,~����9e~|�<{�*Ca���x����P�2�>�6��Y��t����������v��M=��]���C��u
b��C��� �u������vq��!R����8�����y��#p#B�;�d
�@Ui�Ze�9C�A��+d%P��%���]u;�E�����7eh��s�
�$�j��O��16h�z{��K�:���
�~��Ez�\���*��4���Z:���\@�t4�,����C3��kK��e��;��bC�4!���n�M��7U|�/
eT��G(D�o"��,r�t�K��Y�U�^~%P'm��&{��V�EXspt��0^�"A���+3O2D�u5O[�����X��/r7-�IZG�-�j��=����T���� [��ed�V{$�wl�y"���I�f�XiALMT}���cZ�w��XusT'Ak�E�*UrQ�)%G����c�� ���<V4�k�0R�kR�^,^y�����q7��A������T�V��	T�y� "�t^C,��!P�1��>{�Tl��X��lV�e��w:D�L�Z%�@���N��[Tl�N�X-j�����(�T������A���j��������<��eq*Y���F��[3]!���D��+>�h�e���I
��e��)��Y��*P5�z�Dr �J�\s���b�J���(�Q�':r�jd��_�f7����:|�x1�������3(�7���;��*P�B42C��?���N���3�*^����l���a��1���_pzg�
���-l}��%bZ����)������}���g��K��cIp�x��@��`7��{��m9���
�����9��9�<�9����e����4�>c�Cx�s8ap��_�s��Tt� w&6B5j�e��\�w����\Y��b���8�h�z��N~f��]���^���d���"(�Y�6tb�%9R���Gl}��VA�5�m��Pia����L�%?��n��~mWw��p�F[��D B��	�>�h~N��A*���!���u�+te�.��1�������n��@8DN��7��h����Zs��3��m����/�J��*�C��M���_���(N7��4,?�N�;[���]��������If���U[�*��U���yZU������?' ��n��{o}�Jx��3*��Q��]��+��s[f�<UE���X�R:@�)w�8������
����+����-�@g>��@����-CE�:H�0�4��	�����e���%����&,�Nk���fS���W���Yv�D|��6Tf��1��^0�L����7h"Hv��!����c��E���-�c/����c%�����0<���
_
����	��Im��2.4hf�a���jO4K�L��v�������kcKoZ ������e������?]�P@`v�R��9&~bw��=���H%d�R����pv�w�	caO�9�B�e��D�<e%"5�[X�jT?����[�Hs����������.����x{��n'qc���|�~�i	o:@
/�n��L�W�7fR���vwfB�����!Tcn@\��a[F�T�t]����[����N|L�P���Q���3�\���wj���Csc��m����Q���p���]�w�������&������	e�����*�f���-��\��4?+�VL>�3�h���g�j��^��a�vBeW���b��R�w����Z�yz�]���s9�N�y>+6�-���!j���8)�kN��FU���NW���O0��#�R�ni��?9�J���6���E]�q���SJ����]���(����L���Ts�~��B�c�	D��N-�:�p�}�?���5�f�sW�F��A�O_V�|j4�����i[�57�h�;3�t�o�a����Cu����
�L�����!tU���P����>��u��y����$�u�u�
V>Z���3�����@a��4�������A*�������>~����;�"dj�7�T����n#��z��U�5�;�	��9j�	!3V��
o�O�i���o��e�i`�/h����a|g�lvj'�P��+��5������aY������P�;P�5�j��l.��i{��w�fgQ��e�D���4����B�zj�S�-hM�c��b�����qEQ��&p���r��I<�$r�����d*[��7f�S��Lq�.K�85�U��V��nJ��sj��fM��=����j�������OE���q��fgQ��� �Nfw�||����0�����{&A�F�~@��u���8q7�S�(���:.�+������y	Xr�h��?�gU��V��P�����%?�Te�T�P*�g7�V~
*���]y8_}Ip�����Yq�V�?,J�����!��H!fV���H�V�A��6dy;R�m=��j��&�
���/�}�\G������n9�?����e���7��:8��;�����N\�2�k�q��������;�������{�s�x���[e��c�Ol�JH�����o6�����m��|��Yu%|�s����lM��$<�K��S}#,������������[����y���B�����fZ���P���."w�[m����^/+��#���5��(��Q����[/c�x
���J���v@M�+@x�!Fj����/cgyf�9����\�&�}�h+�i���V����2	������������������,�6W3\���#�m��Z�w����r���{r#Kg�5&��R�_m�la��Q��h����z�#I~'@��*������D�g&2W4��Jo}�r���~�L=_L�F�4�wSD���e��ml����Pl�pD��<\���>1n�u��������_���CD 9���vg��j�\��r�� H|�n���/�k������H6��^1
&��H��cTX����!B��`)��/d��$n�(��[�vK�EM�@v���}�L�
������W�y'���Ul�-�}
Ox�����Z�Xsk�S����kk:Kd�1���x
��$v�L������E��7��1������X�������fb��6����������c&%	�R�z(4��5G�O5�fT�p,���0����eE}��������ry����=&��D�W�jG����K.�Qg�xGQ.�!�_�w(�wh���n&B��at�bI�%e�z�{�U#{�U	��B�&K���T�w�������x��CIC�|���S!�/�4���5����Zv-	��*�@raX��7>F�u�xz>=�y�|�B���>]��W~=���w���t�v��6i���D{����e-�;�V�����
�z�E�����������>_��u��!YR�/�*^��c�[��vP����H��gT����w�D���'G�zEv�T%���g����.Od��|�>����~�}z/Na����Y�s�V�������Uy��R��vZ^�SN�e���V7��%]C�����-�Gw�D�c��u�d�1h�J;��9��Yw�����&�CoaZF������G������ X���C��E�|�?����$9��W�-,��a5SX���<Y P)��N���rh,��v�����Jv��+�?��f�,1�w��,��B�*�ol`�vB�2�Vyr��qY���"����b`���������~&+~�����B��_�RXP�j|%G�P�c��I��z��J
�/;���z�gR�+����&u
�����e����7��;��l�B��&��6[>��_��Z��fY���@h��8%u�Z�������y���Y���9t��7t�����B��>%SSvO�a�4�	P��U��5�)e�������~U�=����z�e�3%�Y�$������g������l��(������$�yDj!1[�x�G�8F�
���$����s�%���������G�P�A,Ao���(�*�7D�`�J��1��f�p��F
����h�D-{�c��(�!���X���ET�������/���Z�$��������e����U:����y�E�����P�+�]�)��|�-O-���x�N+uw�(:�_������E2?����U3�3U��+D�]r}��4��v��'��3�@�^O
y����4X�����{�L������%��2�Z�gB�:B�w8�>y�����P�W2"i� %�A������S\�WH������t�bUI�h`�Ak��]c�2�8.Rn�V��Q?���f�]Ki)��9�����e��)������n�!���+5ko��8�4x�ht�������mr+�������0/������M�����4��
^��xI�0��aP�_.@:�(��Y:@�h�};�� 
E��;�H����vq6�M��D���o4c�Gyw�le7�pY6�]��
XC�!���,�H��{��b5X��0��N���o�9�$���������4�}i����W��rt�=��E�_'W���r�]>�R)v�cn��K�D��1�5�#6;�����"�v���ic��<i>O�������*�4���'������p����dK�6kNK��c�a�������>3�eI�]i�@�GK��MP���] +��	~��@�Z�zDL��S��iW�x�
�����.���k��.3�����R9d!��k���4�Ye�Z	M��~`L��|r7�0��H*��'�S7�TIi���3����A|O��q�N���Y��	���Q��� R�jF-��x~�����@�T������T��`3�8C��9%�����v�k��>�U���?���[~.bEZ�+]C��"���z7Zj�b3��#o�Vru	�w�u�����4,�>P�V��q��7����9�H�`�@��+����Fa!p`1�P+����;n��aU����B
����Kd����������T��2+������������1`����g����,�,�Q}�;�Y�����Y'v��fq��(���#>
���'Z����T�[uQ����VU���������0��gi�D�q���G!PR��v���@>~e`���J�Tq���	�q�d���
�%#~ig'lh�@+�����{���3`Ow)�<��.�|f�3� �]�m�+�}�4`�~����j�Sv���SDN8|���:��C�U�����qm�<�@D�ek�����G����I�`9��,��[<3���"r��]�y�La�<�����a|(�H�_�rW�����<��f��������n(1�����=�?�'�-�'�Y!�L����C�ow��9���1@�n*��6}<���K��*E������jDn������8�Y�w*DG_V~sm�u�?Pw�F���'q���'�)4���o��P����!��mPR4��M��+;�<jrY!
�Mp85@~y��SK?Z��:
�p��GR0'U�EHr��0����N"�,�"`,�C�OjK.1e&���C|����B�o��j��9��������:���
����a0y2/v�1��IH�,F�[?u_��
.E��K���N�������c����P!8�g@�iB�M��
/y�PN����7*���B���K���1Gd%d�p�u`�Q(Q	LN�:1������t�z�P��/���|uze9�.����P�����
����X��^�c��g�	�?�? �<�c����y��X�>|noy��)�U8���X�f��t�[�t��7��@���~0�_��7;aV���K��jR���+`����#�u�3W��K�����;�}W�C��d�6�������7������ou������&hxbv������vr���&{�fBI���)N3�9:
���v^�Lll�������B�^b'VB�R���[��j�x���~D<�����_6����W<������#�`dz�b$e%����/�%��o�L!F�'��A�R��6�3�=ml����4����q#^s.z`��IPQ����
���|���Y��� �M?�6!��K3�.�t�z'������3����y��)���\��)B���g+�u�:�(��G�r]�����\<s�f�S�k]&3=�@5s�|f����6�,Y�.��A��;A�*��R�E��]q/�`�N�)���`$�������T��	��_m6�Zi@Z��zgi[�7W�3��d��<���y	(��35����B��IT(n�-T����;��#)|��9�y5���=�y����+�~*GEM���`���X�����u�������&WG����{��wuumZK����Qu��0������_�&�V	j�C}������}��������w{���l��F�ta�����@kT�;
k�$	��B%���3���iL��bs���P��-�����u`a�x�/�H��Zi��
�0������"]�:��]��w�B[s4�%0�ix���R��X'�{
�q�l����2ky����)
f��j_��3�M{��kv�DJ�F��=�Bw��^�8Pe��k���2������c���,%�o��y�5�E�Gf1�� Pg���O =�dc�nm[����L��X�p�=0x�Z�~�9�D��snK��bO
D:��
o.������:Q��t�`������L��\��\2�5{�'1�E�����dZ����t;7|Zde�*�jAWC��'b�JE�K�7���C���@��T����~�
���"����:uYr=w���r1V���2��3��H���N:��*���k	�!��b5��1J����-t��l��H?Z�����t�=�&!y��������z�e��AR����0�X��_���_+�]�	%:wW~�3}���q��a��m�����#H��A�B��>����1u��2,^;CS/�����%l�7@��^��[U���B ��L�G��� �#.�j���nI�g�'Ul�@\Q��)��a �����
0�D��k�=�(�s�e���+= %�S�<�b���@�;�L�<~"�{(S��!�U��:��l9��gU��~e��&(w�D���V$�%�z�
w�V�)���myES�H��xni��n������^��F�3�\_U����2�����hK��.`�_r*r�XAI�
Y��@������c��J�����V$�w���2�x���9GCUJ3 ���xB'�{�����"�c4�0ZI�{��6EV��s��d��YE��`����ves���'Vk-�3���)��.�1�L�Q"�������rA*�
 ���rL���}=��.i���K��f��In��K�72��D�1�����q�?0����x��5���Q(b��66�7[Um[����������k%�Ji�h��s������Q-G��G�w��o�%Q#^{���������_���k�����	����c�(R�5#`&�����6����7�{�TDM�%-e2��������,<�"
����@�Z�x�]R�8���j�����Y��L�������=cVOO�\N��'����������`0���a8G��������"��]s��T��q{cQH�o�Qg}!��������8��u�:^��?^�	K��H��:������[�?����.��`�q�������@|���@�d'�I5j�V.+F�\��������(\�@@�a4���)?=�`l������X-=���D����R�~v�>��2�/P�����K��<�[H4KUUH��=�;|��\�<�������E�6����F5��	�65��rz����0�X��5�L���������m4YZ.�O���>g�Vbw�WM�
-�>�J_��zf�)�$�<�~R'������J��R�����u���|���S��d����=c���q�
*N��*R�Bz��n�����%l����_���;�mGHW�	��GB���#fw��#�( 	^�� R;����c�S�2v��/$-�$0����"��]����|�'�S&��{�M�<�����'����.�E�b2�On�=]
e3��m	u�0���$��+<�8�A���g{�������Y�D�T��f���M��>���b�NC[O~��{W���T���:�-�=X��!��C_�8)�g-�Q%��
�l�
�T#�/�f�-*�.���Q�}�U�y��jQ����<�.��'�P �f)((������jKQ�v�CQ�{����+D
xB�\�����S-2^
�g�w,x�E�r�
��X1�������"�������H\�.�b��#h�9f���B���QRT�v_�l	&[
�B��=	�5l2e	K���������C�0�c���P���J���"���=�����JD���p�P��x�������8�j��'�k�,��6`�9�2����������`7���|!� �/~��/���U���:������ge>��y\��j��[%����QOQA�[�������V�D��{�Dy
�rn��7GmN0�����N(s��r�[�VM��@-�Kp��L��-�`��Y�g3���z�����w�:����|�����m��n����:F���n4N-|�?
���B��vAs�H�4g��W���M�
��z�vn<���d1O���	����!z�m��(���yU�B�MkN���D��Y�������������
�����gM-<�6/��n�3�*^�!�%Xht�Y�w��1��\6?P�B�D���5L=P�������3��	'����=2��em��L���m�)�E'K����1��hV]35�n�y�u�vK<+N�
��J�����U*#����E���.��T��bY���s���/������u�\�X�|pW�/��3C������
?���U����a"���Vp�O�Fw���s��[���)�K��J�+�!�5����T��&Od�l�]�����Y
������|�V=Z�dKt%#�!3��S�L������0��9a�6`������E��h}������}�

�
s�S�~���N�i�����P���99���0�v��U%�9��[[��!�=�S���������������e�L��/���e��=u��lG�#��A3��DFJ(��)�:V�k"���+ZD'���i��B�i`G��p�o��m�l������@���L7p���u����9*|�0S���dM��(�~���FZ�q}���������������	��~���kUC����J��	����vk���M��*��M!2]H*14f#�����i�})O��g[ ��{5H��*>�U�Y�e
����nZ:�>8j�B$�B6m��."�����@���-�!��5��P��u�5 �NX��@���Dc��,-���v�,5�)8�U=�m[�
q���	���49� "���<bd���s�F���11�-�C���U��R�_P(�P|�'W�p$�q�)�����t���6<!����]������<��}����a8�dcx�R���A�R(�1��BL����,m�T���^C��z����h�j6��rr@,}6P�����!���{
Q5������*�M@�ne��;�l���������$�P�\?��@��9������3St)|���jz��T������n!�D�{��C2��w<�c�Y�� ��qH��][�8
Q�d2&*K��v���l0[�dx���1Xp
V�(a���D$��N@���u�`_/B��FZ��?*$����-A~���Q�5_I�{Q���������+��v}�c�*��������p$`�I<P .�N#j�J�iTL�%d���3��'C,����Y_�=��A��f.Lq�,��,�-�6x��q���Xx �g�:Q}�c&���
Ig,��w{�����.CM��Sc`([��?��*�\]MS[��~��h�L$���eP��n�
���/�����z&55:�P���F=��mzF��[�-E���J/j��:}O�����n�Z"H�<{f��eRL������&�[�y�a5���"q��'xxn,�,��O�����-��\U���Lw�p����/�d�������n��x������n!P�G�����������GnmP�Ly��"�<��{��O�h�������=��&�����Sf�v7<�9T�����*Fn�t3&���&��1����f�i���q���\n��P�_����K�-��%��:�����7�;��:
�&
�0R��������f�&�7�7Q�|z��-��Y��?qw�S����~a,-%q>P�d��0C0����X�M�O��p[����K��9hR���pJ��6r&Z��)K!�Lr>p��.�X�sH��T3�;�5)����;�x�+:K��vu�u�+�oka;�]R�Bs�x�m������n��XmA��+������}����(�g�g��tJ�i�>�t��%P����*R�-����g����9��]��������T]�CW������v� �b��G�_D�hT����2P@���
�V�n��
�aT4��+j��7��D?XKFh���=�r�H���(3��lv.&���|���{��HU
���]���=�!�|�M����>�35T�W��/�a���l�&k
�1R��5��RQ)"���8��53|���
��B o�:.���M���}���2���������V�X�`���%~"I;����hK��0����Cz�=��E��~��Y��v����>?D����s�����k����o���9P&�9�C)q8r������Q��1K��2�Q���%������o���&)�����S���C���_�(Q���2������9�\�& ����A�t����	�p�=j���JrI~6�E-��J���S���y?�{>�T��z����VN�����z�Ep�j���co8�
�4H3����=��z��u�[Cc����}�z�7X����=u����j��{!��`�4$u��?y�T�XpB���g{���g��*���>��c�h������B#v}�R��6rp�A�������
7W-Z�w�-)�s
$f�D����Osl5����
f�8�5�zi\B�,�`@
���+��m�����rh`x��=�B��J=��5'���@<F���*���c��<h�OF�T]���n/��UHR:�����`5}��z���C�v���s���`E���o�eo}�y�D�g����W����{lW���`��]����{����)�eC1EW[���S�y��&@`���
�l����fO���sm����^C��R�'J�����?�@|B���-�B�FB7�o�?	h�J�6b�md�]�m)�~��V*��8��Q�4+E=���.�"��X)��X���S��� ��}����p0C�;�L�<^�l�����!d�2d�"=r���g�k�;�y+Ou�Qrv=?�=��O�j�m���G�i�A����*����M�h=���*�s{������K��K�	���c��!�:P�W��&mm�g�q���M�9���S�4Rv0�.m�7`Q�1�_{�B�>U�{��BK�6R�3�`�!�,�A����>n��p����c��}b��9i��o���x�Y:�'�+/�2 �Y��_�7U}oU�V�
0r�t���������6x�`k[I�2b���})V|5�'O`,o��V��:��o�D�|�z��@+��VcR��#���B�1M_s�W����5S����������J�U��/�Cn��ZE����P��-21��b-`w���WFH[�0�h�W�C���K�����N���l����Z?�lIK���!C�C�R@������,h�������Y��[����8��]�0��1�	������&g�j��9.�8q6��Q���������T���x�v��1$����[e�&wV�S��/�&����~-�����|��W��j8!�z�����l]����<.Cn�������O!����QuAytQ�nH��k�!T������u����n���=l[xy~���i��Rq���hC"� �$\���%�V�=����~C0y�6&L�/��p�+a�
�x;��C��5��*Q�s�Lg$�4�d�(���K�[X�U�\����X������$|�d������7� ��{����`
A
���z��_6$B5���5A��1\+�+�Jz^���7$�]�E�>��iw��������B[=��p^�F���=es�gun�[pvO�'������s�m�W��E�;�?����h6~ao��5��)S�y0��-C��"��9����cM���=b��*@C��v�3.�,����c5��+����So��w���!��������Z"��d�~=e������q�}5+������oc��QH����!�6�mZ-F�m�������;����:w�����iL�����M�(<us.���^���19����,Ysq!&���~}��@�u(AN������GR��Z���{2{�U�T���������Z����m�r�z��#�p�s$�#��7�h���v���>KD����}����'���k�Bb�@�����Z���:7�D_n�.h��b��k���Mv=�>��t_��.���?qg��*����6���{��(\�P���]~�H��P(�"h�  ���D��^���elo���i������&<7Dm���O�����^
��d~���Q<��)>����d�v�V-�JN�=*�~:��27g��x�w�����e����8&�4Tm N����-3��Ql�C�T��'�5���U��U��2��E������x��s0U�J���x�7�"mp�.��������
��@��E����1���d�z)n1TRRj��*z��\/�.!�`��Pem���6��>�?���+���"|�B������S��S.F�@�i��">X[�B�s�9,��jP��h,������_#�2���+C�,_{��VqPSf�	��1��l+_���z0��#_<���B��^10����.��*��k���������-D�������b5L��%�wU�������_��K��MQ�����1�,��3]BZ{��@�HXK���0����@�L���%	���=��:�4%�(��Ye�D��[�G%�����X%����sQ����CH���NB�ZG0nc:O�6a?�D��.{����*����3�9g��J�-�3�ki��:r"�~G6"�'��!�~�o9�W8�1W��_)��TCk���S��6����;�O��P�{t���"ivY��D+�g�����t��J*2�f#����?���]l��u��h��8J�18��1��o�^��t����)�j]���\�Z��!�+V�~�����!�z{�hX�M���:S��������dI�Yhp���u�m��:2=��t�=�(�N�z�������������P��MkZO�/Bn�l��@ w+�FQ!��_
�}Y�\�]�T�����#��A)dt���q9�{RkDK���
���������SC��(���:
#6�����;5������@��'B���2������Z�������X��cF�`�N�a�c�w�����Jf BB�g����Uh��h���zm�CV��"j8C�-�`j�u9��n8B��l��|x���g���\2����
����*��C!�����W#]��}� }���`2��=�0:��C�^��Z���Z��R(���zA/�[C�����Q���~���U��0�D��M4���diw�>,E<G�
��H�>���#��arc����'�A�x.�>"w��B��P��L�^���(��0��.
�V�d#xqm���T�
o������]�-�mK�9����9z���J	m�$3t�.a��:�m���:�%��bu����#`/t���	���*^�ex��d����j�b����7��d"�sP�������r`~oZ���r��RU��cT���v�������?��Yh��1x���h,�Dw������P�t�W�6�[O��fq�����q5B�J��s����O�#R�������� �����{[IXi8��P=����	��[)��l�`���
�{��T\��������Q�z��~�(�����Xi�z�����?	���WSu
*7���Vg�W��!����2a:���Z
m��rM�j��Z������{�!�Sx��m �{��)l]�
H�1����hQ�����}���+��YqyLH�t������s�1.��u�wU�����sX@^e��=�W&��0�	*�^�v�^r%�]��0�c��k��vM�{:��[�s���n�ru�N�|�fYH�0����PQ�'���xyn����������Je#+[P��Y9���D�@�����X�l%���.D��GB��X16_�.�*n�G3Z�
M$������S
���L��0]�Fh��8]�F%�E����f�f����m���/��_�������-�.������t�G����pQb�9��vZ�>u�P�{!�V�z�J�C��/h�(�(�67bo�.���
����.��"hSw��73u�p�0�DQj�
�>q�n���M\���>�u��dD&.������i���c'�[������^�zlh���/�EZXP�/��JV	q����o2%z�-��2����8A��U�BQ�m��W��	 o�A�u�Uy�=�[��Z�����~�XL���s��]^���>�L����?����`tl.�U�37��85@3'X����

�
���?�i�0�pq���z���,~���H�
�X]������k=7��@id�����X����Th���-_j�2"=�l�b�828GV��������$�,�:��F���|�:���H��7�/�9��Q
_���JZ�'�����@i]G���a�+���}�������]�1���2�+s�"����B&��KM��=���dr�:G>M���b����]�����W�>��R!�$A��6��V�t-M���� ��;���"&v���zeh�L{z���:�������5Gu�_|���`_nFb= B�������D������P*�^L=2����au���j�����D�6<������h7��N�,|�������K�[��\l����2?4�jX}�&m�y���BCWQ^�P�/�)��h�WI\�.0O0�31
��;����5]%q�Wv��6����������j�lya���E��2�#{tu�6�\�N=�K3Ove�H��^���27E36�N�����L{�Co����(��J�(q�Yg9�����BZ��u���8����ij�BU�(�x��^�f����������(���;=�E����3���O~�X,8�Q_���A����+�����-��2i%��r�����`,t���{��~Al��o�-���3����t��$�9����*��:�F����k�=��1�
�Jup�D=WJ���:Ri�^h����}����I	E>��RD�������=b:{�.
�l[My��Q�u	�i7��Oc��q����h����~D�Z���}����������zU��h��v���9{��{[��M!"�����
���i�q�SBm�_���/V��N��b</gJD|!9�B���;>�G���>�(��+G/�!9���'�$�9�{~v�TL(���XTHT�}l�]��Tl����"._��>wz��&~�7�g��)h�z6���[�}�[Kc��V�<r��~[�nP��&���D|�������GF!_�w�x�sb��(�N��j�AR}n.�vqC`ld�~�9d����}//���R�bd~�xO1?�D�\�%������������9��M��g(��4u�G�'�A���C�~�V3���}/O=�������.���O8��-�Ph�,�'��7RdZ��}`�\�iW����W}�O���U��"S-�(b_�!�E������?��� �y�}!��l��v�].�����HR��C�	
���LD���u3k��������h������I���c^�����Vt m�#���}���#���/���t�'(6'4����#�2j��U,!u�$��I��S`���'��������O�I�"Fm����.�`��vt[&J������B�x�HJ��$�?�Bc�n5���<��!���_�'�?b�d�����������S�Y/ ��T3�_|������4��9�Tw�����=��L���������?���,^y�^GA��IQ^�X�6E��mS��Q1A`��I���2��P�GxK�Q�o{�A�#��]
����r�n����4�G{�H�n�ds�!�����uBo�Ge{eH
����hN[%��0C��M�q���2�me����}e������?8\����EI���{%�����B K~���^q�FL�kF��Y���������-�E�N�?
)q��9���P{UQk�w.�Tjtp�����������H�c�H~�I��A���LY�r����6��W���bNhF��q�L�G�� _Y8[U?+B�H�._c��G����C2YX�<y�?"
t�{��UG�������0���3�I^�B�)��>	KNT��dKE�
��/q�Q$AJ���(p����"Y����dE$z��B!I��
�h���-�>�F���
�x�8�Fo$}����%/�6�hH�P$�>���:�} ���#�bA�z
�Kqe����Z���)"���g���FM�D��������{RD :�%KBA��������*��L�>f?��C�A W-�0�~�yR53Vj���s��HO��a���s�O�������[H�o��)R�kZ�������h_G���3q��������^y2]B�E�f����i
��Q���<u_�Z�+W���C�����9����+ON<����H�����^��i(-�\l�o�WO�'.DA���Ll��t���]�4��kTSw��R�sd��>�c3�\�p;ku��Z+�
���W[��{EH]��<"��9�>��*wZ�5\��U({����pJDp!���������}V`�[�����9��{�Hl
���Rz��p���+FA��jP���$r=J�Y�dk�C-�O�%��!$t�)u)�3�y�f�v���R�u�9&v$��2.�?,�
=���P�u���@a��FI-PZv2�P'�������h7�w���uZN�:�5C��}����}w�%�������D.������M��M����ST�P�����c��Q1�^!R�T4�[���r`��<�� H[�F�!��Y��������q������%��T������<�C�0d�~�z��~i8N*]��!vI�@o�6�P=��7W��� �f����(Y?�"i��B�.�v��:���� JF%����a���]I������;�+����3�5~�|`z��d��r��������0j��������G����T��/�!�z\{x�Ft2>/�h=��
���%���'���5d�_pQ,6�Z�p�D4���������Sw����L7�����IZZ��w��Y����e)/�K��P=f���<�B�9���4��3o��K�k�-v5Rp6`x>G�����������QW����o���4�����PZx���yxL�=x��@1���5�`��+ 
�S]^9��L�
�t~�p�M����$�N�������E�L����t�����������p�S���K�PX�g'(Z�#h^�f���<�����ZyW$��������N���NT����D����ZkmE>�#��A��b��_�4 �A�P������8���C�Z��T�����L����^/�f��6*p8��p`�G��������"���������������'^�zA!���xb��E��6���t&�����/��O��:�#��(�'F�t��W��#�{(nY�o[���L
���S����� ��@Ox����BC&\)��Xl�9�8Od�@�[I�,�M@��D��h-k�s�;:�u�(t��W|��|�3(�	X��d�@�y�&{�����J�?y��ldr�M�9(����������}o[���2�F
*�K���������>>>q
��>k��x���y��<VT�n�
�R�'P'|Y�.�z�Y�2����\�!,����	h{M������u�9�S���&Yh���d�Nt�4GW�����v�`��@8C/� dZ=��QF��fr������Q�\�B�����2���O&�g�%����
Co�������������)�xISAw=����i)�p6�Y�������V7�����qyZw���Pw���Y;����������i���Lu�g��qC�8��|;Y}�3l�:��Qov(E[�f~��~<!��sbF\\`[T��G/�~@��^s8}dp���T��Z���)�b���F�I����K#�B�h��M�G��{�~F����������s�d#�3�P������}���\��?9����^1������
)�2�l�������3�1Xg�;���bB�}�������:[���]���5#����X�*�V������R���do������/)��`g��n����S|����k`��M����kB��K�]�������c�&�&4��bl*]���@@��+�{!2�������O:}Zg�SA��r��^y�,�S�t��vrs��i����y�FU-����k���< -���;���b.���^�T�l�������O��z�T>�JP���K����,����^�6	U�`|�������x�v��h�~��nq�n���g�E�S,�|n�������1�
����x�����4QW6�n��W1QYj�\��u��.5��_����DQJ�Kq�v	&����]�$����������A�i�G������U�^q��6���p�Fv4��8�0_�;������s0�����0X3�:���R1F-0b��=�g�k��+���v�=��@��,��H��v�[.�b��,�5�Z������{t��	W�-gG�M_xn��#�*���X��3J
��B�����i�G��_Ya$�����7�J�4���O�� �Mu~�h��"|�.
�����0z�:	"��\(�g��"�@�B_4�l�����dTV��9�l�;��m��
�s(�}�)�{���$��i;�����3�b��7�����uY���u!��O�!a��`�+�m����g��yX)��0�H����p����N���{� �VQ��X��rL)^��"�X1�s�5����c���3���SV�r�K�^MB���Kv@�C[%�f� O�n�>4���t�����
�k;q��nn/y�[�5���R7M��
�G`�$~;Yu(b=�!������f�u���/	�[��Q����z>�Y�������,�.I
C�jB��c'�9����4�I!<�3�d��a'��!��shZ������!�L���q/���y+)�>�.�+%V��p�^��*vw�~&n����y��KSY�k�m�4�z��^�-e���N����
��eEw��E�j���)�!E�Ot5������j������j��xb���)�f���;,B��������4R?sWB�)v��� N�;�W�<����m33`ANCu��n�3]ljm�
��:�-S��UC��<iK5���O����!�K�����|����~��roF��f�~|V�wb�����A��?�?(�O��!9�#�X���������m���FN�~���QC�BcM��N�#���q�*�bE]j��X�\�j�f�<��h�nVt��1F)s4z"�=O�6���
u�(�P	��j9�-��<z���|��l�EU*�-k�1����UM�.�f�M��o#�X@\G�t�~6�C3PB���jKn�����@���U}�@�n�4��pbQP(����b�d�.���S$�Q���#&Ji�C=Z����9xz5{����V������ �1X�� �*K�f*UL��:����C��0u>�T�^�T/�
�4�������e/�e8D��<s���������D�=��'I7��A�<u��&a������15u{O!7Y����3�#�<9���:���&�Ax�b�W5Y�)�}a�*�z����B��4
����Z����)�Je}!�n���p��j�=��l�����^Dp����|Dn��*�I���o�vR�)��������c�����I�G�t�1���X(/�CX ���o~�$����F�
���.<�"?��k�����-��F ���ge�a����(�^�<�g6�F~8��PV���7V��f���,��Rh��4�<�GeO/����	���0��.�D�P������*T�y<��2�`�h��\�-�!���"+TE��Wfj;GdM�k����e����Ap����y�`v�d�6�V�r:p�p�(���������<~5�����S[S�]���Z)�;@9�B��*VB���uy6| ��CF��D��!�������e�W�\��P�f\X4]���A>!D�9
�m�PfU�H��i�B����7/:�7&nm-	��"��7���}������NW5+>1\{(�����=��[�`�������$dP��A�*���)��U*c��*�X6<[����'����N{��S�u�.�CLe�E�/�Qax�����U:H�nu&�M���/
A�_��Q%E����KD����w�L��+�i�i��P#���;8z�f&�V'c'�v����R�Km�z���q�0������y����1����L�Z�����������;}\dX��G2��uq�[?�ML������*�x���}0���.jT�L�������,O|�Rk�@���Eo�m�d�@yNh�a�N�]���G����(��d����KB.v���2��wq0"��B�\����s��h���oQ}�xk��*�GRJ�������KH������s.����&"WJ���bdX�8�b7a��9�:o��~W)�{Wd
��U8�8����l+w���r:M�d����OH��0�gY%����b7�(T-�:r�&;��9�:�6�5aK"���:�HO�����^��Q�-���^��9_9��Lg{�\�i^��R���0�qZ�Pi8�b�x��eHTj�<��J`�TDs�88����Ds�zaud��ra����9O�W�4�pW��P��_���B�qd5�~������L�z0O��|�����%2��pw����3����1���"�O��Y�(?Q��?W���m+�E���w���g�p
�i�����e�"wVi
�O���%hek��i��>�������$��]`��:+1����O$�~\�^����x��q		�!�0�Ay1|����y�
�g��'���j�i����.h��lh�o��@7�����6	Q����{��
�A���
Y5h�������kw����o�)�4�a��[�TW����s�O�������/�t�@�qeg���u�W�J��3���%@Y��<A�2w�.,�q?�L��&��l�<%�v}S����i�7�n��g/�����avW�7��2�!]2��~`�A�u������P#�ky�7[$���[��b���o�t�����������Y~�{\��y�?��)�[���V�w����~g���`��������+���0�r�>���9}o��Y-|M�dU�2��0Z�^��x�.��s(��LJ�O���b�D7���Oi:������1a��M�&1Y��O���a���.z=����<���4"� �el�C{�($#�t�{L��!!��$�������%�?{���C��M!���{i$����+��9g�p��
 pU������^�@����l1J*����;���nn��yi�{�{��L��`�}�A���]��6�����9g���@��l�GV?g�\m��C�`��gQ�s$��Hc��y�2V��@%Y@�xE[���f�`�w0��;<��c����F��G��P�~@�d�]�?B@�3��,�F���!U�z��-g[��@Xyw���]��������i���/�O#���<���2]��x���S�1�������c8_g�����n�e�W�����W�����}�S�f�����"�=��lz�s���oT�B �q��Y��J���(����d���$[��u���W9��>[�R�u}>/I���f�Yb��$X������a������f�,H�jM���P4Z'�$)
G�"(P�t�B�Xk�<�|�N
��:�y 
\�$�g�!�A��.��$�B,��>+*\@����+|��\S��W��na��H���t�����+vv��t{�(+\Y��X�#��7���n��?���{`�D���0�_��)�����2�t��d~44g�b��u{������%��Q&oY��������
>lY��_}�����{�a��%��qK�tZ����y*_d~,��
��E�,��[�~�l�q��b��I�������t���b�����������S�In������5��)�|���H���R�H�b�����G�������syR����~I��t��R��z�k�7�'*���GK������%����@�rx�x�E�=�|���0�����E�]��������3G�	Q�����|���<�?:��:Q���sT!d/e��l�]b
�W����������l�A@�����?bp�R<l�>�_a��mU�x�������.�V�	Iau���
�� tB1,��A�Ie�B��VNJ?�O�n�^:�R������:u���;)?3�K �h���
j���!��'�I�����6�N?Q�6��XW�����7�X��1�o8?y�7�a�����I	���tT�f���������f�����~�k��?�t��.h�L������-��9�O��+t��Y��(B��}���Z���"�"���P��O��(j����~\����`�,�5�p�T��u}��v&�W������y)���G�	"D����L&Z����^�J����h�"P_������{��T�u�@�i��p�
�r}��Rx'q�����T��:�j���D`�����0r��@/>W�L7�P����2?��ow������H@V����.�����K	T�(�eg�J=��92����f�b�$�)���3��B����+z0�DZ���GQT#�B��B���a��Si:�W��Q��R �$4�SjO����o}!W�0c�Ucl�H����aM�A�m��&��K�D���X���Z������>�U���r��1� //�?(��������{�$���5�i��q�um����O�B��f\�:&U�H������4���(z�vp}	��eeL�����'F/?|��e�W���O`
���UK<EqJ���X	�aj=�#�W���]�4,:E6������ ,�f���7h����n�k������w�x kh����.��l���v���'/�
X]��l��o�*�������x���������c�����{-'��{��3�ky��I��'v7\��J�F=��v�X0�@��cip
����Et�%n���l0}�����D����@��Qt������O�VVP��O��q���Q�G/�0��{����zKEV�a(���'�,dLJ�/���(`�`�4����Z����_�!���f1��\� ��%2��WR��������0@3����Si�BeBl��{Gi&��Zi������������#��["j��d>�\�u���*
�e�Z��m����J4�8�}	��hQ">%� ,g�����a�,�M���,/H��O��Mx��p�O���V���7����t\�e>�\���dN��2@��/BQ�o������+��2�������~/���O����(��8����$k
U��S&E�<A��������$������,>|��@G���)���T�r[���a,�M���&�T�L��������{���Y�b`-}�N������Yg4[����zI\!6�����Dd��7��S��$8�)�����*J�Q�
 8�`�i��{yJ�8�//���m������g��>�X�����x�'h@Bb���<(!x'��(�6I����)�3����M0t^��|/OA�(�"�����1�/Ni�]���������r��BO�W(%�����0����<%�E%gaUqpy��k����zD�A�d�BEC�^�+�c9��[�P^��g��
#@�V	,hj�K�_�mIu�r���I<�������oZ��^��q7�U���5�C��0�����������D�%`��PF��@$���������4
S$�]}�N���@�1��x�]��S�@u�$�6
/R`@�����}��^
��l�7#x4��Q����/�>�]�S�����7�3�g;aC}����q����!��	J�_�^�\��[LWP$�PSo�t�������d��������������$l������x��2J� &�qov�e)����'b�����\~[��`����wa�wI�F/K	I����8n�����e����Hj�;����]^XI�g���0-{�����������5�
�,vG��g�B�U��'�&��u��6����5�� �>������.66P.�7"����x*I���:#���.���0�����#�k&O��H5�2�P\�������8�R��|k��^�\GZ�RV�C�0�=�����c��P��������eL�I��O<�M7 �PC�%=���1;�^8��t����[�
��������<pB��-8r��u���n-E]�!l�o!�f��S����s�����
�5�%{�������@���v��c)�O[9,V
l>]��Z��F`�[al�	!��(����Y�2Ox�Y�.���f*/���<%O���X��mZ*%��2j��{���R���?��3����b�!����i��.�����v���jL���f����Q}�����	i6�eer8I�o0�K���^!J���X�'��n{�<|?�>{������oW���x���R���<5��C���j�D��F]����C@uU`�W@�l������kq�|J��{�$�p���U1+�S��<S��a�T����>�P�����{��}=ry�*2Jf)�w�D��L�M5�������9�%�a�I�(�B�xVQ~�!#-����C��}��@���:m���#&a���4k�_���d�+*��w
KC������{�l�w��'Y+db�]t���<@�@M4����[��bs1����
 0�*8E�G�V�����������G��A�
��[��
Q�V@1���
:�Xe{QP��mS��[�Lj�Y��x}GP~q ��>�G{�(��C�r���RY��k���o���2����0_������8' �G�������0K�5b�����A�����;�R���t�x1��,���xT�����2��x�7=-��+[Aes�8�$'VIxA��3����z�	R#�?HY�~Y���9�������X���:�����t$Z��s:�gJ������0:�����WKx��h���M�	2��@r���X���
�}��:����2�e�z	Ul��CxH��u/j��O�)�Y,T��EYi���>���F:S�g��S�O{hoL,���K��6Dr��/J���`5U7�FY�b�����ew�xw�5���U<"��w�a�W��L]U��b�������-'eV�� *�&B�-����������?�"�w�N�l-k���G��L��P�Z���dj��[+�^�1���������*k��2���R�*7o�Bg�X���:B���0�,�u�d����8�QR.�=`5?~wP����o?���Y�:{��C��b����Z��r�=pa�KY�@�>$����=����` �ji�����.]?P�j+��A�l��c�v�5������Y� �iX4��a�y��"����`��Fm�E�P��thy��
u�z�R�~��K��F^�H��buG��~��[�~J��e5��zME���=��H�T4Q�a"o��}0La`D�W�v�<�J�m�2#����=�g &���9������m�7=}K�w�@y��������G�2�����x�z��>E��z���=]&NI���#��<`������X�.2��s�����C���AV��g���>a
���e]����.�|��"%��2QM�������e�t�P��dWD���4I���I�������/^���P�m������6EV�;��.d�ps�}M�(�e�^rY"?�U��M/��S)E[<����A�H[$������z1:����)���!�]�f%��:
P���&-���1w��.���Yi��~��4��(9�Ww�����D�`�[p!�}�M��jo��'��h���o<ah���F�_�MO����~g�A`�+^�Q�,<��i���{r]U�T>
�_���p1�.d]�d�&��y����P(&���'���s���%�I��	j�i��6�3+_����YkylW��S��$�������B%)��ADg��e�;H��S��I�����X�����6uy���g��K����`��$�s2wi����|�o1�*y��g���H��N��,��*e��'�~L�K	�]��>��w����fD�7!����
�V�V�$�P����+j=o�*���z�t}���?��X����~l�~#�k�������CU@��?�*�(��Z[~��:�oI��Gl3�|����.�����J��	\_?����g�����C�����t�Q-��.TK�Vh��m�o�5��&U��v*M�w��(w������q��7T�����0������������8�^�
7I��������|�V��"0[�����;��o�z�C���g�Hu+T�R�;���!�o��iZB�7���I�YH�U����������������V��)�����J����p�@�1{�g�����	���w���S���>�����}�=��_��������ZL?��gr�O X.��M���c-�����H3`�\�����h!�2G��Q����HR�1a��p���g�y����}���K��[|��_?��1��������"�n]=� ������G:@��-������N�?������y�����M�A=���2��)������v <M�����p��+�?��2����^
�
h���4,�v$�>c����Vf��W�\�gW�KW��7,�
0t�wc$��v�Y|��m�X����5����`�9\�J�����K������3��5�k����ko����p���j�z�����'�Xi���9:��������{��'���K��P�uk3��'5�e';�
@	����E]�\�b�������a��sH��VdK��6��)
�.nE7s�o�C1��������p5-��(��o�{a��S&-4���Q�fet�eb�W���dP����k���hS�4��i�.�������S���o��J��mv[?���mR�yAE����	��]>-7.g��`����G_e��Y[������@���]�g�W��46#�~(�tK��fm�g>��� �/��X�v_t~�����s~���c��p���w��h&����.}��1~�kezO���jtm���g�}%��O����R�X���8T��|��`�.p�^���6iM���;n��?���-"0s��l2�9:GW������+H�7"����33���N����l��=�B�e[Pl�����,$>�t����m��t��U��~g��b���?>[�B(�8�������Ug�����i�|]�vx�8]�2W�m5����\de��>��z#��������>
W���)F��~����RR�P�������b��2Q�+P�w~Qg4��F���k����	�;�5'�T*	X���]g�,�-�E&�����&�+��B���i�3y�N��W��$���8B����������r��j^�����0x/N�/��l]��+PA�l�Hx�������2xm���+B��1c�o��}'d���yl����]dy�����.P�
U�O�%l[�����>d&��oP���A���{��GF\q��KX	ELk@�0�$����_1���z�2�to�T2vD�Eh�M�+C�.�s�W!V��0���%l��}�����6��a�MSr�=���.��+]�y�s��
��������8
y��W�!W'�1�
>���k�u�	U'��9��)(�a��`���1{����	�A�,,��������0��	�Vu��v0�No�1N�]�C�/FC���kX�|�r��*����\��
B��t��u�DlRm��j��[\
����f��p��l�ZdSa~�Iayv��9At,��A��]�)�1he�h�X�h�F����F|�9�������i 4�b�X>.;&K�����wM��#ZE��e�Bw�i%��y�_���������90��_�a��4�h����D"�\s�������l�#�� �i=�-�O��]b ����6t[ �����
��C.Y�GV�_��99�����[�@�L��:�"�@f��W?����*<����Ds�Om��� w�i�j�S���L��cm!�P���9�J����v�g�Lo����D�����-��a���*(
t2�����f�Ga�__�%~1����*��=�o��M@��u������	�TIWnT~�'���_5|����^4&�e�	�rD��);��e_�'}
�#Be���������	�w�W�s�j!R�U��{yBlU"tK|0��~2[��}*Wv�L�j��?.X������[��f"�]�2�t?n"����<F�>���p�.�F�#�� �{&�f�g�i�c5�i�*��0[�D����T�.�e�/�t2�u��5Y���"������C��8��?��$�~�O��E�� �VGp���/(+ezc��t-�C�[H�D �Z�.���@�g�2��`��i&X�g�5���x��69����Q��aX�L��\#Xd��l��E>�x��d
�F�~�tC���Bb,Q����j\���0������<�*vWj�H	2���c�7=$����&�2��QZ����V��2w�<�B��\�!!�.1�u�M�jb�aEG�������T<d%kV�y$E���h�u �����P�
?���\��)�-b�3��'t�z[�����Od���j�	?|b��������n�sj����>)�;L���sn��S�����o��l��<-D��'���+�B�ls�J���a�D���������'� U�"�O`�����%����0ig���d��^_��|��i�������p]N�<@�dI��8���z/��G�H��<�g<���C�������4��|/O�b������8L���`y�SpQme@K�W��D��F�'	�CE~c*����A��A��{yrc��-���K����4J�Q��<��u���Y��@	����MTab��������{��YXp�kL�;8�<pM��D�Vw��V����`����9���T��������m�w����6�P��R���]����<�F%�����:���-�c�����B��d�	����7��x�_��e�������
P��4�k���e=�����}!�RQp���3gP�i������m4i�B�U��dy�v��D��|�J
�����{9R����a7�x��
O���2[	xU8=�jo8>����v�����"e��y�c���q@����x���0D'��z�w�Q��H������K����i���g#��������2�_�� 5�z-h��Vrih�gN�`���;�������K��S
��Rq[!��8c��G��Er��D%�pz7�q��J2}mJ�U��|>,��m��=���.:?�51@���r{e()��x��f]��)�?"N^�������	htZ���{��a/P�j�9�/�����$-�l`����������Lp�Jr*S��]���|�2��d(��N��|�R�����-����s�����E!l��V�,����'���I��;�J?���U��D}>�&[��^7J�o��D�����U)$���KaN���B!�i��~5���cr���g���hK�d�j����#��;��.xbq��j.����6���q���L�X��ph�U����v�[p��������b0{H����*��{ma#��������r�E�rM��J�Z�!(	����a����2�M-PB�}X�e���n�����y3e0C�]D��k�JijV��Srw�K��!Z��@c%l���5Z\�u���z��t0E�c/��
W�[M�v�,�\�H��7�t�!!��F�O�)E��|f��:i��X��?�],-C���l�>Xn��&�q�����3��&[���M�N��`����EM����q��+���${��-��~(��3I��"^��3��������n�=��Z��.��r������ �U�Q�Q(��y��rY���@���vi_��ATPxbi�3���E]�	�Xk��"9B5�r�"��E{��7�j�jn����$c���A6�U:�TI)j�}
Q~�}������T]"���u��L��Jk�E.z"N�t������u��mI�5�������m��K���u�M$V���}�e3z���1`�Ov!���Y�dc�*�������;��Z��a9T!�v^_a�*�>K��.�l�DE@��Wn�����Q�=���LS�j��]m�9Mr�&@�Vu���{
P�5�w�wU��v;�9�7-0��jb���	z��3���l&�c�D��8
�!&�=�r<����+z;��`��5S���U2�g������,V�T�S�
e��
q'�j<�����luS�wF1o*�����Q�!T&���.N��Ya�-�f�E����dS;8_���)��������1�5��������:����?������[��v�K/y���r>��#�����Q�#��b%���o4��RdoP�L"D1xo�\oRE�f5��G�E2$���@RW��r�(�3w��H$���>�vA�m�a{y�d'������r���vg��%{f�6U�����uo����4�����,SU�[��^2v��Q�L��!��b������[WT��^��G����0�
v�AAZ�1��v[6�e�����iYO�g���_�r�P&�j.p"K�����=�;vh�6���I.p�s����`1�6���������7@1�rW��CD���{�q�M/�'r����2m�T
qx�������"4���O$�a���s���5YlAiU)����d��(�h5���w�5U-P��!6Y<iO�BOg��9��6�=E�s���g3S'��������&����)�,c��9�r�� ��4���04?����������%�<��`cl���u�9���q��`.��pz�L�\�N��0��������e)�l���T��g�
�i!,�����F��Q	���Xs_����h��
�����a0���h���c��5��*�EM�=$��8K���R���E'�fL�o�2�����:�}���0���|o��B@a�ph�[9��
����7���l#M�(|h���z�B[��lV�D�����-��]RAh*���FLt��K"T+�@xOGU�:�������6�+X2�fDt�f,v}�D6Z*h�Lp=�����,d�������x�<�ja��28:�H���,
��&��*v�e��m�=B�,�x��;����ge0���]�����:������M7��Ew��g�j���1aW6Wn)y5k���d
r�s,��zU�3�F�	c�t���!B�ZF�����m�AL�Z	hTv#��D�jA�C�P���4�u�;FU_�28C�w�RkF�$�w�J4x�ZseJ�F@x?��D�J\���p���.ErbAo�.�7N�����[_�U\�+��H�r�)��Z��N��m��q:���ac��*�
���j�1��E��?�2��=g� K����c��~T:�����<Y7V��>���?o�����wk�15�.�)?^[��duS���p���tv&�X���KU��#u�`�D���4����40_1`�#�nvy�>��f���	�����%�@���-��#I<ab�8B�w���=��X)��W]Y���Qv����]������xu�F����H��5���2��;7�7�����W��a_�|���MIb���Msz���C`����L�3X��O����`���<�0;
'x�'�V���le��14��L7���,�b3���\�_(����
�B�2���v&�Uo^����"ts/=z���0��������'�QU
h��)Q�����Xd����dn���q�0t�s*���9���4��2�
e��A���l��c�����~�Pr��`M�qO�nW��SC`S��9���9��[���������(���������Q��@�������|������MVz�M:pl����1�� o�Y�}�j�>:����n�����>Q�E�6i�C����sGt���>��z*H����nX�v[�e���T9��<�WG����B�k���;���QW����6�����k���e/���/6�N��[rYL��B�*{6Y�
������~�{��7�4�����XU+{�$2���
���oj2w�*0SI����/{7�2����/�t
�h�v�F��4��������
o�M��+��0��)���*�9�-�v����?��jJMa�L���idEO�N�|��lE�jFU�:�2(Yf��6�"v}���}(a/��Q�b?�s����+'���O�b�N�X�d�ID�a��$&��YY�"P~����]rX�� �������:}q��8t�������17���b�1��������=��.��m���w5�|�Nma��!��Qvt�*=\{d{�cw�tA��Fhn:��I�w�*��534e")�w8�o��Y�V��S������`���kl�
j%���Q0��E��!Q���s��
����JWe6N��j�t��c���$���.=�&����X]���w���W�@�A@������NI*��������Gk	mz9�Rvwd�7�I�?^�gO2��$�*e�:�o�R��o7xxB�#i��6M.j6�I�u�9dJ��b����fk"A��$�Q[>���C�t�ej/xu��)��>7��D�82�<DQ8<��r#�2-�7@�=
��R���X6�
����tA�$�~|�;[
���I�G����	����L�j��T6�
��gf#=�$/���t���<�����	jt�#9�����%6��k���yS��Nt�@e�G	���94k���{��f�*y�$}3us�<)[/	��;h@�x�=$��iQ��o���a�H!�P���f(b��R9�q�]*�<�yl����N��+v��l^h�]��6��)T�H=!�__�C��m���{W���Q�$������~O�J��n>�j�}��1���P�3-i���2���<mG��:7�6�>��q&�X����z�0��&���
���j��
��������"�6d�����LR��Q~��#�~�A�a�lr<�9'���NSl���c<l���J�<
G(�-&
=�E��8��F|�,�><�,E�S�7p{C
��6g�#F�{���@�`O+P;�
gn��N?EhG;�4��
��]��v�oK����oNE��>�7���������QY��<5{�5{�wP���D�K?5�����fu����<�)3!����JP��RU���#�{&�6H���q��L�e���<���ve�
Efp�15=s
LQ��5�0��)^5����G�,:�5�.���$���QV�|��3��Z�f��#�Ltg��fl�"V�&�v�F��d���qk����������\������S�����v7.�t����[��~��j12zFLk	2T��&�u�fb�{�i0&*�%m>�Wa�$}�(��$^�{m�1���l�E)$���\)�D�0�.m*SY�z�p���b�/),]mYI�:��[�>w���*��z��d���P����^�*�� �(���6��g��<��Q���g%�M�x�Sy�!�5����4���4BDZT�w�����i*�����u�4~=/T��{5�mr���a�z�X���5:�o���Q	$�0��v�i>�TF�8cx ry�k�q����A����ML�\K��5�A��
W��r��0��VR���6{}����m�����^�`������L~hP�Y]���y���E�^��2���@�	P�f- z��${�y����	G�����
&\LJ�0�\���<A��USfv�5B�~�<��0���(:��
�0��X�����B6���\7]�BGI�]*DI�k~�	�{�&�����}E�L$�o���2U��V�-��e�J\��^�����1z�Wc����0,�sl8��d��s������S�������6$KCq�Y���Q�7���}���n2���L P^��0`�������,��\&N��L��<��;{���_WD�������^Pq�t)A�M���hJ����Od�6������E���_[�>��WVa���}���h=��"����L��g�)|����(�6�";`���C����5��E'v��	w���Z�����+��������B96�b��tOKW����^f:�U�H���'�)8J@���~��eYZ�;��
rI":���Q��CX&~"�H�ZC���(�T{)H�������Q�v���@��Ga �EK��-h��B�������������B�:�h�p�>pJd�k$a�G�N��]>����Hr^��_��������A�M��4�����
R�s��wQ�����_��ijt�F�9�
=���RX��]����2�x�}>�h�C����I���2�3V���,JDGGo�W�&�Y�7j������l�{:?tGXg�+M����+B7�I�x3���DO\EI�u��x���9DV�:� l�
���>���AP$�p^?�bg�H~.qw�Q��
m���'P��v���������N.O�X�X�Ct���9`�����~�k�x�1�C����`��q9���y�)��&U�k�({;Q-XPM��*T�����	Ugu���U���
�6�jO�jq(�r��	�1�����B��8�?H8�>�Vq�j��8�mr����fs�;*����Me���z�?xh�ha�u`6�i���83����P�$�sy2����R�
Fh�k �"&nd��^���J���(���S��I��bs������,'u��AWT#
6�p�Cd,"���%#<�[��>2h(�P�
�r)%�\!.T����fLyF7���Bq���U������`?{>�Ry�yv����z�;\(#
������=g���K���dS(h��w�g��������y���w���������BE�r�I[�n����h�
�����;��0���k���N5!�b W�U��y���_�J�X�$�������A������'��F���M��M$�4V�^N�6`�{�@oB4��9��0�*�tr���B
���9��w��8��P��	_�ZT�����pAF�����������E��&C�e8*���Z�(k�J97�(yN�hs�7���0$�<��N��
����jP�����q��-x��61G������&d����h4�{���Y����,�H�����9�m�u��;�y���:S3��5�<����o`�XS�������j���4G��xt
z[@�#]9Y�
7�WT_��)
;j�6#�G�O
l����*����F�;:MU�&:2k��J!���0B&�����M�q�	��C�u�B�E�&�J�B�(E	��G�����
��e2�D����"~O�,O\����S�$�z�.|����wi�)�!�� �P<��6�o\oAbl�z�?Uh,���%���S���X��]�M#B�F)\�d�>bJ������|7���>uF�X��J����l��D�����og���[����<�h����@8��3��
���������3!�����\�U�O?��i�g�4%�%2��cbdJ�A�8�#��
��F������FO
������@Q,!O�
{�<�[�;�:Fv"5�t�h<H:�d�>N�3������*4�|Q"�rF���
"������v��z�G
{����u*;7��5��#waEd�vV*����D��:���+���	z�ta�IS��av���vG�.y�,�Y<=&�~�@����2i(�(S��	E���rL�Y��l iS�.�}�6|�J"�I�0������L*�k���>~��,����I�,�L���]%S�.;��U�:�R1`�n�95P`�^fz!�i�9h��,P�V�0SlT�4	�!&������>�H����j�4����l��|�d�<{������
0�/`O��d����4Q��};�[a�7)e�h�v	� �
Q?��&dc��m�)P�]6�����>��r
lG�*��z�[�f����uW��=���b"��!���fC�X*3s�8�V����*;B�)�������
�]}��o��hu
��`��y���v�~NR<������"��7�Y��g��L��v��1��(h�5�"���v�����*��j]4�fN�^�1��i����U7�u�*R"�A�dj}�E�f�P�<D�GniN��"*��g��L$5��j���:Z��lW#)r!��t��]x��f����������-����1{Lw��
'i�l���Q6 f�����F�����5���p�d�����>�V���"9�s�������c����a�A1�Vw�������Q���h��M%�#���O�j�05Y/W���!*U���bw�0���%�v�Ff�	��Y�V����cZ��1������KRl���	�����������U;���b%~��}��p	@N������(^V���]�$�x��q���b3�2�i9i�������L�0�d��Po�3�LB_V���2�=�����Q��M\�
�����-'nM���Ky��G��z�i�������R�Z�d
��	�U��������zm3��
t�9}Z��`)��]�x�^��r���M�K��9�L~=�4BNqpzb�"Z+T�Z����7�����iYX6�c�<��*�i����?�)n���B�`���E%K�<�q�i�����%�~��;V��*5�-�Y�F��c��6 ~v|���^W��otCt�`��"��76�ZJ��,��M�x�;'���"C�6B������?IU��)$��}�i8�RC%�M��WY�} �v�S���������6B�^ZY�Y�23
3�S�����@IP~��j�
�D*E0���nm��.�D���O&���Y��%	?���bZ�
@:
�!����?Lo��J�+q���u�'7M��n �m���P��\ ����'o���	���������y��A7�=��%��4��$�E������%���(���a,!���%�)s�����,ex�e�&0���>�j5�����a��CN� �4�j�L4�wuX^�y�c�'�:�]�����=]X!I����~����<"��]W�h���������L������>�>�Gf���1��AFUk�
Tva��c�3��,883Lhr�Z���L6��]��x�/j`2Zj�'D�5IS����9��@�:�V��%Ge�3��,)�f����Je�F]�f��u�d@�a��*�3���M�:F��
��~^��
>�
�����L��I|��������>����� �\�&L*��R�F�}�}�t�7��_d�<���WRB��!��	�����Y��;���u����-��1	S}���P]���F��S��M�������DA�����?W����=W�����d�����M�_:V-�i�����Ff�OaKx�(�!Yr^�lz�kT���X<M���sm�jf&�\��y�u�j�}���N�(�l�]�6�g�h_}&����/G����:���������K�=h��Ri��gM��_$S���Gd<�M�j���V��T���F����k ��
�3V~��A�	�Yr���d~�A�~��e���������"�P����>��D����X���Q}�b101W'�lW����%=]P-m�Jt���.&�����z�� ~%��d���d�H������Y�?y�s���Um�����I���{Pf�a� ����������:	��|��w���#{�l�����g�mh�^g��J��_�~ov;=9��������	+�r�-e%5�$���n�)�o���)4awz4�m�{��%���R��@kI&����K3�*�������:r��6	�q�.6�W��DGW�Q>����h�e�tW������F��R����$6^�?q&7v	�������s9��
<������������e�BN���w}A�����qZh�
T��v(U��U���^w��,��CZ�$
�q�f�myCT��Kdz�*������5�M�������vE2�.Q�� ������%^�@��K����f6&=r�
�M�.�r�X��!2`"�i�
���:��`z����q�������k���D�L�/����l�s��n�E��nY`~e�J�����TWQ,����uAa��W�T�����-fMd�{�fc�t�%�����c�����[�6~J��)��gw�Ojy��e/8>�XN����>N��&>Nl���k� �o�{PFu���t�����F�q����A�m?��;������
NQ"��-�n��tm�e�:p�w���B������l(�/�~����@�!�����k���^�e�����xm�Q����OA	���������H�"!BU�Q�(%������[�$;��n�IU�S/T�����9$����pc,|Ak��	��(S�;���#����}A������%�����xGU~n���K�����w�	o}x�k�*Ax�����x�v[���d�2�}���6��L���a�]���:����*����5�������l�Mx�vu!�J�_����`��q�%�#B�Va�����_v�L����k>2e���EC$�����2eRH�C��������Ac����o�W���jM�la�.�,!fl��r�PqY>k���B�)�e�a�������G8�����6�������&�R������0N8u
Q�t�=��/P�nh��7_����D�'����3��Y)>�5����H�����"Oz�R���@09�6X��X= ?�W��� ����Y�vI�P��,��J,��Q�(��h�rV�J����-�=G:�3-��D�,�#���^3d,��T����D����O��������%aw���R�N�/w%�oMCV�*eF�q"\y�������T��^F=%�����Pg��5_�[0j��u�����0���@9��b�K�$��!6�~�o��=����@�e���L���'��4��'�������[��ov���������-w�y%TQ��b��CY�����n�.��������J������������V9�a(�����V�!h�+R����;a�����T�=uK~+��"]���4����0��Wk��J��H{!V��m���g�6}��:{C�$�	{�J�]Yv!3;sc��$>�-�fV�hz(�2=�����G��O��N�dZS��5�q'a;�������u�'/,���og�fh[�sN�QH&e�����-]h��Z���L���O������`]V�_%3�	��6��Y!�����L6�Dz�~_�4,�%����8c��s���R����:v���0{*k���S��9����J���4�V�7�cC����'�X��p��W�u2��`I4l�5�x���3
&�"�D�E
�wg��Rc�x�Lb���u=Pn�a6;���gW�V�� �|Obba�N���Y�Bk�`�1��~����O��U];H*;<;����Oy7���+:����nf3a`��/������A�(�I�1�@_���<�A�x�vaRy�db�M�.�����[�~Y����,!��������kige
�*p���.�9|����Y!bU�M��3a�]����]�r��,�����u�z�V^nQ�KG,qg��z��!�
�A������
�<�D:r����^�|Y�=���������$X�l�����b�*e���7�GX����������s3/����L��KA,B�:�u�edb�X��}��#i`G���F��'����6�uz�$I<{�����PN����	�����zF����?�)V+��z2>���Ana��}��g�gE(;;���E��r}�G��n��SoV>�i����w�_5V*S�X�+]��ec��]�g�UV�vh�]0�\��^=�0�����ouHk�w#�6���p�77����_�7���\�P�h�a���9^��k�w���t����mVV	S��������9O8���g{��^����c�b��0n3�?=��"�eTl���Il�Y�������`?��F�bn-qR>��Z�B�6��A��(�?�F�<"�Y0�����mR!`Ot*;���`��y�J=ik�d*o�����J��\Q�t���c�w1.�������k+���R��Jg����~|�6�����[/�$��P��T���5v���F��vi!]pC�pK���W�w�e���������
=]!�o�!���P��j�L��.�B�����F���?��B�H�}���V������$M�"�}$�WH�mp����L��OeTF��c� ��������Z��;+��.���4f���]��v���VJ�.�Q�<_���&aj������]���V!��>����h��L��Up��~6���^mKPgc\���s)B�����@��y��.�=^+F2�
��w�F�wX���Xx�=Z�Z*��O|��|�l/2w��}�jf���@������>�+�M�*WI�]#{���x��(��y�g-}J4��� I.�����r^D�Ks}:+Ec�Y��'YO����)��|Y��g?��;;3������EH��.1$������������6/���"V.A.[_�+g��	1W�Qd��X����<��L�J>�6o�����t�Bh�(gA���ftY�\��K{(�m��W-��W��f�Y�b�B��(_�X{S�cW"N���[#��s0��.����
�J���������	���>�l��&���$9�~Z������!�!�^�*��������!3�g�>�#�������=�����R�v���%���.9
2���+{���������JR���3?���Z5�������������C�����l�T�H���~�s�(�����[D�������[e�ZvTux��I�yd��~��9��e)B�&FF�!�������B<CY�8|z��H�
��]
*k�.T���s���^�������W�I�����WO?��/��6���A�h�2xL�N�I�(�'V�-3u�����e
d���������	t�(��h�bs�FZ����k�n��[�!M�j�������JX���2�����<<�;���C�zSU��w�/3���{?��#.o���������x�������]V�6���7o!����h�Pj�	�flI��nk�=������dv�&�]*�����������~!N���A�e)2�8���x���/c���e����BdN(x������W����dzPy���� ����S�B���.���L�^�P��ME�
0��{�����H�����[c>�Wz���Me�V����i�P���7�J���Z�����&����,�c9R�B�KN��5��H��D�Z�4���]�w&�S�����xv�=���k�>��F�4v���y�xPP�*���;��p��=�L���^�@n��j�t����Sn�����?��������#���=���"An���5S'�.F&%���M���Y4O��g����Hc-�q|A;@~���(�rt#H�	�J^=��4O� 
��suK�������J��nY�Re��9Wz{��
��].�U�L����n\�8���-T|>�C�x��`m;A;O��Bx�'R����5�t������+����q	����W��gr�i��Oe���-�K�.d
u��n6H�s����"�]o�����N����		�<�a��@�&v9���!"�d�+�Y���cI8+Mh��D��!:������Ug�V�/K����#����i�]c�tV�P���P%����*P]c�wV�P���1M
�1��xP]�8�e��,�����Th���3^-�xg;V+bq ,4��X�y�����)V.2����<��(#6'5���F���9$/$�ax ���Tt���w�i��4�Zv�|m����"�z�iX)��wU�!��Z(^�u�"x?Y�@sU@��]�
��D���	���>a�z�zx�g����D,�R��l���mO�L��!�T&�K�l�E8ps���.�����w��/S��q�N����>ax�������y��Dx�'�K���Ul��h�D��\�m��e�K�G�${&�V�G������6�#���o�G��^>*��,�S8��9��o��_d���qF��u��
�����5=G}���a9�]��\����������� 'a���!wq=����x�����Pq�R��k-�'�u�P���������e�"O�f�C�d����������b@E��w�~��$��cT��,�	���e����bd��<cy��;*
�!nw7�y�n�Ydt^6�Aj���3�d0������>�^s�M�S���z`���B�X}kgr6�������p�? f��g���nD�l���
G���������1�����I2�	������O�9�G���CD�q��/I�Y~��8�L��7��x�IZ�vI5��MR����f#������u+r�'�<�a}j#;��ka4����j��<eG��(��S�{�H���e�I�����n��W����)���y%c����%�m}��G�%�[#��MX����[�u��a1
A~M�usgCn����dVZ ���;����*�^��TI)��l�������2���0�l���Rl�fe���������{4���P���\��v��������9�x���!:Y�U�j�i�X�I��!��������v�'��.T������� ey���^)M4����D��`/{�8��9���\5E��H���Oo +7P5F?+DN�XT�lY������\x&6����e�J�B	�������w�[�~&�e��4������nv��l���.8o7>�_��y3�Q'��M��MO��E�����JS!��By�N���,�[>F���f.�Y���isg�%�?ft����������aqP0��������\c8��Q�^�#`7�����_����������i��!<�����l{���u�=wV�^��^/�W Os��y���5�IO+�@�7u�Uo��e{��w@��A�|(��h�ZT����Z7]}h�f�
&�e��k�T5�g�	u�����l1���������k\�U-};+O�o��m-�-DQ�5�5*�p���we���f���	SI�]������`��5+q"I�'��W�k�c5K�[�����UC��G�����Z�y�=2)\���K`�u���)1�%�������i;�Q��v�*�����o#S�B���s��6�����Vhj:��O�� ����q�W�?$-��f�����04�[����"�L���e��Q�2�i����S;HBt>�<��=>l��
}�/��B���V�q��:@���kt	O�X��a�Hzz�=��+��(���Yuf}�~a	��8�!�?R���m�C��Ra�G�Y5����s)�S���=��~�@�������G"����J*�JP.�����^��������������� �h�Q�����\V&�6c�`��6��o�����6&�[�����N����#�����l�����o������m�����.�6��
h^��8������;[�&Q�!T"�z]-h2$��q�28�~��%�5gR�a�U� �4 �y}6mO�aX�z3�����V�Z��'�O��[ �;{ ������&��r>a/�>�a!^���&Yu'��(��Wt���]g���!�]�x�����6w���'��kU�^V��C$��~mA���-��@u��)�F�n������]�>rtM��3�������Xh���nB�_�P����+ZLD��VZz��)!]���	�I����^o�z����:J�����j/�B��lZ�s$�
���)!����!;d��e]z���]�E�i�Do���#�,�t������	L<�_�Z�|5aTU1V	\�h��	����sp�<e�'}e��i�������c�x��Z�}x��F1;����i�����*�^�G��k���yI$OF������l��#��!~@E��g�DQ��Y�K�� )"�p&������9�l&0��F@��������g��Y�+��]i����	�o�v�oj
��i6�\���.|����!9�V��UW���$Fv�D
� s]M�jl^�7��o��I1�N�Q:x��U>�����3�F����/���F�j��F��Tr�E���n���3����d�o-�p���vr
��9�Fx���}�0�x
����m���y���>�|)<�wz}t�d��l�+n+B�dC�[�u{L'�
nnm���Vl�
+��vg��E���h��/�w�C�9\|���#�.d�O����������n�7�����jy�����@&���-���}:�vsg��Q�@�f
��LK�^	�}b�4��Z6�>�	�N<�(����#�eyvT���&�/�R��<)X5���W��>_��|Pw���c�jE�-D�������x6�0j5�i��T����n���v�}�`�����z��5�H*�Qv����x_w�X/�2y�.����I������d]u��7����\������:����;n����,O��^�>������F�9��M9.T5BF�$�-�1:lSUH7�w0�[�f���fX�"�XB��RA�J���cT���=����
O��*�b�n%���&2��A��Y��}��Y��<�Az��
���4�����F$'�?*f\������2s��iG��W�	�@R��- L�
jjq23�����}�'%�~����b���7`�}j ]s���Jiz4�:�?�z<c��l�L�A�V������B/�������-���FccBj$3��5�^�h2�y��Y=�
�g=YbnZy�fPw1������R�M�C���e��>8z�Pg�,0@��`|4�WCz�K�Fa�4V��Um�J�-g��={k'���0[�:���i��
�v�H��,����>�P���h%��������~:&�f���={98
�$�d�8�x�'d����C/����]�?zG�w����1?De��8P���^��#�AK(B���9�
�S�r��zM���y�
U���Rvh�,Jz1�`��������
����*��U{����z7�99@�B����'��e���6D�yp/�h1�0��$f��;�Rf�0k��6������5��Z���h�WP����z=K�|��)�]�����Omz��C���	���4��YK�C�2_z��u%�"�&"�7	�*���}k=���W����H��B,k���}.Dl� �b��)bc������Z��B�:ph q�uR�d��)�f8V�k�b$J�I?��h���c��h��/�KK��M��2��kd}�c��
A��bo"o���j�)b��gDg/K���%�c�����L��g������K�7)��uNw����ca�ey3��h�k]� ��$����Z�.�\m��:��K���g�����a��u<6��^�j�^����� ��$�D�v�}�������'9���TD5�]�*�����b��+��"W<�]�Kps*,1|�Vo'��������I��7�k!��$��4c}k����Gjn{)��������z�	�Xj>U���< ���	�|u���z-$|wnp�f��`�� ��k��X���m!r��*�g/�Q�H���S_�d�T.F�)��O�zT�UYM�%����t�5��xf�7�H�V�����.����15L�����n���������F�1<S������@.�gY��&���<q�����E6Ovx�+�Ez��x����D;B���[���������	i\��<D�
u�8-z��1LJ�b��KC�����1h��d�8~U�M�u�6e�|�i�s6P�+��RA�f��1g�����t�*�q�H�����)5�-��V����5��S��$���S��;��5���_���zDUsV���I����F�C�ta�<�ZK��>vh� gyP�n�[�'��G��j�"��MfH�Gy�t�0O�����Rbq����-�E�IvA���k��J�Z7o���a����@q�E�dd������CX8,S�RZ��Rt`�mZlRtp���D���go����A=I���}^��Z����6�:�
�E�ZD�@������-2�/���EJX%����>Z�Ge����FfI���l�$��n��#���:Snx���~�T���m���F����P���
�Y�aE��������L�Y,��+9M{��W	*�$��W��}i�z�qW���_(�4�'��a	�����R��=bY�0�o#�����;sfu����k�`}.B�c2,&L��5���1����:`�����|X�Z�5&:�Dz�����MN��T�������p��g�ff}�@��c�b�,�_��Gm9�����n�e[�SE��6����)I���PY�n�i|���SN8Hvw��2;��`7$&d��: (�8 O7V����������Ko�ss4���O(�?��}�rB�5���$T����������u�NJAU��UX)����~�
�y��5��(�C�?��[3W`��&�~��d:�y���z��[��:�����r+��gV�
�t�X�=�����+8���rc����}	�I���CX4<��&���������O%�BRn��w1Sf�Z�O�
��]j�js��9F"i��
S���}M�#T����������gf��7���6L����R�MD�����3��z7���:����_x������=�����E���E:�~��;a:[#?_�pL�.����U�r&�M���!�Z��t���a���T��'�a����t $��!�L:D���v'����y�\�[_�.}��ym:�s����+A+�������Qv$4Kp����������S���_Mc�N�F��%h(oJtk���HSV�~�]Q�b;�\���1�lM�+�Z5�3�����a�X����V�`:��c����er��=�����:�������t�\3`?{"��bz��'�KO���0�����������)��utI������L�7�`���Q/���	��?���4���ml� �������"��1���f�����P� �,6��������Q�\��j������O�
�(f?���_�������t�}�\��M5����E	�.4P�y�C&��H��f<���_����{
,v�eV=w4
	��t�D5���s����>��$P�U
����!*�U��}X������[�9�����&~�P������.S����U�1rg���$��:h0�M���e���j��D8�e�\m/�NJ))�����n!��\G�dh72^��0��2���(������4
�4�����F�@�ySx}Lt6�+���f���>�V�e���CYm�\�,,t������b��	u��v���A���&��iSe�9P��2z�Y#�C�L�]�,��3'��,�f�!��`HK���C$<<2�u�<����,C.�f��2��X������]?k:��-�TV��)��J��e�~��gq�����V�E�\a�Y�L)�@�]��|��<47����
���iTUK0���os$`���u@��BM���/�
G�N��t����6WX���[��=��\�@*�3k��b'X=�@�	�X�B�r��Qj�~����v@xv����/t7����y�|��s\�����������������<z�Z<�{��^3�������
�uv����N4����kz�8��Wt������o���l�@����j��i'���Z(w�{az�Wq��R�������<xMu��:�?��N�!���8����c��������MZ��@�eJ������������=������usC�[�-$H����k��:d@����B�#��P$S$�D��)2&YWs
e�#��w1�V�,��I��(�N�J`����n���	��-��~���G��S�(V�J��4
.������_��r-.��z6��kv���}�=��}/��g���}��?SJCx6Z\et;�C������p	`����9�Ze��0v��U�/��$R������l��3�����M��{��ux�T������Y�Y"]�w�����b:��h��i�,���J������"����u^O{�Eq!j�F@{�|q*N��������^g����v&R��t�_!���\�&BA-*�"`�u${��9�����W��vZ>��f3��g��f��.�"v�K�F~���W���W�}[N�v������mQ:���J����em��c�v�t"gz��U���	���.]���P�#��	K�(����pb�=�@�6���y��������%�w��i�������g��a��1u���H���V�z�^�a��_M����"�-����w��fY���^=�"|������G�J�:g��{��N���\_�����f�^��u8n��$.
�maq���y�?�*�q��&Q(NJw���f�;w�m(�~#��1D <?7�(��44��T<�M?��e�����n�y`>J���c��
zV�x
���M|�����p���_�e~1$el\�k�N�j���~�`o�PAu�S������'�n<P�"�TdBT������C���d�j,`�-4��=/F�=_v�ppF��N;4��u}��+G�{epU����*{A[P
g^�$�u0f�<�^1j��_�
gq����P�R��E����� ��26
����`�����z���lz:j����K�fp��p��w��>{�F=�2���t�X���1����
���d��-���#��v$�Q�0<�A9���O�f��Y�O�m[[�[)�g��c��)��sS���Wq�"�(f �����'C8Ye��\��������=�R�n�2y*��8�
����j�Z��r'�.2$;\��g����^��.`9pk�l����"�N��[[���T����D�C�1����O��/�������n�
J����X��/��f�P!�1^&TY����
���T�YT�7Yn�(���.�O��	�������3�J���r�VX�f(�H����2/�	����}��������-��I�����,V]^����2���J3P����Z
3��y��RP�u���j���NR����_�3����@�5J7��8���h�-�#�_>]��<[���/���#�3����A���c���SQ-��,�/��J
��bB�����I��C�\W�n���g�C�}�p;��o%zD���brj7�?R���
�(�@�����4������O:�y2(�BR����P��;�QI�m��6E+���cJ�?}5R���o����[h�$����Q5��O�@�;�X���V[�I���[
PaY��&:���:;��FY�T*��o?U�=O%�����g�}��Q��%z~8�U�X���=#���������y]����=~��g�C�����,Sv*5="'��"��@^�6
�Em��]�-Gu��]�I�0�I��B�Q���I��x6i���N���E��m�3����4U�)+�.�n�r�El7���mo���`�	g�U�Wu��C�~�-8���o�NT�yS�0eW5e��G�*�b������G�V�%��ca���&��������Z[��F!D�Ztxs�����<�Mh;�=������[�����7�:~d
��y�h�<CLQ-�"H��a�M��k�j�	R<�����x �E��9����f�G(~,VH���e
'�uw�7]<���f-��G�u�F��K�a�hy��w�A�A8��SR@����j�~>)��pW���[~��$7U6Z(o>���]��������,��������_��
{0(4l����(��h}?n����n��)����O��v;DHt�����S�\�~�����?G�{������;F�,�X��l���#���-/��UF�2���Oe�����q�=��r��/^��a��G��5/d>F�6���?Q�����l��w�j@V�W�sw~�P<�;IK�X��_��Y1��G8��q
����b{Dg~E��a"�����-D�.��~)���0�p==�H���j��1�A����bAC�t	D��v���Y8��ve0��^�@ �t
��EM�lC�s���	�+��X�v&.=�Vl�9��$�6�
X�������Y���Meb��k��@w���8��r���F[�6���d��RoG�Z�V����1� �r�m�.��|L<�����){�U���l�f�������*�����C�����3��
�������������������z�Eah+R��������1y�!�J�h���/W�0s/]�k�DG��W��u���^��@���SJD�t��[L�-�ES�~�j�6Uj���O�����G���r'��<��,&���Q&��>������L����������*��5%���N/�J��/��M���q6kyR��i�.G����=���<��y�lOF���$q�/2�����������[?w��s! t��`& �o�_��%���>z�s��6}�@��%��������v�hok�G��7|/OE�@��^g��@���d����BT��U���2�K�O�5� ��Z��+��'<��}Jr8/
+�����?6�������t��Fy(���Q�����5��;��E�}��d���
�ogf��W�]�(xS&F#�sOQ����U�������I�A�q[\R	�@
�h}��U=��6DQ��%xd�V�ak��K��m�:����4Y����~�
�����($K"s����&�����}b���o�����1�&M0�����CQ,�t�+�|�m�F[�6m�F��������3����v�w�*2���pn�*(�^������
�����G�hb�{D��>sy�����D��D��mz��C�u��m&�C(�Uy��Sqf�
��v�n���8i��{�cP�j���5U��e���
��l�6N�A�W~z�S��Z�<��u����#A���M���c�tpP���V,��N�@t�����c�|ai�~���=m��l	tR��NZ(�;�Am5����$���1�]�!��=��������`�=(>�������@����X����,�7%���� �������0'��+U}Kr=:�Fr��+%���UtT�6{���%1��qy���2�8X���h�g����+��	���mF�����l^3s�5Fl�Q|���w6���tFO_��D���s�+�}�WV��'��W��SeS�l�4����>N�6]!<�loM�N���9��B�A�{����AN���=
�ih�_��)�q��+N��S�� �%���'"_pr�������9q-M�����[/��� ���zC
���@�B.��A�O,g��AOr
�)��]��T��M���DX���S���nc��6���E���E��(���L���k����V��������b�p��%���
t�����+R��������
��V����Ro����Ggf>$��B�,��[�|�{���G�tv��Kx��%�&���r�8�����IF���Q�z���x�Z^nk��#��;����On���M���R�<��i@V(��_i��O�#a/��#�D�i�@oG��7;�����|��A�N?��{��b�H���C2�X����F�	b��+�m�zgB6G@�A�����G!YR! ����h(X �8��9�^!j��x�wP����a��%�Z�rc?��9^[�M����y�t^�<6A^<���1v��D�{���d��7�����M�����-��}�5u"`��H�	;��:��-��P�%g�;}���}	��0����6�Y����y2i�f��L����J�7��aX�����^�|�5�T5m����W��_1�����!PR=�#~�s_��N��h!����G��RmM�\6�y�]�$�7��#&U���L��	��`drn}�|{�+�>g��K"g�r���d�^���Jo�6��^�����"�L��8Fz�|/D��ar0�R0:O�z����KM���xk(B��w0�
��X�z,�]�7���j}1Jw�I%��.��y�q�e����CO�����3	]�rOgq������n~g!��+�,ScMRW�
�����x����X)�/�uX���v��3��#w&�kh�fM���>K���m[[M�Va����3/7R�13���I�V�����_��W4����q�l��J��"e�A���b����=�1�0�`A`����&��z�Q�9����r6I�$��I�X��O��������j��KaQ��� ���������?��}(5wH��$? ���&�y��0���_5�0�
���&V�^�f�T�d�G��B��5N�����#�R�9� ��kT�Q�����i4�	��cV4g�o���0+�Bm�3��xBPU4����+��J|�BmfA?��6���s�auw(���v�w�>)]j_�P
*�$H����>w��|����]��
Y�Z��)�wg�iKQ���;o�~�	��?Y>J��`�<;3�^U�|b�~�$����;+m�i��cm��[�d+��}�Qd�R$nJV}����
qU�O���<��Y�N|�g���~T+����N���Nwq 
��"�:�Vl��;)���B2f������##�"iB6�������=I�Gf������B�,�D0P�d��J��iV�Ap�v�Q���5�����J�u�5O����7�*b�j���G��"���4	�6]!�r�D��U�?��}>55�e����%T|�[?���+!H�P�� 420��JF��9���l_��Ax�j\���U�� �&�*}��<�-�n=�����*� P \m��SD���h"x���t��9����k ���w���&`��k����u�tg�s�������7lL�<�s�1B��5���@�>��X���}��V��bT�j��Q!t)�a�a��}��j��pj5Wx6��9��:	&P�]xW����P�|E�8VF�����'���17�Y"QAW�D�h�k�CFXv�@������7��3�7�G��@��l�)6e4��ROX@gO:��\s�*���i���\�-��{@��O���S~�0 ����s�1��~�I�^&�H��<�:�v�G?{5z]���UK�=���ZK��7bL��=U���e�E��"��]�������aS��5O+���!Y%�\��TXpN����R���jCt�����qqt�f�6�?���������%��G��q���@���eZ����_D�hi�9�\���������� �n=8#X��
6d�Q��U��?s������X����Q�Q&��/����{���a��w�j��f����Y�:�C2�=]�_��(F�_`;�}�2Px��9-NsO����2�N�~L��
|����	�O����u�^����r�������h�-<q2�oD��s
�����9`,5n��g2�
�u���7��g�1+?�b���'y�����L�n����c�������{��_^{��`���y�L��N�E��FKjy�k��VW����E�\�+�����@�w��tiULo���f�b"sm��~���{}r*�R��b�9��+�)�����H��+K���#f	��e)XHn�y����6��`��W�J�
�8�#{�,^
9�@M��{��\�;�,(��5Vj����p�;{�mA�<��;�2>�!h��=Z�4�SkG�P������0��f���b�u�-A�p�F�T@sh��z���9��l��_w�2C���-���2�+oV0\ByV���������**�)��5�*v�_�-�=]��c��EmW�``%h��~�|��o]�g� U�
G8�{��b��Z����"�;��q��_r���Ab��fX��c1j�a�$H�����S8����i��?"���i�Cb���5��������8�"�n�3�r��x0�L3O��k
G�\�;�S/����j����;��nQH����}Km���'�\dV�y��
�M��t����!P
��%�>~����tq�HX�������e��j�e��$�!M�@���u����|��33u���/*?���d��`pf�W����i�����^��tv�h��5�+BE��Q�:�Y?��w�u������F[�����`���j�����Q%�| p�X}��*��6J��Wuz�[i��k�y��V��!h=V��"��j���w-R�%�rp�n��-�4!b��]�R�
V�o�e�@�1�t
�Y!4VBa��CQ�_c��ULr�����/^���z�0����IX��:
����X
�Gps}q{��M����+X�3�p��b$�l"�����S�rf�mO����)�I�Pc����p���G�j�24����\���A��ab����DR����&�z�8�}�
��������Jo�-	X�\#��#������o�FBp��h��������o[p����E 2��%��W��Y�cl��4n1@>w5�Ak�=b��k-�O�D���[�M�%�����O��p]g����JzH�b�>ig�Z5i��H��c���
y5����*�#�V�g�[/����!DY1����rx��0�����L��V�����Lpnq-B%h��b����S��R�����vFxg��B.2B�������PU���3E���5j���}k�`��^�#v���6��j��]K��z4��U-�
���Dp��nJTP�bb,55\�0�S���^�'�`�Nj�"p��<z%�cPx�,�����$
���'��{;���R*���x��1T���;sW��a@�L���V��l��������\���:�{��d��N���CVH���{]+�{*�Z����Z�d�U,(`i5�};~��dbl�������,+�}���]�{V�aJ��T�G
�G�H�BN�Gl�m2s�a������0�;%�W�Je���S&��M���<yP��e4����?'O�c����ifeX�C&2�?Kb�+�y9~%���k��-CEmh�����1Z�z�iN��>�J���Ptf�����ab��k�����4�@w'4_��typ������I{��!�Y���J���B���6i��i����R�6�Z	0h��[N
����������a`�������9������7���	��AS&;.���B%�z���qu��Y�}��4�U�V��!{H������v�,���r���x���(�]�`'PP����B��{�o�3�N��2�lT�	�����8�A`�L���%�E���w7��SR�]�K_DK�����l�y1W)q0v���rm:@X��#�]f���;~��*(����:��k��F���N���\L�u�t��;���]u�*x��i;S8�P�,�v� �{�b��*p9c�����A�\���A����?K�=����1�83��Cm���(j�����$��X]K�����g�fy'�>�cI6���������A�Vle|�o���y�>,�v����F}iM�
TQ*6��0�M.�����_p�a�tyF�r���1�S��y!����:���Fs�,
�-��eBy�_���^'B��XNq��gl��&�V���� �_��[�84���Z]]��F�y���s����l�5�?`?xGK
�l��,A��������tWj� �V���=��_c3���hy*����,
��LvWhk.2a
������=r+�O��5���A����n�A��hC����Q��+���N�w��a��P�t;UU��4&��*U��U�v����<�<�"
��H����M/��YD�������������Q��E{����#S*�|�l7!��������H�a��];����vr7�c���lJ�%-������X7������"��eC�O�h�e�)[�u/i&���IU��Bz_0���
��	\>�b�����U���0_�����q�>*&����	�}��G��{�F����Fa�^��!*����p����U��8�4fUd/�F���8�����#��<�o�x��9��=�Q�z�K�P���6���X'�����J#{����Lk#mo&G�V=`�Y,O�U1��s�(Lg��_���l�P�������`��+z��?��B��"�����O
<���O�-a���7;���Q��A��a�VH����X!>�dmzC]�bi���B�����,���c�\�������.��.�|	����r�qj^�vG_>��*F�p�~$������Fl���0��/��/.���Ik�L�z��^yrgy5�<��}A���D_=�`�����T�b�u{�:5�1��8���p�V=���K���M�Z�\vY�f�l�qV���l%��<�jQ���%DwW�U���C4�V$���%�2f��?��G)Tf�X����r
��^h�6]���-p���#�����G��uu����r�,f �q����q
y{��^���Q*v�+V���K�C�����1��R����'����9��bR�_������pE�����-�~^_�-(R,��g�'����U/H�.X'D[�lc B@�	�
U����C��=�^[�]��Ti=p��(�X%�g lL��a:,~r������|��*���T��5JLh,5�L���3����SzpX�`[3�tR�iz��4	}
T�}C&On�]d�����5>}zc(��[i�Q��a|��gn����%���k}���������M����t3~T��:�]�Z_�����?pCt!�/��4l���U��{�YT�0�'d�f�fT��O=8����"�X��������H��	��4����\C�z]��l�7�nDVKP����n�k�l��*a��?��`���_���{
��s�-G��:R�t��k�i���Q��b1`�~�^yj������W�_�Y�c����3e$"hf3Q?����]������7Y[��E�CC�(���>)��<3��g"�����`�� �J�t�[��{�L��4�|����Y)�������tR?q�:m�=����(u�<=�����6�:�r��G\�B��AP`#�y�r�>�@�*B�$55�����x��K�Y�\�:��o�;���&>�>�w��K���[;���Y&��F�������`Ygj;���r�K`
���p��r�A����k9=���u��n@���XB��L�^��p[�L�p�V�[��4���caD5��x���?9�Z8�$X��f�E8�d2W��V���O�iN�}�	���U'�������^�0S���������Y�;i�T�.��L�~���r�"7jM�d�X������_+�r:���m=��}`R���f��z�X�e��[�Qc�\��bFmci`�'$��@
��h�hj*��-O}��u�I����K�?."���p�Uu ������L8 �Z�\���R���
km���ci��C������M�l�hyj���5����?�Ka<]Uk>�cD����zg��9*)�3N��q��'/�+Q+�4���E�)E_����M���j���F�b�F54�O@��
�K�zizA��������g!���q�^tk�6��-�;��e��Km8�
��7C�axL������3���LL9�HNB`����Z��	c$ 4
�����H���U3�eg�"2S�yw8F!j��}�p��E�$�xfj��!'��>�!
up�(P�<���u�k��l8BP�L+U�y{�tI�������^��������`�lCsVaT���Q��*����������?X��J�Y����	|��_�Y��L�8���0u���p�����.c]s��\���}>����� �74D�.��j0�������}���r�^`����\��o?1t};G,�p���#���:-P���e����+���RD�oU���t q�yN(�������K�Z�z�<��}�i�8F�D��������\���z7��g�����O�g��C[�Z\���&azV���j�1z��G4�YT�����[�.�g�7O�����zFw�V�����f"����G��t-M���*=?�n�B�������*�m��t�;�P���O���Lv��d;�}�k�S�[��-cd0B�C\�f�Ap��U��2��-���1m����B�o��o:B�cP����S>�;y���o���5��4�?�����`������r�t���P�N���S�G��`���g���{ ������\��c"Y	d�^����y�3�`�)_�s���qX���S�m�S6���\�:�FJ/����	P��&%���w�N���B/v�j����a:m��$�Y+R����1i��|��k�����$�l��-=��c�g$^�t�)d"+t
l�`�O}�#�������9�^1jZDW�$O���Kq��@M*\JRN�f��;��e����{�6]L//���N�Zb�"
�k�j���Z>:l	�kCx���
o
��Y\�&������^��'�nA]}�8����T����)���s�C�!�� /Z/�������i����d�����(���2������_�R?����Rq���g`�.�����[h����?�L����2�����	R�a��p��R/��?�(���
�����!OM�"L���K0P���k@��N���Mm�Fy���n?S�%�G�U.rq�����Q�6�.����i�r�����I���h�~_�F��(	�(��$li���v�h�/��!�4��~,��/�����
��a�����m,f�������\(����Iu��3�C/�������Z9h�u�h#���vk?����B���dXh�j�k.�c����gf_X%�L�HR��2��/���mQv=�Kx���(?���`�]�4CA�3��>P��?C� y��x�W���Yu����E������
i�s ��� ��z��+����1w��������Mb F������h�e���K���l��1_����?�����]&.��4��GK�y�I��Nln����i;���~�����h�2-;1�DT*����c�7�G���E�������y^$R���w�����;9P��N�#S�� �+�]z������+|�����TQ3Z�7!��v=�Y���[\;G���>������
��
�:�I8<B��s�B��!
��S<b�N�I]gt��s�������}��7DVh�lF��Qj��NL�0�(�]�z&fU��
�A�Ka��j�3�9���*U!n���.��FB������N�����:�b����S"]�n@#��.��/�K���_�����EV�B�r��l��n�2	��/��
;�����K@��.��-K�T�M���m���H��)�#����:�$�pa:��XT�����/��:�������U���7��Y����4���k��q���w�2�j����C?4������W��!dW��(^��|����W{��b�����2��	���W��H-�7`�+E���\��2���!vWPO�Q�fn�'.<bY�1m	�5�r�P�a��lm�SW��,���%�'h���zV��>uE;�z��
1��I�E�;l?o4b/'���6Q�ka_�����[������Bc�=����������-gE��W)p�t"�c�Y��+J�#uhoBfNj�C��eP�������EH�v���/�r6i��-NU���(�f�h��m�����A�5������d�6�{S�)�Y9-�_i"���x��i�a�|�HJ�&:��=��5����\h������l���0��v�s>y������P��!�)���/���p%i�p��%��� �l�*t��9`V<��3���w6}�����`5Y{@���^bS�X����k����+��z���U2��!�@��(���]uPf��}}����V�x��
�[�X����6)���IW'���2G�F�I�=.2P��\�����}��sTKFV��#���Y�~��N�4���t_��z[�L�@��T����"�'����O�3[~��������B5��)xz��^��x+��v�i��H��r��H��;�5�&�C��g��r;k��������A�p�_��RZ\�c.H�rA`e��]yYH���E
���e<���7���ab=���\��Z�S��N7�]C��ZT����C�L�;������!�	�9��(��v��!{�=�u���x
��:AqB��~oo�w2��0��I�[��[�'\��57U�{��@+U"�V��A���N�@9�w�-��Rh�����(�U���<�x���wG\��S�����>�,��*�P��Ow|H�|�PV�g��?�fl����/�['W��,�u'��G���?��n�>���D�c�(��5+4��mR����w)�Zj�j���l����M�gsN�`�P�O>Z#�0��k:�#��R���'P��U#D�/��;�V�r��(���@�i\��0�j���Hi���`^������Abs��n����Sh	�b��,�cJ������;n�o��b��C�4�Y��r��<j��3-�
������fi�r�MQZ��3�y��.K�����O�
�]�@U;,��v�hF���k�Fe7��yy��x�;�0��fn^�����!3�"�"(�0M0��f@"6O}��H�!�t��|�U�jY�[�`3b����3
�L��[[>
�p���e
\kZ�+�3�81^�X����[����J���|������N�������u5����h��n@��X7�R0G�V�%��/���E�w3��������0]PmH�t��mz�����W-
�w�5�^F0~tx���$3K������I�~iN-}�����1B����W{����-�A0����=��O��]A�C|��3w?���D����0S`�H������)i+��[��I�<���L���s(n|�]z�zT3�-gO�P�����F$�s+d#���39m�5�����D ,z
G3e������C��XS�U^*�wN�Bc�Q���ta�;$V
2�p��� P���e����,P��Qk�6�7��-�6�j�Z �ZPL�nb$��l��m6�ZW�LdUSV��3�������]�:��g����I>2����
� ����o5|B�-A�J>!�\�O���D��u3��-���o?�*o{�3�.����=(������!ooq�5�vb���|J�z2�����v���%��������)8�R����!d:�v�s-���N!��+�|������!�n�3?�t�3�Z��,�),�D���m�����]���7������^��|1���������z���z���~/C-!\1���_�]��',��V��m�gK~ ;�����0�|�)t�e������{��*sNS
��5r��sa'�	w�,��(T��P�����#Zg��l�q�"�$ e������v~�����3����'I���o���v��}:�q�y�~UL���C���f��<���]��*��u�=�-��q�����>�XU�����0�(=o�{�]�=<�c���c��"�6uxDHF� �y�X������1�Mn&F��]ng�$���3����q�I��Rtz�=���C�����24J��
���.���C�X��x��7�WYh������\m���)��%n.�c���Kd�J�C��o��%4Um���	�J�=�2�+�,c��G����az��t���I����yj
���Q��{�7��M�8�)�`m�t���}'�2]�>���5�.�LH��e��%�|�3"'-N�O�nFVwq��;D�"�K���>���O*s|��9?����������%8gE���;I�R����
��Q��j��B����O���!�N�T��s���;��k?��:@����O�����Lz�����n�+Z{���|���}A��o2[���f�2rZ�����N���;�x2���&�'�����������:HZ��(j���*�4�)�p����X��,ysYG�v����+���qi�������0�*��E�=B��5�FP.����|�J,��7�/l�A�`{�F�7��"2�ehS���$��
��C�mE�(kHaSn38"(��?��#�'���_DM��!�+�?@��{�!F	���'��k�������c��s}��q��e�
B#jLNA)JdT5/o~�����[��vZg18����b�V�eo����E����N�P���)W�M�Bd�m�*�����ecIu�_��Xs'A*��;�����'�r��,7�'UB������������@�e�8+W+�`fL��h�?]���?t��A�<7&y�@�FCiw�m�)�� P��G0d�6�3�.���8���b����Y�)Nvot�V�U�M���H��|��/�������S���R��
`qKDx�_A�����d���g�eN;������n�HP6�����v��y�v^��/������1:��p�	PQ$P��������k	���@Fh:]z�������rV�mS_=MEC�x�p�)��:�V�W�����g%G��.��tr'��%I{f�T��D�^����A����u�c����X���z�'�!!v��aN�������`u@��  Z ��%@�@���Y�
ngclp�}D*���2�6.���t�q^>����qs����o����s�
�_��.�>ht���S�Uw��0a��	��IT����V����uBn�O(�Zp)?�=J�u2��QI���%���x��qm7h��0�y���`�����R�@����������e������}���2,��F@������W��b�7c��RP�3��Q�J�ap����jtc�L�O��������U�@c�L�g�)�9lq��@|�[������
��c`�-EE�Ea����'=���g���Q6����#�t������^R���ET����7��7�T�&j%�
���/)1��������3`�k�0���fLG���D]@����?-�^Z"0!�}�]�-����;�G�G6��$Nr�.��m3��#7]�j@\���[����z��r�����i�� �+A��K�����|q����0
T��E���)h�����Vu�pm6^VK�(iL�-�M\���'��`d��B�t���Tl;b��1�{���v�H����N]~.V�r�?��iA�"M�H�J�3V5n��m������+�Ev?�������������7������um�������j�*��Y����1�����
�{v�?<T�����o���Dj�_��p�y������	��{]��u�{}��%/��S��������1Y��
s��n����PPF��T��:m9@� �R�@
l��R]��x������3�����<���+{��T"�:��}�����]�a��i�\��q��hKZ����=4*���o>�j:K���*T��d�W�:�Ir[��}����2�_��9�>C9b�q���_(��������V(V�{�;(�y�/�qb�������%AI��^
����B��8mA�p��o{��e
^�[>m��R@8��q�GIk�!���{�!�	�l1��)!�q)g
�3�j����4+���m9��hDA�"p�1-��1h�}-e+��?]9z�Z�Uo�8-�:������U^��`���z�A���9�Z$��,+��)A}��������������+��8�B�D������O���,�^	$.].-5&6%����P�W���
�FcC�����n{q���"U\�����pUt��+�X3c������]��?�)I�[�a^�"�pHL��s������v�La ���A��8)��"��-[���,�U������[��2�*r�jUm���p� �G_��Q3}"f��d1�@��Y0(_.����V��K��P�;��,=I���Vt��B>Z5��W�Y���(�������p�g����OM��cV��G�������Y�s����s��@��s���UH�`������3���\m�L�
MBq"���&���L���<��N���
H���<�$[�i�y!<F���
�-L���VB���z#���+ ��2U�U{���[�W��W���l	 O��~�k�!sw7�}��a����2����~������y��O,\�,��<�a_��V��6Z�0�^��7
w���G+{[��*������Jy�\y.T�� f�e:�!��
��o�/@�����-���*�2�;m�O�X�<hq������?�k���@�4�
`:��-��A�����ot�����N@�pZ�'����U��rU�l��ff��$�`�[I~k������f���m�BW(&Wv��h���=uj��PG��i���	:��3t�u{�/o>�f�I�O`Gf����
v��'�����e�pbo��O����wp�����������gZd�b���E�	�1�oy2��s7���
�������e��[ �)_9�Ze�5,���-�5UH�}g.A��BQDG�`��o�v���vjv{�����&����t��n-��
��
�V.���f��SV�
�6�{�u_�����n���;��CN��S��'�c�Mf�>|j���"
�����������;n������3�AM����^��i�{��hM�VZC}���@D7���!�}�-{�{�!�k��h���F�k����"��)r[�
��������I�Qmd�_<u�.h�]W�d��>���|�-���I�
s�|�"�������{E��P��hX�=C%=Mej7�0�#�PU��1Sb�]H��C��(
����=��Q��������"������GP�Q_`��������O�X����_��#������*����FN6��s\��t2�STl�������n�x���������I�A����
�ug#%{��kv�'AP��*������S��C��9��TS� 4��~=i���"����<0@�; �w��������J�?�"�E	mc%��5��U�V�c�.�=��(��I��������&\x-3��F;!�����m���\�����`?�NH���7����>u�Q/�1�'��F���p��F�{E�5�Es,�Y�c��HH�Ex���-P�m�QV��=gR�NOH�������+��,��,}ES\��W�\������o��
�9e.',�����@���&qSfG����!�I#�nk���*���d�����Z5����{����@l��-�T����:�V��[u�7t�6����[��>�1^	�w�=���f�e��*�s���N���Vv��/�|
�*V;�'���M�c�-h�72������Y�C�R	��:FbZ1�!u�T	�Y@Q�
�DC���J�@W�I�����bM@�3��y�5����1���l�'_0��.�0��0�����o��e��,�	�?;���6�o�-{�K+�<�&���u�}V��"�����&q����7y��Xz8HS
��s��0�Sae?p����1�b����]8����t({y����mT���
U����������a�o}_��^R���j���'}�F���JvS6!�K[���������!���Y>�EY���!�B��%��H�n-��|�q��vu�A��Gi�I�����6#]{U�V�!�	����ULS�^z5T��R!���uf�������h]%��������u��+��q}��F�4;�q`~x|�O�3�(_(������wl>����|��JU�F��l�sa�����Sa����m���9r�W�7��E���b�AH�lR��x�z~�g��%�H�����2%~�������pei��b��	:B!7~s����v������\�n�{�,d�tq��9�\�
u��y��Z��f<��4���*Y;�������AV�X����U��A���8�������o4������R�K��E{$�'J����0�.�`?�z�S�Sc�O]���BZ;SKv&���uRT�����u��
� ��]wm���a�c�~�D�I���@�P{|�>t�oG�W�M����2���l"�����:p�3����xA��������Z���T�Y���"��.4��YD�]�ur�,Z_~ ��G����)�8(��A-E$yMsMK������+%\$�6�+�D1o�;g)d;���0
���&|��^�X(����Q�~�A;LBv���o�1�� ���\�<o���P��g������Jz`��AV�(��8-ZZ���UttP���[V{0��@������7dY����t9�B�v��F!c2E���V���;���c4<����{��un4��RM�s�|�g��M����h'!��^���7T"�L��{}�Da����u�1���+���
4G��7�5
4A/��t�]����
��W(^����I������g�)��������2{����,e)� �O���nd��#7��������1E'��%��	��&��!3���B9s���/hh�(}��?�����-������m���L��..
�Qb��<
��`�"z�6�(/����~�pK���0j)���V��m��S�$�L������$��D���E��B��n�kwBG��A{A?
q�x��H�aZ;�����������b:�����P��w72��$[q�A�r��s2idI;�
-N
	HC�]�����7���!���k`"z3~�b�;] [q�
������;�����y��������^���n�B�J�|#���yl�r�8�FlT�D�����Mvd�k�#��=siQv������v���.��W>����JkU���Y
�I*@��y	�����Y�����*mS�=�n��Z/����A�o�s-�R�?ClW,�b���������j��~���b���W��7O���3%����W���{#����4y~vd�V�����:4�3t�X����m��)����K���(��;���Bbgu��\���;M���9z��(��!�!�4��"9nh���5�S�`_>(cS�-��.���k���z��v�a���Cx;s��G��������v���r�R��s^��%']�b���������pc��&}5��4(���8����N��x������0�CmPGG9��D��&�
/��R�t����i�S;M;�	�q�R�W���o_����Vq��S�	>��~!�;}�b��Ygq	�.aj�������N`C����d�z9|�'L���r�_�N���D��gP��Pm����[[�e$|?������~k�k���M�N����q4yQ�����H>8;z7�������P�Q�����v���u.K?�n��I���5��a�����Q7��K�������+�G�0�����z��<���W ���?�m�yc_�%�����@��*�s0��	�B.���,�x���}/���0�?=�p��ZH���wB���?&PL�fy�� �O�������&d�Tv�9�{���G'�]��2��{f�Y�.N��
����g�����c�4��5����wM��P�;�
��0������"�l�[��J9��3���^�|���,���f�(!�g��W���:���^��/��bKb'�`���^8��5N���s�����G�|�}g�V�Z0W�8�X?�$;A�������H���������gd\�ziEL�C)P[��[���
�
^�U�,m�tU`Vfx���%a�:�5e�(���-�%0a�����q����
���	+a�#��3���]7����a{"��l����?��!�~���@��)�DDhX^�p���7!f7�!�>t��� �n���v�X�������EQ���!��}.�_�1V������T?�Np�d(&k$v'�����@e�Xd��pA�(��P@��F�k��;'�s����f�(�5��o66_^c����:�Y��vY�kR��j'�
0�_�se�Thom�VG�:�W0��Ww���3���fW�?�W{a������)�k�!u��
?F��������+��F����<p�R:%lb���o�>�2??�����W���`���a�>�h�PM��������!o�S'�,���n�`��[�BG���LU!�zC������
ea=�{�T���{��W!�e�l����41�����K���b7ii��d�/�6
�/��,U��d'�����D���!��:p�670�_���\h����H�3��=|iE&����������a�{����-���h�����G�������{�,f�a�P�*��i��Z�-c���4O\��)�w��fs����g���fp��� ��z�����b�)�
��$�y�����w��vb��V���\��h���7.�]X����������f�2�_�0�}�=�w���]�l���#��W��1�?y
������OY[�=9gW���S�N�z�G>z�����������w
-��#[?Qhs�yV���%�G=�y�R�T�Y=�'��,����${�s8
2�K���J'zS���-�����E���3��������������y?��VFj�0=�o�%+��ni��#��u�{��X�"*o���cO�=����M�i�a���{�g3���QP��wJ>�t�/���"����B���'��/?������y�d���ESo��>��.5wBt��Ee�4n����Vv&8p{7����~�������#��E�ZR��h��5k�Q�^�d��<���5Jm�;����K�
��B�k=�����	k�T�����.�4g������i�������c6=C��k�sx�r��������r%OB����N���H;����U(�@��6Gw�4�R?C���6�SlG��O�C�g�W2ud��h�C\xC}-��)-(z/]��������V����0uW��a��:N���wa|K�u�q�<�����������L:�6��3��2�J�&�y/c'�����e|���u�aF����Bt�o�6n��-`/w��6�����*��e��3�2�S��P���n@O�����k���+����_|	Wz��K�wq>��+�tY>����IE7�a�B��:�ET�i&�B�/��E��n��DFn���OzUVxV�SXNQ�T�+����P�������N�r��Q��^<ZnV�m�Q-�5����W��xw!=C�,1.� /�1e�����`����=P�����El
k��R���z����� V6<�,S�r�"
�x��h�{65���s�=��9��y�����'sk��V�\�u�14��G}��@�����t\VsB��@��[�]��\��79+K����������6E)_��tw������S���'2�������L	�^YX���x����O4�z�g�b�i	izh������b�er�������F�vIFD9�U���4����{�Y�+���Za&{��5������8�5y��&~Z�|5�QS�|}V�5�)P��������~����y
�5��5k�q��������Kw[�~\G������@�m5����E0�y)�N7��=���9Z�l��Y�D�JA�6,�+}R����'��'��o��G�3��6��yfZ�U����v�K�Y�~����,�pbKd��G����N��/�m��Q�@�?b\I��G�#����'���U� y��Y�H�+�?�T�0.py)t����@��&�����!�eDq��'��-��0�*�y�����9�q��{���z!��/WP��y&�����J�3�>-������
�^[��5�����y&%�b��@5���(��c5��o��:m%�B��vW�A���'�m���b�D����*u��R����C#b���I�*=��4u:�?San3Z��x�;�tu{�~�E	��'L��O����[�fS6��4u:z���Hm�u~��4?�
L�����������BN���k������!.(�|}G4��2���olm	���6`��=����#�'g��t��-!�(%)�mH�GWjmU=��@=�l00��v�$�>�Sa��n����nV�3�tB�K&������K��h�w������;�P6�%$G

:������[��`�����YFA���S*��@��_����;���a��������&������G}�F�7g(�n7��I���IH�P.'K����D�����S1��2^�M,z`K����#=�kv�riG���[���K��Z��'&_]k�wJ�'�'�<��e���E(`Q�%�B9�)�_�A�04��!l�-H�*��e�}�������]�D����gLw���f��C%�i�>�����Z�U��r���~��e�g���uQS`��!B�����4zH�b#��T�B��,�&��nz'����7���:���5!�kaPW��|�[y����l����/�<j[�g�����gRF>M����l�4����k�n��t�!a����39��w%;������Tw@P.��n��r�
Ot`�<#�)�y�E��(F	!�w����N<$T|(2V�)�g`�d�[�*n�n>i��Mw��t�s_f�x��-s�'�����sQ��ys�m����N����(n�<�0�����o?����|���'�k�F���
1k{�����C�C�5^P����-:,���y����z#�L���7v"� Y�^�
����,�������������-;������n�z]fP��y��95_��5�F��r���7]F+�^�����a;^F�J��Q0��y[��hJ�P��9����j�X�p���04_%i�7���D�����Aw��u8z��/������3����f�j��O���c��%!z2�X���&�����i�C�]Q�Qp����R�t�?�>��n���OQ�+K���CU%�J=+��(��8�XB��IzK�C�a�F�+�&��Sh���b��n]gJ72����������������n-�|V�X�������q�Edzk�(C�Q�&�$������*��X?�%����/":M��������U��cH�����a1���1�m
�&��-����Q����6�G��]���D'D��R�AG��=x�y&�[�1)<���RW��T`����R��Z������1������t<��,LQ��/�����������Z���D����H�k�(�;���WA+�gzh
����m�������)K.���Ty��!�Wy��K�����kt���T�
6v�� �.��h������+��^��bP+�~�c��
A����D���T���W�����I0\���
��������
q)�J<�D
���Bq��9;�������N����w�*�@nQ�&�5�2H�D����GMG~��4}��8gu����_�����F�%�Pe�t���_H������bA�n��7��������5�Y�����������= ��5�Xn��'����
q���j��Io���#�%�O��T�1p��|����(�frG�������bK��t�E��\��@y�:e��.���6��NQtX�*����4��Y�m���"�Qn����7=n�t��Hw7
r��O��������\�~����66���D�K�&�qi�D�8;�&7+J
�������E��I	���]G���!�a����7��p;ezG/V~t����&�������Tz]���0��GA�T�S�H{���2�)ye��60�O�@��dW�Y!?d�*���T�����W�6d#v��7��4���y����e��ZYKco�}����OV��:*$Y���Lc���5���.eoB�B
�k�S��dek�����p���H��D�Z�
�fSf�=<��'5��U�D
��'22*At$�[����Y�0�!��mg������C�&}o����������\5�dP��D{�-
��=��C���L�)=�T����c�����DN�fof2.�v��=������q�Yg9�����<�;�{5�o�$��!�[�hr�h�z�����=z�0����7�$U@��Z��_]��x�WF�:����>
�uG��A���z^�t��8[�����Jond+]M^*i�����e'�V�����#�7g�?�*�����^��6��:����Vy����6��(5������V="8����Bn�� o�5���]�~�{~�UX�IM����L�XT2%������	�yr�5����]�5��@�60�b���pf��j�SX�)��d���r������Q0�M��Xu2�mR�n��j\L�]}��A����"G��1��
�%k�N�)HmZ�������M>
�����F:�*����d}5:����Q	q���:���GFf����%p��E/ 9����>Chd�qgt�tK�<?	��3�N�AX9jX�`hS�����-/�-x$�2K@���p����9/�^2�{��������%�����;*�������2�JF�@������&T<�&��|��
��x�2xJ��AJR�O�(x''�Ji�
!���.lwr'R*9�b���������-��k��A�.%������F{��,T��R`�Q%.�mL/���8.�e���Wf/�Yw�p���$�fi)��Ye��9I�6b���C2�rw�����7I���9��
����QA���&=tw&k16z���?�5������Twjt�n{���,w����C��)����R{��E(@\�����L�h\'������)����y���s��p?� q(d���>}�/Q
���}~��A}qne��7���>.�z'���
�����B�&�"�e��j��5�(�3�+����Pg[M�U��4��=�5}������(>��R~;8�*�H\o�Q?�^xCK�����I�,q?Z����Q��Nt����������XH�u��� ��7R�O7J<�9��Xr��=��e �������S<cP^�����tzD��&#�{�~��g�_�'�e���#K����`Mg4�G<��Ib�<���6��6���]��T Z��=O�>s�
��cN�ak�T��`oO-�������#�
�wG\!�M��_��@���p�\����o�9b>�M;7�#�{o2�w��������i4�9����=�3�1�s���
����.�|���OO
��z�`�7�
�%|���F�3)��R���e��lCM�����gMb�V�'i��L�Ut�o�U~z��g��8c=B$g�������|�<\C����u=���K�	q7��W��5p	���s���D�{}��]����F��a�w���P>�:�
�=P����Fq% ������Yx��z���G��@t����C�X�s�R�B���R����\���q�	��lM��"��-��
��U4
�u+K?������R���]�)��<��{5Q��[\�c��'}d�������r��mFb�J�g��b��~���1Yqz�W���.s����3����.��5M�4����f�t���Fm������l-�jz}n�����8��b�������1(��jx������Q�����j�t�f)�JN��Z����L���t���w!�����X��)Pk�����;!
������RI�&�&���{"kg��mby���WC���Ly���EJB�>Y��]#/
�e�.�7���H������2�����>l6��U�������(��L����%�#��x������q��	�C���*K���g�g�D�(&CE���5�6��+�19[�kj,��q��8~���7R���l��^u<��tK��,���\�����s�Xj��%�F}�t�O#���M9bx��f��^%�BN��_�����9s���o���
_�|;`�O���:�>7	h�Ilk���&a�A��Z�etbs���u�?!��N�8`��~����^��{(/^�MjnB��K��s]���l�C'��������E��m���[x�l�kK�(R�L�b\���u��Wa�)�A����3�D�&N����ca��X%jI��m�������\��K��\��%gUn��|Y�y	{%�8<�\�C�U��6�� �&�n�^�=d-�7��%.xa�[�\�I�L~Fn�@~P��2���;��s��O�?$%�����������O��)��.B���L��f�r���3��x(�r�r�d#h���
��1 S���4Ws���`�/��>s=��_���T�N���@q�#-g���Xq��`+��{Q��?r�����0����DB�>��C�~d��x���D�������LIBK�Y
�L_�}�����������<v�g�o���1q�3�i[�� �8������S�*"��0������^�n������X)?nh���fb/Fax���-Ti�n
������<�^H�aW��h!�tk��X_���k���e�5`��:��U�����u�I'�^�8%kb	*����1��T��������]��Nb"7���oM}�s���r�~{���<
�
X�nr������kA��N����+t��%8t�1�{��a���2s����3�V�m:�Xg���5}������J(��5m�P[/��n�h�������&OKC����k3����!5p�U��>�RHX;��]�3�{�t�)Te��M�%���0-����-Z_�i��M��V��?D]X6�R!%m�9dW~���{D�g������<�@L�9�^.����S�7w�o���^�{Y7�������x���u�	{�yGC:��P�	G�z�R��^����j���9ba���4�e�uF\X�L�.Q�������qO='�Y�~����*��\��dl)'+�|MAMg{G����^2]�2ic���~\�}�������n�3�UoKj#oJ2H'�	,�����_2'D�����N
R���P�b3B=�E?-Q��DtWM��N-��5���K�8
G5�D0�m��ZJd��L���o������T}��]0�.�,\��s#�D}���!?����,�jdt��`C
*�������y"�r �m�"`=�������*S�����������%�9O��8��yO��F[b�+��p�`�e��9�����,����M�G�N�����D~����x���F3-8:��h����~./pJ�~���M���c���x�v��
4����>��(���!��f(f42(�T��O��mP�������!+������r��O$Ve�X(H:��� ��9O���*�b7L����x�����,��S-��*;�C�W�A�t���WT������G��n���X|�R��!�����R�1J����A�Kq[��'[By<:���G���^��?(LSpH�n�!M*��~�9/e7�]�'��R25�q
Hp�����|�#K�)%S������)H��a�������n����" _7�����A�^i��3(GF8�o�F@n���t�����]\���!���(��Gd<�m�B��D���������Hw`��a�O��tV�D��%y/���Zm���<�H��(�������)��N��F���2)9�ZJ��a�8H
I�X}�N�E����*���4a��b/�!���45!�n�Qs���wc�����z/����z��X!��04-/�~�EB�7L��l��
�{�l���7��H��u6���6���e�w����0�2(�X�!�MB�-#?�n������E�;!\L� }n����4E�"�/^�
��� <4s��e��^�c�:3����A"���?���>�3���*&����m=[s�7�Vd�)�^�sr�E��3�F��v3�E�5��2��
vb����Hd2VR3#���,��6���O�i)�� ���7vWp���c��?�G��t����nw�W�!���e���c�K�g���wQ�@���%����7���IM�����=G��%�4\�������S�CM��N�l��W��8b��[�$�
2��S�jZ*=�D�J���t�]P�U�G!�'��������c����h��(���v
I������������"��fE��B���:�������f1;B��h;��6_�C�;�r�C��E\^|�%#��A�G[�*�S|��.v��D�*���T
r���5��#|�GT��&��H�3%������2%�q��;s����oopJ%5���we�������\�)�\��p����i1���m�'�	l�)����AS��qK����q?� ))Sz��C)���Y(��P�<������K�6�2k/������u���lX�`|3�%Y?�"g�W���1J�HEd�6������|N�mQ�PpF��x�hf�3�pn,����X)��h���h�(�L�6�s������S�N��7E���[�C$�z'O��;q�a�����kLK1�]���~�����S��~���
L��+��	��2����2M��$'_=���������]�
�92X����F���AxA��Hp1ev1)0��������
5��Pz�0�U�N+<��o������ #iu�Y�>F�>���,���v�;
�l`��2����k������(�?�*�0f8�:�1�w^�;Ge�>F���o�_>�7�^|�Ni��Dp���D/&���G�e��:��M_[7�PG�>�zO���|����/z��i������*�S�f;����f{�9?�D� �aC���KF����qT����?d}����n���$���#�i8��Q��5cCX�D��TI�Kk��=F)K�-V4I�PY���3�Y�'�Q�����R��	��`�v0��^��.E=����s"�e,�K7���0�����*Z,nOr��������8�j��T?�UL����Wz
���wMo���Lk`��g�v�������f����K)�S��\��,�G�L�D������v@|�@�^�$������J�K��~'���^�={��Q�,�<2I��g��b;.�c\�|:���l+��AqEf>F���[�I�Q	�'^�����0��o=�1�v������D��}R�Aq��-���Lor&�uflg�y~D��(pgnmg9#'��R�F5)(��Y�
��p��]�����������:C���(.8:����vF���}\���&4a�&�w��i��8��iV����4����B8����r����,�`l�/�-
�w���O�[���(��m���	�N���U�3�h������t	���j"0�����np�?���!Q����C���}�~5��-�o1�D�����0�Nl�b��������@K8��]-�vf&_�e�F����B|�FVt���}�R
�,����`��6��G�������b>Q7�������TS
�	�fh���N���'=����9w���H���B��#L������G>�b�i�h�A��Y�������bJ���������B$!fp�<1	d.9�{��Z4�hww����	P���N��b>^�"����NEc��O�(y*�����T�FFT��$|���B&�0^�G�pR����YB������;�z���/��+����E�
���c��%���tRW�C|1������+_EP�!�����!�-���w��y���@� ���R�@I�02$�����FA�^8�>�z[�o�G���.BH���Z��[p���}h*LK�i13��8�G��6b"\gI�iCQK�P���{��R->Y�#P@F��O
��fZ_�b��v6�lIf�������(�E_�������)�0���	b�t������GHOg�������(�����
L�13t_�yN|-�����yK��t��0�L�����$?�"���MY�Fh�&�j�K����������Q��;�d6jO�W�PM��	S���;��3�6�� ��(�Bh1�t��?�dl{�YQ��'3���t��'/
��I�-Y�0�H�H��O����T<mP�=������J�HG%����|j��^��D?-�:&�lA��I��{/��#2H��5��E#����	�r!'��q���hG�����%M
a�Ya3��i���|��m
(,:��>��!j��^'�w=�������<
��5\����vf��2�~U�s�R��k�/k������A��}��`��+�����@���R*������x�E���/�hD�^���CC(��.h��&s���~��|.QZ`���g���������.=O�EA�L��iqLX��S�N|�Pt�)���dfP;e�,i���bt5�N���{S2�;C�����R	��IN�T}�edo[���l�����l�<�3�7x[&���)�����xV�*��O�^���`�"��b^��1v���K������2D�X�-X���_B��}�7`� k��;�z�l������<����-�'��r���*l��������h�7$C2�s����"4����S�����c���KA��������U�^|��.�����0k��P����Cw���s���v<��TKp����d���'���x�X���6���IX�Xj�P��rS�1
���s��&�vA������/�~��v�B�������O���x����Z9+��l�9�>�G/X����f����7�-bUt�����8�m����u ���cjS�2\�[���GOQ�x��P�����[��:j�-%s2�P�%3{O
��tFE�,����go+|�{�
z��� ������.��@K$�nE�A�9=��.S����a���-K��7Q��H�&��\���tX)����oL'l�^��e/������L��,��(�N�u'�����l?��"��\�������9���)�?�qX��BJ8�����E{{Y�V`*���s�Im�
���m22�	��f<���S���<����^�R9��w%z-%��@�+�659��A�����$�!5��Y"9��R����'<]�����)JvD�oR
�a*�������C��/���������
�L�_�$^\�����1���-6��C
�=�p����N"�i�)
�yJ�d+�u���G�����p����|h������F�ebX��	�8*Y�y��i�d4;�x2�Z�qYE��h~����?F�Vt*����0N���z�����)J��*B4�n���������	���P�	�E��l��G��5��JS'�[��J>��`mFH/H�U�t��1���n��O5
��!��6"�0�����W��������Blsh���/�n�G DX-(�i����O7����S����~�&�1}�}�a��Cb����J�$�.��Gu!��-)��!X���qP�N�������G��q�R���^D���HG,�FA�����T)������p�AeW%��=?�8	n�oB��+c�W"�>L�]sqx��e*���=�5���"�*�NO���:�G�Z	�	��	3>����0l���0����r\>�B�s^�hs� �	FA�����m'|�;�E�j���v�-:]`~��^+��C:eQu���R|��2�%BY����t�}��GU�@V+.�{���'��D,��T��P�SK�����#q�u���u�B(� �Bh4�C eP�K�-VG����m� &�o��I+�oa�0x+�����u�
����r������S,<���p���x6�q��1���y*��S��mjU�(8�@�����7�e�x��%��*���80��B4��t��!F������n3��B��2`I��
�n�T~���N�o:�n�����?������hL�G��m���!?��%�=����bb���<��2��'�hR$�m���Z��m.���U������`��ni8A�d�n��1�e��L6��2w�=��w��x��I���bPH�mIV�����}H���U�
�(��4����Jv_�N�D9bZ�5Wd��(\�p����)�������Y_�����,�~(�M�����me���'�M����h���M`�:(������(���:��\	I�X��TE��??)�(�R)$�3�[�������v��A��Fbv;�����J	�A[�h����T�d�nc�a��j$
�l��2OQ��B�����~�a��W����x�*G��Jh:\�����M��@�_��e�6�9�|��31���O����D�E	���>��b^���u�9�E?XQ���X(Q�O�������o���1l���\?%��<
����oj�7��Y"���3�'&���s���,
�r#XxTgw��R�d#_�}�]��O�d���l���4]?j['n��g��Rf�AR��>����hd/"
���(vD���������8*n�{�L)��G�}9�7?��-A�@���dFo�^2������[��������F����LW��+G?&mX������M�@]�h���X�t��a�7	�{^L��,9R�>w)��?��K
ih^]G04P��t(��[�6u[~�V�;����b�+�=��L\J
��t{#�c��g��hF�����_2Pg����Bb�Y(�����e�� �kB!y��GV���B�+�|oS;�6�fRh
���lI�B�����<{lKc-7��c&E%�wj%�k����=��v���m�F���1~�YM�B�&�����s����"*,�����������~�9)������3����d�i��f��'�_|����|�������kROn�lh3n6��P������NDI`lO�fn��-B�'���t��T���;��&/��'6�7a�V'��VR@S�<���
������n�=��Xmj]� �96��La��\�����Or6���F�=s���2�q+xqz=�I*o8[�d|�$�Bu����,	����$�29N�&��K�;i�F�Q-*�r��c���U)�X
��9���S�R������o�)��+�\�Q�*�Y"���G(�9������R:5T�s��d������V�����bzt
sIm4��]����9�lK���f�s�N����9��7�V�{����a��tRn�k���n����ME�H��fx{uG.>�M5����G�2��~j��=��RLNt�#GR�t�3E�-���S���TM��� ��m<��i|��I��d��'�t� ����&����zXX����`SW`Ip���I�4^�c�J�@#[:���p)*������:��dM*f�s���^�3�m��o�M�\
*FhS��YQz�"V��-�,;�Vt�,:�A�gR{���D+��.�D�E���F�{��|�������6{�*�t���\w�}��j����.�iH��-C[�Z�b�=��s���QefSo����d���,��cM��zBGG?�A��qMrx���
O��=�	���U�{����@A�|Q������+�0�����D����� 1��3���:Q<���GR?M�������'<>��9���P0'�+Z����]���)DJX������[��"���|N|1�!��
�y�5����P3��o��y�u����.��Rr��2���i2t��i'�iV-*���3j��]�^L�D<���"�#|YI�^�
�[�^����Qc����G�������f��g���S�^����8�6�;�����������`R-�ZO�l���$��jKR�����H��-���w��	l���}����������\��1h����c����CV�2>)�-����Vz�E8��FC-��������/��qw�T����e�����j�H��7u~����/�!����'��t�FF�Fj�o���W��~�bz�1��,8�y���8��Z�\u1���V:�����m��fT�Xp�_������hG���d�i"T������j-�N��/c/��.��$��4l�4�����#�Q�0���s�;j
l��-�����g����{�����F�{�Fk���T'>R� �{������;�����������t��D��.�����,���j�1��)w�hX:C��Y��#�$���l����~�B43���C��F�j�E6�����3��`@wI���IR�����L-����x�}�YB+m����|�R��BU?��d�c��;dq��;������-A?�Z�\�m�:
��&����{MD�b��{�8x���������z
������I^v�������u�G+�t����<�Do%������B�j�e4-�~�N�b]*�(;�����u�g��T���&��NHU�Gi��U��=���yI!baw�s�4�YXzx�/2���N|	��	��9���{n�.�,�\������x�������>���@+��3�{�-�<�������"����J���}�m�<f�����:0[ly��z���
�C;07�����W�GJ�Lf7�Y���A���n�|�����M����26>�"����+��.�YGo�+��A��3�A����	�A��w-�������(Y�$��������Xg��w�xz�;���:���V���b{z��'!Du��X��3����w�<�L�U��-<b1�HA`����N���d{�%.�I^�K�*��������O(�/�oi�����G����36p�TJ�H�����;u]�����1��TH���G�>L|���Mf1Enk��Y#c�������*��p��c���
�7����r����J����������xsC��`����
vtcE7#����E
�o�)��F���]��!��.�������SsLn�����_-���U� ���u�F($��X������mp��?z�7
���H��gF?WH����3�������if�dr���bl�����j�����������V��'$i�T�r'�V*$M�P��t�M{4��(����{��H_�<8�����B���#�
j#���{sf��P��j\����w�H��K��k6��D�>�|
��~�B�4�pN������/����t�Y�����Y[a}W�U�)�o�����,f�G��4�VZhd��N�K�q�@�=��/��k:��[���O���'��w([1�}5���dt���+��)���a����#D�t����6�xbF�`���p�������
��")���\"�>��U��??>Oy�Q��GGv`��K��bV��>U�7U����Id`��'�����Ne��A/#��-UoR�~��m]��.����O�{����s.|�����������;�s_��A��x�q��������]n��k���p�KJ�����;��������
9+c���w���l�8���o-{/;�	K

YVB�w��w����iE?-�C�^/k����
���H��ovi��%�^�B�	�TY�ye)��*��� w�l��N���	}����_�S��m���V���h����wB���S
}����>��_��z��4�������$�W�p�;����Z�������Yo�9v�	 ���%�������ie��iy�S�A�YJ�����+*���2u�-���WR�[w����`~�K�B)��PtF����*(���*�����S��x�'�VV�y���y���`mx�����j�����Z�������c#Y������;�w��9�������#��,�	��|/�i�w�_� F���"�B�J�)��)���<������pTH�aO�v�l��cg;����0���_�W��t���^d�����\�Kl5E>���s�&u�A�7-�\���(e��)o�t+��{^��y��=]h���U	,��}��~��7Mq�� �T #p��'�p��I������~��A�����3�@�_AA1+r\�������}O�T<���^�'�ok��ZP�t��+4����3��dd[��kW�EJ���L��y+���c���d�[�M(�Ru����#}�U�	"�����=H6��>����=�� ��V/.��::�{d8�rR	�TI�*���Y@}����i^�,]I�`�3jO��xcU���g>��$\�N_�c�A!��0P�7JW�1�VVM���m�aSm���[��SPL�_��qU���cce�R���_�A���U]�Uyr�9����a��>�i+��#��6;&7?�tlG4��a���l�1��I��wh��
��v�=�[����q���������A��#�������%�&8dZ�3�����p���@��z�!�����w�<0z������<��bj��!�~NK
���jlxU��JQ��u��+jFV���/j����v�6���D�J�FnBf�� }�]��F%��
B������]`	$H�����R��O�rs�=tNy��=�p6�X�\�2����<�J<F~T�C�Sd:���)q�!0���vt�#1B�$�|��F�����2[Nn�`�0U���r�q����B�1nD\(x�i�x�~�]��E
���~O_]�`;�=�V����
`��$�D��T�c��3�4��ITUL��/�����
��K���O�:�Y���]�?��kz5.�����m��+��N��������4u][����e����� ��'����\do��_3�^���*yd&���fH���9��*��gF����_(<�n�;,��iY���=������ =x[o�@�]	ZCwFa�iq����#J57����0-fF:�oX1�6���zv��=1T��~z,�U����{;�U5$�*O.��[�w/�u���6~L
��A<� ��$=�R�R�"�
]"A����nyb��t�&�#�[p|��m�����lV;�������d��j���72��z���)�>8�s(�8i�TffA�+�U�Zx@W%�������-ON��b��}���!�:���m����8����W,Yo�&&���j.*��$����gP��>(�L�f�O�hQ�0�<xIB5A��(��CP<�h�5�������N2�'v�(��������+.�~xIk�w�#�[fe�kx��(6cd�$��c�8�0����h�(���_#���i*�]
���I��p�8ma5��M`l�o/FWj�(��L�H�6��N��J��'��d��m�F'�����
�;���FRI(�c�aK�{U������=�z��Cs��$�FT�4���F};u�������#2�1�]\����1�'�#� � ��!���A��
t#�YdT�P��AP�f���gs�R�^�_����%3�*|i@���Q)b*�/�x���p����3E�8���|Z�E�YRr�Of�����(A��v����K������`�
�Cl�(3�h��Q���'��r�q�E6�r�GtDvz�,�Op��������=�zx�!����i�G�S'��Nm��k9T�JN��/h�%k�3����>�?^4���������Z�F������@����I�.��9��`�W����?I���K
�k���n���$��@*e	b{E�+m�c�+��?���NgYE�F���<9{�w�����Lf2�R!6:�^�G���>������.��&KZ*�|zb��ur��JVo`���0Fp�V)h����g�������E�<��5{������ �s?������tU�z��%�t����3.������b����zWg��T�{ybK����]d�S/���� ���KM�q��Q0h~k�)hQk�JE�&r$�;��l|]8=9�"���g�(��F �'g�z:����b����{�a����F �H�N�!�/��A[r5b���GY��=�01���Z�
�Z�#�����'F-�����Q�Yz����<F�����<2��<�jD���K,�<�^)�:J����#��[�@��X�^��o?��4����z�`����x�����6����(Dm����^@����GU�^,c�U/����??�����C���Qc�}�P��1�r���c�w����!F�������/����c0r�3��#�`j�={��	S�d)��J�R�:		���cp���!2z���hH�T�:p�v�Y�#=��r(9�?s��X�>F��Q(��}'QN�!�<��t�z|��c����fA��!�c���O��������b���>�����({;,R��7D�z�'�t0����9�cK���g����1f�}5�r�&�P�]������!B��3@�%��*��z�=���\���Z!(�~+�����~��������
v��M*O���f��|6����-��L��������)�R���+
T�$A�����v�8����4���~.j�>��;�-&��(�l��_jw���O�������tQ����2G����w'����iFK��R)�gC����������C>u�0�w�T5�T�����d�:�_�b����2���G�A7�F����&�	��Z����ub����#B=f�a�������<�����WE�,D�}m�B���62��yRY��V��{����d&�����������F�I�\���=]��[�L��w��)��L�Ow�%v� �pG PI�6���:���2��O��Ge�=��
��A�cu�F=*{4���g���j����������D���U��^M�]#Scf����u�$/�X-�(k`�7�I�og�?@l
�o�i�^���^����;J�y[�4�%F����6�,u����� ���3
���.�#Q��S����Rz4�I�o�����`�m�5	,����6���1}G��|E1������1��PV�YQe�����kt,9����xb ���L0�f��A�L������f�z,�������S_�}��?G\���3j]}r�nw��E<�-L�U�`<��$-��[�����:��I��!�qm��:�o��j�m�'k��DC~���B<��6'�����B�D�����I���>���j�[�+z)���W�=�B4,���+e���g���U`iq�J�yg�B�tj���!I�V����^}��A&t���������� ��5���
�R�������D��la���U��,P���}�9��O<4���e��d�^�N��OObbp;V���tU���1Dh�j4��"��������<�$f
fF�m�WK�l$��P�3#1�\6x:�q����";n��k��=R��(�������h�V���Y3#�j���{�J�������j
QV���C�k��W(���������)�|Y^
"jAp���`9�����?�c���i7*���we�l�r�*b_�����A�4ZFR��gkh�8�3��LnH�n�Jz7"����aT��sk��/Y��=F�e_�����5G��EVU�^Q�3+����.��F0��3+���n����lzt��I%@�����
��@�bA��M��IN�H��}�u��M����*I��[��L������+���>UL"��D�9��������0yR#��Ao���K5����(�,C[���$�<�t�=��g��%6hV�f���Z!�{�`��^:j�5����)�{�aX�8�}B�'d'�����Pp[��se�8o>x
&�2�A:�l����=�����?��gN8Zo��v������E�����s�U}}��)*w2�7wC�������+M#����ZZjRt�����r��x��k�L�rL����@�8u�i��t��R1�zW��k`:K���z����)a�~�v�C�T�J��� �eiJk���ph��4�Ouq����4�U���o�*F��A���?����"��9�\�#��F�9�s�<�����6�W���������OdE�B#�G�-����������z�P����Z2�A�S�s�:��59<;���������5{��W���_�=�f�������uI�#�8v~k-����e�m����]/f������E���)����(M*�Y"���+�0f�������z�n��D������Si��1�&�)�_��_}�9�[a�R��_�������A�S���������u�n�fCd'{3��k7*�Y��+p���
�]y5
���^�s���+�?w��s��MP*�*��V+p�B�����������>)[kw����-� <L<��y�B����Z��� '.0(�]��4�D"
���j��"'��KI��Tg��	�L�Cf`�_��)��8�P�]f�4�*���7��(_sy��"[�P��R1�?N������>"�'���#)��L#�}�K�E��v%��g�i�%V` U��U�$��dU7��S���b�V����o�����3!�}�t������$�Y0��r�h��&+}��������_�2��r&��
�v�:��Fq�U	"(�������h�<y3Tf����[&P�`US��?��.coGa�I���`�H�����s�)km�<�@I�h+(?uFq0����T��)6Qq��p��\(_�0�I���_���4����!_#���=�Z`oz:8�������Tvjbz���Ny��������oM�VV�3��!�<(@x_�,^�\����e�v��Q$E����=�Y�J���S�`�
�7W��yJ$��m'g�X��<��xRs�!0E�Us���?c��d`���	�(����aSe�Z)p�&;��/�18����*�h��������<�a�����4��gF��"R��?;"��3\��X��0����Xg�!�f��o#��R��� ����a2����vITQ���\������,��/����ak$R�Z�d�4��@�c�������W{�r��"T��~cDN#��i�2�t ���R��[�(�"�(�J��?J��7�L��.��wiBP�0����6�
�-���������Ne��:�T�Q��k��������������dTK�UQ?W<���f�S�([�����q��%V+�|$8����������R����e�'��+5Z�?){����-�����QX.����S���fI�fe����3�M
�����6k��hK��?�����6{�=�F�����T�@4�R	�����6O��a��C��[(�M���Go?���w5���&.��-�y[i
��Je����}��NbbH�����	1�vn-�qM�U�tu�#�'�\O�d�� BB)�������:0
J� d��0Fh��4X�RmrU����V�T�T�,������Q����`Mb;}J� �X��u�<i1V�cp��hI��F���C��szaP(`?!)�L���"�?*fYk�����������#z�jT�3���#��t����O
��p����kR�=���D���+
~�%��6����)��~E,�$G]B�����,RaN??	����vioT�D�#�R������uN��	E����W��z�:	�} X?�UW��$��%�zxj�W����x.f�E�����\�6���K��Y�b���~ ��+��d���w����zUvb����;��O����_�,�����Y����h�&Dx��h�,����+lV)�PV���O�O�za��nv�NBk7 VMW��@��AQ��.��8Mot���F�by�q�}auW�H�'W5��H��<�A�Zz-�����&J�T�F�@D�Z
���q�>���)�9Y��@�dm�NQu���`�D�����j���Sfc��6�8pK��$���
��f�TV�r��,�ua�)���\��CL�:��*H���U�r��w� x�a���d�0��<�D~	�YKg�a�(|�1t.�PI~C�(��)g��f��$�#����a�L6�k�SW�#�+=��?-l��T��b���?�N�tj���e?z��x<������o�B}���Z�a��|f��D�S�7#����>��F1�n5
)��V4���Pb����u�e/�F1��l����&/�.�9��lS���|���%��h��P:	�V�8�|�����p�hN��H'	=��������"��:tp
�v�R��Se�
:�9�2�y�>��[+���7����� �;�w������7�������.H��;�|}�e�����b��,k�`y"	"�E@=��������'{�sb�p�b�P��G���s�u���:[@u�o�uZ�����ok��5p8Zf��V��Y�����Z/_��[�M������p"U�K�P���	lI��qy{ev5�AD|��~>�r_�a�l�>�4/���������F�O�����k��q�����wo
���\�o�y��! ���l�kh�iQ�� !��PYh���R�x��!h?�e����$�%z��&1R�w��$����{Q���rD���Cq��V�{n���c���u�h��%�������^ ��R�=��6�,��awhV2����z��%��g~mw��Q����xR��To�o�>[c�����)�2K�]��
C�M�k7�(��,,��@�N�p��
gC�>��M~G�c�'��`:�Q���-O(k�A���V��m �����}�$uN����Z# b�'�Q9� �f7c�����6�[lZR��yx����i�FnZY#L��2R/i�������!�#1�&;�it�R��
�J�U
���7c}�}9��2[xjY=f�n0~��Z���� ,y�BI8�7��Kv�Kvi�"u��=�QN
)�?�%�bh�7
��0���������f:x�4rFk�^��V?)��b�`@�%S��QU���M���%hN�
��tN��nm5�i����������F�f9��&R��L�~��D}<BY�X�@v2tkn�@)2�c��-=~K����J�vJh:}#�c����p�vqU��w���1�	��~�<�<uQ��7��h��
u�������
�'d��X����UM�z!,(��sQ]���J���lp����(8���|�}�{
�;M�y��������1u6��P5h��~���F������c����,�a�����h�R��9:>9����yQH��PIV
���(��ui�w�p1Rt�U����
3��>���z�:�+����>:+Z�P�S��`���/���H`a 7G#l|��g���I���'��\��@�����S����k�7�)g�`
7��c�M����r
�P�u��{���J�a|�c�1��ft0��'<lS�)��Y��J��<Eh��QCM/8/S���������Y��d���k��V :~��~s<���DO�Q/7���uk)�D1m���������?P�����uZ=�M�D���m��iG�8�%�m�
2R0�����H'�{�|9�
e���n�.7�!D�����E}>�[��{��=	��T�!�����4����a^�\�`�[>������tY���M������V��~E}���1�T�q��N'�����VX��������/��@
���N�(�)`e������� ��5��m.�G���d��]��F(k�	&��.�I�v��5�!��@h�-a��G��;o��(���l.�f�
E�%�Z*kV.�1�vX[�M�VY�L�����'y�"��h�y����f���$>5�4�������7�h��'�����E�	:@������J��U�s�K���+�R� c�<3��A9�����*U�p�Q���"m�~9�Hi|L��%�,�m�F'�5R��B����oo����1C������QNj��,Jw1�P�1zk��2An��f���9?������E��0���w�0� (�4�pY��[�iS4��<���r��0�+�����y��yTi���R�F�`���5��>��w�^��>o��7�<a���W��)���jT�w�,��X��.z�R��YP&�h7Z�rBG���2�
����,���2|�X��{��C��Hk
������
2�M]+�{��������=q��P������ji��V"5`��kCmxyxw"�q=i��"���E�~.t���J������Z�i:��e���)B�S ���?������,.M���������8Kxd�P]�����l����{O��{>eG<��{u#^&���a��G���}3S2o�u��1�����Z�/��w)���/wZ�����r}����K"]����
J�h�!}���	�7)~�X=:��S�$r��A���qM�u��E�-�tp=�5taW��X�v��+��6j:l��a����]�h�|��G�2<�u�e����"��M����5�.f�|���{�e���r�����e������{���2��u6E����-�k������K�";a��}`�+��1�Z�Q�w���'[��&�j�����,{��e�!����;�N���}�5ag�#c�8�7�/���#�T�
N���e�*�!Y,$��'�>�+�?����O)������`/`Y�?�����<���B39$]D a[y�����n�������������b�������*.��`�S�oC�(^Q���5c���������>m�C��,rx����%(�K-iu��WY��T��=Eo������y_+����:KK�jd"q��$\�����u��2����2�����J����
������vv�n���zU��R��U�l�a�*>��1��*E������^��l�X�
��%;�������(�cQ����7i,���W�w'��,�����Z�����������=1�v�+��D��PY�3?		�o�����!���T��Z4*�@r��5����vi��c�������V�/P����>��[����%T�s
����w���hl���N%\�I"�b�2���|�LA�qmn��ZY����Mc,sz�f����H\�������r�1�?��B���E�-~-��*S�-�V�p��z��f��m9����b���T���������o9���0r��GIJh����2D�g�:��P2=����_R��
�������J���J�rS�0#����Y�*~��<R����������Z��LqNI7�.�y����)o��e��B#�[p��^t��W]�5�7�����Y��z��9�azn��� �M��e��r�$�,�Y�X��J��*�m������>	
�s&��e��p����\p�i���6���m=�W����:q�i��-n����B(�j��^�I?���`��R���_O%�H|��~c��^1�o�HT���J
�q������_���p��Y��Z���}?$��56�Zk��	�K���@��V%R�k&}�w��	���=�2�A_��w����v�5�� �������B�.n�.�*��
A_e]K��C������H���W�/�U�S:K!A��<^�Z�v��/?~s��E8����,���6vk��B}�D���pU����U���=�M����y�tPc�~m�Y.��M���W���Y�#{����~����<~����z!\)d����t�[��X�����������Rq	��V�nCQDW��/e��1�:�fO���^v��c��<��������E���P�h�|U��G����m���a�_����Qh����R���U��6DS����*kG��x���J��� u��2����xJwv��L�Op�b�:�@��#3���Q���D�������L�W�|����U��q�O��Tt]����=�dV��\v�eyb]�,W��c����7�Oy�dQ$���
�b���Y���m/����
N�,�`��Y�&`�<��1d��4q�	�]8f!H<���-0���o����$l�%�U�5!�����<��q�vU���u���~�k)$�8+yU�L���|w�*4*��
)Q��xOY$����rM�qDO�yA��0��	��|�������U)b%c3�llJ���<#J��-^���m���h7x%�W�nC(^�����!l��4�4���z$�����g�#��<`�m[�� EE'Vh9�3�6���5�w�N_���H��r�����w��=[�mH�(:+Wh��AS
��g����%�b�Xy]�7c]�C��,-�t�q�Y�?��]#�d?���A1W�H��7J��&2k���e}�����xj�(K�w�AI%�X`
i�^8��������j��G�6Q���^x����y�db]��p4$�k��t�G���Y'���s�8
�s�(Q�e��Y
��5�e��/}zr��M���t�)��b��>'{�1d�?�P�8U<F��u �~.�(C����f��JT�4���r=a�^�d���a�x	����^�(�D�&�����]kb�~/S�k,+`�����a���w��UN�$=�����
F���^�/���l�M������Kv��B���;)
�U��{!bmby'��8���Z'R�\��2d�Mvlc�@#mi�-��z,���?�c����WV
0��b+]{���&����P��v*����Q�Kk??@Kg��8����-[��#)���[���UF��LE�Z�w}G�@M���*��\�@���F��K<�,��E^���#���;H����:������K���_G�����[�����B��b��c60���UM���4*hjL���0m>�U��H�|a�&��g��k��~8�L�2�����M�7���J:206�l��
���`�y>���
������?8�i�[�\���Z�_9�4.��|����H���U��kH�������������X*�sz���!U<��g�n�o�~����t��p���y�-�p��w�gp��P9/����`�nmZK��E~f�|���S����L���d�(����i�&�
�oVK]�JyL�5��m��Z�Jg�/z����s��0����W��O3�#]}�8=@6S�wLl�:��-�����>�6��H����H��i=��c�M4����_,9	�����`t"��U����e��q�}���83 fN�u��z�� yU�P����MxdG�������[�;�v��ey�'�J��[���	���6�T5�KT�w=��=v�v~������A^}�Pe@G�����n�s	h6�n�.�
tU�8�ai������q6��\�,?���z3|K��7�8/�8��*E�N�9�	���^�r}�/�S�v���;������J@!�~W��_�'�
�b���n�l�v�u:��� ��{���j����W���X	��~��z
���:������SG�FX�d�
&��s��,���cb[���v���Lq�D�
��	��CM���&�o:-_��dBoG������3�����gn�
��'���Z�~�\N]!
E�9w�v��v���
� 7��xt�����x���y��Y�����<�G��1�*&wU�eN���?=�;��-�>�����M�J��ROj3�UX�^
��k\&�*E�Oo���=]l%oX�.���L��&Y����9�7`\�������_HHA�A�1���;��0��6�b��<�K/Dh��#�v�X��h�+<u���-	T}��I�*����(nl�I*p�5��h�����G�F^}��������������L��;k�=�u��zD���q���v����r��P)B_'�8Y7��\�6���'!V�7����JspuA��w�����L3�dn�z�����_6RY�{��P�����S{t��m&]��x6����"�>�� �C�:+9�����l�S@���%
i�]��q2w�5N�H�X�i�a��`��_�$=�8�#{����YUQ��Km&"u:��������b9��\;�w��c�a1<'�?�%�_��%:�i�����W{��'��#P�[@L�",k�����@I��&���/�R����=�}-�w�\��4��qGv�
$��>x�����e�����4]e���;�S�g�������A�q���*��!��[A_�rj����2��fVn���8B���r��}��'-��C�l�W�!�O�,���M]�n�o�7B�#3|��M!k�|��1~qy��&t��={g����p�����r����,�3������*�JA����|���t�Tl��Z�{A�T���g���o]Fw��p�Y�>�Qd�N^�|���-���N�q��M��Yp�6�m��!�vU�Zr����d<l�~OaN���yEp�!���t���
3�,������l�k����&K}�@d)O*�=����r��0u;�S������d��Y
&0]�
��L����$��E��q5��:���Hx�"%��:.|��@�iw��_}r�S��bE�Y��G�UP��3�We��	�����,��I�{���(�Y*[���`����1D������22��8�#���?�=:@m���?We�V�����{���l�����'#��R1�����������R��
:0:M+<I�E�:�>4�A����6:p�X�g�li��'tH��3`A�8�`8O����"��������Q�C���Yg[�n�B�U)2���p�-��x�c�Je�R��O�`@,�^�A���m�v.7��?��A��a
�������z�gc��az|��{�T`*��])O�E�&7�fc3��<�y�������5���"�zpZ��=?�%�H�r�}���&�m�c�����<��
�1v�au���7Qng���YY��B���9���SN�>0&�����u?0��V�<����������s>��K.
!V���i��t��g�s�7�#�����#��u'�������?����&�#���P��b�C��h�o�]RN�?��}:_��V�����#���������/U������Q�'K^�N�?��7\�U������6���������B��08��������4�����]d�w�y�+|@o���o��f���"�on������K��~LzD,��G�������(
�(V���)��(�o�Ym����g&��)z���X@|8o9�DF���%�<�A������f�1d$[�&P���Lc�{�r��Ii�v�_i�1��:>�����<��3){��1���zo�i��5j�pd�7�j�'�c�HR#�]�����x�����v{���U�z�K	���/���1��!]�06'��WE����W�A��
@�,�v�c?-�-����U5�W�	5��9��}s�&���#;���0����U��<���*�Z���6�3O������}E�P���r?���*���g�Y,�����#�~�(r1� ��z
��� ���T��D�0t��/�Q��d~/0�z{��j5���~ �����dx{0����Q�[����c�Xn�7�����Hk7HV����a����y����Zb����n�)F�6�[���7�C��Fv��
�Fd�	����-��J�mDZ~�j��}�$�DT��K���/�8�q���>���[�)����$R;�BOD���&}�$B��
�����Lm�'[����#!s5a�Z���=��J�
���,�8��%ttj�l=a�/I��{�9�!��a\�%��L�1S�I���m�&���.� C��y2��Z}�<�)�	A����LEB���Qr���%��%3%e2�7A��s*�V5(�D���;��f���9	����?�q������&��<����s��,�tq:,�j��+�R_xU�LdE�������?�H2s^��D�>i��P�~�_v)��C���������������������;��N�Q?�
Q� �N.N-�QV�gHibS�������)@}����)�O^v{��8��jwh���
���(�U�W;c�(z�T��#�a���tx�x�U��ejP��*�l�l�L�mq�t�Q�z�O1f�3��}�q`��M
�c�j�o���O��������^O����Zq��c��N+J�d���Q�`������@�,�?-Mm�$�$,���
�1>N&V
a?�kr����H��fAr��;��$�+�*U��G�����h����.��Lw�-��)XYR���ny��n1un&%1Vt�l�gUj�t�\��U���@�^7��t���DUI��;��j�{D���Q;�-��Bh��� �u��6�u��2r�y���*yXU�V�+�=�����������������U�����$Y6p�@��{���ZxG�����M����CId<V�_�\yv�S9Y��/�Y�R�\�N4���`oO�9M��-T�R�|�b%+G?
~u���c.-�G��le'b��n0���HK���,��Q7"%,���1��v�K���7�ppb
N5qK1?Om��Y�`��w���cK\)�M}��������Z�-����a�M�*$i�}�Ii�n�q��3��[���z���y�T5R�*D&�"����^@�6O�	�U7��JX��%�{�^N�i���J�8s�|G��Bi�U�.<��QT_����z(�"�m2>)����XZ:��!�h����"�[�L!�>v0P&����m��Z@(�D��	`?�x/�sT[ t�
LF�*�UyB�z�����p�����9����,�BMUEu��d���,J]�+�im�'h{�H��&W�k����Q�J�rD��`�+�4���S����8�����T���1�`��B�H�L�3�g���qb�T�w\}�P��Y��Vs�G��O����i>� c�b%�����d��h
��
DFY�`T�F�*�A�VEC6�,����s��{������$���e��9�OOFz����C�L�6����=!�(��LsV�-�"�������=�?i/[j3@�!�
HZ��.X�Di��q/���G�� ����������Y�J����
&I�IIL�����n��%i~���r�'�T�V�H��M���(k�m�9���C�$����*�_S�y!���-,W;����>s)+��Z1E�C0x�����^�\4�s}��u�UT�����u��	��>�o ��.�N����������
��1K�zf#�*F���4������$AuI��%���j8{U�t��D�����E��d�����W�3�	��8�#F"o���P��K1S�3������P���lc��$�����d��FP�� �eQW#X����L��R���&�����D���7���w�����*/�m������=q#�U+SeM*�A�\����.���ca�Y2#aH��W�M���R�S���J��y�Pe��M���c]����ulJSB���)vVh�;�0	��aQ�r���_/�6J����s��v-���5d����L���<���������'�<�3`��<~��Gu�f���~�p&qKK��U�c2���U�o������#��;�T�����V�s^>�m��x����wEN�H�5$�((.G��������������)�-���<q���sh.��j�(Yd��F�u�0O=���D+�}t�O<��������`LVn�}������#��Y0��� `��in>0&����n�V����[�y���u	����Y*�@���
����D6
��GJ����,�I����h�PWEH~.����������Z�>;r�*�"��X�\;���f���3�3�x�w�h�^��
��i��P��
�~��
R��`����=��l�eP��:�WUc�i
���?z�O2�/��WU1��>��`��7���_#-�w��M���b�>�m�@��2���e�%0B�}�f5�����m����iC=�����M�Q����2��rV4��o<4;���[�C-� sVj7���vW����U7�PRl��u[v<����9��*-�����*�{b`dt�lB`bf�~S�����
.H���_C�<4m����\wG#7?�=7����K!�q��Q��m��y��	���9I�������([�RY!F���5l�>���BO:�o���C��n��[��
h����UG�IU�\�5Z����j��
R�?W��_���	�Z_���m�z�������w.������}����~��T��_�����$6H��\����^f��V.$�^�X�3=)In��+������U��_���&�H&�i7 `�L��fy;��������z4l���[�b���;��*�WU���Q���������2 )R9�������f�f0m�F���������[*T��r9z�/[P��3�J�_0�7=zh���>���%��:M"(�����Z������P2]��
�dvp*���D��g���$#��i�m���(��R�����iT��N�a~hF����
D
�^�k �7����su�a��v��c��W������P��HvGCB�TIE�����r�fa�� ����$���X�>��T� �n1���vo���;�v��k���eu�Z�0	�~��	� 
�\q�_U<ij�}�U�������+Ao������hB��.����T���N���|Xe�}gl���kP���(��g$dq��z�����J�v��;2���HJ�E�&+���(�u�
	lp��&��g��'V�?�c
���d�!p��
,�WE8��`qP����oX��2h9�����f��~� by�����g������[,_l�6���-"�:���R�~��l��#/��|��%���_F�5Mv_^���C6,9�w�&�C���s@��3c�����klu-_�H�8��hi�7\>�b.�f���,���G�������.T�$Jw��������������P>cS���4d�&�"U�$W��5j�a��e��D��'���K����{s�Tj0�8�������4rY��:c|��8q|/�s���=G�����/������9u�@�n17������u���ctJ@c��U�z�<� ���E�uk���l��/������l�j���=�� ��t^}��!A�:@s)��og��,�@2�Gm?C��n��ZY�X�b"2Wm��������
/D�4��|�N`\��m�:e
�iS��*�X+�$���&��>�^����9�q����[QIE�\��
�k{��;q��IN��ve����U5^*������r,�Z8����2Q��7������7���UUHR;��������CI�����Ga���
�	CM��O����%oB;�'N�|��n����F&6$�P2��N������l�����(���c�K�QfzBP�U���<��I���{�+z�+^!C,
[z!�$o#� W8F�4��vA;��w
��a��6��p���o�����<��P�y����F���?i���-�'��E��&5
T��WE������������C[��E\�."��o�������y����><���L�f��:��n�����>�8�U����Z9�]o���Om�,[������:�
�b�����f������2_��
��6]�R]����Z(V"Y��e��H��t���7@�C���T�>�'x};�4�><�k�����
W������F�%��I���Z;�JPA3r�9E-9W��{�
����8�{�#�uv���C���Y��B+ dY�8Pg34m�}��`:�1=��L�����*Ujy��XS�����Q��W�h�%z���s�tv!�9�����}y.PJ�$����b$��`�:���n&xJX
�r*�rd�N�S�3S^�!pz,���3o��A�e�>M��������i����}���d�����4�#(q�n���"�]o���#�F�<���V�U�t�v
�|?Y�e�td��%�1�H��U�8dz��z�x����$�v��V����5��.�������i$%��q�L��u~��8s~T�2b�7�����Y�zU��F}&�[���=K�� P�8�V"�����=�I�'�EJ����������<�e{u��X@ �	�����j����m3�f6�������Y���DW��x������.$[H %g�JU�3����#1`wq���	��~��T�+���E�,<�p5��`f]�'d|��t�Be�d���>�O%h���Gt��3{���6�6��
����,L�Q:�/M���=���8��m�i��.J�D���:J�����.��vS�v�g����U&H��
��j�����g$���^�
�h��u�(�
������+dZ-����m9H��t���d�MM�������K���1�����8��y�G��!Y�����7dG��,�d(�c������S�$�O���}��O@o"��l��!]�1�#���K9Z�L����N<�b]f@���:����J���@�E<;J������BP�����$���
D"?������_��n�V�)�X#�������	�����"4L��P7j���K��(�u�8�2��
�V8L?�������X�u�cb����E���[��)��������q<U��7����]Y8�e�H"EP5]��;��#�����r��1jVOQ���6��5N�q�[����7�����c�R��`jF��&����Dw�.�`W?N���%��}����A^7;$�O�Z?,��%'H "�sLK;$��_�ol������N����!�)��G�|_�E*����S�}�.�I*q���tg�{�\uo�x�""�����nU=qu�{O4�n<����im�\�d�N�����Bkk�Z��fEr�EK�Tx���;Hb�$`�M�4�>��d���;Y�tx�A�'�K`
����g=����>�2?�F�����G��
�in������d� �uw�?t�����.�9.B��Fk����4:��{�~^���w�[��IV�G�!R�^%�_������	��d�^/���N����C������q�-��{���O,w:{:�Zx"�s�L[@EA�M�;u1���5����VQ�
�^�i����?"O�V��������t��$"`��fF{��,����5$_%���p��v&EQ�U��*��`u	 �rI�2�R�P���T�2.���C��4��a��w�}jU�j����S�2�
��D�B����b�r=�)���ko���b�+�`A����WV�������aW�����(�(�����%$�y|~�|�s>��^/{���b=�`%�R�2e��JO�Dg�x[^�<0��z=��^6b�NM�&��1���(���y�:����X"����5-����?z)��8��k|���u�I:��z���V/}�_��h'=u���w�~@��_�1
`���-O�Y�`A�%��]�5��'b������F{�ZH���#��%O?g���k]�w&����{��FD���DQ�Q/���+��+OA�ng��J,���.���5����@��c�%����~E�y�=d�M'������y��i��(��*e[����������d�[J�l�'�jx�G�~���/ a���M�h*�3I��(-�
��?�6��=��T�w���_4���r���5kPe�;�3X�1L�)�N��0������6��e���N�����V��6Q_�w@q{�����)PPw?�/x:X�4O[D����i�:����<����|>N���l�_���l��k[�|6}Gy��+w�@�H�BG�i}�~�7}`�s�Tt�L[|���Y��,��� �b�,��&5"��0]D���+��JQ��2�`�w�����M�x�}��.E����9O'�V9��8i�k���K���!�d��`<3keh)�S��=$���t)���&��rJ#�
�D�r=��/|`R)�<�y��9��h���+�����K������/(�^fc�������#����	{���r</QH��
z=
�|�������r}��� �y�������O�K�~�5�K��s����2�NM���EE���BwT�"�l����(��:aO��|�;��`O�l��P�^-���
u�gI9��!y���>�������J'	�m���I�&z�l7^�����x@k5�8����1��0����nL�00��@�<g��d~�iE����ko�@��'�:�<��X���=�f����Ai���%�L���Sg|��Q�N��la�j�������'p��[7��2����R�[egQJ��gtKb+p-��=�"��/�{�� ;��UA��-���X>�h��N��UiM��^yJF�V��	O���3<x�;�>U���ohq5{E��!�-�[��!��.�P�~�1��g
aa�i>��F]�\����u�������Q�����p2��W�s-�D�.,L�:.�.#J�$<�
V�p�5��o�5������I�D;AD�Q�MO�c��������G�24N�)�?{��]	+n�xa=/��,������sn��2)����Iq�A������8�qa��
%.��c70������I�~���a���D���MgXE�����yg�g�vY@k���'e���iR�s���b!�Z������r/j�Ye�)�Nm���gW�0�Q�+����q'
u�U��D�����EI�������yR)��u���z�m�&a~���	��z��il���kF���M�eh���f|��8�M��\j��	����z�_��l�\R������)r����b��������t�/-q�I�5lP=�]�M/�"B+�_���Q��g!��zl�I����y�XvD�Q��t�4nq��xB]'��E�D�`�~������HDM��u�
q�ZF�{�(H����+����`b%k�Q����i���.��_��3�.4��t�r���6���P��P-�Mbo[5�\i0H��3��?9�R�DK�����m�8�����_b�{Z���H"���������D�'
��J@&��G&O�����t��r��]��������m���")��Og�x�(9&��(��J%2[�����D��	��*��bI�\�j
�� ^�H��:�\#�"���Y��
%��k�8�r1J~\���]�>�_�G�tX~Dqt7�)jX���A�$�6���f��cW��Y��:F�6�����p������F���H����o
u;�3���::�,1Ti-@YE��-{�Q��Bx�N���sE�r�a/��A���<�>`��>���;E�U~�����l��7U�KQ�,
���	�e�D�����<����W������PC_�d����
���Ib�����)(�����R�4��r�zg:�C����(���J���sf/��3[~������+E=	$Xbs���~^�m:e+w��<�om����T��7�U�FW���p�	7���|��bSB���;�����������{�]���k��[!��=b�^f���41}/OA���`e��u)')��sV{4 .�^�����J���� �����m\��vnp�<��[W(�i�!�=�Q8�Y�K��Ai�{y
J�-�\Z��l����re�B��O��R����� pq�g@���O��������������%�T�z,v�x(��u�����a����sr��������=�O-�BNr9K	��5M��A���R��B� yh+0�t���F����o��e�<�E�D�*[}�f)�s�'.���M��0K�l��?D��`���+���V�J8J���8��Am�EX�\���*��n����L�����b��/������,d�$�a4�I��2(��_���������c���P�N�}G�(�m��U�M7��������
����|`��
FQ���P��N1.�`������B�ntr�K�t�.���"I��B:4����+�]b���S���rrs�������T��u���r;����OJG99�'���c������d���$-�dG)	���1��������k����a;%U�8���x�G�����iNF`c��
m�n���lgt'�<C]�Y47�B���.�������mG�7E?o�T���c��q	.F1fz�0�L
e��w:y���|���K���i�
��-������q�$��A���.J���)��Q��}����g�iw��Q�q��B��faZ����-�ZR���q�[�����������JQ�,�&:��0�R�l��}������^t�v	�@[\��X_��0�.���\p�u��F�b��d�5���L�m#��������vC��)T-H���m��x>kJTN�')�(z�Mk������Bq��:]�eC��S:td��%|O�Y��yq�L�f�X�Nw��K�Z_�[�	Q�o�\F@��P���L�-eu}��d,���'��wg���"5����"5AF$L�,���f+?�^�]?u��.~n}��<&�w�����~]?�+a��1-�^1����% Y��nkv��Ja\u�CJ���)x��P�2e�������?�Noz5z��9������F���j�@�nb���p��j=I!�6
%��3*�x'DG]�����u�a:-D��
x��4���^�o�|g�q�F������A���e9F���	N#�\��
������Ju.���H
�,��.����W����2�8���w,P����r���%p��bL'�����}�-�Iz������[����2��p�^��y���d#��@�����Zx�>m{;x#��Y
pk&&�N�M
�����PM������f.�V6Le��1C�zt��X�o���5���y�z-O�M6<��P�}���&y�|Z���`�m�y���pU��'3��R����RySyC�I�*��
���^w-�jAo��I���
h����Q�I��&'(H������OgX�|I�S<$F�|�z��^[������UJ �Vv�����^�L���������FA<:0���"
:��M����]tt�-VXJ��i��W�8����6
C�Fu��,������N�f!��ztZ������������)L�<�X�H�nE�������rT�3��
6���s�6�N��&*	����2�jKBU��l���<{A�0�: �[c�Z�ru�D1���s�1G[�e>$yY)��a��V���h�����vY7�r7��0Nq�
�vehl��0sy��TJ����h?� 8��z�OzB������X�?=��h����mS�7-7v�%��&H���'{����3F*$�Dtd��-����Wi���b�3���*�:D���C�d�1��+���1���6+S�:��Rq�_��+��*i��u9$���de��������������\�@��H
-Y5��o,C!��&�vwQ�iM���&Gq��o�2�,�u�����c3���xh���
���F~}���j�E<�o�u���	$P��5~U�����{������b�G#k�z�m�Y����l��=�� ��� ��>��zF�C��oc�U�D�{VM�OI��/:����S���%mV/b���H��9��{��������.��_2a~���~�P�4V����lj��E��r Yp
�e���[\���y����g]�i��w�7�l��lg�@��A����S{�^ ����������t�,4D��E��&z�E~�!6�9��B����
U��k!H��3d8%�{'e`ef�R3�O�Q��<��
���|]I�������h���]�,�]�F�����R��v=��"��L���C�	���GKJFh�vkA�3g��p�w��O�L��i9���Z����@�}���6������JvGg�6}��+m����$�B�����?��D��4�
;TJNq�:����Kh-�?F)�FW���}���Ox=?���������c���C�&�������[�����_�cm�O����l82&��9x���T���U�B+������+��)��'�C(���Q��������t1���b"m|s��qt� "����;N����N��)(�X�8��q��������h��A;-9[�%p���	H������|l�{]�pd7�R��l��(��!�M�h��u{e�����7?�2M!� ��9��N�X�m�u(�+s��N�=:�<����/�#�BC�=��p����g�Qe���N����rh�U�^k��DD�����}�
e�-V��q��A���D�s��-���b��D��N�_:�i��()�W�8�"�?u��v'j��@gEO��)��I���S8��Ol�O��������>]�S��i*��$���}"���3E���b�J��K���1�$�����nY9��17���<��M�X����s���[D^���y���N�2s�LqS[�W�Tw�lY+k���r�,V�9�
�g������.����9
�m����
&��s��-t�w{*��M��������l���c�>{��19u��uS��z���3�d���������
�/���'�������HA��2��B5���g�������B��>����X�U��~�q�w5�9��h�u5b�a�,dGR<���c�|3�G��b]#��)�����^�
�j��Y"�P��.
H�cmn&[�n�����y�[�#�>��T7��
]��7�U�W�(��P��S4T
��_��if�Y��5���Li�����>W�m���i�}&�2	��,K���2�~��{�P��y6<��sb�F��`�z��k�9�Z�6���C�j�|������JP����|�c=s���W`bj�[��w�<�i*���cI�i�����nR��gwr0R�e�57�����m+c�<�b���>���[�-��[���Z�g�+�[����n�L���c�'cZ��A�W���=h���~���2h��k�<���x0��4B��B9��������Y�q����.!
%�>
�O�]��WZ������X|�E,%=j�%�5���5��]�(��p��G�w�����V�U&H���2�+���x�I���eM�fG�P�R����U����W	��Bj���}����V}f"U�t1*<P�-O����xi�&��K���._m\���"\�diYyF��'p��G�qz�"���_����N����|��T
X�|��
�����y�����x�*f���W��-����t%���
��m��{)j�w���T���������-��W�Q�*3V��x
��O����^��"\����A��1�
�	����\�7����8�h_����{�}�������N�X�w�V ����S��i��1G����N�
``'��BKC���~��|�EcD�t�;��&�@Z�v����l���aO����s��B�\5M�yZ�|{�
n@K���li�~st�.�������0�=�3)����k5�K)��iB����������x��UCCL�6���T{��JL�P�2��40\���w���qIQ�V�w�%�D��z���Z/�ih��)��N���PR#I,��%�����������?I��%�����7tPf����`��j���Y�����w��b�x^����t�=�3�rT���2{y$�;�?���[���Yld>�$m�<bu��?�MN����������o!�\�����c�7�+������S������[h��<u�Gh`�Mb��|���]��t�*��h���7���
�[���c����������a$���3G+E�&�
�G�HI/�Uv2q��6NOB��h�RI.z��1�~����gdt^D:����qS�[��-�\\/i�&6���s����:�D3�r����+��e��6jzA��]0R�\���8���q}1�����;��6�~��^-��$/�56ng^4JkSdS��o<�k��C?#LA�7EEeZ�u�����J��~��|�
�m���G��l�Ul��0`�I��E	q=�c��;D��v~��t��Jq*���V�d�Es���y�K.�)\�E�.q��i��~*[�zb*V�H<�Mi�[��������t~pX�4�,����q���IG)����b�A�C���_�g��iad���$r1��W�D��;�z������C_2G��
���*���THj�����-H1����}7}B�3�����/��L���X���SAh	p(+}��Q�+�������C.0$��O����V�*St�"��#]{)(;���E�e�P��E-�`��Xq�l���3�B�|����~�������~2]�_�p��@�����
��lT��c�lN�!�Z��pM�ne���U&����O������^�Y���)��bi�X�i��0���/�����2/fO��������U�g#od��$0�/�������y�E��iye�@|���z��i�M���y�5���a��7�/,�0XY(����?�1+���	[|�JF��n6Y�m r���`7�?Q4L��L��������j��M[n��2I"�u���@��J����A��_-��E��>����}��/d���1� �#�A��{12����}��������Zwg�A�L��!�����'�5K��}����Jx�4wU��S����l/NW�B�c�fG��/�vD�����Wk(KU����03;���$��^��Z��\��c�b���ij���r��x�V��O2%�6��z�����S}�%�uX����S��b�O�x�<����
���QS��zE�W3�M�l��M��E�h��O���b�B:�jA���K��4�d�:$��=Ra(Jc���T�L�\� L'�M�1��c�X�Uk�1}��d�S��}}����!1����?-K-'L2��s��%AW�V=�� n�{U��J������SG�VK�
��M�M�%��2l;��-k�x�g�������{����T-���pf���e��,�T
d<c���T���S�18�d��y���,`��*W]��X�
6�;�T����C#.�E�?����^rwh�'G2wh4�YL`rk
��j1Q/�����H����J���y��V�|��H�w�@w�_g}�Kp��
.��ck�i��+W��}V�5���X��FT��-��{�i�B/
a�*������l���(�d<�c�j�	�����!n']L/�����N��j�	B>��y~pw�������WlWXH��:���dN�0�~}�����j�����'��4~�~��2.�C@CW�Ds�.�!�����>(�����M����n	�����Y�#p�r��a�U�?9�R��N������_-�����~�&|]�xpG����k����q�[w{���e��������P���
��u'�}[����"����|��v����e��������Z��&7�����`P�{[���V�g���	�
�>��������evyJ�em��L��>�P�������ij���>����oyx����a���,V?�5��V��^1r�eTy�K���*QZ�6�"k���%��*���z�r���_Y,��^�K�t����pD�Z#Yy+%�����������%E�'4 �3�M:�M���]R����\$%
J^��j�=�$*��7<�#N,�C���~�+F��Z��^]���$�O��W$Z�pXM�u�v�_�����/@#�����82�^�[���A"++����U�q;J��^�UE�����@��T*��E�����):D�P�_��.�z#}Z���c�N�KIn�^�/9Q&�K����,�//�O�,�*X���U���������
c�eF�*z��P%���
��=,�����"����j����-	�@��J)=�w67L�G|�|���/)���-����~�����>m��#����Q�|E���`��N��
[aK���n��o�J
�y"x��vA��7��Sm�?�N�i�j�<%�2F�G3���(����G��N�b@R5�1,��z�H��j��chc����������Abffb�l�7+�Y������p#I��#����}��g������1rl:A[Z��ZB���������7<�\p_��pi�5+���v\i�?���\���K4���f���aM������\��X�KI�s��Z�Qs`t:�5Olu�
���M��K/�����4*��c\�L�bVx�����;!B��X,��~������g��|�R������;�3������/�.���p��)��=���f�]����2j��9\
��!���i�j�����������7
q��{�������Yd�U�W�P�b�e=q/�jQ������J��[�Z[��n�J��Atz��j��������O./����m8"�I�vdg6(��:A\�I�+M<���	+
�$-��M� �M
k�(��!1�+@�B)*�)�(M-*�����'W-�eK��SZ6����6��>�G�{E(�$��L#�z-A8�zJ��	����M��|=�%�[`��7���4�sH��h��}��gz�����S�R��������R/��D!�<?<�v
r�a-�sR-vv9�x��A�*�Y	��TF�M�ff�Q9;Aw\���c��yT5�+!&g��cs|��������k��8-�����&tvH�V��=��ju���8�~v7]Ei�����<�T���lf�]�#���HO���S
��;�8����@�W������G��^hqd��5�?�$�������&e�g�3��>I)@�}~������@�"���A3Q,	sh���	GUtx5����&�g6*��-�.v���E�u���4�r(����f�DW��ZZ�����<%�e�MV�X���^)}h8o6l�.��lKVM���Yo�y�r�
V���q�
��S�3o?��5�B0u���-�����"���`g-,lv��H��t��ee��!rV�0�uW��.�9��o���$����T����H�u��oKj�����#����&�����s���9��:Vg�����A���egJ�=y&An���6�6=��e�O}���~����)ml[6
�3A��K��!�Pa����
���^��,-c�fZ�;�\=��Ri�:�4t%W7��bE��[��l�W�n��3)&W�g��0s����f�T�!��w�����:8OV��
�3p�y(��:�2�@����e�!cAE��b^`%�
J�*,�R@��2t��_�SE�V����]^z�;@A������7��J��I.�'2d�&X��U�hw�2���.������D��?i��3��aR�YgM�������h��5�.�����]+����4�o�z��s���r�F"���9�T���:��K��?���'t�b�:-�%�h1
�5�n�`�~���D����(��V=�cl�����\���#��r�-
>�"�c��� >g'/3���x��j�������2��8��*� �S�+*��i�>3�u�T��)*�����@n���gv���Y�.tC�i��|���Cc!8L������R�dW��u����PPp�LY����9�����DE_&��Q��?C�h
nYT���!T�<7��e��}��g�OP��a�1>6�@���_<e�Y���T����z�i<#��<P���R��
�$a��y+X���s54��v��o����Bw��}�������]j����o�1W�^�$�&���t�$3@>(��0�{��U4%��'������!��v��i�d�o��Zeq,,���c;��
��;0�P�p>w�u�3~.��������3������B�
�I��p��8����1�Q�s�/C�!�+l�H���P����_RJ���Yo� a:�����K:d0�T�H��x2��
��/�������=.�6��4K�8����@����I�.��'�g(���p~��J����%��E �L��m��BBJ����
U4�;}���*xw%��:�m���|zG�������ZM�����Z����c.��%K��k���
���M���T�m�n�/�����N���K��8�pe�Y����2Uy�k(�|��;5�����/n��4���Gt��\�����pW1?�;�RW����C�^l�-�!6�I��/�f�s]�6zQ����6%�����#?�m�e����)Y��F*� �p�d���rX���R�L�8�@W=�?!}�>��qFS��W����7�+��]%�ZAM�L�������u��l�4��R��z�CvC���?��^����D���{3�����@��NB����_�j�
8�����u��4Cd��r���}�%��H\��w��f8������8��[T�c���r�����8P��n5R6`���@P���G�W��_8��l�3��pWE2�Rj�b~ ����x=�xqWztf!�7�c�9��{�]��!���I���y��@��^$3��`.p3��Y�xBGC�!��V�����G2�/M?�u���vg��>�u�uns�>���tX)���*�,���A�H5�B���/,�(��x�_�>�Vv��^���G!H���Z���d!CD��	
 1q1�����#��6����s��k���Hp?�;J���h����&�`��d
��:N�T�2���z#�!N\���oqT�k��"=�(z&�k��y=��{�bH�h�]��I�Y�L)I��,h���_���F#Q�����h�<i%��#(�l�yG�F�z��;����0��J�'*x���5u������,�����n<y���
�^���+JA��e��l�H�c����!2Qz�K���T��Fc*1W�3��H�X���!���������d��ax�~\��'���wm�
QP)��:�$�W�a��x�%�E{(W]��T�W����4!2M(1e��A��^IHT���h��J
s~���[7�U	��o���(������{��&2&s���,���z���v�E���$�W��m_��^V@�l������e�9��)/�;G������������?�d��7mK�=;$���Ln���Q�Ns�~myF�TiT,,���������&�N _:�{����M�w�t�i&��hR���-	;P�\��~���	�aIa�����2LK�������Y�2/���r]"/�/�W]��e����\�I��E����\\P�S��pHW{xP=��5����n��[�a���6��1-�L4j��l����Y[��`��$i
���#+W�JD]5V�!2��c�Fj���0[��|�tl/ ����c���
��x.�,�5�_��X�-�[
�~�����y,~��V���:�\e
�7��+��\�[�z�T��f�Im�s�S��2���</�:�8�!k�T	<��J���ye\���b�&��'�*y=z���)�6\�^D$��#]���M���;\]U������NH���p��
�U�^����.-S�����:���%M^��3�B��I�����t�|�|3�b��z���������4�WxP�,[���2��#+P�y�C�L�]}��1i�:U�)�Qmt�f&9��b�l�:��w��Z�����s���Y��uj7g>bQ�^�?�$�^�9Yc��z���>m��r�
��v�Fi�+��>�l�B{]����y���*�����@�Z���VEZ��%�l'�f/S������"DC�IW�\v���o��~����=��R��^
��Q�������w����K��FH���������n��y��l�u#�3�z�&������<��K ��u�L���cqr\���S��-^�]x����������l����T9�R+�}�%���W�s=x'�z������"w0nL��?�?��"d�_��-��.!H�38)�+1��b��� x&����L�ykK;����1���i&EA�@���h��d���w���L�������SS'xx��#wT��G3�0x/	xlu��Y|_Z*�|�R��s�7=�:2\���{����_�^Okp�Y��	W�]�3pi�6?�����dZf�)�~������S�DC�"P����Q�E(��w��F���92*8�]��J��$@������k���?�*u�5�+�+0��������7��KQRJn��U1�"�����`>�&���������vH.wNd��e�}���U,��t��({=dM���Y��)�^%��5�1u��p%�0��B,�� �lh����g��F�����f�!��Q�H��L�����1�a��X��CPK�e����i�iO�/��2�4��!�1�����^�i������������!���W�7�N�%��$��E��1S��>d}��k&��`@�a������Z����
�����"Q2�J���U��D�{ul�������g�$�}�V���5�f
>F����!x.�������&m�vw��(��`B���{���`��/�l�H�r��g�0p���M�����/1��~�*��rw��1M�eo<��F<SX���3n����A} 3P�������r�{F���`6[R4��+�^�������JAQ��c�o�2�}Yd���[Y��|�*�{�!Tj<P�Z���\��������mz��D����J�	��Df����>� �jtR�]��
�Sw�k��Jtv� �I�&���������UO�F&^]������S��v)����^��z1��V��t�sp�~hY��?}�r%VJV�p��$���4�"�T&������p�2�3�O����u�N-�m�i��Y�CI����+s�Z����%$�U.xO[�-Q�0����e�����s)��L�%%z��_^X�b(��Hi����p���_
���TR��F��8�>-\���r9'CwG��m:��m�����i#�[\���sc{������:P����� ��*K��`�q�%��q�7��)��JA?q.�y���

6�����K��=�r�e�������3�l��-������������r{���c�\]Xoal�s-�b��d��45Ku)�wY�$\�l�����^	r���C�J��F����i28Wt������yX6��� �L�`W���k ��PJ���~&ug���wp��t�{�����U��7�H��K(X�M�:[^�T'���6��6
}_*i��E�����iQ�p8�a�����������0���s���?�C�6��nJ����md"�<���-�����Va�`#�i
�H��q*�o��<���^�@-[D�+-����J=����l���dr���x,�������eT~����1"����i�D�pt	%++��������$Owb�=C�(�~|�v<�m���N���1J�����J����G$��_���o��[��0��F��= 5LY�E����y�Y ���R�G��G��7#�o�D��f�A�I������x�ASR�7�`��!CC�(Su*�{�t�������a�q�0��m�tX��
��@���{.u�g
���i��9l	��'��NX��8�JQ��=R��ux>��u���
���2�����uoGrj���1o
/��l	`�5<L6-���u��?��� �o����u�����������GGM�f1����2uc�(�%X	�/�_/)�z
T��+���x~�&0_czBo	��q�K,��Pc�����
qV=?��Y���p��/(N�Y����u�WU�Y�@@{���	�����@��HW�6�Z�=�a$~��������
Ji���i����*]��](���)�d�x����[���y�AS��:����?
��}�������_$4�x	����z��w��D��rq�����6�{��<�GMZphc���C���_����"Qd�a���Rb�>����h��$%A%,V+���E�g���\t�U���&�Y,����������@�����s�����l\�.�=���HKKTW��pe�-wgr��E������R����2�(��B[�x�e�+.*�;�|#R��!��m�R����Nx�L���j���:��NW�
�VG�	���PHB��xX��v�����ez�=1?��F��g
�8������bfZ��C?�R
�_{��Cc-��l�M��2&X�"��E/^�u*L=XK[6����,�����N�';nA�3��6h#�v���J����X !J Y�;����?��1S���fR����!��Hg���H��^;&���[��^�m���|��UK�f����r�6��`<IY����eI�[2k�p ����@����`����4+��bD>
3�;��*����%i:TwR���&�5�t���KyjY7Q@��H����1'�3 �A��D�-����L�]���.�j����lKb�7:���1NMs��m��J���N��v�<�y���k�+/*��8�]C}�Z�#�OW��r ���;���IG�wJ��>u'��pY�,5��n��I��_�b���/����:O[�3��\h[/|��s�W�1-YL������bs�3����v7@ES�B�A���A��T,G����}��+�������Q�k��5�+�z0�u�^�C���Z�H2E������gW���k���#1H�0��uW+�Aw3�n��w�HE�a�S3��t,)J|�P�tB6f�����.R������1�����s�9,���@m�t`{#`����07):\5��0��Y�l��8���OLf=�_��-�;W���������m��~t�����`��]�o�-�I��29Hl$���i����5�(�%#9f����!.����0�!����������������D���F�J.����K��y�T�^6K/�S���9��o��s�a�1��8�Q��6X?]?QyD��9���Pwf ����D3�����4�|�����_(�/|�����[�0QZ%?�\������7�@*����z�9na�p�����J�\��
��������{&Q�Y4l���>5��I�^������o�(U�q)w���T�"5�CV�
v�_.��,��
BA3�������Z�G�B��t�l�
����3W���)rx�]��b���,���c�B$*����84L\�Q�<�2xd��;����n�������>f$[�Ho����D�m��g�k�K2D2q �~���v���I�|��Q�xr���JRW�J:�f?�R0_��C[�q����Y�/2#J������h�j�W�q�@�q@��	|_��"9�q���YWA
Nz�����e�7�"�7��	���8�4'x�Z�����������W�\������\�'f��2��4'�����^ 1�LD(L��s��2�.�y��sC�'��Jlm��J�W&�`�t��+Y e�j��T<�U�N����Z�����m�h��[:8=��gf��Y��.(����D�g��Bo��P������n�-����B���;�����XA�/D�
����w�������S�_]����7��`l�#�Nb�bE���s!7����.d���&�����w��2R�yW�ae�����-4��Z���Ab�;��tW5�����G��4����
XI��A)�	�54c��=2��V�����M8�b^����hM+-?i�����{*�L6sd3�o��e�z�����Z�eE�d;����Y%���2�b8Q�
����h�[o��I�v��W�j�{�!�+�l�d�!����r� |�x���^����1��o4B�!�`�o<�*�S��	��v���G�$���:^W�6W�;�!�|/��@��
f;D�B��T��B�UIk�,�?��^y��]�������b�����*����iP�W�Tm�k��?`���[@�au���'���
We��X�y%�2_�����7\�T6���T��f�R6�V���s����`��IK��,	Y�(�1mx,�s�#�A�wT�G��a\l�����+��7c��x;��
������
��S/
h41�����VN�4
.��g��j�qO�t���M�^H���.5���}~���,l%�VTy�tm�1����Pa��<���'��j�����Y��b��
�;\�e;!���SE�l��5!��r�Xp��yO3 �1����u��&�v���3�n�����H���o�/8!@"����[�}��v�Db�)!�r�����PH�Z&@������@P�xl��$�m����E��{gF�[X_F����+�X
H�����!U�7�)y hI��#D���L��B=8�hw�#X�yX���u�T��mU��n�v'}��JJ�������@E���X������W�A����x
��J�8����MQ����~������
��n$�!�_\&�9�q8*V
I����2������ZUD������2�
M���?tL�p�����&��w�x"C�X/�8�O�J�d!B�B4O��[y��1_8����ow�q��TM3�� @���y���n��<tn�@�Q-7X��������o�)��z/��C���(q�*�,;�Q�?j
E�{y��r��B?%V{��~���B��~�����Ib��_��JA��e%���3�^��9��eL������_�
h��ly!�T'&�mp=���
�T�>4�p�Oa4��iY��W����M!${��-Uy�$��U&Pv.i�zj�@6J�E�}����WU�+Qn�n����.�O���<�A��
�l_��V16d�A�b�SwbZ��0�J�P
������[1Q��y�
��l�X�r�a����"��t`���;��T���$B�y�K����^S��[I��r�����"uU�����#�l�����9��d0�������!�+�I� 
��,�W0Q1�?�5�x�_:�l�pJ�_D-�f��V+��k�M/z�J_�������y�T��UF��x{����#*D�,#�`���Q����:,����2m����H>����������la�V�O�i��g���s�����97�e��C`������~�9pu�#�����3��n�U��#D�m�v�7`�B�N�q���S�j���&��
T�Q�W�/s��QqG�0Lz�^��j��	��m�L��p���g��]���Q6�n2��1��v	��2��!������B*�r����a�2l;AV�a+?d�/�0Y{38c������{%�����7�B���}H����6-���!�,�@�?����W0��N�r���s�����C!���2�Y������g��� ����L8��Pf�]�:���1�z�
����m�Wy�Z;�z�Td�\����������@0(�+�S���56H���(5_7��~��l�\��t`
b=�ODH��f�d$�ZWRX��?Z�xE(M���L�N_�X)#P�,����I�Z�P}���&�B��

�����&�T���\y�|e����C������5n�����������f��@G��R8�����}��(�F�bf�^���{
��l7��'Z�
�w��
�4�\��=�����4������~��.}�}�?'*�%�I(�O��vA��2�e{{�{�e��=�����g����.��@�G������V����z�b�@u�w���
e��x���|��*�]��j��|/Q�m��[=O
����
�hc���W�>D�y���^�����R�(A��A��A�Lm��s�����>���v

�%�D�q��n3�"����u�gL��;~�EY��H���IV��C?^d��|]��� D�O"(���+b�z������+C�X7�#�f`H�����a��&����V�^
�:|(d�6e-��Bu���PN<��������@M}�B�OCtv���Z-.����A����m�!(�5E�$�����yV��p|��k��e2Q�x�����O��A��ld�t�m���vx�p���.%fq�\��x��O�w��Z�7�8@L�U�����K���x�We�+,�lAv[��l��J�Wq��$�R��@M�=���li�.�;�+Z����x9*#^��rN����@y��b>��^m��>z(l,�������W~�xJ��bt*r�Os��DKm�=�@j�f��.�����~�+�?�\����=`��f�X��HZK`_���������f|��z�!�j��{&�� ������;����xp,�r�~&���^��H4�l��|�e�t-K$�������X��
�#��?|B�=�+���^�����,A�
�)x���k�����y"V-aa��\�
���u�w�d�@� -k�> {��:������\����Oaa����
<>��p��XQ[ GT���+|�
���S�Gs`�^h�����L�g�@�=�8O�4^�+Qn�^%����.u��/����
0u|8Vw�H:��l��*D����=Gz�,��f	A� ���:�P���`��4gZ�I�j�K{�h��oW������O�+��F�z�0�>_9����5k�2�A-J����4����F&L����k���\�nl��'?\�,?*CV��k!���w�9���`��Jm�8PF]�
���h��I��T��N�2�U��3l=$\���R{��3��a�~��b&X�A`sL������#��f���~r��J@#3b�/D�C+K����IH��_�Y��3.&���#�����P>/.
	+HY�������n�����s�@��:���%���A��i
[���T�����o�r��\A���1���������X��EQ���H;����wG����pr��W�����s
���x��P�}�}�������p4Q�T�u�/z�607Li�ioN��GJ������hW�p���Nk��z�VR�)s��(��e$�~>���h����}��=_3�y<�{��hK�z��q��?|v���LfO����~�[��c*�ze"���O����2���0�? �6�k��}��s�{��V��Z����{T�������Z���:uzW�c��,HN�����s���{
%)�6hC��&-���v��-z�J�g���~Om�P���cr5������H6�������&
^ B���!��'��/F &Tq���Zg�[F,u�����`����il�R����z��A��8�f�nBoJ��`N���,��T�@�>����/���9�a����u��lx�'��+p��nu0�W��eVw+��.�p�: A���w`A�~�]������z��#��?)�����F��bu��Q���K�]�z?\����3CX�^���1�ap���J?����d�-�W*m�0�w��(��T�z!�SS�0����u�	��HSUt/K��^@�%��_�X0�k�^�K����]��W	���U���V�h�~+�ce�@�e��FDU�d�X�x�e�>$�I�l�����CSRx
�~(yy�2��w��������7(VS��@��R'�CO���1pk:����E-�;l"�����d�+�!�a��������[�C���1'9��T?���KH,W��)fCya*��SO��s U�-��� I"��������X����C�J�e8�Z�I�E��q���w�m&��_�G�>b���W���6w3%|v���t���/�Vc�q��[��{�!��;�hW`��G�Q�JX5���K�w�����;0n��S5:$�!��5mqf�P�U���4
X��5l��4��+CR�.��L6n��<�������;����Fq2F�@��b��?V��o8�+7�F�?W��HNO���}�\gh��c6�3��U&F�
|'O�Y�xt�������'�j���	�S�4&g���!~�>���A]�����,���}s��J����~h,B�z"�co.������>�6�aa0�?���y�T�:	P��
.!^|����>f��Ga��O�!S$�`��������;�y�,s�?�c���~L�������O=������K�C��Q��X@�~��#D�d����������y�*f X�����PB��k�D�Hb�p����I!�E2�G����7-�u'�6���Y�H����@���P�_��������w�����!!�D�?���F���Q���>�#
:Q��`:�{r��,,U�Bv���@r��b:�?�nn\�����q��d�d:fd�x;'�����5n�|����<��]�=o���1�m��a����+h�Ow��r!�����Z�o�G'X\��	�G�UN���B�|�G#�J�:�
�������j��D�ryR���-�{���_n pm�1�^��Z����RT������=1���Vp�pc���:����#�c����sOCb}����#��o�J�Oj����`��Y�<=����2�����{�W�R���pW����DX��I	�@����5u�{�;S �"�Zg��:��g�"t�\��`��'U����G��9"	��?H:uDz����nv9���}"�1��8r>�K��z��R{�H�v�'��	���8R�Y���r��u���
W���{��c1S@�����P�`����M����,tq�*~l5�?��o�[��W�����G��{�(h{���$W���tN��P������_q���dY
�c�|zT�V�z?D��{�kw����&bw����C�
�?.�{�L�L�lT�C�g6�bi���*�b"tI�6R��X�������KHN�������R ���Rw���B�wn���C �4J�Z���J/j�X��P��+"�P2()�v����(�z!`��P�_��"U2#Y������@��l�?�������#O����4qx��!�R��5Z����,���n����I�=|-8t�n�xA6���X%�P��,EYWZ�a�cS���}���\���?������������F�
;��P��2���j�����G�M�j�<� ����H�FO�P���n\�:�^KG_���M���s:�5zn!zgq���O�.D�d��/�I5�o�E]��NGN��8��bA[���I��������4���f��8���um�{@kICh��9V�.Z��J�0�����W�T���X�u���E�pM!-d��2�l���'U2j;�%�������$�=d�M�Z{���u�9)k����%�YL��~ �9�H�����CS�J���5]d���s���{~��Ou�(
�4�\�]��Q��J�	��{n<&���f�h�9�t��C��?F.�Q�w�WX����yIK�@]��H�L���S���B�u���-��d[�/&��l�n�������D����|4�U5����2{��O�yE�%'a���Sbb
����=�W��:��y�/��7�5���������Tq����R�^���U�dP�utI�L�����l��0Xh�|�HLO#4U�U_�ee��hj���O�o$�O���[���
���vOJ�HA_�����o�F-�1���E����1�����	���(4$ ��}����$�����(�'f���s�=��Z�P��o��H�������~H��9��#dW0���cu�\����s������.�_ooW�%�`���3N�7��Q2Q4<|�EM�-SJ#z��.������L�^",�Yvh�k�v#�Q|v��H
��
#��E���������v�k�������.'=PM�V/����(�Z������e!���0����DD�5��u%[H��A�X�O�����Sc�� G����C>��G���Y�h�����;G�6s���IQEK0_�X���<�
����or�7�O&ai���K(�k{xp�	��)t��3�o�����fj�$�Sxt��7�9��v+'|"��!���](�)�):�r8����&4$����.]���1��f�2B(������z�50j��3g���*aV(�vG:h�%�����#f��p�"�5�AY��0���v���5�	Y
����I��@�l�Hj wV53��@������z��?�<[��jn�)D��B������  u]����� �=�H��R����}�Fr�`
������D�������[�g����]���|��������+����������x���/`%)�G]	�����3�vlKe�xc���y5���Y��w����1
�=�L�N�Z4p�>s�(�i�<d�$<�����E����QC��g>��p�L��Z���Z��C��p*y�f�|�����yo��blfz�zj�����~�da���i^��g��i����N�	�]E�)R>b�n����,��E�4�t����,��W��tbh����c]i0�{��q��o����0�8��/�N��Sz9bG�<A��H�}��M���CI��8���f�yoR[��j�9�Sw����<�:����/�#g?�86����W�]=���]di�!�pwR������qu��@A�9������e����Ss~R��1S ���X��8�F�����7���Q�-�C(�s.�?w���N1���K'N{<���:�{�#���`�36AU����g1�,P����M	cTG�-�?Z��W����������
#�����������.��Q<��1��e��1�7�{�^� M��;���Jl�$M�F��vByJo�N�$�{����U+Y,����	������]�����w(��]G�@���U2P�3m����U�{�
T�a�h���nc��>�U�C!��� V��vT��m������#��!!Fd��{�U�0������v�{A��Zx��3��3��o�{
E���E�B*�E@��f�|�.)Gq(�.{�+�D��N���@�rF�����6�������?S�
7h�~e�	��~�������m�U(��\��]�O�)��z���3���;at�<y�n������+��_���PC��w������:�~�y|����)��b�b��Eg�{!r���b���B��S��3eu��]�W���['�F���B�Q���It���� ���G������������	�6�ON#U�������6�}����!r�(��=�-'G
�~�'L�\�F/7X�����))|H
K;X��	����U����-�e��4u!L�<G�88��p�HW�p�E�O3Q7PF�A����j��<������+��U�������r����tK�W1U+~�Q��h�W�������A0����M��-5f��~({�z6�n��UN��Z��)��&]\�����@���{fW���p������&����=��~��f\C��j�<�24�����	!B!	��f�$��hL����G+��;YZ@�ezt� �-��Wh����_=[�b�|h���do����e2<a�������aa�
Jd��C�c�����O���/�e��O��X����L!N�>��(O�\���	���Q�|4���5�x����j����G.�{�
�P����8�zg��������o�������2����u&B����CH8s9yPHu���F���t�>my<P����5��
�	���>�m[��=d�T��)��D1�(�'"��7-v�(�.�# �d����
�Z�:��/l\�����&����w�x=�Be!�=���Rz������?8&�oO����	2u���J���0�2��PJX���F=�����|���J��!���9����Vnm�����z8�P������Ez�U}�N���
�����=1��@!�cI��NG
ZS�i��.���{����%H5��4�nMo��g�u:m�a�q����k�1DCa���Xf�����V	���5,�s���+���������s.��'�`�0�	B5�e7��;>�r4x�f�z�Q�@th��k_�(7>0��@A��K��W�E2�j�f�dQ^^�4����!Yd�U��u��j������r��;�P*A���f�T
_�\���{O&�4��6��c~?��;�:���c�rl6�FA^m�b>���T+'^���Q��h�Gy4J��p]K�|�E��e��}t��%���h'Q��I��z)h����2�F����1x�~��|�I��(�E�]tm{*� ��k��7	J�6W��5����k�Nt��d���q��#70��o]]s����Dgs{-�./Z\��.�������d���.�������Emp�l\1��@�{^��@�5a���vu����O�����7b��//�����EF����o4���P���3 ����������y������<��a
��	
J�����x0;�>����Uk���@��Y�<���dI�������>:�pX�f��UTee2���8J���=���Px
������c����t��l��D)�����&�i�+����+V*pOh��S���M���q�+,�Y��[YkL�#"���:t�=�����	+���'����\�I�M���y�d##d��-}��l�Y��(��U(��x���"�0!�
R\BL�����vvH
J~���
�c_��3TI}�����4M�3����2M[�aL}������4�>��@�Ir�.�U�E��y,o,�����������$G!$2O^'��q<&��c�2V]|7��]i��-o�����B��[�������!�5��)��b�U)=�&5�M]��_�j���5����=�<���7d
+G�����5k������C����VKq+j��q��.Gn�l��������A��<�YW���n�����I�^MQP.�/-#���a��r"c?R!%�T������2��D��a[5/X����$�I���P�U>Bs�P�5G�
0!+����2�=�B�IJ�p�&�+�lj.u�=�+Yu�Z|7I?��{^��H�f���W7�e(�p�~@�5>���MU�#���BL��#�(�v>�.x;�R�������6m%��
v�N���_���S,w�~Nt�	���$�X��{(���NS5Z��/A�D�����		�#4���;�[��M|d�n�wVO_��d����������RL��4���z��{*|ZP�.�hy�����I���� �9���i�������� 0����ZI0�%aI���?����@�v�/��&����L����v�9����l[!�^=[��f�����k���4��7��q1�8�~�%����;��z*-�Hb�m}(E���u����2uu�{�z8F�-N�-n����qP�������P�90����O0�����i��E�x��V?�����#oPQ�����X��
j!r�m/ga���7wP���m�B�,�	�m�#;hF���E�izN�B��b�`Ll�!U?�-���RZ������+m-w^�����=��5
4����a`���������B�N����g7�7�D��l��������3���>T��V��g-��a	�;Q<�������m���b]��N=e���[�U*�G
U�r�?>�L���U�(��2^#�+$\#������h�">Y8=4��8�����`x�����a
s��~\��^�Ne��#����	�/W�@�@������"�\�a���:����n�����?���?��+4#���R$������������o�),(}��L�����$EU3*
C���s?���b��8H}b�!���m~�q�v��_�"p���`)d]�������
�����p}��)��dBZ4;�'4�7]�";f�!��b�W�Z�<����a����B3LDgB��[B�G������T����X=����=�bwp0�y
:G�2���Q0
�� �z��'B�B,�{e�,����� ]������
��n�@H�=�h�����5vL��B�sn�"�Z)�� J�������A��^�ie�R�,�^i~������Y,�W,�^S�[��;�$��]� o-x���qG]��g�vmS��W���2�������Q�������5����z^��1%v�q�'&�j��
�V��.���+(6~L�g��mvy�(�K�����4�r�lx>��ZVC,���c0U��=����
o����f��e-��0D�HIn���pw���Mgh{��V����$�
���e(Swb)!��
���D�>*V�[������2�MR�Q[Za���?W�	���bE[U��(}�Q���f��
N[�$~<����g��L����B��J�UP���Co(��Z����={�DY�!r���J9��A�D�OTj�����7.���rC�8���[�������U��IU�1%�e
<�7P��d��<G���K*Ep:d������9@=��Q�Y��������d!M*�i���ciAU�5�]�f�����z�]������|*\L
����|6����Il��l�E\Ei�eV���/�}W�"e����M!z"��*��i)w�<�H������J�O��j���q�rgv�[�W�R���DT��]n�%�����������>�M�O����,�.%����F,��HQP���|�f�S�z��������b�
K=�b�S���!���������5)]m������)_���'�,�:���<%�t���ph#U��P
Qdy���/�(�O�N[t�e[�����s�g�G^"�_��cn�,��%"+
���/C1~Z���12�A����A�	��m3/�h�%��29l>�o�R��$���2Pq�����wc��G��o�����~1���s��,p���P�m����dK�^�)P�3������b4
��U[��1����9
�?�g���8l;��
0�8�x��c���X�x�U��z
~
>a_����C/,0A��E��RSQ���sxb��AS�������$����Ar��TA�.[�(�;3��e��w�<{)�`�-�FE����(|K1*)����<���!�����5�����a��}�������B�]�H������i4�	���;�Jv1?,��|���2^�i$�jO������M"���<b�(���s�r��L������@�76�&��j�p��l��	�2;o23DqYl��^���������P�pw`*�����l��@�p�p�.j8G>�I���~��������v��K�������b�wV�	��l:G�_�{�����T����G��<��7�<&{�Q��r}��R;�b�7����j�����<�-�@���K_o����g/���? �����ho�_�r��l�q��k���r���8,�/�k]���n�q�Vr��Pb������8��_�]d�
�^^��KIu]�o)Fvg��q��[�h��G�]xb=��GY��R��O����y�Y��	���8��N=))K��G�3��T�z�{9������O��zt�&Q)�5�����t�����1�Se��OB�� ��������n���S�Qd�D�b4�y������`��?G�M�w������N����Cq�?g��^��=��=qp��x
�um^��S��/���A�r/<��H3E�O`����{�)u��Xz]	���_�1�l�$\`��9�^/��ywg�$v;a����R���x�%�D:�Zw	�T�6>���������������~0���:�����M�m|��DO�s�lZ��1�y�s����N���$)���@2
���H-���H���@C�%�aN�PS��t�����bqU &XR���3��)M��:�n�E����%�zb/]M\����`��^mZ���{D����(|�
���������P�J�;|�um�����b������^y
��j�n ��&�$�����2����*��f�P(�f�Q8y�7 �4���n���'�p��8��FlQ��z<�����I��b���+?Uq�!!9��{�4��M����<C C)JJ�����1P	��;h����JW*�	B���v�m�L�pwPL��6�+���n��5���
���0���x��P���V�	�'EtU.f��c�u1L�D���a���V���^������c�

�p]v����Kg�'Nu�1�F�W���uy�p�%*/��%x,�&��i�)�� �{]�Rp~���.�@�����7dr�xpf&����	�Ts(G���(��{��5�P�"�(��eY�?<y�r�=X�D�������?N�\.B5�Ak�J��}K��"�M�����%�!4��e���V��S�V��?dd����,����
B�Q����d��0�	����2/GS<D�g��|����/2=x����/:��G�:�����������2����(��\��I���^R���)X��4����U�\X��q�;r>�����r� u�?�W�m�|/EA�.�"`�mBb�%'�F�Q��-��M�w�ro��;�!t�9���
��P�n�p�>��[�6�r��_�v�����/h���g;�C�X��:F����m��<���Q�J�gk�)���x�e\�6y\�M���sR�[�����{�
����C�-$Y��_V���0z���_Q�f`O�
fJ��_'pL1�uM{����(%&ql��iI)�Q,;����h���u�.�E�v��+Y
� !�m�������e�Nm�����;�[)/��=r�h����u�E���$�H	I����jV�L���5���]��6&u�E/Z(��4�t���A8�H`�}aB�����1������*hb%�
����MGu�`�.�4�>Q���PlI�q��7�,5�z?������}e���n����3�EIVHx$�N�cao[�Mu1v8=vq'����?�������Yi15��No8=����r��l���&�i$I������`�(�??z��V��
P����� +�����=��%���9+�B�>Ji��-�~>E�[xE���x9��m�����>�`ZH@8${�^��Bf2�(��#�WQ���h�O8��\�_t���q����q�'_�*���� w��X�����������W���
����u\�5Y�%"F��r��b�IR9��.s�����g�����t��$������h��{��uU�����r^w �5O����\��q}c� ��E��q�4���>L��LQ�u�x�=��,Mt+;�W��J]�L��7��r�Nr*]J�R��8��sd�Pf�)_)�t�xbT�����de��k���+g������iJ�W��i�PYH�kV�b)J�0�S��\�M��FV�;wV%��?��w���s~3#��(�UZ�r�<���(���,����6|���>����F�X$9���,��J���^X�'%��Px8��3/]m^_!������OtuT�m��Y�b����?8��D��^������1�[�
d����q�e���v�l��n�[� m����l�2�t�%:1 5����H���X���fu��}~R���ge��j��M�){_}A���~��C���81�Z�����|3M�PR�����~�����bW�����C�$9������V����+6�s��S�-(���2JL]���G�������[E/W�z����~,VA�����������'_$����[�������y��v����=��.F�Z�u����>�O�h��0�Q��46�S���E�@@Q�����K��������]��}iQ��[k��)Gy�������m�r��c�g/���t1lEG����[N
@u}�[����0*P��s���'d���E��qa.���P�L�6=������b8�|���_�~����4�EC�4_����N�)h[Y&����4P���mbY�v�?~?��DM���Zs��]TFJ.Y��+y�����
W*��c�\�����N��,y���,�r���Y�T�.�����g����z�D��PO�0�c
����7;��B�8R��;/Zr�]��oj37�'/[)���Y'aV���z��]�w4<Q"����J�1-���~�s)6}�2���V��OO�-�����-oE�e���#D��q~u���%�B*����L�^y����D�X��)-�Z[��vv1��nJ��1��`���tLt��������,DK�)nF	�(@	e6�4����'U�~	��{��C�t���P<����q�
Q������F��$Rg*.���n]�M��:�5W�����8E�G�
��S��;qF�qIY=)��-�g�Q�W{�)�������=�|�i��7f���p'	�E�!1������&h�6��e�L����,�?���C�"P�v=3h��)3P^:�ec��K�(9���v�c)*\Vv����`��Crx��'O�QbF��|��q�r>��@�Q��{��-��0m�MbA�X��1�4��=�f(��}�<�����b���_Y��D���%,��+������s>��2^�aG�cv��C4�e�2�[��o���=C�a�a�>��X����m)}��fS��0��R��/����\06UL����t��� �����}�J���A�y����hM���>u��� �H]�R����'������Yt��V������,"�<��1����0���>�|h������.�0B9�����hps���\�'���:_2?�U(&�dF�%��y����~�B&L�R��a���3���
E�w�)P�j�l�2��
�����a��RX�R'�\|�=�~��1?�����Ewq(����a����)�=fU��d�-
gd���Ayex��`u!����hr_%d�"�K�M����_mg��JQ��dm<();�@��&[�p�`E�mz;j0#��b0G��P7���������0/4�1VXz��NH��8�U�Wf��Mg��,�;7~�{l���m7wej;]Em7�.��]����~x�V���}��}�e����nAc\��)��\����(�K8m����Ql��;�?k|E��v���^�3z�;�za�+C&���z�d���������1�a�����#
��Ei����I��@����
*d���y�����9S� ~^������s�_���({�e�����{����<���� �
X��D3N�O�@O��?���b#P�U7�hf�'����?z����K�{$�=�Jm�?��)�<�����s�Q0�g���p?F9�U���*	�C�����E�>+����������G�	R0X8��`u��
l��#;��f�^0X����ud�����n������j��a���f����f���C-|��KO���`���k��P�G��e$J<�*�K��
�D���X�N��a!Tp�'^�
�G�w�G���  �B�\	���������"�]�y���&��;��
�Y���M��=f�J&lCR���Q�m���s����e~n�$���KH~�pe���Br���g���^�
���	�}A7Zr������y�}�h\^��V}z=�����}22(�����\��F�/M��?�U����>)o"��WZ�Ou}Z���}2��+�H�XPH2���f�������>��#h�����o�&D�?��~s��@qp��L��p"u]j����BR{iY�5m�B�Ax���
m7o����1��@��-��^��;(�Q�B���q$Y�`R-AQ�e��2J��q�[o�n��)���p�2g���?[�����v��x�U�.�>���f������*��nJ�c�I�Hb����������r�Uma������8�J�w�S�[�L ��t]������z��J�j:P��#�|&O1{}@.��ms���?x���l{BV]{�9���D8�Cc���[@#P�\WXN@Ut�����������b��s#~$EG�A�#��V�tZ�)�K���3h�K��e����?^�C�$8�����
�^�
���bJN���p	*,�~%n������i �^t/��0��:�#!�Q�5u�uns�m}{���Q��#�f&V�R$w�l�@E��������p2Y3a� ��5�+OA��]����5��/�x>�aM��uA�=���Yq��x�����U��W�B���.$N�/��������-g?zQ��U�XR#Ui@��3h��fg��')����;�M�d���k(0��%�	�3nx&P�b��h��ab��S��<R��g/hm54E������^)ocDR]�6\��+���X4�2���(gW!el����H�a�jA���t�A���_U��/��:��>Y�Q�HI�zr�c[<~���f3����)
��!	�bh��;2�
zdT-�.$!T�b�a��F<�.�E)(����{
2��i�Ug���n�R����������n��,O��R=�FMuV��ik3��4n��z��Z�b8
�W?��"��@tL�'DOlZ���#@�i��~[N}yP���9���W��}�m�������Vf;n���>{A�X�vy�R{	�������:��U��Dej^����T�`��2���O�t�����3��P�dV>'j����B�2:��l�9�����2M������>o��) -�[x;���#9��e�U�Np���~��5�W�.���/��u�B������M�I��C~6���=7d������Lo����b�&>���*(d7<��G��	��Qs3��.��' <�,�1����R�����0����{�M�Yot��O�����^�p'�/�2\4��I���z��|��������h��H5V7���:�t�������u��&�o6��������V�<�[(�������,�|��;�9oy��^�_���7��V8tU8l���Y������H�7s��C�6�u\;tx�8+��E#�
RlE����
��B����+r �&d�TP�>����t�<9f�3&%Z9*�Ax�L��\��!�����B=4L��q�����Tt��I��)��Ft{"�9��� �^��'(�����Xl���a���?��O&��E�U_�6"����=i�g�������XD�~���-"v�_�nB�d��	y�z���`;$[y>[Qi����xV"��z�DM������x|xT�c���9��1�����	�&i_�5�[���X��-���EG�z0q����"��H����]�|�0����}��a�G�`]<Kj� ��X�W3/���Q��Ga'KJ���<��(���>��3!-�����h�}��~
�x��\s,�S���7@CF�^.���E�!s���s�x��+Up&zK���*�,��n�X8>�&x"xm�l�[����B:
���<�tL��b��Z��u	6��x��YG�����Sms,�����,�w`�#� ��vx����d�a��] Z��3-Qri
�J8j��*UyW��)^��t���Saso�O��!���j��F/�	O�uk���{v1���NK-�y���:a�����r/G��\��H+E��x��7M+%�O�BU?��4Y��c��du^�a5���){�i$ez�X<D��8p�8�R?��l��EsE*#9�A�����ZEp%�-������!L
���y@q|���ubA��������%������������O!!����Y�47��-�w�Tk���n���d��QJ��`��l:�<���73q���A��rdjtc����� �>��W�9f2B|����W���C�H)��zR�FS�B��DZ 	����\��a/��PP��4+����(��|@E����d���Hh��c�>����`�:�:����"���������%D�b>��s<��y��,�2J��	R���������!��O@J��}kP���R���v���_7]|�������{�_,J�8L���?5]����?��g��'h�2�}��@w�F)��L0�
:�>g����(���\0�N����58�]&`��s��l�C07w%V�c���W�.x���������yu���faQ��KZx�(Q7S�]�*�,V�y=�KG`�?��E�
j�{|���@��e�Gf���g2eE��2m�yJ&�J�W;����=���Icy�<�dm�?�����j�@�+���
�!�5=����;���z>�`�KDZ8�O=P�K@�5��{=W-�(����oa-%��"�a�O+
������Jug~���;�4�q������{��-��&H�	)W�����JA���@c�	�?.��QsN�s>���,zFD������'�����$�-�
���d�[��cOD���Cq��O�$�Z��3���?���DU��~�HN��<�;�;ZX��n4F�9���so�{����
n��IE��{4�E�(&D����8O� <�'+c������AW�$w2���.�8(���\��~=�b�~�T����Z��\�"y��ie�)�/BS�z6�.�9�^����9�}����=F�C |�����`�����1I���'�����B�n�2�U�@7g��}sH��K�/ ��}�eBH�{�U);@
9��V3@���lk�^>I��9u���d�F���N�T��L��G�}�8	��~s�9x6����./��m'��]���Q,�������O��������O���UybE�)��d{��V�&�BFv�Z6+�-B��qO#���`�
B��/�f�X>�Sg������I���<&�\s�R���<�R����7#����gOa�+L�q�u[2W��9=Jlv%�e��/�W�t���q��='�z�m�'h�$�XKR0���_�!^>��,�Iv]���'>7nlUkY�y��d�I��7��u����c����^E^=��KC����]ET�d#�1Q�7QU�H����!��1�AtD��9T���h�WK����9�qx
�������#�u3������?��9_Z*p�o�@�	��U�Se}Rb�H���6�2@Oj�c��>����j����2c`�s����Ql69V��������)��^*������sJ�~���g���<xk�#����bzoT~�9��b��8M���xa���Q5���3U�<�
H�������^��>-��u���!�N�!;O�vwK�ue����9����q���R��FQ����H+�������X�<�%_��id�E��|�+����s��'-��z=;���X� C=��-��U^�Z����&�l��
��a26fBp������2dC��5���B�W������z�x��� �������o����[c�p%�L�jT�/l�`���Lhr����^���nU��`���+���X�j���ey{Ks��+�A����&^�tT�}�9Z�{�I$KCt����u
��'@�C�{U1�9���������A�CN�^��9dK�k���X9O�&�h�k���:�������`�z�!r���m���w���U:��#���!}�k�9�0�f5�����+&�����9Bv�1�.���<���-*q��|J}p�s������EBl��g���9�����[j
Jfm}��"V�����%~�R�0���%T<�x��A.�%�r���]���/$����(��0��-Bl�h�1_e��'&3�W�t�'1�
�G��'��g36��P���1�]P�V�F��l�Xr�Ob�}��)���.m����f'�lf5g���w�D�"���k����y:�?���>�W+��9���7����:��R����VSO�v�eRs,���g�����e�{����3x�G����Z��-��'���%f��<����3����f�H5	�#�!:J�@�:fP�����<��b�����2^�Q��.2!�r��'�D��R�B����@�S�0��*�������K��k��>��q�LGF�h���#b%��:��t���g���?�#E���(c���a���,Y�����bG�[�l-w��shz�0�.����eb-
���=�v��i���i8���zY\m6��2/l���`f�Z0�Ae*�+��=g���1�
g��,d����YuJVIW����%����������������O��,���=��a$H��gC��Y�C����j�W>������{�^o�GM����x�9�����jh�X�f�W"&_lk�z���y�G��������E���wS��5b$������m�������=#�g�VM��k�N�
1�U[�
���s�D�s�5��&&6q��;�v��C����-Y��0WG���n1+�1�cG����?nQ�������n<�3�(P#�g�1�O��+�U{���*FK���I�����0w��Wm����H�4��c�N��$�r�s�~���Cn��hq��� ����
�_���[3�&��@�I�6�9�l?�<��g�����L4ne�'h��#�%�3��-C���~f���B��j�n�M��FN.�������Iwfp��	q�&���2Cs)H�#D�oI��?����e���sl�c�tfj�x�D�2������n��Z���V���'�d��Z����OTH�
��1����-�����e���Y��M_`�@���3���p�~�P�X
�`��AR:3�.c�(�|,��R��](gzX"���<f
����.�v@��lyY�$���a����2�Tp��MQ$A�E,V s0P.5E�������Ot/Lb1���>��)�B�b��1P�������I�\Px��/�C�n���a��M�j����eHSQ�����#I�I.�c9��}8�'��?b����b���a�T�kV�����GBg��%R�:*T�p�
�L�n���B&w ^�F�_��3�I�Z��A_�I�l���;��������C^:���=V����#|uN���iB�"������C�N3�0g����������5�1��pL�v,������O���S56�*Id�H Z�`��!,��<��#�#ks��U]XY��C��A���^��)���n9U�5J%�j���
QR6;fL(�
x�m�w�����{5������X���Ir��5�rY��nL��V��D�z6����z7��J�U#w�a��AVm:f!6����h���m6Q�1:�D'
�����B���S�����%.NH���S@Y�E���s�SJ��U�^=V&��2�������a&�)��v���>Er��S�,��<��<��^���a >��������n�)F��W��5��Vk��T?����H��U���L�Pz��`fR�QI�����g�W.K`dE�F�bc�lhfqT��Ry���}��W]R��
_�6fj���}��1J�4k��n����T1���ZWB���3y��;U��W��Rx�t�����KO�,��jz�DW����t=.���K&�?&�U#��oY!J*8d���������IS�a�>����Uib
��^�{3�?g[V0��
�3��a-�T��������c�q�����aX�nT&}�X���o����G'���R�{j�W?}V���
��8�!v��O�0��1M���������7��7K��eU��������
��U�'�U=���q)�l��^Y����RxS�jL�H%)����V��U$E���)r��&����iA���.bs��F�����Q&������s�C������Q"�G��J��4<#Dj� 7��A��@�-�Y���$���a�^�`��c��7jO��1�����q����7�<���e����U(���|�A\�����y�VU�?77����l^i��A�(V7}�������VV��`'�o�`�j@Z@����cg5�Sf���>`hM�P�R�h���z��}�*T!�,���k:����g������Cv��$[�d�`���"Tb���h���Vc��L�a�`����K��<�D����V.�%��G���V�V�K�
'��x�{�v��X��VI�Q�/=�7�)s�������*q���W,=�U1�vQ��IY��f��,D���f[j�����z@*�H�t ��kr��h,�Y��?v�A��D3�V���9�40�Znj����S��E�
	?���j�����n�X%�=:�=6#U���t��atC��o�'Cd�E�2���6�T�q���Z������X���{�<yMr�27�f{MD["���R��&������������#�r��@94���m�����B�����Sq)�V-Zs�KN��e�'
��b"Cy��v���&�7
t������?�T5���~�%cm����4����-��1M� �S|�atav��Aa�L#(�Q�:s��d��H�^?��H�c]�j��[%Q���~:��I	C?z[!���Z���J&�i�>i`��T���V~�XaX6����=��R�����rv���hYM��L�GB�n���^�b�����	d��zh�@k�s�Q%"{�j��BN�������v��~�@(�bP=�.�R�]�X�������4����7��&��"�-�3��V�����~��
�2��(���h%s����i�2�H���5BI�����B�0o�]L�J�
������k$E�Ppgdd�O�@3	��;z������Y-;��j�u��w�����c�#���F9��nIv���P��!o�n���;R5T-��3��v1w�����i��B�%c)����)��wD����~����I���z�=	-n�`Ekc��q��	4Ddp����z�p8.�O��?Q/(�
���;��1��n�gh���r�$q��)��e��8m�J��4&�i���-��=n�>����LC�	��a4�g�x6���:M�r<��B�l��������o�z��\�%�6n0]�i��V������uYr��]��+m������$��M�~��#�V���8�19�Z\�fv�V��A-�1����1/4,v� �+�;@p�.��9�5��;��(�l����K��>��1�V��T��S���&�Y�Ob�:�At�Mu������@�y:��.�D����D�^���^���j�]��#I��/�����DR`��@e/w��%2z�.�9R�Q-�q6�@���mdoz���50��^6�E�O9�}�\vo������O�7k������S6��`������@�>�=�r
G�/>&\z����������DI2H
�w��F#�&*j�������`��+����Hy�^N��T3�*��l�D��.�g�@��M�Nc���4�|oHB�	��>��0'C����(�:W�q`���X$�Vm�����0�-J���t��}/Oy�!��L��\a��y�c���������q
^i?	� ��cU;E��t�!�se��f1�d���$�=n"��f�"3��o\(��U�J�V��8A^� �3�z:��\�`�O���3�0��0*E���2=�R]�������#24������)|�1J�`��g������,xe��C�4�=L���C�d��\��_}�=/�?���{��I��?&g�������J�4�}��$�Z)�����;mZF�[~�H��:��B���/-���S���
hij�"~ox3�	�H��E�`k������v�
��*�sfg��Kr�+;K�:�
[�os����pn3|-�T����*��0�X������z����VH)H�1eU\��4���
rU�����F�dxr`��{���^}d�.����RdLv�m�m�I�;�&,� ���>�z3&�X��/��#��U�h�2�R���l.��u3& ��y� OM�b���
K�K��',J���#e��;GwA}IuL�W����2EnN������"[t:�s�:5����������u,�:e����D�R6<tm��W����>�V=�8h���t����fk��L��+����D�*7y(����.<k0MNW5-T�w�K�n��.�$�V57Z%}��U��Zj������T���V��X���({<�J�
�>���������22|F�ZT�"�eA#�:�����Q��b<���E9�������m$PF���Q��b�K\Y���&_2�W��7�/a�����(S����\u2���Ed������\�U���>=.0�!L�/3�V�����@��,�Iu�`��iC��?�j��?9�lL�k�)=������&�#Hp���H�J�kxj}Q+i�������1q7�S�8Q=��<���L ��6���VU�l�����8sF���VV��0�����'z�-�98>$����R4����u��01�cSL�N���7���X��*��7W��
l<��zm
�r}g�ai�'��{)�%i�0��As�|x�B�_��az������89���B��������
6��>��1���f���z��d�:Q���Pvs�-�}cj#H�_hR���I�7�h�:��z:RWK�R�`�|+�|�7=.����4��`�IeS����!����[��W���8*�l����(e���v{�J��p�On�y������,r�I�8M ��A>�^��u��(��}��d��=R�}�DH/�w|��qHIM�F�*{9���6r��D
0��D?O�S]�����+���AA��'�)����F��L��r��+-;����M��B8�����q,����x$A��|4�?����!�'}>���%�0f.��h����;oKL���6�n�$�1��Y��U'I%��Zc����X!��=��^��������J�����=�a_f"D����l��nM{�e�b+��&�
���B"���" ����v�,G���v1�Vm�C���
�����=�<��G6���!vK�	
1�T���i�����4�->T��n�b���!�� Y4���Y{�UV��D���.G%'M�C};	.����n�3��P�#X[b���=����tc��lT�F�|�?lS6Q���B��PR
�@���U���d-����^��,G�����;���8�n)���,�10�j���6�n2�B�?��i5�����%�.Y�il�)���[&���1���`��}r��KV9p����T���������]�7�l�'Y�>wmhb ���V��F�h>S����d����sa�~��'���\��!���p�4��g>&r�p8�����F8`�����eLp���DM�B�����D�AZ(����Rd��
w����{�-��������j,����|gjF�D|���D�&�%j�
~d^����*�`���,S��]=�X�:�V�2l�4���� !��B�~��B��RH��Q���Y���>r8.��;F�����<=?� ��e��_s*518g�_l�[�$������IeHE;k?�����[��F��l|[�&:|��)���QJ&sT ����r�
b�����@�������8h$��ug�")�~!B����K��'���*n P�/�^->kD�oQ�1Y�TH��`����0kLm���	���/LL:QI�����E+VeC�1�h��qQ��zQ�%�F����<�C������{�����-v	�^gP��� ���������MG��S.���H)��c�T*�'���Y�sg�,��a�*�~��x��D��$w�����/��	����E ��Us����kj������Y������V��~�c�W�����Tg���Dl���F�M�~TK�z!�D��wew��x���D-�������j��~��,Ml��Y�	���u?t�T��A����Z��V%���f�R��Q'�4A�0���x�����VQ�!�4�"A��a	�	z�'tW��j�ZF�F���Y&�A��������}��?'W\"��]�,�v�5���*%2%7�4�3Y2���U��B�f�o]�zY=Ph���z�M(��l��;6I�����1� >V7�h����@=,�1D��-f����,����9�1w�X3���sZDM���������p{�"���L/�>)�>��g�m��8�
L��
�>7FNQ,��c�	�xb�
M	���y�G�~���e�����I7����g��0;��F��2����k�^�=�2t�.}@�@C��);8Ay:���.M�F�����R����t��g��mMq	�����N���_T6����9���}���Je�2��,���z���>������"{D�����������0�X����{�"E�NL��v����n���?�����������AJ����?���cmq����&V�%Q`����;�BTjM���"V�N&.d�~Q{;0f��r����S���
L3TE���	���Y��c�*{@�cJw�AQ��]tJ��2��2�K��D.
r	���YG���&Ler���2eeM�=�j���������qu^�?Q�H��T��s9�Z��v����~a�"!B�b��z�H+��s�.��m,������D%:s����?[	�^��LT�cUK�KKS�����cZg���AjF��������"�\`aq������wjY���u�������d�	������"����Vd��r�8�'�Yp�v���f���'u����`���Y	�����wW�Z_P�>����}��S?(L/����{��K�hyyj��Ww'5N4J���������VU����.�����H��uG�uh���`8�E����h���r����<z8�\�n�]��c%;���f�	>v�+4���?n���S;��z$q(?�)�\��N���k��5Z���r�{�.���U��	��j��[���lY��b6�>����cv�.;`��3�=�]�������Y_i\�L��	��|1�LZ���_~n�����!R+��*�~O�_&�<��Z[z�hi��$���IG�@#�"�r�]�sB�\��t�A��B��(3�Wr�����|:��1_9�*k��|J��E��<z!v��V�KeR!be��1��%�����qw��n���M��Bk����	dDvh0����Kg����j'�e9��D�����Fn�S_�j	���P*���D�6@�H��Z7����p�cs��2�O�����2Yv�A��������q���L��|I����j
��I��O�������x�:c:$���'TC�#v��V�C�%:��l�z�a������z��D�*;��2eTk�d�.���Dr��Q�7�u�Md3��k{��{K��Q�����<��6Ku�S��j�N|$Ul���������lE����l��u$g9���0��`�K�+r����We��\D:��*�V}]�+�\l\T�lW6�9v|�2��1��2A`$�������L6�'� [q^^�o�����C\��@j�3����YIUE�?�Rf�\��
�����'�� ,�v���|��`����[L2�BI��(������x�\4��c]>�3��Cr�
�����Y���U)b]XA��!����7���6�_=f���|�4����H��F�����#�������C)�������4��(�2!p����<���4��q]��p\4I;��[������6LDY�P�`v���b����k������a�rf`�����tU�bI�
�B�G����ZR�\?G������K��� �,����*E�\dW� �t���	�G��7��ka��PI���T���{���2@�A�.]�M'�Zq�o����<�t@
�!�����<��������p��vC�;�-�/Kt�K���@pEr9����ft�b]{ a�;���m����!�z0�zXo�0���b�b�$���$I�(H����a!���gE��y�8Kv#�^��������+?�.�&S���#�����z�O�����e|d�B��g.�Y�-)�8��s�e;���>��9$�M�e���H���I������?�Q������/#�Jk��?�3o)�K��n��#����I~M��4���`$�U��<�O�L�?�2��N������D������ �#�;q�=�<Y��Y0����e�B��0�>3_���O���'�.����K�����k���4�HP^��V��7W�F�m����0��*���rl��>�{�K)�_���������\	���$�p�>�5���3�.sXT�/�'��6��E8q����j�m�K��i���9�*��c|�d�<���uv����:q�O�#���s�����7����{n�Y"�OP�	�xAv���5`�A���r�8D��n��qE�"T�!?����"�� ���j��]��2�z}m��:|l�H�|.����4���z2a�������r[��K��l�F��Nc��Q�d�~�q���t���A=������q����4���o������<�������0���?�����^�,Cd�%��Z�p.��/�T�2���a?���Jx�S�v���K����<e`^z���T"��kb�����9��!�"8�l��/7���&r��}Hab��p������QU����o��KB������\���D�,���{R�fY2����j���Eaoh���XO�j���#�|�>0�)����3b���M��Q��S�>PY�N\dGh�b��h[Z?�6^&V����m"T�h����������
lm���Zi�������[S���������d�f���`���6*Z���d`�����RD�� {8d���v��Z��g?�h�{iB'�������_�j[��9����"�-��*Q���������r;�����{�Ou.q=I�6�OPm������V.��&���)�<6e�~��k�j�:�=N���]���>G�+��	����������tN�/xR��p��	?�seo
�'���W?���6Y�x�vwn��j1Z���F�I�BR:nE�eZp�C�?C
�fz����(w��D�������1��\�P�9������:����m��1�Q�c7N���^���
���2��M������H��s�Y���_;�@@='���d��X9��j,q�e|ZR_��^a4`��1oV7~c�L��^e=q�����>1='�D��k�.]��GfE���zi���j�����#��9��	I����E�����\
��9�'������v0����s��C�#�
��
�<��L�\��������r<vw��y����66�-HV0�H����z���_��C0NZV������k��He ��p[tu��r`6���j���9K��,S7~j��g��,��d
gL�Z�;G��,��z2�E{��?Nl�\,�/6��4�y��/�(+�!��Y���:��Y%\�W;P������:Y�I��%��`j���W����q_X�^(��Q��p-�����l��K�4]Y�yP'�s����.�7]6�j\!-�K"���0/~~\����r���&zU�L02�Y�g��)�U�eN<�*��2I�xK�
��RW����l+�4�Ip2?�X��DdokT<����������e2X�����WX� ���;L���:��V#���U)B�;h��:N���O[��2��-OV50�h�u����o-gW!�����`(i�����uR�g�W��(��@������"��P�
�UW�G��R-�gL���
B+�|��W}���'�N��B����{�)=7wuee�_cez���[f
����a���?��FQ��<�ydKmP��K����D�HX���5�n�${D�����v2qm��x�Gr����F_e��>{�9�RT����gQ	���r�to*��{{�'s�;�'i���v��B� �$N:2����������z_U�����4����^^�y�r/�|'���)�F��.x}j������
����d\��|�UR#KD���%�%/v��c���8�}\����u]y�'K��>���U�]l�tJY�P��e9�������x\q��e�B?�|=�h
/��C|�y_��C$�����+�}�X�n�8(G�N�tS�����������_�I�.�����5�6`r^��y�U��Xs>�@���I�Qwj.�fg���f|>	�#E����^�_�H�E0t��A!_p]�Y���0�av���.���
��}*�v'��_,p��&�Z.���f�I2�
���F�_7��E=��hc����{��������'+�1G��H�������v�����:��b�a���c��5�r!oa��g�t��9��8�x�.�O8���`7R}���tH@I�,���3�-T_�.
�eV2�	�r��LMv�(�3n����o�������go��vt���]�����A�>2Yf�@��0�U���LF>I�����q��{lD�n1iR��^F>����}��W����F4�r����oC���pqe�b]���bHH�<�2��n�����j6�G��h'�P�+��$"� j�Z��#8��d
�=�Y�uw�8����W�*�p"��7(xf8���9.m+
Cpn������i���0<����Pr�bz�?�Y��~����^��
������=���t����R<p��#��qWa�e�k��e�)��|a���H��Y)��%�����I���������78 y��vM����"��(>P�U���f�xU6���}�u�
B�G8,&���m6���^O�,9��� �[��v�g�V�'��
zTV�-�d+R#���nVKR����d-'>���$�]$�L��Y������!�g6�s�-~1��ps���G�	���~|5?&#�W��o�k���I�r@aB�����j�]��������=pP�'����l{X��%L�0������+����;�n�4����6%<�U���)�8B�6�x��V����������,�gO�8-����0z�8��z���o�����|Go�����3�8Jg	5D�kJ�P#v~�6$]Y��TIr�����3���Uc�|�%� ;�B����R��[*��6e$ut�Df����A��M���n�E�7J�lP��b%�F��5"��L�2�H���FIhK�{P��u�,���������j����+�S��R��J���m����i���JE���QS�Kp<�4I����)��>%�P���_�k-Q�&3����'�S`��?C��������w�0���(i�����z������@�>��� ���l�_��t���?�����&�����6������F���o"���I�Q�{�<��G���8�=�������,1�-��kQ��S�0��+\��>]JB�4�`r����O��5��F��	F����w���;��=j�U(��Lrj�*���q�Q���%���62��_P�qP��v�H�$�r�����(<�9�I�1��Vja�?���\��#J��u���I����x2�����^���$8�X3�>���~�x�$A�u�d�k��)�'�'������=5�KY�����1���>���T���C���E�u���\7I�1�	�{�B�4���5���g��!�=��C�/��3��5���E�[0q���Ck~��?�pi�>a���`����'�b��-�����	���V�Q�����de�8���f}���[��Q):��-����F_�bd�>�����3��b�4d +@F�t�H%��.e�D�VS�] ��k�@o����W���wWVK�qbQ]�i��eT3��"�'a�{I�y}��J�n�&�>���G� �~���l}���&M��\�M����Dj��Bd�����O5P(C���SKh����{F�p~�J��	o�6�q\yw��(�H[9��<��?>/�����N��6���k�\���c���C���c�O�-7.������
'NN�$> ��!c�XV�_\����V����#-�jV�EBQ��^ZB(�]�[���e%I�.����=I�:W�����x(�(�nyB�D���V1��O�(�1M���������r�����T�j���e;��,��D��U$w��H�����|����lv��������CL��{�L����HTm{7�|M�G_Z6EQ����S���ca���1���|�(��i�'�%Z�2�&����t�TRdl��)��7�������`�
������o�y�kh��#��|���e�����6�5�2���-��k��D��F��Qu6��O����Z���m�k����_=�\����>�\�����'@#o��k���_������i�Tc>?kH;g�o��jy�����fi���k��{RJ�4�����������:!%�����]z���4����A�+S��IAy��W�����������%5�sY���w��z���>A��H�����e�\�����p��sg��?d����G��MJ�RK/�6������������4�}i��eG��#i�G��
C����H������8��Q�ua�"Js%�����!{,a�|,L��T������qS�&J�����������Dng[�k����GBDq�L�;���%ld.��7�hR_XT�e�7(�/�����;���S��!���Q�Z����l���i�L���=�����hm�^u��q�z�Q'���G�UKPW�}� c��d�1������5@&=��N�W}pQ��e
�$r�dM�B?�$r��
5H�
�jP��-�5
�y�q�7���������.��Z
����%O��Q8�����g��C7Vu��B��2��U�mF���g��%g��{'�����TI{��=!�\�%�/���.�� �e^@����c5��a��9��p;�!o�A^�0I�n�#����_�����a��w���]����	��Xx9������j3�����<�a"���$i�Nt����Z�4���Y��M�&z���n|��X�_7���318���mEd������F������k�6Wj��/��hRVrp�0�Y�)����ckD00S7����17�2zE�dP6r�.�J���l�v���t�Q���i����� xK.H'���9M� �|���V�GX�ma���l
n-�s>���L������o���� G����'��|`��.�{��%l��o���)h~��ZEw%�����a2��KF~n�4��u_�*�O����0oND1���8C�z��rpp���u�$2��J�c�I�Y!4g���5��DQ}�C���^���K��4Kgo�� F����>�5Mib����p��ANHA��]��g�g�5N#F��$@p~\����\�~�0[�ts���A?z�����*.��I~H%��e�a�OJs�-TTvd����	u�)��A����\�,o|Z����*�=��p|�����I�cQ��v@<�x����586'��F�e���s�2�
����v�{�w&�������u���S��m�F�&���%�Q^�nX������b`�#���b�v��j����Z4����nHO�����Zj� E)f��x�n����6�S��`��d�9�@���a��u�[�j�������6�9�{|w��P��������	;�_�yre�u�=B7H��3��U�=Z�$��
(�$qi[��5�)z�1����%����\���q{9�.�J��s��>+�l�vU�L�����?�E�R>�h'�����R=�J#�e6`)"7���w�6k�p�P7jVb���FH`����^o������ ���N�?�G�%�wVt�_=g������3[����:]��������:S���a<'��?�d'�%��A@s���O����%����F}SyrF��"��q�w<�7#b�����E
�K+�RX��*�RK���S;�T
�"_�5AZ	� �%�D�<�w���AH�������$|��b%}k?G�h���'���w�B�1��"��%����A��AKo����j�K5&�y�L@�4�D������Bd�Y�`w�W��W������[�:��Z#ck;��s����t����nZY�e�(0fb1��$�
U� Z��4^}�������S�@oP(���@����S��k�*S� ��X
.4��Y{,@B�0��p��������H2kW����CR-�u6\T�@�T��"R&��	_�	OXh8Wu%�WlzU��S{2#8o;�:�u���;!��J��R$.uk3�#U����R6:dCX{)*����D�M��w4u��prCvj�������n����*G��	�Su&���Gf9��*�V~�S�r����
���gI��4l�
C��zl���g+\����v�C�Eg&DE��Z�G��vm���rUw.���5���l��BV�FI2pc�n�r����	%�U����r�����q�����&�6�{���?D��w�>s!�!.*��'$� [�ea��`f�Q��+jU��`;��_^����	�P6.�0���[���Q������i���
�&L�|�v��z��l(�G�����t���/��sS����,�!�d�]���I&��KhM�r�9�yb��IP����H��bA�����'D����?�dSD�o\u�Q�@��0����{b���i
r���<�B�#kh��j�������n�N���������v��u�z*�d����=�H��
\-"�~�R�yi��~�5��������%l7^���vQ�'g���"-">m�:�K�{�i��sV�irdj�3K�����C�����G�3���a��=�V�<�J�r��Ow�z�B=�!�x��P<p��Jvp�@9}�E>�Z�;��H�]�y�lh)F��[0I�"W��$4{�{�+�
MH���t��B�L�@�
.hl~s"���<�8{L���[o�����s�g��(X�e���������IN�M���r�
� ��0���B;v(0�\��}��D�Wc�!��X�$��Q`kp�%G�|Hb:~��h������E���"1����O6�z.#H���yFav�����&u,���O�|���I�Y�A��[j��C6����LV8�e�"]T}�	����:H1�]fqY�Rm|U�8��{�8������1�rU-M_"��dn&��O`z�b�lR�OU�_������bDXe�y9 �^�WU��3Q�'�C���jc�
���!���/uK?W�"V!�a:Qh��4��i��J>W��w��i����h��M�������;��J	��[\d0�#�T�g�������������H���&�V��ue��"��P����b����	_�
O*w��
zS�5w $�������$[W���0��^����@q�ue?)���������t����[GX��>�`8�4�-����m���H�0��y���������c����R�Y�����^%6!�(8C��n�z2A�(�oG7���,�.p�w9=,�����zr�j?�39q�/
���P���<=g@������
��{K����K������d��SU
�U���k(���B[q|a�(��v.xU��\�3��=K�wv%�qG����^�io��@�����{./d�RXt ����.�Z\{b���-����mj*�z�[\�=w��
��Kl�!	��4�B�����aR�s�C��������)�;�����#
{��t���\��K(hn�g	"�X&����'�z'���3�^���3�o���rT�L�8���!k�p�Z�{��}�?\�!�v���^5���;d�o��b�3W����i���.bln���v��)-
xw{s��\
��x���^�>P���W�(�ep6+t�i�F!e	��~��2#K�t��;]�%n���0���=:�l@�?������9��I�;����~��g��i����^�2�T����gq�����6���
�[o�D`�p5;��T��a���!�>[t�8c�h�X�v�0Em�C���x�}�T���p_�f��x�V3��Z���z���G���3�0Dx"(r������}e;��!�rC�T����M)J"�A���#��\RZh'��^c����U�BG)�Xow�o2fY����](W��8�i+d���^�}�U.d~:�]U�Y��B.�����#<W�J��>?$��:0;T���m��r(��i����$f��d�.~wP�F`��%���$y&��S��X09#��=a�Nb�aX�9�����s�{����"�b-��3p��d�%��*��j0�G������@xE�<�I0���}	*H�M�����aY��R�x��Y!���N�m$<w�����)n���������;�
���4��1Z����G/G�w��:+���������Vc��WO~%�����o���8�!����[f���&�p��������G�(u�Lm
���?�AV�����C5"��M�n�D
��������,I�0i,�K�W��'�����cO���,��Z�Rx�1����YK�P���s_����!S�@�����^�)�g���O�����V���1�Uje����[q��O�Y,��i��>u�B�<�cab^[n��$�����|����<��x��6gS|Fe���c8�s�'�Z�j�3H�6�����0V�wB%��������m��0�k#��NkE*��i�1��/w.����K�;��7y_��r������.�����e ])������jx��4������HY[������O`O��l��wj��b�Jf7����	��l'����������IL����u������(av�-����2.����#6��(��&��@u��#OV����X�i����!Z�=<�C���E���t��WL��;�V�%�<��Ot���9?�S�.3��sIb#M�{4'����!"��?�/I�m��F�rl�4������<{�;<
�,d@��)j��a��6�7�5������|�=��e���Pl�YHT����D���3��DYN�����
�	l�:u��l�B��l�2� �8n��	Ao�4�{��,�l��x:
�����Q��%����%���4n3M@�k�a�MW�J�z�0�84�������h�DA��$
{DX��Z��&h��u����
m��{��{n�bQ~Z�=�H��U��:ZY��`����;������'/9���`7#&�^�x:�
��uJt����|}�'���bW��|���6��.E�O[����p�B���l�Bb��p������a����Y,l��'��H��M�e�D��"{�bS�l��������0��� ����s���]-O���i��<��T?�w{��W�P�J7�G���gb�K��tUM+wtyM�Tth'��(����abwg/8�&�,��Y�(����`L���?d�7t��6�KS7}>+DA��D���B|����Pt��5�<h�4�u=��9�|�
����4|�1P)i��!m����q�<%%����Z���t��� ����)��;D�^�!�b���Ap�����s:�{U��LB����y����`d�0��IR,>���D���7^���D��f�9�@����ks�P�'T�#=^	��GL���o�������]��
�S4?������8Y��� SRT�(���6"!��ixR��r�~��]�w���3����4��uI�5	 �j�}�^�C.C�(���L))Y�MzS����T��WT�t����m�a2�$e�;9��G��T������f_������H(,[
�����<y��pQ%�����2*�ec���E�5������������1���$J��5
f����`���M
XrFXTG25Jb[��Z��<5��p�z����a%2���=�i5����(�j��D����!���k`
�l;o�]#��s��q���4I��]]������S9to3&^�_W�i����9����_��Gl���}y����"z�+���Ly�/�W�9�0�4������f�po@�$����5��e(EI�D
1��i�JTG�cK!�j!J��=��'P��V��Q��t��Z��������=�|E��b��S�~�
c`��(�$��}c�V
�F���Q��W?��5K���h�����@�F�'X�5��D�r��{����On�l���'%��T�z��68�k����Y��:T|	H0<nM�.0
��bA����.���!��e��|F� t��;:���Q���Y<r:)�K`NN�P"4wz�x~�}��;����6%]�"�(We\DEB��
�c�Vtx_����]�*:������������K�\�e�B�F���&�I����	4�.�N�T��~���`*3C@
L)*�^�hdU���������
<	��Bi`���Q�+���)�9�c��M7YC����OhM���n�v�,
*��64e[�I�T����w��tU��gq���DJzTb��"��9�nC�I���r������Ke��B9�1Qp�/����Pw"7�Ns�aU]ZoCK�+��������XQ����(�]�<�A�T�,Sg��=����%������#�Z���_����k���%(�I�a��s�
�;��RA����IG������'�<���?�nu!rZ�l4X�=���c�l�#t+��U������q�2�p��s�;p=7=dj�������XSv���������^1L��Bc(I\�@)L�yM[��uBe���$b����~���v��;.������R�����B9�F8�}yJn���r:qk��~�g�|�9��E��e�0E���U�Z���$:� �
�&�
@��W{_�#w{=����r"����O
��v]sC�,���`�
���Y�d4��^��%5�QA����h$�����E�:s������e���_��0���j�tV��	�����"�5�Im�\ww36����B�E/�2�o�-b���|�~��O�5O�)Z��1���O�N�J76K��L"/WP,�>Tf����J��R�Q���w=��d��� 5b���M�(	
O�6I�������3J�PT�i��`��e��� ���(^�������6k�B������XJ���V���1�"���L����n�O��	>��$�`�
�a��(
$k�.�����F`��0d���ns�"D1���������3��Q����/��������%����2G����t�SMC{V�R`��uap���M����.d&�{��N���S�R�����x�5�
v�b�=��P++��EC@�����F���l����G�6�r�JQ�tW�^R�/OI��/�K�N�A��rS2�l�����|<��U��Hh�#�<U���?+��{��
���
~��)\���y����=����-��Bn�-��7�n��F�LB<����P�	ub�y���E����/���
�6�����4���X9��^�~{�,Jdv���L]T���p�<������r�X��^��<�U�W��;��ad�l���j�m:>�K9��K����������(J���|/�}C��IHv�wB�G�������z�����6_�P:��G�'@��Q�c�a|�Qi���������;�N�\��e��O���������:��u��7�v?'[�]Tz0�����:B���TA���G��F�*i��8�yA���9�{��^�����a$(���O,~�od��'�~n��]���Vb�},�����_�N��&Kq%t�������N
�����E��0v'v�;S���
���I�Y1]��l�����Ci_E���'*U!��^���"����i��_"��'�^ry�>3�TQ	�`����^�v��������'�!	G���@�\��#!
�c�YA?I��zr��@3p9������DA�0J�Pi��{��]����J�>?�5b�:���V���DM"'�
�b��ot���Qw~��w�y���8<�MK;��C)Y([�NU�-��ya^���1��
����r��:Zf�F0VU�F�'��-�#'��0�������9����k��|c�F�S;f]�J�V%9�L[�����U�#U5Ug��BA�!��B!�D�j �2Ssb
C)T�g��;|�T��rd�z-:������%�t��!�-l.��>�R�y���H�.�����0IjO�H�f���Y�Z�(��<qrdo�][~������F�2{��������@�������*��,�V
�O����h�y��������0����������S����M��b����>"���go�����*����0�F4�5lW�S��(��6��m��@����0�[�lw�yJ����e�,+\hL�@/�k�+�u+:��O�\%Lud~pZEA9�i�����b9En������u��d~K)kV���	X�"V<L�Q�e��f�����((���m��l�����L,�[����,G��<OH�������A�a��5��������>��.H���\��Q/hS�f6���X���^��n��R�O�r�N���i���=��jE����%f�c������^�O������
�j�����W����%%�hE4@���FOp�����\�i��s��`
���X'c?|��2���lh�6�Yy
J�
��f�hr�:�)�k.vv������U���N:�,�����_#���c������m��^��$�B:%Y6<\-OI�����(�CV�'��`W�	����C)-�d��I���0S0�	m]���������,ti����U*�RW�Z��b0���CO.��T{�����M��[����Sr}���X���k&�w���_Y����k������B���tf�_��O1����K{�j!J�1���x���QR�bosWE�}�;m������h^���7H�������h��8)�������&o�bLn�?\�WnJ������Bt�6Q���v�f��jy
*fD!�\�RzX��y�F������~�p
x�g|���`Q�O#��%��~�=-N
6���_����~�x���)���`	�s;��	��#�R��m�_�o���=���2����>x��=@�GC������!���>s
AB�4.r�k���N4���ZA>K�X�hy<=;0A�`15�T�m�>=+D��Q���l�� �Yu���r�Q�bR����50�Q\7��S��D�@����y�h��0.%@Y���0K$ch����L��D���Nj���=j���*�^�W��������c\�]}
��o����=4:��{�KK�\<��r�S��'�K��^ �S�A{��P0V8V�Ii���P���_P���H��F�lZJ@4��E����^���kT++�l��� ��i� ���B��L���<�n�����l�T���P
�_c���B�4�~A3���a�1r�w�[���C!��=q19�9��t	/�����/D�Du��O��J
�3=mH������F�,��.Q(�M��-��?]q"f����cr8-1���
2a�k�<q����2���z�5d�J�I��~9�3�}�	�-�-+1�*'I�X��l
����������O�Ae&T2����d�����WJL���'����N�}��@���!�R�����WjK�T���X�Q���q���9(k��B�-YU������r3Vq�@�F����${����X/B�7>�B�t�6qp,���=S�V�kt+�"�-D|�[L��I,�f�+c��������%�� c��}v$C��^�L���)0q�h�sT��_�C���'X��GS<*7�/����O��Q�ypt�[���-���:m3a��"�v��f����0�j"���N��7*}d/�[Nji�f�]_J��6�����!���Y!
zt�yB1G��ZS�2#I>Q3�:+F)�����s.��( E�r�!��5�����?�A
��"��~��9ZiY���8+O�\�$�R���j���.^���^�H������nK<N�qg������r��n��+�P= ������)�u��r����+�o�y�y 4�q(��j|��<f��RQ�\�����\Qlu�~��1�V���T��R/�/B�@
�����	n�s��N�<��h�;_�yP}(�D�pvv0������7d����a�7R��G ��f9?7`*6Z��9�b��y)��"��`���<R�����B��}�G"��x�� y2|&���n�a�����f�l�C`!B6M�9�#�(k*�P�*������S�g�b�W}�5�����hb�d��`�S�b�=�(9��-�lTN��D����{Ho�v� �5��������s?9��Hj�\�0�>���O����?��~���1�Br��j�����
.tp���/��${�������z�_������}����1����P��x�~�~���a�f��M

HGV<���Y_z
����������S@���8�A���X�c���t��������a	�$�����W�RR���I���F#��.L�(��N��Ga�Zp�9D���DiF��l{�J���.�v�����0����������5zBl�y��+L8C%�����N�����G���rm1��a�
�Z��(U��	 +?��!�'���[�A��1j�k��'d���u^-5Y2��v�'%����S(�E&��|���[Tp����>�~4�@�O�R�'U3���
���i����9<�p;�)�����$ ~i��J��3)����\[Dj�X:�7�����?����StnQ6�c��M�x�����ni����g����Sc_��64V~�Q�UUq�>nAq�$�q�8C�S}��F���.����=��/H�m�1/h����Vl�x!t5���T�M�YAb������x	����HU����~s0
�Q(���$\�vwR9��������<����}k+��
������	�F��t���Emq������������W�����K��H��`c���J]�4 �k�J����D"��%�6��}��XVWD�X����d���v��uB��b�%;y��K=��p�eD�2xF��zR3\]��������=�r����P&����e���fk����VM����t�yDg;CC�S��pI�M2�����)9�������l?�[m� +U>	���>F�S8������z�	������T5���g/�S�F��~����+�_�6%�a�����wPN���Bt������'��ni��@���7=�i��}�k-4�K�\}�>��g/*w�B.����f>O�l���}A�}��2�Z�N����{U�T���@�����U����
��L����g/h������y����#����[�"�}A�<�����>]�4�K�Q�W��YW;+AQ���}m���M�9:�f��!wQuV�P� 	kM�H�1���l��}j�	l~�A���(���,�1R�Ji�������������j�
���x�t�ZW��zo��i��w���L0lC��a�iT��D�!�'Jh�E
I��C�X"%9E��a`$HAwq6�#6O��0��y�\��q�`
�����+=KR�L�^�>h'�Fu!<���>������CZ���Sge(F��s���c&8I�52��:�����)b��(�$w����^u��>a�����2p��|nv������y�6YgE���F
-t0�uh����p���q���<�0+�9G��9��+����2X5&%��q���d2�@� ��SK
	Oh�t&�j��(s�Q�����C��du�$Y�>���6�B�sy�Q <��3�y��� ����yE�%
��hP+6(J��p��6�*�4�G��+�GM[2��T���f�{��4V���#�����D�T���V�.�Y��RT����;&u�����hR�Z�Y
� i�[�d�!�OKI77{7�����@2�T+��	�_"y�Tx�|'�>�-s�KX�67��+�gZ�J�V�LJ+������|6�&1��h���f*�`�:X�0�a�5$N]�.�)�{���0��e����|�eNl��O#Ph� ���Ui�A�A7��(�2�2���\h���,|�T?>��4�*L����8=�������b��
s%��m���Xg*X.� �>h.k�������r��`� _�=U.-�h�����C+�g����<�ms����:X����J�sNk5>��V���
k���g�/�r?}��./���G#8\���%T�;<���z�]�VA� �pj����(
�C��p($�����Z��jW-4���Q���d�g���b�V���7���B1���4�v'�9������ioa�aY�w��U���LA,�D���1�-\l?a�6U,��m��+�������;qS��P��w���Rc1O8:���mkQ��^YxD{y�0�h9B�~*������c���j�1"��n�u	WAf�K����b��%(��*���MC�^xG������,A`����a1!z�����^G8K6�0����o�B���Y�����Bd4?����j��"'~�d�!�~2����mf���NUK9��!�q��co��6I�f��nS��S��%|��X��@�����i�����(wh���I���xIA3�Q~2��u�������If9	�!�i�R����Z���k��m�Dq��=�5n�m!z�L2�>]x��\�������_?6H���%����8����@��2S@dZm����H	�0]
A<��W	;�V�q�O��(z�M�
e"�a-�%��L(g��KZ)k�9��<J��M�����%l�u�"�Z�"[�4W���Fu?�]�j���������=:C��5�#q��@)���r�b��
�6E)� �6��N��}���&z�Pr_=S��c5��T���U;���{0�o5'��
��t�=J�W�C�EgE���g-e�����H4�y�x�����Ob��v���`RIDU��}�W3~�9w<�U��&�3���`��B��{	"���'p~v/��X�����.�-RU�\���+�~v�_��������h\�������g��w{�	k1�O-@����f��|��&�9�PF�^��~��kM�C\�����n�>��-��G����~��NT�4oQ�J0����F���}��Yj~�my�@�����W?vI�`�x���V���������T�]��&�9!��_<G
�t���%Z�$� �������/~��;*;
��%Ot��i����ci�P�E���s�88��n�^�cqC����F��!{���/J�q7���{���5�r���DE��-H�lQ�C%������K��S^0!�2L�S���Z���g����K����s�r��`��}����y���b���B�:t|�\���,���[\b@"{�]�O-��	�)�w
�Z�G�"Iq�O��d�,}�-W1���G�!����Tu�g�{���/��|��Cl5>�Mb���5�9"�TD���<?�v�}�{�V%������R���=���=m���������Sx'��8��k�8��(;�4�S�?`�h_5,���l< ��haMN�WF�����>{�:\�IyQGT����xub�cT5�����V�x�Z�Ut����(q(��8k���
xm�'���\���:��zu�����	���F�x��-R���e�oi����M'��2y�F�+���1�a&=r��mL�2%�~�{�C+���kw��M��(y:�;l�������&aKu�v����Y::�*&8	��Vc������F�J����i���<�\X�]uG������o������6���R,��/�fMj���w�����s�]]1�Pw��%�y*Mi� ����Q���C����s�.+)�R�|������/���o���F��e�}�b����Q���I�u���Y)b��f+����_sk��� ������y�i�l�-��A�$��E���R`a������������j�#��W��]�� �y� ��:����@������yb��^��E�����	<a��g����V&���m������gI*�y}���'v*�x�s����)��qx.�M���`z�?��������d�n�Z�%v����N����F]��d�S��Vu��J����	w��(F�A�e�MR��&^9��(�:��W�+tL��l��^o��\�5�yT-y������N�
h�S�����W��>�O�9�#�8�����X�]���]-G�_�h:����'���4�/=��������L�L����]l�"p8i��~?�����!�X�T��I���pLW<��|�n�E(Z��w�Hg��nA�!���@�j��%�.��"�<p$YIX�p/F$To����s�H����������G4!w�5�z?��S�+V�	�0�s>�-��li������^�u�Z����Uo�'+w��Ug�Jc�Yt�n�<.�5&�-|2��!��j���-9�3A9���v4'�<6�J}�H�[�agv�%�I14�X�+��JnH]P�Y&��K���������a�~s��?r��d�(d������KL%-G^}��-���'�;���z �������[����,4B��s�pp�D�I����6	���:�������A����YL�����F�@�MA&D�?h�����X��Z�:���V��}��L����Q�Y-��kb,��q! ���M�&��o�-���(�t���>������� �P���F�F(?�F�\�R\e����B%,`~�6���$�E�}^�����7U�Q6���%��U��}'���b��5M~PVVHH��\e{�D�X�x��q��n�dca�����G/Y�y�Z���'��/�X+�`��(�>��v��?���4
y�����N+p6��pD������Ux6�fa���Z���/��4����K�g��n�N�VM��J�g6,��T^s.n��#����-��r!5al�ks���E�[N".8L�k��}��R5h��&�.qD��1�%���;"S�������������2���F��+a;{���x��P������B�#y�d�t��\�"�Ur�<f+(��u^t�`��p���d�mSC����$>q�nx�C�+����
���$��������������R�*�1L2�d������+���������w�@I����-�Ts��j���t�.�q.HY��a�,�p��jJWm4��(�T��H<`F��"Q.��ape�uu$�Z���=��w��{�&�9�W��f���V��}yb���
������Sv��C�X0S�"X�d�H	����f�w��B�Y�j�S�2�:����9���KO�����iC>����T�~T�a��1�����[#Tbt2���q���t
�;eD����u���|�u:��F�M���I���;A���]xG��&^4_���������c��Y��+�3>���Y����#GA'3�Y5���)���w���XX��&�����g�b�Y1bc����D3F���a��7j�	�����b����8��Z����k������peW���?�C?.�N�'L&����w�
��R�%�g�}�RRU�����[�u��=�������-��m�������_�����g���#���Tg��n`������Y������)/����s?��B�<8�N!�I4f�^����&�f~Al	�f���I���_�m
B�<����Gm���A������:�DV������./�y��b�'<��\�J�B�H�:�CI������PP��0�		�ub�;=�		�
p7*SMTE�&��_:��h���������Z�q-�A��lFp���������e��,��b�(�o�D�f���k�~�gJ.Z�v�S�i�Ki�D�a#�{��N���SNzx�h��"f���Z���avF�,�BBg3��]�Z��V��ol�72?F�T!���$��n;�}�p���D��~��K�n���^��J���F�Yu�����(���>=���c!�Pfq;&�Xn��3'PD����	BL��Z*J�F��7�3���OC���> ��i:�1��;	����� 
����TV����t���~E?	1���k�b���Oh/�bz�����s��:�t���
�|��oF����wp�����_�(Qv�JQ'�����������}/�������s�0�7��sg0�j������Q(�	����^
v1��|���W�������{K+�	W0����t�U���4+����J������UH��*��3�B�t���2���_�Y�B�}�5��"�T	JYO	���M`��>@Bv�t35���*�d�������E~_�>�����[*0]���X���eV���~�f���XS1Jl��R$�r6����E:T�����Q��c-���T��D��>W������avV�hi�")Ji�� ���	 ����.��@��d�����C�����l����a~\vZ�W���zk���H�8N��*�����O����3��L�D���D�y=b�>pS9��a�b4
3���eO��R	��F�,�^1
���%�"
k^X�K�]��`�����5�������W"���. mA�0���v�/q>_ED_�n�uf�A�%��;���x��F�R��:R"���5�SSIeb�����>n(UMy�J�6hz��L=����z?��P�L9��&h��3m�b�Yc��q�=������������8�)�$J#���*��/�����G.��u`
]s���#�W�O�"�����Ylo(��6�.U��}����H��+%
zf�,J���������|�YiJ���3��c���R�~�!�#�P�>x$DQ�"3�q\p��u���3z�8��V��r2���-
R��;���/hP���k��`��/�t��lM)4����v�+��r��m�	�E�'�[��aKJD�f}�J��UG~H�f"xX��\��K����qt���`�0�j�������k_"N�wc��-�a���*#Q3��x��B���bT�J�F-���sm���wf	q���.FQ����b(��f��N`>����}���0?QR��j�L���w�<��N[�W�"���j��E�IA��"8��7������P��Q���E���Yu
e����Y�E�$�j�e��B����
�����m����^�"F�����d���'�Tn���%f�!� �7���*�����q��D�IE�j-h_����4�GYd������������5v���
��-u�y�p[�.A����'o�����w�l�
���_:���ap5�~�,p�Z��'������P]�U�F���;��_��y����%�����5-sM��`c��@�@T��ja��k���Sr>|��ax�#R��k4�;���o��T���v1���U�J������������T�2�O(m�f��(]�u�3Q������G����{;F	�So��/6����(!m]-2;�`��Q���/c't�`�O���Y ����E��{H��}��9�#o�&
	���#�T���?G���Wv���`��5Pm�}wVJc�=l�31&���}��iD�#:�}��C#��S��y�����>H`mA��J����W�K������T�G���Q�D�m��(�H�;�F�}�h
4f�H����nC������?�6���`8=�v�����=�b�qbF���I�I�id�����Ax&���u��T
���O�L��y���p��`���8]���&O�0X1|����f���|T���u��r1�����5�p!��u33�R�������8%M0nOL�����Q�6vj��	�����Mi����T�CD7�x�)DCZx�o�l���~n����r>syu�;��8��@��#������0���}��/R���J������������/E)H��-G��lln]����a��4�Z����%�T@$��E���?G��6<���p��0��y���F�<����un�gr��(o��E�0QLnN+�����g�P'X���Nl���2A�0Ik�
�F
�}	"����Z��s� �������)�
�����������0�� �B ��A�D��M�Ya�)�=��K��B`��$4�F�>3���C1��	��� Q���|��]��l#�������)���(b�O���3H����2������wt!=%�Op3��y�I���T���h��+w|�U���u=<H1n4�g8��!��0)�����jv�����V},h����(��i{�;�����]A0�D�����U���WD����Kd4�yd��0hc��a��s���������^h���(���6�������z���r��A$FQa��(��4�6���w�Ye��?�-���kn�O��H�pdks���^�j������!��p-�
�`A}����
�R���#��
��'T�N�YS\��j��8�
�h?
��&;���'����vkz��&�>"��_����H�	x-y� Z�>��&��Q�`P��tx���m�SX��r[^������$����Y)����I�:!:�n��TT������ ��yPg�k�,�re3���e��b����`d��DI���V�hC��!�|V(��}�}���_T:Gi����(0��E5����P0EP|	��/��4C^�3��i6�I5vP~���,�}72�C���V��������r'�R�'�'�0[����e	����D��if�Z6���?��t<��5.>��@y�mh,J��]���(���!�E%�������/$��������)R���@��8�����11���������b@��\����l��M��-����uA�J�Y3���@��$/k���!%	���$���w���2L����/&fl�t�<1M�2�����P8���:�)3�dE.7��c���UU;t������������[�^!�/Z��XHI���x�c�����m�'��(�~L.]h�������Q��^��@j���T Ch�����u��B���s�N=�"�,�
E��?�R@%��Y%��+*�-��3�j��[�u6������S5��W��t�X�/�;�P�\���2�Lj���+L�a����f�t���T	��&�;['�~gN���e���(���ss��������o0�����=S}��M6^�_�
���|";3�K���&ik$�.�[��3��L����Ph��_|T���o��o�J$����6���-��D�R��Zx���C��fY����)1RZo�C�`6��V2{��f�)���Y��{,z��$(h�HL��S�y���#c�/��O]���P�V��kw��EV��,�D��2����T�jY�,�)�3�^��
>N��y$Q���}�H�hr��I�����m���1^����}�Xu:���R5�9����i��.Os��at���e�U���?u�8�R���kO��w�1"����g��^���{�J��Y;K��hL)l
��juW3�b?���������
�~�e�F^�4�S=b��e�=��k�����-��'�}&v���u����e'mK�w�ahpA�������nz}[�Eq���'��^���GN�12r[k�T���	�
zOi������Y�:����� b�}6�a�<����EO�3�LQw�J�S��?��\�_�G��m�E����[����	������1o2v5"%`h�#��E���]\�F�f�������]X���a"�d�@�������d!D������9�e6T���Jr�lB���X8��w���k��������'��j� �b�NZ��k��[�K)��P�d�f�_
 �l�l��V����8vg/f���HHw��.��jF��������b�l�]
�;uiD�C����e�e�ix�=����/�P<x���P�[�V���n��*�DgU�����/�5�^������@4�����������:�#�0�)aLA�>m�r-l���'u��h��a��������K���*v��BJ������b��+��!A�|K��[{������������2��\������n��k ��XTR����67/�4xD7an��w�y�bo�B��Q�.l�y��]�����g
��Dv,UY��/N��=�I�/�p`&	����Sbs�����	�<O�`e��#!=tN�X�q��O*�gneN�Q�u1��QNyx�p�G.�K?*����5�~�1h������Hk��hl��]�`|S�����VL��p���8v��S#T���nr�,8���HWQ����a�a�'g
$��>k�����<KP�-�0Y�D@BA�acIZ�>�T����[,�Ra��H��5�������he�M���^sC������U������������2<�Z>aK{��x����x����R#�p�3�d�������������}H@��i�))�c\�Q�">��Ol�Q��2���7/+����y�k�@����o&��c���H]�����W����n=b��NJj��OEC��}��bN��W6�%1���jI�p����	
��PP�mj��O��[6����=[��)��]�W�Yz�6���V��DjEp�n��V���6�T�-�mY�=�����j5���y-
]�w]�{6_�������DG�N���:���U^|�{,(V��W(0$
���:
��/m�k��������x`Q(�">��&��n�(:�b��2����P��
���QHnu�$��yv����n�.�~o����E�ct�'#: ?��%`]
S����_;7[.a����U���5�}��=��e�5CRI�_��`����fRUVOH�.2c|��|\fc��)����5����B�1v���7B�Dwv?�I�L&�����w�*N�������A�%xYh8Z|xo=r�5����zsK=*��6�z�lH�`y���o���Z�H���"x����
���y�.-�Y��\�_�UWJ��1�����Gg=j�?nr9��wys�E����R�Zf���7�-�d��	�����$zO����I�r��a��M�H������[,U�����{I|�q�a�p���j��"�\���-���]�}�F�l:�=��l�!Q���55d	W�1[����50��ZAdEO�
�����G������`q�D}W��Np"��8�����#/h��,��b�H��m�G�Y�@������L_��c���*��\���C�n0%U*+?�����(�H"���cE�����M��Gf�m� �:����D�����yz��^���0��K�3�����u�O����mdp%7MNA�iLw~�c��S����`&�=��'h�d8����j�����z������H��� u%|��G^��@��r�CI�n
��wh��#%CI���nm7#b-�M]Vm��_m�9���&�h���I���s�M6�2����B�$D�c??��^��7rMb�&{���9��]�����A�`�?{��NR��9��o�!T��
����KB�B
QD�dn��GrQ��}��8�lT���$x'(2���55�).�s~xY��
)L�2Oai���	�-��8�������_	��T�
o���e�(���V]--�*:�X/C-P�� 8�}$O���O_�\�lO�M|}F	���x7�9�|�E�+�H��7EF�:�,g�ITi/�f��|���C9p�����-|fROY'�|�O�:t��(���\[.�yU�qU�{U����v�qt�u{1�x�#�H|�����9#������x�������:VP0_w��N�z��x��{�9pfl��N�������a�+��?_�zV��!��qT[�OvP��X\�����\X��F�=@����dnu�gO#�mky�������fn�!ls��R8,�/��x�wf���~�=���v(�!�C���z��yW2�����e��������t��P"�Q�0�.��R��:�SY���/��K�a� O~|r�4���Q�����"�zS��HRY��hT�b��lJ�qA��)����j��%k�6I��c&YeTC�����e
������	|�l�_r���}�P�^7�Z����NW����Ogk��8���;��V�N22����1'q��8����F5�R����cF�m��C��6������u�����e��c+�u�����q����I����.��$�&m������?qg������� O��X�l� ����������Dc������sC�O����c7�i������h������*���o
�{*�!@+���3��!�05��n���# Qva �z�Y�c���n���
��c(�L����N��x1��%p���De�A���l�|�I���`H>!~<P|��ttpd�-��[�" DmkuU�pi��&�P�� "�����L����Ti��;���1���G�N��RP��sk|�J����P�8��)C,�E����TW%�6�v��4�0�~�-��nC�b�^rCD��?DQ�z�=bd�C��L{I&�b�R�H�
�X�
�o�1��3p<�">��B�V�>[�-|�4��+P8��Yt�m4�������� M�p�5�B��N� �mYmd�o����������IM
��7�+td\���������
�_�K�{I#S��N�E>���������S�������1E������*d4�A~���Vj����g�����1jV�\v���i��c_9��[��L������Ae��E�����U��)y�(������x7`K*%��sJ���l��Z�y����t�
�a(c���^e�$�Wp���t:�btU,�s;�P������F	%�+4����,���F.,_�8����%�o�����.�MP���x]��=g���ne���1�6�:�+���dmR4����S]w#i�d�]���~r��(������P{��O����������{�s'�[]�B.�1�W6����U��;~|_��SL�j���&=32+�dv� N��3Y�������u�<����rmZ@�{��EVwc1�$��S��������~�
 P�b�~.)�y<������W�O�eH#'��������U*e�^���:m8:�1A|:�>��<u�
��r�{jR��.�m6��cd���9	�^������K�(4#M��w������V��Y�"���$&������$K�)t�k!��}���������Yy2jMw�U�d��'=`���}����ho�	T�������WVI��V�1�g��
���&��H-z�P��z@G?u���Wb��2N�Y��H!a���]��26v�0�f1>a��e���~M���EV��
dd��p����t���s/�d���0 �#M��Y��f��P��Q�Ci�$�6��*�U���B���1���������E��m��m�H����a^��/�!(����8I]�\�n,�|�o�
������q���-�F^	Y��1*�U~!�Jy�$���C�Z2�	N�i���)�<�k�\�*��c�T��$V���v�n;:�� ��O��F'�3g�����������
�[%�c���Q:)V��f}H0��i����yO�-�Z���:�:�g��a�86�Im+�,���(_�3hZ� Y(�r3�\w���lH��V���H$��a��]_0&�h^�g]�M���]���{c���5�	C�K�'2����o�h��Zc�M�u����j��u�P[B��Y�"����.�n���2����2(O|���5�7�������&V�{RU�}�n����>���3����Aj�{����?������5�z��\D>���`����.��^�T��
��d�Uk7���
��?t�4�H,�>���9�f~'X���Kh�u�������/�J-���	�����2������tn���Bw�������1kN������6����?h�
���V?���K�D�
�/f&i�������2�zE�����}�S}�3�Ew���,�|��
��[�4��8n�FH�gw��2,�Y��`-���ZM����L�\�������0�Uj�Y�{��rU��*�^���.4���.���;F��Y��	
y������������l���#�1�	2p����6����I\���i��B�U��D>M��G~f�%V�V��M��&��G�Z���v��w�	�����|����?y��X�)�O�w����<MJ�����3���h�L���Jr��W���������6�6k�[3���5����$���X;:����lO�!~�����AM}B�L6��U��>���j�,��p
f��#�!5G��s(p��B��}=�*=�Rly�'�n���5�zO_�9*�F���ia#����-C���Y[����B���g�H�iQr��F1G��b�Tf$�=4����Co�=3?���e��[�3�,��|H��=/L������d>�
�M��~uw#��~�f=��.�i��DTs`��w��#����	�mY)0�����7��(���#�@��k�����7"Cn�����?K�b�I��?���~*2L��mPVY���O=}w�h���O5�Ag���vLs��x��@����M�������-�������� ]��%U��R�OK��W�<��Z${�Q([�
���^\p�M#��@��sKB�����
�8*�ST���[�xl�5�����Gh8�S.�~���=W$z���,6��\M����V�Q��7����a�E�I�U��zr{eH�Cl6p���nb%�4M��;{�6�0�$<�-&C�"Q\z����~����'��g!_Zx��AG�4���Q�;;�b���C�����jP�-e����F�e)���M5��]N�>!�"=���GI`�1����$i$��^�a=��D�c�����t��c�M�7U������2r�:=�V���S��I,��J�����c�c�j��1y����W"�D�>y������.�+�������:��=�/�����"DteeK�Dc�~Y[�Tl����C�g�'3����^��J����W�<N�PJ��I���f���^��	"��k�]-
�d�4j�K��=VH��]���:�E,KV)����P��
meY��Ps/��#�&ur}���<YN4~B
W�
����|�F&���{�0]�4��x�a,�����[�(�Z�0��_v���U�������W��u�G���D�����o����R�Q����0��@n��9�	��A(0u�Q���J���'7�����,�v
K�&8���3��*���g
��	��=��G�EB\J@��OT�Nv��Ei��rz1�{�Y�m
����M0���6P�t�
:d�� �������,��57��V�v������U���5M�����>�J�@K����:Q�v��~(yN�xE�����������!���s+�0����g7N�
��������5:������~�e�������s(��o�L�
�����]����P��GY���}7���[���uh�a������0|�5���N,�O�I���:d[�Q�����\�6]�*o�u4w/7�~��K��I�o���O��'�9l8=�q����.a�\��#��u
���X��z��tz�+�����h[Wu��B����B5����=Mi�R��9���+����Up���St7�E�Z�BS�I��A�w�G���r.�K�u��,D[*������'�'�v������pol���3��QA��X���ex;��3��� �Mi��Mw��u!��=�S���FHuu��@��J���u�@��y���	��aSvy��������W1z
	��w�+����EW������6z~��\SEZ���HK�K��R}�����7�4�t{E���+�p�6(����*��#�@]�
O��Gx,�������F�!h�b4U��Gc��
X�>CO8�����d�j��&���0�<����j�����C$U��
H��H�g��Z�w��l8��������vk������"�v�%e����}z5�0��^��L���o���7�g�/�����e�sF�����*��,vx�EP�FX:���`}����C�w�gL���:�r��U�+���������~�'��P9�t����3��7p�PB��>>u
�{������{�s��@����*�7�p(��{������!��.��W���nVW�
���
�5p���6�����`-�����
��Q�n�u��Ct/v>R�PHH��W�lW��|@����=����:��?���K*Ssx<�T/7��T�X����'��o���%����O���[��	s��d�i�
�.��O#O�����Y�>+�\tS�@� *�|�����,����1ZG^����$n��'=���EcwXK�F����=!��a����W�U
�~el������O��FX��Q�����w�6���jV1��i{�{%�p�W�S'V]~4F�	N�ff�(��P~���F�����3������N:%5�c��\S��]���7	�����������P.������#�~&F�.o�����=w����
����L�T��h{����/��m$�_�0���q�H���M�2�l���i�R�YZ��c1�}o��=xA�� �������B�����,����I&BA�`�%��=�������Ll@��q��W=���/�d.���n.���V�I����=��n��HM�iEC�`Ju��0�����<%������]�n�AOyF�����-�&PS�Ttq�2D��A��c�P�}~BE�Z���2dv�\�n2������<�o�C���,�+O�c��!|�Y��H�Hg&���M}G]�������}x]i�}>rp��Y��o�+������1,��2-�T��x������EiN���+��;@��U;��,��BKm�����7�<��t�2$����Wj��<>���DB�n�]J0���sLkK��d���A(��y��P�z�@��)�cB��r{8b)ks���#�r�2
*j�B��������%e���TD]>��k>��V�(�
�J�a���� ��n�--��'U;�����yJ���|�W�"(�;x�<��eG�o��_��u0mP
�����+�GI�>��=ox����"Z��k�K�.�h�����#)\�Y+���E&OA���`�>o�/�E�'��S�#�)V��Y)�-E��a�}r<��������bl��t�������K�^�����c���S��&�����Pp�L���K���"�iy(�!.�n��������I��4����'Z�BJ��v�<i�e��d�	���)3�%��]���N��4#�+Oj�x2�P=�r;��f��\�Pl�{TN�iV�g���x�3�or����K�E���4��+DA���zD"/��7�!�����sU�me���S�a�x��U2���?*�i�N�?�����.C�A�&g�"{_?�h��r�y��j�4|�+D�X�W��9�:y��D2�;1�����/4���C�J��=A��@x�E,����I�X%;�/rE����+l^�a�&��9x�&��D�������7���\Q��3��ix���-r>���S�z\n�f�?��94�t�B�s,uoHT��WpY	��E*���D�m>{��Y$l� 2-F�����������;P�\�����,G�z�l��x��e3�{��t�������N��q�bX�7������.��8N��eq�i�����9�W��1�ip���4���g�"=��~��-PZ=����t�^t��[tv�'�/���gA�IM���������T��e����X'��*Q�X�7����������Ix�mu~B���4��@y�N��p���	��d{Q�)a��}/[i�����1��C��f�%$U�������v��������-�2�%�C�;�G3���Z�D(1W-1�����=�������A�'k��O�� ��t������Tq�?��w��n��F24�:A�wS��������]��6G�����tIC�����<!�%��'��C}b�'�w���?y�B�tX8�j�G,��?��b��Lt�C���$�����C���L���]^&OI�`��bU���o>%�7����v6���������}B�������|����;P�63<���L��^T2<���S������A�K<w7�
u����`��w����O^�,�T��e�Ax)�R����nb�\C�����p����"o�bn@�4�{��,o��:����m.e���/���L�;���������}��Hi
�yP������B�+�4���sX��n�����d���y�q>#��@eZ�������&x�S_cD�N^���(��J-�dw6<1�����������d�����?��s����Q�F������QwuK���HS>�M��X�<��v��%�z0�q=� A LV��*����V$c��dEc��Wfj��n�5�rF���(�[�y�3y2WhS9w�����B�9m�
��q&����n��'%V\��c�k��zv��� I��^��|0�e���T�����{�(��������#�G��dR����e�4�S���3�%�������L�Qm��������P�|t�O%3H@BRu�[�_�� K_h�J��^ZOc'p%�1�w3#�@��y�Z��K� 5�tn�Nw-�p�kf���BHo$������HO�T����*-��~�r����35#mv6Cv���<��"�'Kt����%K��U
�%��wB�#�b�y����u	~����
s�G�?#U3�w�"��*g72��S��_�U�6~T���H-������FQ����_;a�)I�������/u�������)�����@���.=-�`'�����%yz�����0����+����3�d��������sk��Uv8r���#DV�J"��������bE��'�����h%��>���]T�R��O'�^�oOA�1l$�]F%q��s?C?
�{�R]fu���CFl�E���a�����k��Sq��I�=X���,�5��P���C�1j6������i��q ��������X1m����W�u%�E���p�|]���q.�%�m�1,���w8�s���-Z�/:\��|
���dD��V��oxEC�SmL�?n<�rNr3��N��zUt|_%��k�v�����6�8�����e!�i���^��P��[%�oYb��q��e�����"d���3��0��GP�`�.Z���G��L���l�<��[���	p��q���]I�
(5|���21X�8�<h�3���=7.0��&	��B�������yu#`��A�E���`���i4�wI�^y�.�|2�+K$<�B�!��`"�|���{Ar���ucXmN�:
PI�s�}4MY9����O&�4�q�sW����hv��0�����[�*���M7����C�$C�F�\d���I��
H���T[s��<��S��������ers���Gi�G�+\d��5H�����r���~}�����A�"�`��1'��e��R��4u��/q�RD$�Zl��We�?W'�qL����F,�7'����� ���F[�u��������(������(�/P����8���G�Ra	�u�(����F`�0����yl/�Z�S���jU?�:|��t<�Y���)
���viQ�)����	�����G�RmXw��E��h���%��o���Mg)��&z����/~���m�_7Y���H1�����
b?���^:���q���n*
�LO=��i~ePW��
���OF��i�l��b��@���"2�>���(�%5��5�;�_�?y��l�=C����P����s��D#R@-t���f�0������1�r5#����������9�3�����
��0R�������W>b�p��Z�L�<�&M <��f~j"n8�������|���oI�	&U�(
�X��f��5�`�Uk!C���������H�t~_���`���	�U��W��`��|gT��1���j�������m��u��7���O ��l�p\�m���o|����A��%��4=e���+�]�����:Gs>Z��n�w����f��M��g7���#Q�)����������v��&}�����|��H�[�qv�	�����r��0��1*;	�,#�V���>#�rj���a,�B�t�=��#���F���^AJ�u&�YOeg-�PJ����!�^!J�������P�	��6K��;hW��@]�-'Ge�s�(w\�1����}�\������H!d����%�j�>����f�VP'�@p�5L�@����4VY���(�R�`���A~
��3���t���hR��!n '�����<30A��.��9�f�B��zL�M�����:E�8Hr�.m9>FvY�*�H�P�Nv�*����`��B���k������=�'�U�	��V��SY����o���Jn�)_����t�*s=L�xJ<��W����!��P�&���zI�F����1f�~�������&�����*5��t�
a�_;F�6<���/�����2�b�5+��o8���j/od��(���R�.S���85>���yZ��u��5�L�>�����aj��o��
2'M0��7�6��s;1��@.�����^R��]�
��(�J��*;}������S��tT��M�\h��
�F�5�����S<8�q3�F��9������?]#?��z�������{�m�^I�b��Pc���lic8�l'T.!��~o�M�
���z+���G�.MR��.�3���Y\\��m�����Rj�b�e�w/�]E���v��e��\R5zt��_t���p��a�U-��sw��
�DdF��g2,~.i���w%�����bf����;R�2+>�s0�h��a���"����K��LoH�?-���M{����$�{�b��R���!F=�}SJ�S@M�j����P[�m��_�i��|pX����J�i�����|i=V�D�eR����Si����hr���E$V
�p0���6N&�
�����H[�/����-���Ka���[�J}��&�n���H�H��yf�T�Egcm���][���,�p$�v��4�v���u�Z|�����:@y��#V$��+V(lP�}��Z1�o������2���M���9���h\�����F4,�	����4!����N����o��k��o>bf%���-�G�d����k���J�^"��a$JH���d�����,�b�R�-�_j3��K�QD_�6�������j��Xu�9�3��te��L���v����b0���	0�����]���#D>�0];�%Ia����a�Bm3������2IOP�u���XF�,�������������v\�m��=����g�e/Lv&���2�(s��Q�������]�����U�FkRX��@����^R@���<��������&����6
���K�&{�W��V,}<=���h�Q�b�-G�X��b;������������s(�� �� 6_B����mMR]����|M0��i�?���#�������"���?�A�������PE��O=g��
TR�+��I����sxer�ZYY=$��@������;Sw���;_@mfW��m'��'���@<��gn�8�'�4��(���U{���	XM�9p�B���g������N(��%{�>XR�T1����~��fm�e0�LvCr��q�yeP~����_m!N����~�v�x��o�`���0M���S�:
�L��%-J8g�@0V���4���g]�"%I�$�vWz�p���S�Z�^!r��P�����[�����+���N���e~��pc�fYk�pq��mm�~[�p<������B�t���~���i�a�m����,�q�(0L�����Qw�	�W���/��n�
��
nP�W_O��)^?����d~�|��zr����`5�_�A�_���Q������Of������Q��0�&Q
��>��8��^�E��p�06���w���������2dg�'��*��MgHE��5*�t�� ����q���R�����|�����%��|�M��{�r��e�f��a�p!����o���R���s�X��;��������78vg����S��D=����^�����
��P�o������N���y^�W���1���LV!��������\�������o�(8��	���P�C�H���TZ����);��F�(�T�yTQ���c_6�����[=oH%��kV�r1�0?:["na1��s��5j!�2��b@|(�z�l�(\�((]A�F�����C����&u�3=�����1+F��M�hD�Lu��"��n���������{�}�:�z��8)�i�(?����������n!Y��M���7�.�x���}���{�'����$�c�7i~"���K@[��^!
�k��P*����*H'��=��?� e7`�%�ufFD�j���4Z����<1P����5�4���!
l{�O��?p|��/���b�����xZ�9���D�����2vd���(a���h�	�tv{�9F���3t��K���'i�e���i�_������x��]�5,�fNr���DP}�����^y�i}��1i�����F���������kc5X�����Y��r0|�\��Z�IQpyg��'�Q��`0~yXi��\��o��\�%�B� �E%�E���>���s���w�$j5�'bnR��do>-����v����L��n=�J&Q�1�H�Er�����U~s(��_}��KJ�q���lk�A}/��G�X��|����S�=�7�j�!���pu�9���J/u0@�W��|���L��M�|F{;j���<��=�iuo�'_���G��J�(�}/��I��Mx�����[:����#�7���� *� ;�����E:��{e(9>6=�\>�hW�����X2'R4-���Xx�$�c�-+�[�wVf?�������\����x}@��F]�Z�n���>z���`N/��c<�!?l�p�~]�M��i�R��������~����^J:D�-G-4_oG}�}\��O������%����SpR<
���.E��y������"�*
}:=3��:��f���2��'�"���Gv*�<%��|���X��%T�1��>��FA}kD/zo�������������FP��	��C��0�����qZ��#RvL�?Y:/�����1Rr{]�Rh�2+�]n4��>���M�s��o:�'�f���t�k���%�pT"��`k����jN����l9=�)���<��h�4oQ"��$Ix�ujT�o']V��{��GkH@1t]|P���At�������J���������F�������3��d��@��c~�}������{�����9��~4ce[�!��m�I�"5
v��z��������}8��8�S{�.�jY��+CA��3'���{Y{��E�B��\��t�~�8)�����2D/�U��i'�
��	����M�Fe�1S)g�j���n����z������0��Pyvo�6u���+Mv&�����hyl7������#�}i~W,T��)z���N�n�G3I�X-K�n�j)������Z�\��X�L/E��x��h������u�p�e��'=��u�n�����f����0�����d���u�f���,�lg��B��/������^�����y��L�	���������lI������8T�%[#�wz0�Oh�N����()�YE����-x�(o8�	���WX�%

5P�u��;E�@���k[��S���h)Jq��Y��}(��R�	SC������R6�d��$�<�6hY`iI��}��=�=��td������w�=���?F{6�����	�5)t�i�c�)��K�b�9��M8���):�
���!BQ��,����L��s^�f^	�7HV�9��Q�M��t1��h�cu�DSuO1�.�Z�vc�]��G�FKHw<�s�v1~���F}]����G�z��ED���0K�����&�7s��$�d�H0����0�-�p�����Q�����g�2�d��w��&���"O`�=����n�X^u���4���^��l�����1w�#W��c��$n����A�����M0<'�R������o9���b��J1M��b��.���r����"�(��0���F3w�0��{��A���a)s�$��G��8�U0����t�fh����|�"��EI��!�3�b�%uG��|�2I��qu�<ro%� '-��(��(��=4�2����m�q/���
}�J$��%�n���b�|H'�=���K�������v�.u� ����P�!������.P;*�/MMl]��i�[��C:���;e��T�7���;X&;�M�w��0AR��"��br:,���cy
4,3�I���G����vu��oK9t�2�h_&�Dj�����=�k��&���&=:�EL����(�o���jy��s�h)�^�@]4Md8��d��>���T��*8���Q
 a�#���qBl�;���SbxfLO�0�����]g��������=	^���w{������3�{�(��+�sF�N��A��a�S�����IK�3�I-(-x�I&J���v�)2yJ��+��O���X�Tx��?�6o��;���F�
Wy�=� �;��h�Ztb_�X��y(����8�t4�{	K����SR(��'�5�4iJ���+��N����T��]��S"e����Z�j��{}����4��w��5���"*o�)���!���<4���ia���cI�����{yJa8'1���1l���9l��f@����v��	����2���.��G������/�"6IZ-�Zo~���y�lN4�y����.6����
1V\�:���;19:Y�����f�l�Dj-8�_!1O�V
8�Tk���}[�P�3Q� xQ#���Z�d�c�~?�(J�8���p9Zv�K����5��tt7��=`�J�{��TYb��'J����Q��E�k�����>�r��`���Z����.������<DVxZ�)���1�:�8u�rn':��P�31�C��M"��������;5�w-��U�4���8��/�z'b�*m's��!2�%�|�^
�*gM��E��m��f������������!�����|/O���Wnoc7)��[��/���^
�	T��%��fK1MK������
u�
�{W�������
�+MG�"�F�G����1�a"0*$s-��h���O����v-��D�$S���m�u��L�X[���

��x����7Z�����I���:��
��K��yok�����X�+��>��ILn���H'C�*����p	�n.���@���I2n�U���4��+O����r����_��S
F��juE��)����x��s�I6�y1��/�;���+�2�A�z2���1iy�L �����lq��%�|]+"�IZ4���g/h����L�U�����7��#�W�7��-@I	�6\�f����1Y���L��|_�b�`PFD]`a�!tS�qJ���Q�KR;8�K�J�0�6m�A�P����8.������@���y���������P��X'�S)%��lz�
5L[=�����/�,vK�}t�%5
gc���W)@G'��������$��1�,��������W�����=�m�lZ
���R'�
�a�������q,%�f#MT������E�^��� ���V�t��-�=u����vf�������\a�1o�kY�(|��X
�U�y�O)���eSl�Ng�n��t�NTI�Q����AX�Z�^���|����E!��u��(��E9@������~>}�������	��%%@A�2N�X��y�X�\�w'=R��,R�4�
k������P�.��c��w�W�	_&��A��a��W���~O�y�@W��@�����MY�S�/K�����?!�{�T�m'x�,��d��;�!��������%�h����:^�8Q#����=>����I��y�8�;x#v	����9�&.�5�^��&}���Q;�%�r(���E�_z0��lr
��}��t����f��6�f��3��u�'��1wb�4�!��k����\q�/�0���G���b����,L�Z���k+Dy�k���{*�M���>m'~�Ac%�BhL\�4����^�
Q��hy9�I��2@�bA��p_:�L{�)���>�G@�=�i�
�}�Fw����-����X����c���)�^�,���x���tK��1D[�2���%	HOK���R-��lm��
����0��I����Ry�=b���\�rN�(?0�
�=��T�d��~dntG��KV����q����3)h�4���*���	dmr���h��X���]����a|>���C�]��?Z��~/CA��W%��5b{����i/�1�j3��_O��%��4$�Z�O�O��1�����c��u����\w����L����=x�e	��*�X�V������W��~(����k�h�oI���FRt�Bm:<H�������F4�r�Y�_K{��c�&|���h�������)����C$��p�[��2)�SvO]����Y������6��l�.���1��^c�2��M7���3o��������
l�7L�X#��{���'�c���u;�j�2x������s�"�?��U��GEk���gf{��G�V�+�K7���^��;��b�2����gt�:��=�AQ��	��t�k����|\v���O�h
f����3�#�~�L��v���~.;�R�R/Y���MT����
���4T[���i|F��
�O��of���N��Md�S��
R(N��������Mg5?��u�=����������t0�^��>�%>Y���P��h{���I��s�<I�M'��K���r9-[?���"\�0�H<������,w�	����I��Z?�`��dRg2�L��g�����u�Zx������
�uc�S_2}�(�e��8�@o>q$;�,��vk����2��X�K?FUt�73?���=�0�K��	JEw6�����M?��5q�����TI8�w�]���a/�����)&�^�V�I|�b]=?��&����+�)�4�-]=(\����2�J�T`PB�+�FjH��@��M.�`�T�[.H�%�����O!A������'�q 5��s���4n1������t�_�����8^n�kB�!�;x}���'����V6;�$�Z+�w�~�=��"'d������H���N��Otj����r�����I&Q.#0�a���G��wX/��e�m���d7����Lo<�1eCK���}����^����Q��ge�������3��C�����o�I]���(�Syw��1DZ~@���E�?G%}/�"�Y)�xg1Zd��
��-\��0�(��F�\��G�L�QL��T�.�ht��i��)h������TK�2b`]]
L�����Oa�92}b�d�)���~M��������~����6
��p�WA�a����.c����O��5�L?�����$�� ���Mm����V��E*�^����V!?�Q���`��S�@?FF�U�L������H@���I	�&���0
f	�d"���	B��VN�z:�����Q�U�|�4E�7nX���8`�^w\�Twl���@�&!���$|R�}����{���itR����39��K�+�b�����[^��MB�����|�(]�v9�/�����0@�`
s}=,<\���r�Br{���H�l	�2�/|��O)�b�@���*z�]� vu�Z�z?l�d�����y�Y�������EFg���T�k-�W?�)����i��3p1�[�����FPx�8���#4�+q�#du��T(/����}�@����g�G�r)�J������^P.��hF�B����EO����a]�
w����m^������k����,����t��<V��RZ�d�OZ�-�w��c�G�MPp}:^���S���T����<�/O��Q<��
�+T(��g/8=���	����!.�-T���i�w������l�R0'0��?[�C�?_C�����o98�
���X�jX��2E�^p��Gf�&�>61�3�%P�_���^	
j9���i
����^W�-_�e�J�l.����#!�u����(bb{o��1�h��]Ys��MV�'*P��������g/�G��
���Pr�������^hJ����`l�#w�:����4��{����H����cev&mA���{�)(�����C&�
����cX�5��uA�<��D��<���U�9��M�Hi�@��n^�,��!R��$Wku�n��t��k1���p(�*�����Q�����c���z���]hT,vR����,P��y]�6���	;��1z����:� �����i���Pt�P)�g��V������B��ih���Z)}��������!��e�E���� ��L����ZN��6b��-W�whluB"���g�t������\��T�R�J;��������	vv��:�-���%�g.xC���O�.�T��eX����5�����h�"�&K�����M�?n,��4�������.��:Y�<���n�_�x��#{e(��d�g|?�X�7��s�]�\����"������W���_����*xV�_��~�lp�� T�vO�f!u�
3�K���#/S�T#Y��7��PP![Zz%$Yzo�L1��b�F#��-���U<5[f>�7?�F�*�U���+���f��Rh`f��\aKg^W������4������
����P����8�b8XQ-�Ql{�V��Q�5bB0�6�P�����\$�|��y��,��D�E�z�e%-Qd�SwB���Imm(_��[GW���������r
�V�>rU/%�(�����A�z6�����+\��b��~h���&~/U�y��
�Ll����R�Y��	��`��v�!�)�lrf8�SL���!h�x`��cc�
��|3���Sy���F��`v:�$��#�G���
�.Y_P��O
�����?��k���]��a����Y`�Ah�	��9n��������\��p'����9k�������f����;E�W���Gl�hY*!���n�em��i�=���ME�~�BK�&+�O��'�^0Y�I�T��q������^u�K�9��o����ya����z�����5
�a�E|��>o��C���
U:C-}^]�������1���,�������^��zZ�wjl�AO��O����L�pHaB��i�T�����O��A��n���M�7w�hd���x���/hc��2��i�'�c��H5��w�E��2K7���<o��h�z�3]*Z�u�Na�����?;����;��z��:��i�3��g��7���X�k^U�2�3"k�fq{*������!�C��B�|B��4E�$S��K�����g\�N7�����F��4�w��$���
K�<�xjx�@����<}��A����0�~@P�bm�l�g�2
��\[h��"!�A�����v���E�T!0
\����>zA�`K��0/TgVC��el�W��Yb^�)�����1�	�s'A1�������''[N�~>
��B����R?���ZD��3\4'0;k��@��c(������"ES��H�vL�G?d����`Wa����P:�'A����
m�gB���!�C�_l��rq��OU��nd3&�B�]�BU���Ow��)V��[�
�O5��6�o����� ���-�����$���hR�f�R�NX�,���1�'`�"�����+7���Q����o���&Ea�N��1<@�M��V��Ji� ��f������}n�/�\~
Km��7az��~&�
��N���G���j<{�;|)���Qk��I�N�L?���,��Cn|���������x��/���B������s>�fcx��}��
�S����J����7��N,��sy2u�@=�ba�p]\�������m�N��E�N����.�b��6�.2t��*g0����m{�X�����R�� )��z��G���C'�ZF�G����������Oj��>�24h1|�:]P�������m
�|�'��$'b(z$�e��j��pb�X
���n[�-�����^
SA�S)����LBX?b;�ka�N'����l�JqDrYr�|mS���������3������?����0%�):�P���Y����D��q������u���C�ZN@[���_�(���G)��7J����DE���N}��84�Ua���������{i�����))g<y0/���)���i$O[�(P&�j�I��>t�5�b���E�X<��Pj^p�b�����FK�ej�P��S�)M�|X�j���/�c�����H`)�[�M������E���I[���S���*�J��=��Q�6�u�)B}�
�P)~\��������)�
�z���!*&pP�}�'C�
��)��T���b���	���_T������G"���V���jF����Z������)������d���<;l��T�fJ�@�����T=���8��G���w7���S�����z}e_�CS{�B�;�P#�����Si��=k�f�L0�Q���j��6g�[*W;���O��c\5�5�B��WK�jC�h�wo�Z�C�dLL��H�r���f(+��|�R���k~�|tu���0:����A��v�h��cw?Lk����-2\�'����o�J�����|y5���8'���|2��S����}�ev����X�'�4D�}lD^E����h������Hc'�����h�<$���=oB�tm��n�/"��q2_z�l��(�SJq���m�����������l�X��B]BB�4���=���w_��l
T����7�y5����������K�H*��!��lI
�
���0@����y>�d��P��\U��j��&D��)����_=�N�+�G:�!����U{�HE��A�e�b�9-��nRd��T8�g�Y�OC$����7O���+��)����'@�������	��4�H��=F����6dn�".�Q���?������q��d(S�Z���O\e�U�[;���gF���`M�S��
]2��hx��02�Bezx�/��+����f�1x�V�#�$����)�����Z��V����YX�c=�H����sIdO�g|k�����������^�z�����b}��_���aO��2��������P���=�)Q�}��b�T@8��b��c_��o5^=�d���F��$[����@��sSp�[l)nc�i�{�>�Q��kT��?��_�2G�3Q{���dz�QM��B�������n�y&�q��cfQ
a����H����cCYW�x������6}��C�d%��3r�`]�������
E�,���o�������f�]e�L{�����������$�OP��-%'[c"U����f*Q���Q�������.4s����4#��2}�=Y�d�����F����d ��W���b��#�u'�	�����?�u�KL�0�Pj��z�	�1g���m��Y��uBQ��j\���?sJ����l�0��{�iD����k@,y�*��C./lU+�~F�S�/��^��,�2���m����b�@������\�L�o�<����#������[�a�3%V]HmM�,�1c&B�9��o2�6F�\h�����y�^:��i���V����P�N��tm���--���}���L��C��@����x9��� ,Do3y�zN
�����!`!Bz��Ct��\����EVDd������5n =���Gm�!���d1���l�s��5;���W=�jG��[���J�]V��v[�-O8�	�!�3h���j�jC5���H�������@���FNc�r�.]����;�'�&�z����RWX@�<�]!���c�|�h���v�fF�u�,X\�y��r4�x<�����q����B/d\���X&����e�p����+K�����'L�
�;7���[��	�DX������JWtvL�Br��[�W>\#����C�B7",�Bpc_%��"g��`�=��G�x�'E����~Vn���+
#��U��D'�Q����T�W�����lw>|�����a���E)j�������k4�����,Qv��M�\��D�G4 M���'��GtF��pS;l%5�W�GT��z���1Dc/�{���x�8%GZ���I3�eR��EK��P��2j/��k�dyT�C����m�D��\[���+����d�hk���0�4��hT�m�(���X�6o�����L�I�G�n$x��e&xK�d�ns>�3��N\�7�m8[NehmB�b�S����U��E�,�,���<3��7L3�r�����%Q��w?y.��""�J:��4X>�������6t^Te����Z� ���mro����"At0�W���I�w���xK�/����|��,�X=�:��/��N%�e{�w�
i��^��d"�-���v���Q�#�P�(�Oe�Iw;br���m�T�����srK�6�M�,����	�[h��]���X�
)l�2ss����-�]6fU���N�,���2P`_c�A�������T����m���d�s�U�N��p�R#�7����H���o�e��4 z�g9:�����g�5��/@���������tH8�}�T����������������~����������PK!P����xl/connections.xml���jA���a��~�|�$��4	ZlUP���=I���pfvM"B|o�i�E�$�5�Uz��2���������t-KR�uB��&aL	(�s��)}y3N(q����ZAJ7�������J�(s��t���E��+����������y,�2r���
��2:��N$�PtG�I�?�lQ��ki��(���fQ"y�|��eY�^�6����\
n��",������c��,��	q��'"����{US|�gqk<JZ�h<�����,x~������$�q|��~��0�R�h��I<�af�A w�_�W��Q���V�&��^���|�[h�m�� �������~#�+4���p!%~c�C�"v��V��:���1^,���n�j�0�o�A�]Yr�l�`u-r���Ozs�V�	/K�da�o��ZW�Cz�Z�"��8�_h���1�F����CNi�z.}������R�+%�{)��^L�n�S2{q9'o��w4�{s�n���PK!�hV�
xl/tables/table1.xml�V�n�6}/�0���7�$Y�",��M?@��D�.�$'q��{G��
h�"H�f��3��|y���K���mnr��EhV��lo���s��E?����&�&��'_�~��f(�����M��a{�\���P�U�

�l��.�������
a��%�X,��l���u��7N��{�m�����P>�U9���E���������6y�o{w��}r^������p���fS��'��/��R���\���K|�^�����n|��p��
����v�D�1���h�����)j����d�.�mU��\���61��W%8`�t��������G�`�����}����B�qrwS��6/�!t����Hty7�G�V����v�q
 >��Iv�AK��R
A�q.<��hxc)��Fh�a���7�8�E�,���)�����$��
�����irI����)y��s,T�i�����V#������J����O\^A������64>�����Kb<�(��[drF��y����b�G!������P!l��R	#����\�<Wa� �Zd1(5RQ�������������]ra1.�9�r�e)C�I\@M���)V�a��xR�+�'.C��(���G4A�B����#���rERd�����LZg����c[t��HD�����*I�b�Z�@Z�<<�PN@v,��L�����((�_*@b,`�JTg\���(�c���"�(�n,���X��1�=*���q�&;�Q&M�:�1�a��63\��P���PE5�1h���P��(�
#��Y�%��f��1���?��	<��xo�Q��|xoFi���B �s(������@c���ef|��U�q`�c4���r)u�)�	;aj��YF%�z����1��>=c��fNp,	F9�02��rV����OQ��a����M�\t<l�V\I�Uy����2(�T3J
u����hy���x��fs�|���A(rC��
t���r���T�;��`R��a�/q���[����8��
�*|m6�����pX�9��]
����B^v�0
���pX����4^+����B��������'"w���PK!��H�xl/queryTables/queryTable1.xml��O��0��������'�D���mP�RU��z��UlS��U����L,���7=?sw���J3)6$�����%�
y~���x�PQ�F
��3hr�}���g��D
x�B�
��i�A��8�3����J*N
Jut����0�	��|p�:�y�?M8U?��/$o�a�0s��"/��G!U?���T������4��PR����Y ��0�1L/���o'(����%h���=$^!������]�S��u�P�J�8Y���_%��O����vQ���.[��4M���������K<��ms����a=�����*o�s)�d�j�i'�]����
�+M�	���������P)��'�d^��^�4�F[:a���,z�7h����t\�4��
jp2�MFc�EF�A�c�E�.2�(-2q��1��".�_�OPZ��E��3QZd�"��_�
J��\�jL�����1��(���0w�v�n)t�V��i���Sh������<�V�zm��H��6
n������*G��w<�m���PK!ZA���7'xl/printerSettings/printerSettings1.bin�Z��W?���)U*T�A@UU�w��&
U�{��?�� &Bf=����3��x�-E�B< TA��R����`�P�?UURxB�*�x�x@������7^��M�3����=��s���;[$���J�i�V���i���G�E��N���N��������s4G}�{��sn���	</���oY�
��IZW77d��	����/\+�FZ�����/��������_��i�|�X�4>�����'�rDd����hxt�����?8M��4W���O�R�N2�2- �3�"~�G�(�i	�J�1���EZ���e2Qg�7�%�4�\�lB�|C-�������R���Ey�����Bh�z#4B�O�0��n��C+P��6t���%M/��+yu%��)%���Z��5}��}�^Z)6ST�J���[V���-`���F`��S�p��|��R�-�c�
�;0�i�e�j:��i�\�/�,���R�sY�j�H��	m�mW���w����������U�F^�����ieP��}��,���8u{��V�:���z�
)�V R�,s���������QO�G����m8V�2T�����;!:���������G�M�!�fAQL����	kn�{
���Q���U)��gZQ�6���o
�U���f3_�Z`#Z�G�j�f=[jR�2mC��\L)J���KW�EE��M�����yZ��
��s�|Z7��E��p'TQ�"������>!V�!��&U����s��H�m��r���v�\cqdY�+{��TGR M����O+��0eE�y]�j���(����b�V`8yd��i�H��q��9�k=/���0bF��F?�w<YC���^��Z���	�)�F��i$B��1�U����v�Xs�(Q?��`�;�V���u��j�'wS�tz��'��k�>��=�����^�+�4z��xy_?������W�(����k���J��U�7���56l���d�n�.]�����y�:��n/�a)�����G2Nhn�����m�6}9_���G��025?�yA9���k�h��Q3~e���[�����Zj��;�_���o�����Kq���*��������h<<}��R����p�������Z/��bv%�rI�W0F�K�|th��=��{/o^*����[�^;��<C���������\�����������F�?NF�7������������_���Q���U�(a�7p�	�����D�E��P����?��9	�P���{L�KU,�#�K�7���c}�#�gP��1�
��X��t8�[�����_�������y0^x|$�~K�?�%p��=v%|
���F�%��!f������6�'F����%�4t�����-��_��BL���!��>d����y�r�����1~j��	|M�?V%�+�"0��/���%�����q���WM�7�����.p[�������3������4n�C��������F��oK����I�"bu��+���_����w?#����7q�Al������>�� ��C��.�?��@_e����<�	�,&�5��\$��jo���M�4<�8C�S�*������q�u�����&R������t��!;�W�^r!-��%���7J�
��*�l|���I����=X��V����5%x��zr��B[tp25��Y�d�f��j���������k��pC�^>&�����g���{� ����9'{��������F��<�����=��~�'}f���4���%O��wK�����hEQ��rgZ{e��}�95q��Y7{�g�9(��I�����?Ifz�'I��I���7�g���pb�O��q�����Y�sg�V�L�E�tA5���������V�c��W�M���0e����� .�1S[��s����u�����wpK�u�{z���X�y���^Q;b��&ca��������k�U�����-d���3����-������,��=�Y_���\����8��Yl)!��m�~�����Vx�.���#Z��I���~����m�f��I����
���F
a�������@�"��[�=!�"`����V��g$���N���w����������_�������&����_|���x�|�!��1��������������GoI�Q��~�C2����q��k<4�@�����sl/z,��1#��������q���$I>���~w[��%��g������/~���$7�s���N�J2U$H"�D �@�$S"��L�N�$�*|�r����mN��
��PK!�Z08~��xl/calcChain.xmll�Mk�Y����(�;=�2�KW�B'�B�m��Q�������<������������������O�o�������O?��������O�;��|}|�������7�z�����~���>~x��w���_��	�_�<�������~�������_~�������������__����_������_�==}���u+e����<����W��<����^��0^}x9���o~���W��X��^��O������^v��m|l�c/���;������,����aG���;*vT���Q��bG���;
v�(�Q��`G���;J�X7w��;�����us���c���n�X7w��;v�8�q��`����;vl�����cc����;6vl�����ca����;v,�X���ca����;&vL���1�cb����;v��1�c`����;vt������cG����W�WW���;��j����aG���
����|���<_�|���<_�|���<_�|���<_�|���<_�|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|��	�'<��|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���<�|���;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������;<������<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o�����<o����f�w�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<������k���<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+<�����
�+</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</�����</������P���x^�y��%=���M�ob~�������&�7���M�o"~������&�7����M�o�}���t���&�7����M�o�}���d��&�7����M�ob}���T���&�7����M�o"}���D��&�7}���M�o�|���4���&�7]���M�o�|���$��&�7=���M�ob|������&�7���M�o"|�����&�7����M}o�{�������&�7����Muo�{������&�7����Mmob{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'�=i�IkOZ{���������'��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkM��?��������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ�������v��;��i�NkwZ��Z$�P<C��3��P;C��3��P:[i�JkWZ�������v��+�]i�JkWZ�������v��+�]i�JkWZ�������v��+�]i�JkWZ�������v��+�]i�JkWZ�������v��+�]i�JkWZ�������v��+�]i�JkWZ�������v��+�]��Z��VZ������U�������v��+�]i�Jk�(C��2��'C�u2���&C�e2���%C�U2D��$C�E2��#C�52���"C�%2���!C�2D�� C�2��C~�1���Cz�1���Cv�1D��Cr�1��Cn�1���Cj�1���Cf�1D��Cb�1��C^u1���CZe1���CVU1D��CRE1��CN51���CJ%1���CF1D��CB1��C>�0���C:�0���
C6�0D��C2�0��C.�0���
C*�0���	C&�0D��C"�0��Cu0���Ce0���CU0D��CE0��C50���C
%0���C0D��C0���B��/����B��/����B��/D���B��/���B��/����B��/����B��/D���B��/���B�u/����B�e/����B�U/D���B�E/���B�5/����B�%/����B�/D���B�/���B��.����B��.����B��.D���B��.���B��.����B��.����B��.D���B��.���B�u.����B�e.����B�U.D���B�E.���B�5.����B�%.����B�.D���B�.���B~�-����Bz�-����Bv�-D���Br�-���Bn�-����Bj�-����Bf�-D���Bb�-���B^u-����BZe-����BVU-D���BRE-���BN5-����BJ%-����BF-D���BB-���B>�,����B:�,����B6�,D���B2�,���B.�,����B*�,����B&�,D���B"�,���Bu,����Be,����BU,D���BE,���B5,����B
%,����B,D���B,���B.��2��@*��2��@&�D2��@"�2��@u�1��@e�1��@UD1��@Co|�8��q��1?K/��������B�����P�6>�*���+��M|z!T��O������yz�:"O�G�WU������yz%���
L���������yz����
Q������+B��g�2�C
P�w���<�bUyz����
U��^��<��yz%-P���9�Y���,�g�Z��Q����cG�"���EO��y<Sg��qf<Og���qf<�g��=�/�lO�O����\O�7����g_g�2�3��q�7���y�8	���	h�-N>�fq�������U�p���d3o'�y�8����	f�%N.�&qb������E�P��:�d� N$�Y�$2o'�yw8y����#�O���ogP��$�N����|��8q������q�����x�����?�{|������PK!�r����<(customXml/item1.xml �$(� ��[W"I�����W�N��=s9"� � `���ri�����IY�h��=����XU�q�����������M\���/����[7t����?��.�~s������;�~�����wn�.�����/����%]��
��;tw���)�#� F�����g���6�7~/�s�!� 3�}`�a�����=7v)�J��;�*�u�����6�HUr��
uWq7����#�$�8.�|Po������t��}p_�y���Nx��)��`��Cw���!����q5��
p�
~�PMA�	o�}�������<_s������m�����0�����-�����eL����12���j�M��d�k��z��7�8�������|Y��g[���sep�����Iml�i��7H�C�"O��?
G=C�.������)��B���v=��<w���o��YFD����
�����M���n3c��!0}�D�:��!<�o{~�j����\G�����"�s �U��=�_�c$^�B	xY���`>D�
�������������n�wf����M�s$H��38[��9��������]���l���<�>��l>m�K<M��-�x��� ��nC��p�Y4n�E��8P��V��4���s�U���6�p��k|�|�{�<�6�J��}F���4����C+?��}l���y��@������5G.��"����>���������	43�?���1;��3�j�4�y$jqw�&����aL
�,��&�U���������l�������b�l�?�U��g
K���
4���/�,���i&��"5�Q2
eT)p�w
N���h(K�"�hI4�bL�����n��>zf����.�S�o��P;���B�5~��\_�_��n���o"��������>���	�*Iw���w�hbQ,�^�xD���A��YY����e�����
�1��H�1h��5�u���-0'P�Z$h�'�u�Q49F�������X����=F�M��6�������5k�M�#	���"�5U���/���5���������@N�ZL��W�`��p�C8l#u��"WQF�i.7���p���� m���"���������*7������v���5����iz���T��!��`�@�_�(n�����+���������@*�Vfs��(���g��i)nM�3�Q�����@%��@��VQ�nrG�^s2n)xP%��BK8��G�A!6��K�������6�1�nW�A��5�+�A�`�<sd�{q�I`�p�A�F���W4�����*W��-!��$=t�����2F�C��v�7�~�_R@�lv��B����)�
`���7lf�$x�Y�������P�*��#�RUJ��@3��a�q(�K8�}�[-~���G����m��u�M�#�&�>������n����*z=�
�(������,�qv�2��x�w(�	����
��EUe���h���:'�$��|D��������c�����-���|�=�������
���Vw4-R�#�>pw�N���U��)g�T3j��m���Wur�{!�.3T�#g��:���������3��Y����G���\�^zZ�L�����m�y�����cug�����#lx��^�{�S�F�[r�\���
=��5�����y>��|�+�*��We�*�,���%�����VwI���9����O����%6�o_�o�}���.i�����!�����=�-�R]��G������y[4=^k�k������r�/��M�C��zq���������$�������Q����<x�!^�[	r��bX��E���}U��/#V��w����+F����;�,^*�n������U����>���|���E{cUvZ�j/^�/v���E�����
�T�/~u�M?��'���!m���!�*q�g9�l��d6/��P^_m>�Y�����~!� ��83���O��OHUl��'y���t�X������W~��E�����&�D��e�Vg6���5|y�z��Y���������)�=���vV��q�������,�K�N�Qq���u?������*X<�_(�h��q������6c}|����Y/�4a���*=��J�NyM�P�����x���������E�O�Shw����<U���.���OhL���Cu�j���y���o�"\�����y�xM����d~�:�Yz�^vJ����:E����w����W��_Z]��J_��:I2V�������O6�;����o�k�!) ��MA�`y"���_��g����/[��@����:�U(��-_��U�������lhu��iU�U�N8?��H.U�u��Z|�~"?-a	����yr���������-��f�Z>/,�,_�>��{^z9��%���?���o����]�H���g�`��X���_k�2}�Q����k��ZTy55���Pu�l��o�I/��{��:X;�~��\m����������������U\��(�}��[��u��:D{���E�!�������U�~�^'�����u��P_Ou_���:��.����Zf��s]k�AqN�7��g��][���8B�ZC��N~~(^k=�������K�/���u�N���=�����/-{��y��{�=��>�j���u��U��r��.~j�U~�=���nX�;�?��u��\�Ml�h����y������J�}�����������>`������*�)����
��v�T��]��|+:Ay;Z9.�s�k�{��G�����������S}����q�l�N���Q���y?/F'���8����y�g���'����5�V���>���������Ky`J=�h�;<�?�yt�����F���v��n������s|=[5>��s/����Q��~�?W�)�?o�>���c[�>������<]��0/d������j��u���2~�����5Z�e�/�[�����^r�/v�G�O��-���h�ia\��B'O�K��N�������������C�M�c_�K1d�>����|��W�
����'�W;"���:)��)���W��T���������_��|Q}��9���y_5������Wqi��b�x��������q��x�D�����c|�#��;�'����H>��}'������^������e"�����>r�.S�x>?v�)�K�g������2��c��2�������[�t��Bu�b=;e�A��e�c�����:����#���Xg�V�.��Y���<�����Y����G��;>������G����g=������������G�>�^�j�Z��<���������%���y���^�7�~��9��mW�gy��3����R����}UO��|���S���:��z^�0���=����c�O{����Y}�R{���P��?�O���k}��l�Z]!A��b?�|_Y�������2����������8;��F_H�M����U��������Z��Yo����>�y�������{J��
�o.��<�>�uk���P_���u?8O?�i��F=�A���K2��Q�!��o�;��%����`/��wtJ]��M���;gclp�A?�.��o��S�����,Z����W]���7=��W)d��}{��.�����.���	��7����������s8�k�&�l������l����vf�N��_]���c���}�y��F��v
�7����	�\�<��]	�u5V�
��P_B����l����2��x�U�MH��y����(p��vd���mz'E�/k�?��7�+��������N�`<*�hJ[7�9�~�*�Y�N������S_�5�XLgz�Rg��#�	�)�t���`�J
�}
���������
5�wqb��{;�CF�=����>��Og]�����a>�������M�(_y[~��N�^����Oo�|5�����:�n��s������#]���|�
�]���.o�����������PK!�5K�:(customXml/itemProps1.xml �$(� d�Mk�0����2w�k���qq�
{����k��0Ib)���Fz��4<3����>��|H�5Z
�,�DZ���W
o�Cz��nG>��,������?�<p��K�&���Q�����\�*-��I�������I�CY5;�u��
I��Q�S�BX��x1I�}������"�+A����j�
�����5��w3C����~����mu������G2�������i]�t-����M����PK!���0Z�docProps/core.xml �(���]K�0���C�}��uSC���.������������$�������6��2�s�yI2�����R�(�@�RH�N��b��!�X��K):�A���*�����.+�V��I��m��(��o�`&p��U�f�S�q����G��q�	fn�~5Q�|@V;���1�P���A��ta�������P�M}�S���8��F����:nk��!~����S}��[q@Y"8��-u����Oo��e���sf���{%A<.���cd��� ��D#�D>/�=��)�?���&W���5��5��~T����b��x�4�h:�E�Y����?����FM�K@��>�Q�7��PK!(9�/docProps/app.xml �(����N�0��+���mBY�U���+ *�p7���pl�3�Z��I"B�V�������gF�����_��y!2�&T��J���3�%2$�+���R�����!�)DHd3��X�=Q\�9�=4�,{V��M��]�������|Q�9	|�,��bp\����U0�lO������Y����ICM���'��(�n��,�T!�i*7F;X����C��WA������6��--[0R�������F�pJ��d�'������]DJ�6����Y*}8m����Z�
�7v�|[K��^�D����'�pD��
4#��������*4Q�c�`�>�m����>/��^'�x?�2����9'����������.t��2�����Wo}R�������PK!�+{�32xl/pivotCache/_rels/pivotCacheDefinition1.xml.rels���j�0D�����{-;�P����im��Z�UB���1���a�����2�;%�4U
��e��h����AI��p�@V�u���f�%$���
%��)��������#����E�QG�WIo�z��+�7�:9��P�K��lo����P�T�����X���rrR��F�~YMU���Z�v�{��PK!����4xl/tables/_rels/table1.xml.rels���
�0�����n�z�����U�����$f���7�
�������zO�xQ��Y
��@�5�l��Z�6{�68:Kfb�������1q?x�bYC�?(���	Y:O69��4�Ny4w�Hm�l����S�
��� �����l�����3��l��O
s���CGQ���5/t.����B}u-?��PK!t?9z�(customXml/_rels/item1.xml.rels �(�����1���;����x��xY�����t23�iS�(��O+,�1	����?�����S4�T5(��zG?�������)��'2����=�l�,����D60�����&�+J�d���2�:Yw�#�u]ot�m@�a�Co ����J��6
�w�E�0���X(\�|����6�(�`x�����k�����PK-!��6a��	[Content_Types].xmlPK-!�U0#�L,_rels/.relsPK-!B6����	Qxl/workbook.xmlPK-!���L�pxl/_rels/workbook.xml.relsPK-!�;���
��F�
xl/worksheets/sheet1.xmlPK-!_�b/m�
�!
xl/worksheets/sheet2.xmlPK-!�r�."P�xl/theme/theme1.xmlPK-!��+�
c\
J�xl/styles.xmlPK-!%v&=q�}�xl/sharedStrings.xmlPK-!�$R�!QQ� �xl/pivotTables/pivotTable1.xmlPK-!����%#}�xl/worksheets/_rels/sheet1.xml.relsPK-!7�[���#z�xl/worksheets/_rels/sheet2.xml.relsPK-!
A��G)��xl/pivotTables/_rels/pivotTable1.xml.relsPK-!�!��@'��xl/pivotCache/pivotCacheDefinition1.xmlPK-!��>��\7$��xl/pivotCache/pivotCacheRecords1.xmlPK-!P�����[xl/connections.xmlPK-!�hV�
�]xl/tables/table1.xmlPK-!��H�7bxl/queryTables/queryTable1.xmlPK-!ZA���7'�dxl/printerSettings/printerSettings1.binPK-!�Z08~���mxl/calcChain.xmlPK-!�r����<q~customXml/item1.xmlPK-!�5K�:��customXml/itemProps1.xmlPK-!���0Z���docProps/core.xmlPK-!(9�/W�docProps/app.xmlPK-!�+{�32�xl/pivotCache/_rels/pivotCacheDefinition1.xml.relsPK-!����4.�xl/tables/_rels/table1.xml.relsPK-!t?9z�(*�customXml/_rels/item1.xml.relsPK�0�

pg_aio23_basic_tests_v001.odsapplication/vnd.oasis.opendocument.spreadsheet; name=pg_aio23_basic_tests_v001.odsDownload

PK
!�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK!E�sm!
styles.xml�WMo�6��0���dIq�MT��E�{jMQ27��T���w�%Q��z�m���@������Gf�����3U�IQ�r[�
D6Lt5��������s)h�^�F���m��I��
�i����@:�+���G%*�5���=3��@E����m�
>���PHo�������������9�n>�
���r������VfD�6�N�\8O5:3Ty~>����V�./s���A5d�F��WCr��=T���2��b��T������G?���\�hO�ks�`�C�@������J`o�	�J�>�}����v��/��z�	+%�0�]�z�_j����4?�N����>UL��~�����=�����jQ��j_8m�W�D���UY�:f4L��z��sN9��5������[�Z��<8`;�r$u|y�f�!�N���>O?���b�i9�0�J[<rc��_��	9e��<�X���3�R5r���;H�24���#������=#A�VVGL�:%G�����1
=`X�3	@�ij���
��t�����z����b���NV78�\���n���r���Cb�1���)�[���m��������m�x:��@}�w
����e��HK���,�w���|�"L���,��}�i���2�A����s6o���a� </Ngsw��-���%9��%�S���jf�jt��E@l�,H���<��x
v�9~��Y�����O
/p���)1�*���z�x~c��n�b���l�W�;���z�<�!�R��!�m�<�	��3%Fo�	+�cO\�<:'������2_w-�h������U��]M(@����J
�|�2�.j\���p!N^b������B�%�T�^]�=^���l�,����
�l��e-S�����hjD@��'"jw
S���������?�n|d����=<J�jLr��
��C���Cx&�Z�A�n�������a�Ip��k�J��p?�W���la�c�0�Iwq��U�piz��jx���N��W���x��
��PK!��!�
�/content.xml��]s�8���7b���.��h|� }Z>�;synzwn;�2�VtI%W�����M�I�K�@[����"����bH���H$~��?o7�>������r��z��]m���}�\������z�jX��_o�w���k�_�����?~�^_�\�o�o�>��w�������~E��n���~����vwo����������_����w������������j>x2�?�R�C��������w����w�/�Wm������T�?����-�����pC�=�5nn���\�~8��y����/�[lw^��i^w��n���������k���u�i�]����~��}�}��R=|�>���?���X�=������N��k��z5<"������G�������U{��������{=���z$�'�r�������^�v�/�f|��fhw{��z���������7��z������/��~��t�������.8:���N\�:�����u�����j7��/��]x/�f���\���z�is�-�z�������/>vU��3\�J]5~�0\��hd�n����mS�7��:��;��s��4��i�<�o���������C�%���[��,_���/�����2�����
#I��	y�����k���T���������6x���-
���c|��cC�v���nr6L�������U�v����w��vG�U�H���������/W����>��6���o/������&^b�9�>������c��2�����/�x��U~�7"w�1�<���E�������L���}O��-�*������'�2�(u-�6<6 |���	�x�;U{d�,�e�V�w����7b����.���3�"=l��M�t)�t�gc:�����#�u]x_���gDyU��(���rS���!��yST���J?'����=|����1}�W����d�,�A8���A�jW�)2
�V�d2��.�p��A�����Y�F�������vK�vwq���!��{��m�������7������}�������@�����{}����]�CqX��Sn97c`���!2�n��������}���mo�{�j���������b�B��_�"m�o���]������}��q��wW�_\Q_���������b}��zq�^|�}$7�Tc���2<A_n6�����������@��.�?��}XW�F.;}��������v3��W��L���h��_G����}�Qx�:>~�Y��0"����C�����o��������x�G��^x*!�����"=nO�������3���~��<z�te+���_����L�g(���g���q�d���^���K�?���W����X��kj�oi�����y=���#��_���'����n����������O���� ���p��z�vx�k����j����.��\_���%��z�i7�����������&�J?�����v��g�V4QZ��\_���?��~�� �}AN��B\{I��	�������v�G����1+aF���v}������M9�-e-����������m��J�oeI7�V���W����rG�!E5�'��;�������������}��_S��3_���\�9��������7u��O���.l�����U��]Mdz���0�:|�>��d\0ae�w�o��0L9���a`8�N.�Or�#�	0L!"��VD���0?��������/|U��QBy���@�I�G�T��JEk��}3��c�k��(
����W�@��6�1Ga�<1l�U��U��xS�s>��2^;�.�u4�
o���*��q�����{/����P�O�P"+��M2�l�t������y�[�&1���u���.��rr��1���A�����#%;�Ie[X��6Z)��;`��(����%�`"�����_�����$�����QD��z%�Z�����d>h���^�j.��j!�#�Tso�G���pV��`�L�M;exi���s�AaP�������l���Qx��kmuI����Ez]t&6�A��P4aP%pph(��3���8=I�&y��@3�4�4/P��8��S1H��4vZ4��~0
��,����������������gK�M���I�7�<��3�|*b��IJ	Qc��`��2S�YF���i�g|�y����J�����|�3��z�h��Oz��m_����
�f
�������z!0;G�^
?s���
p��q��MC?��!y�njT���R����2v�$����h"��.|�T:>��q��a������iF�gW��X��Y-(������mQ���cO����������vXuFOAZ~Z2�diS�'���z9����{���n�"a�Z�M �!�!�!�!����������������-��3�������<l����M��D��R�oR���Jo����;�x ��YiH���A��8S=5X���P�l��d��v��0���*"�ML�:�"�����!�M�d�p���y��&�'%@.�A��P�H�o�|}��r��vSi�&��PWEg�I�T����0*�};(]T�����]�`���a5��)��&s�~6y���k��z�k� �}���!(��;�B��l�0���$3d�����������C�I�z;
Fcy��/byy����3��R��V9UU��._�2�Z��g%|��D(����z�����w&y�<��3�<�
�/p^@����j��J��$�@����	GC�d�����/��vyN��`1X���X��pvi=���j�r���y�I�0��)g�L��o(�F��
���n��Ao�y�4�"��bD���~��y�c�D�nK?���@�?=���H�^��Y:z�jl�B���)���"x�F���������1x�S��7>�(�,�������I*U�Y:��<�d�3��|���6;����=���XS~��O��EC5���O����}8>�a�2�����B�/�����G�
�M5/1�M�utT�,����h#DEJ�2�Yf���
�R��(�� }|��+�v��=��If�d4d4d4d�����+��m/WW�|��!�+K���$JD�9U
	i�L�)�?Cjl)�9OG��a`��S	_��f[����q�Px�iO�Q�cT���Dh.P���������$s5�0�3�|�K
`k���C�:4�A3%8��*\B"vl)�H����������� 1H�2H��#�Z�"j�
aj���.������V65R�� ��F��%�5>|��&2!d�����d:7Qq���v>��Qq�]��������]g���.�Gh""��p�r������D����$;�i<��CkR�	Q��]�P�P�P�P�P��Y?�w�
����Y^q����lAo�H���[���y<L��1�j��:�O��M2���h:���(���&$�waUhUz���P��+}���_Y�����C�����9���^���LO�Z"���v�����������6�C?�H$b�!1�l�Nt���+J��\b��&B��j������z�$o���������5������� �"�Mg������|�UB�8����blUYJ��J�S�h��%B�� 4�2���-�<�2�����B&I�T���@�`ce�H%����z:���i=�H��:{��#,����
�0�3�|R:��o����L����eL��S,�9��u6yJ�
�C-N���yl)�HEeU�Z=v
�3�$�!�p�g��$�"�b�6X&��fck:d*O7w6Bi��e-���i85��a�L�r�hw�SK��$s)��Ah�ezy*6�E2�5����������%��r�k�)���Th�S���e]���,��2�,����P�Z[U���
���*Pis���!d~�2��x�MY����|D4�sh��f�h>������z�����L:;39��cXS�v�����y0Lu�P

h'�[�9*�������|�{S*���>����N���w���]��5v�rR*��G�d�B�Jc
�y?��q��_ ���XO��y���G0?�����Y�E[ZFV�v/��yu6ByG�������/v3����
�=����>P
��st>zB����(�R`)Ob)0hd��h���j�Qkm]�� '6B����U�);�5�?�����u����?k�&y�o����w���j��t{����������\��V�������n���/���~�������M{q��m/W��^=l��"V�����~�T���I����m����>����o����j}��<:�z���M���Fx����w�O��O!�x���r��'Q!�
d[_Wt�SN��h"����C
X��6"��h�����x�m:R�E�(��5��WW������&��?��qD���*p&��M2g�1`���=���*��F�]��yH�qh(CASZ������N,�G��e�#���O��T��Z~���x������c�
BP8[��l���l&��:�3D�5%|���e�<��!1��Tc������
�'A���X4����yz�U������@%3����g���X�i�i]|���Z���p_�&fu%�P�&2OcB�|����\�<�<�<�<�<~t���pui�Q���J���U-&�_m<�$k>P<5�c������uC���Y}���yn�@2�$'#���B�g����p�\7�Q�2���N:��,��H�,t��d&R�z��A8���_��QI��?�R5E���S��l� �ldH\�R7E��!��h4�����KW��_�[�92�� 3�2?J����yHd�)�&@������uQ�8_��<����\U�p���
����f�h���|�=�<~�z ��nD`P����}��P*���������'��;����f�4k�Y����\a/-��|6/2���4D��)��5��o�k_U=��O��&B����)�n��l�6D�*�������������������������K��FhZF�hK���W�r7��L���*��s�N�A~~�y4����M?��'�������K�I�l����<��D�/�d�ye�*1���Pekm��
_�G4�-�H��f��FlV��I�2#���<��l��D��2�Z�Xm��lb&8?'Md<��ZC���	��G�I�G����`#���������_{�c'D�Kj�G�TJ!h��B����x�&�A�*��[:�d����f�Y��7����� �r<�B#��?y�.u���f!H,��]_R����p��`#��L��M�j>h�<�����D��kl0��h��f������E��f~"$4:��@�*KO�E��w%��1(u^V�JV�?|�x4�q�
�Q'�^����Af�d�?�'����a��)����y�I2����(��(��h"�%�;�W�Vy���1h���D�|"�?[�\F[��(JSB��&�A�Xv�m3��LP��D�!C9i�/��7�s\��ep\>�����������	��0<�0��&�����	0�&2�f�����}�<c�0���0^b�ZQRU��|E��0=�8�H��4�F��2�XM	^|��o��0������)�=o�O�_�;;eR �7�q�U����e
&25���B�H>�Pg��
�.���2�|Z/pBP��`*�����y�I�P�.J����bL�6cQ-���`����
p��1p|�	X�R�??�����$	���2�T�2rg"���F�G��OF���P~�A)��3�s��i@:�	�c�0����kq��T�kV"����F��E�<U�?�ML2�jJ}��
U��7S���I^��4 
H����"��i�g[Q�v�0<� �N����E|l(���*�������$s�� �2�|��m+*���N�������&�����h"����'��j��]� 2�"��/�����LE���x
�|�C*�4f5NI��l�����MQ�����a
S�����]�<�b�,�������B?a�kJ��1<K�}@5�C�W�RH�5���w��������"�z=���zy0�����4 
H��!��veM=���2����D(ja|����n��?m�&����Q����M�.L2'�48
N��/����,�������Sa��m)��@QW���Q��AN�rV/\������2�� 3�2�V���P��ve�AI�ax����Da�t�#?G����CNUeQM)i	���I�C@3�4�@�i4/4�M�V��B/����y�I����������voR�x��c)Yn����$3et�Ag1:��_��F9~H�d.��L%sa^6e���X��--��*����y�Go�r�%;����:�k'D���2����1@�m�N~�q���!?����m@�<�S����]�62��M8���I����h"4=+�1������J���N���48
No�^��O�.W]���da8���3%Bl�����R�
UB���]��08��/���2��m/WW�^�����yxs�VH��8I��I�����|2T��*l=�B�~4��|Frq�zO�~���������%D�Ad�Q���`��4#�TB�1i��F�~��q�E9��W����G�ZO��tH&��$�P��AiP�����C\�:7+6��l���h*}�-�&����yDL���.:5z���K�h��
D�@�KA���l���7��2�B�u]%�6��V��`�!�P�v2N<y�#���/��hc~��	s��r�W�!l� �:�����M[G��������vyN�2� ����@u������'U�Sum���)�Me\1�.i����PG�<��ep\�3����7%���<���t�z���2���t�~�`�����B+*���Y������MQ����?��D�!M�-tV
&y����4(
J���z^d8[Y�&�������A(u�fU8��r���h"���
������.�&y���48
N����V���tQnj���T<kc]WT�_����bLclR�}l��x����K�1�QTjs�������J�!��'����U��cqn~|0������~�{��n���u�4�@3��He^/J��\s����fZ�M��F�<�R�*]����H��R��k:��������h���<��3���y���Z�&G$��v���P*�)�Jk�����2�P�3�
S�Ga�-���f�h~)h����M��2Q�6u�U7v&���sg���T.{�KJ����5�D�!M�N�2c������&���i@��iF?/ov@JX�6�A�M&�R)mUiMQ5����R����?;��]�L2O���g�x�</5�Df�b���)���c���"MMe\1F��k��E�
���Ac�X���n>������(��K������bp����B��^����R��2���u��y�JU�ea�:F��� �M��M�Ac�4�l�KE�RIf8>�a_h�s�oLk���M����q*u���
��%e^��&2�"i�]e�H�a`�������ru����S���2��Tq���$d���2�����UD����&!���]Z~
$�4��96��`3�����"�[�>\�zE���?o��fJ�����2���kCvyJu��\UT`$���h"������N�sM�!*���/�?�n�>����=��>��>���u����������w_���y���X�x��z|��n�7'(|��
g�1��d"/��`L����wq���>>^���M)�bW���h]�oJ4:M��E�(�l����;�L�@.C.C.C.C.#�}}}3
�o)��fj(fS�F�AN��<|X�F�����d����ui0��#�yz��Cp��
�DM	��o:����k^/�l"�l�9�����@���k
e+!&�ZW��CEH7�ZL�z}��m�@�7����i����f����SV����B&�*B�fwt��l0>���� �� 6����d:������c�5�!���G�)Y�Y���
&�I�b;4242���_��������Tl�����������kjC5��i05�q�J�8�R��B�.�h���<��s�
��&�q�K&�t��]�$�N�V4��!����n�^�#>	�%��A�H���0��g����1h��'�����x#�*b�|{�
I��F��JyOI_6'{0��T��%����P�;�dN�h0�����^j@������a#%!�M6�J����]SUS�i_�C��P��3J���Sz�R4�����ip�~	����<y)���]�pXj?��#1���N*E���)��r&�N4����l#;��'��� G����>�:��h��,���0�i�_��%��$��F�|�/C�T���(M2��7��*31MU���wz�-��JJJJJ���o�m�O��(�
Qt�DR�W��S�<������W��^��&y�@2�$�$/O�����\���S�=���k�27��BQS�Z�5|�fl���TG�<$��0 ���������J���5�������	���R��P_�,*oc?azM�<�d@�Oy��j�2�e��C-�<gc]E�G5��hl���TU�\����c)���I�G�3��3��u�?'1�������y�c��
�E�����<f��������B���]������\�}?�=tV�������<��.��eQ:��;���
�(�"�P��O�g��`b#?B!�oL�)��`O�L����h�I�������*���M��)MQ7�j���e���$o�@�4�f������&�V��:!�*��CO*�I�:z�����b����Ct���8���Po���6��`��`3�MGh���4,+����-��=����:�m�Zj��M��so"�����iKh"'�*y4Z��_���|����Xc~�O�������n}�^����D������A(U7�����T��7��Q���x�T�'�D��Wtt5��2��&2���������Y������������!�G���v�J���&
v�-���:���-J%����AN4f<3#)��3�J����]]�W�w**9���J>�J^�X=c��ru���hu�P��VU�L�����D���(�x�d�����I�7P�P�P�P�/B%�
�g�pB�8o�O(����*W�L�� �`R�x�*K�"���8A��2a�V�,��$�Ab���9�/W!�����Q�&�
�I���=u4m�
K��� av��������R����z�����h0��a4�X2����st��7�����.|������]M�z�h=/n�����;��P-�����7��0L��d0��b[`���T�����y�I��P�5����N�	J�*���%C�<'�` ���S^h���F{;F�����L���7��4�C<�uob���<=hH���m���3���
H��4 �" ��6)<�|��@�4��0�M��	�����D�Z��L�����K*���i��>�$syL��d0L>%���*�>��2����2t��US�����)���?����,���@,�s��O�U���}8>�3����Lw�o��~�w�~�m�~�_�����G�/�cM��?�gK�:�X^hD���6
>f��(m�o)�JST��Nl��Md*K������Y���������/B'�dk	�vS�0�va
D��t����#��POPL�J841������h�jZ�N�]S!�0L��d0����i,��VU�x%8��O����D�j�����&B�3c0��JC��N��'�5Q\��?]�@��+�)�3�������K����iR�:�d.������t0���4�Qo"�<�tD�5~��|
\o�����4$��d���~��4'X)�8R)8�;V��H�i�������FU%�|�������N�
e�N(
���:
���=����'�a`��rB�O>
��P:�J?�UA-��v�dp���,t5�Q�>D�L9��6�����	�g'Mf`�f�`nO�)/1�MK�N5e
���2p����.*g�Q	s��D�%C�d5�d�,��<�������m�J`@���*��vaWx����W;�:�hcM��|��-e���G��a`��������
�Gy��W�(6�.�e��Px�U���Vc�'�If�8ybT��Tk�����8��c��$����st���Ias�Jm��V>�����4���P��.�3�6k�2{0��|��g��e���J?�j=�\��e=��2�yL�o�L�B��e��(5��-�h�������N�s�#;��ii��S�H�f'zU�4�y3u)Un��'�R�?~����.|6�`t����q�H6B������Q������l7�C�Q|`h�G��%��t���]\m7�n�����]���+]��Y�~����m�<|��G���]���0����f�d������7��������{5����AM{t�V�����v�WSQ�� mH�6yJ��iKA`[�	��da4�����q��r������kD��~lzwU�����l"ry�����j:I9����F���������JM�zu0�,5�<WP�&#��X���@s�P?>��|/S�;�����T5�}�+�P(���&9��<4/1��W�Ds�j�l�{�:::[c-%}�|
����#uPW�t�r����d��C6C6C6C6C6#)����rE�0l7���D*RQM��r<��wh2�q��/T���
����g�x~x�GA��j���#����l�uX��b�iQ�`#D�%m
������`"��b�2r�����
�`�����`4
F��'%tB3o�}���lif;�x���ld�������i1�Eb4��� l�b���u�2�� 3�|��M�������;�2(��FhC�6>��)S�(�C��P����@������42��`�����0d������)X^�e��j���mg��"����&A�5��� �<z��M4�q��|=%��L`:��;






�24��}����S�H��C#��
a���,�t�'�`���l����SW�3~�s4��"�h��f�9F����K
ok�g�����6yJ�.hS+��I
:!����blmY=F��M�<�AhZ���n>���U�%s�'�&�����,����WH�����B�,M�i�ik���.��k:2bD"��~4�y.��M��b�,~��������\��O>��,�C�R�O��}�5c��������(
��)�\6|x��&B����%NR|�t
X��e`X~t"��6%77
_,��5�� ���TE��3����h"H/uI��6��+��I^�3�0��`�@����!��Ta��	C{u�Y�J'�B������=���j���O���s���#�,��b��Q��������i$rh(���t�,��\J>�+�y�������:egTo���T�AePT~���
]����tQl��P���"ez���Df�P+O�[L{���=���F��`4��3zy�l�s�b�f���(6��(=�M>Ned`�f0��Q����v'��t�#���s ���XY��o����j}'s�c�N���
E����{o��'������Gs��v�����/�	���cC�w���Ri����D�#���Tf�"��Fob���B�B�B�BN�����T���R��NZ��&��B�^]D���D(�l��z��k�B6%��}25�q�,=�B9+��y�&�(�6��`s2�yP�l��a ��]��R�N���0I^AQkK���JjED?�,,��b�,~4��l�l,&�����Z^l(�JW�,T�Q?d4��HW^��)�����h�y>5�.���2��(����m���^tza���'9hm��&R�:�TG�5����,7A%D��,��If�h��f���G3J��o��U�W�6��yxs�V(n^����z�R��%�td��Q�U�.��j[��f��r}���rW���p����?g�����F�����{�ZSC�U���+Y����������~�-��
�
�
�
�
F��f
C��I��Q'9@��`F��_�6cSW��FQr[�*�h���<=�)� v"c'rb������`'��R��va
L����S%����Z(V�]iMQ9^��-�@�:K@`�i�O�QD��o������������)�C�s�Y��|���ov��r�����&3i8��W��
�k����!�eo�hC�]�5c��)�1�1��W���j~,�,��J���Sm��lB�<��v}�A7UA��(�����Dh��3"��R)|�h��I@3�4��������y�c����*C�d.������}�H��.L���tr�^���F����qi����/�5�d��L}����6���zd�Af��[#IV��#3?��8Z�UE���Z�|6B]�����y�+��_�$��d.
mGV����=���H��4 
Hj/N�2,1����������azRQH�����f6����US��W�C�<'�a`�_�y1"��x6�z�SD���c��6�]�6yJ�q�jWx����;Md�tam<�98��{�<w�h0�����R�jBA\�CB�l�����NJ�V�&B���/����i�p�`"�3�5�X&D�v�7�%����s9������|��a�9C)`]����<��t6����S��X�4�I�U��T�����-�h�*�����{�!��=L2����������_�j^�x��}��(M��FN���md<rMYU�F��j���C!]�L����M����h0�������USW�����h����eO��Q�����C;'�6.�����o��`��1`|�K�f�?���<��.��E��������T�$4�ca��@�������vz&y���`3�6���/K(T�3A�����1������p��62���)4���P������PLx�{SS�P?�S ����H��4 
H���[����|�eNv�2�df=�N�|�-%jO��xHG�<(�:d���4�vyN�� 3�2�������v8�P�Tod��{���	��,i1�gal��T"��i
3+��;&�ECAg�t�_�rse��E�UQ����L��F��E�c��B^�&�PL��m��2c�=�Io��?`4
F��`�i��6�U��MH�F�i�7ay5��1(�5E�=KV�c�������L���E|��h��C�4(
J����iJ/1��G�&����0=�p6����������#�S��!Y-~X%?�ds�e@P~!P�f!�t6{����i���Y�k�F��������
e���-)�0}�[����2�� 3�2��������E����{��QM'Y����D~�W4�cb��w�vyQ�\�	4�df���4 
H��!���v]5�:���P����3���	�����#�5eQ�*Jl��P�b(���~�{S��F9^��F2�W�����0?ay�EI�G1Hz��|����`�:�����VX������s��/@�q�'��|J���#���i����l���`������p��`#�ag��
���
e ��1v�������M3��/�3�8������t/Wt��%�<+��sq����X�hOschA����M��jZ����?v�
&yS���A��Bj>-��
�?�~?��v�7"�N�g;����w�h�hy�A�O����<�"�� 2�"3�y��mc�}���
����F���Y���T#����h"�c��?|���.o�6��`3�63l^b&v�&e:����=��a(U-kkj��{�x�&�L�i��8]H�T��dN@j���_
��'fX$� W�����lj\;���d��MQ��0�}��!�y���v�G���%��)���������p<FdVo9������>��>���u����������w_�z�]�Mc�����VI��7�=vN�����)����./��o~� ������\����X>�I���m�r�ld^n�|Y������2���j����L���^�^�^�^�^Fd�/S]��S�����]
6B�C�0u�hG�X|������M2�o��M�;������S�	H�w������}�n6�dO����~=y�
$�����7Th�3?��l�M�������0�<�$l��*�f
���^4�qH�R���n9n#Y�U����X�b#�}G��%�7K�2EY����HU�a�����gu��`:+���*�W�Y�wr����E�E�E�Eg�h~�4l~�����^,)H
Fh(���"��_������J����B��~
Q���0���\��������i�4p8
�{K�v��lw���!)�66R����j���Y�u���*I+�������f�Kf�#��H���u�5o��� 1C��8;��&���c��E�EvT�^��|1r��s���fi���!y_P(
�J���G��PM�`���%Pt��������U�M!J��&:S�����Md���U�0�Z��nq�j�)�Y �S��#W�������2���I@��6}9�&�X��N~)�]I��Y��M0�F<��~�Tg)�%{y�E���C�V.
.
.
.�R���(-�Jx��\-����y�#M������������)DgA��UM�e���y}����ys�p�*5J��������'��7S�`7=�{=iU����$4w�s"h�/�ABg��+�6^��������_�ta�2�����w��W��n�|�������p�p�����.vW�>�����o �?�2����������-�����G{�	ge�?�}E����KB�"6�����[}��n��C���M�6��b`�[����m)&��l���nit���X.�Y�u�m��W��2�jN!��'���!b����>�Wiw�%�8�"��?ya����:q1�Z��i�����ga�m:�3[oH��`Cm�X�-5�>�!:������@c�1�x���&i}�=H��&�)%>��b�O��]��!+��q	��1D)%����Y@U�#��b��l���^'�W�������4��$�w�����,�C���[S�TP�?|�4��="@3��h~	�\�X��*9����V��t��S�V�X��AR�v]S���}:������E@c�1�h4�Q��bm��8��a�/������b[�zN�Le��Y�:�5��!y�D���h@� z�Zvc=��2i���<���d�i�b�_�r��RBk+�fe-CHf�H$��/��G^<L~|:�3�k�E�U����Uh��!4hz�R��@�����?H�����q���;�l����p��$C�6����B���%C�o�������)D�����K_�����cH�99999�^�������k*/IQ�u��t"��@�>�u�W�a�TCoR�k��?^������&�i�����m�*��v�D�W/m(C1�Z���"
�	]�4��)y�ib�������<��0�i��P�@Xx����;�&�F�G~/
t��������.q��P�pN��]4��+AG��B�e�cSE���ta�2@|A|�DC��P�yri,��^d��e'�:]��;R�i]�[GE�yh3��!JKr]S���#�CH�<p8��/�y�����O-v�Xu�i��tL��`����Ut��l�Rg!P��LC��2�k�i����%%:Rx������o����w��:�Z��m�6�_�|��B�o�[v�L�|3SH���������������U�N�OVN2u�0w�2�w�5U'��Mb�x��R|p�����l���<����P(�_(�\���*��gvt�ba$�q�Q*U��<��������;�|��B UC��T��5L�C���&���n7���n\��m[��u�.�Uk�t�?F<��������R�VN���������� ���fmC����C$_S����zkkW���W�K��P���wmS��\�K�C�V|>?3>���������yo������������������ISg�������s<�M���Jz��n������0q��<k]�(�w��kf�=w����Z�z����G(���c@i�4P��<}�(����X���J�)�RBT�/�6v�8k����G�
QCi����M��]�,f69N���@a�0P(�� �l���u\y�rv4���i��)n��C�7�
&N^1[������=f&	@i�4P(�BP�7'ej�Rt(FY������u�m~%�5Pt>��Sz)������I;��LE(
�J���_�~`
����S�
C��S���=�� .!j(mE
Ol�Z��w_qn�����;���<7����O�W1��KE��@����`��K�q�B�`L6Y����f�_�lEPu6M�TTTT��Pe��]c�ed%"u����� �E�%D
�K�~����
���e�;^0;�Yf'�I�I�'��'~��@�����tt�R����OA�gO!J�C�^���*y�\��x|m�����.�%3�9\(��H)|�89��'����.n�bc�D�#�@�^B��3��8�F�����V�T(Q�g������z����������&@i��z�(����"���7W�~����T��������]�ha��u!��}���\��3��@i����Kh�YL�-F�������e���3��`���V�������W�A���sW~jE�&�R����p8FWG��=�(1xR��=R�y
��h���x��t�.S<���_>�;��zw�����W��O�����o�Lt����-�Ui�M�w_IW>�~���\��;����u�K���;��~�u�+p��q��Y�b	A�7�i�p�9;�#�p����I64��_ua��J/��,��!�,H+��P�p�G�P��qS�&����+Y���um�w��

\�NT���I���������s������T&�onHE������a;��:�
Fh>����}+L�}�yO�^a�R��G�C��<��2���3�������`��������G�)�R�q�SO�����.��U��Y��KO���[P��rh�G_r���V#8,5����(���4�9�����������b(]��<6*�t�&�U�m=�4OSW!:��$l�����Zl��{e��B�@!Y���/�w*a]��8����Ng��RA�D���k��P>������2�����
	�����tt	?�ktn������i�7�6Um���yZ�
Qz�H�>���?�$x+��G#���kLB��,0���<�$/��)y�f*���JZ.��*AA��8:���-��0���y�I�.�(���t-f�y"�`0�/d�T1����mm+g��������9D��V0�������I�F/S���11s9�i�4`0�R`��p�������q/v��54���~��������(=�$Y���(/�����Q�i��l��C*sg���i�4`0
���Z��}��`	�.O���&.MH4�������!:�X\��+��a��1����\Qm~�js1��OQ;�X5��a`�|.�����c#��'Q[��c�#K
s/��O@���g�^���]�.����O�@��0��0����V^�UB�&�M�uTN�?�JV!'��6=�I���d��Ng9��W/��T��\Dm���!j�Q�L�1�J0Q��u�5C��7M�Btp����*x�1����[�� sC�~!��b��O��XW���0�����SS�1D�Wf���	�;��%�
R���?��T����?�?�?�?g�g~��<�)��MP�M�����]5�5>2��?�!j m�#eYP�6>SU����n�eD��AeD��tX���s��j��`��/���Go*�i�_�*$�-Jq1��EgD���^[a��ef
�������������]��u�	����w���b-PS��Q������[��i�m���t�h4��4JG�6f��R�X���i�:GS�
�x�{	�CE4'yw�����W��O�����o�Lt����-a����������?�2\�����a����3.��4�����Shk���#�B��y�����U1����mM �����K6��b�4)I��w�_�GT7�h����9����n	S@2 9�����J��b��Oi;��%Z6��y�1$s���V}���_�>���t�)h�B����K��|:���oC����R�m�;�OI~?�M���m���n�CV�v��5T<�������'�[+���7�K��>E��?���������_�*���l���V����P���b0]��<�U��m�$};6�������c�_�����|��[��u�Wh�:�0|90�OM�9_�m]��6����S�;�������H����K3&)���>#�%���	�6Dm��J�������$��F��5����o�����6���at��%�U���O�v^�������RA;���D���)�E��A4��*eC�rY����c�J~%KH��In3�?��$o��e�2Bv �Cb�l�[��6���O����=��A��2�RA�x���'����t����m�m����7����S�k�"��������$1H��g���Cj�#��'�h^OU��b�A�7��5�f6 �����=��II�>>@�?�>FGY�l��:��"	La:�0 �e �.���U]�v~�S�;�H��BN��65�$R�MP��� 8E���y�d��/��_���
�����7r@)0���J�W#@�a��_�H�>u�6�!��sM;������[�!J�u�W�f0������z�mL���
:����Q�B�s�
/��GVy�V��N�C�g��<�BN&1�c���DB6���_u)_������7�����q��r: z�����w�RY%���� :gbB�t*}$B���-$�����W���m�h��6�h����1�W�>�\m�t m_��]�)m{�+���
�e�9G3���b�������v^���!f*��>�,3�2/G�ddV�G\����3�)l46�/~Tz�w����r���_5�V!Zm���;n7"���y�
�D�3������w���H\/T�������b��e�fS��?��J����=d�+�C�w}G������r�Q�`p�KQ��1�U�����w�S���AR�$��>��W��Qc�����<��v����������s@��Eh#2�B)4���nL�$�������������nV4��K��;���H��1���3VG�vmGn�R�����z����}���wC�N�*3�5��[�����6���j���;z��S��xi��OU���}c�p��������iF�j$Q;;�y�K0�bt@�����Uy��
!B6������[-'d�]$�R�������9D�'�ar��w+�"z%s\��d��H^���}B����Uw����+�O��~�)T� ����A4��i��!��5= ��`Q#���C;������Z����O��o�nx�NI)�Q@���>�����Z;��T��)D��{��l$����5tk��J��>����)�60�aRv?�t���������]�'���M0��.��xN�
Qz���u|z��^��K)I������+:�2���p�����{�wG�O�YB��������Oy�����H)%��H��mg=UQmh�E�U��M�ky�_��t�2������@�B�]*��4�k��B�s\Bt���m3����|���h-��2(��
&����bX���!�1�!�+��������A�u>+���)N�Lm*:��js����������m���+���O��/��_0}A0�Ca��]S��:g�<i���J2{��<����}s3kq�6�r��.0����l-������q��j�.R;��Y��n
������c{��D"���iEN�������<m��7N�or%~���Hg����jckm�z��������a>/Q�Q��F��C���N����in�����L�V.P}u�	�?�)D����q��Z�sg�.����e�2������
u����Y����q�d�rL�yg�����M�O����9�4p8
���i��k&a�n�<R(���nL�����kG ���U�,��pH3��Z�Ju���p8
�N�9>����V�&��U����lzQT�E�.l&a����p�2� L���Q�<�>G���LN �
�1�]J����W��R��^��[y�9DMK�XK���r+A3U����;��\F�]���.s%�N�/�|s�����GS/���6$S�
�t}���O'���������������C�@�u�+`��`z�J�'����A��yj4S��g~	�up"?�
��PX���/��)b���2�I��W��4�S�Ry�i:���H`�b�p�4�����O�wf��|_�}���b|_�#�<6*APCxXu]�N�l�W�*D
�8�r�j��$����6Tm��Z��?�����>�����b8]��<)�t4��U�mp�����z�l	q��M)�m?����w�$����D�6R�~=[Gw����\��������!��n\O�9��=�����d�V��I�f��tQ�b��5��%_|R�{B�����_�����=O��E:�/6��A���Ox�v��!21OIW!J����j��k%���5�kNq�I�"�z��������O��r�bqt��#F������B����xHx���^U���A������?��-�Q!K��n&Z<(�Tw)b[�b�M��~\B��8�cS6��4PR���L�!�\�?����tBf�P����C��j���b�p��nL'�,�p�;j������Vq>������;�������-3��F�"L� m+H���z���mS�5���?��Z-"m,���Z.0W��k+�d�23G����/�����Y�>�9W1L�������m����h�-lfZ�����$r+�sO"��m��	�g�1yO8
�N�_N����8���74��&t�{`�z	Q����������S�o��
�6���>�\m���*��n\�������Mka5z���I��l��_������d�
���	�0�����B��$�S��!V����1J��)D�6�O�)F�{�z����S����P��nC��V��VMJe?~T�z�w�����|$��m��.!�m��r2a�#�����<X���$�Q���%t����8
�N���]�|9��A5����H)3���c�����q�S��Q+���m����������!�Xv[�7�2E�;�kNN��GsPwF��HQ���?
�K��n�X��Q�i�y	9y��4\�.���i�]�c��C� '=��E��M&����^�:Y�+a�u�Ei~sT�����*Fe�����@Zw�v���_�*D
���G��^If�#���b7��z�����c�}��=���z�b��if��>�����q9]y��6��&��Z5�Z�[�R���D������T��1�}J���U�4���=l������������i>�c��cZ������e���e�)�"����b7n�lSwM;Z��g���Y����g��WvJ��g
I`�1����#c��+,�����y$R��!BL�,�F��e�9D	�@
���o�Lt�������������~����oT����Y�������g�����$:�PT�u�+�`�O�+�l�U��s��������~��'qQJ�O��g#&O���	�Z��}�o^�IZ�+|��9���`C�b7n������C�����Bt^�$Z��O��t���
��������z^ ��H%u�G!%�]:U�Lh�apoR^���b��&�UZ�W���S�y��K�]��M�k���q|y7�g����C$d�����b����c�8�.Qv3s5�����k;�,(,�������^%�Z����rmG���G�5Lt�]�����l��B6�l��1>��?>,it1����m��Q�7�������<���5�'![�.��A�����s8�<�::��������������O����������V)���'�X���;����Y_.����i�EU��{�D5�)$�����Z�E%Q\�9Y�_�������d�i���H?1H�������7.����TS�h���!J�r��]&$Jt$��
}�6��#}��	%e����att6�UM]��������JQ1)�E�������2�2�`K�\����k\�Ts18����|��C���?�H��1����fE#��Vt���M���)]h3GI���a�����mn��N��
#���qC��*��n�0}��Y-!z�����d,-��K��31Q��1�54kh���Y��y��$��(�)�J�:h;�u��m{��Az�FUKw��'�����`S��?%�����@���z�v����%�@����<\��W����
����D3�(���N41��
C'���7n��_��]��T1
[�����7;��:^>XB�\��X�>5U����@b��U����J�{z���9�J;?�\�Ii�2��Tg�QHO��)T5�k�t
��C�x4FD>0"r���I���Ng�O~���:`�����Y�b��G���7�r�������%D�������/��>��M{�%������p���<�
�7n�Y�D1��?
[�;�I���mk�q�� �B�0�D~�����^�<����
&
&
&}����t��h�m+:��:�~��l���l�z����6��@�A�A�A�A�?���O=�x_Y�\=0S>���T�N���L�0	,S�ZZ�/�������4@��@��JN��m|x���?u�c�?�����s��6�|������'��_���31DI^�Uz�*$pLW�, T�#�k'd�!�'@ ��X3��Qf��|�����}&|�����q`���i	QJ��q��w^��t�������!���t��k�4P(
��O�G���6�Q��3���bc��W<N/!:�����|'X4JN���F2i���{�e�2p��Bp��gO9{[N��uWEb7Z�WBD{
�C!��@�9H`���Y@���R���S�.����e�������j�)���D)+�d?���bR�G�}��Y
�b���j<R6���v�d�4P(
�J��w�j��6w!�aD?�"�!J�5�������x�Qr$U;_d��";P(
�J������e%7V�m�:����YY��^Bt�k���2G��BE}V��',d�y�2p�\.��=�/!��59���=��p�%D{�:�;�����Fo0����`������/Cz�����a������:��s|��Gix�������s�uG�mgG��9^���I��M�`-�ES�3��������rl
.=�=��9�#�,��u�"����p�������ou~8������=�C��!V1�,v����Q���c����%���k���W�l.�����^�@�a@�����w��;��<t���	�$�X\~����Q��t�h�����sd{��3K�j����VM�����W���	k�e�W�,�<�?��H=��^��V�~���p���������_�>�15rN
V:�2���+a��|B�w^��K�vm��ZG���x��%���lw|�����N�m�P,�*o�e�������e����-�]t"Y�3N�b�34	���R���<��i?��$+��h�=3T�����[~��&�W&���>N�!	����?��b�G10��@		x�&[��Q����/�;����lj$g ^����������2p�ps#���c� ��Z�.���������V�����>D��;>M�'.I��t?�	<��Ooa�J��?u*S������\��O����tm25~�W���nL�t[u�4��������P�b:t�IY����Qe���]R��-�#���~�u�+p�bp�G����������sh��g��!z��/#��R��*�n�U��Af����f/��`��k�����^��q�:v�\��YA;�z�X������*�`�#4�lw�Ps7������o:[B�������S2%:V���������
<���:o{���2��t�f!p]���T��i��z�s'a���j 2���
V2���[����:��8x��u���#6�V!:�H�9���0a��}��.�g�<�z�y��5
X�+�L]�Y�U�/����
���f�!�'o���Sj�'.�Uy@=�d����`��!=s.v0�e/v��c[Y�\���8�������x�d�r�n���M�f3�y8n�u8b�(��Ii|T1`�i��9�}p���F;�kb���������]T/��,~lT�QyF����'j���������}J���)���R=_�������$�?
�.������c��llll������V~������q�W�wm���,1]B���&��@rf+����????>R��\�b7.���#��Ml��'��`����B����q�l��b��0c�#`0
�L�L�S��]������������-���du4J^R�������ydn����!sg���~�T�+�-�w�4��"��$�5���W!Z M�P$�9(������������?�{��3��*1�bA�6X*@g��C����(�������*�A����N����~������p�*|������/
�w)l�5��l9���{��r���y,!
q<>b�g.}��r�����m}� eC���
)��H��&��J-I���d�.vc�_A5
W�l���B�`H�-$��k�U:�qVP��QL~�i&�&�:��9	�C^@G0Nb6)b1`���U������*D�@�~���vs1���1�1�1�1�1�|9��AI�R�q�2�qD����`�z�"9g������|���l%3X�*�w���{f3���R�y���%
=:��aO��<,�B�`H��������B��k0��f3����[_�4����r���Dl����>9���Q]Fu9o$O����<8(yw*b[������}o��0��c������-�RO~��T\\\Y�+�����s�m+��������o.s1���c�
��k�y,^]����	���V��Od>oj����a8o
��M�y�@c�����]��q�bq���^"�5����^B�@�O�5�����"^������h��z�-�����T��	�j��cR�p�S���h�t���GHS�m��XS�Ff�����iC���
M���h|���������/��jI}|$a��c��*����hz���*DM �	:�U�z�3W������X�����Vmp��{"�����)D�p��gJ��H��`�#�l(���_��?�~*��C����S;���TM�&�S��qCcm=4�����k�E�c�!a��cm�F��H�B$����{��������/�6?G���~g�������F1X���bC;��>�%D��&����<�~h�
�����j%vyF�(�q1P�kl�F�&"���L<;-�E�_��W]����N��W1������\w�
�+$��n
9y���L�P��	��.$<��5�mH���!mgH��^��I��b�r7�6v�o;�yJi{��$����-9�DW�d��!mC���}����@�a/�
���lC�A����U�&&P���_��WA���}��1{����L�t����^�1��/�)�R���uU�9��������w����n\M��G%S	���?5����>&��@���
1b�K�����w���i;�i7t4��z\������.a>=��L
�6{f:h���-��?u TH�����K�����a��A��*J0P,
 I�k��N&��w7	���!��e����=������2����o7�P��z���l���fY��@Y��B���lr��������8�N�(��X
!j�q���f��J�)������|���)	P��bC����T���W~�W����������s� ���B� (G������� b6�lX�'Y��!3�c��<��O������$��n\�����H����Q]���*D���@U�>���N8��e�`�`�`�`�Y�1*U��a��l�5��<u�h�3�������_���'	���,��1���IT�dg��h@4 Z	��������������.��}����.��J����8����k����A�#X�(�wI�f9���}\b��\�8��d2�}3�n�?d�� �$��C�\$x��Uh}���_��JE^B[��0;Q0���4
r����!o���^mO1��:������������<h��[���q���t���0>AY.V*o�_�l�������A�5�5�5����L~zL��L%��o��[����{C� �:���W����$Qc��o�Lt����Q�K'����H��?��)��
��5��:	����|�t�Q�
�/�����|���422�5��,�F����~=��������U���V�����A�A�A�A�_L��u���M����q];�m�k���!ZH�����_T�����_��3��N���/��Gk��i:����!�KX]���%���	f��9�:Y�����QxK"]s�+'H!qC��������{=EU�����C�rwp-u�l��R'�
����Mn����}V���tB�v�u6PgPgP�%�{R3J�O_�.�{����X��|��$�!$�5��,���G�((F�,��8X��]��m2]�@f 3���+j�=�������'�w���!Xn�42��- 4�3��OW:7)|�����#�z����D�*�_'������?n?�����������<����������^m��<*�"��H}<��j����s���;a.���Jg�?���:��|�U��!�.��:���>����B�4��q�m>s�ct�:�9n&Y��-�6�mH���_����2�{�b'��*��b7���U0��z�����C�/���5fEe�m(�'��B�;��3��6�G+��b��Ge;_�1����m�;���ym��	��w��JghEe�6�m(�WG�!�\C�}*�5M�l����g�|��d�1D)5K2�����1�)�1:H���8����d9��H���!mC�~1�6��������I����g�lS�+�7,�!J#qc����B���6444�c�o�J�\���R*6��J�p0�u���sH&C�c�\���W� 3�m 3��d>d������;w�y?�f�W!:�x��6�;av��������]<O��+�������������6��6�$����1�9���&<�]����9���"gVzB��:�L
���Vx� y���!h4���'N����)�c�yE5��N&P�'Wtn�o�J�0���u��8xC���W�)z�����QM���QT�+���rG��Z:��e��08���N��B} ug���v��V1����M
��$g������*D�#���������|�J���b1�@�.��]�1;3�o��{�T�-v��C�lkb7��������#��A�tUJ@��qdL����m,�-��_�G���m��������G~b�\�w(f��GS5�v$h�o�J%�%e?���C��VM
�����������u�#�`����&Y�G��q�]��4��Vm���W�u�9D�EGb�G����:��������(�r��\�u��������w7_>�~~u�t�~8����D��_����H������}=����`m�w�����Ej�F��\Qu>\� �b�s�G�����v�0�%DIkNb7/��W�s��{��n#ung:��axs
���$����y�bwS�����az����5}fb�_�4�����9�@
��al�����>w��^�R���T]��6�,����!VC����i�����;��$v��dU�4����l/7_�TPL�����D�c�����C 9������Tp��
o�vN��96������Rt��&?����F����3��`��5U��v��9�!z���{	��rB�ms�FXrv(9
���p�j����`�So(	�
7X��g~{T����l��#�����3��/!�P$M��������s��
:�����,,1�;t�N$��y��R-�XFP���6]e]35����,!z�YRQ6A��7Z�@���
5�G������2�Q���Wmb=4���tj�9�()�I�����U:��C���f@�����H���P�^��k����K��/�����|R��Nt����-���DQh�y��|��'PJz���+�<J5:��4���_�*D����kAaW������4�YB�:����#�Y4�������Z�5��q�#>v
Q:�t����t�6����L.0����lB�Y��\k���}E������"����=�h�f�g`L���yeA[2*�@��m,��|�	��G
=O~��GbQk~���w�{�����lYX�a��$�����{"����������L�z�c�� F����WO���X~sX��b����)������u�PC��k�t�\������H4�F���W]J���xx�{������������[n"�sl���
��������.��$���J�����eQ��)D���`�Y�l^LHM�3��A��L�/�:��#�*��T�=��W-��oh4�V�Nx���]U;k�X����B�@�����\s���'�e����Z���L���-?1[.�<Q<����SL�����l!]F��,!z�l?����-n$\��M]K�Lh�Y��'��i�4@:k~H1X1����mK��lg������0�d���	MsL_0�R
����@�����<����#�=�{��OB���=p�%�������-�t��5�+�����(�=�?W�_���c$:t���.����-LA i?��]����,o������CoN�����qe�c6W&g�H�
.m��O~��`��`:
�������iXE������'N�-���1����'3J�Bg�Q�}Cbv����:S[G�gT��UV�=�
�����)c���q�.T�6�����W�W!'���#�����.��}-�}�y{@0 ��g`s+��P���08���A���j�=�(V�y5=U��tt�Du�eT������`y��5�����mMg����Mh����3��n8�5U����*�������s�*o`�`�`�`�/�-���O����}W�C�n�$W��4�Y,X/!y $}������t�A9�r�Z:\*��<d�N+v���������k�8"���Hc��`�'�������rR��cn�}U,+\B�4�^�f��*bc�#��'�|B��������<e�F%{W1X���M}8+��N����������,5���$�Z�ML�D�\��*o�|
���"��o}x��������o��c�y���-���*�:J�9������zd��'�V�%D��l��OVZ���h0����x��������;���m�
�JJ��$���{�7��=M���)����'����wV��EK|�Uvjm�0<�/�� _�Eyr�tD��Z��/��y�1���r6���=x3,��������_�|41��%�>`��T������v����������9|��t�*cW�2������)���>�k7
M;Nq���`n���C��qY�)TX��qK��!���$N�Q: bC�~f������'�����a�������������u|
Gcl�s|v]�Wa#Qw�N�%�k%�WF��l����k�����;]����J+��"�(o��Af��G��e��	u����0�(�,$������%=��jb�Dj��x~�{f��	��WX�X�&��X��e`��Fuly	���85n�TK��f����y�S3
�;���e1��4]������\��P6Q���AiP�~)��M�(�E`��W���K
��������Aa�M�:2/dd��,}��gq��
�$�Ab��9���J�R����Q0�f���uk��O5�AH<�5�Q���<����Ln���-��g�7�2�.���2��=�����>�6��n5�aHJf�{�g�<�q����O��d��&���c���n�����*�����\��
�]Vd���k�[7K�]h������#mPH:tr�2�d%
��������'(����X`���w�_������7sJ~����d���lbChG
u��?�����?�!9���<'e����r������7���������p�p���.��xG�bj#�����8
����ap����ai��.7\Z��P��P4�����=����U������}=�;/�O�^lF����!&Y�`��}�z61*���7.�����Z/&6��;*�ZS�c�	(��'T�n(��2n
��W{��_�������=�N��i��(����*�(������f�6�����[�O�;d�l_:�y������A��R-�]^n�(��Hc��7����8��Zi��Ro�F��,�q��V_4�����Mt���e�������j������U;O��V���/�Q���^c��&��!���(2��{s���y���z0���K�8��1h��S88�<8���D�hws�u���;��
uG����� $%!������W1�&6�$���h3?A�D�

B�� 4�m��z���W��P����BRB'{<���n��&FQ	������tS.���2�.3\���=4�o��9^^��������>U��U^�YLtX����������e��	�l���������|�j������>��Yr�x1�!H�A��v��L�&w\3>�{<� �2��$���"9�n���|�q5�AH���H{_�������?�N�~����	���0���h0�a�Nm7
���� ���Q�����z>^Z��X(\K�^�,T���wg����`3�l����g�%b�}�"��t��k{pO==�o�2'�r�G����S���:@<����&*E���y���8��cj������pju�8Oa����>�����n�P��M=�)g�y��/�F�$Z��l�@����m��7#�(��� 3�2���P�.~X�)o/��������������������cz61
�b�C����<��	Zr��G�S��%'Zr~�����)��[m�����'q��5��B�Y����^��O��M5/^,&Fm���G�����a6�3�e��*�$5i�D�
�gd����=���y3,�i�Q�����������|�v5�QH���]'��y��T�����m�W2����f�`�!k_����8����e����������]3>HZ��`(����M
�z4>T�Mt���`3�6�6���k�]�����.4C�O�O
�H|�,�T���`f�B���������Mt����1h��\���0����B�tyr6��H��������&�z+q���q��L���L���JMN
��@���P��Qt#�f��6C�[������bQn�����8$�����W�2����|���_e�;���?#F����b�n!�.��i�|���~�t w�q��������#)se���^f��������"��SR��n=��t�~z8��R��/7�'����g��������O�.o�����h�#�����_�)���x��G��r�.r�
uG������W;���U��`nf�����f�`}6����	3 ��	3���G1;5��m�k�h>m-&�MU�^���&u�,�����H61
d\��:)^�_���i>,��D�Qo!YA��'�Sj��#|����n���h��<<��w�����Q������jf�iP��i����H��&#M~1i�f$�1���g�}�]�.�	����G���w�hAp2���C5_�5f�;`3�6��`3$���.Bzt������|��bb�4��~�U�l>cM�@��U�����\55���Yu���"8
U]�]������*b����cy����\�O��Vg�w6�����To������ZSW�h.&F�A�6S�'��?7���'C]OV[�����D�����3�g53��v����
L6C�f�1�\��ai���B[�OA11B���&v���2��}����T��%�`�YR�s�K���)��VH��<������t0��N�����G����m2QbH,$�:D���������W^�>���A�	�U���W��Q�\m�o?�I����,��������.���Q�H�7}�$��~d�o�W6���kw�4��[���lT�AeP�p}-ng�?*�k�Q��Y8�r�H�j^�k����T���������q�v���#�y���$�Ab���V������V:�b^~x��D�
JE!Om�_��%%W��Q�4L�����y���ltyO{���q��������U�r����Y6�[���i���:�������$
�E�U�?���:�g��R����_���=]�}W���y�&6�P��J����5�du� iF���I��H�7�3���e&���3��Y�eQa���������x��J��
Z`��l�M���W?���L�A#�(��2�%{
��CK�GP���e>G�&6D�~j&�i3�����C5nG�_�6�d���h ��_��$����|�jt���&��}��������$Q�+�����Ml�����u�I��d��t�Ag�t~,��_X@u��+�'����@������Q�K��0b3t�m���f�&:o�d H����p�����a}������&:I��}h�����=�(�y��L�.H��K1t�D�@4�A4��g�@[���V?�b���U�������4RL��~�r�X��{�?����� Q��g�rq?Ad�-�9�����Zv��P+�'��S8�w���/7��z���t���T��3�����n��M���KY2?�$<y)9[<N-��T���q6N���������&:w�4#iF���Y����kK���k���_����OR���A���S��0��H,�l�p��LT~�a�
��2X�f�`~`�Kn��y&YJ���
����<�e�#�q����pM�l\p��e�#x����$�Ab����/�V`AhP�t
�����&:I�$H�nED������2��A��I�d�;4
@��c����=Ej7R�Q�M]��g������t����J@I'�b����&`�f��E�y	� ;���R&Mx��#���4r�*��z'�HS�k�l��Ii�)��t��00��������l�J&j�]j���*���&��?�� �������q:J����j�v����l�<�*���2�*?F�]��C��C��������^�&F��~��~}����D�D)��k�0�gh�q�8`�f+0�z�����o�tY�Z�/Li���[6oy������6*��LA���=�����fA���}t(j�R�F�X�Ml�	��y��`X��e��Z�lgX�� �h3,S�K�r?U��o�YL�R���_o�\�c�4 ���������b6�	�9���Vj���1Z{��ni��e�j������?����������y�z6�����'�+d,&6\�tklkU��V������$�c�����a6���)4Rh5A�A	8��9��|Z��a�x#9w����~�#?�u��>��5��Qq��G.$#M�7�_��b�&����f�h~9�6�	#4�O�����DWX�]^~���CRF;GWX���
���?!�����7jd�7�4 
H���������^���17�����)�<U11�h�{JEc�#����&Vu���HdX��2�e�����48
N��/����i���#�D
|�Z^���,q<"���u�a���D	C�;td��I���4�Ac����ymrnF������x�nW��L:u<�F)��g#.S7��������O&��v@��i@���K�����\pu��/P�&6���m�vu����f��&i�@��|�W�P4��T��nz�ip�����l�=?~|u������o��S���\I����&:I�H�����M�+;7����h�L}(����'(��&���h0�fsi>i{m�7�<����`t5�aH�h�MMVG�x(.&6�����!��	y���d�Af��%�Un�������p�������{���'�9+?���(��[���u�����9:���3��r�������h��W6/���i�}��h~
HI�Ml������o�<����_=;�����ep\���y�z6�9�>�C�|r�.�L&Vt�d7���1K61�j��7�qK�6��v6��j���4JE��tL��D�o��z�k����h��s'��y���@�a��[��
���M�6��o���;�(ee)4������y�:�HBb�~�9�^�3���Y�3�|}���Y���.���`�/	V�'���KA������h��9�����m�iu�j�_�lbEd��k]�-(���'�P��nC����r�m~Q6"�f�,$ ����n��lzXMl�H�&��ucg�I��;^�-:��L����C���-L��
�+��j3I{3B�R���4���8��68LJ�4����8'\�6�O�a:�"#EF�l�"��_?���g6�j��-�l������	V��g_1G��1Mev5���t���a6.���.C�&���!��u�����;��7e���M2���-�_�z���<.��@�JC��r��b:�<��
�����#*�\'HR�q6N�YS�7.j6���A���I�:��/��4�R�����
S�_�r+�)�	;��P���a�;t����D��6��`s�juU�������a)>g4���{p��[��k����b�����SY3���Q6��0��c9\a�I&��c�2�,�����4�pq���������v������Qw�t��+)�g�bbCi��1��x�-��g�7�4(
J���4O�=
�]����mg���DG"i��}��0>\X�Y9�����<��O&:w@g�t�_��W���JS'v�tqe�?R$�����<�����vJ�y� �l��1`�|���(����;�w*`<��I�qHJ��N�rjg�a�;�k��^s2�^�m�?uE-�-��"���NN}��S�r�*v�?����S<���
}��?^M9	zs�&F����������|��M����B#�F
�Z�B�i�r)gQl�b�����t0w�a�[��;������O�D-y%���]���k&~�!
�=T�AePT��}�_��t@�.x�F���p1�:�F�T��:���P(�
{��~����l2C�FG���������7�����P�L����H�������]���B7�Z��c�z��&6`����}>�a6.�vl����t. SF��L�����<�*Kc�Re2$��x����=m�:�����/&�Rd�T�~M��J�.��[���(u�������W@f�7�4 
H��4��3r6�u�:}�6NI�@&�:m�28]���D�E)��>��_c�?�<�����i`��yL�R���o��@�M�������x����������(�s��Z^l_�����g�x~Ax�a�pb3�y�'�:���(
+&F�
I������P(�qCg�X5�J��2@��b�,��TyA���v$�����x���nZ�����M�m?�`��k�:�L���P6Q�L��40
L�����M78����Z��XM78g��5��t����,�g��!�!�K]���&x��g+<���U�u�����zd������L��������o��e�����N��:0Ay�lb����q}�`�{6��,��2���r�?,�A�Q�Y<��*�����f~s���$�E.��!�e�����|W[��v����p\��ep�6��x}���v�d�K��
�1B,&:
I��iK;��:��+����?c�����T�P;�(���4 
H�H�;{F��<
�z�l������Z~�$����?^�!�[gw���l���"�� 2�"?�6�O���C�P�B^�.&��P�4���� ��t �:A�=�b�3����
`X��e`�q,�Q�N�0�CI�y"V&z��[5���8�5�$w\'�!�M	�0��3���*,�n�[�����r�!8����6n��c

(D���&6�x��[���d�@ePT�A������C�P�W<b�����
�n�����[�8'���IM���T6�y,��2�,?��=�����v-u�(*6O�j���4���2��D5����?����M9��0�3��3�B[������z���o�����
2�gW��X�����K��a6.t����oy��p��ap��y�u���w}������Xa��������@("[_����}6��.���2�.?��]
�����W���S[�l����~�M�Ud=f\���GM�������#��\{Wyu���������n=�Nu�~z8�����w�wO~��O�rsx�g��������O�.o��������O�pb_k�l>c0J���#t9�c��_��sr��6�������;]��������kh6Q��8�b���t�c��������D79H��F#�V/i��p�?�u}xs{ysx{���Z_�ym�����^�m�����:R�1�"G?�rgg�!��c�<L�h����RXb���F�����GjI���bE���L�D�$j��V�KEm2�QQ(8Bn���	D�9��*�����������D�z2m_|5�?H����}�L��S����
D�B���&6"�#��3�N���a6.�fp�20�~M&:o���������o>�]�~s{���������o/\��t�����w��#��0�KkA��]���P��������u��x���������� �/,6��~lwvSo3o���XL����V}���������:�O���8'��Il���@���pJ6�l+A\~-\����'���(�|w��d'#����Z)�<���s���O�2N��4�24?Gv#L�?^=���O�7R�,:�SW}��?G�����_��ZD��t~���q�Zm%��jb���Mr����������m�u�1QUv��M
D�V�3��m���`�;�������	%���%
����������r���D
�DEi��6~,�3���$����`4
Ft��K��(q�J��a���|�����!R���=O���l���IJiG���;���2�� �K!3��mOA��[y���Vz�Kju�������B�o�h[��&6�b�P�������)7��e`X�����[v���r�����o��k���hbE����-����;m\\y#i�mt���`4
F���w)j����)s�K�(�t1	:I#������'�Xe!�	*t�F[�I�������X
��i@��@z�� W����%��j,���D�K����N��}@�����x�=��9(��2�(3�����T=�n�e����jb����k��\�m���k�|���08���w�`��CG}>��.O�PMt��4=N5�t���#+�F�z*�d��0���� 3�2+���{Pn�IW��D\^������o��&!�<�\j3t�T_��\o������2��uJ���6:w�f�h��f&i�%S#�t�`$�Y�.8:95��<<��($S��:
�h�a6.��j6JK�t���ap��R�&�8R�s_B�@���8$%1�R���l\ %z����(�`1X��/��|^ft����2��g��=��>l���y��>�9 ���\0�doR'�u��i��d��14�@3�43i2���J�:�)��/�ru6��b�$B���`��!_0|������l\ ��XbF����&����ep\�.�S�_n��wp	����C��9��]Xu���Ml����I�t��AePT~)T�S����n(dO���bm����n11�M�>��I��'��D�Di�@I����<
P�|�����$w�{h�n��<\x�����=��������/��n.?���}�3�^�����#|�7<����a���p7{��>x3(��7���5��|��&=���<���$C��$K����l��	�d$�H��$#I�t�pq����H:t��1�E+�c���
�t�n�a6.����n\2��J2Q6.�f�������6���P���<|�,�4x}u��?<u�d�2��u#�	6��(�&jl�����d�s8��c��Z�(��|�c�l��m�i���z�a�\M���w^�O}�K���(��l���������0=����#�d��Qs���|��p}}���y�������)��r:\@��W���~���97�����m.#g��o��/�`U�Y�9[o�����bG-3���%Y�=9��|���<R�����;c���=g��"� �� ��J����4$I%`��7�4�ZB�J�&P�N�N�8��+z}�{��������%��]o$r��n}0H2�l���h����?|~��������7�����{{AmH�d�?�J�>��q����q�����|�����	Y����w��s�����B0��?^�
��T������C�;�&�>�E_���~ZLl��T���y�������9X���d����������e`X>���K�;���8.�*�T*��;2�l��
��q��G�R21T�m�=��t�0�(Cy��)��#�O_��M�C��v�<_v�D�!��I�C��4���tAT[.���l���6:we@P��K�p����'���2���/u:��0$�2���?��2�l�����p�xL�����$����i@��9H�R����������+�lg��!)S�����b:��S�o|�@YPo��CwA%|��,������|�������{�%�|������������LtL�/H��dWL��L7;��i���4"��}yC��?����lv�Y0������AYl�t��0��� >;��E���������Q\@�!����%��f6$lH����6��60�R���.vC���??��D������:������h�����I9� ~K&���AiP�~1������zR<l��n��t�c;��#��4{���t�����������B6���Z�>���t���1h��\�� �-�RmF��Lt��|���)�Y�I�6��H�����<J�>�@1P���)_�f��m����G�;g�����$�,���"����O���:�DY�J���4(�b(�sB�J��Y����I����-�����$Q:��E��X>#����D��+����������d�tl��f�l�2��I��w]����N=��0:��|�<���c�������9�����e`X�9,�S�
���������MR]v61��ME�����0�3�b�eB��{�2�� ���_�j�8��_#i���7�Zj;Y�Wsm�z����f�I��eZW����H����DB��O&F�g$J�}�<c&�����2�.���E��e	t��49�����ut}��O��T�3��[6����27�~$H��&N:)�FY$�@2�$?v��.��R4�-�
'��0u��M�NR&R��4�Q�<�&F�A�����00�"0��EF9��+I)x�{���%Z�L
<�&:I�bl��k,;�WeoZ:}Z���P���>���3�>?�&�Q�����q������B,&:I��
�����l\ z8��j�����2z�e@P���.��$w#m(/i3���Ud6�ab�*��o����������9I�t���d H~H�_��3G������Bd
���u���Pz���~ %[���R6���.w�J�2�*���mt����2�.�������@[�����C��AR>����g�<������%-?-�F��(��2���w�_GG�:������OC1���O��y*)�d�c�4J�WB�9m�c�d�s��hZh~Y�:���V��~��x;Qy���$���n����f�Cm&���0.���/��.��xC}D�����_�l�s\��ep\~,q�3"#
�k�QRb��@�-�!����A����u���f��dwh�8�v���d�2�9�����Zx���������������C����O�?~�����������\�����W7���?\�>����..~X��^<�j*�_g��%��T�S�kG8�� ��]L��j�t�N-���!�TtJ����5���d�Rh��H��B�"Yv!2���C�Fk� Y��6�{��MK�V��U�'��f.F�
zR��w��&6�����<��]�t����3�>����?^��tyc7Q��ryO�j��~Z�_�}��������DGD�;-���ZR�sH>��$X���V�8�	Q����+O�������M�[�������{�}��TTd��3����Bg����UI����F�7q6�q�:b�����w'�(#�F��
h���=�,l;����t�&��
�����3l�`7���(��qi3���{i�����u���<����`����]o��������i��T��b���ir�������`r����Rwl�u[�3_�l�g�h����6G��dG���q�K����gg��5u��+��((O���Y{�s��a+�z�����o�4N�#�|���<�e������V���U��;L.�ts��<�f�S���l>)�&6�P�R~J� ��c�0V�]��c�8��%��m oc>�1�{6��S����1���|�k5Q������5�y��RHy���mV��(����Af�d�B6]�����������4��4,�5��^���(NI���$��t0�����z�Gf3_��Mt��@3�4�4�yF��|jd��j�w.]n���<�&:I�8R�&�h����Q�@���@��`X��e`������������h��P��Kb����F��T�Rv�{1�q�z\'���+��F�r�3�0�,�w)e���c;;e��~�*:�����j���G��gL���e>������s���AiP�~9��_+Hu��4}N��a�c[�E���D�-J]�b���0��K���Q��I�1Y����H��4 
H������$XO�.:,����{4�qH
��^�m��l\H��I�,�A�9���JP�';�9I���:�"<\���q���������c���4�r��)�%JR6S!� �h�
������y�������'#OV���N����I����[{����7Nb���P�N�SPL�='�s��i���sT������kj�r���P�Q�	�4 
H��4��������^���n�:I�Z���0��IV?��4�&�^\�X!	��Z���%$e��:���3���y��6]�8�n�*Y(�bbF���}����lb�����U��P2A����Id���F�;c��Zv9�p���;o�t�����������wu���(s%�:���S�vh�I��^�����{���'
)��:J X0'e����3�g��/'��|�ft�h�-�t)���RJ�0���jb���N���N}��e�m�go[A��|����W%R}T\b�������OY�Dd��N^���L�j�����5��	
�f''�y0g�7���*��`�����K[7��]G�������bb�ej�=� !�����L��i��)g>dH6�m�ns�,���rs��vgJ���)��ZB?��l{y3��g�=o����C��i�Z��9��D�&J�YP���I��c��V��&F����R��K.���������'#���T�4Ri��o/�.{���C���w���M�j��z}�������Q�05mwr�w'����7*����P9H���������4������w��fI�v�[!�n�Y%{�S�bbDi���I2��:��,��S������-$��ep\N�[u�o��e>1z}��c3��KI���D��C��d�FW���7��lb��B*�����Z	��F���40
L�?�C����u�&*\�]_���X%%����t���"���P���nC��0��B2Q��4�@�B3�����	���o!��v�O����6�P�������<�����@�H�EU�p$��F)+���3�:��|�,��
�l�yp���HY�"�J�����K�$��j��l E;��-?)�F�
�.���2��sy�u�c��.T^X��!��z2a;��s61�y����B�<J7@3�4�/���\yU"��M	j�9~�8T )���:�u�*x
���7���o}N����t�Ag��O�<�-�R2mFg����%�����n����I<�Q?�0-��lb���T���z�s`�f�`���KE�J�����(�<�h#v1��H���)��R�S61R��KmF�J�<L7#�3�8�z8����j�/��5PJA~��L7B����_��������@D�&6��� ����J��s�QJ��3�:���3�:���kB����%qh�t��lb��T~������k��HbO���b�o��l��9������6Zm�V��������]*�$t'�TX�<9TPSvM��0�(]�� ���3#gF������|��Q�	���[|�)���b�#�`W�������� ���
����>k[~J����8��c�����1�����_H�3(T���	y4�2qR�3�:�a��5��+���2e����9������������Qw�iT�9�j�����dk>a2���)[��I�.�1v37]�M�>�$[��EeL�w'Y3O�l��I3�f$�H�_P�,�����g"���7�8�j�������(��}��o]j3�c��f�|��L��������gBctS��`1XC�>/`�)������$`�&:I)��i���<������_���mt��@3�4�<��)`��y���)��a��Q����X����� ^�&:2Jc��=��4K��iTn����48m���W��,]�yU7
�y|�.����w�(�Sm&eS������l���-&F�eH��a]|�:�(Ue)�I�����_{�mt_
��Ah�Y���������b�0�0�uC[�h�;��D�FC�DR�����(��l�c��jJ��)��l�s���h ��G����R�����1Le��OC5�aHJE�_2�TJ�;����;�������%e; 
H����4/m*�?)��Jp��e#u���)��h��'�d�q��c�'��@�H&6����]i��� 2�"�� ��i�`��-��}(b��1T ������SI���qf�G�p��[p�6�N(�2�*�����T���M5��0Pi�8SNmG����sn(���y�Q��6��F.��
)�c�����e`X~!X�_���n����!L����P�M��i�\o��bdo�V���!���6����Af�d�O��M�+�-���(!$����M	|��-�D�D�;T�%�2��9(��2�(?�]����S�C_/g��PMt�r0])9��x,�&6�L�&�"��3�Mt���`4
F��/E��5��/���W<�����A��{����8T����?s��������O�w	�����d�Af����y�����v�����wx����D!)�Si����0����e�?�F��,��2��8�w)j�Lw]����G���������6���&6�P�5��U
���}P�$�g�p~!p��6�9�����fO
��'��+CmXH}E����V�`$o�@�T�-x��!ei�0������������2v�~z8�����w���>g1�f����k�m���{��.�{��)�&�zdi��Tj��we3'�1d�
���w\�
�<��Oe �-�O��D����=���>��������/��n.?���}�3��������S8����w�!��'�q���v3(�S�&��.���|�����&6�m���&,���M��"nW��}g~�H'�� sF���YM@z^8���>�����2��U�`�9?Gv6:����W�-���UI:����P�c�=)&OOd�~$m����7g�O����M������dB�� �2P��N1�����sS*����Tf}�$���s�zv������������7
�P=D��Y*�����(+��tR�#_�6���?�,���A����Y�8#qF���Y����6�"�>}��l�Ou�7t��X�<��{�p�lbT�GRu�70�/D61r�Z����?r�L�Z;
F?3��.�����A�,B�'#D�|�z����O~�Yn��_�h����c������k�����]R��5�)XL���i�6�C���\���������[@(����l}���������^��TW�+��o����f
�>��mi>"2��6��!7����ai��
��X�{4�X}".��g6bqy�nJ�5#kF��7������'�x�W��Gq�@�~d*u��?�����]3���L�'��9���+��?,��l�,��h�.Uj����d���Vm���K��F@	�w�/��DT\Ll��������1�q!P������-f�;�2�.���������w�S��&���<����Mt�����8�z�K��;�(+���8��7c���l��f��e�NE�a��V��MX����*����{��E�������q#m�V����&����`o �������Ar�_���n������&�~���X��K�>U/�E���(���x���y�7���n��a=�h�*��vYp�����Q�z0Y����g~e�*bW`#8	rW]z&�m��& �s����C�[ n�Q6T�:������R4�W�3�g�����'��/���Q�R�0����,�V#*V��Zz�n,�l�\��y�i�q:'d@�d��no�(|{���i:�%��M&:�S�D�A�|rH��:,&F�T�VB�sx��g�Y�g�z�[����x�����
�0���v�*�/}1�y��a��w���M����4;��x�/l&�L���4 
H��4�CR�s(����\����3mw�n�����bb�O�)x'�q:'�d H��d�����"H~c�����#�JBr~H��{����jb��<J���0SkO2A��q�Q�������s�v���������P�-X mVF>}3�0������F*E>���/]����������8A5�~���I�gg����A��t�2���I�/��C*�t}C��S��G��#M�s�tkwt�����������Lg�_g���M
�$�@2��|a��s�RC>Q�Zv5�qH
C�'�zr���T�>`�����'A��(��5@
P_����,��b�r��o	 D94�(!$�t�32�]�>{�&F��H��j�O>��&� ���h0�f��Tw=��l��ha11�r�C[n/��]LtX�F
��8���2N��6��`3���������?�r�3D���DR��"�����q�J���n��-&:@g�t�������G)]�|��9c����)��_|�S�6]���k����b2�������&��M����@:�7���t�*���2�|�N��Ge�F����Bvs���Lt�b��aS��	��g���H`�>	+��8����3�8�o�h�/�.y�[�� P����DG")�C
�,���t#�2�)=�6�����������+B5
�H��N7U����3���GJ��9o�3�ht�b��L���A2O
7�9��6��`3�6�i4O�W�q�<gA.��#u���������D����X�?
�������M
�&��`2��3����S�������`�+I51:�Lw?N���~�u@�F	art����]R������h��"@0weadT�T[��0fE���D�!)ko�0��������?t�c�w�\.�Mt����4(
J��|}�����8�9��}�?R:�l7�w��r7Q�f���}��JM@��D7?�3�<��3�����s�t�J�r7��(Z��w�X�a1�Rf7�)���e����u�y�b\$��$�������������wN�+9`?<�������=	�4����y������sG,�`p�����>��$�;��QqV������X���S��t��{|6e���	5�/��w��?���<P�_�_,y����tU�K���}���6�<G:f���q�^�>,�tN��00�u��%�������T&	�z���HtF�:+Y�
�3e����:������i�U�U��RB�h��"@��1��d��'�}������q���3e������Sa��(�(�H��r�U����*
�D��J���4(
J��a�T�������m�
�s�i��,>T[Ll�������2N��&��`2��3�����s�V��3�yj&:I���s[���4r#�a��l�>N��*����Q��	++�*�Q1���L����	e���H?���H��N��a1�aQ:A$N������t.�� 3�2��|�|������'�=�$��:�H	�����	^>�&��&��Tmz�����n��n
<��3�<�x>������F"6u�rv5Q�HJFj���vg�Y��Yb�C�L����o���Mt�P
T�v����O��t���O�h����:>����"b~4R{���Z|]��+�}�&F�k����p���ZLl~t)dI�����(��2��k�w<(�D2��e�*!����QP��b�C�4��A�c8�9������}f����8��2�.�������}��3�u=��T�.�Q��W5���G��#%2\�!��K'%-&F�RH$�;�m�nV�d H����!V�������w��Q2���&�;�qy�����3�rm2����L���� 4
B��O'�<
��4aH���L�`�/�����&��6��k}����=�&V*?��A�Z�0�����1h��O���ve��e~U��&F�IHt��2��&F9|�&�ve�|��n�,��i`��L0w���K��Is�4��A�'��AR����v��?�����Cu���F���&��`2��t����F��@1���RVy:��V*��nb4��HI�f�Np�bb4?"$��>�>N��,��2��4��h�a,#���H|�B�3e��vn�����94;j}����������2k�Ag�t��Y�:�,��K!�������3�V���xh�����V:}����7]	�PSW2���@1P?�(����l{�����s���*��_�I�1H�/��v�qk+������?���3��8�3�0���`>��]
�Y�*�����&�����%j�M&V%��r)[� ��	(]��i��r�������������?������=��������O�Wy&
�>����^�Y�|��������������~��NM/�3R8���S�`1�y�|��N#�����^
){���){��(�3���_um"��0��_��[�2t�f�Z�:#uF��E'[���><������������;�E�F������5j��)��.�Lm�
	k���w�2_i�M����n��s��e��C7���4�@�RY:/��b4�h�v��\�m�nNyKyOe�(M�I��v�n�q7���b��4�O�� ��D)���40
L_	����h4"������*@�bb���&$n�8��������C���>L��*���2��tI����q�a�o��b�C�4Q�1��[���u�L��`'�w�^��&\��ep\~������T����Dm?x�5����D�Ei�]�^��?������`���0���������MJ������A6�����E��<��/>���:����W��m&>7�I����({pIy#��d����f�,�g�pn��.�>��%�(�\T������*�r�#m��U�M�V(�o7/&6�x���.����8��2�(������n?B���f�6��!�RW�N��a5��h��o��Lg��
&!�<�qW"��q1Q�H���D�(�������j�� �����kb5��-���\}�4���I��mo:.��nb$%�����=/�t0����=�E��;+�D9'`3�6��`� �>����)gAp@=��H��k�&ykY6����Q��z����"-����r/|��g�|���U���g��*o��6��Z��Vz11�g��a:�����A�k"�YV=5���m��	�u���V���2m���S����^&��RY��3���������U7�M0
L��40-H�T�������7I��:�Hq�cF����q�n��2N�8���08,��A��>��A��}��@RS��8��)���}11rh�
��_�8��2�.��_�e���[��+��*	���+�j����!�R����3������^f�/@f�-�F�3U�o��n�sd�Af�dd�<���1���h����S���y���4��t�Ysb��������;n-��f�<�>���3�>�|TE��Z��U%_�F�v51����.��z�j�V������@��N>�ZLt���`5X
V_��b e�"M��5�O{0%���)�0$��XLt������w==��YL����$#?)����h��f�h�����G��]�(I����mU9��&l;�%H�6{��Yr���9:���3�:�|T���i���kJ�F�$M�[�v,��m21r�����>�N�Y4t�?@4
D������|��[��\$��yl�6/%W9���8T��1l�������wa���U���
�40
L��� �>`|@P���7��68�1
���V������'����9(��2�(�|Ty{*�?�(��s+��&:���8S{�y�
�=
���C��!������9@��-�����t?J�R��d�x�J��������
�/��T��U�������=�������/��(/���k�������tS6��`3���4��tY:�y�Ac3%��A7n�8��l51���m�e�m�m�����Y����<���rk��Aj��^��o��|���l��������5������T�}����vC�x5��u��@�������n���^�oo���C������p����|����w�?�������=����^�Y�|��������������~��5�%�=�K����$����E���1����	�t��n��~[-�D�0��=%>U\���S��M:.&F��C��}�}�T�M&:w@g���W`t�y� ��E�C��q�i;�$P*����Dg���/?G?�y'z�-&6�x?
���S�sx��g������%�GU�S�S9�h�$��%�D�*JM�5u���ay1��Q�U���������L�`
X���U��_�0��mF���Ly1����\�����Ipyg��L>�m�Y�5������d H��u��J���|��[W$Is��4G�1V(o@��Z]��Zi�sT^�(	��h������.,J��jv������j�
OA��Q�|TU{�|� i'U���X$4��t�8l*2�}XL�*������Y�P3Q���5Rh��_�Bv�lVk.����#!��G��T�w1n�B8�M���������N!w���z�LE'uw�7���ipz����p>��-F�$�b��!eD�����I!���>�����#����	�~��g^o_����AePT��u��
��J�t����]D�>R� i,R(�D�5�#-&Fg��8��n��c��D7A`����v4���}���L����%S�/v��_��t��=���sI�����:�H����K��>�����Re�M�a:�a`�2R��J�/�a>9{���g����
J��H�� ��C��l��|�������8�iW	&iFRMP	�c��o�{�;f��xt�����)&�(��SQ�,-�'�m/O>��=�G�	�}�
[�l?�u�-7
�q#�#�6�>l��M��g!uF�����U��|��2Xv�7��kC��/V�mp��'��h�8'��:�������n�sH��d H^���
������F���j��n&:�em�����cf���?q��C��	&������%*�P�������8/��I���x1FT���`�g�mW���:I�H?b��'�m����,�4��t.��12e�L��?�~��������e�H�=��)�u)N<���6��D�������ir7Q��Jy$H:����aJ3Q�,2�� 3���NWy��!/������)�S��|i��f������ea71����k�'Tpb�pG�}��3�8�������=��|P����+�m�&6��^!��s�a���jb���t�{�cS��(�j�����|Jg���'�k�I�iE�o��6(Ln�a�a6.T���Q:�_��~�_.O��_e�#�����@'��xjF-�d5�QH:9�����R���L�"#:]��>N7)@3�4�@3���J���y�s�����!�z.���Z�Q�=w���c|�B�X���PT�A�����V���(�'��TT� w�[��$��������(�H�3�����\��ep\��l��A�x���������G����z���R����T-��F����M
�"�� 2�����u��K�b���U�n&:I�L{��0���2��	�����>N��"�� ����jh��K��
(e��\[�8U�w�rw�Rr����%�-\Ll�$J�?����DYF��`4
FsY3��*��~pP������xi��*J��:l��n�,��_���~��D�Ad�#�QulJ���U�n#u��2�d�<��W=����jb��D����'��f�>�=�h��z���l�#.^���|�{{#�g��v���6#��g$�R�N�����/�T�M&F�/kyu�vP�3�n���4j ��n��%�|�L�����C#�F}59���'I�P����;e��0�N7�-�yyW��C����iSfw�2f��Q�(
J���4(
������}�����B�M�v��H<����4�:��
��MJ�.w���f�����'�P������g�������O��w�<����S���� 	�\�����K���+�����),��+��=y������9��UI���jb��NTaE]:O���]������?�1#��	�&:w�9#sF��������e���?����;��_�w�?�����d��}����)o��<��d>Z�&J�V4���9����uh1�AQ����a��e��	�d�A����/��Vo�2�U��z����G�@�������/����Q9@�#.�y>~ZLt>����������Za����������MJ-�?���<�%v����Bh�H�i�k{p�2���������������d�������}H���g.����]�.��M������/�F+2)�^q�����/�d8���	4�u�M��6������X5"	�y�*���n~@g�t��fK� �xY�Aem�	�������G��B��l"2��&F�i��48���
!�ip�x�����3���'������������\|qZ����|���+��Ll�X���r�����8���?c����;A
y7��H��4 
H��w�U�W�p�nN�v��B��:�Hs�����O�3��v���S��[��8���@2�$�,��������v��e>�bE���H�Pn�EE��8+'��wR�s��ap�z8|��R�_�@��A	.�X��#ep
e(���������j������S��D��4�@3�����	���t�3���Z��:�H������
��F+���?U�N�O��q:'�d H��d�U�'���U�!����G
�E��]>�By1�jz��@�Y�^�q�Y�AePTVSY��V"��,HO����;�9��I BC3Q��������o^f�-��"�W
��D��.���2��f���S�v_<	I��#u���8�4��������nb��{����"A��Mt�:���3�:�t>��f?���������G�g�RhG�tg��^����!�p9���>N��*�����P�?�c�(�B��,��H��L��KLe�e�Ca��u�����8�Z_�9�bb�Om	�w���t3Q
�@3�4�@3�0��+�"����"J��H���q�q:>>[Ll��t���y'�q:'�d H��d���5���;�H��{�;���ET������V�C�������b�,������RY��H2C�����K����BR"'j�9:��8'F?��P��� 1H�Y�C:��4���k��6RG)�3e�tc�Zu=�lv%
��x?���`Z�8���@2�$�,��*T�,�l&&�:PG)�n�*���h����8e�)����b��x��g[<��?��~��U�>��16?�-��~�������tO�)U��o��l��K~����<w�������k6&���(�����4(
J��x�����x�GiQF��(]�����y������|Z�eqwL�����&jh�4�����OP�rV
��40
L������no����T
;OtG�	D<�%��
R�}�p��yR�&6�q�y	u7Q
@5P
T�W�j�$�Q:�3�H]���);�Qo����bm��|\�Os-�l��0�E��X��2v��f�h��,��bw��c����aU�y �Q6�v��#�}�� ��` �G�Q��u�>��$��I�n&:����
t������QH��J����*��+!�f�����j���P�/���B�_���Hb~;G�w�������HDNe�"�e�&F_	��v~����?�D�����4(
J��|B���F,$�6P��0f'����q$'G[�l���������]-;�N7Q?�� 2�"��<��*q�eN����F����'�y�����DYw%���B93$�6L7'�2�(�z(tM�R$]
��P�����������STbM��#_��y�t�z�n���ea7Q6��~�M� @k�t���1h�)2���B�:�����a1X�T����Z�%O�;��G
���U���.�����iQ���Q�B|���3������:�57���x�������?���~����������n�~yx{������_b��?<>������d�>_,U�G��������(�xi���6���R�9������I�.'�8�H��2#e�"���H�����J���F���$ee����l�w�p�r���������_Lt����P�|�Gp-�bb�O|���9�?�D�(
J����	���KH2�����j��s'!�
��b2�8$�tv��K��
�q���-'�6������c�<�����j�S�F^���w���!�z@*�">:ZL��C���O���q}1��B�� 4}E����WW�����U_�E*@�O'�W���H3�q�.N��aZp���a�Z�L�����(����4(
J��|}�� ��������>���9����E��u1���v�.����q:'�d0L��d����K�+!u����QA`�
73�#[?|���(�U��0f��V]'M�}�	P]���L-���8R�������?|�I���.>�K��y�cUWg�`^d5Z���`R��*o�Dt�^L�^��"�]
�������8��n���;A�n�s�4�i$�_Dt �o��w���5�;[I|<B���Q0�c��U���Q��4R�i,��8�i�B�hm1���6u1��tN���1x�������<�yd��Z��F��#%2U`�t��ucg���E���=�e��	`���+����2_.;e�c��
]
������:�H���*O�|.������Pq�\v�B���D�B\��ep\~Z�?�\�=]f�����m�
�s������Q71:)Eup�$;m���Mf���@3�4�O����u�c�|XBp�#u����!�b�J�%"v31*
��2�]�P^XXLt:���3�|%t�A������U����H}C�4m��|y!���8T��D�������M��.�����j�1)�y��Ah������A�Jv���D>G�)k3�At��"��,�/��&FYt�#�'��q�I�f�`���A�mj�=m[�z2���(;i���D�,z�C��W+��8���*��%tP�F�S�h5����o�o����vW>_N��l5"�:u��t��WA#5�x�F�_���y��MtK2+�a6.P��8�T�-�����$�H��0#a�%��Zd%�k���d���c���.7��aX�eqw�Sd����?������ \�mF���D�@4
DC���fO��7U���$
���HR��i�35�X>�q1�q����v������R��Ag�t�:������1���;N��K�D� �C��:���-R����q6\����G��	��#���r��wNW�<�}w'�?���p�n���������R��<��{�I����gE��W
#�Xj3���qNQT��G�8�c&�+����(+��5�����Gh�nR�&#MF��4Y��;|V����[����Q�G��ikW���Ll������������C�S�>@}��	�l���+a3�
�����F�$�@\�sB���@x�AQ�2���1��&F_j�x�(���� 1H��O�z�����/�Lu+��M�v!�����3Rdb�Qm�oE��������/���X����f�`���A��kv�g��0����G#�j=R+���&-&F�(a��9�kI�t�ZLt:�����������u�I���RZ�A6�?�m����f�/��^���th������w������?t�cLi����n�sp�g������xp�36�8�bp��vv�V�n&:I#����v�����j��8��$�@hh&�E���j����o�!t�k6�4��tm6�F�����)�$(K[������{�8��1h���Dc~)T�$���r�+���[��1e�>�qS�'�������\��R���]c���}�l>j�&����i`��I�!��8�������s���4^�19�Dm>V[Ll��U�I��j�sP�e@P@��%��.]>���j���ER>S!�L���������9P����]�n��x�j�����������������C�%d>���
�q#�u'�*��&�;���n�sp�g�p$��
�����N��|���D��itm�]����uY71����'>zY��&t�Ag�t���2��|�K��`���:I��/���]|���m��>�a�@���n��h ������^�Z$]��������L�rti�@/&F\L�Z[S������<�������a�T�}�_E7��4�i`���iA&�#���~t�]9aH�,&:IC�)��]O%h�����S���o����\��f�h�h>��=%F>3�]I�H��P&�:����>JXM��T8���u�?@4
D���h����������u����y'*D�fb$*'����8���,�N��t.���1x������j����pf1L	s�����:H�v�C�)�bb�Om�v*�@U_Lt����3�>��>T�.9��1&	����1���N���B,^�XMth���d:z�
�?@5P
T����f�<u��CC�]I~��d�a1�q(���1��5-&F�LC���9>�O�D�0
L��40-������2�/��w���4^���s*�t�j�5�nl���5�|������e@P�P>��=��)]�U���$W��|�zl���wcg�D�2r��IS�
����|�m����|��Q�W*:������<���v6����V>}�������*�O �=YMl�}�y���J
��M�J���6���A����ep\~������3�(�ih�Pk��z\d�u�@��v��.Q�7w�]�~W&xE�J��B��h�^��o��|���l��*�S>GV&�=I�H$��
�.+��3��bb5��Ot�?|��M����6���=�4������_?��t�n�������7���z�'����K=���?�><>��������O=�������j�/����?��}��W�Bg�/����La��;���u_����$�>�Z�����E'j�Y�v�� ��&��U:-c�k.�f�z�n���tF���o.&q_��/���G��n��b��a�����>���G������8'|���ST���Mt����2�*_��}1*U�v�S^�<U��$+�����oW��������C�g�������F�����N��?M )��_�M�0�������������e����+������_��pA��[~�K��L��������fI:�\�j�����R31
F?���]���&�/
rj�����S_UN}1N_���tU�c����>��H�T;P7���V�]�>��������hz]P��g��o���z:-V���}�-��������q�bb�G,P{p*w?��dz1�M`��HQ�������U�,T�o��b�]i�k���3,��Q��*(���L�����
�&F��)��d�gg�=iG����4�S�&:wg��42���������O�^�^8u#IT��o�R��G�$���n��;JX����\�&�.��>N7'�2�(����<%����LT��BfK�"^M��HWG��9���:P�Di��L�S�S��,q���:����	Q�����>~��V5|����V�������UA��[~�8���I�Nsp��*��jb3U����m���q��j���,���L�:�8#F�|U����|LQ;�A"��U�n#mx�cF~��q��l�_����9�1@����}����u����_��:����S���1lg�	�bb��L	��8AO��Lt���OB��$G����$_U�~V��pO>sQ
w����)+���[��SM�t�APm��q�d�!��Ig>w_Ll�]����6J�x#eF�|U)��x|��k��������YZ��j�#�4B�T�Ur���h^Ll��J���������g�x��u�)Q�������S�P�G�$�2]���O��|���<�V��j���s�[��3:t����o�6$���6��?�~��K�u�j�6���%U�:h�L�
�k{p�����5���~5�����#j����>��R�CQv�n&:w�e`��]�,�}*g�
Ft�5��bq���s=1�|��&:
Ic��0���;,&6�TU��-�l�&�d��AiP��]�]>v��K�*n7�[!���������Hv[[mI������61�Bg��n�Pw��{�����5�G�JS9�l�t��Zl_��X�y11z�;w���y���),c����@��g��D�E��h0��o?����J1��Q�C���o��H|��Q��������]�\[k�|�����<��c�<�x|Tq;��b�e"ri#u�����������B�jb����q�m>��Zh&�|�!G�W��F�)�y������bO�����z�����D�o.���W���p|���l�U�����@q1�q�\Qk���H��cm��I���j����94rh�����&���/�`��F�J������*�d5�QH�Jw�zGE��t%������H��6N7)@3�4�@3�����/j�{����]������H21�!���#�NWf=ZL�/{��J�R�5zq������j���������o�^�����w
�dU�_ .����R5:�"��t+�8_b�`��^����9�]��z��Q��SB*;a�u��m}PK�9o
\����n���"�F2�d��6�>�yI����J��v��K'(�H$i���:
I��*Zgw��%P\Ll��:7
���q:'�f�h��f��gu���Y�fbg�[2
.�z�4#9'd����������(���5@
P_
�y)Q�J�������o$��h������b���� Q���1��5��l�5���A��f�t���h0���i��P��Fbs���#mXH�K���������hO��g��y)����n~g�p�g�GmXs�|`R���:�HSgj���v�*�*,&�dU�5�*>��Lt>���3�|5|�Q��rXEfwm���]V~�{ng���:j�m[\p3E71��{'����i�nF����MXx�M��'��S�Z��	�#u�4y�9a���=X���Lu��Nn,�_~i�Hk:2���
[�/�#��2N��fd����5#k��}�~��X��8�"y��%��t��q�K�X�+y�wN�;&��r�j���A����u��4�z���)���������$�1� _b��6f�\���H^��_L7E
S�[+n�Pr2����!���g���Ll������b���w�lt���3��.U�u����k����z���|�����������v�s�Hy���#�H�(��_�S������z�#��L�.t���������L���t{�.��&�Y�8��3�,n�{�����D���nt�:
�� ����n�Y��vw�!Nk���`�!T5Q�,�2�,��|,����'�T���=������g3��!��~��Z�fb�
��i���k�����8��3�8&a��6�!m�l]i�%{�#�t#�d�}���=*�T�7�$D�J��� �b��/A|P�:#5�"�cu����j��F�F�8@�[�/v��Ii&:o�b�(���(>`j�'���^�>����������db��S��~��w�L�������WT`+�k��3�8��'0*H%���-�!����&` �:�/p���X�S�"���t����1x�%��[*�y-w�9�&:ISx�K�s�>����
��+���rz5A�f��Z+^n��o���g�n�A��O/�����"��>~���>[et�"|i��h(���/����J�D����a���<]����P�������A���_3������|!\�g����$�H��(#Q�G��:�[��y��uH�z����;��I��&����_Eh��:���Ww���S�uQh�/�����/���RH��Hu��<-�K�l&6�G���{������r�L����k7D�A���-��*���B5����m#��t�=:i:&��
�.��0�������������f.�$'��lOj3(�SU���y��mc��nw}��o�0�x�7Of��b4>�B=�C������4��������:���Uu��bA�TMt� )FR�B����l-��0|���4P���x�)�X{��D�����y���QW.�J�m�e�D793�0��|0���U��W���\)��������[���Z�{����D���O��(H{N_��C�f�s<��c�<��������=Ny�/�C��D��B��bi�|���P�@kG�NG�O~N���r:�c�8������Au�0LS)��X���b�l)b��G>'��l�<
��EM� T#�n&��`�
�����l����9����`�<i*����2������Z|2Q&����j����.%��������E��W6��:���3��Q�2����H��J�t1:S��B��\.�{g�[����[)|nU�DJ*�l��7P`�v�Ja��S�iw��R?K2d�L�����Ip�1������q��,�*�{��<��k?�R��L��8���08�7��Eo���S�1Q��M���(Q&M��7��($�p�]�-�����d H~H��A#��O���2���^����#�����=��6���\-�F�������T=��O&f(����AePT��N���n���q�p���N&G/h�A���6�@ZtXl����Ml�!����[X�F&����2�,����X�g�u���1���zk��$I�����BVE�&JJ�%��1^���|��Mt�2�� 3����#By�D��	����-xu�q}I��������3���
�<J@�e�ISk�s�A`~:7�!�S�t[6]6MA�d����������db��!����l=�(�{��3�8�O�y��uti��jJ�<������6�4kf�B[4���L�%�7�2�(�r(���U��n!��t,�n��������-c9H��O&6.�]���N���0��TNMwH�$z6,�M�:���3��t��s����Sk��sa M&QG i�@��_�{�K2Q��J�I��x���"~������}�������W����B��P>�~��4�fy�,����b��*t��N=����SY��~&�;�Ml���"���|��L�j����9
�����m����(�9���`|�����_b������|z��Vo�}�6��_����=�L����<�yA���"_�>���k$�R�\P�M*u��xMl���%d�&;%p)��xC��[���x�Q��{n�}�<�������������p�p���3)w?����kvw���[F��C�����p~������=�~����%��������w6S��9�z�Q:\��g:��C>��Q���T������ft����S�C���k�q�w=���	T5��E$QOE�����MlX����4J���mf��
B
���/	�V�~6���O!;��w��������#��M%�|T�G�8@A��))~S%����c�l-_��#9~��)��c$�1[���W���Q����=�F��re�N���L�{�R�� ���xf�	":����S���F�2ed����)���l�Ft���<��d�����j�����:�H�L�s�M����lb�
���
��$�'����'����Be��K��UEk36�T��
�)o�!H\���0	����~���F�qo=��jX9#�(�&��H��"�~�sy�����ie�[���x�d�]t��45��7���b
���qe��x2QV2I���h����
�lb�
L5���/=��l��>�2�� ������@�* �G����������d������N&ea��$G�f���wG21�� [��"����1`����j��[������n�L������'��0������������GG������Z��p������������X����v9*�@�$
������F9Bb�@2O�l �����(�H��#	F��'��Aw��t�
_�A[~��fk���4�$]�����lH���!��2�AaP�!E�����7�����M%�����@����W�a����L[�fQ��g���Z���5�vU�B���CJ�;i9��������a�����|;����uM��������TD��6#�&�$9�r�����6��@��	2��<A���=x�M���tC����K���DG)I�v�4�?��L��d���i�'2A��8:G��r������
��P��}i������d8����%�{k��D�y��
���;-��Z4����d���2�e����e��@��-?���_�����=�wmA��I��lp4Q�G��'�z����P����H�t��G"p'����f�`��+�6��y�X�"���R�����\o���aG��TI��N��0��xC
6�C�*>���L�t���0(
�����h��c�
�]������?R{_���W�>�&6��PO7Q�~��%�`G;��C�Ca14v���,�Q�#��������3"�����Q��;����C><��j���zB!�i$�w�,�e��@CH6�7�2�e��H���ty3"�Q�n(�����!;��db�f���Gi���I����}��F�D�
P�@1P���)$���TH]�,O�
�7	64��6�����7�q���8��KI�5����g�p�����t��n�o�v�zx���������>R����]�A���L��}*����%��g��2�� 3�2���<�n7��B#����m��u��Tp6V�)�I�.�9���lb�MYt�B�Q:@aP�A�u
�����l��v���R)S�?�����M���Rw��.�����/�	h���.:w��C��G���^a��������������`i�A6�/�u��y�vi"�[��[=�����P7��-_a4���I�U�4���Iof?�4H�xd����
�~6���%����2��h�(���7�t�DSQ#E5�`bD@�k�>m�R�t����B��>���e�1���/O��'�����nkj�Q��F!�l����P��nY�V����@�>x� 
���dd����%__����Tj�jG�xA��n������<�;_4������P0����|��y���� I��mca-����AdD���z��^����u�X��)���� (��hbCD�����Q6d�������s�a@~�AR+GJs1�t�������]��pob���M7�V�grob�N�I�29
�=H��d �= y��)����T�U��p�VB �Mt���TzQ��G+������yp@7^��U���<��3�<?�1�0.H�/�eU����z0Q��R>�r���;;�?J&�|U�
4�t���������w��C:�>��1�y���E�7�o��?���Y�zywu������y��#���
��������������HA6!t�hbDA�W16�4�OR{�����7���}��NQ2Q93rf����9�`k�fE��c#�w��O��B��I��C�Ll>���X&�W���f86�4 ��&x��g�x��=]�����U����G�r��X{��&	�~��*��WR����T�e;^h��t��@1P?��}j��j�j�6n�N���Z]&�^R&�6 �(�g�X�����N�3��<u���d�c�}����+��Cv�_�!Q�
����"X�T���5�-��Z�\��d�����T�����?�']�(������{�Ml�!]zq�sY�og����	4h0�Q�	��������N�*����H|�e�x�P��q��=R0�"�E��q3�(������l��])��b�(~Z6O�Ly�Zv�������d��8���H�-���l\h	��r������/�������]������u�v����4TGi�Ir�[$��k�L�y���3H!����P��M�,��2��L���S���:V��<�s�u?V�)
s����|~�Ml�!���(��t�@1P���x�
uh��sT?���L���T:]��e��d�c�4D���S)z?G���L�� 4
B��?���fi����-Z��������`O�r�c�������H>�hK{q��0���a��h�1v����x%�
>���3�|P����3O�T�c����v3�M)�n�����E���n3������S�^�7h�MH��s�'y�h��R��I3���o���2�S����fp��I��l��z01�����������R���v�vg�D�M�������pw������������������l-V�f��G������k��������������:����-�}z" �u��"-q��S���E
'|U��d`����gFF��;T����h�m[u�S�'��[�TcM���^�H&��7Tb�9
�=<�r
���C�����Fm�c~�7ky
*w*���k�]��A+x�e�l�c������O�I�V*2� �O(W���[�������C����7�S@�4����{���92+�g�8�nc�G5?{��y%��7��y�]�\����5�^/�?����g��y�9+� �d�*������f�
��&6|&%Z�
�y<	��\�d����?�)���R���>@�6P�7�fh�6�!��+�6�r8),��hb���e�n&R%���F��"}{U~�@�{8
���_SuKE�����	�'(�������a6.P;�N-]W2ed���g������O��W���)g��6������G����Z�{pU��&�|���L>I�e��[w������l��FY�N�Q��!��P���
������$���J��qy�<}���n��+�0���dy4�!b.�����T��2Q�����dR��c�����'L�8m���x��u�F��	��%C�����G���y3l���Ml��~ �����G�p$
G�4G��n��%�����7�����E�.T��)[>�e��Zy���Ml>��[*4���'�����|��g�y8�����j��<�#��Xu�l���^Pa5���4.#):,n��3�db55EY�!*�����,��b�,~������+H��2�>!��db�TWy��l\H�1.�|��$����2����|���;�TY��(�]��?�;���8���w�$��2/�5H&FBz��qY~�s9���\��ep��: ;���:E�"I����*I�'@����|��lb�M���Z�<J�l(�P��`�����6u��`��h]G%_r{2�QH���=]�8�N����Ml�i��ZF�{<�&��������t_D��D���u��D�fM�i���������@M-�3��7�(wO�o=��5��<��-���j���mg�;�1p��uq8����|�8 [;��N��z2��G���j6(H�l�����y�����_�:�?��S��U��>��}�z�C���K%�t����&���;mQ6�������Mt����2�.K����BY�*��y
:o�`���J}4;�L=����5�P[�w �IO���{��F9�1`�R4��{��F��H�7+��p��N&F4L%�n����.%e[����h�8���K&Jh��f�Y�*��;����<������v
&��b���4i�k�0�X,u�-�d����� 0,]�6#�sT�z��^���j_V�k�t��������J�����?��Ml&(UJ/6������p#�������VO��JF\��ZW2������0^\.&�=����d�#>����_�U���|�0�M����+����L*8'z�����������~J�]������%���������������wt��?~��q/~�jp�I����O��;U������!�����|�I�n����"��Z6Q��JU��������/��&���W��o_.w��n��}��?�;|=\<�K��b����w�-�_������6O���y
�����������w��Q:����3/R�����e^G�*�ac�rw&��MQs�8$������v2��|��4��$`��3Y�;�������H���,XY4��y��H�����^��J-�<�c*�'���7�����0o�)�'�Ml�!�oV����1`,]����S����,��/.v�4V��+����k�|l@�5���C�+�v�+��r6�y.���2�,^$�J�y@(��Hc�F���U�
_�y�?|0v4��G�!�-�2��q ���$��'%�����c�8����"d�Yo��=j��F�rVM��`���d�Ti������%i[���N&Vj����O�Y�.j�R���s5���B�e�b��v=�-�������O;����v|`B���X]V(�r�����y	�7�q�-B�8%Q�C��P�
���[�,�%���A�5j�Qs����w����Rf'Wm�N��t��H�f�IzfMCm H�t�����^��5r��1���N��jb�\����W���|��g;��XH�@��_�cm�Lj4��1���Ceo��z����W2�9���q%"m.���i�n�WT�I���7{z�@�>rL>�������e�5e��-\����/^���K������+ $��1R���\���C#�(�#p\>����Hk�A�����_�e�P9���4���o�/muy�w���=��E�U���`��`2�y�|U����a6.���<��(�@0�~�LmZi)�G�,���x�(�����Q����/Z�F�����~�UC�G��XI����7�q'�Z�Y����F2Q�@2�$�����e�{��}�Q[������g�����.�����%��`b�OE���B���DYq@�4��z���� `��U�|hI��
��D�J�g���)���Ml>����?=[��2�� 3�2?yN�O��p��G���9�h����������&FsA"5]1�!�R61��X�*�����L�E��3�8����p���M��e��0���������.$!y�g����+_t��|1Z6���#��^���cR����[�9(�^����av�mi�@����Z�����=�j�;Gh��g~�G���Dj:i��f��Ut�y�������##G����~�0Z�7{�f������Z��q��:R%�4������rl2�q'R�Q?��y$'T{!+��$
S�Si����pJqB��OH������S�v�s]�H 6L�&F<Lt���v�.�&���p��N��n�����J&��93rf�����9� ��IX6� ��`���2tq���`2���I��k��`AD�lbt���e���E=��&l��f�l~��k���'��^�wtl8�v:8�{2���#�T�"6���<��R��N<�&:o�c�<���'y�O1�����:a��)����l�Tk�HN�t��h����	�����\�`'e�*���2���S
E.��K�f5���^��:�ArT���xiAR������a6.TE���e�����&:o�a`���'�c>�JY
a8)��O���Y*�{���a�����G@�lb����/�(��1x�����<��XMe�m'ps�u�#����|������8QQaY������,���>���UV����^����������<���'{��J�������z�O
����L5�u�tU��~�G�CP$C�ZG�l���4BH�L5�)a���D��!SF��L��<�e��Sc3��G|�WP��X������hbu���^�:zV�+�������H��479�a�I&Jo�f�l��f��W�����4���)�m\>�Ojvo����*�B���@ig�]��QA�s��|���_~z~�C�_��������%��#o��m�[�G��������F��tm;P���3�����FW�����4�(/{�~��G�:y�n>�c�y�M^�Ztqy=q>���_�9����~�B��~�$Z����&x�MCu�����8�(���
�X�vN����6{R���#�(��AePT���p�_������� D�ejC���p9S{�H��8A��+��0�@0����"�L?���f��f&�9�u����&F������	Ah�L���$A�'��(���6&�q-2�E^��ko6����^��x�k�')TwU�v��-����@��(��MOb�����u���C��_}�c�l���2�ed����-����E}uY�:?Vv��I-�{��|N�t����0H��%��t.�1@��/[��gT��d��m�j����_L&:�H)��}��K�E�u(�d�*�!J�#������Sg�p������E��u[W���y�Ml�LvA�a�`t�)�Y�E���D+���3�:��O�y�E�$�6�<u�u�y�
��(��
�a���+�"Y�+�D�p���������b&����{pn�A�[S��������2
��7���&:J�������`��D�xp�g�p~:W���M����
PH�^��
��9��q t+�,:���Mt����1x��O�x���+�pT��E�"�&:I���Z���f��o��7��m���LP��������<]�G�qdG�/����F�f
6��}��S���]��E�)GQ�*����K���$\[�U�\�`�_�l��$�H��(#Q~����O���Q��E�����<�^�Cm&?�^�Q?O�1!�&6�����|����&:o@ePT�A��������#�zJS���g�&����Q���nh�����R�����'���B6��B�� 4����U� �j/�OjR�m=5���������WD�>�&6�P������xQ;����g�p��N���K���(H���:�H�R��E4'�&6���-'E!���g���RQ�AePT����>Em*�u���`�'��a�B	�1A(	���?�!���|��M���Af�d~}2_^�f�.8�/4�����ZY@�����uE�%7����'����L�(��\9���7�q�.J�-R�D�
�.��������K���|d��1:�������Zs~y4��G�eG7K�K>�'-�Ll�I���{�G��c�0��V;�R�.�dm��~�C��.P�W7c0��`b���N�OD�� 2�"�D����������+����vE�Qn<�_� �L�PX.��T>�q���_���_��L�
`����������RT������R�4�!��xq���F���_�}�6�1�I	�<�+�4+M�����������_���("J�A��G�S�&f�8�8M����E���� 3�2#M�p�������3O]A|�����������X-�� ]�Ux	����&k&
��n����Mt���3�8�p���S�^����R�� t]A8��s?�}TH]���P��gob3A��g��DA����`�f�y=k��dD����j�[��Q�����>�X!��a�g�;���L���b��w�
�[���2�� 3��N�]b�8�P+���L��X��TA�Bz6���Tq�PS���f3��*�(��g�p���?*j�:���2)�.�
����=��������T^9M��T�Ia�u���(�T6�5�a�����d��`�f�`^��y���M��B�������l�7��G�8�����*��(X�Eg�������Mt�2�� 3�2��y�zv������TV� n�&Fi�T���c�����`���4V���Hy��D�A�����P��I�^��b&E�j��x�1�	N&6S�������q!y��yg�7�1`���zz�K��'\��WU]w��)�����DiDD*t\\O�����;�p���"��Lt��� 3�2���d��p]���G��L�uo�����I���<����,�����,��}����������'Nc^�y�8�������rx�g���%��|��V;��EV#��N�'���^��'��J�Y�Q�fNW2��;^�g~F��/�����T
��e�a6+KUtt�k����������#7��s��`���7�q�����mQ��x2��G������nFI?�/����t���Af�d��Z��a��*������Q��D�)�S?����<�wi0����Nf�r?L��"���O��W���K�]�.O���.
�y|�J�����0�J�9R������rq��G�L}����=�q�*�E�OA��d�T�b� ������1��g��#��:B�B�N[������\�QWR�Lw����d��7��g�x���J��V�����h�Z|��w%��jCB\��y�iV��M�}��:UW����|������h�����f��j�'�`���+�q��5�Ll������T,%����M��������/^��&���Ag�t��I�����t������Xe��'q�&:I�����GW�a6.PIu<^�%����� �$�@2�����eG�U�� >A%M;�($�r��n�)��u��&6��{�b������r����h �= z�+��*���4$��)����X@&���R�H��2�fD��s��j#�pkcV�?��*.5EO����~������>b!lv���^8����L��W�1x�o��T����"?,|&���$�Y�L��<���
�����q�s91rb�����C��|� ��,�����`wy��=�|8��u���f����h���<N��hC9�(�h��f��j\�E~E6b��L�s������M��S�CY���7/#�&6���K:v��t~��B����������7_�����?����!��7���p}M��H�_{s�z�	�����l�s�zR����71�!�	i�����	�db���K<�/I2�%���>��_�5(����T}�=�������r���X�C�,��w��
S��Dc����yM#��J��y6/��N&���iv�����������/7��w������?��X����G�Aw�-�
���������������n�~?��~G�
@/*��?_�^<�-���$�u�hED��A���j/})�1�RHR5�^v��>�3�-��`������n���;��v�����|2�y:c�{���V�@�~�2�����FH��oD��-���U>$M���sH�g��0�.������~G>�('H��d �=��m���*�$hwM="He���L��w�DL���x*�l���^OUr47|��M��;�.����{�2��Z��Y�\S/j���L#�0X�ch �`21���.�`Jh�n6@aP�A�g�1���s[U��Nr���(JMB�n��+�+R5��j��O���l<����\��*�t����$on��I8-��R����g`�Y��f���xb��v "��r��������Di��e�l����|�^������/��u�l���^���H�����9oI�����T����b�V�m�L���X���<%���J������uUT!P{�1e���������2�ij?�������T��������!j���@�?o��}�q�|}�X ��uS��	�?�'G��qQR���L5y�G	���;$Q����G���d H&!MX%�����*�u(�o+�$�����I��~�����`b�O�h�{�UR���������f�l~l��K���,�NM:�!>�|=5��	=��$�H���?%��q��9YZ�w��	-�QF���wW7���'����>�U�aS��BEk�?���<}3$�Q��]��E�9�<��Y&9����<����F��U���l��I2�d$�H��$?W|�O��=M��p$�����#������c�+�N&F
`��Fjz6C�@��&����f�l~l�)at��_{OP��uh������@R@�2]����q�.jR�g?|�W2Aw�������T������%��\��/�FH������U���6�{���6 L�t��@�o,g��"��T������.�X�N��#UF��T�2Re�����n	#m��jV���g0Q�6������lA�0�3
H(�X	�XF���0����7��EXB����Wc����(��l������=���k�5�T��7��Ll>|��C�{��q����5�is��&d�7���E�bC��B_/�����z��V+������f32�Q��>�e[7Qy�`��D�f�$K7n��& ��Co�#���H;'i���2����F���wY�]v��#8?5�f��G���t�s�<i�#��UY���-��2;/��5��)��x����aVu��Z ����a����/��2Wzd�1�jHjRWm5�G�����H&!U:�%R�����0��t�uL�t�e$�H��(��D�_���49�a��@o��8k��T����5���~��R��~$C�f~�"�&:"J����8/��$:6�p�y>���3�>?W��K!�.��D�����*Z��"M�K�
���H(�2���8�#��	����`��wf~I6Z�7�_7{0��1�u������]�������DD)���:��Y]g��d���x��g����yqA��m|�L��,}F��4��=��u���M���T�6���.X��g�x����>em���Sc�j$����h�?��_=���P�6W�h�����>;�(�x�`�_��W�������A6������zjT����k�f��uS���|K�����#}���^��0jj�6MH�~�9�����c�8�f	��1��E>���r���Z_���=9���#}�H�n��S��&����@5�f�������%��&l��f�Y�H��������4�N�u�����zt;�I���(k����X������,\��nz�ep\����f\��83�������PEb���S@�����>�����P�����9�������Q���Ac�4������lbk���
���S?�lb�?j.��eR�i����7?e55%M:�N2Qz���1x,]����}�����c/>�{�db��#$�a��%b�`bCgjv��q�=y�+����x�������Z��H5�b�����N���&G���[�
�[~�������q�%�N6�y��1p,]�6��7��f����Ctu���Pv����T@v�������;����������D�/L��d0Y�H����T�I/vu�F��A���Mlp�������0�&F:������H^��d��9��y��A����o�����[on��I~9�����������O[����`_��%��N���8��<
O��x�b����MM����hb�����X�D?�fm�
p��AJ�c2�
R�1n\~���p~s�����������.^\^����bl�c�E/��V���~����#�L<f3������K��<9�?�l�LN����a]���Mt�B�������pw������������������\�bw�������_����~��O��U��6�C������
������n�~?��~G��#U~��������2u�^~{>�zPo�ms��.���+����xZ��HeLl��Kw-�Fkg�DE��~~�#?/���@���_���8p�s'�$}��������k��o�`�k��
|R:���z���*�Y�?%���;U�.����L�z
�"�� �������_v�K�t�y��I�]�t�:5�`St�t8����O�������@U^��Q�D79`2�&��RP�������n
�5��6��?�����K�@-��a6.��������DY�@1P,]7C1���Z��YzL2tWS���b~�v2��!������l\����H�s�A`�r��R���?�	q�v�P?|��hb���T���/���O���-�a*9O����M&��s�4�Ac����|x��4��H�8Z�9i��j����CXMZ�`�~����������?�_<\�^���Z|��������& o�AJ��~�z���_/����y>K�16�9^����������N���5�51�:b�s?��DU���_�~��{�]�V�r-���Mt��@2�L�*^?-�_pn��B2����_�������J�8z��JP����UO�E���hS�2n����]�l�s0�c�X�yd��EjG4I�t�����Lt�r�Q1w�6���-h=�tF��ij�B��N���@���w��_��+��)e�m
�-~��)Mlf�w���3�U��Dy�H���g�#t6���<��3�|���W�]�����7�J��\����4M�xH��[�h�����j���������������g��/�2__�5�_��l&lo�3�T�v1V5����C�4;M�{���/$�&�SHb�Zj���j�$S�L��f$�H��0����O���#�F\���t�B���b�o9�&63����_v:�a6.4�?K�)ug�I&�wD�AdD������R����vi�Ajj]D7!���D	Bq�Li/_!��(�������������*�|U:�����r����]�������d��g�l�qv�M]��)i��D��+v�u����NR�����\>����6�U����a���EH�(���v���(����T������gO���
�8'�I�ofPP�7�����34lh�
{P�Y`��n
��:}��Q<�=L���n�����$�q:J�B_R"���\y0��,��2�,?���KA�)�P��?�D
���:�HQ���-:�}�@7BM�D��inA���D��"�� �� �fG��^�&�����1#����D�1�������g�D:��fG�{��DyDL��d0L~.K�%S#�v�`��~�I�x2������Q%&6�x*(�|����(��2�(?��J�u��x�E��)7���L�zQ���[��~����=�.��d�b�s4�Ac��]�x��*�����Aw]C%�?�����@R
��+����q6N�������Ao�sP�e@P~.E�a4����5�|�*��)�u��������DDi����8�~�z���4
@��z�vh�t��g
��T�a"�����5ee�����tP����xi�7A��>���{��M���M�$/�������������:��<|��*���H����}��z�?*q���ya�,JOvF�HU�)}Q5|a�4��|�+#WF�l�+��>j������"�7�!�E<���%����=V{	N
6F�^��eU�n���+�'������*m�s��AiP��F��_��GI��������>Vl���TAv�*�No�o���x(v���F'c���"��&:�g�|���y�n�/��n��=8��A����vu�;_�����6����@/���@�x���e]|Rxz�wM�}���#5�"37.U5����������iE0)���L��d0L~��;��+JE���me�l�*�����]nj�c���-�vV���y0Q���4
@�@�/w��2eF&U��^����^�,d��A*tK����|����hS������;��'�������.<?ss�VO�}�V�)������+����=x3g
�+����[w�Q�u:����&�&F�
�8�7����0d��"E��D�2e$�P���>�������\Iq�h���bu��U����{�Y��8��_zR��c����&����hZh����^#�0����e��=8+��x��t�1��JiT��RW�I��#m^��P�s�2Y/�n5j�����0�G5���N�
|�wHh������sS���M�v�O5��;7*o4�M2dd����!#C���"aS��y�,����F� i�L�@�6���axj���u�
W��`�f4Qnz�4
@�@�+�����&z[	{^����j�bK[�����o7r����~��_Q��\,���6pf;��{�v�����Z���)]��e�����"`��j�?��b�N${��P����
/���/���i!O������+?_����������x�;��:h��Z}�7�1Y��*b��U�����}2)��an���H��M�N�r�S�)�1�




����md�S2���-y�>:~����T�+T{��h�?-�[�u,��b��u�x"�W�J6yI�N�����!�:)���Q����v��;sH�����q�1x����Ym���@I�����<�j���T���LI�I�!J����l�j�uz����5�����4(
J��g)����y9��g�E������������6�b%��;��aQ�����T*c;$����#���U@���/m��tt��Ww�F��On���5K	%SU��[���M�D�������8��.����t����Ct~�'�b�,�����8�}]2m�b~�����XLu�]�d|����C���jC�nX#=\�6n�|_��|�rc�>�j���v'Z����K���hQ��Lok~j���P�
�Q�[n{����!@����i~?J�>�M�W��IF��4�����J�&����_���~��F$J���(��!����!X9&Y�t�9n��]��T3D�?��SS�6�'�]Y��j�v.&&����ym���P��tI�q�������j
�'M�7���QQb����z��a���)�Y�c��i�MThL��1�k���<�.������L�z������#�4���^�������d+��F�2�'��w-�Bv�R/��EK$^N�����!�9W��j��p��V�y$oo��f�������D��r�����<^��r���`M��r6���;���������0�aR�|�z3(
lH�x��X������^z�bR'?���6���y���,/��(ui�=U�Vy����M!m�Af�d�Y0��@RXl�wx�R$�ZT�B�4y�����U��H�x���Q�JHx����Y�����V����x3y��$ ��|1�XZ#�8g�m�L�y��W+�!:�U�����9����#�������������&�:z�P��_S�6�Hu��!Q=������NWB�;��m��.8obi�� ����f��7���]�T�[�J<&���bt�{�:D5D��)�j8�p��T���Zv�@������h3����N&�|�TKW�S�������lR�%��R�#�A���w���m� � � � � �C��Cm���F�E]�QG:
s!��<U�s�9D�fX��+I���W�=�!�7	h��f������>)���Y)���X����r��MR,����5,jX�������%So��.������D��QH�T�t�7�d)����Q�t��0D��oS[�u*****�7����S����:R������MM<��.��`���>�|��4���@3��������~������	����������������������tCL��JY�|��So��x<��B���]1�M�/������4�@��6"���C3?R�m���s�"����c��Ki.3tsB��wh
Q�m�@_����e8�4V�i@��;�;��������|��������%u�kK���`��S����c,��,c��m�������@���:2Ox����_�>^^��6��������y�V���R�y�6���]������-���)FiY!E���d�4�[��(=}IK<�������������m�����,6���������O�rc�$�5�=��f�.�����!7$�NV���1�q%��i@���}� ����5����QX*�0�����T:bl��
�Y�-���vm���a`~%��d@�m��*�����>���D�e�K4�)�1ZzWl�c��r��>SH��f�`���cPJ�����m��"�������%(RFv4��1�������\�s�'���8��mn����La>2�J^��`~���Tw�������d�.;+ �cC�<��x�OQ��:!�L�c��}j��	(d(d(�64
�o8�_Gk�	"�m�`�������T?c9y�R���*S��d�N!&�������)�q���i��C��}l���!��7�|d/�����,��4?�5Jo��7�P��z���N1�����Oj\����
��n�>�:��Hm��
�f�f�f����V��5"���N�6H�?6�c�6�H	��d)���*�y$��}�!�l�^����<���W�c�
JC�fBu�w�wn�b"��'��4��6#u�b]?���!mP�2�Wv��-p�������4(
J����>�����6(
!��c��8��|�!�H'Q�y��wh
Q:�2tt�����<}xJ�P�����p]�������9}��|���������HINm����mS]�����uK��G���odY��D:��~X���(-��~�M��	|����75
5
5
5�J�4o�6�+�UD����6�p��mw��%��c�������R��g�6T�-P�6i��JhA����]B�:6��`3�6��>���Nwv�_$8����1��A��=��f�]0�i=���t�'�����G����!P��AiP�<�w�tG�#�X)�kl��)����2$����[�t�&����g�5��C�2�(�A������}�T;-���o�^��Sk������~��tT����<�t#���a�U
Q���C�@dD���C��z�|��D6�/���7�|�O��%+�cL����������\���	as����QZ����3���1m��AiP��?����4������D;Kt�i�������Qe����sH��R�d0P:�����)��? 5H
R����������d�nv����3��n%s�������W�sH�s�'w�@dD�AdF;���v��/Ib�
u��So����9*[O!J��i�Y:9���x�i/1��q�@3�4�����MD���B�1h�Q������S�+a��:�H;�����*�4��u���j����zbV0��
�R���k�����!s�Q1J�TL�HgD����R��������p�ssq^$��<NL��r���Vi9�[��|���������g��#����f��G�:��qm�3*;�*#R��/QC��k[3[�S��3�������>�t�}����o������.��{A��i���7/�����/��������txw}�������^<l������{(�.�n~�Gh<�}Eymzf����wp>�����j�L~dV��N
m?�~����rc��-�}
�6^��q��e�����3�����/�����0�ah��n4��Q�8$u4C3.��8�7L>WgE/��&�J�N��=��<p�!��4�)��<>Uk���h�f�*N��7S��}�R2#����6���C??�>k�jr��{
Q����t6�����SL���L��O�O�U�G?�Bz^�<���ip�F1��b$>���$�=6|���:)�.dM���KS��8o--��zm��
�����c��~->6
���<c6�qM_I;��1��6��NJ}*������� ���T�F�#3� �������<++�B{�-����I�������I�� Yfc��6U������@�U:�b��,XV�Ctz�������������(8r�f]>6�wA��>*�wa�r0���s�6J�	�oMr��>�l��0��k3�O�6N��QI)nv����M��f��9���60J	��.�"�� 2��H��M(�:��-k����c��"n�=ec�~�6�g[�!:<N�swb&�����%���48
N����Zr$��j��#%F��t�c����S�.�9���R�Ly�!�V��G<�kHc��j��Va�OW��X�����\��}=���������]/�HIMu�������d+����������>������z����'�^<����Tg3��������jC�}��X����P��,n�������vzGQ��h���1h|��*��:���km=mKb�a�qi��A�x����J��:��������fm]�A`~
��H��"o-���Kz!~t4�U��,~���A�%��^�����/�{F�����u<�H�
upG4��G��om��
�����y,����8��������y��p����T�������rd7��Y�'wr��F;�(eQ��zO�=�g.Mu��RO�{�k������,���oc-};���Q�����������k�x���%������U[~T���1r���:PO��V���1y�B^���f5�j����O�^���m�c�jC��&u�j1��=������@�k��DjAF��uX��e`�5�!oF����)��*]�b�2��L���	��6��KHW��V'�.��t���>�48�40
L�
��-D��/����2�Ye���I$���Q�u@\#��L,�\��C��x���J?�Rh���1h|&p3�����2�l��L�a��4��T���)��Na�1��7������y��G���������i`�>���z�~�>2	U�A%#;QL����y:������R�C�f�u�����x~'��v��ip��_�y�����J�cfI�u���=��@Z(�m�kC����T��������j���#��7��nun��F)l�*;������N�%��&jo�����gn'���?��������Ky����N�s�����-�������l�7�����0���j�AmS�7.��m��*d�s�:;���q���ti�He��OS��m���������9N������+ %{0�v�|��(��e%�v}���C�[,�;sH�38�p�Q����y�d����}�����Iua�/?��Jv����p��y�3T��O��4BW����E�0���y�0�Z�!J�g1Y�>���:Y�]���������k��7C�s�{rt5������"R�%N��~����S}?��_�]Bt�DP�����|�xR�!m=�A��D����*��������:�h���D��_���\�9���o�J(%�W`�����.��>>���L!:7�[:e�����R'����VT4�@��8q��d0YE=���H[xYL�Y�4��I�5����2/������C�_�kl����K����8����}.��92�^��s������\c�t�����G���S�+��
��^������I"+M�1D)m�f��b#����!m�ZZZZ���������y6��m�����7��>��W@O!J�����]]b&�*~
��e����S�1��I!�7t�Ag�t~w&��9l�UOo3�����������ZC�
l�L��K��<����6W���R��x{h���s����=�9.50�Nd�RH#��^BN�d+���P��}9��=)�6?�:��ud�Af��U�y�
v+G�\��<o�N���#E���)X�����O:�h�~S�nArx)����>N-��v{u�������w����~���Q_�;|��xy���z~�1T�)���:�9~�������K1�O!��|L�*-u^������/�������1D)��y�|��HC�n�1�1�1�1�1��S�������$�*e�.!j�uv���)R6������{z�5�u���_�CcH�
d�Af����|2l��M����~dUZ�����q�8����k��g
�r�-�Q
|/������I�y;����1@��9U��C%�v��G4i�A�hK�9�l�t*;k�=)�O�����������:�4>!3�0��90�4��G��%�$sl���b��X�]c��X������:6�k�+2� ����Y�*"���!�������l��1���g����!m�*fK���m=�S��,��J���e;u�B��(��2�(�S����3�()�M����m�������|�P���C����hk2?;tc��[��`1X�c�n�l��,I�H!J�$��;y9�J`�O!mP��RW����k~�vi��i@�~�������h,��6�0Y�9��/B�E������r�c8��%�5��_������#��*!��|��g�|>'�w81���9����R����j�O@����FJ�	���y����]�M�Ac�4���x���s�Y�L���B2���E1��E)��@�M:~�yj��	�d���������>J�x�@mt.������X�����N]��=���s���1�B�����������B
y-ox�������?�����@gl�|H�F �YH5F��!�Hk�T�d��?J
Q����#a�;A(1�N
F��`4}��}j������C�g~�pm��f��'��#�|��-s�N��u��4}�B��8��3��:����5j�S���m����p1����V���jrS����!X�ZCw(I�'7�@`�A����wK����,i�e�V���tr�Q��:L��a>��O�!J�W9Z�~ha����8K��h ���"z�	��C?x:�q��VA�Qr��]2}��l��:]q��bul�� H��������W��J�����T�j���AI�K�����,��9��a.���/�����������pyu������[����d���sw���o{}���)������s]���w��w/�}������]��{l������6F��|I��':<�~����o���#���6�y��>t�s�M�>s�R�L'��Vm�4�4�4�4�4�[��� ���� ��L�����<�;�M�]t�����������%�����1���� 2�"��MD��"���UFJ���m���G[}�bVw
Y��Z�u����b�?��c��O�4k�
G�
�T��OW��A�?�o��*Z�LE�kY��N����@uK,�����y�z��1��E�I��<S�!m�R�R�R�R�R�w���kC�
���"�dZSL����K�h���m��:]q�+�t�~����!m=�h�~�F������'y�J�}���r
�Y�w���DI���a�M��e
�S��p���w��
7����fn�f���g���tk-��d�J>���/DKGP���j��?J
Q�����.��^��HdHdHdHdHdx�+�s�'`1�K�6�H��u}9����Y����d�;���X~��6l�3`2�&�����<���s�m�i��������m����� 3�(+m>����s�N��_���}��ZL[���c���#����q�B���=x G������%�z�������B�6�����U5DijF��z3��V=}�M!��� �!�!�!�!�a_?�/��C��Ng76��3eSG��T(�KS��8���]������#�1h�����_��E���x;�:%���V��[o%ywT)�h��i	��#4�-�)=w)gL�m�Ud���T�����,���������#��g1?�Q��f,L�3�tkq9����L@��}0i�"�B��(�<��[����g,�qrF��`4
F�e��`��ou_���V�]�[�RGm�f���&���R�#���J�V'���u7�4>�1p�/���Q�L�nwa�����k�,kr�]g�e_2/�k��z���E�/������_�����p-��/���:
�,�F���N���9���/+�KHco�a`������n=��N��yE>����C;���%KS���R��?e3K�cH����f�l~�l��R�q�d���_--Rr���*~sO�:��3�E��tg`Q�
���B ��D�E@^g���s�F���hPbip�vZ��n>mk����M��a���1���h�b�b�b�b�b�W�P8��~�!Q�W�S�)��'��)O�,��99�&��R���)����s�"��Om�n���a�������8�m�f^�4n�XyAVK����W��� ���E�y��_�(�u	i\��>����[���o����\��/?�+����:z���P�>Z)�O���)�x��Mu�BU/)�+��@J���C�zEE�������w�em��s]���cL��Z���K�]u�H�Ct��B����
��4�@3�����'�*��3����T�*faH[��AG���;ZC��\������#��0�aV����Y��gV{�/�lQ�N�bOq:���$q2��I�fEs�N�\�C6�d����!���d�d�d�d�d$]?L�&&Z~~B��cC%=�5S�����*�����#��u� 8�qi��2�� ��'3O�FQ"5QB�m�}�~1E�)�j9�ATP���t��0�a^��fm���i�}7��f��&wq�)&M*���bZQ���Ejkh���sm�3+r)t����/�O�!�7��������k]���Vq�������0�����?iY��t%�!�2wG�>~��sH[��g�xV���W�w?���u��������(�t.����{�C����ff�oL�����kZ�e�2�(�������IyK�CSHcyK�����W��1{i{-i@���e�����)��|h�iIg��~Nq�m^>��1H�D��7�0W��e
Q���JzvOw�~x�����
H��4 
H�vxw&�a�5F,-��}���P���={��}�(�;TC_y�	�Nr_���n�"�� �+!�f[�7S��]� i3��Z��G��)��:�y�1��)D���;k�Su�Yp~��B���4�@3�|^,���	��U�e��0�'L��Ag��[�;;��Ti�CSHc��t��� ��i��@���^��i8B���/����~�+<���w��w/���S�{#xo}a�7En��y�V�����^��NvX	�<��%��������`���n����(�dg<%�/�"�1��AJCJCJCJ�)���U@�6�J���]�dc�>��@|Wj���H���i�����j�N��t������
��a
i4@iP��Ai�������������/�5�(eL�C��_q�|�E�5D��>��/{"��y���@��i@�<��jxw}\vZ�*��%�
B����y���J���w���Y:&���daj��	�l���66��Q�.O����J����0i����?�=�1m����:kNwe�6w
���R�Ds��05l��6��`3�|^7�o��3}]8Y��Tc�$e���3�-=deM!:
����@�9o�J���$(
J���4(}��;u�]���J�����(�'�V������8����*N_5SB������6q��ip�~%���A�r����(��r��k=P�����U�c�$����M�KEn^�N!J��������a�}��f�h��Kh��g�@��}v�_��(�{�Q��C��0�.��>����AQ���C�r���L!�HP��AiP�<��jt��_���m*��uH�2�bi|7+y~�2���(��M��<Y�{TC�:6��`3��J��K��&d��_�U/�v��4(:��]�=�(u��������55���RM������v���!m�i@������(7;�������
<R������Z�t$��jqSG���3SH��H��d H>���z��������=6�a�B����V�����1���-�]O�M!�Y��3�:������Ai�"q�R�H��{�N���q�(O�����o����.�
�����������)Ni���4P��I�����t
����1x���6�������"?�Q��l6�.rG���l�1Fg2�N�������`��Q9d��4�� �����c��m�����9Px������j�������hj�
�]�<kCZO���,���T�+�K�"�O��vm��b�b�b�b~-�UG.?�_|<�:��0���d�;-Y�$��vR���tF�B�=lV���=�w"�� 2�":���G������h5,�H&��L+�zR���oM~���&T�K*��������X	��P��AiP���^=l)���.
��#E!����t���=�%D�K=�I�k����!Xb>�8���|{u�����_�������1����xrW�����/1�O�
X��wwu5L�>%P����CRR��s����[����7��nun.r���%��	��q��^�K2"�_���2���N:o��.��Gg���;�+5Fg�&k�9-���-u:b���*-��:p|/���������O�}��{w{�r��?����M>
����������E�����������������1'A��	�������Q�?���>��h��
]L�l��OzW���������GZ7�N��m]��G:#:�!�Y��t��H���
����
��2V��������d>�E�����R���.<fa��:"�QJ	M&4���=����aWC?���>�>nL�`��[��j�_2��D���s�F@��P�}���'O��:eeeee�<����^�/�����O����@�i��T[���F�/kn��G�Lf:j�a�t
i�n@h��_�QA�y+���������rq�{sn��<���&6Ll���~��F�;6��k����$��V�7#�s<X�O4/SE�T�R�f��$8?�6�A^L9%��?|m�����hrXj���5��C����������p��\l�}�7�u=6lc�tj`��t�����,�$e���~�;�s������2�.�.�O�
 �6�Kq8�]�&1��-���
z������]����s��]���w
���m{�2?*%`oF��.���%�R���Pyq��J^SQ��a��k������M����
uTjH�-�\�\�\�\�\���fc[K2KC����m���v:��I=���4����C�t��r^{ojH�q3�0�����_�F�m��;w���r�6��]���zayk[��<�)g���6����?��.�����XW3�|z�����-�v��mH����&�)!�����$��n���)D����K
m�Vm�PT�AeP�L^�>�[[�S������aw�31�'9LS�i����,!J]r��7���,��:���1x�x�[��m��B�g��s�Mp���ij�s����`���2�����,<�G�� 0�Q������h�w��:)�c���1�]2��*}i�������������xc�f�l���3l��[������i�#(�u��b9�1�3�%�8jS���h)��d����)��?�3�:������<Q����������uT:����i��r�j�����@P���y�1ZX���| 1pI����������m_��7?9�>-e��N
�X��$�x�����g�.�;s�N����D#w���%��7����������g&~{u��8�x��*J��-;��$s�k��������B���(1��vm��d@n�`@l���@$��O�O�S*���)��+�f���P���,������6,�'Ph��]�_>�y^���w��w/�}���F�7}���h��w�������� ���%\O�%N���T��e�	���T�+����~r���;1�!���P�P�P�P�P�����j)����`#Rm�D���@��/���T�+���7�w�����4��������O|���Pt]�<��^	�������<����9����*���O�����i�5vXhuW�Uh3�l
�oIs�li���j����w�g���#s�N�`[���m}F��S����������?i�r}�������^�������_>(�p�����P��������?���5C|��a�R�z3$^!/Mu��RG��_��s�������6@!��������EH-,+�����t�#�6[V���;���l��gHvl�D��2�����;�'a
y�L|���N�4)�j_�5�jH���8_��������7�~�|}�����py����|.����}{��C����w�B}{3N���a^^�������9GW/��Q����V���i��.\2�O
z�s�����_�[�"�S��>}S;�g�5�<��o&�%��CC���=�g�s�kD�k������R��M���,��d����Cr"?}n���c�w!�9�
�������{4��w���Q�+j����U�l7�eh�0�yn�FCiG�2�
���:}�wbi��>?3����w�Co���;L+/�p!��������B�\m���sk�Q���\�9��������7H�7v��@�rS���\�y[B���i!;���J����k�I(��.
�����<��(�����EaAJ~�o����0����c�S2;:	i���O��4��*{���X:
����������D�@4=�O�r0WI�?��G/;�C�)���jM �!=�(�X:��i��H�1F����+��U[�ep\�_	�7� �K4%�������\������dicP�p���N
�{B�}�V>N3�aV��c�Yv,cQ�d7���ru����1y�vvTW�9<��Ol";{
Q**���l�Y��l�������L�d�i���R��A��NG,mF�}:�T���@U0����������:9�9�yT���P�k���+��D�Vm��"�� �+���'WywT�4O>����[;%Z�=9	i����6�$5�8L��V2�Vmw��ap���n��4����6�K�r���]�I�?ZB�>���5����F�Xz��=W���cQHc�
D�@4}��4�Z!v��>��@����-H������:��N~v��H��4 �i�P�3F�������'�����I����$D��/N5��=�R���3����(��ap����b����.��3�-R�P~WL���I��}X�-���&�����qyW��PU���vx|�b,C,�C�����D�����G���l��?b?����?]�K�:�tx"mU���&��9�����j;�R���|��1F�;dD��b�����(!m��f�f�f^&t�U��sQ���m�j>VK$�xzk�y���&9?�)���'!:?Q�-S���U�4��o���+=[tBm��l��������w�k��6L��eA��07V��>={R�kk�6P�5��]���4s��/P�u��������U��/��k��������&������S~�$w2c��J��fG�/��Vm�\^�=p\����"��?����J&:�������vK��9	i�����fJ#��[���~$�������]S�������������/��2$����B��:�Qq3&ov�����*1�@�.��f�ko�7f��yGW��1��O�d"g5�Y����{��N�k�(	�s}8�B�@�]O!J��%��Ey����%s�mZ��:uXQ>�*?9������H������xm�$��w�m21�&��_P>	�y�I�F�c+�t�cB��%1@��P�[0V���|�sk�+��N�������8��'\��V�z�L������o�I[5��"8�������7��nu~s����1x����K����)L�}?����h��4;!�ZrVTk�i�����`C2?UC������|h+���H�W�h�����yu���L�rg��^��%D�JGeZG�Tk��r����s��YD��������&��$H������e��{�W���w�V�+yH��G��4,!mzP�N�p��c+�������U[���o�:���x�����w�n_��������ei���7#�����<�+���������n����
6�N%)T�����B@�6@�O��S������c�J����C�@$G1���J��s��m��AZ5���`Y����IGX�����<
���!����|<;y��������o�`�Zi�
��>��+������@-c�����C�����G�;����2����W*7�2y��d�i�S:rYd[��6Io������B]�br���
PT�A�������a9�A>O�������!Z\���O�j'������G�4j�<�,��r#�y�N���L:���T�z0ip��������65�e�Rnq�s�:`���:`_� �b���\���t/�_��R0]���mOl���5��^���8�J^�T5������,��2��`y��u�t�S�#�<.��mXB�@$�)?<iB�9�bt�S�Ydg+�Q���e`X�_���C��[^�)e@ov����]o��v����'!m�C�
Vo�m��|��Z0�a�L�	����e`X������W
���rG�1mvy:����$N�!�0���=c��,&v�[���:j�QY����Q�Ng �L-����s0����Q!�s�@U��!J���X�I�O�|]�1F�y$�����U[��������_�^����d+/��s����5v�|N���G�����0V���v
i�D
HcH���&6���x���S"0-�������k�!j ~i���eev�PL��I"T2T2T������)��O��������1�IP�+��M�i���D����H�4��r����x�����
���f-3�0+����_��F��G�������������������SL=�J6�,>i�����wk�VjU�]#�A_��}��y����}��N�)��l��@���c��\�G�M���10�)�����=%��qW��������;|o��v��q���k	�<NpD6�y��'<��������e��e�cz���m>�z����~cV/�K���|g���?���)r$W�e�{<�t~���b/�L��N�L��4�������e��ER����:'���N�����TJ�5kV�	G%��H,��%k�d��t�i�/>��Mt���������s@����oW7���_^��>���1����~��o�i��Q��������5���):W�9v�������0�y�@�O�e6�
��+���`����oA���C�5���;��6������gl������P���-����G�&��1`�N�X�m~R�e��u��������3�<���zu�+8��ml>�z�[	���Lt��Ah�~��_��G�JuT��P+�T��!j����d6�H:�W!X�.��2���`0~���2GN����T�R��.��K<|&f~u���^4+#�r!����g�x����>E�6�4\�3�uu��#����c�r�O4H�NF�0H���~������l��f�Y�f�>����4/����{pjB]���<V��[�&f�\�����������~����@0�O��{T�]�����{���v2�#��+v-���$�@2�$?��]*���&W)�67����$��B�,Bt7���'��B!�����P��&,��y���(��Ah�-�?F4RLw�[{j=����i8����f+>;^='����O���o�����(���@0?$����e���OyZ.V�T�P�a f�0�!H��N���_���f:%��_�~�n@2�$�@�H��n��.W�y���y�$M�<��[��P��tk��#��PDh�,����E����<���:y��zN\��&�6����[ #���{��NM��m�ZB��,Ll�~�xM��Qf����&P+��#.F\���-���Ax���3����U�#��_�L&fLc��aN��H�V����?�rqu�����w6�9:=���L�������	���|��y����]J������CV3�<�&F�B���2��%���R�H�v�����������A3�f�������_���BJq����re���kV�~~������.h���=8y����}{�G������D��l� ��Q.�����H�� ����H)>H��y������m"�	�n�y�da���������:�T�/����; 
H��4 ����T��6TSQS�fLy�����D�r,��k���V";��Q��L�������;�����;C����O�IU�C.>QO�����S��1u�����x�W�&f�t^Q�i���U�3xZ������lt��`�4���������O��m�xoFi>Hz�wN�
����cO&f�4��ZRQ��m4���P|eJYcs2������-��B��>�wW7�:�]\�������_�//nt����==�E����%���P����T��6q����jq��t�C���������u�Dy�Yy�X���|�y�����@f��3��7s����-��,a,�M����������� �/0G�l�� ��_�2H��2�*���2�!�?w tu"
v.�|w���������[~UH�nq�<8�&�d0�\��@��+;S?����1�f`����c�Z�����Q���<6��bC9����+>��Tom��t:%{X����_�b���>�z����M
�w|���x��H�-n&����1g_ut�+/��0������|�Z��N���?E�F�.�>�|��������z����������EFG�������.P(����l�*�����db�oU������a3�<����@
���t v�����������O7��w��������(U=ug?�B�t�������"F�pB:����a����e��qjk��u��m����O���a#J�R��*M���s���y��AR?���vLD�,J�n�N�I���U�G��LF����oA_�r�k>���<��s����D�����T���������6�_��)~]z��4c�LM��(F�x��gmn�fR�fx�����&T]G-��>
6�aZ��2:����4��dl�b�����1��H�>q�pyX���
=[��bp���rS��������!M&F	�E���
s�x���e�zu��;
����A�1���A������e�}_��JE}�%*��X��_���hNE�����tQ�W{�_@�t�Ab�$~3j6O��h���1�%f7�:Q�������&:)@�����V�(K;(�@��i@�+�KU��V�i�*q�*I�M�0��f����\�N�J�_%>z.�t�.����!����_u�H��P>��4����U�,������Z�����_�,L�>w��|�JbsT���~��W;�'@�t+�@1PLY�Kc�,����w|���VBvW���D] ��h~&a21Z�rV<���sP�@*�c)�)������PoH���,�6�h����/~T��\�-��o�`�m��
y�Bvn�����o�s�Ml��R{��j^H�ml�C��)LC��@{�lL#�F0�`������<��r���b����\t�&��������B��WXp��1��O��f���������Z�p��,���-��@������w�r�]�UhrW�|86�O
^_UcH�����i+[k VF��Xy����?dw�7;x�����sMG�m��R&�Rg��D����>!���	����(]x�@1P�vdk>N3J?�L=�����u��];\H����.:�������~���e-�l���0�3�0���������9tM��56�?6:6(��*Zx�l�����8I�vJ�|��g�|f��O
������1�� ���G#w��X�{O����NR�J�%�zu���`�Mt���ip�VsZ��v"���YT��'|E�t�G)����&��'��_�i�U(�l>|E�^���>qJ�T�AePTf�gI��WJ���J�"u��Y������l�<f�.F!�:v������4	�
�����M`�f�`f��KY����*g�C�Cf�;	GeyJ9��$FMF����W��y��7�Q�p���������-�4�5i~������(
�?[4JG��_����=�N��*{���<m����K����&�Y���e��?���k��40
L_~��Fu�o.>��rypgti����S�u_�n���y��F��i>r|�6U�B�1'�Wp�7�hb��O����#Qy�]�6tm��oG��,~����bWy�|Ch6+������^�C��)d�t�"�g��'���h�(3�]�b��[�ip�����������Nz2A:���eA��D'd��X|}:�����W�G3�0�3��?� ���E�C�
7zy-Ll��W�R&[�&�����AiP�f)�O�;���[�6��g�	t:��P$���\�d2��X:���X�e��j�D��N�����?^��f��u�c�2����u���9��(������b(I�s�4������R�_Q}�������Io�[ �������������;DF���9%�����tZ>]�+d[�����6:�=6���U)��2k8��c�8�n#�LG����������0��AD��&������| ���L�
gg��(��e`X����=5I?k?�'��*�\�r�y��&F����/����QP�l������.~���+r>���i���oa
����z�������_�3��*U�y�/�o�fP��t�S�P)�n����������;�IG�
�S_��'��U�l��f���|���pw~y{���������������3R����O��fw���e�~��6��m_�I�m"��������t��'�K���Q�����]��~�zyqc�.�����o����tr�����K��>U��`|e��'�Z�)��lQ�v~~Tu��X��������D����P��l�������}�����������j�'�D>�3�T���(
k���qU�9tc�0^`�M��"�(�����~����~0,��O~�*#T�?�w{_B�8�
��`��w�o�����U9���%mE5�� $�2_�������>bc�����/�$��}��]l+������dbvG�ju/+���P�ll�J�ny,��t��e`XVa�?_4���-P�H���=8�)Tu�ML�b\�� $}J���2���6f�������E�	��(
D�@4��5���9����q�P&Zr��hb���Wq����R���Y7:B��>��Ah�IB�R�N�kS��@���!"
M�W��P��!�+��(��$V��x��TLt�&��`��`2��I�,��3�Z��;�;����a��`wa���t1�H���f���������(��2��d��C)�R�;�j�n��y�$�&:i���SG���@U:��K;������Ah���$�>�l*�YQlW��<B��$b��Q`�5,02��N���������Ag�t~t�_�G��
_R��H�!
��"�����J,�/H?�,����{�N���b�e@P�����
��:���K�����QBt��+:i&Q��YBeE��Ah�IB�R��m�L����>4����*nM�������n;i��XFb�2[P���_"����������lO��(��y�f�x.���{p�^FU�����
g#R�L����N�����<���`��f�����*g����<��*0�DO���`�_�Su� �fq�����M9�5}�?����/X��?�&
�F&)���-_�OPL��
(��8!<�Uk���>���@��D_�v)_��5�����G��u���3�����|�����Hj��D�8P��fC��e��s
�����wd�9���8���b���m�c���lb�����F?�����{��R���Plt��e@Y��!r�H���/���d#�����XQ	0���g��w�
n�w�e3��3�i���M��ap�\���Y��f����PK��
u������M���q�-���^�<�mtX�:$G���� ��AePT���y.�`� u���k�Y�4
?��X~����w���V�n�Q�P1�����p��1p|��<���>�XE��X���,g3(o�z�����5)���E��d0L��f���PSa�*��7�f+,�1|����&���&]�:h�rBo��
 
H��������mH* �{�Q���\�IQRUy��\�4\����%�O��G�M���/_��������h�-J��L��gt����T�������L�5y���>>��w�G����g���R�X�6+�#u�mc��	�"v��H���H��R����&:_!��}8C�/�������������k7]�t�L\�f������)��
��'HKn/���RH�H=)
�|xH�&����������*!���)~�}o�z���Z|CE���L,q4��Af��R�����&O{�A���s�x����U6������������/�K���������L����/��:y���3@Bu�K%N*�%M�>�����M�n��%��?X����V�8@4����[��gy��)#i�����mS�\Q/(�t�`���ukJ�ZmL�'�H�����/�����x���7����@��������\�:u�X�w�&:
I�H���S�G�M ��A�@T����ap�����o���W���8�����KSi���&v
+�!mJ�'U���A����;T�lz��D�@4
D?���*�u�j������)�gO&:I=���V��2=;�L���vl�5N�?�!�H���!����_m����n�*����}tm�W�6�w!�;��c
u��.�zii61z((�k(�IX�FIE����d��B��HX��e`y(�"N����ft�}�w�oSZ�7����+�%��V����B����c�M�z�)�<����q�2�(���5���SF�V�6���XR����Lf�h��5��|R\I�������������
�$���A������z�u����M����'��/�	*7@�^)���u=�o������`t���_��,��_��e6����)��i:Y�����/�>��/o��|��?�;|>\<~�p�R����~Y��n�ZWJ�8Q�����:�����()����_�w�>��,9*������j#��z��<\_��$�x4a��������5����Ji���h1^���N�v���8�*v�-�)0*�*�3/�#.w���W�y.�!�hb�dE�����5���T\y�+�Pz}�=r���,W�2�eY$'�w.?�u2\�_�����b�|��<���c�db�hA�u�bSB���B��mAoD���=#z~+9`�Az�o����I��u3�5���M�N��-�!����F����)�tn
@�1@�d���N���*V���F���#�R�F���v��G1[T^�6�H�nI#�_��VL��J���4(���f�xN(�q�������1�T��M�x�����D,J��.�Q{?�l^�{��6@�1@���|������g��:��
�Y�Q,�&JI���
f��G��ue1P`X��e`���.���.wU�]��J_��y2�
P�GiV<Kam#?�$�n����(�D9P��A��Bi�~����f1�f��s��"*����/L��:����FM���7�NM�7�Oc�iO49y��s�Ex�����n~?|����������(�[���)~k4�y4a�%>V'w�����4�p���X��Ll������>�	��z���X�y*��t@��`���������x�n-/���!�Q�fT�����@ t.�yVM{l���0��9��_;����
u�X�d�|.&�|�y��>�!jF�>�n>�/x�,j�M&1;�1���~6QbH��Q�J�L������IPVC��e�����3�# l�����6����o�VY�d61z^����t�������%�CC�~H�3e��8p~�s��u3T�R��(K���iFMYR:�hb��nLenY��QB) ���-&�� �F$�H�(��x�/���y~3�16�����y�f1��c��������<�����_�m��G���|GJ��E�'O��8q~N�#�]�S�6�L{��6��#�������s�����S���%�!	�?�]�������m�w�q����)&��@��!2B��k�����{�C��\�]�z�_S�����B���
��.N�)����Q�4R����txB�&���Ah�~���c�	����k����MEE��?����0�!H�.�hXr�L��&�_&�Y6���@0��>>����y��J(e�M��|��d���B��Lf���� �N�V���P��(}@�a@�	�S�&��j��ApnL���J��c�W�Q��^6��4ri��9�6��`�[`3���8u;�:�2���y<��gg�F+����Q�#+�x(�#��@�����$�@2��D��?_ x�rE24A��Z��Kg#����|��~��O@j4��2H�x ��` �	�S��>W���G��#�>3��X����Pm��G��1j����>�e`X�X�sk��yx��(�i��
U�uj������0�����1���(��t+��0 ?�P�vM�.�9����A� �u�4����Z�t�<��X/�g����D�AdD~��{T��6�U����fP�[2�(�Y)���;`�z1��]l���Km�Ua>f�M@�G���E�P=�3e���B�g�L^�4��sD�3D~�}o�9���o�������y6�Kk����r��/%���0���������d���!4Bh��o!���<�� ���3��)Ut����g"��-Lt1�<��R��h��U�V���nq�g�|��gH��q5o�
����f_gI���X}�hn���2_O��mg��P=5��v��X1Q:
@��i��
m��.����Ei���`7�cp�jj*�9v��<��0�0���I?��Ap^r�M�t�a�����B|��7{�f�n���%w��C@~	&:��d���2�<����'(���f6H.&���r�[.��6,���������'��y�g����������)aR�'�7��p}-��O6F��}�f�-�'�'�$�3E���*5�=��	���M;�i$%�zpF������
)�����W������Q�2|.9�
h
������_��l��$�}
�y��:*���Q.�f��b����
iW���01��Q���E��{<����+��Z7_>}<��_�^�ts~w�|�x8�����&ug?�B������_k'���D,vw�J,�0,����U��]\��u{�����"�0��?��
�2�z�x�D���F8����3���u��9,���n|������������M�Y2���[�X���|2�_,�G�:�t*B����'1��&F�P)l=����(>D�mt`��P1l�=#�:*]4�d�A�Q�?%]�.2���UV%#�����cL���t�n` ?���
)�����t����F��I�e���p�{�>��Qo�[P��AiP���I�IJ~�7�4�ol;J��I\�k61�&L�k�w������	��2J7�$�Ab��#�N��.P�s�F� �4�q�2�X
��G��z��Io�C�t:�:��(���u�^��B��\����!��.���{R��
tb���%{�y.M�4�Vr��.��I���w��P"TF��P���	�7��f���7�U�wIp�h���#�L!W���PO���g�\��^7�/�$\B�%����O������)��������_�
%E�71N����L����[&�g��'�I�n&�@��^L��Be���*kCe~�6��6��`*�RC�&��,��&6T,3/����&�����M')�;p\��ep����,n��}e��T4��`k� ����L&fX|m���
�����B�J���i@�� �G������\(���{*+L�n}kX���ll�����*����)��m������N�m��,�J#���~��S�]���q�!X�y8��DSl?�f�/���Hn�_�� �����3�g�&&L��|u��������1R��������T���s#O�L�|#"��;����6:,���N��V�gC
w�llH��4 �f�� �K�����I�M�*c�q*]�M�|�����������s���R��	���c�7�9
�40
L�o��i�}�(��Mj���M+~�cu�F������2����{��. ��` ���1dn��PwmS��;�=��J�V��W�cOM>�(Y�.�R����p����
u�P'
u�
u��/���.�$��c�.R#�����K�cM���������_�nu�,�f�� �F� z��~8����_u�H)�|-�� ��oVP[�����{p���|��1S������~2�%���+�\��ep��V���7�7s�Hv�*kr�
4���DRw�� ��D?��=����:PFv�9��]��+���
Kr���6�����$���v��"������?���f`���j*�UQCf��-����(�tvd�2����-�����������������������������g.�[p���/���_kU����vw����.����(���"6��m���`�0��m����=8������P��!��_���D��x��5�,����L�7r��m0��+�<4l��12#0�"����f`>������"Y�&F�Q�h^-�G�x�����C���T�0��������9?T������]7]nKL��a�Q��/Ll�H�D�U�s-I�"���J��E5��w�z�l�ip���N����X~���']Wu����`�_{��=���+��+��-
�Qd��_c�kOq����k������=����y8I��=x30����WH�����
/l�&�Qq�,I�rF�_��^����Q��2"dD�����N��{T�)�N�w�W����F���K&�x������#�6����S'�I�����4/��d�\�H��,6�����|9� ���&Fo��i^4�f����`�6���1�cD��������e�����X�i�8����Ai0Q�8V��(�q��&������n�P��AiP�?\��=j�T����Nu�����Bha�����������(�%���O���;���<p��������n3�j�r���i��&���bj�[A���&6T,*7�t��2�@���,�%UTn�f\�y�E2���l�_J�
�Mq����7s���A)!J��C���r���9FiK)j�8��q������������3�gD��K�Lma�6,i�%��w�r�1Q����./>_�9�}���K��S�����Q�8�~��Z��l�]�DyTN���48�vT����M�I����f35D�������d&{$^dne�*]���Gq?J���@1P�lb��|�PG��H��04��������]ZV���vFiaTO��"N?
`^\��_�N�����A�~�Z�f|������!���@=
y47��E� ����G��'#i���~v�?�����b:bf�����b��W����|�
�y�fU9y0��<�����::
���8�$�3Y�}�*��t�;����.|w��kou�@4
D�d��!�w���!�FNI�1�T)�H8/���&f�n�w��.[�������S`]�R�Dy�&��`2�&��}>���������Y�}��Cr�>�G%��jv��� %j6��8	�S�3�32��t�����������'�x/n���|�Z����?�B���,�����~C6���Y����c���s��d����-Ll�H1uH^�����(o+��/JI����f���6���y3$������N�fH&]����ZY��[��GA��ElA.�����`�'�$���M�C��:�Np�|�����;�������f
�&:4�Ac��m�xQ�f�p��^���hxP�6(^�� $'���@��m6��,����W�����pg��������N"����7*�}������n�����ew���B�������Ql�/E?���Wr��	��t@��12bd��P�Wm}J�u���Z/�.�$��s�
���_>h������#�^�S+�
d@�U@��g�sDALb���=86������h��gab�.P����y?�l^Tm�+����00�O��;����(��]�Q�����Q^y������g�4T'z=~�z��B�� 4
B?I�=�ZS���V>�P�^,�4��($eb)^�W�������uX��f3�H����@4
D�o����<������QZ�f�������VQ�+Q���bP����e	�v1=�_#������/.���
��������������g?������$��{����q�f�~���O�f�u�r�51z����m���P�^��M$�+�}"|F����3�g(��r����<� [gA5�y�Q`�Y���qH�n���'���f���1h�6h������������K1���������������Yh�c�c�����w�N~J������h ����'���TF��F%���D���%�M�*lu�AJ��|�R-d}��2��(���d0L���'���4m�����%J�_����Ge*�<Rm]3D��%��[g�H:5�0Y�XF����1o�����z��
�$�������O����j�{����,�?�������<�����������4���T!�F�1�F�|�r�����7�713I�9�.t���M�2��&� ��5��q�f�d���?����� #'h32'��Y*��C~	F�C^G��yQ�D�4�����P��UK
~�z��?������������/�n������?�������G���n�Z�-�~�����m�|�Q�S��0A����E�\����'����7�������g�a�wj��;:4��m��{4��bHMU�1S��|��`bt
���i\�������8�~�chP���yN��'���[�(=m����Ju��dv�@��?��N��Kr�zeak����g�k_�X���(�`X���P������	or��ua��.���MQ^l�V�A|"�h�p�Dt�f��Y�������UY���itN����8���c��w��=j��W�'�:$���&6P������8�I��U���e���]��1#bF�<$1����r���n����i���yb	D�vob2�HI���0�
����	�Kh���@2�$�@���j�I�|�)����NK,����6��+!pV�����Xu�K]|6Z1Q^��h��@�R�v��-��	i���*!z0�������q���uI]���h>mEyi�tI��~�Lt��AiP�VR���U��J�iA������������J���H�a�dbH���\�E�,�71����5|����}&e@P�e&t������n^��f�<�X����J��(;!��(�rI��&Vm������D9 ���h �A���R���X�����Q
C�"Z�:�*2?���f>$�S��Bd��3���:���3��V���#�H��<��_��G��h�5q�^<��@eo�c����P�zY��?[Ll�C�A��a�k~}z�lh��h&|���M�u]�GB�J��BRBS�Z*(�<wH���������{�M�[PT�AeP���.E�L]����|�qj�Lt�r9�
'����q6�p���t~�8�$d@��
��m�*���t;��A|pj�cS"XMu>�����l��(/�J��~}K����6���;S�P'�;�O�(�e@P�e&J��7@Bt��4�(���'�T+��������H(��d�fY@�_��`|���}�Dq�A�Fm���<���o���=W^�4����*_��Pc��������iE���Xy01:Vv�.i�/�8N�* h���g`3���������JJ�wtl~����jj�(o�`_�.>(t�@��yn+*�=��SP���X�]����vR��Oo�[�`��m�f��#]|�>��dY*�6���~�f��fq���}��
��{j���dB��t?�?����|����K1��{���7��'���/��0����x�x��X�����>^=|��|yq��g���_����-� �ez���w�f�t�[S��)���U�db��Tf��s6Z�cq01�F�5�`q���O^�Lt����3�gudL�1L����Q,�m��E��=������XNC0������r`9}�"J&6�!���@�{<�&��`2�^���2�K�'�����?I��H����S��M&Fi`�N�]Z@Y�%&:*JDJzk�������j1Q�
F��`4}������]z=vmJ��g�x;�<�^�m�%k�&F��M���������D�4��40
L�L�/��7~��XEz��v��L&:
IC�DPl�H�����t�*�c��Bj~�z�l�h ���h.���o@qt�7�8��c5����Q�����8N�B����&+j�A���� 1H���w�i7m�2��
c>��X���!Pw��8����h>E�N���	0���8��HP��H�x�u�x��hC�����nH�����%8��($�U����T�E3"��T����2J7��<#x���U���a#*�O�����k�{T����D!)���*���(�	P�+��Q�	��00���/�VB��3����ll��c�	:��8$%q�4�e��)>���������_��D7p����7�i>\R
�RB6`�����x���C}��k$G�� ��j�Ll�C��Y��(�t�AePT���y�"6������Po����3�m�	M�C��K�F�mw��&�$����2�� 3�2sd�cn6uz���po���'��X��[p��WG��P�zxa���4�@��A3�)�!J��3@�W��]:������hu2�*��*�j��H�vY��I��P��i|�^��bD����������3a34��=���4u�|~��~�Q��Q�	~S6B�f���=_J �C�j��#�����|�R��W��l���^G��0�M3f��0C�>����:5��}/R���:�H��B��:h]�3Ml�SJ�u����i0��x��g+<�����(��(��-Dcl�Y�)�h��������u�{$�����(�z;�����/O�&F_O�W]��`������3�>�����a�w��(��_���1��w������@b��u�[��z�������D����nyh���1A�d��]�i��|���;i2�1H�hj%]���5��l&I�w��#�$��D7p\���7�e���D���wF�������Uvx��;�����#Eq��>b��U~�T�MlfC*����A`�A��"��I�tY8��y�-�K����T\���z�_,
������G�5��&�����I}
���Z����B��F�����h'���h�v�G��p��%f�N@�W21�9L}�����|���e�*
���w���/�c�M�u�B#�F��M����d�];;~��J����7���R&j���$g�z�`b3�������t�d@��i���?�Q��Bt�����xp41
���T%�g�8NGA�W@��V:��+(&����`2�&��O1y�:v�u�;@�Y_D��D� ))H&i��
�a6S��,������mz��
�*������|�%�"���l%����T_:��v�~��t}41"!?-s�E3"����(��@�����"�� 2�"?'���Q�/O#$Q>v�����H�9��_�K4���D��fy�-8��Mt���b�,��b�5��*�0���(���dbt�\�\wa�5�MtL�j��.3�y���0���3���<�o�,��b���d��;�!��%?���G�����-*��K2��L���vvU�u(��A&JWT�AePT~*\��w(]�����&.K��{e����t�j���%�������M���6���y0�M��h�~
����K���k@|�������!��g��;��u01��*7�:�Ob.?**�7�k�_���_?���C�G��S�t�� 8������NG�WSP����I���:�Ll�r�<�E�G�&�H�2"eD�o"R���������q���U��W��cf�/�b^���%����������7#�X�5M�mv�:��@2A%;*;>��j��I��+������H���j�����[M�\������CO�Wm��_�j�t��h����M����.���r�h41�Q�,M�B4�M�.e���S��s�qNp~k�sX��e+,�rz����H��@7��ed�����Z�_-����C�!x{��6�y�.��?�q��M�oNk	�k�&:w�c�<��=�������1��><����4�+'���o�+������T��.��k��8N�B���w�����T5Q
�@3�4�@3[������i��y�
cn����|�M&6�����v����2���g�x�<�M/��4a���\{����oI����H���n����P�&F{�t���v-I�\L�gd�6��:���p\��=.�M�?��J����p�;;����jh���M�^�~�Q�V�(�Q�*���������mj�����+�+���y��9�������_/��8�*�����O��>�����F����_���iS%t��i/uO|�����vG:{��Xa�k1��
��{]>_M��<�m���!n�����D�g��UG�!��~%�'��
��|�t�1G#w�6�C[A�6]W�5s`4
F��`4[����Tn��>�%U��`6Q����xR���W���M�l��5�����Q:c�0�c�������������7��@$�q�E��OZO�����.�9�X��\�d@V�O��fTT�(E��L[��u�]�^.$qk!�lbT"_�������hb�J�nj)��Ua���r����h0�f�f^�U�~��u�N.��J�����$�@����~{M2��h�<�$�jBj\��!XF�����4(
J��,��(mS���j`='������@$�b�R�7|�4�X�
�zyf7�����&���j���P��0�[w(y��O�w��*A�8��0$�4��./�EK������J�h;�t�
�g�?��Dt
��p:<��Hx'��������^xq�}��A�O���~<�E_��r[dL.tsn�Cy1y�zr��/W?v�R�.&Vwnb�����W�� uF���Y�MaH�������{���>�D�~P4�j��4r�]�+���m��a2���\�f��a�st�Ag��v�����<�r�@��2�#Is�?����!�@�a��2J�Aip���G���PT�AeP�������>��=S�bQ���Q�P������8�d���>�=P�a:�c�8��c��������������u��>R:�%W�}�&F�z�(���a�)�d�����O�_I�*��[��W�\��a�`��������Z���O�y������oo~�S�jb�gh���
���(�p���:]���X��J����)S�Vt��k9,�K���
��J�����u��3����?����ur��?Mt���40
L��C��pov��G9��eK|�l��4d�����9�8����*�3��q���8����1�f����t`�Q�-�����2R�)�39�n�����Ml�IM�Y�|xA��(����f�l��T�_�_��M�;f�������=��0���$��8��������>X�&:of�`�f�{��&����=���lb�����[�7��P���f���)</h�&:w�f�l�o����]�VF2u�["=��e�?R�H�+��U�{�������_va�������6��`3�����
(n��uK��W~-&6�vtP�:��(���8A���NO	$�b��f�h��f�{��}�����+���]L�UO��h�g�%���db��r�t������Dy���j�Zj>�R.E���_��:"||{w�������k�8�?�3O�q��E�^��*�`�;XMt���`1X��|��3I�X~�A�N����N���l���x&J������]M�6�C����u�nB�ep\��e��{���WJm��|N����h�����,G���E���q7��M
�<���
��_	�r$^�|m�9�����n�i)���h��<�n9�%*��&Vl�V(�^m����p\��ep�O�w(g����C^J����<�(9$M�����*���4�&�._��(�wK��h�y4��L��40
L�����M�9�e����
��DG")�}��6��C���*�n�������8N7)�3�<�7���%�<�du�i�yX�,(L�Mt���(��]?,�I�q�5��5au��`X��e`�����RM)�[�����D�!)��R�:���:;q;��V.���oV���40
L��4������0��
����������r�4]1�&���Q������.=���?��(w�Aj���"���_m��
�C�A6�o����y0��)�;|ph�o�n=i�~�n21�����!m��j�������8����0��2�*��o�������2�6@W(
���q.�#��O���A�����Z���4������Gu���$�Ab���������k�J^/
JiR >/�������Sv)-UW�'�I��(R����,���8����~��a���W�A��2�� ����_�T�����{0]���]v�{�����r��n�(�������%PW�U�'R�:J1Q��4
@����3���6��)����i��b�K�Dt���myQ�U11J����&|!B�C1Q�/ 4
B�� ����c�6��	�y�����]Ll#�H����4��	���w�.��D��� 3�2���EY����6v������&�}��T����
slR���|V5Q���7"4��"�Q����d H��N��P������i4�����$� ���(����q6N� ��[�R
���=F�O�b�����g���/��z��8�8}�����^�I�����������O��>>����s<���?�H_r�c��|.~�W��N5��@��>�}�b�L�`�15!,��$��&��&�k�G��s5�����3rg5Avi~-2Z���n���a�#��.��|�����z��s���aF.���e���x1�y��`1X���.�=nU{^����_-����t������:��� ?������l��n�����^F�;�1p������{��sn
��V�_�&Feo���j������D��+�Z���p�ni�7]��2�,����#�r[>)2"��L
p�Qk�e��d1���R�?6m�	�,���+A��gWQ������1��*��}Q��r��:�.+���������������W�R��w)c]�$���Azv1�JR��>J��:�(H(�8.�t��O�C5��3.�������O���������p:���;�m���Wk��O�9_R���n��O>�|<|���_>��&��#\���������_t?�4.C9v�������c��{��;�Q����T��bbT\P���6���hi4�J��
��x'�U�4N�-��6$mH�Jm����f~T�����s_���"������}�L����19���S��*&V�D�[77B+��b@r��/rg������Q(� w��Z��_QF���,X�u+�4$!}�O���f��`���xTJ����#"�hb�i��_��	�m2���94rh����Q�}wV����h�=��
����:�H���j����*-L&6����C���i�sD�_�_"�&�������7R�L9���4���/�c�b�,Xm~�>�
Tw�u�?����6��7!���u��mC���?��n�H�c�Q�TF���~
�_<���(��B������Hk�,^
�/�V}[�G����u�|2_��([\�#���_���0�������5c�_�j�sY2�dd�V�5��gz���X�M�uK��4��&:I��r����-8�4���C��?;��k��G��</�d��u]e�i�pb���|�duL��.���K|nf�`OI_�t�����S0��D�|�%t������m��y�������j�sy4�h����������6�k��_"�����T��\��Y��w�;}�	B�H�t��KD�E���(6���AePT���"w�n����3��A���DI!)]�L����=��0�k"%�A����`2�&�
�����e~����F���\�W���d^��M��L������d�<�����4Bh�l}�k���7��D�@4
D3i��b�Ry�R;-7���kWe�(�b����3T�0]�*u�^�k�N���P�F��.���2��py�rvNt7r���1��9M&F^���H�0	�`&y�P� ��b� �� �+O�tC>Q*���w�v��{��*�v�����7��{����a5�!Q�F�M^��������(���Ah�I����(;�Zh@���Cu�;'���f�0)6>�[�$a�0+.�6N"�rh��p\��ep���.%�R��C��t��T�b���4]�>P:�~�x��Q�k�~�y�>��$�c�Q�����}S���)����i������������*@)���^���{�1
Z����P4S����gJr�S5���m\?
���
�t� F��������WC�K|�����c����9]�+��t�k����q�����(���"�;�2�.���2t���@&*q����k���j����$RG�e����g\���j��X��e`�V���������<�gj�)(�&%���G
�"Km�6��xC}6/��Q:c�0�c&G�_P���:���0�����BR(WQz=LM{�|�P�����\�V ��+��(�|���2�*����Y�����.��)�,�z4�qH��R^��&[�TMl�������D%��4 m	���'��y>��16���R|�f%��Y��T_�'���.�?����6s��e�[4��jb�
I���ZG��Ac�4~P����1�>W�q��&��;j�U?�*��D��R�1��M�������P�A����1u���j�s`�f�`>@�������fC�4��y61R+�V>G
%\���jb)t������)]M@�����f��N����<�����?�k{E{���"}�+)����7|5,��lW�gL�s�W���	�2T���LM��Z�&o�vH�]o�Sd�s42hd��HN��������"�*}56�P�.�E�!��zfI�<�]�\���z�%��{XM� �C�=�Jwi@��oF����Y�=4�K��eaJ�P���j��,p��q��_��{-���� m�Z?����8%��?i��x�p|{����(C<�QT�������X�I��������������~�E��%�������6}��������J������{0]w]tQV�=���Rl='��8�O���Q�^
��se��0e�d����fd�7�-_��W{����rG�#����U��i5�at����N�i|51*�o���z��lt�H��4 
H�r�p�r;�Y���M$�O3�<��@$�;�2������aFw��usS��w�����AiP��R����0�U�_�y
L�^��}W
�c���i��/&:I����a{�JPVL�W���g;�w�
������V4��i���[���v��?F�Q��M�bp���C%:]M��+v9l��w~��nI����rX2pS�z3�oI�QD"�D�4i$���/��)5.�C��e�r���'�T�*�S����GQUw
VQC���/^su�nF�f�h�o�W;\%xuK�4I���K��.���_mdR�������U�R5�������������6:w�f�l��f.m�UK����V��������e�Gy�c$���r�U�Ri5�y��lf�l �gz6�5eY�0�3���y�I2��}����h�b�jbTXn��3�:��Th��6a�e�j��\��ep�f������b���_��4Ri��7a�g���q�p����+��a�Q�&�1`��1�$����S��~�k�rI"�D����
�*���tf�1�w�Y�N���.���2�.s\��xM-0=u�������n�M������
�=��D�Ei��5>��B�s`�f+0���U��(]���Hc����?2���O��55��f�W�`��MGn}�K���&7��z�:�n���@�&F�t�&��M���*�AF���^�{��:�n������ 2-��W#2�?W��SG0	y��6*O��D��*�����m�aV��66iz��Mv����i2�d��H����x�K�w�]��]}O
�&
	
�&�&�q��e�D��Q�N7O�g��C�j����A�� 4�\*��3�]
����g>/,�8T������p���
���7����!y.W�;�2�.�����������;��sn���n>���9-&:I�\���`���q�m�v��^"��Ql��i��2�(�OAy�r6U1':#f�X��<�������((���]�:J�0���M`�OD�55��l��u�:����Y�Cu��N>	��|#t���E�jb�M�k��(� 1H�� �S	1���� ��i��ssu�e���$%2��m:������(%b�;�>�����[Rl��h�����K�:����/��y�6-&6���>l/\��XM�.\���=�p�������AhZCh�����*�*�i��������`BEM��Iv�P.��zF-��S��.��%6������ep\�������I����
i>��o6�v���AR"�l�h����bb$R�^�/�t� �2� ?�]J�����!	P���G��RN�m���_���;t�d���:J�`���M���F��������T����@E�|
���:�H'�Tk�Prd�x*�J��Y��j�s �b� ~*+��LM'��_���Pz� ���i�P�(��8P�@1P���*Pq�~��-�$R���QzZ�jm��b������������n^Mt
B�� �M�g�k+�h���^J�R������D�'���K}�����|�ZL���Y���!��^l���>���3��T��� ��$�v���:���E�����l\�]��Gu����~�_��)��U�n�����-(����H1�>�9_/�U�b<���(�H�g�T�B5�yB�� ���?=���o�t}�m���2�����T�_����=�zZS����V�_�Cm^���z��k��bbt
�:�ys:����N/6�nrg�p��=�������3O(�8�jpN�����L����b���4$u��� ��l ��6��{�w?�p:�q������O������/6S�����g?��������_��O���K=��p:���~������������g?���O���w�_��w/�~
�B$�������_�\�z�F�}j���nl]�������E:���(J!��k����b4�Ym�����m~�����A����� ����g�9|8��e�F+Q�
��bh���<�b���6�~��/��s�P���XL�B:��k�e�n2e@P�oG���_B������s3�n ,�A	�b�����<[�Q6��q7_z)��j�s\��ep\���������}j�}C�/�0��|n��l�#���$P;:��~�B5�q�o�a��O��������i`��L�A�(�� k{m�"���>���S����a����w���xCyq�k�@���Smt���4 
H�l.��� �����~�� m]LtGJ�V��<�T$Y�Ll��M�Vr��F���ep\�Y.�R���q������L���DR2'�w��$�����C�b�yrj��P5�yL��40��4�8*/��.��2hT�����D���Ji�p��M��}���=�"����(}K�W�p���-���@��i@�����q����K�,����j���%�b���71��P�u'qs��1x��)6J�d�Af�df��K�����s>,�4y6�&F�4��gd��XM�`�r�k�����?XMtQ0
L����`�_�n	�.�;T��,;�!d^�-����6�O-F��t��������k��j�F��8��3����|��<���G)�&��������]L��"6���:L�B��A�3w�g��@�����Af�d�Y2�U��]:��*����:I�XDs�Se�����L����j��@�d�v����U��})k�u�C��gj�J)�>����-�&6�P]W{���W�����p�g����������<����{�M��7��igA�R��H(u��B
�<!�o��(6�~3�0�3�������4g�|v�HIMl�H�|��\G�8@m�������c�Q��]#������6���x��(��n���'������� ��_��H~UV���\i��b���n���P����d���BWo(�"��{���(O{#mF����2m��z��m]���BF����6J��t�������:/&v�v��T	&�Zk�����Q����T�L��40]t����$z��6��
��<_�(��^Lt(�j�i���c�G�R1�q�J��3������2l��ip��!N��J����~�/+nh�����-CmpHE�+TD�8�&6�PFLesg���Rc�Ag�Y��	���?=B�������Y����2'�'�������������h.�.[p�����JJg��B��C��}�1���$;T��J���$��|w5��jh����(u�t�����	b�M����i�K�(y�C&*�n���Y]��|Q{�V�5#kF�����4m~%~m�ZW����}��O���u�@�T�o���y|��9�.w�y������c�8��gO:�Lzmq@)���S�i7UV�]M��dR��(�O�0��5��so�i^X���(��h��f��G�.E������w��-��������3���|X0V%����(9l��~��DY�N���48��4��f��'��l��S�5�4���oK��8�f����;;]����j/�U�m���6:w@g�t�Ag>����������q���)��l��4i-��q{����hb�Ul����E7a���`���i`���.��D�a�������P$5�_'�
_?���bbT0F�C�<9�P���&���ip��8����
?) ��n������.�H^��/Cm��.��qe3��]L��7T���{��Sl�����3�:��|�G��7i��Y]�)����hb�ER�=��m>,���Y{?���	�����k�40
L��4��}����#g2�����������H�=�,���M��Q�;)�A���3�0�������r���$Y"{�n�-Cu��BRu;L%k�/~����xC5bg�A+�������Ag�t���yaA��p�j���(I�b�����.��"�(h�j��>x�6:wf�`�f����s����0��|Me���ER4=����l\������=0�c+���U��(]�|�T��y������@��'��>��T�58��nj�����A���Ll����<M��j�sd�Af�Yw�����AFA��B��g�k]\�������j.J�m���T���6����g����p@4
D�@���7�??������p��n�P������i�}��T�'PZLl�H"yv������X�������0��)g6�)&Jo�h0��o��h?r�@a�����[������t��@{�v�����YL�
���vH�,��G5���j��4��=D�AdD~:k��3+1���n��f��(���/}2eE����Of)��,~��DMlf�h��pe[�/��(� ���h �iD�R�NC��f�
a{1��b�f���'�����j�����1�
	����m`4
F��7���e��L���������������rwU�D���g�MH�m2�� *k��u !�; 4
B�� ��Y��B��s3��������D!)��j��T���@X�M�<��Ql��<F�O�b��M��g���/����������z���tx�g��������������~��[����[���G�?�H_rin��|���^|�.���6]��eZ��c6y�*y�k����
��l��i���}p^]�&:o�,#YF��F�����MJ�N�.]�e_�jy���;�y�����P|���7�H;��3������ 3�2��Je���r���W�v}	�b�x��8��t]an��/}5�!t����L&i�R0���=G4�O�Q�n�@4
D�O�4�R�&i8���������v5��ti��o�s	��jb�N��@�N76�)&J����h0�F�o9F���������J���sp�}j3��M�
�a��������q�t�)�w�$�Ab���lY������<���>�@D?|}�jb�R��`B�0H����:���b�<K,��2�,?��}����s���2�&6T,"6��:���8PG��b� ��/M��c%�������O��=��G���R��*��T��(�����"g�����R8�Af�d��N������pJ�%���������\{A�RF�8P��8�����b���1��*���k�u�K���y�O��n}1����f��������m���2�fq���T����~�����V+��Y��$�&6/^U�����"+i�L�t:�����nr�U������y�����|9~>N��}{�r����|�p�������������'�@�]e_>�S��uH���?������~F����W�T����o���~|x���w����O�����7��_-{L�}�$�q�?=���g?���V�#�y#h��_���/�(k�����`O�J������M&F3A��M��0�_�D�dK\������D��A��mp�ojLI��.�|PDFt��Ty,x�
�i��>��)����M5����O&6�8?4i�[�N�tN�@2��Vm������MK����V��GE��<����H�P����=��G����\����or�m=���(��|���y��&�EL^M�F+��Ox
���1SCN�X;�a3������l�>��q"�&�-������_����l-ck������tl-?�Z�b p5���.�5%����&�{J��8��9M�����Q��[Hct����1Rc��H������	��J���C�fT�r51����N/������l�#�T���o�����94�@3�|;h��W���8�-�b�4�v���*	���;�{��Z.(@M�D}R������+�B6��%V'=N�g�V	��3RDCu3����>k�����=��B�����5M�2�#���\|�3e�^�Wdg���z������@��T��_ �0�������|5�R�����K�AL0��@$N���fC5�K��v�����@=N�nn�Bn�S4����Aj��V���R�I�j~!4�0��`OW���N[��&6�}tm�������F��,}Sh�a
D-F��nzi@��i�(������sP��QP�<���G
�]��nI�����?.x�G0-�8�@2�$�@2��]*���i���K���M&����t���&v�K�Y�&�^#n�M�-��=���_�}����m����=�Q�3�&L_������)�|��<T*e/��:i���j��8r��&F���wM�V���_��DGE�+Al�!,Q�6T��t�4 �t�v�i~]V.��y�
wL=���eA�B��������d������"!J�vjW�A@��D)6��40
L�P�?Nwg�#�AXO�z���l�8�������a%�@��L�Hc�L�y�����g�x�<���+���~��/�
Jev��*u�s?�I��2�Q�I��J���f�������u��`X��e`���y&��x��:��g���y4�aH(��5���j�=�Ll�q!�)+��i��	�l��f��e�N�������|�N�K�����n�(����:��:��71���W5��<���-�����t���5b�=���t�hze.�t�����'O&F��E�&��_�4��� ��o�Y��@1P\��yOWd���9������b�����|5�R���!��#�z���4�,-E|L��%��M��?����n�M����3�<����P��Km6�AJ������H RW��$�5r���&FQ�x+\�8N��,���
a�j'�y(s"i�x�S*���k����C���>�m��&{M���]R7�6W�;�4(
J���4�<�z�r����j�&��S+�,����6t�1��Jg~�w4Qn�������|N�E-<G����3�<�<�wZ��S1��G/�F�H��@$��&�i��yhZ�������(+�AePT�o��;�]�z,#
5�Yw��hbU��JV7�Fx�=�&6D�!l����&�:5P��AiP���w�t��f
W ��&6X��T����4����J�����F�?�3�8��3g���
�T�@Q���L�D���:<�&6��S���WM���L�WMH��w����4��O&�	��j����Z� ��"��x=l]Kt��O\
�c��c��L&F%��*�k���������]L���(��2�.���2�B���+,���n�ge�\���>�4(�����z��fz2�qQ���n�=_G>��4�@3�4�h�������������DR�TM
<��!A�<��8������<�G��T�@5P}C����h�&�J���{�����?���c^H�L�� ����M_����H�H�q�s��nb�Dy.��AiP��j�F,��I���$�8���9�����V���?.�&�M��_2�(��x��g����.�n�T���E(G��S���j�4��hb�ldJ�I{_>� j4�M
B�� �
z�y,���V}��/q}�o����2����C�m�m>]M���tZ��vt�V�M��D7=`3�6��`3�=�0( q�%u{mk�G	�����c�����x�hb���I-��4p.��5��]g�����&�B0�x�r����������������������x8^�Y��|�tx<={|����^�^b�������_�����t����y��N?�?��O�Si�w9���]:p��F�[t�����y%��qW+�(���L]J�D)�#}F����,��I�		���W���tn/��n
/[���}�r��@_�x����o9�8���.�E�m7p��]L���S+�~��~���h��PT�Ae���?*_-Y�^8P���r�X��l�c�42*mF���y-y41�uJ���j��0���h0��������o�.f�{��i�������]G��#����a��y�}61r(�-,��%	4� /&:@g�t�o��<(��d[u#����P6S���/)O&V��m�����8N�Ai��R�fN����(�X��e`X~:i���=1-'�d�v5�AbQ��$l��@a41�{v��w��i�nRf�`�����_5{kz�W�UY{��P���H
�����:���8��v��q:'@dD�o��;�Xy
�V|1v���.�}��������Q�:�:jbZ/w������	�Q����_�q��;�AePT��������5�A���������2R�)�r�jf~�a6.8*������h�s0�c�0~�{�C����O�I�.&FYim�M7V,�|�4�X��������^Gp���c�Q��->�V��Q��U�
YA���������#��*�lA�?���U������2��	�=�������y�MO����A-C
����0����3�g��H��,K)��������iF�^$B�m�s|+��e��:�Hc�r�c��
���h�LQ���}�V2���D7?@3�4�@3�����tz{G�\��z2��&��I�2���U������H�&D���M�'��	0��0{��=���6������*�C�����_H�_l}��T1Z�Y�n����E�j.��kW$�"�db��e��U��db��$�lo�I��\@����3��I��Fe~�}��v�s.5X����mj�*��6G���a41��p~ 6�� �M���ud��������������"��qs%n���4��!s����_�[��wY�����:^�u�H|�����5.�H����w��oZf����&:w�6#mF����F��f�|��[��lbp���C�`u������g-������Q2��EX^��=�&��\��ep\~����+��#�;�u������h��h��[.��\�4���.��@�����f�`���.5��;'^;�rn��F��#��#=��6i*#��D�����K���$���2z�Ag������������F>��4���-�D�6$*x��ue�A�3v�K��k7M*���Ml�����i������$���]�*�D�.&�8���i`z]��u���4������Q�H�k�Qg6$[F�����M'�����I�x��:�T�����3�>���������R;*�J+%�����H�t���G�db���#Y�	
��v$�Cg`5X
V����j~�T)(J�!����|��h���6��$?���C��H��-=�q6N��3���Q:@c�4�AcA��Ke;������=���G��7]^���&�M�)ow�Y:�&�@|��g�|�y�5��K1��	ziCD�
��t-�=�L��4d }=Q��~x������A�["4���->=3�b}��|L�sg�D=+�o���.)���?$TK�j�Q:c�0�cA���(�7>��w|40���G���r�t���)��;�\ q ����D��>���3�,��^��H\�z(u��������0�����<��84P�2��^?��PM�]E�g�|���yIQ��&]������/)\�'�Jkjm�6b�`2QrH���f���;��8r\m�J����d���h{1�����m����P�����t����W��H�S�|k��� &!V�%�F�����7CG���e���h��f�Y:�P�������y�`�0��@$es��U��i^u/&�i�?�:sau�����P��j���z�w������x6����7]W��h~h�F���u���+&�
B�� �k"4J������>�}|������D6M�4�m�&6\��>��l����vXU���&���`3�6��`� z��#�����G�FlNeF|<�y�
�@�;���b��bb��P��~�A�W�]6Q~Q�3�>������;U�����0z���2�:�(CE�f����"U��?�G���C-es���h�Mp]��Y|"��R���Y�?~���|���tu��g=���x�������/������������O�_������
��_�w�H�6_�C��������.�v����>��v�Y6��j��K%k�0:��:b���jA��b��5 �F�(Q�c6�,.��F�Fb�f�R��:��%_
��H��sB��U���(&F�j���ZY�$���|��g�|��}V�J�u�����:I	���t�� ����&�`�M�n����'��t�.����-�����t?J�RU��@�l�3���]��98�	�.~��/N�p��U��������+��W����bb���������TLt�Ip���������i�	F[��6��]����m���t��W5~EC^p/&6����^p>��9$�@2�$O����?_�����{���)zj�i�AR,{��UE�*�������!J��(���b���y�A���OE������/g�]a�����i}O����G����H~b9��w��S��:��YG����uy��zr�0��]�^��|�x1�� ���A
��#]��/�<�h��x]�u���y�@R$����t���bbt���t1k�O&i��M���h��_���Y��S){�j������&6�����#���l�h����m���'�(���g�|�_�w����6"��i�^u�[L��w�O�'������hb3A1V��^� ��&��xp���ipzZJ	���W��8��=�;��D������<]��?|�Z[Ll@����T�C�b����~��+�o��yT6���9b�����3~���7�N��5m�'=$�#u�"�2�}�����#q21r�m����o�W�l2����hD���M���V);|D�[����piEQ(W+��-`|����W��N�Wm1��Q�D)3K_����wx���p���ipzZ�z�����U*9Bu�WE��=�hb�IkC_uu;���!����!5���J�l�����zy+q3:u|���Q�,T�7w���U�K��%�t��4���6�������H�wi3�T���(olf���Mlr����S��`:� ��{����q��-&������J� ���~Z���*��(�x���f��8]���`z4Q�����(��u��A3�$���4H�xD����kC�~U��f(���.�lI�}���iR��H��N5Hx�Q6P������0Al�M@dIhsv�
��5�-Ep�i9k|��U��,��y���������<��[�l*�)�g<
�q�ZSQ1��8�� #@F������<u��4.%���c�tG��������(���n.�`,&��K�	"MzX_��������48
N���:��P��
�q}sB4�i�����DR0��@W�V���/��X�P#Ij��o��8������H���#��B-�G���?����f�6� ��R�n(���H4��!H�e�C�������&��XQ��������2k��A�E�n>��{-�����)!���T����3{�����O5E� #;�q!�j�X���&�r�i����-��@��pY���������?�#�����������H_7���<���-]d���/�l\HU�]?���v������&#L�=���5{�V����R�NC<U��"e��`2���}pC�������hbt�����[������D7A5@
P����]���K�I��b���Ou��n������bV>�Ml��~����#������; 3�2�� ���_�i/��s!�����*��&m���($�t��s��+������k������8���@3�4����n���;���l����$US[	^9�9��7S!��#�������8�����3�*W@)A�
6{p���t5���U�l����>��I�6��|3�6��U�������0���h0��w�9 H��^R�x$&m;�� $�&�z".��|]���hb��k��� �����Af�d�92�R���#�R�'	WP����d�l~(e35�����n��~i41r�u��V"�`���R���i`��b�����q�t�K������L�D����U�����nT�jUL�^����}�(����`1X��\��g�0H���XL�4%U���#f�@���	oj�V�Y���Q��D���oP��@��
�"�	�K$�D>w��K�
��(y�b6��
�O5~���D�!���b�o-��"�q�Q�$%��8���-#ZF��j�e>����G��W�������0���J�db3��HX���W0��Q���n��8I��b��0���h0������K��+8I%;��G�GI�t�-;IRv1��'U�^�\[��D��4�@3���y��6%[{�D>�)�e��:�H��C��vU�Sp�2��8D�D���y�����x���W�g^�4�V7�7{p
X����Ou���9xWu�|�-��G��R��q��ge�8�2�� 3���;�qS7��.�M�v61�<������=b�K)&V�5��������M�t;�h��f��C�.5mO�������\F��#
�)��4�a�$�*�hb�I�������q�Y����2��~�����������_���x�;�D�ts���_���t�r��Ci������y3Ey�S�v���&&��db������\�tn���a6.Dj��:������]p�f�����
�>����$��v�<�Gw�m��1��l�S���Fu����0$�R��!�N�y&�&6�8�K��
.>h/&�\�40
L��"�������}�C�����2���T��Q��4j�7M��2n�n|�Ns���b��b���@�4�z���Go����������/���/5S�)�j)�)�b11*���{	2��0��Bl)p^]��h.&:w�ep\����r�_<�.R���_�U68��8���mh�f�!YR>;���(
���M���,8�(&:�i`���i�;���8t����\F��#�!��%i/~��M���Rw\Gq�0����	�&��`��a2�_\���E_J��L���&�o'�Ha��������*"2t��Q���#|�;�4 
H��48�pw@�T$[6��:�Hw��v�4�ZXN�Ll�q
�?��b�sH��d Hf��S-��BR�4�S��<R�)�}��`-���G������h���%�a�9��d0LV3��_j�
�tK��{0	��kc;�r�_�hb����-���?�w2����b���2L��0�3����\yq����T��8NY�i��=�QJ�n����,)��F�hRRvVw����h���l��f��e�^�l
�E�`�!���G�g��N�u����Q�l�t������fT�AeP��P�?�S&�J� ��tk��p��CM� �;O&63(4�����W1��Q,����,��a����nzi@��i6t��p��6%e4��k��:�H�))-��������9�[:N��t�$�@2�$�H���B�Y�R��P7��S��<R�)�I�N�Q<
4r���_J�|���&�+[�2�,���;X��r����<���6,L����qU����b�l"!�"P���V��	��D7=@3�4�@31�pO@����g3��e��>R�9�k9��p4���
�TN$b�q:'�d H��d�;����qGP.#u��B���Pu�r��O��Ll���^���[����|�m������QJ-'��b����F�<��zY����I����Om���M;����ob4��'R ���\Lt���`3�6_�����y�zvp���<%e��:�Hw�m������o���%:����v�^3���x��g�x~m������51�]v{����������*�>V}��c=���(/��e��W��q�sZk����������7������wW�n���>��W�^]��������G�����~}�E�U�X��)|�����g=����_�����?�?^�=��g�H~��t�W8��n�����k��GIo�w���g��Z�L��n��!�Wr�L�u��-�{���������=�q!�U�V�9�h�D��e���O�o2R�,�j�Rtf;yv�%X�m�S������f8Nu��UML�K�L�����6���j�E>LMl�	Gi�]���8��3�8��H���;����#_�#��y�@�-�n��:]�*�����C1�]�~��d��B�� 4����*��A��R�vX�| �S���7�����q���P	�����'�����Af�d��<�n	����-"�GM��4�n�6%�Mt]�J�h����������,9A��b����ip��yN�U�������$�;�����Tt���F5
�q��t�ZP�{�s0�c�Xc^T��N|��[|���V�d���*�9F|����LWS��Us	~�PL��%���x��W�%�v6�M(
J���4(���<��Z�SM�P*�PJ�]�������M�!�����SH2�qQ<C$����y���h��f��G�N������X����#�����P����R����
�)�^w�l��	H�V�!I�V��sQ�d�������{}�j�dI�|F�Q'X�u��8Rc��
��[�TR)u'����y�?iy4���m�f5=�F�D���iEzw�ja�H�?�ms�[����=[�a�F��I�R$e�f��#7��m+J�����*g�L#m��T9���drA~V1���Q����OK��0L��}��<��r+x���{������$m��:IwbY������P����6	�a�7�����
�,���+:z�a���d�U_����M*#��N�QV��|�ZL�r�C���;�?T���Jw@iP��Ai>Al���z��n��I�.#m��fv�J!�FE��J��2�����x:���T�L7?�3�<��3���j����
�P5�<R q<O
,��Rq�W���v61r�k*���G.8�
�D�? 4
B����g�R�bm�
V���Q�*~YVF(��f��K{p*O�5� ��F��H�F|��
�w
��(��%��������D����AhZ�}���t���!���br*b�G��ESy���
������JZu�{�2N7)�1x����_���u�T��b�'%y��������%wO����"�<'�BJ*���\����s2�� 3��J��_\1.y)��=8]Wn�_�r���hb�O�/�5<y"�&6T�m��u>�;�(_��Ah�����O�����B��vu�+�����&�

i�<��q6X��gt���e��[�����.���2��0���i7~�[���H�.=TM�#�i��}C��|��l$�]F�W���T'Juno��,�iy|���;�X��@tUy���v.���y2�q(4��M?�#)�]Ll�����_�S��.&:w4#hF���Y���,1{3Bo�`����Kq�����>R�����F35���� �Gb��M&:&�g�2�[�:���Dy�F��`4
FC���eM�wWt��7����`e�
��6�������VnP���|H�t.��00�0�'�(/������.���r��������f�fMu��%2��~2�!H��U�[�B��{11z%:W
�e�nJ@ePT�A����]
���}����n� �2��WC�������DYkK�O*0E��q��w�&'J�C;qd~!��_�6��_
�B_Z���d������&F_?i���^����@�MtX���>,O�h����A�<-K���i`��~/��6r�������N?X#��"a����
����(�7H�(8��%a��=���(���`{���e���WF� ��"���O��wJ���)��������������w6���!F����}>�h�e���+T��hImF���J�=���z�M<l�#�O�<����R2�__����F��
]3���������f=����=��0���S����IOam���4L%i���n�����x�(�&Fq������a�� H���W���,}��e
��
��~�8P)�CSSh����R3���ef/���@U/&:w�ep\��edi�Q�/0S&UX�;	��t6�LtZ?�����t�lk1�����a���tN�3�0��y�vZ/�ALUH�H|�D�����l���4����%%}U�����b��L��d;&�9���/�oR�T��u=
�y�f�9���H����I���^��bb��_���,��q6N��rQ���L��]v�_e 9��N��uV��{���o��I��������ec��N�]�J�Y	��F�h�����i��C��*#���������I;U���(&��?)%����X��4::#tF��������������j���r�@��/g�L,tU�%m �@7:_����<��	�@~�?G1�_�����=����.�"j>�]���W�5?��_�1J?�.xe\#	?	��3�����0���_�Ml��t��%S\�VLt��e���?>�������?���|;|=\���������=���$�o_��t����}���V����9V
����/��������F���6U:���7v�]?�;������0b�������Q/Q���n�#�S1i���f�N=�����%.&��vK����`�H'������m?���%����w�mS6�����Q��D�"q<\_Q!�����������i���^��MtV��`5X��]Z�f��;�mY�)���������zB�w21�r��ZU2s4�qQ�qHBv���j���f0
L��40}x������@�p�+�����EQ@]Ll��R�����]�0��l�D��%�������1�8�����3�:��:�T�Nyc��t&���I��#u�b�������"��^i4Q�b��X������l�����h ���+�c�k� �����L������U����\G�c�����hb�J�t��8��]Lt���40
L��H�g��!�v�WQ�X��OO&��{����5.��D��&��U��t����7�^F���5@
P�P�T���.���KXy��?R�B�>���@��E����yO].ct�� )�M�"
@��k4��c�(�H<m�������Y8�M�&F_A�m��U�9�n#��'L++�ee�0
L��40-��y$�`3LSpL%��/�l6)mh���e���D�JczG����:��O1���6��`3�,`�N5�a�|���8^F�$%�MW�~��y�F�
u�l�u�+��x6A����x�����(	��P/�����f���f�5Y%����=�nEm����'���L*����&�2����R����� ��L�5�G#�F�8�1�Z�T�fe����=�g��U��q����p���'e�(
�S���_%u�����QR��M��
k��=��`3�6�����vnI����a.�)�'L&:I��C�;N����o�F����U�[�:�tg�����c����
�����7�Ek�
������=��Y��r���yh3�T��r��/\�|�:����d|�0Nn�W������H�����{���XVi�|�vq��D��/%S�]F�$
�s��n�-��hb��5]��es�?4
@�o!7m��j�(v����u��F�����Mt,���tkJH�����F��,uh��vW�D7?`5X
V������^r�����i;��/�e��u��r��z4��'��jy���q!�T��s�(�`1X��`���9���@wVQ��K�=��Z��Y
%�;��K�!�x�g��\L��q���l��R���N��_��Bd�!+Ya���f��f��^em7D��=�\�<R� ���	�����|f�������zB���P��b�s��s~w��h �~���a �h�u_�ER�������a��6n����!�aMl�����uu���LP��l8��o>no��
�|�Ag��e�z3:o�`���(U=���	����H�,�aNN�k~JF\3T�������9��13N�O�A��*������}>���W��q�;�qReO\�z��T��q��v��0�a3q���!H�Sk����T����������D����j����`����n��h0���&8��-F������H�
���$W��r�+�i<��|�I�v����	G����Av�sh��f�h��$m�����U����k�����d41�Y�2��'�R6��G�����Q��vWLt����[��g?S���O�Z������/O�*(]�w*tS���D�2�f�	�����-x�F�\K�F���2L��,#�F��h�(�(ux�:�������:�H�C�rv���f������H}�O�m�[�b�st�Ag�t����d����xYy'��X������DG")�s�����"��7���4m���y�Y��sI7�����.�*���������������;��*����������}�F��f��^5��'D�<�OH�N&F=�I�n���Bh��>�X���W������?��F,�X�U������������@R��������l�.�%���&63��h7A�0j4��������e��$���p���ipzZ�y/i�%�;Ru�
�C21�+�����F���<�&:2��PT�d=?��$w2Q�H=�B!C����U����%���fq�,����=����;N��]F�|�T_{��a�-x�F�:G5�V�w�OQ1��D�����7�u?H����'Ud���
~����1$0k3��������b>Akh����M��9�D�����'i��r_��b��l��/2�� 3�|{������}��&h32��%���Yc��(k,����h>\Ml ��H7���wI���D�(
J���4(}}���������=A����`�@5����Cb�(�X���c���+��N&F=�Ix���|��q�nV@g�t�_
���Y.9��d�g��)%+�u\	�<	��R���9�PQL?�Y�+#2'�z��<��	��_q=��D�������gnn����s�u���XI�#u���9�c�~%'K��lb�O���~��P.�l5���A�4h�����W'�L�i�uM��#u�����U>�2�����C��Z�'��2N��(������WVK6�R'U;��-U8���D!)��o)��bg���M��K=t���M�}�f�`�f.Z����0���&��G"i����m�U�_S��b1QRQ�SpT����m�q�nR�f�h��f�;����
���e�?Rz�o��,9��8���]5����D�#<��3���3�d�$C@
�yN�E�8������(U�r��Vp�<��y
H��;QB6
����`1Xs�2�,<�a�������M&F��mH����l�������5��q���?��l�s���h0���S9�#	�C�$�@|�,�)��/�;�_M�"w��jh��[�8���@2�$�$ov�Y�%��+�fN�:��W�Vl�<��Z���O�.�l���W���z�s(��b�(��c>"�@;t��H�N%E�H|���T����q��I��>�b�s��a`�0�W���$���]����Ys�1����L0�9�P�+v{B�!+�N �2��j�����]��k��w]�����BJ��&6~���f*��Gb�|�T�k~����0���h0��w�9 ���f
��j���&:
Icgj]�z�~������?����M�`�0����AiP��9J�T���R����e>�����DyWW�i�����f��������T����I��G����F���o7w�����5�;��J�_�u������Y���|z���������n�K�'?X�g~�9����'?����<������9>�W�_���+"m�z�7��H����>
y1������U-�a6���
��S����Q3`���s�x�i�GK�w�'4�M���h���_�B��C#)���#�����MX����$K�]&�������r.�d�7�2�*��'�����zxwuv��_��K��)SO�vN��f��t���W�:q��L��2L���UC��O�k>f�&����g�x~�x�s��U���L�5��?�m�4}t���'#/&6sArt���U��&6��8��ir9����l�sp�g�p��Z�GE����p�s���0����J6�E���x�D����g��=�+�����1h��?��N�l�t��]����NcmHH�t����e6���q��p��#f~�����`����F�lXh����t8�	W���&Q�i�J����./�goH�&
a����6:wg�p��F�<���3^�5�3	��BC�L�*�M�M�&���f�RM��kI&F�_�O�;����g�L�;�p�g���p����T}�S��1x������)����<��������qr���v2<w�R���e�-�1�	>#�KT�}�YOB}\i>����<�Kx�����oJ���U�����jE7)	'��/�g�}���~-%��C6�yI���b����6:w<#xF���Y��e�qip��d-������W�Qa
��"e�Mcu���I�n�� ��q�U���@�Q:�^��{�^�7�RH�����
��
���AR���f};�'r6Q�N������UI��l���yy)!\�*���|���
(
�x�>(vgE�Mc�
�Ql�n@Y�����}����jk��)����b������B�������F�L�T��^}"�j~e����3�g�j&��\l���������#�L��b���s#����D )�I��_�4����F�"�Q:`�_�z�Yt���kG'�5!x��b�Q:BM���R���������M�v)T�������L��5�|�����g><Q.}R.�C]L��L��������e������9
Z��N
�Zv�����W�~�!����������?�s{xsw�5��xXu�6R
y$ax�o���S�p�dW��6�t��$������k@��#�<J7��	#F$|n�q6��?��4�T�N�c]Ow�YU8���&Fw�S�t�VJ5�7�&F�#�O7��N��+�	
x!D��0���s(bR �?��M�YK�����m�������<m���?�55(�.��S���D�u�����fR��r�����y�n:/#^F�l/���=���H�g�U�E���f1����|Z>\���G���|����,h�L��#�&����
���L��&�,����X��_��O;R���u"0�cm�Lj������J1�q��U�'U4yw����d�A�W@f^04R+��WIC��L}��=T���!�&F
R���nI.-�ON6��P���4������'�e�;�3�:����??����]���uU��0�/�(������@R �D����L���1doH���{2Qj��3�:�������OU���(V�i�<��1�Z���y���M�6��7-�D�
�$�@�+@��V6$4�����h7��!,Cu����
��������� ���$I'Y��H��-�2�*�����e�(c�US�th����B7�(��J��4����Z��R8�zC���/��Q��
� �2��c �T�n��\�G�tQj����~$F�a�#���y�������0�$�������d�Af9���������{pJ�������'�S���$�sJ�v�������7$d��V�J��bB���:���3�:�8n������j_���c�=.L&F�1���I]�kI&V;���E����3D&J�t�Ag�t�1���j�����1���5��&JI�g���G�e�.D��@��I�_}�
B2Q����d H~H�O<�Y�����r��2>�aO*v����XOf�� m�j9�+q(}%H�>�k�%�6��t�Ag�t�q���E�'gS���8�L��X:��"��(� %Y�\T��������Q���;v���rr�g�x����y�zv�����1�)Y{2�����5g^S�����TLl���k��[}xu'����h�~��a��dFR����n)@���db�F�v{=�M���Y����]�����g�p��G�<�.O��V�}b����
��������$�x����R���7TG� �>���_����o������_��������((��'?�����������������t}��Y������?�ig{���.��/�����2;4�K��j���'�w*j7��u]�G��I6y�y������0��m�F��'���l�[�7#nF���q������S����16�;��k3����{��j������<��uK�4���G�8@Y�����,3�Q��a@�t3��,������h���C�P��.��HL���X
�
��Z'9Z�u�dbt�M�u\�v'�K�nv���n7�<-E�� \���m]�=���>?�v"\#.~�{�������u�(�+��1���X('��'_e�@_�FY4�����a���������|�f�^�]��G]��^z�j�����t'W�O&F��@���� �I��OH,���mt��@1P�p�z3�R���'i���jr7���#���L��V�kI&Fh&�����<J7�1`����Ya|��t����m=^��a6��G�c�?��J:x����;�!��$��p�g�������..�j+��}�R��0����~61�l'=:��4������zg@
���'����c�0��G�<�.nzO7�b]�%,���:8N&:�H��X5�t���,���U�$H��E���0�c��<��)[�
]�q}���H����)a:P3����
���;$>���|�Y6�y*���2�,�2�J���K��T���$�&����t3���D)�I�>����������u��v���2�� 3�2���y�����n�z��3Wof�b*2���(��Vv�H&FW�r��U8�Wv��
���n#����_�{3TA����(��ai�����+�4�T�����-%z��@9H�B��.4���h�t�=�b6�q��i�BT�{<e��(��@y����\)~6{�(A*��H>^�^Ll������t%/_$e�J�7����?��*������%Z@}<��R���&q��P�YV�3��7#3����[����eZ�K'	{41�����e$~��Ml6
�Y�N���
v2��-Y_����

�Cm'�G�}*�T����~:��y21:�MA�����)���$N�k=]pm9���
�!hC������F�,|����@������f�z�l2��#����j>X���o���7$h�'����~W�F��2�� 3�����������HKC5u�'}�h���:���J&F��N�[�<J7�1`���y�S���"���n�Z�����H61*v���4��=�!P�#�"�zG���L���C���������=������������<
�y|�.�����#)w�CG��;���/{j��SJuT��L&JJ�L�0	��"�zP���'#N~�8X~������������6���rM��{�Y��cm���$k�4�k�}� ������Oe^�/>��jMy�C�%������8���:���e�������<3>F�&����f�h~
h���� N���R�t��!P�u��O�h��sAz���Ge�e�j}
dY�u��AdD�A��?_)���\��f�Z�����|R3)��X�j=U�,7�x�8��x�iv{�ZX����
�BP*��J�������7�w�Z����,���R���Ll��h�V��y"w��uS1�� H�k�)�� TF��P��k�w��1�
|RWFJ�����|������i�]���Q�=����4��'%���D�AdD�x}�,�.��m�Z2�q�j��=���XE��>���L���I��GA,&$������-Z6����2N/2{�Hy�Z6�y}��z$"������(VM�t�
��)�([AH��U�+e��[�	-h�������I�|&�"�!"�#%8���A��W2��������s7��
�������j��f!��
[21z%�V��3�@b�6��A��6�m��-X�uk���3����n�*�m?�A!��Mcm�wJ�S����o`6����� ��4J�P�@1P���>��*��Tq�|X��������t+F�4]��<����Rw�����7_�4�(��h��_��E�H��=J�m��4�4�r�F#���;}YlmE�+QW�;9��n�Fy�
<��3�<??�l�<)�kSYR�a�<�&P%]�M���g�����I���7�"�7�2�.����\�����T��7&�	��0���T�Np�]���@�t����I��l�D�����2�*�*����|�j?�Ul���C�e��+$b���I�s6���r����q��������,��2��@����@ a:����#�#	�g"&a�/{�G�8�J_��<��	�������V��%FPb%F�#|���<�f�l�}���EC�����"��aa�OO���4�(�r�/bU7����_�k�Qmm�P��,�����oF+UxLQ#Iw
����7���}O�-[���=�Ml���(n�	_�?,���Qz8���ImNA��l���x����B�������N���~(nS�q-k�8���G�&}z}����%��xC������Q:�b�,����j��S��R!M�5��L���H1�QH
e��w��(R���x�?�N&�
�(�����2�����m^�o��>�����r��j1��d�lb$Pfu<���]B6��3��������g�;�3�>+V����������V�@>��q���=�F�����)��/�������H���<Wm�Ue�`�`��00��m���-R3��od�o�MF��P������0��
fW���fd�87s7>�p2b� e<&
OI��w��aB����D� �O$J:1�Q6�(�W���9
���0(�(����o����ob�Y�Q������}�F�|]�:����@}v�����:���r�:�C=�u(��O����M/�/~��jG��!���%�p�M�.!���F�az5��3����(���#>F|�,�c~4"�n��cDd�z��H��������A� ��g*������������%)�e�n:�d H�����cf[w-�6����d>`O��Q�\4��>�O���*f����%���AU ����@���
���[��_�bH�/��E�v�@�6��b~�W��a*���obfA��bbTc��8!�P��MBh���BkBh�����3��GT�3]W�m�`��2V)�I��S�G�I����:�����t��`1X�����h�2���o;	�)[{�����}E�����y�Fw�B6eWo>����(o���`3�6?6���r����'�Q�����(�����R�bb�������I�"�^��@��)An���\mt��|��g�����'�v�3eg��{�a�|��&:IwI-��zK����lq,�F0!�i��b�(�B�A����R���+THd����{E��|��q��T�M���������c�<~<>`��
�F�yr����,7���&:I�L:��M� ������7$C�sz��g~�����|��g���x����V�3��s�I��T1{2Q�����U/'��p�����I7i����*���g\a>M��������������>����R�G?�b�����W�g���|�b�$����������N����=�����^���������J�����yw����!4Bh��OB���������@d��������W?#��x�.P��r�y
wv�B��U��Cd��SI���fA2��Q9�1`���L<�w�y�������q�d~Qu�q�
I����cS�����SaU�b�����2�*���7W�!�T��o:�+�G-&6x&9z�����a5Q���B���\D�z���e�n�x��g���[���o�-rz]�s��|yC�J!�M�K>� �&6h.{���/^�(&F��L���[�.^mt�4�@�b1��p���0��f�J�R��Q������Y�BZ���#���a}�J�����P���n�������q������8�J���S���_
�*gS�052����vk��"J�����p\M������{1Q���,#XF�,����c��xq1+������Mv}N�-)AU��������b��-���jb��k2/��A����`1X����|p�r���w�dc**2��H��g-"xM���h(��~������7��Z�
�6��`�tU:^�|P��P���"�l�#���+�t:;^fyXM����&��u���Y���M�<���x���D�$}0*)<�>���b���8g7���b��n�~	�\��t�=,&JY�
�V�h��1���1hK"�0�h���<�^�iv)��cz�>n-�9���G���}���oQMl�!��+��r�(��2�,]#���*�}C57C;��Hk"v5QH��������hm�@n�L������&��� ��xYdWf�(�?�3��/�F���E��}l�0�����&:�H_;��oE��E11z%J=�wP�&,��b�X�(���Kc1���@�����q���������h.H�n���G�[�F�6I���}JuI�m*&@3�9���7�W����w'���w��KY��u\���y���������i��Z�~���S��}B5Q�8_��_t�����������D79�0#`F�,]�v���V?F�y2�jC�G��Pv��)��Y�i���&6����]�v<����0L��d�"���J�N���������n��N5�M�]�����q�����w���O��HL/����
�M(
����t��� �$}�n0<�^��@�����lb��vU,I����C�����8��VMd ��k,�����n����(e��4�
v���_��R�A���)�&Ly���o��W>����R%5Hf��jb�f*���)��WUMt��`���	���%����[�����k�U���hS|)aI�a���xOV��.f����
����&�kH�_A���Y�5�^Mt�8����+��)����?�X�y���"����UG���E���Oi>������H�C���,Z��������gx�Q:@dD��`�-\��?��H����\1�X
�AH��Z��-��y�`61�, ]:���f�����H��*�_~�����4�6��`3���}�u���`�MA������[���p6��anR��3�����9s�tN	�b8�m�,������.L#�����q��n������~�������s_cN)�����w��)e���������>
Tts�+���&6�Mt
��6zf�QM�����3�gD��~��5v_�}� `�����\���d1��
����m
�{4�����r�l����&:o�f�l��f(�������B����t~�3���D��	<��]������Gw��h�z�xw����p�g��^8S��M�KT�/�Q�N�Cu��b�$g�������hb�N�k��'���]Lp���/�:rO���'����/@�<R����g��,��
v{0�]�9���g�����f*H�����0�q�H��e����#0F`,������xG�U��.�~.;�oV���"AGQ9f���-?u��0��`0����RH�@���=��Rld�#���EiNn>��L*�����?�z1fAQ�:L�`�yq�W�R-�@������g��b^Tj�b�:�K�������qi��k����|.��!5Yy>[M����W�m\�^6�l����p?�l>������K��p�b�'��<���$�u��=�x�k�,@�:���[�m��&6�t
���N2/$��
3f���P�n\3n��I�3:A�-`>dj��T#t�B V5���@���o)?*,�����K��Q=�X�6o�e��T��@�4}I ��-�������5S�t�
T5d��/}61*qI���`e.S9���"�Q:`��{���G��0�w����]���#�����G%�0(k�g�{<���� /�{/y��MS�h��n�~y��DI )�}K����2m��I1�a�{���������f>>�Fq`�����1���=S�w����<{��
mh�X�,&6�|����\�G���;T��u�Z_�?����(/C#bF����&b�ie��N�������cu~96�#������.��������� ��
I���m���\M��S��1%N{e��j���t�A�1$�:R������J!�����)�z���AR ��Q�5�<��q"RMm�n�f~�VM�m}�e`X����W��'�Q����I����W����d�����E�����hbC���Q�������g�p���7	��^���[�gA[T=��\Lt�����Ww�fef�BGE>���%j���b����2� ����"���0i�u�
�$]R��O�����������xF�;�2�.�����|\���U~�WFa��4������Y��=���"Iz�s�q���I�M�g_�A���� 0���G��CI��s�ai�����+Cw( ����l�*�"I�����s��D79�3�8����p>�l����nLl�\r����Qr����H�w���R@S1r���:L��*������|�bTR�_��t�=��)���B�y�`���t�I�N����0����7W�uC���G���08��4zi���S&������5�`n��U�!��f=hUb���mhR�T�Xw�	���F�o]m���0
�o��a����u��������>G���6\.�3��%�Z��<�����tFf��d���6�`� n72���(��W>��i"�8%*:=G�|Q�������i���N�8,&F�^��gs1����SMt{���!fC�~b�np�#G�}�n��%k+��� R]Lt����$Pg���l����^������g���+�!VF�|v"�����*6�dL�_o"�[�0����tq��|�i��14!�EG��l�\M�EG$#HF�� ����Uy���O���^��a�3��"���njpKpXM�8��M����*B,&���8��3��t��JH�D����?V.D��:��1X�]�n��!��Yr�P>8�;��ts�n^��?��l�_x�n�>����o3\��v���?����t{}w�����Oo����.�������{���Y��5?]����^|��W�m�!��/�]��+��W�VT51
�k*u��~L�l�R�]������&����2�����c��~�Y|��Dcl~����i��=���bL�$�y���F�s��^��l\ �-����	���8.$7N��Y��&Gh��ts#K���`0��O�F$~�c����/.B��T?s���{���0S��m�d����db�N�x:P�|��B5�y���1�cD��?���Ww���/:>�`��!Ij7�P=�a��Q����"@���a6.D����p����&:w�yzs��]��uvC�����-�����9Bt��G� ��N��y�U����?%5���P����0��R���2mX�����go���G��H���)�:9"w��rrf�2"�o<B�
�����H���Y<I��&FI�%a:)/�;�����Q ��Wz2�� 3�2_]O��]{JO���:��j]��M�[��0�s����4���Cu�]\r�Eas5Q�+�3�:���8���
�cR�s���%N�/�&��P!hGA�H�i���D���#u�����_��rt|<�*�\���x��&:�HL�u�gw�W��k2�q��TG�����P5�y2�� 3�2_&�auk�uAs�L�u11:S."t7�1�4N�@�!�MG)[����U���@2�$���)�V7���G��tx����M)�]����n��F��e�RI�IY�C�����
`I��bu��� 0_��P���(�>Z�%A��:T)Kz����TM�������_�j~�RMt�"�� 2�"_&�!ejj��S�������4�����z���4N�@���������t.��00�����n��n&�9���*����D�1�J�m?5�W���;�,�n��$��7�2�.����������B6$5:$��}��O��u�:R"���������O��&F3C�=*~��@�a@�2��QSMjj��v|X\�,�Cu��b�h��E�n�q6ND�P#�m�O5Q��b�(��q/,��X�J/�������� \LlPX
yx~>�a6.��}v�I�p1�>���������������I���Cm����b��w{0#�i*��V�b������VM�����$b]���m�N*���P5��4&#LF��0a2������B�K���6S����w	��?���(���
�Y�&F����G+�]F�&<�������w�?�C�[�.N�%>�+�l�.�<������b��6�}yOf���4���[N�0R�f>����9��`0|s#>�9����v�'����������1�l���#�x�:�)H/��'�����Rw:j�|���������Ag�t������r�GM��sI�?@���P����dlZF��G������O����h�����f>��(h�@��J����|)N
�8T��d�����w�-^��&F�$�uI������������X�C6���*	G��?�e�xIb7:� xq�6�R���p��'��;���9����5z��������q��x�����A����3�g�������c:v�r��9~�;���]�*={��q:J�R����(��t.��`1X?/b�TeJ��8T��r��{Q�J���@R�(�:
�|��l\�M^�D�
�=)&� ���i*q{�����/�7�����NUb����~�K,|�o?�%�������X��&���������{u��<���\�S�����d5��q���n�'"^^����U�t��	�����;����3f��0C���x�
9������Cm�\�h*jRC��?,�G%�����t����CZ�KH�B�������������
�w��^��?VD��4FK4���|���L�t��T�kN�����|�E���(R�&Fg��q�Y�F>����&�3g�������K����}k��k��RR���;��<��]R"���[ ���h��4z���}��.�_��D�
�<��3�]�/�6����"�
MI��D)�J�X���
mX�F�
��F�|��X���o����������B��`'�&��{�o�l>Z1��#�Uu�������Lu�C��pM��{��%�e��WA�3�Q:#0F`�`�p������G�#��tE*Gj��n>J�8TG�����v����7b���;��QpQ���9�� 1H�$j~	�����G4�y�����BJ����G��R;SI���nF�-a�?����X�(����1`?/:�
�E�N!Q{���k���@b�����Q��h��L�7��j��D�Ad����m]6�?x�!�����$Er����`f��ct�����j2�y$�@2�$?���vptC��[	���-&:I�\d���O�q6N���	F":�_�&��������-D}7N��Y����S�=����r��Q�t���?�
��U;�����<�d��=8���m\jl
�/&6_>�����}��a6.P��������;�Dy�k\�>~������w�n�����������������]���������r�����/�.,��?��I�w��[r�:w���?}��������.:�W���jI�����Q~��+�}+��^����^^}�P��.'����|���fy&]���9��y4�q�+=&�|3I�\L�/
�)R���
>�>S���s!z���jbT��h�1�����K����^:3�761�D�c��h0��G����&k�vC�_{_��N�9Q��f �2�lb4�W�%����/�jbts-6���ex,�/�3���iC�~�����������x��H)U�3�X�X����T��.#�~KR#�j�C�t.J�5U�|���&F_M[�l��_�b>oO�UGZ�3�>���wWg	�o��igGU�X�d.Cm�\�i7�5<�q6NP���+��b���!`C���
�Y�|Hb���Q����K�uX�e'��.&F<����OGI�NF.$J��p��d�!��@
s�o����;>�@�'����i_�y��2�O����k���}�r���t�����$���2����5��� ��#4Fh����+��K��H�;����i�����t��8�MK
�������R#$�N���R���h0��0Z?����%~]�t�}1B�d�vh�n4��<���i����B���3e�*�3�e�c�2H�F���1x�X���K�)r<Q������G�D��:���G�D�9
�$��	g$^���;]����C�N5�y.���2��8Z��o�J2�1��=������A��>�-|+��1���I�&�����2�.?.�~��/�7W�N�����r}�(*,btN}X&�T^Lt�R����*�Kj^V#9��5�RB��
�<��c������um������@��L���@&Q:��9P���j��P���(��jg�+��b��?���3�>������C\�9uK��'e�(Eb�C�U�i�.R�:A�t��u���(���g�b>Q^<�.��������'OU�H��������D�SX:�C�t�i��_�h������v���"�b�����3�<�E�<�^�����]��hK��jb�RN���k:j����&�7 2�"�� �CD>�v���)��;��P|���o��U��O�l������WU1��=F�o]h�
�?��C3
4�x�fo?�l�.��lTp�(f��Q��4y�����Cm^2��n61�t&)�:=�����V5��
��i�u�����?}x?q�������h���_\n�4������e0(
�Ku���)�a5��P�N���/�(��1p����x��!�j�*���V�y�<��dM��)��p�db�O���O�8}4��.���2�,]&w�2���b�n��n&�k��]0�M5�E���G:������9Fjb'��2
4�~�^���#d!�o���5D�'�w�1/N�<�:R.R��?�}�a]M��Wr����jb��P/����|)�j6���)�M4R~G��?|?�F���&
�l~)l>��M	�1������lb���K�^�t%�ui4Q�R�(���K�Q��>����3��������V
~��>�����{��� �zx�vO%�o�-�3���]M�BgR�c��cf�����m���9�D�
�q���e,K�����n8���<~��G���.��:���"��(8-�4�����!T�����+���/��dI�V~_`3�6��`3�_��|rT);����/�H����B�W�h����2	��m��z�=]�|��K1Q��ep\���$��b���[y�'����e^6�MDtqoD���a�B��j���?�D�_�F�6�1h�����x4�~#��
(�@��Nr_y����e#:��).���&6�t�����D�&�7 2�"���E�__Z||X��yR[���g�~]L����_��Z�C�a)�wo��Sw�`�R���H�F��=9��h�v���n���Bk�{05`�b���,���63(��/�#i{�+���sd������?�j��D���%#J���_�������S�����G��m���]m�>,������R�d�a]�&��*���*�8>���� #@F��[���p|H��*_Sg�a�K��/�&F��R�:{�h]Ml~�ZG�E��>���j��P�P�[� �m�n���R��
Y�~/e6Q�u����0G�6eW!���xQ�n�2iS��"�s?�I��A��m�?7����HJ�o;I��b��J�� ��B9l��F9�/��rQ^�-t���\��b�n�����q���p9�O_���#u����j���&6�t�O|�\G���aD�8P������!l����K�L@�~�l�h����+i��/��%��hb�;#9:(���D7=��7��?N������.L���_N_�Y��t{������G?�b�5jA
g������a�gzACcl�5�����c�(r<��cC��[�k�{v1�}�����q6o����,�����2�P��P��O~�(S�-�J����������F���v�
CR
^%��H��7�[��%��F�#��6=�:�Q�	��b���(�B/>>����&PP�d�,]uA�������y�;=����&FY�>G��\/��i4���?�=�X�D��u������}1�����F=w��,X}u��L<�l��}h7�YO&���a���n|�:�(d�7�,�kRG��!����7#n.��x��5�=#����/�/N���2R?�]����D� )����w��h��M&6����~�rYByv�RM�].Ah��A���T�N��K�<x�fi���o���ZW�G3�JW#�z�����`��G�?�3�8����by2�|i��He�������db������{S�C�����6q��K2���R_��g�|��yQ@'���2�RO���Q��lb�Q�Tg�c�+�0�����e��`��a`�AT�>e�9�h���4t�\���-�Jqh��k:~?@��i���8��c����V�S���H2���uWL������!a���wK�M�f������f�hb�O����dP�K	2����|��g���p���u�T�C�1���#�����(@�8��	��&����<����c�0���!��L��������}2Q^����t��ky4�u,�����?I��\��c�8~8�b�S<�t��t���6���i�u�{2�(���3�������8�Y	���	��c�b��P�e@P~0F>�`���-2-���0T��TN])��9Q�=�Ll�!_�a����&��| ���h �ADU��-���5R��H#R�K�%p���q�
�q��Ks�YF�����WWG���{D�a�httT��b
�];:���Q�CWv�7J;����o��zh�A���g�-�	�&FG��R��;Yu�nB#2Fd�D�p��b�8���K�w]�w]�Q��'�d�g��uGeKZM�s����R|��&�z]LTf������h �~��n�<�x��c�vS�!�)�$�ZM�2�~6��P���MM���q���8��p�����AJ��]����F	Vb���Q��L}��~���������*(��(h9�E*g��y��w���?(
����.����!���]��Sw�w�w��]H���y�
����!�E�����x�Qf�����+�)�������v��W�� ��]��Gq������y�� M������|D��5�"q���'+�:��B	�'8
v.��
�u��}���_�M����G���/oO�_��t������>�~=]�������b7\��������{�c���������7�W�9�����_�����J��]���O��}����;t��1D����/��|�(+�lM�j>�C=�'��fh��PJ����m
�+��^;A�<�O�"���!fC��IKO�I��8|D1;P3�L����R�\ml�c��<t����!<����?x�B:���(�j��4 
H�_PD���^��!7��9C�	��66�&�����&��P+Wz����T��:P��.����3���bX>b5�}KMO������g������6mZ�g�u�M���'W��c�<�����d~�7����0z�6Df��&�BL���P���%����������UG	�	����n2B�$�Ww�JIi�J����/d~����~)O+lU���+����o7]"�C��FG!��Ob���)����i��+���V��Oun x���7W�2�.�������FV3.�I<�^����������<P)����6�w����d���$u�r��t���JM&���@���w'�h�@���\�����um���Qm��	���������t�z��)����RL����	���?�%��D7E���SD
L�����b����K-��\��2P�))���������6��@�M������9���fD����/1����G%(1�M�yD5
���_�c��L�6:I��Ia������f�|��Z).�i�:`���%�i�l#e[��}<H�*�K����^W"�G�E�&���ju&*���YH�_�����9D*���1��94�@3�4?|���rv�s^>J%"��:�H�f������,#m	�#1;x^V_F��c�0~&0>^���]��W1;�E1eq2���3�������d'P'���H��j��J
#���4 
H�G�������������
����������6��|��ujxQa6�9:���3�:?L����TE��V�6D$>�NG!q�\�'���K�~j�J��v$���,#u��� 3�lC��~����QY}���Vl7����F��E��5�=�U�i��h��S����|&?0���q���1�_���;���T����%�y�
�(
<���>��l���$���M^����k���^�����2�.�����?�b�K����� �b]m��X��D���=��6�j\E�������Mt���3�:?:�!����(3z���������Cj���`�>���2�DT�MtH�F�O.��`1X���G��T�CJ^Rx�� �@�u������+�����/��$5�&�?�2�,����X>��MM��1�D�.6�#TiDZ��~�4��P��O�D^h��0�(��d�A�g@f~M6b�n�n&��zo>r�:($%b�]��i��r��f�3
���>����h�
2�����:��t���S�����������O���n�]|���|���:�~s������pVq�����y]�[�??�{�n,����������eV�wml~�]�B��K�~n���CZv�<�j��,#XF��`�2d�m3������<'����c�3���5r�D�R$e�B�a2�98��3�,��`%�-A����O���
��_L}��Yd��i�t<(�C��?��
�$>�\G���O����`0�����T�]J�J����l�
�:������Q��=��x�;j�Un�����H�# 3�2�� ��d>lv��.�d�#�4:�*p��-��y�`5�q�6"nh���-���D)���40
L?L�G�F��y�\t�d����q���:�����(�'�wM'�3-#m�i�'E>�0ai�7��2��w;H������w���B�uhC���kz�^�f����v\����o�Ou��1"bD���!\�%������vU��&������$6.�W�x�`5�q�ZfP�u
~�������y0
L���3���bWDF�9�N�q��p]�l���,Hh�L�<����
$V���!+*���wc�n��3�}c��5*�4�����������3�f���������d��#4Fh���1��(�!vVekh��S}��;�e��#������Br��[��D��"��6D~������~���1P:+�Z�A6�r�N�w���V���K{0�6S)/������h�"�m\����hy41z�#U��=��q�Y�Ac�4�B�J����?d<����������5���9A�+u�6:I�b}��&���4?7����N�rCO�cf��D��D7A�4(
J�������C����ht����*\��cOe;���Ou���9����a�7J�HG��d�k�N�L���&���&��`�sa2�
�d�5�rM���6��<��3���&k#��0��@��I���J�o���() 
H��4 �������a��a�#)��d�t���a���I=�����V'����8;�u����h�� �h��f��A�15m7�m�G��:P����:U
���&�p�M��Kyhr�l�nY5Qng�p�����$2:n.Buh�&��gJS�����ru��yj���C�5t�y�
�����@��i@����;h(u�Oqa�h�����A�H:w�R����I��6�mc�15�
��N���juH��4 
H3�>���S/G��]��#�sHDo�����q�
����P������*gh��*4��_��������(��(���k���;_��">(z�������������� ^�y��d�!��D��v����j��{U���������u^x���������0#`V��c��Tm�.��M����0���&������s�=O&:*JC�������p�w1����i`���k�5zo\�F/�"���s5�QH���)��?����AYst��]��y�u2Q�����3�:?:�'�(N�d�I:7���C�O~�q��������.��%�~hg��{��0��1p��1,R�(�kSoC�c*�I6:Ic��}�n����R&��A��i�Fd�Mt3L��40
L3�>h�6�����s�H�&#,��z���)�:4�y�:?P�����O�M�����;�o�}�����os�������w�j���������{�XNp��[y���T�}�6�7�
Y���Tt15I�����6n���X[gIr��	��GYh����Jx�W�p�������V�C���4U#�s74�������M�O���:	�G&�!u�@�����L&FUI��m��^L�t�y�6�fD���!o_L�N��/T�D�2P��n��TWZC����ne1�������ym�������!�x��������A��,��;g0�x�����8���_�����=86��A���}�K&�U��������P4?�|�lb�����d H�'1_�f?�B$���FF~d����/����v����2��9R��q~���Hu1���7�p���)i 3�2�� �5���������~�&	�G#�!M��2��_����D���
�4N��.������2��G��R��Ey���M�������T������[.�L)�K]k������R	���# H��d �������1�rKuh��K"v��)���x���n9�~�XM�iq`3�6��`3��#�!�&���!N6��������s�Zd�7���6��)�M/������Af�d~6d>^���[�!;��-���,'��2B���T�����|����(�B����>}��\tm�%�hr�t%;��1�����o���?�|�O�������k3.H����}�.66�
��=x7$}������Mm~�1���5w�Z���q�S
o��n�8���#6Fl��������r�#�H�e�>��7_Em����%�7XL�5/�.��6��r��l�&��4
@k-�u�xe��f<x�s-S�.r�Q���%	���6?�T<��~M�b_�����'9�z3��2���$�@2������������2Q�\��#��Q����,�����q����2%�/{>�l2Q���2�(���~s��rZ��zm9�.66`�!P���K���u��+���o$W���:G�f�l��
�y�����UN�a�4`�qh�`��S���y�=��?S�*����<�'�yi#U�Z2����?�Mt�4�@3�4sa�!�l���	�'�H���G���|De.��6���)`�n>'{2Q^��e@P�9(U�r\S��Zv��#	����C����q�D��7����e���l���g������j�nh7"2��"k����L�W�!"�{~nf��<u�h���e�sl��f����;��8n���@��6?��6b_,�������!�dKr������mO#��R�Y�^�4����a�6s�y���oM�6�
ulX�)��tkWOgg�@�BtzC^����%�~�R8g�Ag�t�9:���N��jy��eh[[�!)�I���L��)��)D)=�
�U]��Xv��=L!e����u`���e?F��E�:����t.�5PW�`���"x"�����1m�m' f.
�e[�S�b]��2!M.�g���x�����1p|���I:x�AHz����O~�f>�}�\,��/�16��]
��"���������#kj~��]��ap~s���������m}#�K�H�5,�t��m��:����J]��`�>��>����e@P~P�w�*���i��.�h���	�S+����0�vm���f�?_�Btz��.h���1h|J"���v�
|v1	�����o]c����SK���6cjY�����SH[��8��c���w�X���%�Xe6SL!���uk[1��f;D3�>��3�v�F*'B��#cC�9�������*��7�Q�������o/��X����c;�}����o������_�u����>�]�_?���KOn'���4�O��!?�9>����i�M���,ki�5Y�J����Y�
|j����1��������/B"o��.��������f[��Q}��b����NS�~�x
��Qj��������.��,������@4
D�p������n�8	NQ��gy�S���.�uN8c�77U�
m�6ug����&���2�(�@�,NgT�u��
"��v������\�:�Uz�m�����k;�����/��� /�����4�{
�Tn6� �[��ol��;:����nt��2�����������O�C
��@3�4�@�)4�5��N�T#K�V�M��CE��*y�N}�9=��t)�������>�)��G5@
P�/�����_����&��hG�J'���r\������5��k�����dq������p�������E77�i��L�yE��;�o.�~�����k�'6[L�������';�OP'�Ts�(��N�����T��(c��� ��.��g-�Y�RF0�f�f�f�f�f��K���a���><�:w�b������@��D�%������T�+t�$��U�/?�!��G4
@�/��pR�p<��������R&C��E��MqZk�m�y<�������
+��	��(�u�pW_kQ��������7���I��zl�����5�������dQ��i}�x�r���e7�������]�����c�i�c�$������h�8V�g+s�N����j�K�f
)�@
P�*P?���.��~�w��Q:hYA�#j�s�'������M��D�Y���i��;�}������K�B�vZ�0�r�]�,�C��*�����|������-��_MI��Q�1����1�JMcCAV�R��$�����H;O���!J����F���Z������/GT��%f���x���S���o�6}�������(�_��B�a��t�=m��������g}Y<�1�P���
�i��t
u����X/S����.�0�
rrrr9o��a73��A�p�C������P
v�v66���]����U�n�7�~����h\�-~�I|��.��7�3�>�����T:�s2?��M{�����l0��$���2I�H!:�$��};�N�`�8��%����[)�7��2�,�'��KG;�qk��!�h)�������(9Xjh���M}��[pcrH�m��f�l~l��r��}e������6B���jd���DA V��B"J'
�	qq��d�wR���g�|�����y��I\�ar�+��=��H���mm�t.���!D�?�h[%~��5+��(��2�|�;5��*���[�~��e{�8���m�(3�o���H����}��.���0(
Q��
����)�pa����7&�s�����K�!�+=��.�{�l�N�t�U��	�]��.�� 1H��'��M�:���'���!Di�r]9:z��`�Q���T��SA����

D�@4}�;���mlK����r���>�L'�-���T�z�0,������iMH�Z�`��R�P��A��Ai�LTJ���l�����:�r��o�IF�|bS�T}�*E�
�C�N��8��^�S�sH�K	&��`2�&�T��4��&Q5~RP���C�(���C�C
=d�,��=i�2}�y�Rv@f�d�A��d�����y�0i��sH�<�B��Uc�P�ve$�v��+�Y�2���CHY@f�d�_�w(]y�
�R5����i>������c�,�_���bh�sW�i�������9�-�
�(��2�|R.�c�R�+�%�Z��
U�L�5e����,#�tvPW����C;�N8��h�2}��RR�q8��c�8>�������\aFv�Er���S�&&����d��M�}\������Rv�@iP������K�R�%�������fg>��IKC���k�p��i����{�/�V��Wh�J_z����dW��a
y!�@��7E�y@�g?Js9J3����Mt�;�/�D���y�2����#�s�&&0���E�g���j���y��������c�
wg��?�}����o������_�u����>�]�_?���u��������L��pyiL�9Ot��?Hz���ORm����`S�`�����P�M�Q8O����D��!:]�k�U�1CHY�������n�\�\}�x}�����p~�������~����?������0���������������Tu����D�pw~y������m�w���A�nVp���~:~��ST��:?
�vi\��h�D��6�p0��G �=�Vh\��s3��;�k��SK��<�Q���_
���Oz�aH:U������x�e�����$}=������:���GK'7�Snk��!��3W�HK%���_|�u!�3D�d�Af��2�����e[����>�4a���4����,��.��Ct���5m���ve��Ag�t~!t��|��l�[&�yk��I����'gC����n�wg
����;H2� ��i����	_�R]�I+W<�����t9���x�}�����&��{1�+��2�� 3�|��;M��+��FvSF!�|'W�
��hK��!D�K.�������!��?5@
P�e���nn���l���N^^>��*k����<�'�R�Kc�%f����c�N�lI8Wi�c;4�f�@3�4��5��m�z+Hs"4w
uHH���2��S�$C������Iqq�$��!�p�6��`3�|��{��C���K���sC:[����+���NW�k��������e��c�8~!8�-��}�R3u�.��u�WOI*w
��#��)4TK���2�!DI)��n~��7+�'�1`���im����C)�6�d;�J���c�2�H��'eL��j\��;4�(-�'J=3Tr{�P�WB
_�l��f��4���[��. �S([�SH!)�-m^v���;27��3P�����o�L-�:2�� 3��B��K&-&�=JmK����~���������F��4��(8�i)��xn#���kV��(��b���H�� ����yW@�!Fi�N���i��"���)���R07UhM����O������40
L��1�[/��V�����e����\��j���L���NW������o�5f~�6��e=��epY���]�Q�c��T�x`��0�������5U_����1�ol��R�����GJ�����Q�/�!51�_gC
�(}!�����1`|ut����o��������|��}�����(�g�oL�4����Q��c�1:�O�:����F�<C��� ���#<��Z�@4
D�@��'%����>��_�%�w
u��(��N�H�X�����!:]��sc�%�J0{B��2�� 3�����KAI���n@���������
k���B1hs�2�H�����31L�5�c�!p������3�[pw{y���R�������o?�?���zs���V�q��O"�6�����~;�=�����xx�k}����������\5W�����F��R2=�<�y*l��x�V�e^*-%g�[YA�����o9FO[��8����1D��XO�fT�z��C�_�>�pn��l��|��p���������7��O�������l��=���������c��?�������]�\��5)`�jUnGu`y1�����1�
 �|<l\������/�'����X����*��P�d?���O�g*s�N�j:�1�0}��)T� 3k8�p����,��o��X�����.t&��:�.��� ��XSQ�������=���0?�k���h�������|tE;�b�I�S�N�,U��C7���$�k)��(��2��}(���������Yv�4#:�Fk�Qm+[�E�3pSH�if��
�iL��n�]B��H��4 -�4�3�p'�tH�G@%#wT���Y!��C�+#���;:��1������,S�N�`c#��P��A����kJK�' 	�������wicS���d�$�����/R��D{�F�,�S�N�l�T��i
&SHY� �!�!�!�!��j�����P���"+;7,��X%�*o�)�K����.�P��T����aY7��%�n�nO8�@P�=I��,B)���%�|=��:N���f�y�mQ<^�u�����t�JK�D�+��#e�
�9J<�1��������]2�	�t������F n�0Cnt]G^f�:7��F����R���*�CS�N�(���|z*6~X�`)�GP�P���a[��z3���dl�Z��_�p�
r��"�:tJ��5���9���i����O������ 3�2�2�O�����qMc�,I
�D�+��R���gJ�z~�6�,C��#��a[��>u��s����$^� ���V��s<����18��U�������=;�(	e����Mc�P�BtPL����������e�<�<�<�<�<F���|k�E�s��aG*E]����m�����:]	!�h"%�
>�b
)��&��:L�py����~����
�a��\��m;���LA���D�H�l�f�1Jec(������G�b���3�*��S5n�v�!��g	J���4(�����n���9�	6�T���k���;�(�-�5�e�LE^��!eT����6����a��>�����4(
J��W_�R���mMRg�B��n_y�����ij��geg������`2�&��/���0�����f&��rJE�p�)�b���h���	�U��(�9�9C��u� k�)���i@����k��fo�vU
esQ�'%��qj��-8r)S��ICK�^V�v{R�0���h0�a�^��X	��li��e�����jJ�nZ��SK��XGG>�gjH��������2�(�(�P����eC����q��.��d�s���C[��lx8�
un��*�vi����)���f�h��fF/�pN@��KaN�,�!:HL)Yc�\�_wB���X����d���O���Tv�i@��i�{5�CS�,��vnX������61�2uj��Z�:^���:�� 1H\Hb~4R*E!�le���peh�ThuHx$1J������956Tz��O�vo��+�� �2��H�����i���wj��$c� $����b]��1D�C���q�Je�[2�����40
L��w�`�jN��T]C�
��T|����N7j��qv�$��e)��&��`�Ka2j��<�;�8���?�����}a�P�L�B��I�V����AH�#K)�����W��t:�����M��>��;3�0�3#�y��b�*���%��TZ��)�$�QC	��J�LT�s��[�����Y����>�0K���i`�f0�WO�sXd��Sr�)D+���?�m��Jl����1^����#����#0�����@
P��K��d-?�+�m������x�)��TO�|�2����yl��
r����Jj�RX�h��f�hf44���gu�.�q����Y$psL��+E3���M-���v:`!��j��,���!��C@3�4�@3���������T����GJd�RI��$��cj�����dl�g��
����`�����_�~�����:j�s��+�of�������M�������|n���]SYOgU����)D�GO����00_j�'�����h�G��61V���tLs�Q������qi�A����!J9h��w5w��(��4u���j��>�y}��r������
��<��e�Y^����O�g������t��q������'���B�6s9*FgR�����@P��e~,�v��������p0��ee
'T�!��qH����I�i����2�����}$ H��d ���{4��k�_8�<�H0w1:����-|3u�_��BtzDU�M[��y��)�����48
N�������mj�k;�SB�w�'CJ�4���1�puW,��q��Cs������@2�$�$�O��z�P�I�C)�.
lOM��#��k|�]��!#���S�N��_���y��������S�G_�����������+�;�=�����xx�k}����������\�	y�AZ�?\��a_������,�m����������_��N��<�����Z��f&��+����Z��4�%�,m�b{16�b����-?��u�����F��f<w���$hlX�0�,�HE�|5&�	�(S�N�:���s�����>��:��g�|.�3/���
�f�y�������cC���P�����Y!JEt�&�����I%P�CH�
��f�l�9��H%l6) �RY�4�$�C�R�/�\��bTn{��eQz8t��3�h�h��KXXg�N^����-
W_������������|����k��&�w�Mu5gHU�<!�4SL����}[�$���cK��8O+���z����C�:�����b�3�+E)��������.L:����j����S��i�f��-_�l�:G��t��d����H?q�'�3��({Xh��$m^�(������J��f����
�(�i�)�|�jgo[ks����CC�^�cm|��>�|��/�aC�)�M5��'�����f~�{�6�^]n_	<�lm��e����:�[EuF&���
c��z����������Rv�`s����
������xzm�vC)m+�j�J����az�|ncL��ti���!:����C)i%���}HYw�f�h��fdo���S��B�s�(A��)����������yej���T���*Mb�vC�:>���3�>s|������0G��	�������%hD����eH�N�og�v,�"��L!e=�i@Z��]�Q�c���z�������J	�;�0eK�e)�FX]�R�R"}�H��TOs~�cQ�2<�/8��c��j^�w8<� �����_�7�����]E!���3m\m����I�gIS�N���L�����{��=SM����`���i`������)]��l�4���v:$���5�	y2�!:]�6L�4���wi)��2�� �� 3�������k3��	���S�B$
���\n3��1H���������`h���'���`0���S�x&6i^��K���1:��2��N�j}l��
�+�6�g0�+�`��1`|
�{��#-���^unX�
&��J����m�8�Ct�������S<�������`3�6�6�O�� Rr�i����
���k���Jk��#�F:�b*���<�(�o}s����l�!����zQi�G_\X"�����.�j��8���{z��\�m��c�����]x3g�i���������16��<���|j��D`��p�n���lS�C��t2t2t2t2t2<���yw��i�����
������[O�=[��d�ik2k�O!J[�C]��k��Y�]��c�8.�1�k�uz;+������.Ap�T�Y�c(�Ab0W�q��_�ucK�����Y}(����8�j���J!���.���gI����:��t��7M��C�S2��r%6��'��������@CCCC��J��i�b��ab"�Jdg�b��a���6q�2c;4�(%�:��6c��A�A=��K��)���������.����;|��8�.{Q�.�s��7}�y^�Ig^�)�inf#ov�Hz�]nO���9N��mC>�1M��)VS�N��d����}J�������~z���
�Ah8~��j5�x3�&��K�mn������%V�����1D+�����&�qU���pv)����
O�6<���������i���;����]p����}A��)D�K�����z>=���Rh����3�:�:�O�n�����jbN��9N����
�^��2����k�����_H�������c����Q���`x��r�P��6M��?6���
M�J�f�����#�������v{�������P�2�H�4)����D�2�Z�t$Z�6�j����i
)��&?7�/�?���k������!���py����~�+W[]����F���q���5/Nt�1?k�m��)_��\7>��9�����y
�?���� �����Nm��.�KZ�.\l��i`��.30�S?r���2,����.����Cx�30nFK~N�4#��G_���N��;	�x�!:�?�t~�j�J��HO����6�$�A�>AA���y�>vt�v��&:����x_�����GU
b�����t1Z��1w�b��L��40�R0�{�ZY�,�
�x�b�'Z�5u�i4����9��B�{�����N��R����P��AiP����TR�mm�*C��Xm��t�c���u��~����Q���AaP�
�����	)�����(�)�Dr�LK��2�hu'��/?��b��L��40�R0�?�������)�d�������-�|5�(m������I%L�VZ��
<�!�������7_Ql�Q)'VpF�0��y�}����]��{�v��j3Lov��0]�)3�i��G����ICN��s��VeBU�o���.3��9�3�3�3�3�3������wgt���Zt���l�!:d��k~�;7���M��-Q��Z�:v����@s?�
��+���W��H�4n��w��S0M[�vH���"D���5rqs��������������wgT 4�g�z<t��|=��Vj��t���G3�!
R������&|�y 2�Bp
_�W[��_�wh`[��Ru��M���87�4����syo�2������u��5�k�������y���UH��
���D�������5Gl����f+�f�)@� �K2mh
���i`��_
�yW�f#�������R���^4o$?��E�S����Z������A~�T���o?=X^��}�:�s������uTJ[i���<����f�J����`���:�=Y�������fr#G����n��Ug�:/qa=��_��ir�ME9���c�+�\�!j��kw\�Zfr&MK�3���7��?�S8o��
�&7L��br�O��h,\��(��gmj�EH�5H[����h��d����	���jz�����U�������9�c�y���7��f�9���h)Y�P�l:��?������G�����M�a�2*�$��.�'@=C=C=/���[����y���ft���+KU*��M"���jQ�������nD��
l�~�����-�&Ai���K��������H�FM0?�)�J�����.�(O��!�0�������<�yY|�r�No`k����}uX[�l�9"L����Z@��������^]����3��4%�F�)D�fr�����Qbd�A�S��$C�oG/�l�bU����v��U�r�k���Sc��$��t?M##��y�6��fI1eS�=c�k��.�����*�����m|�i�$�7�p
�1t����Y!:��T��u~�P�e�k�9b��t���48
N_n�z���UA����q7��MK�u�����4E7��E�4-��|7�V:x����q��X}>�th�`��y���y�79����T�l��|�Ct��]���m�j�4�;IS��0Q
������h��IW��J�:�
�XE�����E[- �`E��k��6�W�s��O�������c����f���z�=���F����V�WM�����749�u�C%���~qu} �N��L!�y�{��9�c�2'���GV��DXl|��gl|�p9zI�~+��6o��]z��&��x��x~��c�����v��U��������i�V�|��0����bHfH�B{�����l& �w���h�����-��s���'����f��t1:\�
�m\u��������������I�Y�m~XVRn�I���q�M��ih�7	9=����|m~%��������
2���t-��s�� ���.������
�
������m�!���Y�m����q������hu�6�]��v�����_>�v�}sqs�������������w��}l������"�����qB������7�NY~�@Y�@w~y��������M��3V���,[u����H*��]+�;V��L�nv����$�b-_s�B��B\x`'�s�S��%})�mO��"���fK�j��D�� �Ac
���Ao��=z��S��Pt�s��	�t���4["�J�U�2*����D�J$��7�D�?��Og�5��&�uH��XM!:.gv���V"3MR_1�_��?���p	���w��|m����p_{�u3_���D��V�g���E���m�?�,�_����GO���������gG���6��f�����6Q�s�uj�,B=L~'/lc_�J��.�C0C0�p�"�I��y��t�a�T�j�F�O--1��T��!~�"VS��k��������L����j��l����
��48
Nkq���?t�M��c���Z�L:���T0;r�7��z����$��a����~��gq��B����h0�puu��r������f�����iB�wT��`�Eb��D���N������{���fM�35�%]T�A	az�2��j��V��fP���M������\d�����!DI�R^6/�s#g����q��{�c
�� �!�!�!�_�4�l�f���.�|mMU����0v�!jD�>.��+��}!Qkd�%��f������y��+����>����/t.��y�V�V�*L����z��Z_��4&��E�����4�}�k������Z`���d�`9
��ho�#|�}��u�L����<��i���Ar��q6�����u=�s����v6���i�3XPfeF�����i�~�%��)V^�*I����)B3n	�5�R��$����J6Jg�D'J�� �Tv�k��7c�f�����LJ��8��Sc���[��}��i����F�9��]<�C2C2C2���y�Qq��������ge�w?�1DM4����d�����	iL��mv������h82����`4�3��v�l& �w�D,4)6!'BJl�1D�j'�������4���{r�0����������8�V�!ab�����\z��y	�������$q�
���M���g�������e�?�����x�`gs�z��[����������dw����Ko���s����*O!JX�N���}vqpg���Oh�6�n
�*`���6�����n����fN�SG���>|O!:B�$��,,�S9F�7���:���2r��B������G���k������Um����c���*:�&_
�F���v
y�X���g[�_��w�����+G��tG?����&-�������$l�R�&�?&��������MS[i��9D)�/{���9%���IB3���EG$s��0'����F^��_c�D����V���}U8�O��j�p���� c�NgNu��u�"��c�����}���F��������$��k�A7���Bx��������i�^�fL����1���@5�R�C�l�,Y��1�����2�������y�(F��`���y��f����+G�n��R������-J�23[��^�;�e�k�����l96.���Y�B����
$f	������_��3@�*����t.������%�����v���`��J�����v�9D��#!<���z��K����R��6v\j��n����K�Xt���(�b`O;��F�X���EJl�I�4C�]4���������"Di����������\:W��f���74�"4���������(���g�"W����u�|�����z,�QM�.O�)�@
P��������J���nT"9�Hy	�n����N���H��)QlprY�!�f��]���1�y�����E8,�b�{��bP�O��7�.��D�_�.���\�����[���
���R��p���������\do�.������D�]3��f�35B����)���$,P�
�O3����8P�@��A��T*o�*��%���[����%I�c����?X���������.��
P�5@
P��y3��REg_�:5}�2��5�����
Y�oX��`1X��x��5e~��|���������������n�?";�������i��F^v?�����J�SM���Q�9��{*�������Pii�������'���vU�V~B��f����������u!A^6����}�����xn\-l���R�H��4�DJS��U4���
	`�-�C��se��yh���<���K��
Njnv�6�m��0������y��v�7��M[�O,��f&�������bN�V��4H*y��(��K�48
N��/���ml�L�nva*'��@����"O�E�����Hn��|��{s{�t8�y����(�C���~�k�p��t���?;J)yw��y���9�!EhM:��U�[���B=N��9���(�BcY��@��R�S������*�����sX������(������<������t�mc�i"Y���+�E��kLG�Sdu�B2J�G�)���;�xt1ewV7�nX���_���?!��J�bT���R�i���h���@@��}���~�rw�������������M(����x>�j��5�{3Bov��+*�m�L�*��^4�C�=2���q���7�*��=�j�j�j.��p������\�:�����S���*:��K���~Mfo���������e�����4$��E>���*���%���5A�����#&6V��&�K�B�"5X�w�T�d��q@��9��D������vd��t1:�B-�<fw�����p�����7�nun:��=J	e��ln���{������	����f%R��zu���U��(!�����MWq
QJ�&�:�It�t0��`��i��/�P�ok��B@"}[/}0~-0���m���!���"�spQK�&5������z�JZIl[h�����
K�X�	�<y�l~b6��IJL�L�ov�|0�I����e�9D��>�Y���li+���ai�:��	�f�Z���*
R�=�m^}�����������6@� ��S�^V��'��V�*������������"����a�KK�jbc,_�b���s0N!j+�5��^~$�2���7�S����Ys!���`p����
�{Z|����U��?��;�J�o��j�H���szQ;%9�oP����4	J������NSoF��.��#�7���+?�l���5Em]�����V�������*q���IR�W���O��TWO���4�~#�������������m��p4�E�Q��(%f�l�����gw�����6>���	�B���V��P�p9.��gel&)�w�T��TMeC�*���X�!z[�����;w I
��bJm����	��'X�G�wg�������!�O�Q7�fZ~�om][\K[�;��`QS���6;��qo����I�v7��r��{�I�h�K�N�f�{l���M����H��+)���G��V�%sI���`~������M��d��@j�������n����7���������C[�u�����!:�Y�N������t�KP��KP����01����������]�r��V��(R��EJ�	&'����v�C�nF�.���F:d��
2�� 3�|~v�i���fz�fwhh�Qj��;��O���8�q=l�?�vN��:Kr���O2sH�)uHC�j������~���v8RJ����LC'G������<!j`���h��'K�B0J3'�vCSCSCS�D�N���H���y���!xC%�R�d��$�)D���$Q��~�������444444444�����D��u�]�����iiG�W2�B�L��\�|��}�6UH�`�v*�; 5H
Rk����?tF���
	�l��8�d���SR���p
ME�@�P����:�{���p�H��X���aq�x�x��E�~,<��!��eCJ��h
����������E���8�r�C����qeD���z��,q{c���]����N���/o����y��y�($�BbO[H ~- ���M5Jj��I���Lqn�3:�'�OJr#������ba2|k�����[�t�U}��o�[�J�$y(q`�'�[��wT����;���m�$��P����=9F�5�G��Vu,� 
H��4 ��p*~�����h�'G��$�s�2�Hw,�����6jn���M���	TA�#��N����t�6������]��Te�6ul�������S����9��X9���]��D!��YzkAZ��P�C7C7�?{g�����[1t�N�gw�,�������!����8�������m�<�JR?:p���.����|�E"nF��B�����|����7>zjX��Q�X��M2�eD���c��/E�.��� 7����Fn������^�N�t��R�tt������Ja�{Woe�������}��v���M���>�74n�F�S�-	��q?��]����
������4��O��D����%�� Z�q�m�#���z�W����G=1�I4N��s�q;xg)%��X,��7��+g�4�<�{6QL���j\^�m����GX�3d��8��I4N�q�w��/J�b1X�����P&�"���C���d1Q��S7�����	��Ng�8{���3={��w�qSm���-U�h�9I�${,����P�����F�P��p�:�v8�_����]��8*�i{pcS5t���O���k,&J��2��nWng(y�(�DJ��(�P��pC��s��Gi>�T���Q��]\c���_Hl1Q��Y_�t�LdW������rg,��2�,��t�>��4�oZ�_���������4�I%A<���4p&�7��2G��%�.ou�g�|�_����k]�����T~&�I�2������9��YI��"kR1;���*���2����y��b��j���
R1{6��
3�\:]{h�9�b5;3�\��ep\�1���f��5���j��~�c���V���������U��b�N�w����2��������T���������C���4F��O��Y���37@i������*:��y��*�p�5u�Q:�d�j�������/�n������ot��z��k�����NX���������)��3U]{�I%��g�F���P�7�d�?$�U^��qU������,���B�<����l������_@���#?Wp�)6T�G�H�������\��)�R�6]��mj�W�y���W�&Jd�+��}�F������
(ueML�F�q`�!ak-��
f����
%��d���P9��E1Wc��#`q	���V-M����@dD��am��P���B�����)���]�u����?[����Y��
�3��2�c�qRE��C����9������S
�P���������8����<��*�]8���y��v P7��(��Mt���hK*����IRF��m�#h���m�V�3�gD���_J��>��
b�m_O���u���E'={��9�g��xf����/�7no�7
�����y{�gl>����Va�q~y����������@���v�F����2���b.F�@[J�j���$+/&zUE��}qs����������90�u?���F!���gg�����.�2��>��.P���%FDEF(C�7Q���]�6�&���3���!hg�R������?�S�����{p�����S�1��_�����I����i���Cu����������,��2��y��?,��y*�3C��|����PT�W�g����9��|�>q9x0;Z��)�s���3�0�W(�}v�
��,��S-Jo���<W&:d�pw���Ko�5�i��]�RYa�W����]����ccaK��z��������Y8n�v
��qs����R���@)�-���]����/�j�������BbI�-�3�P��B��{���A�w�3!�Sh�B�������A��O�Vm����i������������{����U�u�;�Isn[h��!hg�����}������RX1.�21�jf���M��E�X&��D�[���kp�~Ko��(@�������)nC�NW�����cG��by
)	��Q7UW�P�J.=�L��Y��q�YV8h���������4s�s���U\�g~	f���R��>��J���9C	`'B�3���n:���>e�:Z�1�1f�?v��G�0>i���&�q�m�"x����iN���M�tp���g�>g�B�'�����oy_I��O�~�F4F��5��Z���.����=8�M��
t�jj��~���R��>2fg�?���G����B�`�f�y<�����@�������0DJ�Hp�i��kZ7T ���M���������q����.��8
N���48������mZg���Np���a�����h�X~��Y�$n�b��Z6����%��`2�&�&���R��GQ�zB�sa���a6���TH���.T�W]��or��p��]A�Y�����o��Q�
��
�q3����j�m��<���CU�*H��1�;��U{6Q������$#$����������P����R�����]���SQm�?
�����M-)�v�P�P��nX�= �::E���Swfv����IO��o�R��Sx\�D���M13%`3����L5H&-.�Ar�w�"QZ��Wi��B����]�k�S�r���������5-�x��� �_��D
��)U�������_i�H��^��0��!-L�������6�
��M��MG������Pj�W�l�I"ybX4�Y��o�b���w���A�6zS!������fl�{���m�lTf9$�NX,������ce��v�����\��E������n%-T������U0+���qO����l���������Y�%�~*��������/�l�B�;T�B� H��}����M�h����{��I.nmeC���,&�;���}*�������m��w����X�����y�I��!_�)�S��3����\��|���-[�J����i(^�1�{F.&ZX$���mTE����|����[��PWu��Vd���
�e7� j�����]��<�B�bd�����&�!=d^	���Di-�K��:�D%.��3�3�0���;�(!s#������PSi���u��O������9NW�G���\����E��������ch�t�&o:�48
N��/����c�}_IS�.���]���s�I��l��!9M����8[������[�p�	��,J��I^)>�R���������9���s���'3?����y�W�u&��m��/����&��
�?�g�v1@�Q����&�����+��W&:h|~i��wS����B���
5[K��p����T���Jw�1������m����E)�����(���HXA������x�����Ae��Md18�������gk��{,�K�2��"Q�9U��$�����E*��%���3"gD����#��G�Q�9�Q��zQU�E���:���b�y�W�%�sT��Sc��c����I���s�d0L�_����*�{p���j��a� ��pe�7�V���+��;�a��r@1����8@10�z��#P�U�.Fe^��(��a1*��:e��K��dF���xvj��b���`(��/0���aT#qwj�\0g��C�����
��~�����t��N�&u��+K�L&JW�);�fs�#���Q��&��lY��YH�3"gD���>_����I�3f>z;����������:�!F^�X�����:��m��=T���{
��V��������"�����oT0�z���4�Zg[~�Ov���CH@�N��[�t/~	�l���@9���g��JUu���(�`9�I�nrzXF��������U��:R�
���W��&:NZ���}P���5k�T����9�e#-i��]v�k*��iJWU����{p�$c7�	�����YL���m����+��beA�����M��h���Q�O�_�?g6���mhlb��_
�?���h���K��}�n�!�t�����7�nu>u�|!�)_|��A��Xy�
v�Z[��K\�B�M�^�T&{lu5�"�U��F�6��
�+���������������se�M�O
V�N��_)d-���"I�
��,u��Z�C�������V�;��WZ�U�������n�?�����7}n���{�:��
!r��}G�����E�G�>����r����(\Sp\��������in:(�/L��t�W=�(�
�|���t$y!�I^H���m���)\��(��������i�����i������O���X�����1h�"h��P��`m���D���J�<�b�� ������y?Jg�������0F���2�*������d����XQ��Eqj�l����P�wT�k�\��
{�p��^�����P��(�n�RB[�h�w���������!?���N�J�45�X����m������uXA
Y^����{�0-�����o���l���4
�y|���>���������M����L�����D��K5Ag���CP	3��/�f����u�wy<(��w����b�q��s�m|�-�_��$3�)C����%��J���F������.�e�y�"���I��0���������8������$�;���W���7�y�z��v�Sg:v�B_�P��a�����5�|H��R�Q:U��%�Ay���b�~�B�G=&���n.]�=8
�M�vn���g�2����������(��B�/�6�������&�x *	���#���f����^���r�����O�woo��(�5�
���$y��
�P������{�o{M	+C�S���������#��Z<N&'���~�����c��q��h�8����}-Kqf��O����������j���_`[,���������^��y19=��|��k��4/;�*s�Bf��q����l:
���'yw��q1{p1S�u�)d�-��J?���$3*���y���
�j+����P�5@M-���b���r��v0�j�P���6����xJ�D�Q�S��?aI��@1���/����0�C�)�S�8f�wC���X��G����KS�x���b�vk��  1��9�F����e��j������q90��>@3��%�����b>�syOl6�YN,��eI�jBf�*��}u\.T,`�L�6l������y�6u=�b�1��i��T�������Q�����^�f]�WW��q3�	�'���r$`g��m-[<����l~V�����Ljt��|�c�C�J��DG?N�����)�=���4�+d	E��(Qr6,�n��Or�6]�bh���mMG�-Z�Pm�w�w�d��f�CP��*d`���Z��)�}�[(����y�v����!�1���n��h�$������A�����m����F~v��zww��.]������G`u1P?�����of��4����I�v���U������2.&J���@�FoH����C?���C��|�`:�V��lf#���t�w�AJh�O8�Q:�����������wy��3N�q���dI���������T`$8Ou9Y,��=V�6����1[n����$lg:-@4
Dk"����-�k���{������M`.��hM����e{6�l� }�S`�/H?J�{���(5����S_9�Q�:��+5��v��z}7�L~'V���lG��H������l�f��l�Y?��Mb2;�d��J}W����KUN�l��cx���6#l�=q~|����H�Gi;%l����ja���2��[Sg��<���d�JKY���cZ�k����H
R�� �"5�����R0_����H�"��������$�Dr0B�^+T�7_��U�o���+�7������_.��A-�~�!pO]<6���!�/l*E����b0~�k�����oSycH�fU�e��9���
�I���A��2"dD���>_jl�q�Zv������������l�#���$e�����5����,�^�y�_���"����j�TiQD���6�G����<T
��o};���B/���I�]����LRz��!����������{N��H���G]W��CW�P~�����y (e�{p1�����J��l��=���X���a��6j��&�Jk�O��
= ��5�z��v�����[��	��`-����g���@K�F�u#�����x����[;����+W������w������
ewG���>�$����C������p��9�2�#���QS�3��4?�����U��U�,���f6��b�1�Tn������B%?��D-h-\T�&���K��<�Q�����s4&o�7�?�����?S�,���O~a7��?�Q��]c��wu�
�!���DI�H"��c�U�mt�$M��H]���M��P���W����.g��Jf���Ko�&<���T���t��o5������M��2����������p��r`3�6��s���f�,������,��
EZ��Bh[�&�(�p5�&����R���x��t�G���~Z!��<���}�,�n�����_G����|�v67��5c��P��Ut������.&��6����.�t5�����A{?*o��7#n����y����*�xM�5�/|-7GW7Ul���>��LV&y$�~�)"�_�~��B�m�h#1lM���qO�������)^�M��'���ME�+g�d�4-,��D������7�Z���P��.�Y@���q��@.�s�+��=j��kM[5&�v�3��-&�,�G����ic��yIY�J����o�� 3�2k�����y��ts*|����������{p�uW����w�Vcu��*�+�����h���OU!��(�=�"?jl�|���Rpy
�~��~�@mce|��������r�����~{�se�+�B��x�>�>�Q��kO��|��1����mV&Z��f��t��N�tk��_�nm�f\��������
:��@}��������ET%{pT8���7�J#�I���7�@�FV6�gD�+`"z�{M/��?.�P�����S����b��J�IR����S��	�fj&�� @����r�W���3�|V,n���m:k��������;)a1�cR��n�9��F��H~�k3�u�6��U�u�9��<'^[�VAu������l��z0Q*9��m�]%��$ Y��H���}��Z��d�GN{��v3g�����8xj�X��i���?��D�*'��s����u�,o*S��x�z��V�-�h�2�eD�y����9c�>5����"�@A���*h&j�����4u���t�����~T��ep\����_}U�1�N)i��E��Z������''9[�?rt&���H�B�����6������d�J$(�	PWSSr����-H��M����2lp\���u(����v)`��1))��I������N���?�D�l����4m�w����!`C����Bl^KUjl�o�J;�S�G_��[3%\��V�I�G�����������)^���~��B5d��Y_V��2W\��ep\�r�v�cS��t���v`/)�&J��q�3�~���cC���
�{��^�����{��}kj��XGw�e��+0���~�V�_$�&�������������:Kl
	�9�����2�*���g�Z���6F��l��0����lr������`Q%��t
����e`X~&X��?lk���LtC�,H�ZL����)��JG�$b_�n���.3���2�q����r�\�{��������Y@��D��zJ�f��� -��Z^6H����98?5�?\�:�|wq~��e���'?Xx����������M�X�J4F����z���Rh�����H�N��M#u^��SX������7Qz,��]�h���`��:@3�4��U=j
���#����A�
���T�7����m3t�"��_��$3@\�O�(���r<��ay �N��|�Q�f��$����ep\�:������(^��j���������I��\)}�����K���&:�n�(��"5/��&��0
L�����4<e��R@�<P���=�W�I�fD3��aa�������Q���t<��^9
��Z�I�l@f�d�A��+�Y{)���
g�5a���h��S�aEE��2��Qq�?����q_+^�O&���h0���h��{���T(;.)P|E��D������I����?l`?�(e��@eK���i�x�a0�[@���i>T����M����$�@���kKB3����t���,�nqj�WLt���|���4'����� 3�2��\��?��"����M<'�L
Im��L�=���P(�DC�rT|�3�	���F�g^B*�t�s����������)������w'r,��S];�i_�d�L�L��d����[%���Mt ���w��4<��������"$����'���g�������4�;*��;��y]�����FWv;[�
���H�������n�o��L��S�
Oc�F�M` ���O�o/n��|��{{{�|8�?�A}�����/��?z�a�h|{�������fpw��d���O�������q���7��pGg�~�{!�Q0VZ�W[���_��h��oGS�j����E2g�<
m a�=0���_U�����Dg>�!i}�)'h~5��Mg@+ �$3���������O8(�������_�^�c�ju.RS���_��$�CRL��"��M��k�qMt�c�X��{U�����^��z,��V������48��rJ����O~�Vb���B���*�7V)
��a�fZ��;���a���vu���N��d�P�+�*�����L2�D��>��~����ku��N&5���3��e/���\��;������ H��H2�T]-��5��������:E���>��M@�q�*��L	H��-i�S�#����vs�6�j2�RJ�i�k�E�����<.��fjo��M�����L��N���48�{�������U�����R�,\;Ro34�(M��c�y:u�;m�7��]��CpI(������Ah�F5������g$X�������'b�&y��I��L��q:��r�m7����p�i�Mp�����������K8�2v��-�G����Q�[��4�i
�ck��6��v#��N5n����z��dB�`r����s@��4lL�6������a&��aX�)#RF��)��?��}'��0����y���;���=��w}��$4G��lmM���,�a��,�>/E�7�L��c�<��By�m��c��Qr}��8]Q�-U�y0O&������1]e��z�0N�t���u�N~>�I���@3�4�}���l�R���k��vn���q2����uE�hK����&:�i(�=�Ch>��MB�G�g^T��#�8�����3$m�!���f���z��X$[���^��`�[T|�H�"{�[T���~���r�`��J����R`��-�U7���2�,}���{��)��6�����?e��=zO�
�`��gc���a:S��q�3�>~�M�fY�6dm��/G��we�py��v�R�����:&N&J���M��)H�8���Q���_��U�4�l0��8
N���������������\�%5{�G)�����-�Y{��8��U��K���$��z�@3�4�@3��?� ��M�%A�E�d���,et�Y7<	�q:d&��^�%���ufJ?�2�� 3���y�����Ws�h�h!�����R�����K
���|\�<�����6��`3���f^RTJ����L�P��!�k{p*�M
��<2�>����Z#��5�����d���E�d�y���1p��2��f��Rs@�v�m���y���0��a��$��h<�Y���
���0(
���`�::R

?��
*�M&y ��$}GG�n�t&j��T��]�c��6�7�[�����j�l3S1�"B��������u��f.��/�l����;���0�)������$o6�2�.���2@�R�6� ;�W����$Cb�tj�,����n�����	���%�Af�d�Y2�T���T�cJtI���R��o\e�����$�b_��D��f�7	�t�A��C�������DB��k;�#y�G)��_���d���*E�t�����2�U@b�$�Ab6N��@���mk3�,�d���$GO�����I�3G�A;������2�� 3���y�
v�����uo�"i��2�C7�
����h�3��imW��$���If�P���?\���?X>��xC��5��:.&������u��~	�`2������7�u.���Lt�c��	�#o��0��M��h��pu�F����7��	����K?c��\�-]���d^��L2�Y�32�7�f�AC�&:+D�����f������ 4
B������>��m������jB������5�&!BRF�H�B���a��(W<]onX9i0���,�����2������V6T�����T��qGJ�I`o'����c�7�t�������.�{�����2�(�?����fS#G��`�r���=��!H�����k����8�I��1���N  �If�8�*���2��C*�Q�&�Z��]U6l�L�U�����8N��.T�[����If)o�`�����E�<����P�����c�T�����)�j���-&:��U��g#P�{�����2�,�?����P�l��s�^>�M���bG^4N"v�N�S 
�_�4(oa@��Bx��uB��H<��I���>R�j^��F�I&�������<������M�����h0:���V��)�Ss�3U-����:uk::Y�����L&J����v}���7Q
�]U7s.\J����	�|N5������m'�2�
��w����n�����������o����r��>���O��z��~�py������������g��/��:��;�k�3��WW\�r�M����/],�M���R����	����[Lt�J��a]��?�L���4"��CG�����|w���������8���|m������yum�4O#���Qlz�}+�n�f����|L�6���qy��d@�����} ���(+��.oR0��uh�����&y�r��yA�x?Jg�
g��������d��f"�� r�.��D�\��|���Dl[9c\�#S~��i�`�� )���]gvu�M���i�����$�I��B�� 4����G=L����m����y�������8�q|���A�aJnB�Q2��3/%��$o>�2�.��/��|l��X�����K�Tj�d�J�[.,���d����,���$f�0�EI��Q-N���Z�������lY�7R���	_�_}�a�X,%���=��)�:��)�\1��YV��mW��h�t&��*��M)~Y��� TF��P�2B�:";��6M�}3���G�$��$B�����?�q��$���k�I���&����?^��������ts�����4H��Oq=|�\�X�������{^1�H���!��_@�?��Dg:�����*!��"�&��gg�p��5�����������b^A�}[G�+&�2���R�/c���$PM�t����=����{���v�d�Af��{��K)����w�6�+N&:`���N�s2�
��$���T�����fO���?n�MB�G�g^P@E7���9����k��*�	}��1K[��Rj�\�M}�O����ew���W��b!4�t4��+�K0�(�D�����J��o�M���{�V�k���L�f�`�4�i�/(���y{�������T$���/6�A�d��1��>���`��NPc����|(������AiP���}us~��ISW������xi6�#��W��Tn}����h�4����JkT��^����j���_������}��L T�HF_Q��������`�3_�qU(L2��$o6`3�6��`3D��) ��5�+r������&y���I�n�*k�O�L���be����;�"Co��:�4(
J���4O�=fwtCg�f�^q:�(�A��W
�/
��X(��l�?�x��Io��W���������DQ�s��|�|����]���|n�bD�V$�y]3���.��;"U�n�s��
��J4$�*z�~��r�QpJo���U	�4+v:�$���a3�f���a����kJ�b1o�:B���VP�d����;2�<M�7Q��S�".�f���M2�����3�:������c������5��A�d�t�S��@���&����z
��l#�D�ZH��<�x��g���������L�pGeJ��<2?R�?o�5�y%�7�Br$��������c�8��h��+_���(O���y�����$�CR.�P�0��8�I�X���z�?��&y���f�l�y6�1=�c�]�,�1��<�(��DE��^����4�(�����nEj��	H�v�w��l�����?�eVI��:��7^/����}q0K�z����Jm��;�kf{��������jJ���|�6������y{O[�}���u�A7���e��PAR��^"�F�(z���z6��:�r�&0xPd��p
�w���j��pF��h���,�R�v��n�`��3��.�-�#��N&HD����'�	"��u�?^N����o�-��e�;U�(�S����s���>i�������:J���i6�L��������L?,�W@=mD��P���xG�������[�wRO�Z�;�A:���
��y� l��W�{2ws�3������
���:KO��L�
�e��M����(/X�~X���� 2DmUQD��+mJ�@N��
(U��CE%����.,�6
�Q�#��#�>��*'��8NgTE��'a'����`2�&���>��3��(y�6����3��a���d�������G�08S����]��EL��8��3������#�U�U>��D`<uP����x�����x�*�q��k����$���@��0�O�����$o6�4(
J���������mJQ#��+Q�oI=���E~&�I��XLM ���F5���D���u]�)��I�l�f�l��f�w*ow�n�9���\��{�����ii��<J]O���>W��h&y���g�x~Ix�_��J�3	�5j��3
����)��Y��?�Lt��T&�opu���L2op�@3�4���y>I��i[�f�����Z�5|�>���D�����>y�=�O&�R�6��`3�,`�.UmCE�M�w�x?P���S����KL,
%=���Q������{�����3�<���$���<*27A����Su�����'�T�L2�D�g@�|����&:n��� �A�7�`�f�Y7�O���m]��	����l�")���(M�g�8Ng���
&y��g�p�p����r����\��P&���XR<{��Y��
��qy<�N��]�X^���MP�e@�%A����L���,^��NE*�P��H�G�h����f�u�6�0��aQ�FP3�9C.�
�tL2��
 
H��4 -��yBe��b��H�����K0�h-��~}�,��`�F�G�b�R1���_�������3O�3���<���n�v�{�G�]����3?���������3�"��X,�������Yn�{$)���RfXR�[������&'o��]��W����^P&y��v����������/�����>���?���;��A��������7��tw������/
{{���}�����7R���_�|����p��;n<ps��<�#���������	o,P}8������A�V8W*���:["����a5_Ls2Q5�c[�!�0LgQHfh��iT�c�����o/�������bqs1{p1Gj{�z)��Ch4���me��3K3��qQ���jJ�7�`�f�Yp��KA;��0��kn�<�(%���V��O�lM��(es�!M>����&>�����?\���}�n
��|��<~��y(����]e�����3 �9jo���5�����0L��'����
$7�{�����2�,�~�}S���y'H�E��L�\�6>�W��d�!�/�������b�������-�s��Y�*�3������h0�������|x������S�M/G��"U�JUOy�|�A�L�0$���mog��`�~*������������3�{������4(
J�J�MU��(�����8P��*�uk.��@o�	B������P����_������e`X���y75�j\�m;A��'���d)S?H���a\	�� ��!{?*o@1P�@1��=�g'���n��i�:�l��!)�'��0Lg
���%a=�)����dF�3�0�/��U�I�Ts���"s*�L&&��
a�I�So�QLt��j�M�A�_�t�L2+��4
@�L��?���`�t.�Y�<&���$bw<��a:P���f�%%�	N��I�l@ePT�Ae��;��ccV��D\�M����jk��}��B��N����i<��p��q&��H��]��I7�bN��[����?��$B�����K�P��n2Q*j������Qy�(��2�(31���kj���ajy(�&y�B�PcG��0�)8O3X�i����$o:�2�,��2��=J��QYj:������y������:��d���`�%k*�$B?.oQ�d H�_
��C<�]�����k�*U���j�P���$�Z���$_��~T� �����g���x����L��d0Lf�d>&zu����r������Au1�<��H}��6���L��!�C���������o�FHcN�l��o/����2�����z�;�=�����tx�g�������?�]�_����J�O�?A����Q���R^W��x�Zu��hH���.��P_��D��o\����8���P%QQ�4,o
��#&FL�Rbb�-4H�y[�4M��o�������h�TO3	�5��<�Y*k�.�}�R��*�DU�PF��hn�P1�"��Fq��j��7��pa#����*[�M����=��Iu���n.�+����:����h�)���
�f0�A4��4�j:
}0�����������������O�woo����?����bw������������7��tw��������C�����_�v�����7�n�~��-v���������S�������	���(����=O�4H����JB&�K+�������jl��?<�b��H���7�
q����`��JIaLKH�S�q\�����_��������xy�������}���(:���M�Y)���;|����7>��]tUj���m�\\������������ ���7�4����,j�7����D�����>)�D�]%���@��*�3���?��D��O���4/���B]����p��l2O��ep\��{��#����[Y��kK*IUe�s�{�4{;������$��)�����B2cN���~���@Lq�����(r��/�*]V�%Q�&���&L~�{2	����p1&j���w�#�bbt�M��i���0^�}�&���AiP��J�G�( ��/�����y���J�2T��fBw�G	�h�����h�y5������c�X�~A��I��
���^)��@�H���!�����4���m��`J���;tf��Qe���0�c���������!���T�k&!����D��I���=;�ev��R3�����:��fX��e`�j�|�PU��|)r�}?�.��;��6�M�|�We�@K�8x��Q:`�`..>�c�f�����%����{��#�FN��F�tL�F�~gPlt30�3�0s`>�`M}���~�qf��8�(%Z)�):�n�E^8�&:.J�!�<n���.&:o@iP�����C��^��l��t�SMGE7� ^]���#�!��r;x�v��M����W���Y���������:Pm�eT�|�����c�s�U���69���z�!\>�,&���RF�f�>����&d�a���nv�Plt� hF���A3�fH���7g�d��DR��@��|��:T )�)�:���������C!9�o>���l������3�:_
�yy�(h��u��2������Bl|X*cg�h���2v���9bg�"�Ny&F��`4
Fs4��r����"���73ZG�&F���5�Ee�I �R-����z���`X��e`���!s�I��;B�����M��]L�6)�Vwa[d�O�*&FEF��k��,��W���2U��AiP�j(�s�|��������!t��D>l�M���ZJ"�9X���U�18�I�SpI��b�s\��ep\����mr����f	2���T$�����,�-wH�N[o���	�au�Q�������Y>���/�Non.�^��*\���C��y����6
F��I���XyL���Z��w��a$(W.���z]<����;}���1���M�J�u��7t��pz�[,|��?�}���������'?��n�9r��������G���y�/�16?^�f/M��� ������u���F���N���_Ll�A����u���
5����������=}��r���
Bh��~��o���	�^�1R��ui�����R��%�?�]L���3�(�/�:Ww�U�2J�@����������w�~�p�����������77.ui��������8���_\h�������GxI[ q��"���E�l4��_�j��L����"�!�l:A�a�O7��]A>��&:IO3(|8�
�����(9m�l�v����/�&JE��O��� `F�,%�pc	LsL�$��$��*��E_JD�����|F�:��$M�3��A61���J��/�F79�3���m��
8��/�n%��_����P�N��57������o?��#�Ff�B>g�_~�{<H�� 1���1������fG��9Rb��f�s\Lt$���(��������tS�l��oyq���&���i`�z0���F���`i����`��N.�$�`����$MK�|�+���N�)f>�L%HI�6����f�l�G
���1U���qt��2�x��5U�N�!)��JMA������Dy�X�w<;��c�Q��@iP��Ai����)��a|�����P>��k�6������R&�H��7��D7!@2�$�j$���F�}�U�J�>����ct���e���M��
*r1�������'-�Qlt���3�8�l�,�tk�w�+��"��
��jK����N�va��%�2Q��`�j��"e-�����,�~r��n9��3n9����y��]����A�����T�c������3�U�����L����{���lb�"����6G����l�U(���8L�����i^S4�����`������8��x���C�o��h1����k��g�s�*�F�T��^�����
/
+s���7+/�g����w�2�)o�H�vu����1u���e'����&6.Q�o�]BvJ09���*����)�f3������q����9�������Cn���q�w,8��M�D�H>��t:sZ�u(&6��d��������_���q�(����b����7y�����(�[��,�����=��MO
7���F�n61�47����W�����W�6�TF�w������=#zF�|=��nX������`��6k��W.�Mi�n�fN&����(�rXj��H^����*g��AiP�fo[R���i*�I������"��;������;c���D\p������g���o����=Jw�n�$��:d"&��f�dt�k��8���!�s�i�'��������N���$g-����s�s����������������bq���.�4m����/���"��.��6
�_\wOjllC�K��s�/!����fe��D���v�Q:�����oO�_��x�����W�O�N������q�K������������������_�1�}i��cyO7��Nww�r���qn�������>��Fi��������/:���+G^�������}q��vj�1�1v��8��5�J�T�cv'����MR.���(�@e����!_����|�� ��
_�|Mg$~��{P?����G-Qn�?������xC�A��S�A.�^�f�3�0��?+V���F�n;�������X��[�n�jb�������[�=SN��'��:���� 4
B��J�C
�ql\��~���oS��r51�:�s������J61b4����Nc@d����,1��Yz*L��3-G_��p��3������������e$��Cu�����
ho>��P�v4z(���sRF���1�c���#8�
��=���Q��S�����|=�(o�H�����;=��i11����O? �l��}�H��2�X��/#<>�`�|3�!�3����8�Q���{���)������?�!-�&RL6��]�����C�����J�nA�L%��n�_V�U����A���r�#�[F�8@��rw��y��E���)R6�lH�*)�xL>���bCv��*��=���9��c���K�l�����b��4���wQl@hD���i���$��/��,/�+��P�4n��#8R��U9�6��*�q1�QH���2�����������e�������UX�\��WTT�|����v9��e�>�iq�j�,���{>%VY�A�~�5�����R��v��A��"aG�������&*����)��Y�������nr��e9;������8������?�~zxw{�{W���sH�35�z�-����Y�bgF`~�W��V�_��v1�&�g����O�bb$'gu�zRn>,���������m���������g�?���:��w#�1��,�v�.�3�D��7��=	���?��&6��T�{�� �.&:o�h ����<$�&><2�e�{0�����' 
<YLt�c��������<��>>�w�uen�
B����"�� 2�<gQ^T�������c#�0��,i��!1k��M���|.&JYNh�\���������Ah��SBU�vC��|�ID�&�8Q
���S���G���������0��E1�yD�@��~����?u?I�z�[A�<����u��I��$f>X���s�l�%��"�:�����������Q�0
�=�=(�t����1x�/:����+���1%���n;���&����)J�S�V���P���pv����d�+UTA������m��L��40
L����n)�s����vrT�dHa����������J�~.c��o������7C:�6�����r�N���48}E��C:e�8�c%#�N;
^=�nws8-�d1��45��g�Y|��lb�J���>'�M15��de(
J���4(�G�|���D����a�0sQ��=�q1��g7��bb�i {���WL�oI�Q���i`��yLR��t:�������^Ll�H	���|
^E�w6QK%*T���������rGN���48}E����(]���
��{0���
v(�wj3T�$��{Ih�O����7T���_�(��1x���17o#�|�*���#�����w��nd�5R�sN����+�@1P�(���N�[pS�1�q�j�g%/&FW�)	;E�V%i�: �D�J�����9��l��o�ip�����%�����`���K��1v�Y**f)���c����Kg��<���2QZ�MN�:���c����J���4(
J����m*��b���e)�+wg�-_)I�v��V����t|��y��U�Vf�Af�d�y2R�&��w�Trnn���q1�!#�j7�K�r7//Tw�&Q���������<8
N����q�O�Q����76|T.����I����X��m���P+v)n�-I�N���o��j�������b���t�Ag������o������ �&}���@$�bV��`�P���@��|F�s0�c�0�a|LI�oR�����^�;J#�������n�Qj5��Q�;��0�f)�3�S(&:o�ip�������]y!1���.- x���@$�b���R�����n2�(�L�����2�*��|�,X�mH$`���s�55p����,&F�Rv<+j����D�E�&�k��zs���o~������?�p���?��������O��1O~�������\~�6����s=�����z��?�}���������'?�b�?~-y��L�w������i�g�6�id���/�[������������c��T���J��L�"zR������\L��p_\m���{]fX1��*N#�F8mN��_?�~�R`���uk�w�]���5n��`�Ba�4��[(�}�� ��S7H��y����/����ex��tw�7)�����"�����nC���J�P���������Uw��xL��Xp�������g�|�O�y� J|���;�����^!��&:IIM�Mz������Ml���!,�MK6��&������k5*��~8�\������n�9C7G�<W.R�v8�`y����7-����*a���F�i��i`������D���2&i����.N��
��=.W�3!����mdr���<H7�2�.�������������0�����D#�u���xS�6
��P�	��� �7�g%�(S�g�p��	���]y��($�����\1�~�,&6\�<lv�+x�Dy�+�4�e��y��%G��M0
L��40-��@���R�h���Fj���LiA��bb�Q���\)�����r11
�)=�}>����(��2�,��!��������B@I�S�M�:�t�B	�T��ea����*n��8�=4�Ac��/����cTl����vY8��M,�Ow�lb���N��f������X�������l����i`���iA��jF����r��1���`�C�����K;A��:����t}v�����b�sl��f�l����v�T:��a�����D�"i�J)�!�9������l���3P%�3o��&Joj����	��H�|I|.G�F�_qv�P+v�Y�*�����+���e^�'�	v�w�q����r0�u������a_��K���M�����v��������|L���C�d�vj��,ZG��:��:C��}J�s�.���}x{�������>�?��|�t�}<���
��J����Z�?��|a����_zx�rwzu_vq�?��r�nG�H#2�O����>���_��]������fg!��M-�1/��N�u(����_��\�,'��������\+���Q:�f��������6_�E�����^��xq*��#._g����\I��HJd*���E1�((�+�G�e��0���U�t�p1J�M�����=x7������.(�\Mt$��8���R\F�8�)�\���d��1�d�Af�i��_��s����\y����&6h$���;[y��������'��E������den@
P�f�~�����z�]d���l�[0y�S�6����4���Cm�u�{�)�/���\Ll��t��oW�(��1`��W�c�������{v�P�-j0�?�<�V�Zv���dvSH�v��R��R���uKJ��
�E�]�_Q�p/
���d�H�K`���q6�n�:d6v���O
H/&6�&�����2����q���#��|&X1�y���3"gD��9�Y�FQ���h>8�-����h��
a���:���������#]2�����s2�j�;@3�4�@��o�W�_>���\L�:�������C7'�������B��%h�r�v���$u���A���$�Ab��!�!��86��C?�����dbt�N��m���:LGB)�G
��y(�t�@1P_����'��\�D7��9/zu����[����;��d�r��<���f�h��(���Q`��c�~L��g�w��M�i�RB��������l6	��a
��#�(_`X��e`���!�k���\����C��4�&F�U9��
O� *S)��4F�p0��`%����Fk�n��LM���n.�!���DG!ilL���XV�����bb�
�Ig�����F�
D�@4����H�(�X�Q��KS��L����
�$H�Ml\�i��e�����9���j���*���2�*3T>�x��H1,My��D�!�N!�pL����{Y�����q���4��*&:o�h0���������tOJ~�Uf�~�vC3��=�sNuF��:�H���l[�\���q����Y@�;~�Rlt����1h��L�|D;������?��m���BR*g������=R1�q�������#��d�����h0�f}LU;6]�za��G���D(�d[�V��q�9J���:L�X��`����H^Z�� $�-����������?��c�i��+�_����db�*����SG�������&h��f�hf�dt+���lhjt�]R������2_�c1��3�S�wr��ON61r�di�c��HM6�7���h0�f}L)�:����M!��g��;�����U�bb����7�V�_�z9��u�Q�W��������^~�x���p�������O�e����,�����������]�n�<����?�_���/�hE�����`O����s?�(1$e4#�j�q�E��j�l~al���.%��d1Q�.E(�P��CiPZ�_�(}zu��z4�;����x�_�g��_��6D�m|{:�/
�y�

S����FZ���L&6�8	�����"�� ��;=#7��.�_�.������}���@i�vD}�H;RO/8��G�L�]��K}�VP"s6�qh�����o��r�@�4}KS�;�M��#�eR)���\���d�:��<�aa����������?$g���Y�V���1h������o�t�c
��r�[�c�Y�.&6 L]�P������)T���Y��G�4N7)�3�8��3�Cj�]��:^�u�H��:��k���/�������\\�����0��2�(�W������K)�)���M11:B.�p2�8���(�*���=��t.�� 3�2��|��c�%&c�C�Zm���H~�L��P��J��wQM�v�
z�n�C��<���x��g�x��|H5�������s��H������7���0�j��<,�g����g��0����`2�&����ID'�5P���9�zl�&����'�l�@
'c�����UM��!�:v����;RMto((
J���4(�G�|���m�9A��s��H~���RwnMk��������R��C�sD�AdD��|P-��&T�z�l�Lp�&F�0�t�����Sg�������}�v�6�|�<���y�y]� �
��]����R������.lz��~�	:���lr/W���B�n
7��<[MR�������y�>���y��4�X���wo21��2��M���/����:wE#�F}����R�y�=.�����y76�:�K��{jD���
"�\F;��8$�t������3��T����\'A�\���h���+�wC�!���������L5H��Kq�a����h21�*��P��Up���D�ER�� 5H}E������,RvQ���Y�N��(�|=��F�C���U�ph21��t�as�Y��#K���
�AiP�����n|�B?�%w��4��U,&����F�#/3W��>���z��i�nRg�p�g���)5;������������L��C3��A�a�Mth����u
O�2J��6���l���~��*����3�{�~��bl����T<����9Y�8����!�t�k��jF51
��k\�\����D7=�40
L�e��J<�y$(O��1>n�-��Sb�OalP��6�$�M���%IY��(�w�R����4N7)�2�(���	���������on��2�� ��<RG �����HW��(^��j2����R�k�� j�Lt��� 4
B_�w�g%u���N�=8���%�
h�f����!Gjv�|x)��(+�H��0P��x3�|w�b������i`��<^����.�y���G��#�aV��$U���q"�`I�G�s$�Ab�$��������k���?�&m;�%���M��;���0���(�%��S�8;-u���x���k�3�Ek<�����I��6[��d6�qH����M��E�X,Ve�.�?�R9�Q�	�f�`�q�w���"`Q"X�������
��M�[M��TQ~�o�"�g�8����2�(�(U��n���3��H��X�dk���_Du�����b����Q �q�Y��ep\�\����E��_
_����_�%���F5fE���H5����V8�<����I�������c'���~�40
L��� |��`��0M!q���r���lb��������s����D�E������gsN��q:'g�p�g��m�#uk��F��H��$��v
�}h���������>
������5�0��5����bT_b����Um0����F>�@7����)%�lb�b�>����x+���D7=�3�8��� j>�����
|R4��u��?��9y:q�weu�����$a=�qJE��app�����;ifvi�A*�5���@#7R��������j��`�f����|�8uG�����5�X�!���Mt����&�:�����]�t��2�����N	L��40
L�������]�4��1�S���
�S�H0_�
�@��&6��,��tN�tN���3�:��:U���'�~aj(#u������������+M&��U�P��0n;N�R�/&J}���hCD�}�����|i���[v5jm,���������~�7g��_=~�tzs�����9�:$�a��O&JI��k���5�:��- �:mUvA�b�����2�*�^�;���W��Q���+��E��C|C����uc
��H��!-]�Z>|?������;��:%������l��f��������G����S���ac��lg���t�B��#)��Fih�F��V�_6-Og_Mt����3�:_��0GN�J��b;vaa�AM5�1H���)bMkiA�t5Q����|\�GT{�������h �fh>f{��v52���G��#Ea)��n��e��k�����M5�x1���M
� �2�����6U�$:��<RG)�I�t�z��!(ZM�5��Q]���f�L�/&JEx��g��Z��G)/.l��2#&6������%��Q�LW���^L�lf%����K��a:�b�,��b&T>�&���[�����+�L����j�F�4�zQ��8(u�����?h�����c�0�U���_�pYf;���R[G�������8���(�����ZV�~���2U���ipZ�i��F�uH�.�6J�����LC�\{��SPMl�����ns��������j�����.&���h���@����;������e�(z�#u��n�R���6
.�C�jb�����~��g�8���2� ���l�)���f�|SMt���wC��qv�H7���C����Va��,�b��]��ip��N����*w���n�0����?����L�h��Z���4)��������
�9����jbt�B�����#�~�M���@4
D�L(���JI�����Ka4��:��)�
�������h�R2�E
!)�{�U���(��+��#����<��{x������/�_��_N�����?���Y���������������'?�b���h� �_f�)��h���/T�HI��k�6d�D�2��[���c���-�����?�o�_�I��j���2Be����%T>`���Q��+�lJ�������bb�P�jA���h������������a:�e`X��e(�g�@��Pnq��L'~�0�(3���d����d��#��� �s�q=��N�@���������no��K6	�@_����P����mz9��4F�H�H>LP��H���F�!���i;�z��v�H���6��\�b7	��H2z��x���-��*&8l��-���Wys���f}�<�f����?��h�jO+O�l�G�s	x�*���/i{�j3����w1��^����&OO�����t�8���'�o����[��R��w��(��6�m�����7
��'I3�����C�hK��_\E�$����������,�Ui%�v61R�s����8�L��d0y�����������s���5>d�Cy1Q����
J����T�{��;��DG�C������jF�C�D���k���*X�v�j�8����n������g�Ll
�����Z��db���m��"�E�b�T�ip����y���R~
V�I��m�
���`��X�'��
S74~����'������t.�����
r��t`(L�:X.7N���R]<��-�>��=��_�1��<��;�����ih������_�u��	p\F������f~-|q��~�6u`�B���)9���G��@��F�=�i���5����
�U�;@2�$�@��@����<�*I�z�.P�&:I������-Bf�B���4u����0
�������$9B�~u?��\(���������[�|T;�],s�6�(Q$����9[����@B���8�Yr��y2��B���>������Q��J�h~qV��������f������^��wG���������d��������6�4?A�d��N���48
N��d��en�����+
��m�d���r���G�D	D)�s������8P�����J���$EB�>��}����������
��]F�$� U�M�=�e����HhwGI��:N7+��m2.P=��w��;��\�?w3�6�l: e���t�t!Y[6�S0��@$�
r����u��D������w-�D�Z��3�lh�W�e�f�E�_�_���|�<�g2��u���ga:�~���
��&rv�������I����h����?��*k������"V�D�"iMb�o������q21���<�
�o���|U�?���#��
���
R��<g�_#mV���T���z����������w_�n�=�|���db�I�� o���6���H_U ��w{0 #6�\�����U�.a21K���>,a}��K�������J�P9O�fa2��R�� 5H
R��$�-�J!�>����JX]L������i�����G���v�F�as����]1Q����)$oH�F������n�].��
|���<�x��oY���]Z�Or�sF8���tY�������D�Z-�
��=�����C�j����FTmU��~�e�����nk^��K
Fx��/�8�zw��:I�H���qX����w�Ll��y��/�R��\�e@P>���x��Q��c����3��&��|.����w���lb��v7�{	�G��]L�5qAk�����������`��LW�����W��6<�m3?	u���>2�����	���x&�@"�(�V���oau�/���z���N���_z�x���p��������zQ�|��N�-��V�������)j�Q�z���b������;u}3����4�{T^E��Y��c����3f��0/]��,���������!q���1H
BO���6�T�jb�P"0��F�����D��F��`4}]�>`$�&A����=���i%��db��7NP%dgCgJ��n{��#�	�WA����4�B��N�tb��B�~ZK���Z�R�A�n��6���Z<���K<�Ml<�e����_�7|�&F��r
�6J�2N7)��=#zF�����36�������a�B?���=��D#�.�.?�������J�db�P�	k�-�����r�Z���5h��h-X$mVG/���:�>tqcD�u�C�^%���D��,E5���]h��i��{@��k��}��?�j�st�Ag�te��P0����Pf���6����P$�s���M3��VM��������x���M	�:���3�,��A�n��;A�J��u��ARR.7U��k�>`�Mlr)4�C����=T5��H��4 }]��/�X�Kl&�QN�`��-�b26]��[�~yOH�.&6(���{�qFN�M��!�a:@dD�AdQ�|LQ;&?��q.LRF�$���$�+�^��=�&FYj9o;y��i�nR�e`X��e���f�^��jv�c���(��iRs��#��'#�\�<�������/&����4 
H�����{��t��yU]�����sQ�Rug��\H{�k?�	a�DYt�Ag�t����^Z?�.>W2f��755yu��Xe��vA���q�I��g�|�E|>���F�����H��#u�K��4t���<5��m�p}����L�tN���1xl������"��S+��<
�y�ne��sF#�x&%��KBk_(~
f��.8����|�,��1���}����/����
�i@��x��tw�����A�\��">X3��������LE�'��
�
�H���F>v�&F��j.�`ZJn�����f�l��][�\�cyP1{������C�R�����w}G�f�M������I[]��(o����3�>_���=�%P;�d����{Nu��O���6h�����K��QM�W��/�����b�r�����5�4�@3����<^�������U6`�Jx6Q*�R*��(�)R���'�]C����/6��96��`3�6�l>��M��7z%t���B���>�T���D��G��2j;D]/B+���8���|�'������_��?�����[�o��[�Y����\����������O�oN�>����S/>GN��[D����K��7��&��`//n]*����|����~2y�"y�w\K�56��y=��t��x����-t��5#j>����,�����%y��~8��yw�����D���f
F��n���9�����:�������O]j��8��|��db�O��D�.�tN��`3�6_���"��h���A3 �G21
���Hl��#y���t����xVU�+��25
t�Ag�YMg>Z��U���+mP��������M��{2���Dp#u�����qFN�Bz�s,��b�,fs��`�H>�	ht��Ul��K��V��d���4@M�@�����R5m�qT;y>�k�sh��f�hf�|T{�� �O)2B#u��B�T�N�7���1��r���?�������AePT�*���F��<������\:$�~�?���BR6��!���H�<���Cy��`�R���`�f�`f��c*���$J��e��>R����
�������(x�i�T���0�q�I��d H�Y$T���v|bD��:�H�L�t���-���A���>�I��8�`1X����b~9~q�V|�[��l��b�6[�BJ�.&6Oyd�z�#�2T������p������i@�ff/nw�������K��<RG�n!uT�#m�yU��*;�5�U�SM�����3�:��,�+g�6	��<����"����z���
����)&����3��M����g�����9aE���?>��O�E�y��16�b�Kr�:���y��D�������m21
�sn6��K�w2���"I�a��&�t1Qf����4(mE����q?~a6�4�(�
�n1I�T����E�'��
��y��^�/��(�W���\OU��6H�8����3�8�^P���������BV���d���<��DG")}�\3F��y�����#�k�sT�AeP����')��<�b��L*��G���\M����s�������2K�E��qq��E��UUP�����h}����Rx�}�y����������W����n�~����_�/�V���]��J��&:I������X���5�j�����&�QB�q�IA�!4Bh����������,�ysC��}�� �L�G�$�2��}3t��w���Ml�Dr:�_�D������A�+"��X�&�Q�L9^-��Z[?�s?��~
]hBZ�dKW#��H?l���(
J���4(���<�����(�����e�.D�������K�Z����db������f�[��tN�� 2�"��<���M]"���5�H��T�#�	-�s04r�+�0�cxI6��&:�ep\��\����"�
�_r�����"1�qI�����$��Dh�����\M��uEG��uzF��Z1Q�?���4(
J��|�|H=;�1	�E�<���9�����&F��s5�����8���� 2�"��<��g'���5����H��j	��2��}h���W�	�C�D)$���2�._���F!9�.{)���?�"����[R��|:���xy�����q9V�c��������j���������"c���i%$��nz@iP��Ai>z>���v(������#��S7P"��J1��db�O.��������;���q�
������(��R�����T������\���33���~��lq`������l5�%�tN�� 2�"��<�w�g�U�j^���U�H��Tv�
�A^W�������/\w�D��,������U~�W��KYXQ�NG����`b��>)�e�N�e�M����.�
��l~�9�(��AiP�����'�r��r3J������V��$i���<��9Vl���n2��'�]�^�'��tN�� 2�"��<��*gje��m��4*���u����m���N�~��	�0��a���������r|�Fcl������To��Ta����2-��&F�i*����iv6�j�M��E�|�@����\50
L��t^�I����a����"��0��i�*?5�x2��D,���
~��4�Ll�I�v��o�0N��:���3�|��}����T�$k������e����<��|R�;�(����z�����&@t�@[��g.��4F7�+������������U���#�G��p�\��������������������O~�j�>�}z�#�?^��s��|�[������Q�	~
6jS�������&UV�sg���;�^����3L���JC�GBX�#�T�7���L�42���NF`z=���@���E+�j���#�[��7�4��u����/�H���n�,���9m��<���SU��t��{�C���F�/���i�Rw�/��������Co���ow�/��Mz}5Y�e��l�t�q5X�C�M����|�2����uEI�,~�)�&��:�������wV��}�.uR�[B����]]��Y�����]�����x����?������V��$_�~Q��;��XUvWT���(����#�w���4o(�yx!����l������F������jD�w?�&kM�uI
����F.����P��P��{��M�t�(��2�(��{5Y%,7u��9?di8���%I
��bK�Y�<���(-�p>u\����D0��`5X
V��K���/��f�"(]���`j������`�715][U���K���p0��4�#q�bz:~���rz�i`���iAJ�K�;���X���~�
S_��<����D�Cq�F�K���R���@P�e@Y���s��j> !,wy��@R:�I�pm��f^�MljKR���
�z�; 4
B��/��|g�4�g�k���$;����*�oeE���t���oI6�qQ�O����u9u6QF��i`���4�L��o$��v-eu;���aR�����hA�����T�-S��8��2�(����;U����6����
���E�t��t�r�m��&FU5�V���<L��<���K��SWv�7��I�n���;�|
�h��������vyU��`b�O�6���[:a��N)���40
L��,z��A�G_	:q�#mpk���71�F����/v�����D�3>���3�>��S��M�x�6���6�&����H����D�DiFOm���8uA��~0�M
D���~�I�����l��o:i��{0�'q]h5p����7�UE�Tm�z�W����fq?t�
��nA��h �>��������rQ�m��J��Uf-j�����G:�nIQ���J
r�v��G����d@����Ws��vWS�
������G�d�)���s7O��<��8��tK�2]���D��:����K�3����-�����=���B���D���D� )�SK_���0��	�in�e�6_
�M�b
�.���2��d����c�6�SB$x�#mP�������������Tu����8�2� �2����U[��]�N#u����$c�E���*���H6���m����D�A��Bd~Tj�R(�i�+����un<T�S@�u6�1H�����R|x0�3r�����4J�H�� 1H���;R�u�v�a�\��F��#%p*�n����Ll�I���^+y���8��c����^�j*���B,�/%
�L�$k����mdA����$��BE-�]�(��Lt���40
L+1-8��[����f�vw�`��W�N�������f����(���zd�j��%G�{�;`2�&��`2�:�@zu���d�����N��p4�AH�%I�lm�l��(2�L;��? 3�2�� 3C�]���,�-�R�Fd������XT��$����q(���Q��n�l����Ag�t~)tF�����g�����|ygT�c�2.Zf�r�`b� x�4�����u0�At����IU ��Z��D�
D�@4�$����	@�|Y��T��L���Xw��2
A�����2�T�M?-�8���t��E,���]�������o/�?��n������Q�}><�����xsqw�;�y�����O�xusq<;����=�2*�_��R��m�<�R�N��U��-=��e"?��\ru�����7}01*��R��-~�7Q
���A#�>
�H=�z�t����������W^����uY��������2�&����T}����l���t%�>�wPv<	{���J�C�P��d�{	0�3��R�������73���
�����OFF�Z���f�B*�v%�
�tN��1`��>�^�l�Rj�������]2�S��$�g!I�i���R��.�[��|���(�v�l�
����x��l	Wz~YH�l_�+ui�`A����K������lKfU�'u���W��%A�UobTF�=�b���+�z�`����v����z�
L��40�X�O����������%|���"��h��$_{A�l_�M#u�����z���?�4������K��V�,`2�&��`��<��T�n;��<���H
:jJ�|\4�q����J7�KAg��D��<~#����(V����,��eT�:�������c��t���	J�fg���n�����=���|H%`��Df��J��^������|!����`�7��6�����5���Y�4�VA)"���*��r]9w�d�L�s6Qv��~�nj��E;.�������H5��E�����9<��3�<��D�%(��m��vN$j�#uZ>������5��Z���HL��R�rA��0N7+�����=������Gh�S��Ec7h�?^��a��2�>Tu����������DRz���.N0	*������j#R��[�����nz�ip	4��@o���������>��wq�mB��h�#���&�<L�����������N�A /��tN��3�<��:�������X��N|&OE�y�AR.;_J��6�|�4��uv1"�� ��j0�M=~7!uC���f�
����+���U��a� ����RL���D�Pv���,�#i���Yh������2{��mg�;�2������$���2�#l
S:���H���P�CQV9[R��Ml�Ir���	�a��	PT�AePy\ g/A��kS���I�&��P�|�c�`�E�h"�{#���:.��'��&:���+
U�6Tm����pf���`��-F���O[���E��>�Ur�L������'	�!&V]I\,�z�~����&F���n��1l���Mt_
p�FJ���E���q���Q�]��6O}�OZ	<Mt$�iI�����������?��v�,/y�c�l�|U@j���A�qxLw�i-7a�t�fY��$��A"���h�k6��q�`b���9�@�c��������n�|yG_'�(9E�����m>g1*���A�*(M�y�
�6{���M��:v�&F���������_6S�M��E�����U_
NX�&���0c+[���~[��qz�o�i�QB�v|��8P�i��H�b�H��8��m;7����M���`2�1�4S��������1iw��N�����V�����6H$e:������������������7�M=�F���7�������[)M'\)���<�d���������_l$�n�S������Zd���&6���J�^�����wh0���j��n��n��&����L'����7����k�r�FF�T�c�R����UHO&F�A��� �n�l�������A"=�����g�|>|O�������M�nRu�Q��2��`$����6��r��,�+�_��?���q�Y��h �~Y��C��X@'��+��fN�u�5T�5��8�M:��H�u�����&6����\Z�L�&:wj���jQ��.�n�|��;g�\L�T&��!K���t���8Q����%S������d0L�EL�������M�&��1HJ����j�C�|�4��vm[���kxa!�(�0������h~q6*_dJ6|������k},��L�>O��[7�O�&F��<]l�l;*�{��P�5@
P��i�RFx���,��+���|�#�4�N]��W�x��,�*���?�6;�0�#��z�? 4
B�� ���{��c��@Ljw�#�����v��c_��>�������DO?�^>��n��h �����5I�I�kA��/�����>O���"��?|6��46�@*��~�O�����P�_h��=�(�wp���ipZ�J��z�bw���'�(��G�:�m�����/��M�D��t�d�7}��t�,��2�,���W���tlj�r�E�t2��`$%b�>Ru����B�������]]8jL>����`�� ���A��Ek>�S�p�rB*����?8e��r�	D��&63�))�]��������?!�&h����l���j���jQZ�K��S
�������68����fQH�'����?���l@��t.�@2�$�"$�T�v�U��	���H���������8#'(6���s����l�,T�f�`~Y`�37��F^�5:����I�]�����d4�!�������u<{��b������b���M��.5@
P��(���W!�����0����k��U�X�*�Z\�����M����h;��_�~�.x�Ag�t�Et����6U��������1H�o����8��,��v���lb�PS�^����3�eh�}���8\]�"~	�������x{y�i=ug�Y������w�w�����>��Y�q���;����pq�����X��,O�t�����O6�+=�g�A6�����H�L�f�mG:7g��.57��Q�����jQ��O�`b���h]��b5A_��D�2id���
3i@��;Z�.�g�W0~]6��	em��z���D��+�g��Y�&F�A���u=���(�����.]@���4�;�� ���$ �^�r-�v��K�W�'�t�E�;~�:�__|>�=�pX�������6/l������[�*��>��d�<�+�"�����O�������.m����~��;�$I4�����@�m��6%�?2�G��]5~��u���HNH����<�{e	�4V��3�����mJoH�S��B���H���?}_��0H�o����z8?~�B������v�f��	e��-��PV���Y$p�#u���0w	��,���RP�����F ��t���y3�f��4�r��!Y��F'n��r[�f��EI?R�)�����$~=�Xli�����Q�"q��4�M��;�,�����2�����e���]���|:����������bg��
�������2P��c�<�Y��_�_] @zt��|iR��H��t���4;���SU]Q�x���9$�@2�$�H��r�9���)A��#u��+���*�����p2X���5/��x@��D7A4
@�j@�@J�N�T��������j;���@{��x��.#&ig�D�O����a:�`0��`6I���W�[{WI��R����GJ��;��~�N����I�t�c�|:��M
�$�@2��"y��ul�6�9)A������{��n�i7���hb�P]VE�E�~x�`0��@�4�r��)�=��u�~���:���G�fO����~��0���j*�^�YP�����f�`���y�bv��6�T���H}�p�u,�n�U�������V��j�HZO?|�0���t�Ag�tf��S]�q�N(�uN��~��?R>;O����f=L�Meptjk�����M���4
@�/�;�bYU��^�H��?|�8��`�������:�����Q��k�T��u���M�nzi@��i6��at@�u�`{5i�y��>�:�P��r��������|����
���	 H��d �E�@�}�-�['+�n�4R�)��kHC��s��H��Q�L��EY�)H(�(a0�M
@?7���^^��R���9.V�����x��l�������y�s���EV�
h������E$E������6�[:G�
�i���K�^�?�<J������I?�G����w�}�]�?x�(�(��W|�P�7
��-�p$H��l�#�4�u�Zl�
v���U�����#d���V �d@�_C��Lx���Z���J7V0��h5�X�������j�3�����Q����f��l�g]�@�4��'(��D��	�E����?8���6�t|yHQj�h�<$��t����G����&6�L���� ��9���1x�rx{�*��R�n�j�I`��q�=R�].��t�y$���I��:!=�(w��e`X������O;R�\F�	�c�i�
��B��?<�������<��y'�(�t�d�A�@f~9~u�W��r������t�2u����y�
�I����4����A��O�t�v�]`�vN�w�{��E����P>�L��P��]3����X���U(V��D<��!�:��}���8\��'h�Bct��J*<���p�\����������8��^K����m�}@�����CU�6�n�EQP��'�'�%��}������a6.��|r[T)R�e�Rf��H��2+BW6S�:�>�H����=�ui����X��{2�����~q=S)+���u%�Z�H��������`!��&t�Ag�t��=���o��5�`u�������y�8��q�th>6I�t�x��
Z�?�'4ZJ���5�_
�7�����������k�'�E�*�W#�4V)|I}v_���q��{�';�|i[2Qf��1h�4X�Ga��j;hnFc~��@o���+f�P�@m@
O��[~���6$qZVOmXP�[)�"]��	��'���b~!�=�#->��i�����aw�oW������@���:����_ ������7���6��|�V��\�0�q����I>��D��Jt��;���F����J]���p{������������������g�1���u�7�������+_�������T�	k�����7��;���F�$���08|q��p~|x�t	�>��@G�k��	~j�,WE����^4���T�vR2����D�
Xq�4��P��'CVLY1�<���x�%�1^� �F�yR��q�����R!�zSM9����A�P&o����g��t�������#��OC�^���C���c#8o����L��+:NDB�p�������$C�A��~z���R���|�B��Wn� [F��l�2������*�-5�� #���E���8�A0i��I�.��lb�Q�B���>c�;7���
@�F�����l��M�>��vdA�h�9�2����
u�rT^5HE�����*]����� p6��d�d(�\�����e�������
�W�&�6�Lk���O�{�."n+�MF)�vq�o|�R�Jr.yk�����ZJ����d�T��;�$R�.F:r.�7�������x�_��Oa;R�R(}%�4����f�fY�������
��Z��[����� ]����K���RH�h����eF��fv](B�=1���=�MlxH�t��Q��l ��*����������1c�{��cV\��Y���w)b�����<����oz��$5a�H�tTh��$A��E�� #A��A`O�^����_V�����H�i���2�-����C��T����c>E���������]�Ta�e_�O���;�0$.�R�%�7H��0#aF��f��[�����%�<�����RO��I����D�HC�!�8�C����o�f�W���xzM#/�g�;�3�<���#Hy6���lW4�w
]����7����AZu��+y��������Q:a@��G �S�KW����|=���G��TN�M���aQ61R�2��Z����+�D79@3�4�
4�)��qQ>G�-~R2�Y�6{p]R���-�J����L�%N������E�=�&6�I���oE� $�.�G��&�h�}z� ��4��z�R�6<�y����#5�rt�����h{k��Tc�L�ys2�
��2���!�����4�6#mF��a�����t��qy��6�	!���'��SU��29�f�?������I�e�V/�� 4
B�a���3*9�L_���TrL	b���~SU��z2���I��Ozr�k21�I��w�#/1$e�:�<��3��HI���Ba;�u�
S�7�cm�L*u89���eo������t�@1P���x�Z6U\W�$zh<)��S�vo���B��9{�+R*c����Ou��/��d��4�oT~���__��"��pm����<��N���yj��Y1f=q���bLb4���A63B�ss��\��)�F��dd����%��,y3o�`Af�[���`��\�X�R��s2���D����/goH�������#������x��g�"�����b��^'N����y2Q
��I����e�����Rw�O�`w#��9&��!nw���?�v����t�D�RP9>��16��)~�5�I���S�u]�n�����,f����i7���T�R2���������P���4�H�
�e������!�8�(
��/�}Bb�6��%6��&F��H��'��l�&:��c����-������Ac�4^������K����y�
y��l}E���}S:�4��0��_���q��B�����������
D�Ad�g'2��(Sq.����M3w][���c�Hpp41�H&�:����l`L�s8��K��,����Ac�4��������ZME�U���`������#
�����Ci���S���i����.���������*}�|���#��/m,�&F;��6��r����{��]�(jhu��Ac�4��i��T���tK��4�k�&t���� ����Ll\JZ���o6��&F���|�X���4�@3�4?�(�0&���CBM�6�9)����?�p�hJ�?���L��s&=��yr���4�Ac���V�.���fl�'�I��&�}S)��=t
x�&:J�!�z�;��z6�y:���3�,�3��)vHWCA��[�������P�*T���Cm&���������Ml�!)����y��@�a@�$�$2�e�.�H�3]�����d���4�������#*�vvU����Z��BA��nn�f�h�����O�:P��S�Mt���9�Ox��0�(y(u��S�'�^|��Mt�:���3�����_���$F��f���\�F��Bc��l�����u�+������&6�P�M��r��o,'��4�@3�4?�8�Q��$!�>t��`��tVy4��!��a,A��d6bJ&F%hT^��B6��
�4�@3�����iGW4]���Y���:�H3�TW�O�J�>e��CjX�+8i�n6�ap�v�0Seqc$���.��\;o����&F%%����(���@Y4e����:>@I6�!c�0����F��Sl�����O���k�AR����#�E�:��y�n:�a`���0�Sm:P#����p�����q6Q�-I���7Wt�y�a}�&���Rw�"�Q��O�d����Ag����/�w���fip���4�������K��2AFNw1��vU=��d2Qn�J���X�De���.���b�������������w�7����pk3�.�[4���;�^^Z�����s=������z����?��O[�8�9`!�������������M�}��puE��{��/�E�_���@F��f1��v����������W��D�J?~�O.�*���lb���:�>�4J��dd����%+�d|��g5�����O
�}
��
��6V�V+��hbeR�i3yq<�#z������^xS������ �����h>K2�Py(3#il�����]��n�3/L&F|N5��#��1P�a���t��>�Nn2�sH��d H>?~����<�^��]��|�Z9�r�f0y�^��h�T���O����[>K����������������[��k�^��6���)d��NE��"Q)�8V�i�����X.���Q�Drt�W�+ea8�e��H�5�� ��zR����0k��T#��|S�q�D����`�&6.�6����A6�'-�����o��6:wd@�_�~��7{0�A��l�qQ��q�BA��X|��rR�]uU6����V��y���,��b������
�]�VBc���:�Hi�U�e�k������?���w����6�p#��'$��>���$�Q��bSY	�
�o*������]���~��_��P�q�u=D��Q�;i�~�F���L��!M�^#l~�=���������	���"�V�)}����� 5��7�s���&4�Z��|�k3I�f#�4���I�?���u&��� 1H|w�vu8�����-��O�����#�>U�P����{�����6��rn���d�.��������[��;�D79=���~~�=�ps�����������x��������������������O�����K?`�(���Eqa�r�5QY����f�dE�B��"KF����\_(�uQ�I�yH���%��TU{�.�a�rqS��`b�P��>9��0���"=:�%�������:���_�]_a�
�I�a�ft���M]�o�m]���$;��HI�>�G��X�����,�����:d+��|���d�ULn�lo���*���mMi��a�K��	[wS� ���&F�B[���w��
�0��c>9R�"R� ����?8��ejA��y���"�����_�db����x*b�!b��M�.���2���!f��. �MI�S!�CW+Ah2��$
RH����<���k�E��t��`0���1x�Bu�#N�j5�R�&F�.Ss��(=�Tb'����h��fc���M�hA���Y��R������j�o���z*�M*����'��o*�d7�R������eA�h3��=�u�p�k��	[PA0������W)���db�c���$\��r/��h�u������v�3:�v�%�2��~hq5�����(�����HtM����b�
�S
5?y���!���Q:�YC��fm�Y#!�r���<�Y?���;�}Qve�c5�z�h�C�7&FYh�k��0	W���w�-��Y��~	b32Qnf���2�.'`�5�Y]�6���Y��I�����:_S����W6�&6D$�9���G�8@�s�5�M��&��`2�&���W�A>�[w�` ULWe������$���f2R��Iu��$#������>��t�&��`2�&?��}���+��O������C�%	�����q��������r2Q�2�.��/��|n������\��e������35r�����|�7��NIBih@"v=�be?��&`�f�`~4a�aD��a,�P��iNcu���0�R�T}�+�����5U�1`����0��zI+.�6h��������X!)�V�CQU[W-+�ya=�8]�F��`4�b4�$*W!i�$���-x����j�[��!� �s?��|$N��*<K��[�Ml�I���sT�g�B�>����@
D�>G,�'[��������3�0�6<�Q���r�&V���&�t|y0QBH$���c����M���_
`H��;@�A��#aF���	3f��'G���`SZX9�K���:�Ha���f�/��lb�NK�j6�&�����q��v����D�G���S�t�o�J���I�M�j���G�+rx�'<+�O�jj~]TUCw)�������u��qe�@IwR� �kX
!�������8���a��	v�q)���BX���{��m��a??H���4��*���H����{��]K�����u���U��D��J�5��_�i���Bbt�g�y����W�_?�?�������������pq<��t�����kf{{��i�E��/��tw�F_����T<p{����E�kD���7����p����0���������~�������^��!x���m�z��v;�A8I�a)]^�lbC����7%��g�7`4�����I���k�z��*���|q��5���vj�-��db��UG7��?|��Ml��=w?��'��6:wh��RF~.{3@o�`�AF�����Q�����\���G-PA�t�U�%z��FG��h�|yE;�N�aj�B/z).����M����SE�UtU�����q�
���}"�QJob����$q���D�
2ed���
2����t�Di����j��fG~5���R�7��]��f]N]�yOf���4��D8�]�&6��$N�u�d��|Cg�p��*4W%���y�t���r�����i�H�DD9�8��$a{�?��q�Tkm\�Mt��1@�_�A�S��$����l^#H
G��:�H���Dd����Pzw�#����?0�O�(�����2�.��\�W?��a�CR�[��a��`�Cm H�o���c�������Z�����6���e@P�L���Z�=������M|vZ�k {��3������<>���l@�t�~�_��~��>����k��wH���&v�C�rY����X����p����]&��(�L:�7�3�<�??���n��.�<�J�x$����S���rS�g������%�������d�	"���v����E�g�p���w����e7\H����������J���(H�n��F���` �D�>�kjd�:V�~��tPj0���[��a����{�S�uT��R��4kA@F��{W(��\B�K(�����vk����k�
"(���@�P�����wWvs��������B�
�������C �9�T4��a6���g��+
�=12bd���������OIt+�OcA���bJnC�u��|	�b��,�����,��dbu��D��~����b� �bH���RHxMwJ��A_>��R���aIq>2�Q61rK�����(��s2���#��2���_%B����A����p�\��5?��	h�_�]�.�����@�m�'���*!1	��G(�v���g��'Ho��5k��'��5ko��Lr�tYeK$��g�� u��El(cC�W_��o&��F���O��y�vS������i*V&F<L%�q)e��T2Qn�J���qnz��"����3tm��ry�]�^��YrBQ�k������C)	S��(
�0Z:=k��c>TL&J�0�c�����V6�1����5�wS�����M�TG���y�
�I����4H�x���_��Am~��O���J���V�cu�����7��M�l���_�]\��}s�����c�;��!�Yw�7��>����e�T\��6���}1�h]���[6�	�db�NK�'M1%��w�O
	1������r�f%jV��<������r�Jcl^��di���(����Ew��Z-&��/��{��x�e�`41z�i���f���E��t���%t�c����������M����rr�Z:��z�z s=
_����������h��5�
�C�0��q��'>J����;�h��?������l��i2�d����H�������rl���Q�v%������)�����O&O�|�c�@�{=\f��(x!z�"FM��q)��!Lo�s��Ah�J\�����g(FU��])6{��^Y����"��db3�Pw�2a�=�Ml��E��y���<��c���6({��u����d�=�Mt��&$NG�����7�q'�=)���m'��:���3�:?L�]������e���Y�����t'U����4��`b�O���k�p����D�
B�� � ���X�t��4Y�*k����������D�!J="qZrAe3%uA������e������4p��1p�p�������]���W���[���MC�0H
��b�A^7&�uF��h���Y�2�r$�7�2�*�����T���M�<b���������
����U.��-�D�Di�^�E�U9_~��de���Ah9����$]yFe��{0i�!T�����F�P9���i���I��O�OH2Q~�`0�����d^�|}�5������=��a�)wr"�d��i����H���f���l��T�AePT~�����I(��r�1������7�!H��P	�i3j���[��Z�O�:uOWD���D��� 4
B�B���������E�-#��������#@�������O7^�>g���.bX�sT�d����g�p�N��(j���
������y4��sjRF]�U2il�����`1
�=8��c�8~�{U�]|5�� ���*��$U��4�E�/�����*���K��z6Q���� 4
B�B�/o �>$MwtRwq0���h���x&i�E]��M��ej�����������g�p�N�����@�s��}�p�L�v�S�u������Ko���4X�����_��t�2� �y�z6�5�kG���_Bro�C�����H�-3f6J�M�2��(�eGN>p���R��Ah��������b����r����r�yH�k�{0��������o��@�����P�D�|�&F�:yS����H�G��n=�Ag�t��'���3�&�@�_���W�EK�x��:�H�L�t=�-U�����~����<J�H�� 1H|����r��.������$>/M���R&�:�2�wC��<���(���z~R�4�S�M�J�x+%�������������?Nw�����?�7_�};�=��>������z����?��O�xusq<�e�uz{�Y�pJ?^� e��|����&�>�p�������s��?BJ���Q��g�F9�f��+J]��T��_�4�ME*�.u<�&J.��G�E����%�w}=_4���y�n��Ag$�H�_B���#�[�67���\�6!n6����#� I���	�$:�(����C��M����_�[�Mt�,��2�,?����D��������q�P{�$��K��lV:��q�Z_G�+c��c�l�L��c�<�_����^m�&Sq5u�h��c�����m5�*����H�����"U�lUnS�>�[���zf���HI����d�g���8�EDhI���"3����-2�"�m�!.�|.&F�P��~����|X1��3�:�����l�xaA.�v�-;���:TGi�L�t7lec���;�H������SLt���2�,�W�|T;�>�Lv�#��Y��a[e�f�����J��
�������(��b���������Q��.L�>�.��@���|�AS����+����9R�AS�����1���e@P����<�^�j�wm3)<t��:�H�L����5_�5����7m�*V����`X��e`�
�)[�!�K���:T�)�s��1m�n
��lb��j��6�;��Qw{�f�h����
:Fs�el��zA�U_O#u��b�d� ��a6.d��T�@��L�K*1@�����y_�t���"�vI��4 UL��"9��~�(�O��gTF��%l2��3�<���<U���8�����5�P{�`�� c����@t7�r�4L��� �s 0���<������9*�
m�K���_L�K�R�=�aST����D�C�;$_�[o%^�D�
�:���3�|%?�������E�m�&�|]L����a�1�*�$��(�s������Q:@dD�A�+D>�b�6�w���/������#�_>�1����8'�}��Z�/$����b����8��p�6��"�}��\=�I{��rj�/�~����gHcs����im.���F���.L���'�"�����<|�{�Wof�BG������#8��L�?Cd����d��1M��UT�c�W��>�W���1�?J�;Fj��z��a 	�c�5��K&w��wg����x�D�
�$�@����[~�Ns?^�|L�:K�}����y61Z������O�t��H������
�]&�77���y���u�M<��!T����t	1�z�3��6��n�:��CU�����������\�!���7���;]���;�D�
�2#/F^��������R����$
t�Hs��h���-����������V6o�����n� �2�������1�j*����|�RY=.Cu��B9WKw}Z��$Y�lb�O���6H�oQ1Q	�2�,�?:�����*�QIUK�����������
s�t����
���x����M�.8%2���q�� 3�2��$�<���8Ud�Z�T/�pn2�1t�J�:��$H'_��e�7��(��w��Y�u�7�M�U@2�$�@�H>�����q��e���i�
�K�����y���@E��i�����_!�I���i6��t{��������p�S���<��<����,h=���Cm������9��1`?�Q����:�<=��1G����n���-�T�V��@.&�� �2� ?�C���@�i�����%_Mt����Eb�������8A��m�����E���0L�����|B�,���-2���9hta����
�&�"�z2��G�H~�I����q�k\�%��n�l�|)Ac�4�A�2�cJ�Tk�,>.��;H1�a!)�}��0dc�bbT�@u�$ l>|�XLt7h��f�h~����^�2e��F#������m4�=?\�����4����c�A�R����������^Cg*�J?����p��7d'a�g��xz�^v���R����/�%(�KZ���P�nn�d����rw61z��]���b��y���.������[��<Hwyd����##��3b~�5j����w���<����f�;�����(��F�o�d������"���Mgb5��7@3�4�@3��������D�������'�:T�)K�5I�4��L��t���n$��b��=�2�,`��w�w?�E�c��U���D.����D��d$�va���@����:��e#��pU�<������%��*&:o@ePT�ii[\nq<*�1�Q���V��)+�[ze	<�&:Ic��m��;s��=��I|����� �2�����gRn/�$���m���->�8�i�>R������}g�D���Z7���]��D)��2� ? �hP6e��u�JT��p)���O�n#����j�Q��^��,,��k#w��z��Yr��S�Q��z��v|�����5n�z�{t�'��A�E�9IxK��\v�����D�]�������	���:��OA,���A��J�c�����5�<~���2���g]�xxde���3�g���8v����|L=�J�	�#�KIv����"
K}u����vGZ�n��?t�����m2��6��`3��,�|�V!�|/MT;Gy�k��T
�!����#�5����4����A�!�|�RLt��@2�$�����xG�{|7O�q����DY�,�H��2I-&:"J�����������|��g����)g�
�xu#�������:�HQ���S�n�b����h�i��Tg�a�[���1x��1�u���0_Z��n��J�������[G��#eqn2n�k��d2�q����m` +�����AePT��e�GT�}C[v����MR�dbD:��6�V|��[dk�(
z��|���W��r�>���3�|��GU�iyy�bh	��f2�!t��]���t)��������CYH������2�*?*�_�p�N�/MD����{������f�c�I���}�0���kRw�'�AA6A�<�/���y�����1Y���^���n����'�Q]����SH�;q�EM�Pf1z��7
�q��������b�D�
�cd����#;�f�]M�g;�MKk�~Y�����D����=��Y�n�]�r��n����4��L������Ca�;��E������`4
F?F/m�'}�|��h��8���HRMt��:��*��E�bb��wM�74Hw7�c�<���k����,J�]���k1�!H
d�c���c���db��ik����.���r� �>���3�|���������
m���?��\�dg����Iv��1���r�c��e�����b��@�f������6�C.���Ids�����S%#]�x�n!-�>�57d_��D�Jg��NQ�`�Ll~t%.t��<� ���<����b��"����.��.9!�jb�.}�}��������i���<�V���f���!�a���'z������s�{�~��/�}K��J����G�_3�������.��B��
|�S0J�i��;���f���^Ll~�Y�N~S-�9���?��>[�4)&:w�E#�F�,�yd���X��'��0xa&�z�<��4���SMln����(����!X��&�$D�AdD>]+�z��%�m:8b����u�
	s��P#���`2�q'�fo����73�0��U0�6��#�~�?��&�~#��h�pn�{>_���P(�MH����(�8�����Af�d~d>^���zN��#(����{RMt�1�f��6�|��M�m���P��-�%�|6���x��g��j�|���4�i�)���j�c��Y�V�<���C2�WWj�����h�*��l�c�	�k�j>_�M�������n�V>h�MtPGT(w���'��,>(z ���h���E�DA:'
�d���x��:�4�%�t��&6��$���$F>d�&J"J�E*��eZ���,��H�������8���is6�rS��+6;�����������c�����%k���
�I��}\����XL���~�MOm������l���C�	4h$�����6��"���s]���u1	6�.�H�q����������+C�l�i@������5VK�,���v	(d�����U�-I�`�o5��(r���c��L&6��&�u�^�������g�|��&�GT��p��� X���yjDB��=l|y9���4(|��v�6�jm�G�=����y�5�_,?c����i�~q�M0W�L����K���!+�����).j� Q�&6�����l������?T�M�������D�� {F������������`TN������~}�UP<e��#���������L�MP��A��kn����>���3�>C����w���cK�J*j�#"���_>d ��F�JO�[ot6Qz@o�I�(�AT��'J����e���sIO�lfJ>UP�	����F���*75��w=[��P�.w��1�a6.���F���i��0F����0����o�_��C��I���-{���{�����{�SG�������}���Q�R�G�6K�#�Q11�;j�vV���D����f�l>��z�����������������[;~�$\L�U���������e��R�����_D}�&�����d@�?���\\8�X���c��P��i��<R��j/Nb�$o$����n���5��DwO@c�4�|����rF�HFH7�K�1R�C�'�u��#$?�~�'��!8�����&��7��+�D�
�$�@2��`�|D�:�/���ZeOCu��F����4/'�R1QbP�N��a�sL#���e@P���QU��j�`����Py��������,O&V���fx	��a�[��1`����x�*?���$En�C�g��/&F�m$E����AB1��P ��u_{���.��b��l��f�l~0Q��d������[GGC��Y���#� P�uO�$M����hOVl�����F�Dwse@P���|H���~�.
��
��On2��$�����]��%x�L�B���@2���o�d��=4
@�b@)�$���b#}��x_��I�n;��zp�����D�v�zZ�oC11���~&��4H�S�c�0����(a{Z?�=Q<����X��
����`��y�Q`��/�O�Kp��qx�N�7�����K��X�6�D��]���#B��u�7:x���db�O>�1l�{�jA1Q����1�c$�?~r��!���������������a��@R������6$��~����qB61���5.m�,���	�<E���'<+���>���^� G@!W���R��V�o��8�
�����~H�=����G�����&�-��A�u�JJwh�9�}�����������{�����a��S����)�>>�M��P�Y�t�vK�75K4kjr������6�jR�k2���)^����_N_^������^}9}>��������C?���7���O�>�t�?�i����JG�~,���OE^D���`.7�,��?�����X$��>�-���|fd��E��+��v�0i�q�Uu/0_���(�������.	��2��iP}����G���&�L<�2����,��w+���?#�v��x����umh�y�-��jbD��icZ��
���bb�NO;��_B�s<��c�<~p��Q���S5�f����&F[�J�j��T�8�y:5����<HwyP������������
���qw������G_�b�G��w�������������<�j���|�Fcl.�=���o�>\�Zc>-|i��1��6 ������
H��5O������Ml~�t����#xG�����s���/��T�I��7u$����|4d	f}��!��|�S�z�R)q*��q����Y�lT6
3r!Wmo9��]1�����08�P=��9�x��%k74!��`�m�c�8$�����c�?+m�������
�l�0�(7D�@2�$?$��w��;1����i��<�8&�v=�i�~L���h�:��Wb��.����t� �2�|zu�����j�C�xj�,�>�*�Q� iP�Q�9�Y��Ml��n(q=EZ�C��(u}��Ah�*���b���)��9��F(L��}W���/����!�i�6_�q:'@dD��������n��f��p�
tL2{�e��~��/.����<gz�����<��J6A�<�/N����mQ��~�^�Ay�j/���O���.�C�jM{h���c\�S�'��
�,�y/�a6.��8����0�H��#=Fz����Y F���U����:R�i�RG��y�����c5��V8�c��m����c�]��-�#3Vd�|Ae3_���O�F���.<4�/dt��-�����P�����8'<�u�7���9���1c$����x>�B�7C��Xk������AX����k���i�dbT�����^�?y2��d��H�����YdgG�����K���!�l��R�C['jB'����}�4���n�mD�w�r���k1�q(z:Kb�#��L&�`�3�g�����G��wkP�j��n����]xl��t��g^�]L���Q�$���8#'Fj��F	���X1��*���2�*/E�C���o���GH��F��#
��2MG8�|Y�0�&6�P6N�Pn�4N�p��1p|�U�;���~�7�c���I�AHJf��ef?
��y�����}����Ly6��6��`3��<����h���yW�#v��h�2������&�����Fd]z�tO�l��m�y��A�>����e@P��&����zl��"H�,&:I�����L
��y��������Qw��D��.���2�|��G����XKr�#1��D!)�iI�������f�b��������
\��r�(
J��v����~�� ��3:cs���n
�e�o|��&�nm-By�__�5Om��lg�B����+���RMl<z�_� ���a��OC���]�4Zr�����G��7.K��P��(7do�lcT����Q������?������\�v�-Iy��i`����������#��m�#��T����+��,���[J��v�#9T�h�w�8����k������48
N?N���������wF���z�H�oG��5������u+�j���T�yra	@�d@��8�R����=�'��>�6������aI�EX����7�
�d���ED���4(
J���)}Py;����������bc�iZO'J�R���bbE�>6c��7���|���g���%��U�y:g��UO>���G*��E:�ym������x)�������\/9r3*����	K�z�{��^
�2[����S���u�[X�����/��+�X���w��n�t�����H���,6���.>|���K���.��Z�����(����#7Fn�����Yt��&��!V�8���T<�X��~H�o:I��:T�Ci|���&�Q�,&:�h��u��Ka�����"Keq
.L%���\�M����R!;��($}
T`M�8E*��?��y1��B6���B
����7����i
������^��}< Q�v19m��d��_�����4����<����G��,4tN�v{/&:��7#oF���y3���vH}���"5���G�,;zZgn���&u��+��m�^�Q:�a`��	��R���?F�}�������G._������$_��S�W�xc11��5�k�����~~h��!&��l�b������������F@������������2�y�y�
o|�%h�CYL�NS"M��/�8]��\�2re����!Y_���N�Z�%�<P��1���,Y{��ph����mQ�^|�U�G`2�&�����|�d���Q����K�u�3)Zb�F7#�_{j�U�k~�{11
�\��d@��'���H���_Om�����]��H^��,�y�b�<aQ����st�f��b��9O���Bg����P����_��^�o���?�p�����~�{{�Q��J9O�^����b��c^����n�<jO�v�����r1�!ew������y7�������P51:���8��[�n�BG5������Q#����y����(��xd9o$�~�X��<�O�F��xn��;aI��cm�|?�c8������&6=���0�^�(=�$;�� ��:|�e$�����go�<��u)�����?����([��,}�Wp�|P�&��`2�&?������������AX.m�LmIhWm�b�WG�9�5���u"�q:'a@���yU�H�-?���Y�����i�Ec���@gW�b*���D�A�:����0�p|He��<�(�a6�#����������_��Q�6-�A���0?����>����V	&g����K�R�	c��v�q����(M�]t�D�G*�@�`3�6�������Qy�(311tH���i#g�96m�s���j�����@�F#k4���M��!~�����D�`�l6��e��/n[�
��]x7�N�s��nhC�Il��'�q-&F��]��z*s�
�(���rb�����#'�^���u.$��}MG%�Ylt��^����H�&��!�P#WZ�Rm�Jg�~6�9:�o$�r�)�����.m�=t������S��~������0��#
��E��H������p�.���9�<'>u�Y_��A:��g�z)�*��Q�h<��
����q��Y��26ZL�b�|�U��iY�<��Dw�i@��i���>���W�nJ�u��H���}7��'��u��+] �6�_�^����d H~H�w�m;�C�IW���.�u�Ht�RMM�V\Y�.&6��5*�Qa�
k^K5;b7�z7�7��	�}3����b�p�g;���u���-������A�.ut|SH|TPG�n
�cd����#;�`��`�AO��R��:�H��|Z#-�
v��6�t!y

B=����e�"�� 2��������O���dc���������M�5A�w=e�:�%�P��VC��Z�������w6��n����V��-��������v6)��:R���9W5���
m�M��������D� IF��$I2�d���O�����lMm@�;jJ����E��PW���vl�T�����j��L��d&��������~�R1�����A6�rO��yH��x&�v�K�9K#/�6F�^�@	���u�(�����n���nX2g��M�[s�<��WoO�IA���������������������������;�~�����/x��������w��������("����0�����#_=�(�[i�6��7cX~��lb7�16�m��3��D�l�4�i$�H����K
E/��{�B/��%�V���}g:��u�vZ�+N&�\Q�i?�D�k����������]��=�Lt����3�:?:���y?D��G�)_�B��cn���Y$ug#��|����Z�&`�lbs���G���@��3�8����?N�vO�vMZRo+
�Mt�F)������������8zj7�Y�H'� 
H��4 �@���v;�����x^���c�M��p��s5��h��!������e�8���s�3��V�%�[m���@{�Yy}h�F�o�P�,Xy^Ll���]��kc^��M��Q�f�l��f&q����������5?��������4����������v6q��L��j2��P������/�>�v9��7?���Tv�`[��	LtG2�__{��j�����xPA{�����+��@�����i���	���bb 8Z����is1�v*��zq�mh�;6=?��\�,lzV�����}{��lf�����]�m<���|��2���wm�7�|\K��wo1�q�Z�:k�d�G���r����|������O�����������������>�����D�r�Q|���������w�_)	�x��������l9�����ot�6���9#s~��y94��X��q7D���S�S�n�pM�D�O�m&��?��L���ZP�5�e��~lz��o����D�����3��I�P,x�}��4F�sD���'�r�^��w�h��e	�e�
�}�15��GJu��#�o�r|��e��
�,��be�n,��`��q� `���
W�UF����F� y84�
�kW���0����!Z��rOQ6VY����f�l�Q�}����
��v����P��b�c���c�&:t�%�<���0�t@U��!f�C 3�2�� 3C��*��
���0��u���]tM�x��p^L�<�.�W���`�G��rax��g�����_d|i�X���nX��+�Ew�U��R��~��@L�&�}���a���_�0
��/�3�8��3�;�xiQ������������� $e���mZ>|��lb���|3l:��o�l�_�i`���i�����M�ro.��m��S�	4�|XAy1Q�P/P�\$���������n�<�vx���~�� �3����n]����h��.��HD:r���+I"��V�N��*}�5-K���
��H�����@b�$���+g����L|��H:�D^�|y��i�v�R��JI�]d�h�s]�F���d�|`PMl��K�x'�(�3�0��-$�����,�����3�5K�����H�v���skV��#u@���5~�����<�����g�|~6|��;���A7�JI @������m��:���T���Cj��jI���SMl���KH@2�$�@2�2����x�)����D;~��Y�e���V�]>J�u�
�)_�'N�� �2� s@>bv������4�����{> �1�P#WR;4t�/��un��1p?���KKSw���d��y`94B��f?i�)�%Q������]�L�s�m��m>tH���#��8c�I�j>@
��>sHl~Ue���	��l4��v��0<��cZie���L=����()mW���N#��7��nQ�;�?�b�X�,#YF��d�2���K��]?l�j@(���F�!i��<��M��r�u��+�����@�����Af�d������-������������]�G�������e��dI����(}!!iC����w{�B�K�����#J�.���v�b�!�"�m��1�b��(_c^Lz ����z"&",&:��;#wF����3T���vH.l�g�"U���8$�T�WH�}�:���u��+�Q��fP�H9�(%@���gi���pzs���n��������2R������5'�����({%�:�&n�Z���jb�j��!pC�~s����	��!��7���[�')H&���������D���������8#���]pn�l��y���Wm:��o�G����'.\��z��^����9w���>@�DO���Y�x��$e����t7P��HHt�a���7�(w���yG5�>�������j������I���1(:�V�G����~9}y����?>|�{����t{��77�	�o�����_>����F��O�����W��>1���"�|�C=k�>�G������#�FV����Y���[�ts�t��TrWE9�������Z�t@r��I[�������
$�@2��z����M���rU���U�e��F��J��.D���HU��M�BE����*�k2�D�C��Q$�"1�=�"1~"4�Q��������n��:�H�L5�il*c�u��+>�]�D~�e�s8��c���7��Q:=�\�����{�&�����8�k��d�'����f\�R^�^L�i��|�w�`�Z]_*��X�����t�����%��}c�[*��d��@�P3�h���
���T�������2J�H�� 1H|�.����i��^��������H��}[}��8^��p(�	[��V@�i��	0L���g�d~���d!F���]��i_��3��Z�7Z�'�9����%�I,&���R4?�N��1p_K�)VS_�q���@�e�j�Y�j�|���x�6��v
D����8�.���2�|����}��e�V����� �wCGRp�g�u��#.��q�Wm��:7�d0L������H4�OAfWp�4�\Y��4�e�F>Q���_��,u(�^���o���_;��QW	�J���Niuv@CWM���������5�����wc�n�
����&Z��.��m���e>'�&6��v/�������B&���1Rc��H��C����on�������q6����0���:���v�H*82���1@�1&��xeK�k<���LCpa:����ry��������]��D��4R@�+�Jj��F��k����B.�4�z�KU��G��)��z�[�nl��(��z-��ch�`b���?�:R��ad����
#��������]-�Q_L��������:T���E�`:�y�������I�y<��3��,�������D������S��^ �����lb$�?�>��1p_����.��e|fh����Z�L����H�lgs3�.��U��r5���
T^F}3��1[���G�48
N����5NU�v)�C���A���A���w���oH�f��u��+]7��	����]�&:�h�������
3�|/�a��]�X)Ot�R0b�3R�I�N�J�jOKQ��bb2@�Fe6*���6O��0��2{�����R����hg����K�����w��n8�M�m"���6a�Ti�6��g���#���K1]l���~��:G�"#EF��)��@��6Aw)-����x�9��(t!��X�>y�p���X5��-���@������`G�l�,����������_������I�y��������T��N��O���]8P7�2]?<'#���v��=���9���r���.7S(#Xq.&:wh������O�{7N���Q,��7v}�p���^Lz��AR�<�
�a6.D*g�[(��g2��(��2���I'���e�GT���qlS�g�u��=R�a���~�#K���(�'5�
�N�a�{���1x�<x,H�t��	����.L*v?HN�X�|�@G���#/WL&F
�HJw�F����D��2�,��k�`<^�v�eZ:9*�FO>2�M���u}�$J���OTg���
��7��������Ah�*�*dS�b�7��7��=�� $�/�H���P�3��&F������4N��4�@��@3?	���OYu��$U>�|+�e��-u��������i�����r��G�Dww@ePT�A��	���\���A<~J9v6�!H'd}:�nQ��_6b�L�:�P�a+i����A&��B�� 4
B_%�A%��a��rmv�c��$K�$�o���y6Q�P���Z��w�-�y����� 2�"��,Ht��x*��b\�u��_�'�����}�/����O-=j��o��M�Rf������8�����aS+.����0(uuPT�AeP�j�|@%�zC��ky�e�
�q�[�+�H�.&F����K����'e�2�� 3�2_%�Al:I1�u�.�VP�r6Q�����n$*��`�{h+�yKs�������Y�-��>���1�~��v�_Ow��Rw�����Z���������������T�#�>���s7-�W2���
p`�:�va�}7[vy-&Fw�4j~~����g��3���j�����(�}d����-#[~��nD�q�e�9�r�g��N�>u����I:���T�U?���L������M��&,����D5�@4
D�������P��Oc
��� ��M���\�M
;X2��lh�FC���
#��d�<$h��f��y���I,�!
M��]��/b,&69
O�i�����T���R���d0L����"���m�d��dy1�!bl]�WA�Q6�@-B�M��{3� O�����l�;Sf���G���qp��
u_������1����6y3*�����7>dMvlbjG�������6,����� (����G���
�|c����@��?��_�I�b�?)�0�j.����~�� ��,��|�S�<0z��S������k�-���/_�t&�-������G�����(H|�_tk�����Y+��R����n\����|�o�6��������f�g�u.vt��[\���T�B�m�Mna��]i3�����w�_)�x��������Pg�SQ�?����������3L�Z�%9Di����]�����nRb��j���$�V��I�c5xy{2��ip���/O��mc�nI�n��v���!I����IL�>Q���@f���M�I�������\�}��K���2��I��l�Q��.
-��)?.m&e*�8��T�wh11j*B����yu��[�CXn�r3���������)�4[���G��!J���T��66�v!�H���e�Y T�|rM6ur|�0�(K�j����	��Y������������44���@�P��,L�jG������&E�`����9O����08��u��E/(u?�GV,m�G�tpT����]���������[9�7��2N�f�`�����JvPu��%�������=�zm�lX����~L��dS���4����j������b�:<���2Q�����qi9��u���wR������:��U� b�"��?�~�����{y�Zvi��!�*��OwC05�F�����@�Q��04����#m��.4���5��l��
�#+FV��Y1�������M��C;l���6�����X�#;���pE~�{jCh��s��TW��0u6Qo@4
D�������-y���Y��}����+Ol�":VS�[|���y26d�k��}����E��<#{7 ���b�n@&u:���
/��lV8�5w���f�c�/e�|;�jb�8�H[�)kfo�2P��ed����-#[��}I���_WU��6�(�U����i�uY�M	B�j���%*?w�	<���:7�g�|��	����
������j�v�Y�f-8�4���g�|=������%��P��fC�~s��Q�����y���Rk=F��jhC�.��n\�,�<����G���h�����5��D���2�e$�H��,C��$f�0nfvI5��dc�,S@�����^�eA�RMlH��b�G���Q�b�����j�������fz�5�~r%O*�
�e����j�u�7TM%��]ks��|8������T�����`�7y�)��0��(�����xl�>l�������j��_^UF��q���������m$_�U���>�b_��0����	����/���<o
���\���ta�lqs������~a�O	��I7D�Ad��������J�l�Me��=c����8��$MO|�����<Q���b3��o�j��nC��s8 
H�����n�y\�o�J�lk����NU��������Cl���-#��~�_����W>��Z�����]��.�������2?��J���U+p���t~�_��~����Z�I�_O���#v���
U[��G��9\j3W;_QF2[�Kd�
@�d��x�T0��f{)	�3*���'��k�2�%�����F7ty!�@���&�������oj~����sv����f����?���o�1]hC;r�S���,���9�fDu��W��8��>D7�b�b�b�b�b����[�B�-�U�x
N��#U�t�aL]����]��C��;���Zg�Z�)�!�G2/o�O|�C����;(<I�����nu{}w�T�������o�?<���i�x�r�o��l<��
��dyJy��1��a1�i�����S !�r�53�q��j�]���w���	C�����������O���]�-}����Oo.
����SE��W�h�F�Dq����\���i�q��3��J�f/;�Z��|�n ����a]��k^�\�Z`�j��:�eE�\�����9��v>\B���!6C�fe9V��k�y�d��l��f������ ��������UhSjdW��QG�����Wa9��_]�B�>�0�aX���=T:W�fR���`�a=����]Z^Q	��W(q6�kc��*E���6C�.x_u�Q�4�(��c�c�c�c�c�]�+��]+()&�����]���[�c�3���9�fH�u�W��U��@1P�O���12dK[�8��R�����WuQe��Ct�Z�<�k���_��~���+��q�ZS���l���p�8P�UJ��f�����+Ar2���(������6O�?�9D7"`���i`���>b�uWuu�V��n��(���+������1�<���N��N���O��S��!�5@m�������O�����y�"��w_O:3�s���c�]��i�8��������3c�ziQKj`�h����h����8t�r��G�z�!��%0
L���P�/nL}<L�I�QF$��uS�4#��=->�	��i���ctXr�C]�0�B��^�)�����m�~�49�Ct��AiP��o�<=�;�������7��
������C w(�?�����B�&�tD�������$�h�=!p���������<�&jw�1��n��|��By�$��SSNgQ�k�1$�����=��G�	�!���`3�6����>���u���Pu�_�����mr��M�O�1�h�!d_��*�u�����1h34>���CH�1vY��bTF[��HmHx$�W�p(M\U&�;�����<���K���4�`��M�R�J��
�.�.��k���~�t��0h'U���r|QV��� 2�"���`>^*Pli�Q�*C��8��P�:�R���\���b3��s�U_�u�A�@3�4���eGgHi���Ct���P7MG����+��s���\
U��pz�����ipZ�i~��������Z���������p��<R6G-d��L�e1z}�b���^����f�l�
}D{�i�afP-�������r�IA/���q1�D�Ey�Y����}c��	��40
L��ju�X�e����mS��&O�M�Z������P�0PmO}�5�E�T�
)G�����E����9>=������w~�������tw����Oo��n�r	ou{}w�T�������o�?<����|���:kA���)����^p�m^����b�^�H�@��plV������%�jWU�"�fZ~�%�!6/�ocG+�|�2]��3�3�3��K������`�d�;95q%%����B�J������d�s�1�h@1�����L�e��������P�A9�t"�B���*�s���t������1����l��������B{�L&��*��>l�ut��8�X�7
I�n���O���Y	Q>"hghghgh�����y��$����w��r���J1�K�
�C��#����b3 ��Te��
����0
L��40���/? ��u..E��������2�6/\���bt�e���*{7W����B��{��Y�CG��_��((�Z���F��2)�4}=�����t�n:����mn���F��A�n�k�)�"�����}�s��4����b3�K;p�A�S�n@������Jt
S��*�����B�Q�/�k������O�FT������q��Q���^�EZ}
��|�d|��r�l��[���}�o�$�@�X('����D$MEv#�n�}����� ��[��
���������^����b������?(m��J[�F7 ���j���Y�>��]y��Y������#��������eL��3�X
)Qv��-WT�V���l�W_�������=m<{{Z��x�*%�����4_d5
� �M���:(��rB�D������B9x�����f<����7�A��7������H���|Do�;��{����mR��\LgK�s�k~@s���|����8��c�8�����q����M^�9�%F������s���R�g��Cl���-^���x�n�3�>��/���S���|){{�ZL���Dr�����h���D��1��)��������6;��vQ��z����0��
}[��y<���nH����!y0�3��B�i��h�K�$�%��C����T��,U�GTB�Y
�2�2�2�2�2������-��j�CK16�&k���jb����Km��K_� �S�_��.�������NR�=�<����@7�J���~���^�r��G<HUS����o�c�B����b��686	�Q�H���B����~�Gt�i�m�a���g�;����NJ��@97����!6#�������y�x�n�������������}�U�U���b�TrS�CR��|Wu�"��R����������4s�nD`4
F��/���md���K3���������+�O�q��__}��7Y*��}Kh^���<mx�����u��}zT��fw����/��o�[.@�t�u�:�����R��Q/�0�}��b���-��xK{�L�P�����������ho:���k����������"M����r������Gr��4�����L!Y7"PT�A�Ce^"�A���%z�yn�Y���E����nh�����uBtd��u����a��~}��=�y��������w���v1���%�h�\���<������b�aS-Z��()�}�.C�v�v�v�v�v����h� �T<�r�>Ry�BCgIU�I\��R��P�Q���VsO2>U�B��
�2��6d����F�������x�������pc:u���"!,4��f�y��f����x4�!6#���>�����)���$$��nn�����
�J�u]���8�4�n�����c��0����v��������7��(-@��i@�������[��
@d�Zf����'Y�!�fD>������T��&s�Q)ZNt�U�����@�!�GL��40�"0��~��X�����������49���H�|����s���g�!:*J����*��1p��f��4"�n8NU���#�-�J�
�\j3mi��������9$�)TT6����R	Q.���4 
H�A���v.���>c(vx��!�O>u�.z�B�q1�B��t�E���u|��!D9`�������IZ��V%YU�9��l����#�W��s��S����f����<�i�b�6��N�Ooh,���a������;O��>��F�T�@������T�l��Rj��j5`s���^W��CZ]���}��	A'C'C'C'C'?�����^W�U�=��������T��9�cb4�
�����*���j��e����>����5��Q�'�b_N�Ou�������������}{���7�M�?�K��^0j����F��o�l���*�M�J��r���6,��t�S��������}�����S����������K��S�T�9�n����~�(�k��t�����&��������`�{��=C&�����k��0�a^k��������u�#Or��u��p73ju�8G�<�l���b�(����U��z����m
Q�6�3�:��/���.��F�r��B��J�*f:976pn��m���a�_V�ClF6l�������6��ny7�O��Q�������P6�������dU���������j�9�H+�fhR�^�u���}'!�!�!�!�!�J��j`{/����r�<R���][5-�ms��f(����
iN[�Du��T�AeP�EP�x�u7���i[l�b�����-v���4e��
&�Cz�)�lTaY�SWa?�g�Yj���w�q��n^�-^��)��a�k��Q[�fnp)����MQ>�����������C�l�l�l�l�l~(�;���UM��6_]�����9���W����i1:�"��P���Y��4�(k�g�������N�5��V���%��������6�����@����)�����MW��9P�H
��I%�q	Q�������O�������/}JW%r�����(�,��.�����]�L�������2��~���@OT���s
Y,�Bl>~���H��h7�O�s���
 �b�x�]w�=�>>����n���GW������
9��L��;���bl��i+u\�����h@iP��_
��)�ha��}���R,�v���h+Ty\�ee�b�c����Z���qP�i�?��\��J7� �b�����C_G��}&i�:���l>%��b��������L�����(�r����W�,��2�,3X>��M�Ujr�F��*d�X<���97��z�#���������l8|��>D7P��A��B��iX�8�M�R������<�!f��A��p�9�_e�P���%i]�@�1@����e��>!Db_S����b��_�o����T��*�O��U�ri��h��>�-�������;j��n��V��w�Ou�?~�^ee���m�������b7���uq,���b3���ye�fe���h��i����/��,!�Jp�e�e��U����+�����W_	��� �;�FK��y���qM�c������g�*��I�f*Z�h	��*�8���a:� �b�X�Z�s�r�p�6g��x�� �z��ql2Y�u����}b��{���8�V�
%*|^@�uT��2�,��2��|H�:���\����|n�#R7�v�2�c~5�����Bj�7���%*3�0�/�hr�������]�����?�1�V��(��75��4lA���*D�!�yP�k�����l�h��� �b� f2/�$!�?#G�L':�:��i����B���N�6
QRQ����$���>������ ��ZY��.�
������q���\h��`\�!V����n^d\{e������R���d�nz9�������)j��+,;�ml���#����U����a`�m}����Ul�����M�h%�~x�Bt���D���l���5f]�6R)N�����h0��F�����U��S�P�������j�RH}�����5��m�������0����}�n4�4(
J[R����Y�����cv����=��s���r��v��9���e������U6_:��=�(�W��1@|���5�����H�~�7V��Ho�S��J����E���_�.�Cp�s!"`6��)'q��x*A
x�����1x��������g���j�/����b;�(�XI�"����E!�����(����\.�$���v�l�����f�{_I����^��$�,����iV����n�u����9����S���������*����N�E���




��(��H|D����
��C�1��n�QH��q��d��Q1��v���j7�_����(GD�@4
D����_;�Rq�jG��=y���V�S:��R)��I���VR����������Ww_>Q��Mw�)��t�u�&�Qh��0eX���Bm~I�)�h>�%������3��UR��k��=��A�%�;)��J��5��W���c�c�c�����Y�����x7���7nsHy8����,j��c��f��b[[�E�����"��+�H�� 1H�z��nb�+j��:q��)Di��QL�FNK����L1:�*R�]��E����2�,k�,�U����� ����+�.����T5]H��?�U���_��8�������
Gt��*���2�*sb���u��H����<��r\B�0�������H+{�l8�h����i@��9H��MU����5����Z���	Tw�6�qK�d�p���U:��
�}*�U��$�@2��b��/)ZI$v�5��RL7�K��}U��mF���b�0�����W�<����W��� 1Hs����Q��n$�M�<��&���*D�!iv@�{f'8:Q�C��j��z{��������_Q��S��5�]S�]�~���N�����iZ7uj��nS�z��gP�C�����a�T�-��e!�4Q W�OW�E�<zzzz������V~�W��R�����Q��< �_Q^��8$S��lJX.2�����+[����08������L��q��
-����XR��1$��0�='u��)�J�X�����c���J��:�i@��9H���~�M�h{T�K��r��KG���D�������i~E�\�{ 2�"[��w�����t��6����\z�����o������������]�UL�S����'�*���'��{�k�
��^?
iF���|%��tR��_8�q
_N�Ou+�z��z�����&�'P��F>2Jv��cW7���hb������3��_��g�������rrrrY�[�xr���u���aw�|C(��b�����g�"���n/�63���zx6�hxO��= �����h~IO��'��yyq�v�j�K���?����$������Uf��s�x��
���0(
?t�:?�_������5���"Q���b�"��~����"+i:P<�u�Y�?�R��K
@f�d�A���|L;y�z]7Iu|z�1D�[WJ��C�|��wL���!;���L�f+K��O�Ah�~��qi���J�TJ��I
�m���V��*����/���"����*���4�|'AdD�A��4��R�����S
q���#XB�$�4�,H6JS��}f��iJ���d H�����|L;w������Pm�w������T��'M9�D>v6���}%v�L�h ��5�l	���"7V������K�j����@�9D� 9���nf���7�\�{'AdDV�b��0��Y������tJ������\>|��ql����l�����T��Bt�>��P�o^�_�>�f8��>��z{Z�Ex���_?������SW/v)����MM��?~y��?��o�^�������w��#���_�4�sh|�k��~x�9�(A#S�m��W���Hkgjwg� s��$��@BCBCBCB�S�_dD{���]����5������8��!F�w)�bET�v��;���4&�����&��������*���l�u<�����v�*�����8���wc2��/��n��T;�;dh�����fh��G8����%���Y7���
�f���|LG���
�wa�-$(��s�MnF�vw��K�hw�\b-W�/�.$�%D�p�h�����v�N������Y�'byW9�K��]�*���n�O�V!:I�T���&�`��1N��LH��L��d0�E�2����n��c���jG�����3��&m3�x
��V*y�N��Q��L��d0LF���^}}U*�c�*���S�aW0�#�����3/�}���Q�u��i�a�+��m��{ww��?mT��#�]�E6��mi�6����p���������SY ���O�x�����*����o�V�m�
���l%��GU->/0�i�Q�[F;����S���&^.!fX��M�O���J���
rQ����?d.�� ���i@���r�+����Y��$�������^B�8���0�*�����B;�;��Hlh��
[��=u{����~���sW��x��V?z�.@=�����4��:���:��UW�z���S���d��I/������]�3
������t�������(���|@[�m=i��Rv4��K�DR#���\/���cl�S|m��U��3�0���z1�6��A���U�t�E/Oy��X�1D�"9���!�5��+��2�����������em8
N��������t,��D�Cp!���6�U�����.�5����J�<f�`�fV@/#�m��������L�S�U�Dr��F�}Zw�U�m�:��l��s�����AiP��YJ��v]���F'nD�L!��CrN������w�*6wc�6�#��a�gs%D9p���i5��uI�i�OF���n�igoUS#�F��d1��2�?��*.�?��"_�<*�G5���6�bg����#e��T1�2�`l�Q����6���K�r���a���g���16�&����\�������=}�y�8��QT��:�����x`>����o�������O�8�(Y�h����G/Y�lwF������_���l��0�af�3��6BO%���g�RUw����,���
��H�����e��L���R��bt�`�f�`f��x'�����nr����A1�}7�bhf��9��(�����-�%D�����4(
J��,���l'���9��e^*�<�(���`���1i�����*���t����j�px�������ip��p�x�������������\��^�	��t �����f�t�W�
�}Cm�S�	3�0�3+����lW�I������B����j��t5�V����~��5��yX~95���������rqO����l��J7*��9�!F2��gNr-*�n��KmV����:#�?;��������xw,�R=j�7�������gP��<��E6�?����9�������*����K,��3,�^��������S.�d��q;BXB0�d�O�d�����0zC���lY��7��0��AF9�n�1�����E^.�.?��#��"�V��f-�V[Y9>|q>������Wo?���������O����^_�Q �����o�������0�K���?���������]H����A@������!-�&���t�%Y~x��#��j��9��~/����p��n���?�>D7p%a(	�Z���~\�MN����"�nzv����U�}�mv�k�!F��K���3�x����O7~x8����.1�����ipz����M\����hxw15U���x$��8�����22���k�D��S6i��b�,�d�Q���)��m��<��t��@��y�0Y��r�W��}�R�����J�2G��h �~A���A#��3J9�JQ���3Y����q�����
����~��5��?�"��\v�"�S$��~'��C��������m�>W,??�����w��n7�Ad� Piv"����a���;N��e	��*�V���s��~Q����V�)�����w��}�}{�h�:d�_L7O#}H��iS[��s�����=��[u�4,�!F�Uc���'!�� �16��c���V�ZB��;�nX���au����2L���fb�MP���n[�F}GF����������U������auO�D�y��bX�W��m��.y�d� w��I��
u���^B�8}���<��J�Yau�q������E������+���g9�]��%�H����SC�BKGX��7��J���+P����eQ�wg�D�##������{8���{����='*b�B0��"���W#����~�7�����Xu�D�D��#(�^Bl>���e
��������"���� �z����������lST}���7	-X!FVr!�������H0������g	}0]����+�)���t���o����t�T���=F8����nJ]_�}�tz}�)�x��T������C:�e�qU�v�06�d����<z��|�:�;����m#]���q�7�����4)�����8uRxj�0���%m�(%��+MMv�qj]]��9:;�����b��@r���c��n�����(��2��f2%F�t�i`���}��~�zz7�y�����$�h7�o�XM/!69����]</[c1M�L�'���40
L���|������n�<��\[9:2:��|���\��3���V�T�+ns{2���J?g�U,!0��,�����[���{ZL=��>�+4�}���o��G�`:��ln���7��4��n�C���Y�W}��UK���/�fA�P���OY��	
����o�<��<B^�"l������?��	�^���?/�Y�Q��[@'c�\�];u���b�0�{��N� +�H���q�4@�+H�GU�m��xR���uC�]U��S���OhUz1��q����/�r��L����������h�����=}~������?���|�t��;����A��?�%��?�u����0�?���{Y����%������#O[����>��l}5���Y�#>��i~�7����:�!�����^B����>?�<}��������0���F�e`Ve`�������y@��<��;W;:��<V#�XK���I�g�u;��9�M<�/x���:?������N����`����G��WP�R�|L{�����zg�����(6s���c��M���4*�6J��n�������=HeHeHeHe�F������X�Y�H*�D���i��G
���V!:
)��t����������*j$���HT.���`4
F���v�M�v�� ���#�	����3F�0_��)LYrN6�������(�1�eRD�@4
D?���z�M�����j8)��!y�S�
�m�r����f`��W�{�2�(�:(�6�Q�.��(�����q
1VM�������9DY�$�H�$�U:
��W����t��a@�V�<�.�����*�R�Ws��-��C�;��t���'�a�FL&'��+���@e����e#�b���;u����>P�����B=�_�S��6�v\���~�C�@HVt{�qXt6dk�&��!�S�����9���Awt�t�����\�'�K���d�Xu�Q�����7��3*����B����f����l��p8���6m8�p�_������h�M��8��
-��G�9D�5P<+�BQYk%N���z�_�{�2�*�����6m�v���h�@��}�&���Ct���8��"��Z�1����zp�F�,�d0L�_�y�f4���N$I��)&�����Dv�)�R����U��*���2�*?��������X�oV]hr*��G����(��g�:	�k���Z���g�_G���AePT���1��)�n��[�_�!:u(W��=o��c���J,J�C^}������=p\�0.�Y���ga��o��n���)���Q�����u����K����7�Rm��rJ��%�%D��*}���g�UF/>I��B]��;*K��e`X��������4�� ��@F�n�qt��A�������d	Q�R`��z&*�U]7�:&��Qv7��a`>�l����C��5�\umn��SK�9D	"9��{�Dx.���v�D�(jA�5]���3�0�/�����<�.�/����b��y�����G����^;�s i��	�� 0|�pE�j�T�t�a74����8���DY7�Z. �yh�"^6.~5d��g��L8�����������XTs�0|�n����^}�~z}������tw����F�&���W�Y����];t����*���]��P�=����YS��p|��������:C:���
����`��3*/���)>��f^�n4��n7N�l
����I\s=�(�����c��~X*���g���?��h+_��a�|������o�|�����������?��\���W�xC��������o����H�?����uo��_H��w�������F�P���I`?���������n,�N����]s��w�1?�+�|)s�o\����)�B�����GO]��Q1��C2��l@������*�b,-ciK�/eiy7���upM�W��!	P���@$M���;?��_��cl�C.vS��z	Q�N�4(
J��JJ����������8�:V��~<�B��rb��!,Y[�����
|�>o�K��� ����^�M��x �M�����u���p�B/y�xb�����&�x���]���c1���!���������Inf%G��C�5��������9�����3Y��[�_dO�GN�c�S��|8�4 
H�/��"#������fZ�H@��c-���C���:�F.6S(���k@�4m8��o����6���h��������o�����$��R������������,!�C�c���i���%���Mk�g��MJ��r3�2�2��U�s���v[u����t������\����|��)D�MCN��Vz�?"K[��X:_�nh�2f�#�Cti(
J����K��y5wi������r�J��$���+�%�NI�_�����[�b� �b��x���W)z��_~�X	 iZ�����oEE�u�z�1�?�*�����o/|���1�����2��y4)�C����j��nXI��~�B�>��=-y����vFiB��0l)G#xB����j�j�j�T������%���6�����T��k����{�����|7*�������$�[�Y��D�W�����08��P*�\�)���R������>��7���c�Vn����/�[��a���w�hm��!�>�r��	�� 1H����B�}k*(�:bqC2t�eY�Bt$�A��us�;�/��cl�C}A���(>c+!X]2��O��A�8�����:-��`\��X������^~NV.�I� PI6 ���)t���b����|��b�KZ��l�;��,��|h�\�?	� �����2�����Pzu������&w3�w�2/��P���J�T:F�:W���f ?�%��a.����UFYY�|~R�ne���-���'M9 1H�����A�l����ji2���������O��2�+�X|-��K*� ���3��|���=�`oO����'���
#���Oo���}<���T# �T��+v��x����T��y�/Y��#*���������r�������;ALCL?s1��w�1?�!(��TuMj� �%��S�!�yW��\�*K������J��SB��{%'Xm���D��o�u������0�6���q�����������X� )��H��
�$�`c3���^���
RR��E��g2�����(�(%��n�B���w�<�����W!6�B/��|�N*���C���� 1H�������|i$�M���D���W��P6-��/!F��@����T�����I��}����t	Qn��i@���o����l���Uh��V���������da��(�����nlE:�e����@59F�':���
d?��os3?3M��*�R�����.��Z�k�Fa��X-!F���������r��H��Q}����R| 1H�<l3�8�]S�����he�Rq1�x.N�yN�{m4WyVn����Iw��p��^��~1��nb���m������O�
s�������wS�&�D��^�3I:OU���*\`#�`���<����8�Y�����r��&#�D$��'h���QS�u�r�6`c�-~��*�fL����_��i+u �!�!�!�_�`�����x7G*�&����y�B�H,�<+����L�J�Eu�P�X[��2����l�?����J+?%�������u��9�n\��Y~����)�����������6Z_&���5�|3J���@.C.C.[��_����:��v�����%��p<��S�u�:����}[B�'4H_�bZ?���I�!8���3��3��`2�$�j;dg[o����������7�.G����
z���?�%���/LM�k_�X�%�x*
������������������e�������N6�����s
/��|����-R�����������k���Ct���f�l~l��c�.��k�w�1�_�tC����z	1Z�%!��w��k�J���J���?-XH����yfb��t��o|���4_N�Ou+��*�L�N��
?BX����:������y ]�����HEk�y�#����s�����"7?~�[�������JSk��f�Hw{�b�b�b�b���\�cz�.wT��M�%�=�s��Br
7�Y�s��e��V�;K�C>4U��~�E�>D�p�h ��_�y��(������Y�Tt]Tr�|�q)����G�����?��*�d��d�{���/��>�������}p�r1�����k�����[�Ebr���������)!�M�,��b��!����#���i�/�U��[w�D>S���N	E�D��x�3�D�<d��8�7�y\�eu	��lfh�0T�~���x7N������F�i=���;Z�d������}��E��C&C&C&C&kd��|@���m�rj���%��9��c����������QJ@�t�&����%D�n��h�����^_Q7MG�u�|^6�!�X����%��uF+��������i��%66�k��ao����c��?~zC����"�O�J��F�C��7����j����c�r�a���%�b�0��-��[%��
�Kr�
F���B��7E��c��!��3�����5z���0������;���I���u$��P���d����&��*F��p\��S�4-����S(�!�!�?�1;�x�>dAv��\�MEAO�0���������7�#g�%�����:t���E.���`����cM����8��$#S�Q���^E7U�S��u��+IZ��H��V �����������
�l�uz��)l<��/���=����p�S{���s�;�����TA�O�rw��"�[D(#&�T4zp]O��R��R%���������aV����ul����Q��z�������=x3�(��=U@72��YL���IO���o~zLuld��yw\��n?�������m�+d����d7% K����[Og��l��]��}��L�:���X��p>��]�I��E����Ep��d�j��93rf�����������:�����FUI5��6�7�&-��{p�U]����3�k��gE�/�/��h����
�����O�����RP�mS�
"?Iv���u���(��,M��U\3����`11���M�{����U���2j���l`�ay�6]���U��KC3j����L����q���\���S�`C���
{:�-�6S�y�D����4�N(����W����,�bb�)|Bv��OhI���)-f�`V�c�j/d��>c��|^�L����R�.�>�Ti���I�N���e�Y�L�7/����ep������a���Ik���Z��������������5l�w>U�Om����`�r;j�]7�m������M���i��X�si2�d��H��G��_��Y�L=���+&�k����_L�4�\��o�?%�sk��{w'���
^���40
L��Z�^��yF��_����X�&a�U������"���0�I��8��m�xR��������&H�KIn	7�eb�E�C=�S�M�1?������OE��\yj���\&�{41+���30
um%��KV���y6��&�d�3�gd������`�M�1H�EA���_��!Q��������������S��^,� ���G��h�����U$J�P�,�����h���g%c����um��.�|5�L����b�g���.�}�5J���9�	2du>!,y�B����K!����y�[�(Q�?lr�g�V+UH����\���\��h���~h�2z��o�$���p��8�u/�	�g����?����F'�����=8���a�Gy��lb$&d�ZrpJ]z%��s6Z!7� eF���)���)o�b~�U����3W7��5u�c^'YL���K�uIt�OO��������lu�2�yJ���4(
J����)lSav���N���G"e�����}~��$��%��&����c~��	8
a�6q���~z�#�0��:I���/O��c��i�����'���.V�ga��5�(	�\�:����(�H��2#eF�|=)3�A`�o��P�D]:�m�l11��q����:e����(�P��AiP�������O����H\t���
��DY�,�X���DZX�m�NSu��P��\6�5TC�Q�D������_/�67tj�X�GTl�Ul���?�.�zEg-K
�y�fDn�\���1��YB��n1Q�H��S:����s�lct������5Y�����4�i����40��G
����\�����<��"��~�@�a>�@�4$��,&JaW�i����������r2�e`X�g��L�O�_�����=�oSfH���k����~{eb��P[����t��?xG5��Q����y�(�|��`3�6_��uP)J���2��k�*�E�[y��lb��\(vNf~z��������z����=8
N���48��-�u]#4� 2:�O�=C��qZ�t{6�����lU�=r��9�u���->�DY�0�3�0�`����P
�~�!V��}C=�X�����V����m#]�d��;PF��l��f�������u��9��Um
T�=���d^��H$W������e!9Y�\�?q�
���on�Wj��UFEg0Tf����H���7#qh�D��-m4��yO�����J�;�N�4 -:�=C�(],��Y1�bd����X�]�:��M�:_����4���Q�j������1:JIL�sJ���l�SRH���<��i��RSd�Z"�F�N�{pt���k\7v��y��D.�������Rl���%���2HA���	3f}��?2�X2J�x��"��j�>v����bre(���$K�c)��8��{6Q��3�0�3���Jv�6�
w��|Q����(w����L��"����$jw���l�����h ���E1�f)�f}�����y�e�b��_�����G35��#+6V�|����T;�ip���i>����]�.T��S�OA�M���rN�~�����1y�W��J��8&��Q�@d�Af�d����#�#��JmG9�g�b��p��0���Y�4�)��
�[w�R�<
l���j���o"��Act|�����r|x�G�^��M��������j�_��fAr��/��KD6{p�cI��
�y������j��Z:1��ae�Z��g$�H�W�����������S�\�f��S7���������Vt��4��D,�V&6��'5[��l)i{��8I�^�2O������^��G�J������=�;��|u�c��H��Oz�.&����^|�N%m����OB��=�HY�>?�$S�������[A�6E.�6z�&(
B��'�v�k��];���j�4u�q��<~�6o��n����
m��v�WyOV&F�I��5�Y(fY��kP������d����k[����?c.�*<�Y�4G�_��+}�f�15��-������{��(wW��A����6���DleoP�Aa�����x��5],��U[�"vj���NI��h��G�SR$B��d����l���f�l������l�����Qj���c���������.�b��rRd����|��uk3w��/���'�roT�AePT�/��(��)�2��h3*��)_�#���{-&F!���[�o�����AaP�:���[�mW��f8�������&;{�1*�j^.��QvLj4]���ak��	6������[zp@�I�Y����t@
%^�����S�I�w��n�*��R��}S�3����}
9U]����">T)�tA
�e$�H��,�"����6��f����	�����o��!��Lt�FC���e���F?j��&��DY�(��2�(C��9kO����hj���\���%�����Y�>��)S���u�����|ULt�H��4 ��4_w��9��N�hTf���&$BP��uR��OYe�(��)'�\+�q�oD�o����0(
��_O�y%c����glCU��.SL?|��2���
>:��OO&��Y�����'�%de�
D�@4
D�{���}S7U��)P�M����#*R#���������6F��4}i>��&��w�?�K�y����*��Z���T�eH�H���F:*���4���HWC��F5�G�bb�)�
/	�i�Y"�Ua<�U�/9������3Rg�����U�^���������+O�s.�DD�d��+�/�u�J���
���;Y�>;W�#���" ���h ���Zm���W��!
5��y11j�����^��D�sF��T��.&�����q�������P��nC�>���p�udm�n�/��f�_��P��}r|�k��w k�v��������d�J=	93rf����
8�?N������Q����B��%�1/&��C�!���`��W��J�!��w���4v�'�(���h ��Y{3R��h��]�j�(^(!Y{2�J�;��y�#���b�\��
e�n>�ep�������������^��J#���&)l������O6��kxJ���?�_,
���V0��o���Sj����@�W��ULn�"�9�`b�N�*���Q:�b�(^� |S(~��/��L|�c�l�����.�V����Q��s}�������roW���j��;�KRL����4 
H�uJ����=*����>f��<�J�l�K�X��_��_�g>�Ml���*t+wwS&:w�i`���������F
�/NO�UM'y����`
&���1:��CT|�ZL�R�T�u��t�&��`2�&s���������RK����+�����������^f��P(��xX�����D&�&e3�0�3�=j�m]Q�/���8�3q2Q&�R*��U�>�C�[���Q���D���a�.P��f�h�4�	��",quK��F�=8V�N�0�F<�Ll�"�Qk���f�Y���;����QB�3:�(_P ���h �����F���ML��&6L�ru��Hag�D����]�73�0�3����]��z��E��`���TH�����4�����?]0����!r�Lt���40
Lk1��Q������I����77��xJV���f��ewz'o��?)�_~xQ��(Ee�+���k�`���+��nr�h ���h.��i`����D�>�Y�����DG!)]�W�Y�b�����8���c��3I6Q���4 
H��4�=��t^�K�E>?�G��9�}�����q6N��*���8N�@�����o�`t�g��u������TNa�'���B�l���P���a6.dI}���s^�
�,��2����������O� M&���X��[�H����q6Nxj&���>�,��?����Af�d�92�R���/\��z���O���/��Q������^�t(����
�n�s<��c��jx��������G:p�&�5���g����BRfe��-?��&6�P���{PF��AePT��,y�@j*W��*c��y2QVK��u�fiMVZ|�&:*J��-������v���2��i@��9H�U��15s+mBY��&FxY�N>h������G�����8��3�8���?��Y�������;��y��.o��6S���iO�o�Vp(hi3t>*��,JT���bbtVC��q���������`3�6��f��c3�A���TK��_>�L�0�9O���g��9�W�	&dg��o*��>�t.��1p�@�>�n~X���M�*:�����$O��D��s_��|��N����CX����E����d�?-�����^����(���ip�������V��NQF��v�U�cp���2��&Fb2��H���\�7�6#w����-��]Lt!
F��`4���������m����������]�s����8����)t���
�t.��1p�,��X���:��D��\Lt ��0����U�������"F�?��_�VF�&d�Af��z��gI/�kK1���.d�9�R���bb�S����#����R,���y#H����0���h0������������`=�X��S�����s����,Um�u�Q@�����%�x��xwwVv����&�SLct|!�z8}�����[����������O?~9><����s=��_���>>=�;�?��/|��<���������c~������K7#�N��]�88��������M����{>�%�	6���C����XEr���(c��H��2��:Bv�i~�V���^���y/G�n���k�w�S+��nh�XN������������s�`b��	v��(��2�,�j%{X�,o�,I���O��;u�O�������������
&:,J#&����-P���dLt����4(
J���~�En*����s��)AG�����T�f���q��X(Es���x��tN�2� _��V)�.����]�T��+Y��&F�>��M���0	�8�B?��Y����+-{�����2�.��l��C����"m�����ti�B���|�����q6N���Q"W�a:@aP�Aa��;���>4q.����|�E1��H��f���)f��a4���\���N�,�� ;����������V�Nj!�1)���f
��L&F�UK�O��C����|%�����~~��L�;�@4
D�@4�K�/6���\J�/b4�(1$���d����0�����qv�(�@1P�@1��]��m����vJP���4��@$�qh�*�������o�O�,C�%�a:�f�l6c���_u�G�E�����?G�5�����+��Ze����UH��L#m^6�b���}��a6.��j�R���wg0����`1X��]�?���Q�#X�u��������m�R~
&�����V;���8�fZ<mR�������RL�s,��2�,��D��5U�f������B�P!������LpRi4�at_Wu?�A#F�;��Dy����h0�:�����OJ��2��+:.�%�`��8��a���~Q���O����7�|���.�D��
*���2�*5s�e��&h�t��R�rA{2Q��GF��fy>0s�8*���OS����%�&�F'�2�*����W��G=��p���k��|�>�� 1��NS���5���&F�$jx���o!�(#80�����`4�*�?q����V���=�Tj�oK��{2�(�D�g@*u:�.�'���
�}�Y���a����|��g���9�����$y�g�������a���J��;�u:It7�|�;s���	}�
�K��J��4�	i�n�/�������^�G���/���z���tx�g����������w��'?�����8�#�q��>�G�������p�Kr�����{�����U:�4����d��|�C��*��a��)V�yW����[��b��7#oF����d��5��-?�u��_�����F�1]��]���i�a=�L���\�����=��lfe_Q�������D�
�&��`�NSb������r������%���8NG�4Bm8�f>��8�����EkA\PLt��� 1H��_�U��h��6F�)=������e��%��Y��|�2���(�dV��S����92�� 3�|d�+j�_D�.��J���#�c���:�H�G�|�S�2���,�/^�s�{�^�z�=�\/�p�rw|u�@�)�;��W��i~	4b/O��v+_���s!��������3�P<���7+�&F���}s��?�D9= 4
B�����[������"t��8���wN'�8)����`i��T����^�m�G��8��1`��c~!6J���J�}��L�ol<�xNO��t2Q��R�?�6�|�t����yId����9��005'��I�������m�Mt3�T��TN
�l��vjV���iLt���2�,�_��NK���Q����y!I����rhCE��'$&V�����W�5���
&���fAg�$_�V���:���#�@�f��6��U��������}%��DS>���l�-�.�VB��x��U�>6i�"���&�S?���0���.&F�D_u�l>��9���3�g��\~R��}q����B��J��K))���L�Z_7u�Z�� �9���P����|l�b��DY~6��`3��U]{�����T����}��|��4���1Pi�������&����W�����@�����Y������%����4�T�T�G���h~!4:��Y.�����F�`�8=�\^JV��0Q11�����:��r�f]s����������A����3��+��7�2O#�+�2�J�tC#��������g���&��[�vI��bb���
��[���nv@iP��A���~G�������+��]?7��oM&FF���01��t8��%:����|��D��:������_�_Z*�YM*5���-��a������s���2��PJ�Deg��]$����M�/%�0�3������H��C�uHS�
>D�Ml�Xnh�<
�q6Ng�%V!2��6Q�*@3�4�@3��=Vm�<������������N/�#��.L���k0��t��E
�+l�sp�g�Yg~=2����nF%R�{0u�}�w�#u�����=vt�����;M�vg�6/�dh�C�Q���D��'y���a�H�bK��[������\|�d�,(HT�Mu`�\�����
]�W>���8��	j�-H��(����)#S��Qz��3�����2�=���J�6�8�������J���1��}��W��v�V�I�U����a�s��^#U�D����#����#�o�������5��|�dt^g��u�����E��i���T,�>&Ao�2����j�ZH����A�<�=on ]C��������{;���z��k��-�)��Cf�mR0;������(�x*�^�����K��R:��e��kY\�/sy�"vK'����.�4��������2V^EMl@���.z�aA=�(e$���A�+*���8#F�H�����~
��I�����F�S0�y�(�xQ���@:u������D�n�3�0��q��E�S�X����*����
�h]�.�Pq0QJ�b����[_$���/&����i`��yL�T��}�hZ���=��H$ch}E{���0��q��L�O�<~�sL��dS&���?�����i�����Y��|��=�������'7����6V��?���q���N|�#���`�s<��c����~<�����T��{?I�5���!�b���4u�avi*����dy01��-u��c���hW6Q�b
L��40
L�e�}<�n�	��oH��"A��h�D������Ye�|;��Dmw<��a:@g�t�����=r�@m���
��;���2`)	�N��R/h�=���G�H
t���N�
����;�
��A�����1h���\y�6��H
1�f���)�L�2�r��`�y���46���~}�Ypd6���!���
���)�AS0�}����G�se=^.6��,S�cM6]��	V;���g21:�b]���)>^L�����0/��,�'�(���t��������}���������O�����77.������o������?�0�K�/w�W�Et����.���g;c�^a�����??~����MlR�� ����E`nV����>���.~��T�������G�����L]P�����%����*t�.�t��P��j�W/a���B���3��x����'�=f��]r��S)� oM��q��=�IdU���h�P�[�����VY��1p\��+97���F��N���q$�*�\W�D�")Ch��uwL���`b�����e����]�R�� 5H�H�vU��H������=��c���:� ��`6��H���?��s�������(l�U����yw�E�C��O��-��%���O(�~9><��p{����R�;��]��~�/Y���7K�y6���(]��
e^��L�B�����k.�Xt���*�:cW�4�Q
&��}`��~����
CU��.%+����=���Avl�R���CO&:I���*�Mi!#��&6��:U-���t.�P��tC�^4���<����tM��J��'fF9tnU����Y��p6��
_W�Y/O^�(&FJ7��XF��`�f�`FG�K�;�&��N=�	]Y��#L��H��K���-��Eh>RKS>Z���@�d@y�2�ki�y����I�Lt(��9��uXm8���`b���\���:�q�nR�g�|�����zlt��O����L���������D")������5����PL�RgW�5�t���`�f�Y�8�xi	�M���[�/t�L�~��b�aV�]���j4��Q(P}�_���}6�q�!��'�S�^���N�F����������������m%��,�S
&����������z3\�U���K�mP|��,p��wh�*��0��i���?Y�n�@�����68G�sT���B�)D�������'��EF5�����A6�����-Xm~�>�WMMb�Lf�L&6�A�6�RhA�������*�����	Lt����!u�I�@�?�d�/D��Q4���F�t�D�k*������(&V2��+��9���Hx!�������s�����M�Zf�`��W�v{3L�R��+G������/�J��Q�@w5Va�)���������_���
&����AiP�Z(�����e����Y\B7B���af4_%6�ME
�^t�p���`�c�4�n���u����<��Fl:�����<������������'l/M���$�X���n�����T�["�1.�(KF���n�^'�5������� �F�$I4������N_>Q����zd�h��f�&k���m]��R;�(�4����!�U
s����TFH����h}�K��D�IJ���4(}-��_>�sH�����������O&#&�SM�������n2�q�W�3�*C�?l��M���� 4
B��L��� ����<�y�
	�\-:oLJ�3�W�*�|�\F�f�A`f���m�����s��H}��z�t���uI�x&J�X�O���:=f��b�tp�g�Y	g�A�K�r��D��l�S���u]���f	4����?��-Q�����RD��u��SF�b@P�e@�����Y'Wu
i����`w2�AHJA��U�7�L�
o�	mZ]�%8AUL�g��h0���h��{T��Ke�u7�C���
&6T��3wG���i01��s������q�nR�f�l�����o����W�4�9������C��x�)�M�<��e���Z�����*��z�a[�N1�yB�� 4
B3�3��Yu����t+d�;���k
f���
]$]�bbT�#�0��e�
&����h0��F�T����HR�	#Q��,�
U\�I=���(�\������0N��6��`���yi,��Ze�t�r�]�'W�N&:II����O�Q681�\���9��003)���O��}:7c��g��v�IST0W�	NO
&V�
������/��&Jw�h0���h��;���D���b��x��bbC����w����d�bb���������q��	�l~il����o�=�>���z�B����_�c���������t_IE6q��ny>�6��G�6�L������Pjjb!�a�CB�P�O���f��3���lb������*����d�����f��,m�_�;"7#���W��`�ypLm����k�������G��������M61jTV�������������v�?q��,�{8�O�V	3��~y{{�p����p��g^���9VS�/F�F�Q���&�%��K�$�M��,I#��V5�EY~��m01Zk� }�)LpIdG&�J/{iB���KPG����:����
"�XP�[S��5J�x(�o���s�7t�Et!��3��u21������oY&5?9���u��Y^/��}P���zt��������}���������O�����77.�������;����<_��?�0x8}�;��?|�[�?,[b+�w���'�x���~����g�)o"_�Oc��p������\����[���7��o,�b�������b��
n�H�/�������s*�Gi$�;%�S]�~���h^\�6J]X�&46���2�<}#^��|������Y}�(-���I&�6��<�L5�&6`����.��l��<=���=p$�����2L7#@����A�\���z$r|SZa�Y\^>�{y����lT��S������*��������j���4Q���t_���/�*&6�P
���f�)>`,6:w�f�i����	���Z��9a����YPS_Q�v'5�n�������bf�BMA����9�@1PL� ��~1G�L����x�;���+��/%_��+�:�8�D���bb�5	�b�����C#�v��4��J�����0����r�� �Y����]�h�S��b7��G!tU�3�"��z1�����;/����l����;����/&�[�����7�H������/�F��fl��S�M���E�$e�Ll������CJ���M�Ji�x��C���*��!�F�	4����b�Ke;��o]����,�dbTM��1LB�P�.&6QCO�Y�0>z�&J���AiP�j(��sFi�f�vv}��M������������J�:��OJ1�������27�*�@������jb4)/
������T�hz36�S�n5�te~
�2^L���Y� ,j��E�t7�
������K�Qn� iF���I3�fH����Y`��&��/�t3����c���������c�����v�Q�Nv���2p��i`��L�9�2Q��M����V��L]�������nj�&�P��i���^Ll��~����e���4�Ac��K��di�QC0?],���XLtG&�jj>M&5����@�t�&d���#�b�s`�f�`����B��J�w��P{41�b�����.�t$��
�]��	��b�s`�fS0���?����%B���1�s7k&X����@|in��v����y��tUs.`�E`������n���n���6�`p�g����������y�zvW%���L�-?�1M&6X����5�l\ m�;?6�������`���[l�����!m���ev�O���]
��U�ob�G�?�g0R�����}�_��������9��t�G���32�����UP��wX��( ��c�`j �����U��&:I?*����� ���&���e���rO	T�AePT��+������c�kt�����}He����>|8��ywln���cQX�������6S:6t���o�`����'�^��7+��,_����������}=�d5N&���s.���G�����
4�g��,�0L���i!zs6��`�����%��$Q��K1���=�0G���PW��:�]"�W�e#��x��A-��Rlt�8���!i_����7{�fp����6EI�y��?�8�J��|�:g���6Q�T���K����ES���l��Y
:���3�:��������i�K�I!>�������+�Z�A�������Q����}�E
�)�l���i`������Q�����;|������"OB7�I��*�k6.��&F� ��i������ius6��`3�6�)4(�y���YPPc�e�i:�,(�������C���M�y�^�F9|]	���A��Ab�$�Y�R��t�
}3���+�����D")��D�������kw����Z�>l*&:o@iP�����<'�yD�>_<-���6>����P�p&a����
���7u���qA}Z�Q���3��_�N�o��\1�:������2�u����&�*�y�bvGm��������]Ll����t�*cLl�!��>�q�_�l�{�;����-c���tx�k,��o���=}8|zxw��3/����[��_�>���#}�����f����M�Q�z*��O2e���m��j�5�����f��&#�����9���3�g����m?G�u1������T����M]ut�E��9m��Cu����i�Ne�B�?\�M�:�����8.�t���1p�a<����|_q@���)��~�#h1�qH���MZ��l\����z��t��Ab�$�Y�U�n�.����d=��H$eq)��Q8�q�h��PF��Ac����oo�}�k�k�4���<|��,�l/������'QL�>P�W����n��;+��%�bb�
�Rwg�y���R1�f�`~w���A����k	�h�������
��|E�db5��Zv�#
�����B����b�sP�e@P>~��K�T����m�e�h�����|dBw.
&v���Gdu$JpH���DEip��n�+o-N��� �����@4��Y����=�i��������Ge���tOtw����s61K�]�or��k�t�� 2�"��_M��|�J��9�K����)V=�5n���oE����������.?9�D��+�!��n�v�C����F7; 4
B�� �W	�KY���]G�����9j�Mt�B��y�xaf��97E-ybxw�;�2�.������%��U_��KX�����������G���������~,x�D�%K'������G���us>���3�>5o����l�t����t���8������k&J�mj��C��$�������
���M�Ah���J�}*�]n���S	�8����+��]QQ �F;G*����PLt���`4
F����F�Xk�n��fL;|�KUn@���'7��|������>��j1���J��>y&�;3�0��W�����/;��pJU��$��.&:����.["h;�����j��#KQ�y�?��C�Q��}�����������������Kr�;+���e���_��O=���h���2P���
�I�n�#���g����;T��j���0L�Re��H��*_G����FU?<��vV�{0�]�@�/����a��=����kut��[��.e�}���l���b��l�,��f�`��a��,���frm��l��3�f7�wcf�I�"���\}����U�������V���~y{{�p����p��g^��0r�������_v��R����}��17������G�n[�=��zg�\M&6M_Q��:_���&�
n$�KLHw��kOT�@�!����f	D��A)@��K8��Q�W�@w=��N�f���o' �+��d���mk��<������':��;J����7�B��b�{<@C��s �t}8}�;^��3��+�n��&_f�C�e_�A��8�N7M������')���A�!����Te����3P��2���}3���PT�A���?����T�	o]���P���R��b����A��������k�a:�b�(6C������?����U����<���5�0�6�,g���G�BP�K������|��
;��]����bb�
��n��%�d�st�Ag�9��������|,dm�DG������t��4���;��jf����� k�|��m���$�@2�����rj����>���-���O�:	�&��P
gR��6
���O>T*&f%�]��%u��
}E&�����ip��8��*�@) 6�`�{����\h�Cu�N>9��y������6�b�����lu���1p�|���g/O��U��/?]�OAL���I-**�n��� R(&:.J����sc��R������i`��yL�S��n!�v�y��"e�AR"��"��N�lO5�u{v9�L1Q�n2� �Wd>wS��R"�L�����
B�����;��8����J�s�v}���`��i0�Nae����Y������n����dj,�#`�@PLsjz�G>�b%*|�=p�PW�f�v>K]�d^F�&���8Um��	:���3�:��2�(�������H>Q��s61����{��M&6Au����?��m�Q0
L��40�cz��6][��!7m��JB������1u���9�F\�����q�Y��t.��3�8���/�Fj�n
��S��}b���{Ic��P����q\W`In2���m�����X��Rd�Ag�t����_�������cLT�]�Y��v1��")������� t)&6��J���S���Z���`���i`���&n���w}��|��-�m�����o����'�������sS��<;����z������j�����N�,����ZA��K����z�����Z�y��+�?
O�2�����8��evR����@�d@�3glX ����SG�|����K�$�lb�R�����`r���;$VS����d�V0
L��40�cz�wj���$t����T���3W��b�L\��B+�V����(����2��(�2���V���Lbv?�����NCu�N>��n}��V:�]�A*�����A�&t�Ag�t�S���Y�����V���b����Y�|�Af�)�T%���kw�Hh��f�h���U5;!�9M���v��D�%J��k�[�&������{�t�f�r���l��8���)����_l��F>���l��Rk�`A
����o>�on��\W����$a��S���O���6�Ou��d�O����7(��}������W�z��w|������������m��'>0�1O~��x}����m�i��������{�g}�������������������x�`�yw�3���~��_�����E���;1��O�-����s�Xg���$�^���T��������k�u�4�:d��a��|)�I�R���������|��������~�i�;�����u�����/�����������p�z�s[�zW~,��$,�o�_?��������?abAf���kL�:��IY;�����N}6X�X���I��8%7�v����b�lbu?�k���(�;�(������!q��;al	R���6��D�F����m�
��\+yj�;�
ZCZK���PLl�i��5 �Q:d@�/_�����/�-����H���	�������DG"�68)����� S�&JMY���a�LTR�m�8��c�,�I�g�Dl>�-~�y}�+�]�tl>�M������0J�7)qgA�wt��Dj�J�zHO&6���������w����*��3�j�s52jd����^,��^b+X�u��4_���NR�}j������@���l�����Y]��Qa��&x��g�x��G����*�:��\6�O����l��f��\F�P(Y���YU(�� x�&:of�`�f�7�pS7m��i�n��rw1Q��J�H������l���%u'�)	��r��x6A
]��G,=���^���j���3���I���]������.O��?���t
���NCm�����y������o�n��kJ���F79H��H#�F"��h���k�3/�*eL)���( ������ �R(>&[Lt$���n�v�YaLN1�q����nU'����Plt����48
N����VD���b��j����ERR�.�O�:����������"�� 2�����/�F��x%��{0�X
C��Y�xr2��H������4���xC�[�=SD�3�����i`���3�)�>��5��F�s���$4��%����7GFA
w�������n.�)�6��l��f�l�y�U�$j�8�'�m-��]L�Xe��_Wq�1S1���n�4��z��PLt�H
R�� �K"5�0��dV����}i��\�P��_��q�����*�����*�X7�1p��� q�^�O2��w�8&�'�������%e��@������Z�<H7�2�.��������S��!�0	���l�c����N���@c�&���@E��Y��`3�l������Y��z!���\>3���/������"�zj3uC������B1���(�=�GI�L'�u��@3�4��|�����v��
�����]�<�s���&:
If���)������R�C��G��3���s���h0����s��i���l#�M��]�P�v'�TM��:�H�����s�x�"3e���,28>���t3"�� 2��R��K��j��AA���Wt�1����<TG��S=���?&_�VLl���#�������l.6:w�f�h��f&Y��3�L)��@��-����k|
1��5���db�:����rA�U�oO��+��)*6�/���h0�f�IA���I��~���*��4�)����q]&�	/&:,J�R�� ��0� 3�2��/���Kb/�=;�Rw����1OCu����~|P��<.&6��$*�1J�s0�c�0f���E��z��_�+�x-&:
I�L�����0�a6.P=6�����h������ep\�.oR��|������/���]M�m5�d�"k�d�����t(����f��0������1h��1/��|J���cd�c�����|PT�=��G
@��y�<�y<��CzpR��q�Q���`1X3�1�����k��vO����t2�abnr��0��q�j�O
~��y������e`X�,oR��[C�S}3��|�c1�aH�+S�u��jW�Q1�q�����A+N^��&��Y���O]{A�x���<��>��o�8�u���?��;�������\�:����z��_�>~�}:|��>��g2=G�#�_k�l~m4�����������[~�Vr��Lw<T�4�3��u��/�����]�d>a�&���;HM����A��@��|�2����/_����
,/���k�������{���($M��@-�>��q���[��W|��m���.���2��x����wWY����N�1�����Y�nO�:	��XSMl��}CZ~V�0� 3�2��/���K\�<U����S����>������#��&F��H�N'or��C1�yCmU��#��A�
�F��D�@4
D3���b���������}��I�%����c�^���������/A��m��\��OT}�
�P���.Tu���f��B�t����,�_o�onw�N��U
�Q���������8��rr�MS�&O��=��N�t:	��B����M�06����u��������/����pw���������~w������B7^��=�GW����W2�?��������]��=��ow}��������(��P�]*�~��+{�e?�K����nw8���j�[v-i��
	������_�����]����Ecl�s|�S�_������&Crs�L4;��Q�I=��a9?�3z2��$:�����TL�!(
}���mPz���B�����f�F��C"���G��'v��1tq�4����]���'�����R���WB3��	\5��
�48
N?T�e�D6}F@{ z?]�8������R�m��7a������H���9+���$�����������X���j��h��������br��@_.��THB-9���)o�)4�(�"��H1��������db��DU`�C�sd�Af�d~�;Tq?���:$��c7B�i�>Rw�QQvXtd�����?Np���=$�Ab�$�I�Q{���e{���}SL��w}l;*[zw
��jbt����;/�8�YP#~v�
%�(����(�����&}���m�������	�n��f�{0��}�� <���Q��)gT[��D��+�fR����_� ��(%~d����E#�~AY��(}�_���g�@���a�O��lb�t�i���~��bb�y:���yz$��]5�M0
L��40
���n��u�&P��t:i�c���D��R��RvhZ�*p�����Q�����,��b=��2��>A�f�o��$lw~���d������j��V�����(�d�?�X��{Q���J�d�Af����y,YI�,~�vTsq�H�3��`��:�H	�����x�����)�&F�m�9U�PM��<��3���y��c��S�>K�"v1�!5Ul�v��*�%�db�����WUj|#�j�����j�������J�X&(_i)�����b���1�	�������c3��1��q"w��Xa���{����h��f����y"�B��Q�����G'�*�Sk�����l��-)�.&F=��r�������*Y#��O]SB��g��u|�`T%v�z�bw�"��<~��v����[������������5��F|��L�t����3�gd�/({�`���F�zn��5��4��������6����FS�LtX��e�?�'�x������AiP���q�=�L�$��Jr!w������iv���j�,����ra9�8�a�)�AdD�y"oU����\�@rv����>�x
��<�	R�jb����._
2�#�d��(����	���b�l9��S4����4�W����=�t�0��FI�Ll��������d�+�Ll�I���5��M�b��@iP���:��������( �Yd��������I�u�
�H'����\�g��������X6�x�&��y�|��g�y��/:m�crK�*�H�.&6��j�����
��1�M�V	���j����Ah�%�b'�p����m��=8���b�y�����m�~uQ����)�iD0'e�nB�c�8��cA��IA;�]
��gA`@E��D"id�+����e#��h2��'��D�/�tN���3�:��:o�n{��6
6q���>���Rwh�8�
���i2ul��w
	���L�tN�4
@�$@�����gjVeY��o�`�=�����y��>RRK��q�f��@51zRh�)��t���1`�eF�����I9����#u�����ta�rMV��=?&rOm��]��\��d H�H���=�.x>$����H
e�����'��?@<�(o�;4��+����?@4
D��@�����b}��RJ��>q����j��N�d�l�%�b�m�q)������]Lt����Gp���(����i`�d�|��
�m\\��`1]Z���Q!X�9r��i>��Ll8�������q:'@g�t�Ag���s��s�4DM�#�4}�."TX�
XOm�m�#���|�VM����Yi��Bw=?�?��o�8���?��>����s=��;���Y�=�����������O��\���%)��>^g[����h7���w�a�����@��*��}~�Z=�j��6��,Pm����<}�>���@=�U�P��=z����Y3�fd���e��h�f�\��� �:M�5}������7��6�y�g���q�#������k:�M�e��	@P�e@R���07	��s���h61
�J��[a�M&�",i�b��~u�U\GU���,>P��o?��`/_��v�A���7?�y��r/��o�R6���������S������X�Fv�v�9��������L��*r�=���P�j�9���C�E���"��|s�b8K@r��me�G�����Pp~s;�	�
���������bp�����d#8SNC���
9�(O�J9M�MZWld�jb�i����y�s���������o>��|�x{xs������yw����W��&�����a�O����_:����U���T|�)<!��]��yw�����3J�x�����~��=��?�on��%"u~���������(��^������R'k���e���}l|����M���|L�evZ�0�����|����~����_�4��(����|���V����[d~t�n�N�d�<2$}�s�����0����
���M���PT�A��_nP��������F��!�8�7.:������:R�)�s�j��D�	�jb8�5n\�2��J5�.��*�9����:v���l����������
 ��a���.]��������h�����t����Xg�����!�jv��tN qF�����KI��e�����W��%_�������������ib��6��|51�����z�Q�	��b�(�o����~�[gE4�P��C��)�#u������i�U�l��]Ll�q�5��C�i��	@�d��F��~����� DmI��CR6S��H���S�G���C#�y��In����l�����ip��p���^]�zA
�j���S����b�c�����F���?�Sw|��2J��(��2��$��r��>
v�I��#u����C�.�A��������%$	h<��9���1x3<���M�'-���cBs�Mt���������_6%M&F����5!-NF�C�D�����i`Z�i��i�P@��[��-{�-�S��uh3���u�@�a6.���uS~N����,��b��I��\��mhg�_���<PGi��E*��|6��q"��=O�:L��� 0�x�����RB>� ��:�H)L�����\�y61r����BpA�M���Mt���3�<�<�I
����e�+�:���l�����G��|$�o�r�G�db�OrM�����;��({��3�0�L����*��[�62e�u�
�&���4����db��yWy�C:����b��C
6��`3�63l������}�+���B�bb�E���^RV���2g���Dh}W�b�� p�������
��,���5k�V�"�I��}R��R&��o�Q��0(u�dj?��
_�#	�D��*���2��d�������a���:;��B��~�IJ�_����(et��M�w�Z�1�j����i@�f �Q�{hS\j�Dw1�QHJE��15�*l�7~'����#4�.f�������0
L������x���F�V��� ���ln��<X@<�{5!������+x��bbTFO}��V�4�2L&6�	�cZ�!�K����p��������4���yH�D<�������u���:R �������yN&6�������J���2~�h���u���o����
c�����6H�t�U��/�i�&FuT�N�������D�=�6�]���
���	����?�+���������o?���?<�������������p<�?>�n����m�#�~����L���1����
Kv���>���Z%��C�4t�'���.pl�fH��0��Y_Rh�^F���c���p�A����mj�>��OT��]F�t&]9������4m>��Ll��mJ��'��8��2�*��/J��X��Q!��z��]G�$�2u�N�������H�������15�[����|5�M
F��`��b4�8������^�"d:]����7��:�Hq]�����
T�!&�?t�P
�OP1A)���Y��Sl3?iGP(�b��1��Y�/�b��Wx����Ft����mU�m�|�)�80���s.��t:�5��E5yzY�����>�v���&��A�����c�)�GP���������F����=~U
���D�>���Q������`�����k1����VN<P�12QN`
X���c�n�D��2��(H\���7J�@����+�$a��_6]_�db�d	�
�t�Ip�j2�����t�7P��2�D7���48
N�Vu���H���|��\~��$o������L����a�Ar�T�&rG��U���D�
@�4=��P���]e	��1���O#uZ?��&����R<�S��)���C���?�E�R�8����D[��
�@��a2�L�?�z���g�e/\/&;u��t����+Ve[�Ha�`*K1y���@�o�G
�iZ�#��&��vC���(+w�	*&Jw@g�i�cx)F^q�����A]�0��� �K��6|&I�t�e\��<������$���"��j����h�E��K�7Z�=��%D�H���!H*��0v��lU��j
��Q5@p��WI4�CW$��
��bt������!���BH�/�Do0�eqd&qG��C{Rs���-?����|��M��U�8�p!���e����H��2?��w�`����@�,2����n����T�|�[��,�'���G]Y��J��=�Ll������:�,(k/&8��m�w.�9_j���������z�_�������*��V�m{��(��2���Z��k�����Q7�@�v8�]��ts�z^�P������X���~��*�|�p������s&#5u{�l�KIU�4���������j��������'��qu�YpSF5�M8
NC�6����W3��.�K>Yz�Jw)�N�tW���i�f�'��m�S����?��vR�;�%M�tN���3�>�S���<z��vSt��]���r�b�����$t�M;���Y8N&F�-���	�$�8���@4
D�,D����K��p�I����&%�oC�d�M&63=�MS�����s5	6��������*&J���j��U�mR��<�!
u�.#m�����q/,������8t
���P�������AhZD����)����e�
�}�k)BZ���������S��.�N\U���qG�g�����������E�rr��;�������`7n��}��1�J5Q�;��)�����j�k�r�/���l���l�`>w��Eo��!���?���v�������42hd�/+���/�`�VQ�o\����4��u��@�������4�j�����P#n�t_��|>_M��<
B�� 4
�����|5�!�|5��N#mM{��
�(���d������7^po�4N7+ 3�2���"3�O�m�\&�
�V)���;j1.MK���Mt$��0��o?��L��Y��Qm
��t����4-)&���j���E)4�����o������s��H����-�=�Vf�Bn;��qA�s��apqX��^�H@R6]J�� Y��u�r��k�\��y��c�Ig�_��\��b�,~Y,F������{�rp*�N���� �����|������W]Hx���U��9��t>i�
�6(�t���j�Z�4�P0J��.C�r�2a�G���yV����K�8'�����I6Q�����3�>��">oU�n{AqS)���6D$�:5��=��!���&F�T->,��#�A�b����h �~Y��`&��tZ��
�S�$�������1'�����a2����k*{[����.&J���j�����h��K��\����Q�\��V���i�
��=��	v �8��2�(����[�j�$Y�.#u�����MM�]�d�l|0�U��$w�}h��p\�
�����_�x�j#�Rzd���]�,��l��b.}D\G'��?>C�Mlf"�y���`�y2���Z��R�>J���&��v@��i��M1��4�C�4%��zk;���i�
�e���R�0ro�8�K���0WeI� �2����/�7����@����U����A/��X��\L����bQt����BF�#���IgC�@�|ZK��A�j��@j���_������Pk��,o
���M&6T,}������S�e��
�7m\
��U�D4��4 
H�l:�M};�v8��,P�s�lbC�.�����Ap��db�O������>��96��`3�6�l����9���/&:IsV���>���db�e��I����,������A��C�
&�����IL�&�i��$ZP9MRw1�QH�Pn#�%�fGg��	��y��t.���1h��l���0 �l��-?�q�d���A�c:�E(\�%�Gf#\����`6�.�?I�x��c���Qx8�_��q�Q��������o?���?���3�S�����{�g}��x�;�����'?�l��s�N-=zO_�7�Iz���7�_���]M���_���fQ�vh��/�d����X`�M&��3���%&��%�zmW�!yF����aD'���-hPa�T�=���e]/G�eo�Y�l��*�;�2)���|�i��VT���j�����T��H]X����(��i@���
�� �AY�#E����z"��80���s���/`g�D��n{~�y�s��apfw���f��[������	���>�]����iI?���(������+����Rf�Aj��~9��p��xH��@���<��Yz���|��Ll�>6��qAw������N������u5��H��4 
H��4�&�BY;$���

A����Mt����p���iT�Ll���t-���r~�i`���i�U�G�7_�Dbwi���F{W�a��1������v\���d�� �����
��,��Y�4k���.Xv�-W���D��*u(��Dw�b&i�5���(����>.�ZZ�y5��H��4 
H�Y��B.�����%�H}�0��@9�*����jb�B�;#� 8^��&H��d Hf��Qa��B�O����j����>���U�|51�����[��D)v�� ��#�s�9�.%�_���f�r���4�������l�,l>��LmJ��ttmJ���W�/����.����?����x���]Lt����4(m�O�������oz{>����`��DF��>1U��_KzsL#u��f�]��)���y�}2����?���[	��j��x��g�9��t����^�Q��SX�L�V���%��d:u��t���]����Q�H�v�����z�	@�M�.^zpiF���^�ofP������4���>�g+�����w��%���wW��++t��4}�i�oS<a��aM&F5c�:;$��i����Nv�d�sy3�f����_P�|1�xz���m�\];8A��<R�i\��������������?.t���*%�P1Q����3�<�oQ�}�;^=��]e�:�y�EGf�8��*$RS�~�W�7|gT��3���i��t.���3�:� :o0�e�~�yi�J�O������'���C1�N��u����9�"{^�)�t�@1P�|�����������i�=R��~�]w:p��|R:�����X�����9���1x�<�j=��������Y�.&:I��#q�n�8�$�3�����@��n)��=�j�,���j����Z�B�,�X��o��^���S��=��#>%�@2��s?��|��?�N8�M&6�P���[�#�������AiP�����z�$^EF[�������cx�1�Mt���]��'�U7O~���u���,M���,u�nRg�p�g��U�������y�@R*S�uj\Z��yMy21r(P�o�����D�.&^7A 4
B��/��|�d��J�;�_)�.�`��Mt��!���C�����Z��j�O�db����t����?�����;�4(
J���4�G��2J�x������r���H~�t.e�qu�Kp�d.��VC�E�n��lA�&d���x��g����Ve�1����Z��w����t�V�t#�,N&F��q�
�|#3_Mt��@5P
T� T�i��KhY�(a$��K�;��~2�QH���)X�kyLO&6�P�v������0�3�0�94���� o+'����:R�)����c�1<��q�����M�q:'b� �b���]?�P��y�@R������EuAr<��8Dj:m�wK-�Ul7����SwK-���U��77�N��"���_lV	��e���<|{�:�$�����s��W{��'���LdN�u�.��q51:�]*���I�P�����4i$���40M�RN��OW+F;�|Ld]�(���kM�O�h���g�
���DFi��k�#]r�|W�D�@
P��a�i�R���E���}���*yG�
����������vj���\p�i61r��M���c�j�����h �����I��S���w�a��hwf_������_�1�:�<PGiP�Uk��y�W[�������7��~����D�?p�g�p��%c����9��'�q������t����Nc�NO&6�8���U��}���p���XyP8��1��$Mm��EoU���
�t���2���T��V/�<�a�db�7\��n�K�0��4}�e�����a�
����{�K�����v��a���aw�����,���>��G�q�>P4F_��+{�o��k%H$m���=����0��y�����r:�g��8'R �`u��D��&�����������o>��|�x{xs������ywEAA7^����_����O���������7�7�eo�����|�S<�W�����w�����L9�f$�H���>���bI���|��D4�z�D���	�G������V���G���?�z��C�:L���
gl8c���/��bd����Q�|:���0��D�"��	��5��O+jVRL���c�'����	��}<b�A�����8{��,]2(������SJ���[.���������mFw:�-��db�JP+�������&:w�Q#�FF���1Q-[c�T_��K�/������@	������@$��s���]��F�lv51��cV�;����
8=�����j��������zO��AI��D�")�I�&VU�_M���b��i�$9�d�� �������j���rPJ	�t����|�9u�_�s�JJ51�P#:����;�{����7C�H��e���`�f�Y�DoR���S�M������:�HC�.P/��"IT'�}�N+r�?�3�>����������NB�:RG )�	�=��&��&��C!����j����Lt��@4
D�$D�������=�Em;��J�5\��&d���!uy��
�0
�.P]6��Z*�%b�D��2�� 3�,H�7P)7]�$h�����H�8,WG�D��l\p���v6A��e<f�I�hC�����}���Qh����X�-�����iT${12nT���>R#�3M"^���:����Y���h�tE�����Q{5O�^�+��t��T�2Re�T���?t�Hi��<�'c��������|�O�/�������6�Z���WG����E�)5C\���SL���3�0�_Hs��a���Qt�O�U�6���I��db�h�����V������������<=��90�3�0/w��R7�b��[%*v���T�9vn���c�����.3y�����|1Qv+��g�x~)x�I�n#����	������l���0z��qX�/�r51�_&u�><��tS��1`3�2��B;C��T����N�4e�}C��/u_��RM����O{���gt������i@�f �QA����^��JeZe�.E����9���xa21�O�;Z��*\�&�s�8�?u�-�>�G~��#Q�M+��1��Y����������M{�o��ai����db�>;w�W��i��(7y��D
��d�<T�o���^d����A?� ���@��@4��W�^���Z{nz%f��L�b��qEA�v5��'a{�*�P5��H��4 �Rd���r�2w7R�Sa���0$�4�!�eh��jb�P�M��p�u��	�p�g%�M�t��tu��A�b��=����h��?�
w2���#]�����@�.&F�F��;��������Ah���*1~�5��B�,X���]�s��2��Y�nWg���db��C���m���������f�l�6oT��c��Kt�#u��j&�_dl�v��y�&6�vl����"�� 2��R����F�hx(uJ)�.�`J�S[~���:�H?�t���6��������7��:���{;�D��0�3�������c;t2����G��r�#��\��������0��9��`03��\M=��p:�����[W��8���P����p���lb��K���j�6
8]L����ip��_
�7���=�����M�?!Q��jb��,M�tZ����������Pe���q��)&����Ah�f2i~�}�E���N�f��6(��T������XUP�����o�%��*u�l��f��a�FU�����lgCf���o��	������R��N�*L&:�f�h6D�����/���(��qm��9O�l���kk�`�>(��,��R��?�e~�7?�_�&���AR�]�L|t����Rg���&F�s�M����KJ���nz@g�t�Wk�x���y&B��G|���Nuae�?� �J���j�O|6_M��yj���nu�J�U����3�<��:������������*n�a|��q���!H�hi�6���<g������Epd����0z�m������3���	��I/�YT�r����SAY�+����d�1,G�'�&��#]hFA����t4�~�O�K����T+&�r�pF��E%��3/��B};t�ky=���:������r��������N����:L��&��`2�<���i��+bm�F�.��B���H��"�|��T��w��Mtp\O����O9�>�#9^Mt������q�$!�p������/�_Q	i����x�;����[��*�����������������_��_Ld���KV�����-?�y��:��Z�� ��(^�����SF�&<��<#y~Q���x|�_�����v.�@>��Hbw14��d�#/�W#"�H�BL�
rI#�b���h���u��dLA�rPvR�����B�n�u��&6�&��%u�'�<����
���N"{5��D�?Mh���k�,Yh��%I�����3�W��s�m�c���������Fg�r����Z�X���Q.M��q\M�@m/&����i��H��Jo����}Q��b��D)�J��|Md7��jN&F!T����7�]�sx��g�x��(�kP�:n7�=�7�o��:��w��c"�{��ZN&F�z�(����90�?��m9o#W��2�ss���8s07��9M����k)Nb)k�w����G�� 2DI|u��J5�P��x��� �oC���}}s#n�)X�t��t���#1��}`>[����^���f{h�F�/�m.nd'd41����*9Q����D7=�3���i��J���)U��4���y��4"+�t/G���O&6��8��>j|#��D�G
��h��������Vp�!F>8!Dwu�
C���.L�=�L�L���4m�4�%�LtF����!qC�~7�8+d���%����w{������bbT�"m�����u41��*�Q��'��0��j����2?�ti�n�w���s�^�>��n�����Yh��<e��v������m8���~�i#6����"�O.-Y������+��G_dk��/[��G���?�;G�y��DY"�,Y4����Na����{���gU�{�~��(�u�
C
�����c>JM������Z���c��D7A`4
F3�?��Q�W�#��f���]$����/�FE��{0��m��jV��-'�WP�������W8�(�v��
R�����:J��4�@3�|�w*����S*��.�����y�#��-�N;��}M�v��Z���F�a�)�e@P�?������mIz�_�ET�uZ?���'I;7C^��;���db�P��zP��\��?F�k�^��^*��?���M��#���]Oe��
�T�C+h�Q��2P����K!_�;�q!������_�h�sD��'#O~Yy2����u���������|���\�.w��|A#&M&:4J�s;U���D�P
T�@5P
I����:�Q�������$nS��FR8��Mj2�M�l��������q��@P���e~-4���T�������	8
��G��RZ��Yb�M����H`�/����#�����x��g�Y�3�Pxuqa��FA|0��H
�69j���w��&|����b�sX��e`Xa��-G�nu���]5��HJhj���M|Y�<�����&�����3z4Q����`4
F�-F�����J������(�����.u^����d�q��.�`�s��l��L��R0�a��D��6��`3�,������)�����t)u���%����(<�������&:d@�d��*h�KGL6]&8w����!Q?�vEG�j3���(
B*Qc�e�s���h �e!��2*�=��^*��+�����#�`)�v�>��z1�U���n�\m7*��	��������g����.����o�$�����E���`���h�faa�g
�P���I�u��/7>�����hbt��4�=e�������3#gF�Y���#]�f\�j[��:R� i�LD�fp��!��Q��/X�#�q�nN�ep\�_�O������h[:p�v��CB�I��HJ�D��b���i���tdX+�����(_�:���3�,��O�^\�w�v\|�0�m'�)������6+�dbC�R��u�V�4N��>���3�,��`s1e������<���X�M#u���T����.���hi21:d���|�`��r�i@�6���O�-Y.	)d�x��R����������N����T�;���v
&%��3Qn���Bi��<�Q:GR�/'�h��	�&��f�4 
H�4���4�C�A�2�H�E�@��G�fE�Nag�-�hb�"Z�kx8�&J@g�t�A��������������b>��������+�H�&6�����l��9�������"�9����0��3�8�/�hEr�+�?_�7�5������HX/����k��q21:e\{f����;4����4;��Q��O���Rc�i@������v�!/U�|�Z$�j�����-���t��������QUy���t���#���D7?�40
L��4���
�C��9k)��e��?R>Se�o|�7���Fn��n��3�0��9��`1X�rX|���M�t���>_
����s��DY�,%r��"
NyM�t(�:A���_T���a4��2�� 3�2�Y2�%�F��T���L�u1�I!��&6�)71��	'|LT���M�[��40
L��4������o)��s�@��e�
I����M����$9�&KZ��tN���2�*���w�R�����n	���;=V���!^
�/}2�y����/2��<z41��I����A��j�t��i@�fS�S*�����l.U�u�
�IcoBX�v�$�����?�������������d H�Y$�T���6�����jA5���L�����U��GJ���L���TM	�5���:�34�(+j���_�����X+��OVV����U���G����}��/���v�&F���I�����D��2���h0���d������4_;��E��|��#uR�tJK��t��hb�O�B�T�h��B@g�t�Ag��g������lQ����6<)�&����@#7�������l��(�`X�����e~!Tf&��M�:�����Z���v/?����vo)�����S���L&6o&���#�:J��,��2��f�<�^]<@��t���j�#u��FDE��J;M��]�{�F|�:2���o�t�Ag����Y�����\�[�����db�6S	vj|9������M�IcO���q���l~j6�����+�F��]9��|{G�����p�����������}������?����|Gcl�s��[���������E\���.�>��a��K�s�wvj��{��
����2E���t1\�����PT����|}sC��^ms�

��T��8M�w_o��~��o~|�����/�F�aT��K����"h5V�)	s�UzMB���q�7���:J7 1H��7{A�.���R�N�o������:�H9L��k�5�Y����JIg��U�a�������7�*�����2���������+����������.J�*������
��2F�
�u���Z��
�ib�I�6:wd@�����w��I�.t1ODH�E�Mt���E����&�Z5���(�o3TLt����3�:������"v�B�H'�����v��l�l7)�s��S�M���aS��oi�v���E������4F��|i������O�������z�?��Z^I�+-�����+�G���~`E����^���Bjt�k��(Hn�K��|�^mt� +FV��Y����a�1hu�*������v�#u��IOv#����u]��(&JiX��o|��{J��DY�
$�@2�$C����������t
D��BAxp1��s�M/�{]���0��l��"+_T�7���C�`��������3����.kM��=*X��tK�8Ka�?���t���Y�G�db��I��"�:+�I�=R��uoO���b�<���H�5|n����]��1��*])xZ��m�n��zk�b����9���jb�
)��w���9�� 1H,��G�sVU�X����s[^(bu51j�Q��U�.8<���(����G\]�8��e�@�4���Z���l�A�a�rr3���ptp�M��L�u��S�Q6��O�M�5�0W�;�1x������]���}}�����IR;�EE$Gw���<����$�4[&m��@���o��py��iuS�"�c/��fBy�C�� �S����F�$8L
A��:�H�<��C��E�0J���x��B�j���1�c��H�_@z�/����j?���|2]���.�(_��������O��(�!i���A�j�s@�d@�^}Y	�fE��}(��:�HIH�s��;��N\Ml�)E��n�d��X�|�8�����������������]�'����$E�a�]�D�Mtm+�	�2V�,9�o.Bd�jbt8�o���'��D79 3����(a�����5J��C[��������Hy�F��=������^��|�E����|hXMl�)5��K)�t���a`������a��q����k��Kq��l�D�44 1:��&�<�I�����%}��(� 2�"�� ���[����[���{5��g��$����:��������X�ze��&��`�K`�a��xi��=�4��6�KZ�*���2+����l��E��b�,��zCzt����6�`x��g�x~ e>�����E��RL���X{�(,R4��2���E�f_���� /��>@�s���k|���?�K���X}��-�����b0���C�u�I���j��h��f��%����)����{1���E#���)�I�M�rS����L�YwFef��zs0Y�O�����<��3��@�|���t�|��Qk��q��=�X����A�H��F���N�a��D0C�D77�2�.����\>��M�
t��`#�����#%s�������������Rw���o����D�
�4�@��|��\��+� 7�!��L���l0���P��/���������xC�t�4��U_g�#����r�n��/q3�7W���O�����uv�������S*���.t�y��5d}-F�:RM�$�R_���*Y3_yUL�E����i~Nh�.<@��4i2�dM��.C���'U�}�����
���g���q~]�����F��z�����nr@h��_��N�����o~�^����^��z}#e�|46��q�4i��2����[w[������P�AaP��J/>S��R
rS�r/~pj)'�.cm�!���l(�>UoJ6_�UG���b�(�@�I%��dO-65��l���4@(��[C�g�hb�N����F�������Ah�~	�>a��*�F��KAu��\���G�e������hT�Ez��Jg�	�b���X��e`X~ q>a<P�Yw1��+U�������O���Gg��-�b2�D����^u��D�Ad�"�S�n#�
}�k�}Z���#M����q}3�@��&6��M�7���)�&:o�f�h6@��O������|�Fcl�����<�_�����=8P!u�C�zc�����������~.�R5����Uo��p���F��4�@�����?�����C3EA���uM�-�r�\Lt������z���WCWoH���I)�t���"��M��>��M���&�	�����u{uw�T����_>}~��>��vu�f����������!Q�W��Z���Y��g�I�S�NmZ��� C0�i1�y��z�����2��T�X�'Q}sC�D�&�����3re���=W��6uy�I������+	Cn�������Fe��t"H��7�1�J���k!=����QJi�=�`�o2��:d��������V�#L'�5SovY���������q���Q^��G)�J9�S��<8���P������$H�
�����xC�WF
�D�
Rf��H��2#e~�f7:���3�����O	��Xy��B���u�^pb��.}����0���,��2����|����G5Oi(�\�uqb!?����v)��V�S��#oH�N��[Nv�cRJ+@3�4�@�=3�%��+�y���t(�v�[��I��6�2I�y�A��jb�
����}�#xI�D�
�&��`2�|�O�b�t���!��I��D i�R�i��c��<�����Xa-��'z�����:�������%�f5�&x�������q��Tt���9����P��'E�?�\�<�T���(���F���`0|O�|J�:5����L��!�z6��G���m������xC����L�,�L��@3�4�@�=h>�x��0�0H����6,��k�J�|�>���C�4.EpU���`����X��e`��c��C�0����=���:jt��=S���b�le%M�I��<��(��AZt�)7s�Ymt����1h���$����U]�%�HL��<V�)�I����2�����zsM� F*&:oa@��{ |R��o��b��~Me���@R���M��`r5Qv���C�u����Y��nr@g�t��;�O����Q�I�S�g�31g~�/Cu���Bt����P���7�@��y�{<�@0|O��s�U�����t����"d4��G�b���U�OE���5d���h��f��4�T�?t9+�@��D )�����%�<Gw�y��>��V�(&����3�:�������e�4��|��y�S�^�{%� ���{p�`B�r����H�~^���)��������[��ep\�nO�N��e>
R.��\����t�&*�I�[s1�tQ��U�*�(���Ud�����
��M�$�@2���5��T�]C��>����:�HC����u�k>VMl�����+��(��1p�/�<��������$��<�G�������.&���/�l�9���,&�h�7�F����T��$�|��ps���.V�B���������/�>�����7?������w�������u{uw�T������?�i�[�O��
�v��GL���?G����/�?���|F�sM����t2���O���_7��q�2�t���7�S��7��(�H��2#eF���
�:{x���Tze�+����O~�=�*�Ze=v5�q���i���9*���2��"�|�|u_N�9�HA������8q�C���{�G������7��zul�	�u�`��<B���(�Q��B��D�:�Y�F����#����F���S�|:�k�B��]��(H������9��i1�b
�Q������Q����t��k����^[z�cu������w|��8�����M�O���j��<��c�X#S����|��D��Iqp��!7=u��*�t�bb�H{�#�2������|Q�x,�
���o@�d@~������RK'�|��|���^L�N@��mRT~O�����N�/m��G��g�|�����|N;�M�����`�g�S���ur�.^�O�D��+�Q�@���?@[L��z�:��oV-=8�\�FS���!RY��p�%^��`w�����4m>eR&(����L�E�}B�5�;�Tf,x���M���o�,�!�z�	"�:�5�^g)6�H��B#��S���0�Rm�B�g~�}�wl��.b��B�����w��+&�h������7�[yT�����f�`V	Ll6�,���������vK����W���g;m[P*^J��%��1�RB��O����h���~8���C.�4j�&����5�0ie�#��t���NR�=���G��� |)te�u�V��^-�6:w�_��b����.�7 [?�c�
@�a��Y#�>����l>(-��Pz�(&:v�
\����(?.
4��]G��$��Sk�T��O	�tj�
F��9u��5>w~������i�?R�J��92����Cl��!������&l��f#6���?���K�K<�J��������4����E�U�z���u�}?��BD�������Io[z�������zs���1k���M�4
@����<
�b��B��������1Q��4T�i8J�t������U������s�W�(}�AePT���Wd��r�c���@ENl��kC�*P�4���@�tj���|�\L:�72� �/�|��."����^�'����n��k�i�1�� ����2)\�FET������X���;�RoJi�6a���j@�\��s�/�X�����}7;���Y':�����3n����:vG�Arv��
���p�&�^Lv_?��a���)t1Q��;K��7��z��,��b��d����A#�F=�����I�����^�|�J��4TG)�R�>=��p4�q�Nj����Q:@c�4�_�O���{��_�ah�b��
�PWH�������7E�^NV�������e�������������c0����r}i+�;�����������.����4;�-��60$5:�3QG�8P4k~����9���1c$�H�!So�Zy��N��s���$W�&:��uj*��u��~n�/�E��[��?R�����=��D79`4
F��*F��h���-��n��������<]���,��=��|R������s11�I��^�2�N��M�0�3��`�|J�:7��0][(�R7�� $
�JE5�2����M����V�
�"�� 2�� ��Yu�����{��(���64�q����a6.��#���k���+&��~@���d~	4�H��68���QyW���9)������\� �67\*&6����ZP��V�
@�4�`���j��!v�x��bn�!R;�� ���S��F�l�h[��B���<�UMts@�4
@?�sJ����������n�������e)�����|=���(�H�tp������$����h0�~�>a.�sQ��K���I}L��%��lb��@"w��O������	�y��;���F��>���3��`}����RjB��V�n�cu��E��	�F�A�"��2J7�/��������=���}��N��&�v\'�<kWl%�qAO�A7���a�b�����hF���.Z���%>c+�l�XQ��<�_��K�t�����Hm�����p��Q�Jjt�Wi�*���
����g�p��h|�R����'�Qt\TP�������������bb"=
��C.J7j�����[��.!�7x��G�����^�>��pSE�p��M����-����u��.��+����N��7�/�/��k2y�*����
���$3_WL�� EF��)�8�=,E�k|�� ���:dG�O`�5��db��~���e�
�I���x�o�,6�1���?��ab�l��������Y
������u�5>��v���������
I��|=Ue�@i	�#���9��x^w���`]�Y��������o�������O�|���J���Sdk����w�C�$3�����C�@��{�u��6����0�^�Ml65�&F��=]�����2����`�R
���+�'LQY�({i|[�~t���Y��6H&�y����T���7������)��V���@2�$�����z�X um��8��fc��$WO&F��s��;�~��Ww����` ��'��sC�1���X$W�cu�'�$W�����
�"W���)�;}#@q��
�$�Ab1��zV�l��&����=8�z:��@=�qG�=���t����������u	��c��s`��a`����_w�k�A����r�C�:%�}��c21
����+%����&: ��_��N������p�g��^8�T�Mt����A��d�lH%�UBh�H4u���Rz:d����O\U�7`2�&�������j�|B���\){phC�C��%�*b�����\��w�7�)������a��|o�b���g�p��{�3����
�.%��TZ=��!i�qs���jb�Mi������c�Q�ES�;������P����m���K���_�J�K��)c��q
�����5�_����\M�v�K��m���TM�y�T�(7,n���{xU]���Y3�fd��?k>a������C�d]j�"h�<n�M�G��4���n>$=:�NN11���������c���i)_wp#�����Q�y
6��O��Ds����%������d�b/%a�a��V���r�8��������b���S�x��*t��WF)���%#KF�,�V���(39��'U�C3��u��&�\l4�!r�������������|����|�J
��R��������X��z��Ow�?��f����4���Oq���B H�l~��=8���nHi������������o��8��\L���I!��gQG��<��c�|<~�����SDy��7>P���i���3.&:I��Q���$u2�V���I�����/��#�j���t�A���}�l��rv�T��|��g����b�C���>�&����&���R@����v�u�nJ@ePT�_��bW#���������*�4�����b����!5���x=��"��m��7C����s1���c�Q��1�sc��D���
��a�w_o��~����=��zu��������)���&h{J��9)�~j��I��m��t���Ml�!Uz���pfg���,T���������H����\�)�O��b��{o3F��
���!�[r��=������K�i���m\Z�v�-E��r�{d��?�p������������_���������6������~��a�/��/��������o5�^�+���6��K�������/�}��f��-��c�������Uc?��}^9�qJ�+	��8'�<�lb�6�� ��kf�I�aS�_.&��[��Z���n�����
��a>�������x��2��$E�^<��|l0���CR4������Q5�y*���2�*����8�����.���{1�!HZ���I���7Bv11J�c�rk/�!�h����Ah9����hE�3��V}�o�b��.�~Ny�,&:�H�L0y��
>�jb�Nj����:J�x���1x|�|F�2��]?'�<{J[��D )�I�n���.��[M���Rw�&��E��
�8��c��~�S����9|�{p�Pz���96��~L�l� ����eV=�&(�#��/�(������2Z�Y2�G^�i(�Lh�"��Q&����M&�����t�iV��)XLl^>5���z/�?�TM�vN��Z7xw���;A��\�2r��+g^�||��[e�#Q�������f�Mj���=���2��t��.I-&�FZ��)'�7
Dx����Rf4������O��;M	6m���V34J���`$w>/)�����]�XJ���s��gL&F��
i/��&Q;&e���y3�f��/ o>_��g�Fy3�S;7d��5����M�g����� ����qjr\*�K� �J/&:o�c�8����K��Tm:�!.EN<	��]L�ENR������(�u��(y(u'7C���|��b9r6�l��f�7�;����{*��9a)���C)K�5��,��q6N�H=Y�(�H��#=6J����u��5)�42#�a�6o��`��c���cv-&6�<TS	3��q���Z~��A���@0�_
��ww>�����#�h��t�h�9��+�������k25����o�$��)Eai�69�Ky��w5��Q��;��/�
us������oWT\#C��������r�]���y�N�Q��!R��\���~XLt�����WEU�K���?�5}�F4��h�s42hd���_B�o$%o��k������N��s_�4���]A�Z��jbT��`k�HD�����;�Ag�t��P��Z
����d*��$\>N(U���.?�����:����]������	�*����&:o���7���4p��|}B�����S=���������O�|�����*�|��}B��+M8�o�k�~��a��)��zEctQ�&.Cm���m�nEf>V�Ll���ns����D���@32gd�/!s~��k7����=�D��Q���0L�Kd21�[ �ZP]G��R�S�.�_N5�y���1xC���Z
�R�rJ�vZ�Pyv4��`��[o.n�C�jb����� ��\Mt�,��E(���F������:y�����S*��_��m�|dIp�j2y�&�����k��w5���@(�B�/wO<��P'l2��.��H��.+�e����2���U����|.���_���D��@�wPj��Z>�_�h�������O�� ���c�8��c����Q_k:�<����1�j����H�Z6��������q���L_�XMt����3�:�����"6���}�n����ge'J)�K��y*O�t �:S�e��8L�X��`�K`1�����i�kS�I��r��K������T�
{^+��(�0����Z����L��0AdD�A��c~�U��R
���.�<,-�X"/&F�$^�����������^{��v�sD�AdD~�����s�����G�
G_Mt�'E��)X�j�3��1h[�U�����2�*�*�/]�Ih��s�E�e����H2 I����K�����D�Ci���
~?���9 �2� ?�&�/(��S�l�.Cu����^w�;����;mQVq� 0�&:o�e`X����|J��h���d7��|�^X>��$t����e��Z����hb�����_������ ��
���/FU�9���:z��0u��`�����s;/�{�M���J����q��_y������&:o@c�4�?~\ZG�} u<�y>�ro���Km��8f���D��J�*��!y����,���&:J�i���_�W����}�@3�4�����\>��M72���^p'�2����&F�����q6NDG��l|RG�}A_��}����_�	�~WW��?_�����qe���H�>��;J��<,#u��:R�~\�F�,�%`5Q��J�����.�����&@�d@�'>�R�zR�y�QG�q��:R�� i������&F�y���nV"@r5�M�$�@2�|�O�P��q�%
�<����M��.��{������`�o���mT���b�
����������������u���f:�����k+����<�O�}�E4�?,�g�m����]���&F����K� �4K�D�
g$�H��8?���02�`�>��^������2R��#$K�
��^���;m��������F1Q�� �2� C��~�?W7^�SN5ut�C�B>8�����(�a��<��q���s�a2+"T0�[�L\z��/���u��������/���
)��u_�|�/�F��<�R��=���C�p�;�����c�M���U����2��M7,�k�������N�������z�W,|�?} �=�i�Gy	����_,�W����C�S|R���a�qi�����\W�{r1���&�����Q6�$��I�{<�i��P��N���e�g��N�m��k&K�����u�,�s;�����	F�;����j,�;fg�t��+��t�A��Ng>32*�P��=8MLT�,hS9�������D`5Q"PJ���C���|��D792� ��=�\'��f��C�����y�����t��u��hQAs51:M�,z���
���V{9�;6��B-���������2G�w_o��~����q�^���r	�#*��}4������;�4�i����g����Z��ub?�3�::Q�N�������D��d��H��&?�4�|��aD�^1P�,]{�jb�wPz}�� |�4��X(M�S�b��y������Af�d�!`��"�C��'���-�Cu��2�r���{^~x$�&6����9\��t��&:o�d H���{�|�lO��x���y21�����&��Z�h���8X�����$�O�c5��<����?~����_�?F��\���2����]����K2���l�-]R��0�5�&F;��a�kW�*�=�&6��i/j���L�j?�<��3���W��[&y><�l2��H��\�U��H#���y]r�O	i�db�	����r����D�
�*���2�<����C�)l�&
��B,jo�b���T,*�Br���i�
�c��W&T�;3�0�/�|A�R,�e�Q�g>74z0I�9����Pi6�S��%H(����Bjv�
b#��(����ba��q~���_����>����0"�Q�v�wC�]>6(��h�C�4,*�B�����hb�Nn���`.&��32fd����1C�^�r��1���3S"�8��E�N�"�K����HR��q~���O����tl��f��E��|),��������B���Gq6Qh<��Ia}�"�5T�e�Rwr���f��:�j��^�g�x��������(I;��6;���\�u�� ��L��:���5�y��J;�F|������3TL�
�>���3����Z��w!��I�s�#u��4Q-*u���lg�D�k;�f���W�	v������g�2�J�G��x���_>}�����e���d�W\	&�z�y��%��{0Z�r�������O��aXw��=Ml��F�n�:�w����A��l�2�eU�'
�Eh'�-�>��O�=��8x:><t�����������4�m��'��������d��?���Uem�<��3�1{%����I�w�!�?��B�?�|K�@#��<���r����J��;/�D9'3�0�/��p��F��+F��=�n���R�XOf�}^j��y3�#��M&6|NM�0s���SMt���3�<�%l�����r��b�\d�jbD���M����jb��
��K�B�G�J����g���zZm�%L�WwW��f'�>Ua�h`w3�����DA��|L���a��Y��stD����J�v5y�Z���k��vu'#��db�����a�si3�f��P
�k?��w]�������>�_{�r��L�����
s�NAg��DG iLF5�^�c�_~nFw�����.�3�q��o�x����j�+��{0� ���H�)
���R�jW;U�v��:2�R)��6���
U��R+&JA	|��g�|~h����6�r~H|��z���IU}��&��K,�G���t�,��z���<��x�����kK{.�)\���y�c�Cg������=�J���c�7;��`1�y�!����&F�{9D����,��[����|��gU�������B����w���Jd6([L���4*-�t�k��a61�f�:J�8���08<`���:�p����=�!�2�q��>R�����W���A�h��J��D��"�V��U����3�8�8vV��\��S�uj��N>M?<��D�
��6e�,����Z�u1��7UL�b
�>���3��`�|F;P�g��1������
�I�n�5���RM���OM�74
@����>���
]����D;:�TLl]$k:75��������?17�s��W���0������h>kRj�RE����!��L�u|r�|����������gw���D�~�Q�I���]Z{#� 2�:o�g�|���sh~�5��a|�TMOwP��Sp�c���q��!�8A�jb�J�z�K?��
�qe��0�3�� �O)n��I�k#v��������gJ�g��H��x����Cu���:J�h��������5�UH�0	a<�]]���7��?����_�!������./�W|28��3S���;QG�|���j��9���b��D�AdD~0?�I`����R�.vq����d1�!H#P�un�[��K�&��\�;���29%g���j���|��g����;�9�\]���{��8f�`�b��00���v���ew//��7#��]�$GE����I�HVVV|���|>�~M�D��CZn����&`��:��R�<N�B)�['�!7"��L&:w@f�d�����n�������K�����'o���H�G�*����y�&w������'����`X��e`�����N_���S��3-4$
|4�!H��!�����db����i�����Mt����3�>��O�����O��%g��zjO&:
I�������aVc�L��?$T���Q:�e`X�������(4��[K�k�u��:�HQ���~�Zz�g����;�������;�L��
�,��2��d�����yW
��a��Z��S��&F�_M���s��oe4�6��N���Y����S�����h�I@U��S�aMY������U���g�>���/������)��<���i���`�
����/�_�t2]�3�6���;;w67�������W���1�w�"U{41J�I��]YSf���&�6���`r�F�m����c�0��:����(��Z����_D����d3��G�?�������d~�y2�q����z�����(u�p�g�����Wn����Y�����uC'h��d�i�
[Iu��xg���u�W�\�&:w@c�4����_���j�2�ng�	���j��vc�s���p�2����#M�[e��c��Dy2�q�9���<�����f�h�
F��b����)��t���)�$=��k�C�4>�J���5-:�1Ge� �2� ?
�C*�tZcL}����x�������Dy|���������q6P�� %�K��0� 1H���?����^U����n�0���j�L�|��S?��p'
�QR��o�S�W5�����f�l���y0������u^�x���#��?R�j��*�����G�P������f��l��f�l~���T�ik����j�����m�M���xn��]���y�Q��]��T���L@�)��|��U�����������?�r�V����O������������������o�yG���vv�
����F�����!��W������S���~�����4�9�5c�������'i:�{�)�hb��D�����%%h�D�p�1#cF���Y�1���Q
����-�����>q��\z_�����d3��G �"����K���;��n�C��u>�o&�3���YZ�
�<$�Ma!�����i����������41��������C��T]���eA����`+���Z�z1��q"&���#P���{2���T�2Re������P��E�T�Dn>Q���'{��4������e^-XL�V�I�.�An��y4Q��������%Oj����1p�����^<��&}^�5Z4�:Z�P������4P>j~]���C!��Ll�i���4��9��2H�e@P���@-�F�O +,��w�~������p�)�]J����������&q�@��v�es��'6e�0�3������K�S�t&��v���:Qc�����jbC���Z�n�J���$ZL�^g��_MMt<��c�<~4Q>^ �Z�P{/���u��:�P�����~�$�8��xS\��_�*B3A��)���.e��h[������Kp��IQ�4�f>�)]S#�8���������/��yg�Ru�6oS� F�;�*O�tsw�]��!Y���|���?�p���?>����~�������7F�C����������������z����P4�������������*/���S������"-���.��j�mA�c������D]
�h��M���v��^��������R!���tG!�kv�������� >R���|�G�UT\M)x^�%�{i�BH�e;����lb�O�]�~�����O&:we@�����W�x�W�|D���h
Q����:���a���w=�YP��'��������y������1h�����_�i�7��ub^�kAf:���#Er����O��q6NP�N���MY��P��K.����F�&rXPV������������|H�:���-)�����db������w!��<����&7l�g���d4y��C@g����o����?L���_��Y��l&���Y�����!M��k�|o�u�������o�-Nx4��	!QQ5���tO�5�k������D^\�;��v�*�.��*�|V�����Q������Hb7���C%m]���;��?����g�p��=��f�	$1��Bg~Aw5��GJ�B�i����a6.��MG�Q:�` ���G|P�:������������D�T*���ne��+��:�������~�s0�c�Xc~�y��>��{���RU�m������+��� ���8A�����>b�U����g�8�3:�L������s�:z?7�4 p�:���B���\�y�zg�D�hA����y�~6���$I2�d$������+�2�4&R���v%�u�����$�*j�������'�3�H�_�K�h�L�
�Ah���B�"�R0��G���u��I��54�M�h��<�[��4�M����}C�=Y�D����`3�6����{S���A�2�q�Mt�;j2�b����Q��w�nEp�a2Q��2�*�����T>��]\��K���}61"-�tD_����FY��p�/�L&:w�e`X��,0g�Q�����=����&�+{�����]D�T��l�H.�}�.����2L�d@���	d��5�TZ>4�L�*�4B�];�1��`��lb�f:��
u��"��5����Ag�t���AUlbk�i>D�����Ua~�Jh:����:J��*�g����\��b���������t>���[i�����)g�g	��l>��nLg>D��1���_��h�������>��������-���{q�_`����81��$������H>'4�{���;�c�~��Mg������2��,���:��������_���y�(N�z�C 3�2�� �_��?�_�����e����v	�<�X�)��%�8���P�aB���|��4L�x����3�1?	�WKX�vc���.��j���|a��J�5'S�&6]\'�`�O��)U�	[2,����j��m�W�����Z�-���M�j���K�A*Y+��Ah�qB���I��l^�����>�{��lj�JJ��+��R2/`�&:@dD�������s�QA��T�/������H��T�5_�YLt�����O��<Y��C�`t�s$���c�!���D?��wM}������C>���j����y��������{��bc3=S����>*��\Ml<�V��(�q"��)A<�@��@4_ok4'�7V.�I������T���6�9)��&���m��+���,���|2Q���$��>�;�������������O_N7�����+_b����[������_��a�K���4��������7�N?���W����������p��tsw���������!Z��Y��]�=�nM,��RH>a|i7��w���T�@V:�����������]G�^���l��T�R2�*���wR��]������@t�ElE��� qi�HI}��`kh������!�8��c��qU�N�IsMEk�!4��l��n?=��nCm\	)��
�5
=�G�3�<�����l����v�B
���#%r�m�
�Z`��e���hQa�S�b���'��AePT�O���a�����	���S��EF1�.�P��&�C��Q�PJ��S^C>��M
}z3���r�3Uy�O�U���l�(��j�V�x�����^��)~�����j��Qu�8�:fum��H�DS�������oi�HH�@-�������b�L��>#}F�6?���xY�����������`���y�2T�4��D��JZdv�O��a�g�1@C�~p2�2�I�^.>&ul���G�@J�Ku��J�:���B��0��M��b�sd�Af�d~��G��ik/K��m�;R ��3u�8�
�q�>eJ�-[�/��&:��d0L�<��}�E�C�R1��)��{�?5�l�\�������~1�rA�&�]^��7;��>GjJU���a�����G�%�1@X��}�x�x�����!�Z���
=z���ZT\k��'�l�=�i�pv�~���q�s��
���AaP>�S����<������Gw�B�@r����������K�w��5S��r:X������d0L������m��I�L��L����P���F������L��������b�������������/�{>!>�,M�[b)�RY���R?q$B�{Ct1/~	B����%�u��
�3��

B�� 4}������-/��C(H��S�I���E/�	���:�������7��O&��q��T�#<]�,�v������)���hg�~���������6��.a
M%���eW��B��%��W�&6|�����/�� �������������j �k����p/qOv�4P��{:��u�l:�
�!p���M[�$up]�l
64lh���������;)�a'_X
�u�4����B���?
�q!�@��;1�&��R��AaP�QX���f:�Z�x�g�p��T�r�(��i�r����e�����s�2��=��@����/������Z�}5�SjKG/&e����z%\���2��
����y61]�C���#@P�e@�<��hT��RU\0���z���aMu-�j�]�ll�	���H�v:�i����D�9�+	����aO������h�;4���|����O~;�\9�K�L�������K�Z�h�!����6�|����5��Q^�X���:�L^��c/s�?�e�W���k:V���+�NBW�3Z>�D�x�_������/i��b�sv�����<B��22dd�����S�(Gw�We��:���v�n:'��K�N:\ry!}3�q�d���6B�&:��ep\��s./o�!d�\�E�$g��2J�Iz���#/u��F�@\�qA�����POu��V\� ���Zq��ct�_�X����?�zA��:R�h���1h|���-�k!����lt��&�T)]�w�o����q$�B+�]^+�x�~1Q��3�|y0�x��?lf��	�A6���T'��������\B���Z)�G����V���3����pN���.g\���l�{@ 4
B����?HAy<BO��\�]��p���3'��o��3����U��6������������tsw�mP���A!y������4;^���������k_k;q�$pm�
c�����uS2�>�&6y��E��0���A����3����?�u@Fb&O�2���xT�kZ�h����F i�\|,n��}����C)v��o��-�&:@h��A������g�p�J����Hmt���.W+�u��#t��",Lga��^�ih��f�h~�GU�}.��,��]�fc��r��"ku��V�����n�8� 3�2������hT��#�FW�q���&Y�4O6:I�f��5���UR�=�}7�fG���j�y�����2�,�O$����.=���^��������H22�A����:J�����.X!���D6�����e`X����Aul:@*��'����`d�C���1��ky0/m��5Bv'_��e6��2�� 3�� 3_~c45�@��'e��n��D����]'���6F�:CG������4zjp���0�@1P�@�I���vD�o����E6F+�}�2�Pwj:�&Fbz���������n�����Q���GctO�L.<������N�3����^�!:Yq����@5�|��uv�Z���
��������'9�Le����#d6Q
	H��>����s�4��
�� �(Ym�t���������
�CH�N|K�e��~��Z=_�5��9��1`?�2���K��E�o7�n�0�\lt:�*�M��L�H�����6n�j���+>4Y����c�8����Ql_����GPS���AR �������a]�f��c)��h*�z�{0`3�6�������X��"C��6,��k�6��1G
c��N.���<��	O+�^��y��	��}A_�����x��7���,e�;N6:IQ�=�>me���2�(���js��Uf�C@4
D�@��>�x��V�X����dm ��Zg���'���HM�2{���6��6l���~����D��<�A6��xM���W�r�w�H�L�_vE��QxBRlv6_ .���k)8�����xt�_p���I�+_|��!W�"�E>W��t,E���6�V^��m�`\�a�n���6��
���I��� Y<��[������W��%h'r����?|����f�Z:��T��\<#(&��..�5t`�:�b21�B�.���$'_�&�G�\�4ri`�Y`�/�5j��J�}��At�TZ6�tA�'�f�c��a������W6��6�\\L�AaP��J��y�E*�5�i)	�lf:*�5���+qp�����Fv�u4Hw{p��ap�)�
;�L������#M�}�1VG'<��lCm\�]������g3�([���`2�&?&/E��0e&E!)�i�bU��qp E����b�L��hb�d O�~7��q4��/o����{P����������uc����������6���g&,�'���������l���!�j7�������$����h#p���{�~IK{��������(^� 1Fb���1c�����W�Z:':.x��a��	�I�Cq������q$���>`����D���`3�6k��/�)W��Px
�<������2fO}�����6�0�����;�����C=�,���,��l6��.���2�.?h��v��H���.�>�	���x��y1�T`c^�l��j11�~UJ�	�[�p>|[Lt���i@�~
�G��N��re�h?TdcE�N��f�I�l�Cu8�O!w���1�<�W�G4
@�������!����Z���wM��Z�n��:�HAX�����og�D?����)���2�����b�,��J���h�>lj�P�n6:I�L���]?��?�8a5�����/|��4N��0�3�����bg:Q�����AR4��
}�����'>y�Ll\
)t���	�H�#�3�>�������_���R�"v�%�@7*5�t�v�8'�v*F�������RB��b�,�����������*�{�k�nY�L��66D��`�n�#�������)�&:�/sX��4�@3����b|i��[��T�����F� i���NT��K�8�,&��[R�B������|�A�b����������_lf��%�d@�ln�-�������{�S����=w^�/�7;��.�.d������`1�R�(��=
�(��bi�p<�s�Q�s��w����,����(���|��,��.��:��:�8������{�\q{+��(<c�� 1H,��.�
Ho|5.S���*�dJ%^�@�������34��	6�b��������Dy�42dd��2�,���2d�V[r�H�*)����X=��&PIMm�#��n�<�W*_��~�_��~O?�:�H�+�F��ac�}�i���V�7;���2��z���8nCm\�.����Oa����l��f�l~�����C�98J�4�����8�����m��+��V�N�|�6�(���1p�8������^-U�
�}/W�i�
s98�[�d��YMl<�R�.p=��\��qJ��8�j������w�:��j����y��)������R,�RW�]�>�i�n�|�)��PS��wO�
\����y1�b������ gT�E�s��W��ae�2�ejh�@*��qpqX�ly�|5Q.�J����K]Y2�Np�l�{D�2�.���y)���V{c���jy��n�]~wa�	snv:�H�;?�Y�.�-S�����(h�x�
�@1P,�������WCq��W��Wmcm�[b.���*��[Ml<��S/�x�`�s8��c�8~�>�����TG�f����6P���*����/��u��+��]�j�6���`\Mt�� 2�"�'Iv�6�����H��Z�z��V�K��I�(���N��{GZu�Dj��������z�u���~,������Dq�:k^.4�����#j�����q��6;��w]^#>9]Ml<�%����DH�EYL����.#]F��t�L4r���x�>���	�plh���;���y�������U�X�(��#��i6Qz2�� 3��������p5=�j7������U$h�����
�I��Tt��
�|��^L���/.��@1P����/�t�D����!������B� ��z�������#_(Z�&�T>8�M�A#
F��`4��+d��'���Mm0H��C��
VO��6��>����%B�=ZMt��������Ho��_>�����oQ�D�H������7��j'�1#*����2��_:hQR��o��a-?�N|�����'>H��X\��e��d@�|� ��p���q����b��}b�],�N�\,�#"�N-W���H~�l������S�M���������d0L�A:/I>�jM�-��;���D��k��@�g����6n�mZ$]/�J�|�6�(`X�����|��U�/�'�0@�B��]z��S�a���C1U���;�a2�������~LdRu���3�8�_������c���!�uM�0!��D�i��x������Zl�4���CD�?;]]��&�?�3�<����x>�������$���Mt��:x��$~uyg�D��!{~uy�s(��b���G�V6T�YJ��4iJw:-��l��i����z@ct����E��"�S�0��/����������OeT1�N�v.#u���/���~��%(��L�zvy��b#�y����@1P������t����d7��qG
c_���6=���������m����D�8��c�X�c�tF�[_<1�3��T,@��q����R���z11z�RWkA�>��qbp���m��� 0�� ��	�!��.�������:�H�.:���	�v6�L���C���*���d��d�{>�2�*�����T>�LM�p��z��!m��:R*SkjN�5	�bb�P���]7>W�LP�u{z3�\+�n2���\8��	;!����+�����e���MW��jVl`?�r�=3+�=��j:9i�iHq����Q,&_=k�$Cw�lI3�x�G�t,�f�]�=��y:�&����)3�6�������}�1n�!�d7��)9x��C��G���R8�L�s�;1��9&��`2�&?�����)�*Z]��#F_��yW��a2�za�l�^*U]��~l>q�L�[��e`X��,0ced����+��P��y1���=��3�y�������K��s4�y �2� ?�'0������r���F��#����jj~�\�G��QmCH=�������t(��b�(~�����B x�x��'��T&O�Q���b������:���;���A�Mt��AhB��_?�~����>i�167?�Q�:�<����zF��������VEcm@�}�.�y,�&F\����0|+��/>��v�5ZM���X�j��o��.	r'�������_y���@7�f��?�?_�k��t(����������7��1�?�Wo�?�y������@�O����;$�49��{�{�o�o�#�F�#j�����
;�H�m�RR�zO-;����:��DH���.v�Fc|��d����;#wF�����������z���j7&��O�=�]�dgC�P)fp�lC��^L�������1`�"��=����Pw8���&d7�IL�V�x�x5����$�2��9$�@2�$?���j�q[2Zy��i�dc���4�@m�7���W�8!��RM�
<'�&�4�`c�������Xh���iJ����W��^\���/}��U��7b�����T�&;�������������=����1#cF����d�W���n|57A����K����.�Fb%���)M�jI]�lb��%(�	�&��`2������7�t��~��
���T�O�>l��wh3�q�T�?�5
��.���2�,�2��Xm��Wus�4O����r�.����_�|����:t������bb���QC/~�}�s �b� ~<A>�h=�a[����&���b-RU�z>]G�8(
��Tz���=��}0�����3�0���������-9u�
��� �J��!��/>	������P"�()Z��:7�f�l����;���F�����a�K'h���AG�I���w�AK�^7��Y=e������)���PT�AeP�����z�m�T��68$������ +�m����G��&
ym�)��Q��P2U�	��e����f�`��`��F���L���U��[6�XE)�K{:�8)��W��BS�5�b����:7�f�l����4��
��k�(_��$e*�G%;+OG�����;:�b=�����CC�]?l����D��&��`2��x�|�`����^��g�����������C�d��D��Jbsp��=��G���!��h�q@U���Z���|���f�#��>��������^Ll\
�>�+������9:���������/��x������67�x`��>�����JGw}��sw	�Mz�3z�r�{�I<��*_���<�_@���x����	��@|<
������U����+S�rU�����@�Sj����CmLE��gC����N��@�d�Af�������)^{����iQyh���:2���5]�"�fb�R�C���n����
�*����wNe��f-E���Z���K��fgCC:�����V�yY5�����@�1@|>=���ES5F+��L���Z���e(��it�N�H��F�u5,������]]���$�@2�$�G�a�n�	8*�4P�)�I~������K�X�&6.���/ki�����D��*����wN����<�`H�u�-�6VG)��^�������Od5��*��w�'2b(����������/��m�Wo�?�y������~�;��������x�o*u�MXk�P����>S.�`���_=�����3.
�R��t 
F�� �yV��O;Gq���#��[mt���h=m�|����e51* k=1���3����P�@����:��#��y,��x7&��!����_A��fg��C�j��i�lb�^@��:}{K��������"?����:8��@�������PW���/��BW�3*�n{��'������>&�h���_F�AR��I1X�g�Q�ijT�W�4}�
�!�L�pvM��U:��'^�����BY��
).}By�z1Qv���c�<��y�F���Gv��oD�fg�C?�S	�z������#���!R�~:��|EQ(����"x���b�j �����P�������qbi����l|*��Lr!�n�/�jb��b�]����K����'�T�2Re��H��}G��c���h*v���!�oGR�����|�����D���������b��d�G@4
D��9�����2*&�l�K���Qkio�S���*��r���t�"6Dl����B����nO�?�Yg�y�P��I�����j7�+�}q�|E�Q��
�
5�&{X���/j�&F�>����*��y1Q.�#AF��	2dh�o��p��|����2�T�>���K��;����D�55���
���9����q�����������c����A6���f'���tl%��������'&�����,m>I1��vE-�i61z�MuY�&1�B�����2�,����9s�J�xX�� ���X�%e�����5���fc%'_�������:J����T2J���Vgg-�N����om�.�(�����~�{���g�-�L��(���	)���t�'z52����qYY����@��t���&w��#����������q2�����1�c$��!9�{�������"����I�%n�pI3�����Xi�v�����M���������S����\���18�d������H��&Y_���B~9;)�����)��}�h/�z�Y�b��82�5��~YT��*M��\|?4H�@�'#OF��<y��__�cwP�:�2�&���0��6:I�L�4Uz���������'1G>-����i�l�{B�4 
H�����^�I_��I������-�C�+��@R�J�t ��[P����8�"m�;a�B�����Ah��4�x���y��M��@u�X�)��j�����q$�K���-�r�<��c�<~��G������+*��l�
I�&8��w�t�:�jb�Q�u��F��f�?�4 
H�
H�����8�bW��D^~�}i7�\�K���CI��F ��:*�.ui�%���|�]�[q� ��L��i 4
B�� �i�e��=%�[]����77�`e�ep�o�X�6�����1O�a����%��Rf��h ���@�A���r�;U��	���=���J�m/#m�5u�O����J���!�p������w>��=����?�H$�����
w����i(���n41jC���vL��E��L&�B�� 4
B?�>�0P�+J�$��u����M�R��>^��1O����,�&f���<8�o�[���b�s\��ep\~��G���P�(����=G�Q�� �p)�����:R�B�#T�^\��Q�%��D��8���s�� c�M~���j���n�$�V�n��,66�A��J[��$uh���QW�s�Mt����2�.��O$���������O����P����%�<�g�>��\�O[���0�3����f�a��O�q���F�H�gr}�C�u���������-5�b�b�sp�g8��_?�~���!t��)� ��_�T*~4�a{��A�����%lV������f���u��ZV�g�K��c�8>)�����~���(i��C�U7	����}%tt�D�[�,m�n����w/�S�RL�Af�d�_C��������yB�\����f��!�O���Q���M��&6.���~����������6��`3��,��s���^t��t��j7&�v>�/ar���D)}�+W��W(�H\L�P�ztq�4�Ac�4~*S����a��I�/?��F��w!���@��M�bJ��]c�_R
�+C�d�Af��)2�;Pc��J������4l�n��_���������+��i���8���g�c
VL`;\�\	������l����]��.Je�fg�S�!gW��ki�����6O6d� 3��7������_������?��������_�����AaP����x������������7��~��	?�����]��.i�\=�m��n����uax���9�c�� 3�2?Ef��	��~����Q�.�OF��6l�t��8�E�l�ZMl<*��>�k�P%��F�4�@3��A3��������S����{�S.���%GGMm0XcO;�������l�LQ��`?�^i��m�{&@1P�@�S1�!�kj�y����:�H�S�����K��&��:I:�~������l�{B3�0��S`>�|�j�����*�M�&��/�9���c��/9�A�<d5�aQ8P�����C�?��D����j��Y��Cn>�~z���-V�I�N�
��@�X��a+������*����&��l��\��ep\~*���T#�j��M��C��(��Zlt���1�@w��%W�w�,&���!����!��G�	�b������8�>(��w��Sm���7�<@�����9�H���~�������7��~5���s���~�x�R���$�*���<U����a���@�����9�������B)�8�j	���D9� �F
�)��H����^3��V����(����&���r�����������se:Gc��wr2�$}�R������5_P@�<5����o��L�9���@��'�U�v�kry�C����!V�j�6�	�1u�:�����������C,��~u�����D��g$�H��<#y~j����v��]���8v����M������bC����/K(��TV���e1Qz6��`������/6�U����A6��x�`�,�O�6��j9��n����ek��sh�qN���?���������/���R
^|����z�����|�c�\�W�����U��BvF(�>�:tF�k�6����V�\�S��
��f�l���xl>�dM�K�r�8���}����Z������wu�q�H�U�G�HJ
6s���/0��.<Ng q��33y������UL��v��E2� Z-���:U���S[�x^���{.c�1�`�����kQ��	���aD�NM��a��]��N����o��CtV��GQ�/����n���Io\~��Swq!�!�N�)A1Qn��M
;x���R�t:Y}�:u�u�l6���?l��N]���^��rp����2���$E��4�(�����Lm�������`0�x0�wb�4�`���r������O��uj�V�j6����_�N=��������_�/��+��g���7.��n\~�O��vj1��i��mum�v.����SH��Y�B���y���*�,H$����h��Wy��0�����ta�H_6�4�W�����x��RZ��r7� �t,B2�}����c��x�d��V��������n����l�k�	�\!������41�H������x��)�P(�i��j�]���Z�.����q��t�86��4��E��J���:465�f2_�� 2�D"��w+V���K�b5]�:��-V>v��g���A7�d���%�2�!W�W��������@f �tg�7k�]��&Z�"
.O���v���#[�8��R�5M�GO��8�(��![�t�k]����\������������������M�
^����A1��+)��d��P��1U�u�0��Q���������8���3����]�������[X2X2X2X2X2�����q3L�4�-��<M1y�#~�"%h����5�rs���|GYn���'r�����i��x��V�Q9P���GcyR����MN�E�Hi������G����v��'KL�,�5��O�H%�������(�����}��~]��������\u�H������Ml<���&!<O!J�"��3A0�j�2�������f��k���%=��	��b��G��������w��Ku����R����x�?�9$oEg��9��������7��y���w��u���_>>���O��:���5:7?�3�,�7A-�����v�&�T�\�?���J/���n������P+eP���� -4B[\�|K���f`����Sg`�����f�(�"���,l�BJ� �2��������up����m���/V��	g����t��u�yh�Q���^7�?TH��B
L9�S�o��`��o�?��|sX����i��D�
�3���
�x�X��V&�.��ou�ct���_�c%��
�1�1�q@B��e�5��o�b��	
�Ntn\���qg�mZ7V��~��-�Z����<}��r��T��T�d2����v)���X&��~^<A�8��5��hB1����h<�h!�u��:��J���~!��4��3�K�2@�P(?|�
���~mk[S���C�,�!��P��N2t�������S���m�2���h@4 � z�����2�Mt���U��p����]�P�VzD0b���E�����h.��������#k�XS/��n\��[[���k�Nl�������G�%@�'+L�F��v�(�_(������iSK�h@:�t� ;2��9��J�v�Z������z$�����@���
Q�v�����J3���_�������tn�y���� y�����3��$w��5,P���#mZ2=��h@4 ��5l����l0M�l�{��Ct�C��u�������+������;~�LL&�����'0y��v�h|��nlw��a<�h�r�G=R2/�������8��:�����f�v��g?��S��=1���b�r��b�$�-��YN*�1��^��N|�6.R��q�0�!Z�&D/?����������q�s'd�!�-����~
,�H���4Z�I��X����"D��CZ�d�����x`��Y[\ds��U`�\��Kc�������m�PY�n�B���I�Y�H�y���i����!�W�-<<<<�5�d~TR-y�xyR6�d�\�4S�>�e���B�Z��0)�Q�&m(��#9����E��e����l�s��,��n\�l@�kG�d��=B4������]��.���[ � � � � ��e��q�E����4d�"(N�z=�(�tI�~���_T�'K]_d���AG��d���)����h-����~��0������g.FX�����S&K	�~x�|QR���k�N9���������Q�H�<HG��?g�C��f!�
]s��+>o���N�����|_��$J~V��j�����;���,��9DQ����z!G!������|"����@e�2P�
����mc��$��VN�L�=-��+�ENl�)�h����\�yvF�����W���
�yx��v�(P�����T
I���E�����p�t��O�(e�
}��r0r�&�����9�Wa~"��o����������_��n1�����T�qu�vP�yY9�B�
�{���U�m8�Q�S���(��
~x��b0���"U��3�*
 �NgVI5����r��H�vj�:�@�]�@��g������<���d����C41�G42j����8!F����h`�X�<|�>`�����at��h�i��tu�~����X!j��e$!��j��3��\�l���u���s*_*�S��-M�����d��@S,����v����H��j���Ox�:��B�3&A��������� /����J %.�-]�'n,�-�9���d����5B��J��Bu����S��P�����@��w�n;K��������?�8�d�D�GL�
/��J���(m%nw��>SnG�3
�Q��������.��*��PE��a<M�9US�w�:�Ln�]L~o5+����2�D"�
D���J��K��:S�6L�T��C����qBba<I��U���qG~}��=b'���'��J���)z<S��	�]��	%��u\%������pv�i������V
����0�a�9}�O�1	y��g���
Gx�r���b����>�����������-�j�B�0�������b�9KBz$�
?6����N	
�>UdyY�?���S%��q�3M�4o��vQK��z��1��m��r����WtA������� t9*�'b��b7.���6�2m�v����	�1D�������m�R����Oe=`��;u�'Ov`���Q��}���"��i�*Q���L;)��z
#��>FKlw���2.��G6�pdkd����_������R��A���2w>iZq7&Y���;�#��Z��=y�!a��������������=�=c��_���/���|s����������e���/=IY���"C��>_�?�t^�^�t��7���_�v�?�f)L��w(_�.4����Zh�)fj�=�(��I��7����7���!��YzDRyuJ���q��c�|�4�U����i\��n����.�k�Z�2ES[�����0Ct �dh���|����Z������P:0t�.��7����?won���,�d�7�Ej����������������[���r����������s��e~��4
�	����4ejp@�@�9��k�x����f�{�ct���"������5�������;T�]]��	M+����:H�Dj�+���Y�7�T�5�m��
`�^$>�����H.���r1��O);������0�B�^�s�$&Yz
�@������F�q�.s9�g�3�������K�;��'�������u$e�&�����&��9D��U����^����R���lH���?}�~�Z	��d3g��b7�H�`@��kMKC����?�E�]-�h7R	n����G��M;���0�n��n��o�$����k�O	;u����55�J��=�h�0��Vy]�����C��P�V8�k��k��!XK7!~����7.F�w��Ek\W��H��~���������D�3��t9�^C��^
������N}i����J9�b��4�N]3T#�_�"D��T�<	��$}o�^�8%t���'��E/��e�.���2k�����y��u�|�L�V��������8<"�|;��[�,�wa%���]��I(�P��dC��jx���G�����D����d������05����"D�[xqR�\������2��g�3��|�YQ�6�J��NM�k�vC�/��@e�S�>|V���=�b#����,1���^��e(��5�opn#o��	�<gS���T���3�ZbW<R���u�0��c��Q?�	�J����8���o�A�U��:�2�2�2���'����<$*)���
�u��L*(Z�h���dC���L�����~��J���������2Q,�\����>�*�������;p�!z�,��zB��$f�,�dg�w�e�e�e�e��%��)d��2���!�o(�F3�!Z�j�rB���L���!��	(zb�S[�%~����������^�h6
��E:�/��������jce�GP���E��{GE���3��$Gw&g�B4c�
7��6��Z��a� ��)��MK�'+�a�#��y��Ok��`l�����?�c���@��"{8���9%xm���I�>�k��������[6�|������=
��z2_W$�����!:�H�m�����nO���o��CpJp:��j5�,��)�Fx��p�������;����b����I�v����Z(�C�1D
�`z�s�O�31���-���h��F��O������/+@��@�G�����i����}6����C��kqA{0���Yl��l��6��0kk������i����#KX����h�������CG��H��j�q���N�(�������P$6��QV~�W2Z5���7�6cZ���%�w[�������Cm8�g��Vds?|@�>�S��zy�P�����HG����H6�1����!Z��:��'�vfsLi�m�2`���Q{�n.����iB�q���l��>"�������4���J�u�f�*tM����
5j6��,5���2w!)Q)F�w��U���u7������!�����q5	f��YN����#�	j����B���3;Ga������/
����b��h-�5��Pz0^�+9�(�xHx�^'m�;SMUUUUU~��n��uE.l���z[�������3�Y����`��W�et����Gf���g�3���:���	%z������T�-v�����ISq���Y��-LL��V��@��+}7P�a�l����^n���
�x���d�=*�]l����[�r�s�H�N�N)���P�}<'�>�������������*tMt��Y�n(k��&sH��iy�|���	��N������N�Z�Jn��������o�����3�f�'�	���/g����h`t��
K_��3�J%�������q���D���"_K|Qb��6r
��?�>���t��}G�oC����2�����@���yR:�*���&}�_��b<,��������9�������������B�z\p����>-�%�}�]�O#T\5^�Y3�#i���,�����$O�c�9���PL��A�igH�Z���������M5���X�N��)�����Zh����#Sh�s������>��a��������B���
M��k�@��������G3�����6�o�J(���KKyG�*���S�,K<�(=B\��U:��f��e�2p���m����w�<n���"�RAJ��Ku��@@d�����8����
v�	�3-i<��A�����O�tl�2�4nh����q��8����<������������jy\\��Z�P���P��
��|@v���������@S�x����]S���������o�ou��s���gM1���,n�g�?*��b���i#u��m>=��e�Kk��h���J��5��[b�/`������B�,?S�u��v1X����ZG�p��0��������5�t�?�r�f��vr�����m��fC���
5�����c���r��3$ �0�z��D�%5��u!��H�=��y�O�Q{a�J[i�d�?�W�o:���t�C�����6��'x��Qz����/#9�3���-{U�\.��}�-���<�S��1UCqp%KT�!D����V�0~I����R�[K��)*��3
P��pC�����p�;�R���/�6]�'&�����\���J�!j��e����
"
"
"�"���@����Q�X-�S����FN����
����
]�sH���Z*4$AC���d��Gy����h��M6��x�?�,Bt�1��\7���4�h-�Y;V
N
�f*K�!pC����Jn�F�u��b���J���]�FT�W���R_�u�_T���`�5Bw��������@i�4P�����h<�TuyRS�c�&6���$���Vj�9�d���:��9[2!����eO������{Fc���_I6���\.?��������T^��D��)tS�0&�{��
���N�%�����sT���g�g�4P�����H,��+�grdw4��b5�<FC�����5F�����E)���
��b���/��1�C��x����l$C��:^���.���r	k�I+����7��Be�Zwf*,,,,,Z���[���X��H,���=���������b"�[+'M���g*�W����e��.����y��R
���<*(�����q-U�Z_�e{
Q�s���]uf�.�oa0�O��$�O2B��H���jp�w�����??
m��v1�������
�����U|�{<��L	��|�����������7z�)���������~����e���������xz�����<qP0&=z�q��O|�pn�I�g}���K��=���8�:�x7���OD�-Ps���&��R������6��[���#�c������>�q�{s�������o�_�h2���c���I!�����=n�V�;�������@?3@�)�(#.v$%	��U?���������-kS,Iz�Pt=���V*�	���V��N9@32��8�?�������Df'��������g�]0>��
=
�� �)D���:~I�V"�����BL����L&���}��ML���������Q>c
Q2^%q:N���[��)F�������x�u.���@h 4�
��LJt��%]����Z�J:�R��Z-$��F`��%��J#0���87B�2�@3��h4��<��L]GQ�{��=����(hK���S"����)&��D0�L&�������J�\�N�f�C��*	�nm9�]��,�$�����)&OUB���@h9B��b�U�#�2f�b���lC��P�����]���@�o�q���B�����:j���\|��L����K-��@���������DXi��t.�1D��D`[zb�#��v�grx�f�f�f�f�f��>����a����8S�
���f���J��z���E�1Z�q�z����SL�r��@h 4� 4/�*I����D���!�3���|�Cf����<z��v=�����n���
�@��?��{��i������fL�\5��}���O�6��t����lr+X����3Oo@d 2�D�1"���M��2UP�C�X���
��9�kY��������:�K}L�r��@h 4� ���+��g�1)��m���`c���a_C�
�	�������R�����;!��� 4
�B��C��h@���3�����#@�cHG��mK���T��1�yKP-|����G�t���X������~�}��T�L�9Z;v�d)j���7/��B�L���_�R�����5��T:����/�5�c�8��p��?u�*�P��<�WWJ?5{���O��;���8���T�d����s�"D��'�x�x���	R�:�P
�5�k���g�\�q��{U�'���_S��	��t��)U<�=��D�����Y�7��#Wr�vf�q@4 
��$�/��(�R�&�\
5Mh{(��aQ��E�������T)K���K�G����P%Z�r��h@4 �5@4/dfn~���P;���l�M�\��~�A�o!�����_�����C=��{h�C�D����(���N3��U�^��������`��A	���V�4i�nkW5���GC���5����m�����K{$�b��96S@�� z��vt�XF�6m;���^��	i;3�+'��?5&�;�i@���
)R�k���q�zY����*��#e��s������?��5�
�/-I�jFF��d�v���4�2@�P(?���b�����`mW,��p���rZ����FX9��&����d`���O�:Po�PQ���BH�`��_CH�Z+G�g�"#6��?��G�77������>_���3�0y��lw��P��M�ON~w!"�F&�r�um�]���e��8�dB��	��e$�Z�t��"?P������;�_q�pz��l��yq�!;����X	����������o�ou�E�����P��z�MEx���D,�I�x�^��#��qdHr5����w�^���n��r���6L:�^23����W���C���a�w=P�gg*[{������z�L�*��F���������l�A��*�y��e��ke��D)y������T�-v��=!r�c�o��`�!y$}�)Kr�t���WC����o�5&���<�6�����&)�����]m|G� �����F�WO!: ����l��[��:d&p@�A�A����������M�t�������j���X�~1��v����
�������x�����P(�����GXy�Q�	������Ss��|��1D-gko0K|:��T0�^
�z��mbz��9�F
5�63S������r79P�C�@���i���|��xq&�~�	Yy���*���P���2Uf���#��~���B2I�nh���s�%���I����'}�"�����?"��c��B���Z�����;X�S6�F������#��X\��$P���*6C_����������t�q���7w��� 
f`������n�*�������!o���0�?)����4\���w�����w�/Bt@�Ti'�����Mes[HK�lO=�W��	���}���Lm���qw��##�������l��wW���������TNJ��fVP��Z����NS����9��{1=�goJhP���q�.���H�,$��������&?�@��tP�k����~}�k��g%��`x�6�W�8vrPc?�(��!`���|��������������=
 ������G	�������6�%���� ~�d����� ��H2;���(`�e`�>���Y���n�����*�)DM1�6L����'�[�k�a�	���"{6���g���3�����?:���J�5��2]�[?u�d���%wt���B����F��J�mf��@e�2P�\Z��4�C��c3��Z�T���<Ct01����/��X|�I�&8b���_	>0����T@�~Vq�q����t��*�0���q���)D�_��]S<&��F�{����h4h4h4ht��������JR��nk�O�^V~����=�dv���j���$n�1� U�3`0��f�x3�C��
�b��������8��0��,p&�.���8��	���]j�U]5���ky /���X������"v��G�������
�P���o*X,���.�S3�9d79����@�._u�ui���9D
m�_�D��^�A��_/a�7,\�p�����II��$�����v�6]��nh�����-j\�a���L,����?���<�7�����������^��z��O������n+�PB:�x��)!	�C��3�h���']��$��<F���W�<�qAi�-FV���7Q�T�d�Rd��d�B�0H��%{4k%S5Q�G*W��P!�8P�T*���t�Xpi���b[���zU���cH&3�Ae~	�����l,�j�F��9�@e�2P�T~
���]���z�v5��M]���J�H��x�������������c�T�1�h���|�o����*�X�k��#%b����S;j�����ku��Dy%�GW�,�jx���~]����>��&��Ys�����P���������8%�U�!z�M~�,�F����Cz�E���!��OnsB70j�tj������{T��
����m�������!J�t��%9*Qd����F
��5�|��S��L�R��%j^�������US��53C�r� ������V�g�8�����<'��w�(P�w�Y�mMh�@3�G�YC�[?m����{�y5��c�	)F��C���Vh����������|�|��f����S.;T���9�[C����[���<���y�k8�����
����q���j���b7��%l�d�j���c�E��z��t����e��������~v��w_z����r��yfI`����%^o7T���5��l��HX����okgj�c=�N�����O��2Z�%���%g����`���d1�W1���hm�M���j�!����B�::�Y���jp^���� S/G����.�i�&��n��I������'dj��o�&U��M��Mx�4�s��8Z��)�0�x��/}r�����G���X���ct�L�v�5�����O���_iZ��$|��|���Md(F�yR:�W��=n��<>/Bt1��y)��Jg��5"wRD�l$
c�]`�@d ��y��5�zjM]�i�"}�<���'�z=s�?���������d���6���PL���|>����Q+)�|BQ)��o��[��Q�1uq�J�>+�����QLR�����y|^�y�P)v
�Cp�.���?D>�+�)�C��BkMcmS
����!J���N����������7��l���H���R��i
������_���Q!���E�R���������o�ou~o��}
V ���L��T�#�S��\4�o�cH�B�����)D�W;���D�	�E�1Z�q"�NW�-�5Tk��P��T��)J��9�Z�G~�W���nL��PM_�� W�Z�!J_b�|�LUJ_�`�)�J��C��D"�����i����@�M�D2u��U�6�y��	ky���Jghr�&��6s����~!�~!�C�=����1-U��0T���cHf���,H
�k�G����N����.��=22�C��������6J�&Wy����tn_��W1w7��X�wU�&^��������h��*��n
��<�W�E����J(?��>�.�����x�^�O|*�t-U)�L��T$]�%����*�CB�yFSSSS��
�?��G��zW�����aH)@���Q$�UZp��6]��1�@ ~@������4G�G����PMM���D�M��O!��b�����y[rd���t�ZC��d
������I.�<5<=��$kG5Pu��q��
~�A�&Y�q�h��6^#�E^yy�w�^Lm�9�S��_%��O;���nC���2����C�`:�f����.������eU����
I�6$��!i�����i~�#�������c����S�$|+�/��6$mH���?|�xsx;����r��e��koB ���e���cH5\*Ao������S�a����������������T�k����uQ�_U��_{�L��:J:zM������	�qJOI���I4�H�����
�5�k(�P���k�-))����������5�������X��C����b���]�X����G3�����D�������������GfZ��e�&���s����
*�l����>����C�b���q �6���������fQ��t6J���������T:�f=�������e �;���2�2�2����v��=~'�4��G��w]��&tu����u1M_\�1�O,}��r�������z����MF0�R�'�R���lw��@~���$�r(~�W���n���i�m������c�ZC-�e$��T��5|��]��2���[�*e�6)��b�������lS��*����^�>���)6�S\^E���Y�k�2`��2����{����S�r�� �?"Ct`1���8�q��_S����n�����"���2l(��C��c�1GI:'����R;�a$E�EC���H��a��+w�$p8�Ou��� ��*�.��sH&\�c�����*2��5$kH���_F�}J�UG�k��5�B���)D� �z]��/*i�Jy~��C����}�������Z�����]����������q�-qg;Ct��!:�}������Jg�����GJ5����2p�\������k��t��b��CQ����&�,���L��}��(���C�H���!q�z�����6�)�-k��Y.S�w�q;�e:*de\���c��1��j;Z�&��)FITI}������:��\\\\��pi^�Tbq<@)�@�G*�21x��j���"D��k�0j��)�1����
���m�56��3�e��x���w�54��!��?�8����/��e���ZA���
Q���e�������.Emk�m;;���Ma4�yQ��D��i��f���&�!j_�=����(Uu�'�b���7pi����q��sJu4���b7��������s�������K�m��y��m��1����W����+=�t\�J>W����������^e�y��U����_U$��`��W��c��,��z�U��(�1S�A\B��l�"�||C���3������96S�7��$�����V�XJ�z��@�����:���o�V�Q�������FI������0g������d|�%E%9��/�#]�'�5���k|=�K�hO!J��$T�G���<� R�*�l$��Q�
	�l���AD�G����_	{�1�%��4�5I���@�>���`�X~�U:�vl��g��9�O�yYA�.�R����% ��*R-�������}O��hL��/J��6HM��3#��!\C���*�oE���T8,V&������GZ&L����<Q]�h����?D������![C�>���LO��=����T,������bmf,��?��Um:�P���a~%s��k��p�px���F��r�����w_�Z'4�~x���w��}i���~�xT��@
�@�'�<�����'����Z��m���k+'��u�9��$�"�����$�:����!QC��D�J$��QT��*i���ej���.��9D��e"�"�:@���6?����E���[
���w��Z�
�����:��[����q1 �G��M�9��1D�%l��T��Q�W��bv�Yis�L�����M���h%W1\��Jc4u�4y����|7�(���@�{r��b���I��l}�p��y�h��@����>J��Mi(�u2����������7��:��9vm���������{mk�c������:�=n�q��@��C�f=P�j���[����
��k���?�������'������:C}!���/��	)��)c��zZC������qB"�V]s��g�����{���������������Z����@tj�-`��F*y5yf��n�M!ZC�&J��71���KS�#u1eG����d�C�]�7���J�����BV�11��`���?��h�:;�6Q��"�z����V��8FH���G+�X�8�f*L	@�~f��/�Q*��t���T���2+�;�d���ctt|��$]���i�����3��=��	uOuO@�gF�s��6�^"U�G��8��3��[��4YP������a�,A����_�D
k8��]mb�^����p����������j����$H$�u��S���)F�O�7��I��d.4444y��|��P,�\(�)V��lhr��O?��jQ�RqTjM�Dh���_�'����``��y�������q��1!���j��"$���b�e$_�� �A1����L��sj����d`20��W�����P�Y\�0(;M��/���E�&K,\6(Y�����j�G������X���p6y<y�Z5)��74n�WC	209�(9��2M}������h\e�qM��PL�r��������_k.V��o�J�T������=u��a�����1D)��4jI�]��R)B�!B=����t�Qh�����b4���������0���54��s�x%�K_��``���uH�DWi:9��bbbbb��t�N�k�<M����6�h=^�6������	1Mh���C���zh��fNM�,)���#��2���	���ar���=���������0\��s�]�W+k�~�]�0����arp���m:W�#�Y�|Q��W���,�e�"(��)�j
��5T�W�Z��=�����%�s�UC����-�L��*�g���V��$J���PL��@��	�||�|�����6�=��M��-%�|��p��?u~L��
�����9����>�H�������n�y
��[���W�V��'���:�E8�4�G"���C�VD"�����@��?��e!2�Q:�CdO���vc7M��B���J�6�������]�����|pv)l���!�S�\��tM��� zB�u�#)@��#���[��`oC(�����D�W�$O�s�����<st����2���Kn����U��v���1��]{s=M��c
wb����F�C�C�C�C�����|�Jfq��zs3��@�lP���i��:Rh^QC�(���i-���G�h�|�����>��/��C�Vx<����7C��&<����$�hB��}O<R3�!$��G����r��;�(u��k���Uy�H$��OA�NE��i��;!�?$j����4����"���PPz,hhb�J��ir]F^p�d��i���9�*������U��*���iM=8�V'>�0���ao�n�,�������L���x�RH�j��������_g.��n\�=u��m�<)��T��H3"��9��,�k	Zt�W�-���~�O���J��Q��\���d=��a�T0��������#!���t=
I�~9]�W�����D��9���J{��'��~��x7v���������P!p[���M�_���G����%�J�[�B�Vx<��������@����cB�/��B2S��#��R�K��p]J�k�~)0����-H$���O!�.�l���4�b��X�����n7��oe�s�&�d���`n ����r @��H�7���H���,���J�-��
L���������b7�����87�y$N!:&J�L6B��~���t�Bt���`4
F�F��yh*�mz���r�L�6��cg���m��~����|Z
�I'4�oQ����00_���4�{����P&���(	$�q6�E��u���$�x��2L�H�� 1H|��5����r-'�y%J�u
�1H�A:/�D�6����b��3��u���5D7?`4
F��/�����gl�B�X�];�Y-�s?�eT�h^��a:J	Tu���dU�0�����,4�=��A�?����	o���'t�Z��������e���W"��*����K	�Oa�|��b%R����6+Z*�
����6+���v�����7���4343434343|��t|wC&u��y�x�C��q�&?{�a{B��S�M>>Q��z��W
%Dy�h��f34��_?�p����|��EF&�n��]j��L!Vu��7�����!6���;<X(�Q��d H�?h��n�-�X0"������S�z�Q����������Q�-�"?���S��������^B�gc��H��>>�yQV�g�?�����!�i�o�k@�����#B�P�mR��_zM���������^��7�����_8�#?P��6��E��k�nz��������_���w�������\\�6������9DG ��..������!6����M6��'%D�
�>���3�|�V���5�rq�t/|)��FK�O-��2�&?��']�b�O$�<���:L�`��1`|��4���n��R�|�c�c�x]@7>����B
���L��6��o��)!���4�@��@���*O!����������/F�B��NR@Si6u�GgC����O���%���08_���[�%��j=���b����Y]1�����J)��C�4����!K_�<M����	��Ah��J������a/U��V�nv1���c�2��J	k��>�2��������SBZ�����h0Z�h����bu�R�H������L�����e�|�#�t�������:�&��� ��<J������/�{U!�����]��_�����C�����a9DM����j�M:�Gep��c��,j�.��h�*�ia�!�'sy�H-�k�BRD�������I�.���������5D�����2�.�.�����'%�@�q�u����|�cZ�V7���!F�1�E��p�	]Cl>���v����`3�Bt���3�<�We�!}��vI������5D� ��|�$X'�a6)�1M7]mJ���E	Q.e@P���P>�����j���]���,x�R9������+�k�U�a���ro���.!�{��h0��_�y�bt8F��l��n�E�1�1��4��y	���\p�ol�Q6	��[~Kd�)�"�j)\-����~y���/g�w�nD�����|��?.5�|&K����������1��I�Z�ta��6Jn��!�h�����0���u���dg�F���e�
s����J4AC�bT�Fj��q��l��p�����`3�l����~��-n_���5")�����e�>���]_����lV�b�7~m�����Ve����T��o|�z�/�h��5��/�Z`o��$�_`"�]6`3�6��g�����xl�BF�<�to��}�s
2_�5�(��L�6�_�����q:J���]R�)!�t�d H���/�w7u�!k��t���.�<��5D 1G��M�\�&Q�����C�7����H1�]:�3�>���?��r\��Q)x4)��<6�@7:���/��Ct��7��Y�:�&�������`������d@��*��a�M;:�.�M���J�k�?b"��#���oz�������R�N
�����@�d��@>����|g#[�F�u����q�M�-���3I%�hw{�����(���1@� >�@����GN�s��q����~
1��&�ZP���j��.�������Oo������e�����������~���7�x:���Z=<|�}����'��� �W�0��WX_�Bt�W�W�����f�*b��q	1R���z��n�R��{�CC���3��U:���V��K�xs�����Uq����M�[��}�)�&��7}��%<��������g�x����1*����=����u �l._�l�����C�������N�l'S�n|��yq�ho{sS��n������f�l��*��gl�����~
�%~��9����nv�0���Si�8d���,����KH��d H�*��fS_K�;1,w#	��)�H0':T��_z5���(��o�������:N��*�����Oe���D�>�6��Q
69�]���b$>s��kL������\��fTB�
�B���6m0��_S����?�����7��>{��	��}�7S�����y���Gj�fr�����?����QG,K��N�@��q6IPco��r.T�������������������6��c�O\{u��t1:��odl�
y�%�����e�]���n
�T��T�T~����=�r9O
��� ���t^�/����x�u7��=�DW+�	,#mTj��S��T�{�Z�t�����^�	�!�l�����������j�����&��
Y�<��0�V���Z�Bt8�*�|UeX/��l-�]:�3�:����W�|�jl���-2����/~tFr
�����:L�@)�}�
����.Zj�rJ�d0L6a������i�z���� ��;��������A �l~��=8��L]X-d�S��jvT���Q$d�8���I��������UT�AePY�P�xT��@F���14����k��"�ah�$��:�&	��~k��E�@K�.p\��ep�����R*yQ����M���^a�\Q��b�H�
����JE�����������v�[�f���U����p��kE�����Jx��z�����O�_?�>���|Y'�eai��U����4��c�d��B��`��G�y!"��!F3AV���"��M�l� ������>%D�
�3�3�3�����Q�G��Z�����"�:��[j������:��$���80[�P�����{��\B�{��2�*���2��3c��
��c�
������9�������8��b��(d�:��{���W�%Di���4 
H�H�/���`w���N�T���c!�F�B����Q�E��8�e��;��?���!J{T�AePT�.�����
��&*��~x�<�X�����^��:����=u2�aa�����
�����ap�:�Z���DWW�$�d�#�L]K�������5D�N����aka���>������w�+��d��@������n���Lm��2��y�������0�H�:�&:��o��J�p�]60�a\C0C0���X�+!+����Fb��^�!uc�����c�x!V.r���8�$��5��-���G��K��.���2�#��j�]3�
��4j�]G��#r
��MO~Kw
��'��H�5�b��tI�� 2�"�"�pP��I� �g��w��|yct�����S�RJg"_�����h�b��������'�%tQf>���3�>_W�<�^��M�,���d'�Bz��
�m���|�#��I"�M���_��]:�2�,���u,�"��^*U{��^dd��:�Hs
T�=,G�$���X-�5�T�	]Ct�8��3��B�|<��k2���\_=ts�\��g���$e49�����(��e[�t�i��J�. H��d ��^>�Z�<��|���<����H��t�o��g4���C����nb��O
��D�@4
D_G�!-m���E�E�"��Ct�B:y���.������||L��gK��q�$�f�h6C�������Q��
.��.�y���U�����@1�w������n�|��b����NO�MU_]C����B�1�5r�2�A{%
B�� �I���Gh~=d��mi@w?��8��!&+��U����b�;'��b�dh}�:~�R��R��ep\���j�W�������"��#�����0$5�rl�7��x�!F��a$�]��cb�.%D�p�AiP�~)���
������z�C��P���������v������r=��i�
��U���o�T�O�2 a8����SBt�1@3r��6v�����a�i���
��� $E"���6�fo��5D��K�O�6�2;y��*!����h ��D�N��q	���,�A2�9���[���y� 9^<���(�tOk�MOS��]B�=MAiP��_
���gy����%�::����g�b�[���K�ln�094q{�ZR�VBt��@2�$��p>�Z������T:�8�C�	�����:�i���R�L��K{������\��3�0���h�M��m���B��j��-���00�a,�6���gg
��0�3��3�)��CR����{������<��#��oz��[u���/�����P�����e@P�����F%O������t'Hw�c�x!�;h��'���H
�������-����5D��4�@3�����F�'��i��)������z����}�i�����BK�����|>������f�l~)l�K�b�oj1B�4���=�e��=R�9����a6)P]�����(]�/�����/��h[�����v-4�%i
�Z�tE���h�:NB�b�hG�J�?|7��K\��ep\f�|H��5C�
�%�Mu���0�@]�7��<
��||����?������f�l~)l>�t��h�mJNt���^]���9��9&�:
a������c�t�@+�~����Bt��� 4
B���z>���J������:�� $E"���/{\-X���|b���IG�b�!�t�h0���h��Gu���w��$[�e��>R8���������W	Q6�������77NI%D7?�3�:�����_t��wCp�_�A6�wly��y����m�>R�S��c1�sOM�s���.7�������{
im�-���2J�� ���������_�Xa�]�m8��J�2�M��}�y	�H�
����j
�����S��#�%)!J>���3�>?�����������������CZ��I���%���Cl�G~�I�f�!6���{�g���*���2���������%^y��f���Oc�E��
y��G,���d�����X�BhzA��2J7`0��`�e����������],h�<��$%������/�A��b�O;4N�v,��%"�� 2�"_!�A���Z��qf �����!:I���jZ$��]�B��(OK��;���Z�������Ah�%�#�6��S�����~[����
�(k��0���av���i���$Z�^_<��n�}���@�d@�"�������w�$v/}1:��s�V�i�-�U.!F��R��n�;'8UB���=*g��o��
Jn*�;�)y|��������p�������~y��Y�9^X�I���>��W�
���S��~��.����!�~u],��Ch��i���)�&��m����X~���J��

����_�F�|���{p���m���yi�A�z�����(9'�C�67�.���� �(�C6�����7���(���_
����N�km�u<(��.M��.,"�_�,!6t.M��Zq'���M���N<���_��!���h�h�h�h�h����������}Hm����n�Y��U��.��]��s����Ey����6��8P>��q�Fx	QzJ@4
D�/�|��Q���t�n&�:t����\+������+�3����-��N����9P���m���HM�
�8��3�|E?���|���h&[���qG
�lk�$���z����iY0+u�.�(��b��
�Y�MV�#�,�+���Ct�B9��t�z���'���||*���0��%,��2���|<�*��������E������[F�$����m!4�%(!F.z��*���lk:]��@P�e@��V�A��,��q7(����K{N�M�\P�@����f��u_���@vx���N�:���t��d H�� ���u�R&����qH�u
�!b
t$9����b�O����O,�j�. �~M��Gx����������9�5D�Q��cw��`��q�;�h������b{���!�ryj3�c�xy��<�5�z��,��a������V���,��niv���}��tO�f��/����������h���o�"]Eu��[
��:���������I)!6�Px8����9&��k@�G�.��9���=*���������/�nl�l��+T�����B}_
��.v�{Z�G�����p
���N#��f��2�&��lo���{	Q������L2V�~����_�JI"l,	�d:B�;�k��9��P��K!� ��S��p�+d��$�(�� �b���s��s���������f��>�bCD����f�a6)`�R�^���2��KP�e@P���Z�t���N<�e]Ct������k�-5�&r�����.!�t�h ��_��'_w4���>���z7��G
��������)��N�4�W��PF��$�Ab�$�&���H�����_��V��C������{e��OG��PF��a@��A���u�P�uX�My��)D� )��}�F�����5�&2���_��a�@ePT�5T��D��b����,�CVY�n�Q0�Te]���#���y'>�2�&j��$��F�|_��|�Ib���|�J��C���Sj{�H����'>�5D� �\�f�?������	�f�~���G�r�r)
$�@2�$_C�!]�04C�]
�x��9��FFj��|u����7�d�����@1P�"P�oY����������~\:C���b�!��<��6��Hr��}@��t��W�9Fi����3�>����2��U�v��)���X�q�i����p
�ID�\d_:m�/rHg���~vjR��1�(m��h���e��qt�S7����a�C�V
dRS3�M�5�j�!6���d�p�/�J�.��A�A��)Y��o����k�A�"d�C�T:g�����~s125h�������|"�(�<�.���2�|M9oA@M��~Hi�Jp.x	��s.��<��0��H�m�)��Ybt���2�(���|H;�����
����N���5D!�l&o:�mkM~�TB�Zkz�'�+V��fX��e3,��_?���o(���n�����vQ�~�.?�6S��o�B3����;�~j3c��3����9D�K��hi�OI��� ���|<���5�n ��.����@^.!F�
��:L�A9��7S�M����0(
����	6��E�n�5������?�!F��5��UU]�&��)s�������{v��.$����<��}|�������W�y����?Oo��������&�&���F0�M���`j#Bg�/�������R��g��>��}�9{i������X�[[Bl�������&o��X����_N�n�K��`���e����,��P�M����t����� 1{��'��5C1�h����Z�A���}
�RD����U��EwCG�@T�u	1:']��)����%��@q���*�������
�����������O�|8m�7�"��v����x����kzeS�m��7y)@[6��3�����R���^tT�s�Y����PS��	�/Cu����uq����gv6�R	��&�9��%h�Ubt�@<C<C<C<����w{�n+�}���[!�K�%���8������L:���9�n{\z�J��</
B�� 4
B��>s0ru�o����l����C��TDSG�W�Q:�*M �����Y��R�Ab�$V��?1�,@�Y���$�~�^ ��k�y���!u��=�O��#��j �
�*�b��� -?'e�n:�b�(��������5���4��N.���!���$E2y�C��%�a6)PO��U�`Y�c��PT�AeP�*�Y�M��G:l����a��(�	�<��0��&����� ���:�,�����2�Z�9�aNw6�z�w���;��V$���b*���>`���J*���z8�f����L`�f�`����w�k;��-�0�.�w�4\Cl���4��Y��)��t��:�?����%9���I�^;�6���'}��r�Tv�Tx����~�{�t��������
�G��W���}���w+��M:��������o%�0�U��U��bb	Q����h���
i� T�d(!�wT4T4T4T��P��bv���!���M1������Cu�������F��b!�<1~��@�w.��S9����Oh�����5���������r�������]��!o\,!6e�z�?�:�&r���ux��O�����Z�Z�Z�Z�Z���|LG�o\����1L!F5ydO�~Y&�J����T�S��xkc������`���{�?��b�j��=��uF�5�`������~&����v��rv��P�O���3(;�`/!6��f���,*�{@P�e@��= �h��������?�A9�����R;kT^��!68$��w�'Q���@D��T+�_��]:��ae��6��!��U�*�����tH����������K��@R��np<?����e��4*��^��,�t3&��`2�,>X�X����%�������{��_��T[��.6��Z J��6�?�����F��9��tT�~.�+���%F792� ������:����l��8�0H �����xq�E;_�\lk�S����������LD�AdD�*��j[��u�,Qy�$��K��D�� �<�hQ���q�FA�����]����^V$����dz3�}�����)jfZ��&�'��mH���ci*��N�h�����%�����%F79�����������n���y����$�����E�O�b�F��!�����'���4�;�1�V������1h����pz;�
S�������l"/!:I�H�?g��(����MA�0]
�0(
��b
����pE��M!����cL��gM,`�b�������%h�UBt4�~�b��^A!�������g�p��*�y���%�mU�{V�8�����!:�Hq�Xwq�+��b����Yy���n��w��������u����!+�C >�.��?N�� vt�a�3��I![���-J�.(f(f(f(��_1�T0*��9`d���`���@�����rcc���T)�'M�6?�����&�\�v.�y�?�@(C(C(C(�:����hc�$:f�f�����!68�6v�O&�a6)�����2?-%F�d2d2d2d2d�W7��i]���QPk����:TG�P��-��gI��<�K�Q����D�����/�r���X��e`���2��S���\�����^S�����$6�^��R|����nD^��������������F��*�1p���WU���t)c�����I�S��(�#�O�;W��
!�-��y�p�?��w/1���mYs��p����u����������f|������:��N����\�	a�&>��=4>>�r�9�0�}�dL�~�mS�e����N"����2L�.�b�b�M���������/,�.^U\���<~�6�4��5��}N������O+�e�����h�c����b�M�c�~(y��� 1H��:+�x$��=F����)�ft���d>�5D!��3_���E�D,!f�����c���;��2�(�����7��E?���u���b�:�7��2��tAc�f�<�rR�!F���x77/����%D79`3�6��/��|����!/�t/^�H����G�]Kp������!F
���0����x#�<���r��i/��4J����d0L���e���Ml����*�|�b��M&v���l\G�������)���b�eT^��yHk�Qy},���|\��3s�'b��s�CR� ��a�J�_.�e��4����|+�:L7#�������3�~m�����2�5��h3v���Cu����������`��=��$C��������b� ��+d�F�m�b�x:~��Q~�t�����Zp*����p^�-��1�����P|�P�t���~yO���/����������_�np<�o�����Q�R=���!6T��!����:�&���m�^,�e�!�2�2�2��N.�{hFRx4X��|��v{��yK�)�g~���4�fF��:kk��TBl�!0���g?�g�c�>:�2�� 3�#{����&�}��Y<�������m��vl<_���&e���t�Y��KEj#����2�(���u(���^������K�
s����w�d�]B���}3,����g4�bS�nvi@��_��O����p�7�e*��V.�k��*��C@�P|���NF�|d�x�a"��/�t	��/����2�����<��{�KLW?P3/��1��G���]ClR��t�o�r���b���!�m6��cz���@4
D��}L�*�F��x�[���S�B������K�5C��n�!:*J�1Te=�B�K`�f�����xv7��+���:b����9�h3����7U���!F��:���(�������B�Hct�{a	&�����=	�7��B_���[�
�1QqU�b$��U����{���+�Sg�?��owB���r�����,_��1�������mg*��!6T�5���F/��J�r�����i���^�N	Q�H�������m�V�����<����}��y�~f/�[�G���|ic�f#�-���Cm>�|Y������U����R2�����'���Sbts2�� 3�����GfOF����")�&���!�6X���#�!��}�7u`�I)!6���!|X�K4�Ac�4���}sV��{u��;��_�~�b��L76��gq�C�xi����g�Z�0]
�1h��/���]i�Q�'�kt�]��F���Cu����8�����I�����6���*1�t@c�4�AcF�bH���$�m�����f�s?�2�C�����n{��`rJ�������pvbK�^d�G��h0���h�����n#�4Q9��_D��<�����n��|�k'���`��������G����K��AiP��P�)[�8���5�������_�|�:TG�d���q+��2��l�-����:���S������o6��&#���^�:���~�{�t���������-^��_����#J}yOt�����o95MW/��_�*��]��->��Ww{�@���)�n31h��i]~
��i])�sUv���0�I�k�����e�B(f(f(f(f(f��O�]�M`n�kg}(�}n��d�E2��qN'����/!F���!-�\��t32�� 3��$3�KAY�"�L���{��t��E��j��3t��0�C�t_Bl>�r�6��#�Vs��{)_� jE&��I�t�(��2�(3r��D�mg;7�NC;�Y�
�����xeD���"��6�0H��K���M�*o����V�[����l�{�7����,����������e��?R�~������I�|�6�&~�K������(c^�(+i�x��7���`�����u�z��j2����:��l��u<?���Ur�r'���h ��y}f)��=U<�q��k�	�)����Z����(�ri�vS[��<��\��ep\f�|T���\O'���T��]Cl�j6���^eK�J�R��}
�t���M��e:�4(
J��/��|��Q��
n�I��\u�P6����[����T�R4�B��u:q|Yx��M�4�@3�����	��=D:����J�!:
I����;~���I�j���q:���K\��ep\f�|Tc;th�Mr�_�dc��5'�.u�h��c���_�Q`1�H���z��<������_t�R�\�_�y����r6>���N��<�fG� ��Z��|�T`����E��)��l�=���7���%F�$2$2�,���U�^�L}<,�� ��n�|�#�b���lk��k��@��hnv����lR ���<����Sbt���`2�&��O�_��b���u���tH�lY���i1��dY�����.��]ktb��yt�#e�����?�Oon?������[��/j�;0������3:F��	��L�5�Q;����:�H2]��w;^���l���g-ND���Kbbbb�%����|L{��ja����.!V����%�`�PBt@�����V���JU>���3�>����������>L-U~Q��8W9���5��tS�@����o���t�K��Y��t)���2�*�*O�
���u/���%��I����P�TF�?���D�SI9D�N�C��������A��M�,��2�|E,�l2��nXN-��R�@/������$b�R���@����.�.�i�1J{L��d0L���cVc*q�~�D�q��)���dF�g���t������l��r�.�|��g�y�NY�*VM���{���th���0��(��f���9D��J�!��4k9�U�bS�!F'��>�-sS��M'�('x��g�x�"�����r�k���n�3��.!V�/��y����������	E2���T�=(7<�d0L���+L>��M
�{G��K�T^BlV(�`�=;�������]��+��(]`2�&��/�������A�3�[��05�N��;��Vn�]���5�Ng�.��(9�h+������e�9J�nr�e`X���+R�������a�A�v
Qn���(�K�������leG�����l� 8S�c���|��g��
��je�4��~*oZ��k�N#J	��w�t�.!F���!]�K��(��Ah�~	�>���MS�R�\���	d�:���Td}�A~Rr����U-"��]F�� �b� �"����e�m?���%O�5DG )�/-���I���~��R{����%F��&��`2�|�����;*���r`���Ja
1R��Ewi.�L�
5D�D�"�S%6/����a�@eP�[R������������o������ ��^�������>�i��>^���<-�=������p�����=��
Y����m������q>:�OP�	�������j+^,������2�\�
�|8������E�������Q���]?��}q�_-��[������X�y�=��.~�t��Y[������T��t�I����i�nN�_����~>�������?=<�����t�t��w7$��������[r�����o����k������o��E�����{���I��o�_P:�!�G�i��.|�?|�fO�8�_k�X���-�������tD��0K�uLk�K	�K��gbGg�|`K���#6���w��e�{�����`3�\|�K��������B1���mQ���I�������d�!H� �]l\���]����|Au����i��9��_v�RB��P��Ah����+�_�4�CK����&R�e�
��&�&�U����YC�$|v��A0-u�nR�c�<����<>��G�
}=
$)}&���($����j�v�a��S����8!R�C�od3�P	Q�"��40
L�LP���}#uH.����
�b��{Z�D�_�b�h*MK]�����$���Ah��.����RI���kDh��-���<-#u���@G��cV�[�5d����r�~�r���k�r�4�@3�4_G�A=���i��!X&�K�
Cr=�+^[���!F	��.����K�@!([N}���B<�zz��/����sF����#��D��^��Rg����>�w\0r������'�������C'����6�6������O�b�O�+��������������5$�5������+�����1����r$������}7�����j�bejS6����neW2@P�e@.����vj:E���3�3��Y�Q�u8��-XN!F	���W��%�q�$�d H�_��n�������Si�x�O�g2r�v���d�*�B�drH���|��9�S��R�&��m���?�\Ct��� 4
B���E3��W�d�[����8�%�h*����� ����BtL�":�Z����8]3�0��u0�&;��_.���$T�+9��1�l��s�BtX��9tt}uX���)D�0
L�����e,�Zuo|1����� �4�&���
�����|BS�M>�:��aM�	|��K\��ep\�.�� ���t	�� �E&�\Bt�.��V��@�?�TC�
�=]@��I�:N7)�2�.����u.��N��vmx�kw���!��WR4�D�KT�6��k�9D�EiBT���V��:N��6���l���~��E�_{���vH�el��y����n����}�)�H�F�%9���e��S��_ ����6=�����3�hVo�{|������+�p����z�� �>��������Hy��#C\p	2�Cl����i����;����
k����+v{����%��2�� �=���G�H���E�����!9�u�LS�����?h������	�4�@�KA3�T^ ��@1��������+���>2?s��L���M7�I�\�!F_	2��@�I��%��M
D�@4��g���nC/)z";����G�H�b�3���K;����C����5��4N�� �2��������ex�ktn;RF��#Er�#]�7r�U�S��\�
����l���H
�M�<���K��e+K"��t2l���+���B��{b45+[~$HJ����E]���/`t
��D�@4
D3
���+���!������H{�(������lR��������8]`0��`0�����c��R���~e;���0$�1��D��:�V�����*��v�����]>�3�:��J:���������(���L�t�\X$����� _V��L`O!6�P�uL��\��+���R�h ���hF@��_����_�f}�4�:��n�]���n6��u�b�h��&u�F��������i@��H���\����:�����V7��'���B��u�c�������e�x�����g�J}��KY��o� �%�7�u`�@4�,��wD��y
�����1n�v���K`�f�`ft��m��N1<I5�����3���F���so-I��d����}�{{0�i�c�����)�RK�����I�=���EAUTCu�c�,u�����h7n��������2I�������/Xf;���2� ��w*dG�DSmT���G�@*��=��hAm��bPFj�����{��.�x�g�x���w��^Q�N�0��N����]\����1�]|�>��@(
�t������������ep\��e&m�������
�}�0���,mP��L}��Zl����bu����q\��(����c�8��T�N�`��2�Iy��4+
1Q�j����fE���^�"h.�p�F8~��W����2�Zf0J��hg�W�6*D�����`��#i�q�H���]�����2d����n.F��Kv�N���������x�������	�������VB��o�.����J��r�R�]-u��~�r��q�i�O�����!����B����n|�ep\�����/�77�>�o��(onv�`S��0��^l�L��.F�^!�C��D^���(�u�+�]�<���X>�^�����G~z!�1�����W�p������_�p��������/o�?�'�,��z����o��%B��/���S6���F"��h�����y�(]�8q��Z��������w?�]7=<�3It!Vq���np��V\�%���������O�__��|�����W__�������T�6����������;�������������������=����0����������x�eUp������_t�7Ox�����gz��=����y��Cl~_�����ty�������h.��^+������\`?dI�������)7�cW��.vyv����Wg�)�{p6��I��=�.���p�epy�j�@���}�X-mHH]Ez��y�G�WI��Q=Z�x�M���SsA�47�^���C������
�������_\��N�F)s�x���� 6��E�^����9���t[��:���b�`���SX7�i�����u�h���W��/��������]���k�K��a���]��8m���y�\�0���0�����[/�4
@_�$���|�=���'�*mO9��]-u��f���Ou�_f����bP���f~<4]<@3�4�
4f"�$���9�*O����N�R�|y�bh�������/�f;� ��z����-����M�r�i{D�����{T	��_���?�(xq��TmM���~�R�]-m0���������M���>��1j6���<c��I1�b$�H��n�S�����W5�� UU���z�X^\���6�Ug�BuQ��3�0?0�����2��qVD��O
��6,*4C��	�����)�����.�p@aP�A�G��]*��H���V)����[\�[���Pu�o�����.: J�)b���6�W��.��h�@�T�N������#Atu1Bt�t��O��+��f��~���s��+(�����z_<p�v_h��v_������~�������y�sYv��J��������gv�y�E���3���Ms1:CM8&���7iTe8H��B#�F
�R����j��h��R����'K,�-
��t���5�p���Z��l�,��b�,������E*���������K�`]0��$J������e�	���(���u~���R]��h ��>~����"��2�y���|	Ha����T�4�8
�3�|2���S����M<��u�	��k���d�Af���h�OWy^,����u��"��������/�L�Y��#�����.
kU�����bT�V�����l�D�Ad�����co{-���M+�z+�U*.F0��!�������hvQ�\IW���_�6��������z{E������/��J�O)��������'�q���5��K���6����E��RJ��s'h���t����Jh�+X�t MF��4i��I��Y���O�'\���]�,������BR�������c>=m.F�	�]�=��>��h��f�h��}{��i�Z������[��:�l���AH}�����e~�4�X���Z�.H�)��� �|���g�g~b~q,������P�"��Zn.���A��B��J�u���MTW���f��_�E�,��2����<^�z�R�~H#��@�^�R)�s���O��\<9���SJ�i��_.5;] 2�"�� 2G���bS���N����V����CR8��.��5��Q��u)=}��	�E�<�����|��wV:?���������"������<����@��TG�.:(J��U�^K�E����,U��h0�����0_���j����-����6,��6��K���lD��������p��`~v��8��3�8sp����H��KE�d��E�$Js�������,Prf�QR�f���
�E�p�j��~6���������A��$v���:�^u���b3qt�����W6����AJwN��Av_]�`4
F��`4�L�<0"��MyqO'�� �����9:R�7}H�;���r�.���i�����D�AdD���Wy�P��y
��]\���XM�q<���_.P!wu��Q]W9��"�K��CuQ��j����
�w���p4bP�s��A����?��!���|xH�.6�CR�@�i�>������h ��r��
����^ �.�:�H��c�R\�
|D�%�����x����f�D�AdD���Su��n&u�����V���G�d�M}7<C�0�H�4i����T�AeC*����~����E~�dd�x�N�����^��$�O/��Yn{���O��~q�y�����UT�]l�I������$���.@���w��y���DF�!�V����.���j�:�H�T�����8�yv���S
6-Vu����\�/
t�Ag�t���cW�T�N)�j�N]5K�`=tS�+Y	gv��m����B5��&��`2��|��$��on�������?����U&b�D�h�q��m^4�]�j�s��-��Y;De[.��������/�x�\�����E��-�C�����nt/��������^^���gw���=�w@�����'����@=��B,J����g������z�7��������.7H��z��
rf���1p�8����a�v��:�\HX.9�1w|����B�]x�9�,Q���. H������&�����#cg���}~����=R���q����
4#1��h?����tC��1p�2�(XS�/�,���]�Z��#r����������h��H��$���N7(@2�$�@2�����=����/P��:�H���W7>/]��('�U��ewq��,��2���2�&(g"����Y�G|���{���n�Li�_m�������2g��7���Cm�G���.�0�c����y����~��|Ae��RG�r$'�M~#[�'������Q���@��nHd@�d�;����b^,&$7K�HtJ�sC<�����Q@�{U�Y������h���OT�9��<���k���'��p����dv1j�sO��� ����o��t��Zr6�rBuQ~7�ep\��e6q���S�������Y����ZuS^��	`�\�J<��
�x|uQ�:���3�:�t���M��&A�J|n�6|����0o��y
cq1
��P{j�~�d>Te1
@�����Xv�Wj�R	�4��4�6{�<zq��b���;��F��	"���i3�����1h������1N=y��������WS�����LGB���HB6�"��tA��1p�,�w�^G�\��j�:�H!��'~I��1�.�~Y�L�����X��e#,�������-��<�����y0��h���������F|"������;u�����NJ#lr��.M�o����(���h�~���x�?@�t2Z��Z�x���$-4fC���qS��vs���{G��2B����1`�p}��8�A��L.F;�tCcOM���3~�����(���XZ��*�|h.�x@iP�����z#���U��n=�^��0�#���#�`�n��["�R���&�����.� �b���ty��5�.G�b=��/	f#?��NCmZl	d��b�T�T�F����4;p��z�g��d!�M`�F7��������2������?����������;��~_��R]2�
�D}}yK_�?
�l�?�}�=���|�����Q��S;���S��@%�B�������������*Dk.F�ht�d��N�n.�j?���>�;|}����o��^}=|9�����SS���o��}���?5�?�����������������}���~�g?�y�G8�}�}8�������
�iq��������9~�w�p�����{��=���.(���.�Hw��^^�������r��	S��^��������:��y�Gp���b�Y�	G���\t�����F�_��"���J��z�+�-�R�rh����EG i���e��]�lB(u�}�	�NH�� ���h�w*hg��~��uj�]-u��28�D��������.F���m]��p�.��g�pV�Y i�f!�4-�P��M��L���e@)/������# bs1*+-�'��>}�o��b���;�H]�N�Xss��@�4
@�L������������I��2��RG�������'����bO���t����.�8��c��Q�T�NS�o�B�����b$"���6�a��)������t���
?�nc�.���=sY�Au6��Q�uJ�k�����YH:+^����t��m��'�V��g]�(~��C+	�r�fg��a�R�_'T���3rg�����G�|5*��������{����O�K���!���'�Es1:�L���^��N���(�b�� 3�2��P���LP'���������U��b��;��]�m���I��r�
��\t����4(
J?J�0���W��/�����[���������n�l����b�N��a-e��v�UU��@�4
@?�F�pe@wV~t�{��Y)d�("���l���)��1�����`0�e�N�����E���D���c�8oO��R\�4��z�]������q�f��ep�eq�_�*�����a�����_���A^x��#>�(��<�{4%��W��Q�h/��%��F�������b������:{����@s1z3T�6m��:A����M�� 4m�9��O8��c������4 �6����l��t%�#��������&�#����:�������7�0�s����<�����K���iG�N�?�T��jiC�xq�� ���<������1����.�x@g�t��	��$�\�F���L*u�.��#Y\t�&��.������*��Ey�XO��m��������Ah��Gk�v)l��S>�"���e�1H���G�Ecs�������I.�3�j��f��`�f��q0�T�R������!���'^@Yu0�������]T�^]t���3�8?8�0y�sU��/&P��7>��65��.6
S���Mk���\�
���G���.\H\`_UuQ.\@h��A���g�/�`�t�\���%����A����T�]lV
�\O>���$��E�4�@3��8�w�lg7���z�����>�45��:���t14
����+�B5����00��0?1���L��J���S���Mf*x���ARG��{:�~X,7�abi<��MX�DY�)7���� 4
B���'����zm�)X+����Ab$R�l��*��`��6����}I�v�0����f�`��N��_���P��������8��St�z������h�;8�U6��Z7�Qk`����g�i>cB���k���bT*��t�y�(����b�(�L��+�%m�����P��,�'�f�.���AhZ;�����v�M&u5B�������&�4f*k�6��t��q�t�����5I$���B��S~)@c�4��/C��9�����(�|���#�H�n.6`�N"T�TWc5%�:;U�1�h�N7*`3�6�����|�b4	���x����L�t�a�����(�S�#s��~������f�7C����Z���e�m�[���]�#i�J6O~�B���w{�� 4
B?�=�pi@m��������t�..F����::�����;�E��K�d(U����wsQ�����E��`4�8�wZ���G��!��������i%���]���av1
�n�t�b9�eg�.���A�}9������������������b3CP�}AJc$���2��i�W{p���aXUI�8��|������AY r7�xR�\���eLu��h �}	�����o?�r�������x�������c�fW�fZ�������V�����$Y�@|���O�6�<G�bB�At��	���C�!<�����
2����D�&���*��������tn���JV�K���($��%qz�e���(�������4~L������`�f-3�_+=��~�����V}!��6z0��L������c��q�����63��=n�V���.��:[��2����:�p�*��c_Y����,*�D��Qy�����2�����H�1��V;�����&�r�0�[���.$�H��$#I>����V�u5�T���a����#R�RGizL-�);
�zF�]�.F�I�5!DQ�t�&��`2��\���L�����k�=�����'����c���vJA-���c�t�@�t6{���������h ���?�����^����k0R]�NR&�H��Yu�Ss1��P=
Q������O�/c���Q0������	�W�^���p������]]t��9$O�W)���L�����F&���R`@�42he}�����!��\�h��������.6��4�nf6!$��r�Q���!hR�P0��H0�<�j �Z�~���4��Z����;/��)nZi����E�JK��{id�����nP�#%FJ��)1���S�:fRY��!�Y��#E`-���Z�&��|{���T�@�I08}+��� �x�����g���h�zi�����
<����#%a�Qy��jf]����Lrx������b�,��T�O��	�� �zi<m"Y\t�b�S���M�Ki.�"�0�Yr��N7( 3�2�� 3C������k�I>c'5��(�LJ�����Q�f7+5����IuX+T�a-p������i>gzq�,�F]j&f"S�^p�h)�.�6a�&v��_{~��h�E��J�%���d=�=M�W��(Ou�3�0�L�O�F`�6D��.�	i���^,m�1v���g8�"zv���(�t7��T�k���2���f�l�6�T�&6����8�,mhH�v�����������4si���M&����(3y�x�������6��T��N+��j}���D��m����`*�.�`���?��h�9��K���Ps1*����_��n`_UuQ.�g�x����Ml��g~-�U�"%��6�F��\\l*9�.���>O|F�\�.���C:�o��1���|��g�|n]�>�}�=���U���T��~tiE"����I���������aq�1Q,q;O�a���j�X��e`�9`Y�|B7�I�_~�7"��L)p$R�bh3���:�M)X�Bpv������t���|<�E�(��2��H��O-��V�cJ�}��qG��q�����lg����hf��_����G��S�:�tE������<R�8���l<m��J��b�D}��"��<u�.���h������f�|:f�j}=���?l$�[6�(�#t����uYPV5����9m��)���x��g�x~$s�����$��r�,u��b9GO���+�w�\�eO|�\��L8�����p������+�p�������{�;P���x�"v?�C���Z���.{�; �@/�Fa������Nb$��0(��0w��s�49��_Z�����R6���|$��?�Ws��S5�j.F_���?&�L7$ 2�"�� �#y1/����s����ZO��O�B�,.:I�������KAwY�����\��5I��t�1��T�AePT~��;U�I&����Z��#������ok�8�]�5T��z��1l��k����h��f��9��?������ [����Lzt��Z$������3�onN����b���zW�a���������]��zG_���������O�77�u�����p�e�Fq�}��A��z�������jP����]�X�`9gw~i�@��.J��g$^�I "4;�� CF��22d��>||x=����Dm1���'Ky�������*�K���H/��Wu�7��y��(Wl@3�4���������t�^����F�^E3vq�������t�2U<ox����W"����TGm��j.����g�x���yA�����L���K����,u����r�45��E5�sQ�����43���@1P���x��C������I�..F���������PGB�� ��MoJ�x
��(K�@f�d6"�����_�n4A��g0��l��y��<�zy#�[%��zpU���t�d#�]�:ZVU�/�jf6_?�
2
�V���tuQnu��2�,S����"k���E��
�z��O_rO�u5�!a���>�n����.6�]�e>o��tA���1x���w�^A��s)v�}UX`(X����8$����(7��2�QMvu1
h��cq\�	���2���j��~6��n���<^���R���!^;(�b�b4q���|��t�~H���63����mf�iY���.�:���e.�����!ik���G���.{���\l �}���/f;]���1#c��f��,�v�4z��W��R�i��a���'�]��������K�vs3��$�@2��l�����-�6��K�����N*����I��9���1/[���y�����4�x�fx&��Y]��F��`4
Fsi3?��@E;�D'�(
������b�S�(�?}xH�.6��&"I����$I�K'�f�h��f��T��H�q:�P��bCj�A����.��XiC��A�i� H�Cs�
@
P���|�b�"��Z�fd�tS�TM�����z�3U+��Hf���t��143�J�5
��������1���2�*��\����/P�����'�"
]u1�	F[�t�jX�	�������;E~�:�����f�h�94�U���N��t�\
Y]t���Sd,WC���G���zR��b���tA���3�:?:��z��O���|{��i��77T�����5Cz�{�;[�tUs�������&���f��hK���$m�A��w�%x8~#���~:��y�7o>�8�����������������x�feU#2�P��������a��W+A;gj�u�Ns��Vss�A������S
/8x=���A����3�gd����=.B5�4�Z,u����j��8�<�H�T�M.V
��B�z��(��2��l�����OWus����w��I�8��]\��y����d��#�*��e�$n��6,s�����}��`4
F��\����A�Nt!2�HX,u�����\�qs�����(/���So��|�l�D�Ad�#�^����/��Z��#e`���Qy�ik��hq1
(�%����Hs��>����!����_t�H���"��2�y�c�(g�`~fF����R��!o($(�^%������Q�%�X��M<�-;{>}of�f�`����������!�������2�G�[��B=������77K}����P����W��M<E��H�v� �d H���:<����������	��R�)����]@���<�����$����.PT�A��Ce����D���Uv��H���7�e�;�Hf�#Jq�tzk�&Q�������B��P�Z��/��.�Z4@��i@�M�w�f�~��/���Y����������_��.6���Z��K��N�$�@2��"y�j6�Q�f7K�*u��,�������M@���j��"�K��f]<4
@���|��
���L7���Hu��d��T�]l*u�)��:|�.6��:�_�f�`�f�`f3g~~q+�rPy�=���i.�:�HW��Ap��lg��t%����f;]�08���0���*�C��_�WK�$�Q�W�)����.�����.�,��b5���n&����`jT���SIu���V=^��b��c�t i���4����F�����9�������i@�����O_�p�Se�����>��Yiv������iv���{�f9J����. H��d �E�N5��C����f����tqFO;�<�C�0��e�?r��xt��(�s��2�,?,��!���um�����D���U*���B��#h�5����b7�M�nA������i@�fs�}��C�����SW*��E�!i
][�lH9>�o.6�Y���f���N�4�@3���y��v���������?R&��B7�5W�W	��M@�w�8R?���tsQ�"�h�~>��a
�N�F��^;���Y��..6T,�����jf6!�@=�p�2���T4���n����IpR-�'��������^~���������_���nO�\��e6q��� t�NQQ��,�<k}��|{�9����]��+��M��� 8S=���@v���Aa��;������OI�n�:�������������D�c������s�*H�������h#@������(��\LPdd�xwY�����l�o*	��{0����b^�8	�p�e�y�1�.]mvs���^�g���U�/C�>~�rxs#U��c��������W�KtQV�����h-t���������x��a����6=�zvs����V0$��4;]���/#_����c����9�Qv�c���G��T9l�r�kj6R]l���1���tA�� 2�"?"����l�\�n���6�X���Ze����Xu���8cG��O:�����(;mI�!�m����`X��e`��Dy��5U|
S""hj�����4��6G�`fv�!QJ�r�c� f;]�2�.�����\��������b�b�#����j����lg�H�>�M�5�^����8��c��y����_\����&|1��zrI�^7C�0b��~Su��%�.6����i����rs��,��2�,?�%�p=P:���)�A����<��A���O0��Lf�xJ{������Np��1p�(�w*Zg?��,u��9���L����J��Y���~��j�D�Ad�y��h*�'A�
��X?�����^o��4��]�����^��F�h�\��:����d�J�����(4
@���)3���M�4o
����.G�1�����)��(����S/j�7c��
�.���2��(�w*e'?���fd�d�.Fg��-��Rl>C]u,�9��������.�x�f�l����J��u�}��7���C��O����NG)�H��3�lgD)��kS�I�^eSp�$�Ab���,��(_�z=�>�n���"���9�>nX��V����I�����W��nP�c�8���Gq�W�:���e�E�..V�C�����d~�D�uq1�O)Q/��4@���E�x�AiP�~��ge��'�Q��C#�j.w9� 9
4��C���Na�S��Vg�xJY������.�(��b����y�kb�8����J�(X\t�R9�D��kv����bt>����Gg;�����2�.���ry�Bvt��\S�f�c��!��w�y2����(�Rx�������t!����������u�P�������~8R��^~P���
�?��������������w�o�7g������s��H��c|���<��������=_�/��Mt�B��}�;���E�Q?v<��M���[t��	������c���N��0��M�ow�W��~����<��=�[D?t�4f:��>,������
���o�����������.��}A��^������/t��"��
�R�&��~���F�..F���a�l����(�Q��6vT{>K���������g*�c�NH�t�����w-���?�'�8�?8����
��PY�q'LC��Ni\�h
t����'��k��\���G��	�4�����c~P��.d���c�0��
�Q���G�[*(X��f}.$+iL4���<�h�?�Z�}A_��}/�w�:u���>|a��n���G
��6���%7PU�pF����")�9����T�AeP���SSv�7�c��K��`���� �$�������((]����R��n0@`�A��y1?������%78Aoh*�^lu��R�TDo�b��lu����G�m��@�/.�h�d H����H��T=uc���D-��G7�X���2u^&a��t�.�&z���n�r�\�.@�d���n��N��d~]'�W�������*�0��[���d���~r�y	�����S��&��=���X��W]8�1`Kg/����{K0/���������>��������(�����|L��&_n����b��AaP)14���������1��VSv����9���e2�h��_�e�i��j~YZ\��(�d0L�p&������������4��ClN�6D&���������&���������S}t��� 2�"�����=j�����>����,�kk���<�#M�K����E�Bi4����q����1`_��N��c7��H�^lu����tIg������(�����Mg�6������2@ePT�p*�0Se������	S�B������,]]l�Lm=��F��	��I����R���0�c�0��"�p��Q���T�_�\�$>�g0�+�������
)������.�0�c��2�w�WS����O�
vJ�[\t����Z���������^�;��
� �@��?~���S�|�F66��]����\���<����=lz|rQ���~�K�>�V6/�tg�WW�t�}A_��� ���G_~�U��R�\��T�w����TM��]]���Hi>?�����(���_M�{��)V��a��'���k
d��6r������;h��o�N����S[L���z�����hq�n����qsy�<z�q��M�r��J�����<��\�2�:��t�'?��9��aANd3�_�����~����_U�����K���X��Q\��K]�J�"�/c�h	�����o_on�):gq�}������D�Y��������fF����L��D'������W[(�p���������@��22dd���'�:����6C����Y.�:�H%���c���T�.��������cV�H7�08��?:�w����Q���7�n��8�����g������4$����R�]����2'�FSj��<6�|o����h��f��?��<�r1�Fh&�:Q+��MK
�H8�(k��4$�:�������������G��).�h@c�4�A��@������K������$�ai2��I��{u�	�$�m��s�U����g�p���]4:�*��n��Ba.�)Nc��F]N��O.6/�$i�S"V��rGW�� E��UD������G�[����[d������[f�'���'W#� za1��t�SI�%��lj3�����f����I�����KI.��A��\�2re����+��0���V^7[y��i-���1�\l�!�-��j
?D�XF��<��<�o����_������Z�|�s�<����,�<�����tj���bR��|&��EqQj��������Nh	����e�h�m�������r������ R�J��~�����.6�;������������I�j� �@�/V���~
��5��]��Z���B�z�wJ��e	�������E�>��e[������3���&�N�.�h@g�t�t:����KY�'cO���N�Vl$a�]�
����T+��k*���n�������Ac�4��C����2��~Xj�4�'�T+�!��|2Z�t��>���e0�|�	�(,��b�,�,��n=P�k7w��5b*����b��"B��F2�H�.F�t�b�7���l��(/.�Ag��/������LQ>�s��<��������XG�P���=8LE�u���\XMm��q|!U��	�t�1�K�	\}t��� 0�gR_��� ������b�j�MB�i����A$��Hs��G*���t�1U�h�[H~��]��X��e`X�|{��i��77;���.Nc`v��
I�gw@�����&������4NqQ���Af�d���|�CP|ffu2�O�l�St��w!.9!��\�9�v��NSDi^@�V6cB2��	,��/d����� 1H_��y��zK����>MK��	N.�$P�b��i?{��G���@(���on���������n���6����l��p�������~�AeP��w�\�4}�����z�}�lu�����������@�s?m�������Ar���d�����Q��T7�KS�0]��;��{���MH$@�C����_���<��F�[O�p�e`X�������������;��H����nG��f���\���n��A���4X	�@����&���cU�L��^��.���w"�(�95=��b�6@&���!ze�U�jy3�y�@0� �����Dc����~�2VGiZ��<�x7��?1�Ll���~�lkA�`�F&J�H��d Y�d�ni����kKm>��1w}�>O+yX�|5Q�G�����m�a��&F������n�t?� �b�x'6��|#�b�vK��e��2V�)�kb���-</�&�O��M$�\j�%�6J7@0�@���)O�D��J?��44�4��D�)�Iq��#�a6.PJ�m�>�&H��ty��o?\>}�)��Gw�q����O������f�&������
�h�����|�l�{����u��>L��3��E��8.,e�`#�5���S�q-G)��W��5���}gaT\M���Z�c��LW��#�f������##F���0,�R�N������`gR�V7y�{#R���x�r��QM�/S�������M@�1@�����)��v;����#�/&F}��
Ww�[�K������������/J
`�f����|�@�%��pM���pm� �g]D(
���������
�Dy~+��4��&�[����(�.PT�AePy'\��
�����X�H�^�����xS|��K���75q�� Y0C�D�
x���1x�����Z�.R�W���I��Xu�AiU������3t�2��=�A�&��|�W�Q�(��j��<�k�����<J�>6S}��k��H����s��(Y�[_S�Y'� ������o[F�7�;�1`��U�|0��>F���`�
^�!��/
�����#�y��n8���fb�
i����(��08���e�|>�6=v����BA�� ��DG ��Gu)k�Ninu)F�$��q�?�/�5����3�<K�����~�[%�O, �n����1v~�)��u���4i�V���r���[�����lt���0 ,%�� |F�:������uS\���#}�Hx�����r3���j{���V�qv5Q^��e@P������I�k�Q�2,�����V7��X-�z�g�����m&\��fWUFl��mo�1��^�M�D5kT��)�����Y#�����L��O��g&�dze���=�������&F0$�����Y������x�Je>�ox����Q�
���3Bg���e�]�
�c�a��[��T��s7��/��\Ml�~��o�wE)��RH��
H��RG��8��c�X��dF��*�Ct�?I�m�=�W���������?��1*7�2�.���E�
�����,���at���'_G�L=����h%��Ll��"�ZA�	�2��2{4�{h�C���
O[O�"�*�k�k#�a!:D�Z/�uQ!�B�:z?,���6�jbt�L�u�9U��%U�Se��M(����(��!ZF��h��t�>�Bn|m��v�����P�e:����D�U��sA�v[J�6wjw�m�m�����p\�����2���H������G�n��~�Z$�1��$���&��GR�H�vK?9��TM�x����������(�o`3�6���ER���06�Q��b!�����`{R��L&6_-���Q6P}IM�:J��� 0|���n��IUk��������
FI��M�B�Z�����(���@�������i&:od@����a!�	#S���^���a�%�Y���&6.��n����������7'������nr�f�l���6�y����bl��`�H�[��SM��}W|3_�c3����D��De��Mh�t�@/��J�eA��[��>��'����b�<G<�H�^Ll^��@>[jf�	�1n�a���f��H��d$�����~�R6�x<�A6�?�n��
6��|�i����h�:�f�kz�"�7��F3����������S�lt���2�(�5��Mo:��-��������(i���c~�
�\Mt��EI�.7�|�\M�f�����{tC"���:�8��3����?\���aK���N��p�����6\&]���n�c^;�Ll����7�������f�`�0��!��\����������UFj�����.&:�Hcf�����.�_h&6�P��p���WV�
�5����o5����_�Y��	�����1�O����p����o~�_���������P��R�}�a�����D�����$OG�ys3����i�_Pn�t VF���b,v�|�O����B��\S�?A������T��X����(5e�*
��#�M�����#^h�
M�Xi�����/��.N�!�[���	c���ZS��d51����x��������t�(��_��v�,*�+'�2�eD�����������r��"��������f��JK�D��tH31j�HR4���A6���M-�>�(�0�0r�?��\��Z�J��Q~�2V��~�&P{AvYf����JU����D�3�c�0�X�(����X~�<����eG=��_�F���[7�h����7��yji����8���Ayn
���1h?� ��H��c&j���k�O�&6�Q��o�s�;�jbDgdV�4��<^~������D.$rM�!�NW��o�����T���4��t��^Cd���|U��������D�J��s�&��� 4�� ^F��x�����|&��pz������So�P\��k^2XM��?R�0��M��w��Do�����o>\��-p��s�G)�����L����pJ_s����H?<�<��<R g��*��(���&68$az��-�F�����K��*�J��{���3g���<h>���:?�Y�\�YD����U���K!���R�$�/�B�4�[�&Ja|��g����|�����B�4h�*5�S�����z���V���#�Q6��Z4�&xw���^Y�<��c�<~2^>�F ��k�<��<��!��@�=)��&��?�h&6t&1��y�F���c�<����I�k��~�=���I�2V�)�I�����=h&��\�;TO�6R�3���nrf�`~a0���/��P�&�����y�a�5y��H�=���@�e�R"K�DA��jb�����Ii�������I�rg�F����08�X��[���0��1����%A�d��\����`��Uc��8]�r�o�"�� 2��[9��ru�
%��Rh�U�[��fbt_��[���s��o&��������tE�?�n&:o�f�l������g��M.C_�U���[����2.���Pu�K�m�~RH��k�y����(S��b�,����8���u���[�������6$�:��7������7�9�������D�
�"�� 2��K��*�c�Sl��_����:�HS��K��l�a6.��<*�]5�71@��
�F��8l������}_�������K$@;ejW3��P�5���:7
�Z993�0��n�|�A����G��m����6��M�J>V�&���R$�f]�^_���D77 2�"�� �.�O�Y�.����|�^[O4��#5u���6��d3]*K`3�w�W�������9��P��F�_c�4�]H"�x,����z#�8�q�:�fb{e�\M�"e�m�V��}���6����*#TF�,
����S��#���K�cA�V�V�Ll�L�t��L���������7��&:of�`��]���v�F7d?���'��D��^Q5��j^�O���;c��j*S���C&@3+D��!aC�f�H��V���o�.o??�x�������7��_��c$a��e���	���L����!u!�����jb�%���%/����TM���P&�z�>9��Sm�?
����3#f����00�k����� �E�jZS��3T����2�������RM����=S�����������w�`�f�yWG;���C���E����:�H)X����z+���Ll���*gn���U��2+F��� cC�f������|t��E�L���dl*K�J|zU��6�� ��I�(P;d>�
����<^~��o0��s��N���_���|������B�2�zk�(W�Y)7oh^��)6�g���]{���.�>�E���AdD����9�T����A-����6�y[j��i2�qg���;Q���Lt�������k6B�����_������e�E�>����<�����y0��6���O�B�K3�qi������;An&6�$�f�;p���Lt����1x��i]�1���u�����@��~�_�J�at�7?F[��>1��T�_o�����D��$�����f�b�4L�?���Me1~�RM�	��1h��?�����Q������r-Il^Ml`��������`�2���������d�st�Ag����,�	u��4FP��=���c�W��}���T��]
�N�l\ ��6J������/��f���j:<�!�U�h���R��n�H��2b������fb�OP�<��@�	N��	�|y�����������6�>�d�A�}2�U�s�B^�Rw3��G�f�{����yY�v�L���4��;��BU��� lF�8�w��E�<�+���A��]��|iCd��sV�j&6�P	��/��i���f��<��c�<��y*������426]=�&v2v(��!<
k�u1�jJ�7��P��(+n�� 3�2���d>����a�C�6���]`(
�k������:v5���E'�����.V���z��2$�z���z����o��^:���5Z
�@�H�=��k.������H"�t��I�a6.�n���n�g����(��!#BF�,�����������>���n����,c����
�.���sE��tH,F}�)���6!�I��2�Af�d��]�[�Bb��^PJ6O&6l�	����������?��=F�v]M�]C�~�����o�B+.H��%���)}�e6��Q:dn���d�����wk�g�K_Ml<"u����������B�)w����q�|U�jA`�/o?���L�(��P��L:����>���r����]y�00�Z�����O������������u�
�ka���*�����U���Pi1~R�(�|@�^A�4�������+\�?���=	�C�p���a>g�5�#)�i�H�
A-2�<{����P����"��lm\��<l�;�j2���d&C��s �/8�!h�B 1e��r�b�c�N��K�Z;�����r5����u�
�G���&J�`�f�`����]��+��a`��Sr�-�:xO�&F(��@��|��j�(s��_
��F�7���� 3�2���d>����W��v����#�`����G�8'H�n��&���~Tn�b�(6@�?�������`e
F����(�!���|����+�cHaAO����$7y'�(��=������Lp��b�,�v������h�s�����!(�fb��*?�7��l�w31�j��G�v��5mLH���'\}������0e��IF�L���������n��a�:�dMa��Q��uje���a�<�������K����f�sa2�d����=L>�
�a���L�>|��_%k^)XL�$�0nY��H�&�G��Kam.]	�������Af�d���O����|���t���j��)-���=��DG)I��a�)�!7��4��B�Jb���Q���Ac�4����5�Z�#�t`������������[�|�wK�����MA����#��zl&`��]�i��5���q��axCgz�_k+F��6�m�sdZ�^����`�����[�`�}�6a�3���H31j�T����d��/���Uo
{
��R������-#ZF�|���ad>���8j�%�%���G�����bb�Bj�8,R����k2�������M1��(�/I5Q~_@3�4�@3������fTGG��p^M�������)�%��L&6t�C���'��5�����g�x��?X3������O%���T$��[��t����;���!��)r�������-�:jO���R���ep\���y���r��9���E�e�4W#	��O�Mw�_�[i&f��t�lB���|�\:2�m�f�h���'�|�����f��pn���D�������Tq�3�R���F��w%���h��L��?�<�����Y ��1x��S�5�Vp�ZD���W�����x=����(Y(}%}��SCg�U���� 3�2��O��zj�v�]��������'��>R���B@^��]�D��I�M-��
�1|5Qnf�`��'�|ZE��a� �����6Ln���9��]�Ml��D�r�i��'�;�2�,����������zi<X_�a��������8D��������f������y>^O4H7 1H�� ����5{k�#\
��?�X}51J&���,��Y����K��ur�7�u��('h��f�h~�g��C*}��S�<
������
��ZT���(+<&�5����HU����2���P~��_�_�x�
|���Q��c�g����E����=���S	���/#m��Z�:
5]�������Htj~�v�C�D79�1`�s������ty"V<�������_����FE���$��2TG�����~P��`43����:���w�Lt���1p�����M��o��U�&�,�c����D�)��]�:��Ll��5�z]-��D��2�� �=�Y ��=1X��(*�ih�`J���U��OJW��������[~n&wbW����S�����AePT�w�e~�}}�5�u�C��1.�1i� ���1n���Y7�a6(&�}>��=�yA^���!�Y��4���k=,t�R�L��z��D-r��S��X�Zi#w"uHuJ�d2_������O�nN��(a�Z���7�������������������o/�>�������jO�/�2�3����j�R��Y�~���!������&��P�P-���m�����H �����D�
d��00�@�|�.R�R�rb�o�&:�HA\5��m�W�j&JJ�����&X���&�`X��e`X���Yuk��t� -3U��&6\�t����<����q�	<
���@0,@0_��j
�?����#����k{0qth���X�_p1��G�R^t?ls�d�L�s*I�=5v��T����2�(�;q���@����Nb�c7D5����eR�>/��Y�H�N��|vO�I{���Wo�`��('X��e`X���Y�j��<�}�Y
�� �DY�Y.�� �i��J�.�L�665I�d�s4�Ac���i,u��t
>�rME@��5��eq-RM�r�jE��k4���5�S����hf3����2�*����N���eF(:l;P��}JCY9�3p1�� eZ���r�W�����@��}��A�0.���+��xL������O��#������������sj��z��U[�V=��i�=���	y�lSr3��P*�P�4e�/�:������rz"#DF�4�s���|�-R���:�rrF�p=�}�B�������L�&���{@Y��vS1@�1�jj�Ys�GC�s<�C�jb��*<|.�4���B��X���=�{�^���a�Y���}Z�A��;���#��=��E��<��q"�/k�I����5����<�_��?||���_u?D��\���u���8��Xxm59����n�%�����6�y�]L[�S2�����������i&:o@dD�������m~���hRo��_.���\Mt���v�I(e[	D$W�bV	��������nTK��^�9���W���v��_����_4������V6?����
��-��d��������`D��J���(��B$?��y2��yWi:�k5O'Ug��G�I�b]MP@��������g���������hcp����G?���>ld��M%M�a��In���f�,[)��S���'A�j��O�4_���q1��/&#t~���04�x��v?���bm����F*��n-VBL�����X	i�7�J����nrp���f5��Y)�/\>��MwjC��@����Px�/\����1��d���������md~��LP���#�x����#]��p���{~L�0.������]�05����9���r����Q�i�}���n&6{*���'G��D�
Bf���2K#����|{���5��RW#6Svv���m����7�jWQ�����1���c���|>�j���x)�E�2�H���jb���N����6�(����D�W%�1U�v}\s�$Ig�D���1�c�������e��x0��=��)���$]��s��lbt|P��G>�of�? ���H3�y��1pS���I�����D7��+HldZ�96�+H$W��L|��L�b���9�����+���l����qY���dy�/T>���;���4t�M��\u����	c2�Q�I�}�nbg���D�bg����;K�������C<�'������'Gs��w1��/&:�H���HG.|����@�@�M���j���Ac�4��#�&-����h|F%;Q$He;��uD���|�$G;�h�l�;ja��zm����x^���#d��_[[���S�i����g51�����
T9���((}��c�D7Qm�f�Lt�e@P�.K|d`(>&R�|�O,��
{0�Y�:���HYpj5��
��B�$R�s3���C�G��
x�
��Y��r�4�@3�,���|�=AU�]�em�$�bb����e[�Z��jb�M,]����6J7�1h���1���w�bX�kj��%^���#}��.�:T���0��������9�����Nxg����������?��������y.��7F�a��x,(E��������T�����8��F�:]qt���l ����W�����@F@l�?}GK�������}������g!��1����������O9��%�p5y~����Ti�����P��Q���v��W��8��c�XL�����L�v���!��H�b=
����c����-���L���c������'PL&���f�`�0�����*j'�b���6N^L�$������k2��P4W6o��T��&:g�f�h�����J���F#(ajG\J^z���&����?R�8������;h�X�,����������5
Y��~�J������/8R~�#��B�sj��K%���d�L+Y��'���lgC���a���i��D���#2��w�f�v�|���G�Dk��8�a����=�i&F�}k���Sx�#�46����o�j��i&:o�c�<��c(������ �A�q61��ku78�N&:J�Ly�e�D�u��K5Q���3�8����p>�l���9���|dX�'��S��
�y�f��U���P���U��@�4-4D�I�zl����U�v}�j������|�R���.�a31J�#o���e>v�&�|p�h��f�����_w�p��������,(xO�&6p&�:���dt[�t9.�����f���8��c��I�S���P��1��;X��6il&F<����]�4LA�X�.�m(A��j�L���d0L�&�wQ��J����5W���LRt?��K�N�^O&6�A��#�N�l\H]�z>E����Lt����1x��O��g�����2zM��b���toR��M��`�0���S���B���ci&:og�p��'�|N�u}(tW�&�?��=�]���A(]j�F��d���$�.���s2i���/��F����'_j��������?����������Y��J"_#�~��\���������[=��)_�~M�xB���?����B������#��4�f}��Y�M8�\M���"#DF�,������z�������6z���nS9�%����\Ml�}���pf�B�JY��[����D�
� �2�����My�@�����[��>�.N5�-\���fS_ps������kw��Y�j2��*����S����~��E�����bd����_�a�kS���������\���S�&6o�RK�:����79]��D�
H�� ��}���H�|$��=F��a$�}\�]Ne-)��g����9��&���V��2�J��*���p��U�	������
��>�`�����s&Z�.�}��Ii�2
�QG���-��-��]�db��H�7�_����?M&��A��hP����P�����C��=�Dh���Z���jb3�[���p��6V51:��*������4�����2�.����2���F��<�K.��*|O&:�H7G$I���	Qy�x2�q��U��5�{88���08������%Q'BMu�5�� �Zn[R�������v,����|[p1�� ��N�q�������o�|��7�}�x���
�<?��#���:������|����)����q���b���Q�"�5
�q!v��M�.9I�&:o(#PF��@����X|��cq-�r��d���D�)�I}��|�Q:�I(]t7%�����(��@1P�@�n<~V���8K����p=����
���8��q��j������f��"�� 2�,"2�E�\}��	�#��|�����?���$������@����D���jbT;��]	�{��
� �b�x74��1�y�[��b���-#u�����1lS�����(S���P6���.��������&8��c�8���9����+1�)����Pu����$��d����?�M1��
%�-B5QV(�e@P�o(�$���#`$��X]r����9)����H&��9���4�����Y���Lt���`1X���2�>�:P��F
��y���Lt����$C�����L�v+�Kq����O�&��
�0�3���s*���cn��B	�4����d6�kg���N7e��y�Lt��� 2�"�7������Q He�C��5��~1Q��H�*Y��-����d�#���qU�f�H��(k����2�.����2��(a�L"-�_��i�
�=���l��L���+������D��d@�w�|V�������u�P$�L���^%��lb�O�]L�����e2��(��_t����#����?�\~���������:v���`�'�����������~L��9��y�����O�~��xd�5��?��
���S��[����&F��c��!m�e�?��D��,~%����
��d@~�����2�L�<h)zx����|����E��P02���}!�n�O��� d��|i�H#���D�<g������6L�@��������(Y{�����_��]�V����a�����y��}O��61:_��7e�.���2�|�\�c##�T�#�����I~�C*��|������X���u� ��5�����.���/(��L�o(�8��3�����'��+T�����wyj��F�@��������wh���?�����k~����\P�e@P~�'U�sJa�^�����5�U\��;7\o*���N���������?/�8��2�,���������[������1*������D��������Ml��}�z���4L�� �2��d�|��@�J9�~iB!Hk�M�t�B7�<��q:J�)����%��q:'b� ��'A|R���~L|ne\O#u��oJ��Z?�3r����U�B�4N��� ��������!^��P����Oe�Q
������������W������4��	��%�]���|2��"�� 2�"?�P��]��6-'����tS���`wdm���rw��������8��c��I�T���3�D=��qG
b?G�����@7�VW�e�y�`6����1p|�8��J��,lX��V*F�Q&�9C��.��YLt��D��Di���fb����;�?�D��8��3��d�|R���Q)W�����r�s��/��l��_�s��(M�t.���/����O���J�����)�m��;���
�����`G0��8��
�y��e��D��0�������)�E�}7�B]V����b����!wY�A��8'�Y�����zu3Q�9�AdD���O� �9�0� ����HV:�t���/ O&���@g���>M&��MV�j��������ma�����ZO��*����	�)W i�t�Y4�������n
J��^eX�����Xe��Tu2��z���.��W�z��t�}�s�f6�}e��7��6����������A�`�����|'�����M
C�{�C�Q�:��E���b�����y�V�^���`�$��_|	o;����1��F~�y_\�E�e�
��0&W���3m^�^Ml<��������|��R����ep\�{�D�����B��f��7	��
}�@���8#�w�jb�E��h��8� 3�2��2�b���T�������������~��P��r��(�*�2�]���Y���((m^\-~�_��~���)X�NP�
�u�
�(�,��~�ymx5���w���->��M��u���2�.��Os���u_J����1i����@���'�����&�:���F���U4���D��4�@3��'@��"W��F�jK���;��(�k��������������-�����G��/��~p'���hn��\���������\�M���������m���u���������_p��}�vW��`|����$K��~h�{&1����������x����9�!\�~|��.����D��O���_�������V-z�E'��@#��w��9)�����is����f�?��!_C���d��4���V�Nj�"�k�����F�/������T#����mlW�yqQ�`~:4�W}���0S`LEBF�$�2P�Ii���w�,��l�H�"���Z0+����$�Ab�$~��gM�N�o�G��/��M6F�}9�nL���:��!%OjA��d�sT�AeP�O@e>�����)��%�"�:Q��}th;�E�����(�K�w����B)�:G9;?���WvZ|����5��a��U��u���j��L"��)(V�YM�<�}���i���S�Lt� FF��12bd(��O��.���m���@v�`&
:������x��&6.��k�7���� �2��'���U}t����5�:'y!�lt\P�R�.�����B5�j��y�~h�8�g�F����l��{�an�3uxb��e���F�s?v1{g�w0���X�d�b�7����I��U��B}���D�z�sv~|��b�_�(�]5�{#"V������Cm�T�O]q�'�H�#P��VC�~y�����~�R!��`6�1�_\�j�3M������5L�������2�wTBs������h����� ��k����3�^|��{��#��X��������=U� .��h;��K�{�~�_����wsF��5`��B:�R���Af�d�����S��!��oJ��Dh��P|��sApT�5r�S���ie�����2�4�Ac���i��F	�����=��(�I�V�-I���k�L'i�4����KC .���� �~���Vb�Qg��L��)m�P���{>*GY������q�uc��G�����@�FB�t�',������E�����e�@w�;	�U��}���U��mJe�kNiN���H�I�q��}�~9,v����D��eD���-�y��/��-h����^�����C����������v5Q���y�P��(��'(������r�a�=�������T�!�s�s�BvF����GQ�$����6��������eP+5|����#&FL��|V��*!�[��AMmH�-9v~����C�\�C7l
sJ2���rj@c�4�e4��6�!O��'�%O��Z�CS�C�Ig�
���Q��y�b11�54jh��������7�7���#����CIQ�_���Y�����_�_��4��.l:J2���W��H��c^�s~G���x�!u�[zH9A3��D�bc����#6�R�n^
�MP^`H��<�mt����!��Zw>���PW����aX�
�K���#�d�A�;'�a��x&�:���T��
~���PU����(� @��j
����T���{N�zLc�d�u7�jM6F��z�K�B��K����X�z�!wCJ�r ���&�}bc����#6�j}�Z��\/<}�jM6:�H���R��{v�pj�J�#uaky��wi1Q�'��f�h�s4�/��ih����l=P�����g`������@��l
����d���{���/E�5�kS�we��	E{G)Z]�B��@��M�d�P�n���@p�*�T���1bc����C����c�c��U�@*�=�L��$R�4����=Us�py��gf1Q^��Ac���i����/���fu.�1\d����G�|��E��'e���08�9r��y��p���OG�������!C$U|��-�5�&��
D�@��O��|��7DF���6%�K������f������������~i��q{s[xwJ�{�bD+���]�g�X�
.;��b4�/�~�0L����8���a���T��V����?��v�1P��n���e����N�yy}�sq2�d�����q2��ct_�'�k��)L���;����; ]:w����
�!�&V�M�}WF^Q���
�/��v���bY�:��
�����&(��>��k=���0������Q��e$���4����(g�l�����DM��e��W����w�����[���sg���!aC�~��_���'��G��Ur�<6&F?j��VU���jb��|��?���{4�����
a�3��?C�|�0V�����Ao7Hl�v�k�D��Yuq�1���;��<��_�@���4k�E����@��1p��1���.�;��{m�(K:�lt�n�4v&�<�Z�`���y���M+�(�2�(���@��Bv?��2�r�Q2�&��B��t�\����&6��w�L�:����l����i@Zi~m4�[ ��t��8jc�d�|��]�����i���'WJ��+�wo1�q(R�i�w�+�y��C�2�.������q�UV�q\��E��?��y��H�7����f���y���R�������9����W�?9���������K��VK�nc�����O�%�:"�+���*�x6�d�Y,~�VFF6���|�x�(�Fq6�����y�����Q=�h����D�/B�64"H
f�?�!�)��I
��tt]���n������6!�g�g�g��� �yL(���q�M���J�T�p�Qbt�bo��I%"�h�!O�Hq����jK�YV��J� 
H��B����N����D����*����l�h	z�QrQrO'*F���|�s�������&���kIe�D�@�����O���?�7���p{;��vS>rQ>��u�����u�@�!���~�b�d+��E�2�z5��#�X(5�c���j}g���	����
���%�g��|meZ�k����v�u��G���6I�H��t���b����	�_��CM^�E�9$�M���h ����?w�� �#[�k���WD�]���m���B����E��@��6�m�t�m��@���E���#�*V���jlw���m�������<w��t0����X&.!JC�����0�K/KH�-�|�|�|������/m?F1'���Jo�U�^�����RE�����d����%W���y��L��������z��yA�S�s}���4��
�|����w����
����-J�R-t��@[�s��|��$���s�����m� 
H��4 �����c����.����1:
�u�,i�����ij�����e�u������8`�����i~)Oi�s7I��������Ll��/�h�5��&������
W��`���t�1�l*�����
im��A9��,(��*����3!g�>`���4a��������,�j����;z=��z����/����H�w�G��� ��A��]�CBr�S��u�K�U����H���Kv;��������E����W�[��}/M�q�=hi��������'��
�%5�s��]��}��sg]�k�h��/.�b}��G]	����S
���!��,�w��n�
��&�������_S���\���LGN����`���ZB���c�~�n�w��������Y�B���������+��������W"4�T-m%$�K=t�c�W3��:����X���M.6\l��p�[\l�<l�"%�n"�x����z������b��m��m0�>�WOIAaP�Aa�`8k����$0��g��1F}>�L�2#~B5DgF!8C
���t���sH�|�h0�����}T�:$���se`x���zg#���Kc�k�����u�x���?��6#�t�A�Ag~MQkU�]S�b�^~�>v�������=��QQ�:��F�
�N?��X�y�_�����|�5Wc�����~�p�[�/����bN����lf0�(���s<��k���,!JO�����f��os2�4.�C2C2C2C2C2���f�?�)��a�<	��:���i�3���l�PCY���{\��R5v/����4n��Ag�t~t�I���z�MAC;-+�eUU�s���������P�K�5DgFW���c�8���������������e��-���l}gH:/S��s��A�t���ym�-H�������h0�~������<	GC;���{L�t�E3�eC��E�!:S�r�L�BW��O����t�A�������i�!J�V���+�t.uO�w�o@�R�^�������-cu����~�jT��4��]��~�_���R��/ ��w� ��(e:����5����b'!���JV��L(��'X�Cu��e���O���|^�QT�s4�@3�,%���|L��Rv�w��	�@
��C����&
,�k�V�����G�t$PSH�-��ep\~�\�m����������.L����;�����J�dE[�&�k������g�@1P�R>\�$��n(�������/�
��^�u.������S�f\����u��y����������5��y��+�3�0��7���|X�:��Z��wMu(HFt�P�0{�|���(M��x�}��5�����7\��ep��s�x�����Ol�w-��*v��zv���������Q�qA\�{}��W����]����Sz�r��F'V��Q�: �x/1���:���7��������f�l��������wC�q]�T�2�^�B�:)icr��8�������b�!�~{�r��
SH[�B�� 4-%�-F�}�#B��e���.\��r��)Uu#��[�����6�I��'�`.�����������-���,��>���2P�u����t
U\�a�KH���������{�����T�AePT��:�Q��.z�]�CA���4[9�R��4rP����co�MK?>M�!m7	d�Af�Y�f�_DJ5�D�X?#��w���]��TScH�������Z�+'��z�\����y	Q�2������B�+�����Kc�|������s�R����>�~������?�J�7�/�+��\~r���*��w��!�������WMv�8������o��e���&k��������3#�9��N��gIi��6!�b�bPT��;~�J����������e�2�
:���N7�d�Y����L)�H�d�����l�� ����j'% �o@%*�g��Z��-�?\��K��k��-���;��:-�?��4#8�p�oo����������#F���Y��q8��'�(�^)��r�bD�(h��lS�G-�T6����������v�T`���~{i�222�,}U��X���:�.��k��e)=���I���J��'����6?��6#D~5D�om�������/k�-J�h+X4F��Ww���n��.�L������=8�(=t>��Nns�Tf��)D)U��6w�����C!�wl�����e���Pv��Z�l��b./O�VWD����b�-�*��H��F��@S����B��l=� �3��qm7(��b��R=�E>O&��&��>vV��,#��#6Gj��u<��<Pg.������d)sH�|�c�8��?������-�|�	Q,hw|-�2��9R�Lk3�7<�(Y$�U�aS�%��cJ��J���?Z���$���S��py����~�s��������/�_
��	O���nH�O���X Ou~G�����w|�3�XZ�����<�)���~��������[����hchchchch�/���tP�:�q�]��,�K��.��#Y��u�y`�,v1�`���C��,��2����|@�����e/���������L��TD�i�����a:��&x_��D��%�q<��c�<��L>`"@~5UQY�a=���`t���n������u��8��t����X����(��b���(>�c=dj�U� +��c]B��T��em�����g4�4Q:!:���IP�6�������
*�P�5������������_�Sjg-��i{�I����p6}�s�� Y�]B��G�u�H
q�8�I��.�d`�6�c�c�c��K	�E��x�mI%����O�r������qdu�	P)�N1T�.��4�D��8:�����iX�b� �b����~�����}���aAR���H�d;�����P����d���(	���O������� 4
B?B�
Ek�G����p6)��W<w<���;A�[���`B�=��(�'�!��id�amS�d@��+�y)�
�k�b������L{
i��T0��i�7D�Wr����@�Fo�22���!��3�<����A���P�*�E��B���u��~s��&�K��6���Y3�N��<��� ��A��O��X�AJ��2u�J����e`s���T�TO(��S��|h�q��7���%��[��d H��+��D��]�4����9D�����s����_��t���3�L�����)�q>�3�<���w�|PO;Z�"O�R�=�T�`���6{Y�T�V6��L(����X��)��B`�f0�������c�t�s���,Xh����]��YS��*
���o�Zd��K�Z���Ijl�����w�i���%DiFW�%��1`��e��_A���nh�3��g���3��>ew�@�d$�!��T�����(����:�����;�Q�����!�+N���\����7s����k��d�^�^/u8hC���}���P���Go�����s$����@)C)�� � 2���|�D����y������i}cq���(�5��Z?����J�y���^d�R�j
ic��.]�3�c�0��/��i[�^e�%.�u�C����������������A%-�����*���2�*���5�;�,�~X"��5m�m#�T��)]�c�����:Tg*�;������<���p�g����x���+����G�_���&)�K
5�a�2+�yXBr�3�}��sgq�2NT�2��G<_*E�}��p�=P�S�������A��7�n��*+���dZ�v��`<VS������<�4n��r/�=m
�}������!:�������[��w�15�~/w���������o��������������?�������{w�S�����i�/�2�����=���~�H�k����}{{~���n/��ME6=��^����OI����e�����M�8"fU�c�I'�{�3j�mB�f-������D��
D���]#/�*�x:ha�}�����u�0U69� ��!I���)��U_��2�%K6�����1��w��������+9���dF�Ap�<NG��h��$[Mk�v���6g:���%�P_B�&���������~�_����Cx?d���r�-;�ig��/#�Cu��RG'7F�����-!���JK��$(�z���p�����~�s����%O�4����.$��i%�}C~h�������:*m�*t4v��?�5D�u��5lk�/:���������;����G<V�g��UZ�.vt��n���Ov}��:��
���o�YG�L�wT��m2���������u
�����w��G���]��S��2P����zk|�����d���L��(G�v����W
i��6��`�`����Dmo{�j.U^� owMcu��S�Z�U7�����4jT���m
��5lkFjG&�f[��v��n&:vNp0�2P����H�jK^�������k����fsH��+ccccc�[[om�\`���@�5�� ��k:E�t�8=}��u��R����^f�/�/!���l���(wv����������y5�,�_OZ_Ww����nB��6�nJ�)�h9%���e���L]L��^B�j��^����F�F�F��F|�=�M]Y���U|�g:O����_��R���%�^�B��Fq����mM�+�)9���L�6H/]�?��Bc�2�� 3��:�o'S�".�'��YnmmL�/�5�q�]N��3��b�K��v<������KPm���;(��2��������z��8k��r����2����:\/J�OjH����)@�1@�_���F�n@2�ukQ�0H7?%�k�,�b��S�|B��!p�����0Oi=Gd�Af�d�L����_>5��Z8M�������a��j��T��ta���7�NSH[�*���2����|�r����-E���BA�l\[��C>c(����C���q�f!h������^��+��
��]x7S�� >9��:��.�����C��b�U���f@CCCC��^�_,���v�$)����Q���HU�����	�����z
il�!�u�6]������F�|��g�Y�g^�4���p7�x��O-��L��R�����������!�'KA8�p��L��=���u{}-�w#����Lg��%%�biIu�8���}J�3T�-��TC��j�����U����C ��)�2����_������2�'��ZZM�+-4��	R1�)F�.u��t����a
QrkB?����kO^�/!�
�`_���}
������vT%��gS��B����%?�2���@gNL�k)e�c�N�3f6�l��G3�w����B]�%3[���)N)�SpTW�����0�!: &g����U���pKH�x�T�T�T�T�TF%��;P�k�1l�A���
&��.^��L%�}7��,���������00?sO�����������X����R�1����E��p��R��>�K�|w��n�-%��#w�~�]����S��D�������|���b�}H����QG�M��������9M�e�4
lC�T������k�vB5DiJdR�����yo�/!�)�.����������M���>Ow4H��W7��/��4��.L��~ ��p���e�I��Uw����3!k���!V�.����M\��ep���/R<��|������!k��1�l���Q���tr
��������������[�]�Z���f�`��>;7�o��b����0��-i��Tu�]���Z�7��JS����1�`�B����4 
H�H�/����X���^%
�u�����>ZT�Ne�i��+���vt*E]��sH�:�tB�����?&r����3�0����c��T�][Pw_]�� 1i/t06�=_��F*M�V�y#��Qm��b�(�A�1�l:g)�U��1J���>G��Q��3�%D��
tv��W��q ���1�_���y�������T���?�J�7 3�����������W��k+]xt���H�������e�l�7�[��_�s��j�=���|�e���$3$3�0��H��e����Q�v�{0�h-��Hg:.,5�|�`��6$^PSSkj��4'������t��q`4
F��`�#�>��=D��_��u|�]��[bt�����uW�7�@�!:3��&���5t?�9�m>�4 
H�
�����Ab����+�l��/��.���<N��e�Q^��	�����
����]\��#KH��AePT��������;�S�����2��:�<l���k�����%.�R�f�Y����
X��`1X��jc{���hc�1m���SC�hb���7�"-!O/�����!?�		vU�!��AiP�~	��W;���g���t�6K������0:XGD���.o�/!:"���W������I��2�,���g^��>#;�����.#SL���y [�P�*��~dK���h�=��S�����CMv ���h �D��Na�+E�|��a���w������5DgF%_��Q|�l��6��`�K`3�T��6 �]xt�_�4�kc�T1�+m����Z��S��������a=��"����EW�?��~~#�?��`�'PT~D1�/���t�5(t�zy��A�0�]�a]����%DgB�������C%;����������A�aM����Z��#>e��"�C��'��B���[B�(U�1S3��Hm���as��!�����'�����w}8�����wO��
��������/:9��P� ��w������nn��.�LvA����P�/��o����W`-!J���}�d�d�d�yG/��_�RbJ���7�3hv� ��*�:���tX��it%�t����8��	����u��D��f&�1/�c�0>5��WWR�p74��v���x�I_�@%��0:.���]F�L��>���<�m�00�/�h%���,J�����rSZ�\E:cH&��@/��'M���w���.����o��:����%g���@��<����X*�>��C�q�lD�fu�������M.���y��&���������MZ[�� |�a������|�A[/1y�4��/��
�m�a �;	 �P�����#Y����<����@CCC�i|<�*Q��^���U��"%i*w���,-��`�\�p�����0��>}Uz� �iS��o
Ze%�B�����O���������Z��b���K��>P����~���w�q1�K;,~�<P���88���{�'��t7���`�41��Y	�1�1�1�1��c��Gu��$8��TL��:0&�:yol�g�P��D:�1��[�'JsH�|�d0L�[��w2vm/!��������w��|�O���QJ�~���h\���:�!Lj����W5�Q��:�1��<���k��n��n�'�:�Bq�������
M!Zj��
�-�#u�w�Y���~��m��.�.�.�.�.�W���?�)t�7A��m��^��[O%�}�2}��A
Q����{����<��l�7�2�(�/��K�����f��%/fu��y��I�����{�\mA��p	A���|�<v���>��!&�Q�ts��N5��p����y>�S��Q}����TRr�wC��x�.>���t��pD�=M���T��\O�
�E��\K���zK>5�;fo�2���@CCCC?f���:djk�~X��b���^NN|�m�S!�=N�m��4Op�N<3~��\G�M\��ep�Ep�xru7�<Z�u���Y�$)��:�!��7O��FE)�Q`
��5l�]�.���J��n8������l���nX���NNct�GG1����/S��M
��Q:oz:Jj��?�%�mB�������������Jq��|����:(USQuv�t�(�C����{
������f.|�PC�f*��������������Z�bW�
�����;����_[��~���)�sM���RG���n0�J���������~�_�WJ����Ov�2���������������Ri��:�oSm��9�:T��N[x;\�O��`1X�4����t�����F��<T�^����X�,�������L)v��h�t�M!�I
�.������2�=Vi�nRu��E-���P7*A��V�fk����5D��W��_��|��������
��4�m����Q��&Oq:�#�9%oA��:Tg*T��xu>�j�X��`���c�a
�����hP��6�H6r��/�lm��W).!J�Ht�zs-�9�)|�8���"p\��e��r7.O��TT��}F5��Za��D��r����6
�����/�����^x7��v���K5��7�OY��>����3Q C��JS1��4l�6�c�0�2�xZ���u�TDMu�GuOE��[�o����3��G�f1����6p\��e)����]���?-��5F_���G�=A������k2�}gh����H��cQ�z����@1PK�x<���&����?�%����J��l�	%m�Z�%�Z��F�X�K�[T�w�F�:*����p���K�-_<��E>��1mw���,<?�������
�4_xR/�LX���T�Q���_9U*���'��.?l�s}0��Q6�4�Cu�����o�<�m�����0 ��!�o'��Al�����g�����K�����iMJa���3��(e��q�"S�^�B�e���^����x�'B�fO��^%�������
�dO��-��J��3�)s�il��W�rt	QB��d��.��?(KH�=�D�D�D�D���\��}X��~ut��i��S��-�����ro����:S��$�,K�w�O����D�Ad����$J������.\L����8�'���p�j��W�,!O���� ���e
����3�?�q{z{w��n���}�����~��N�����n������7���b�����/m8�<�a�7�������;����&��`i\L�����*������t�����N�@����Q�Xw[��!J�H��r�c�F�%��&�� ��$����~�s��[���t�X��������i/h����=�����w��B2������,�-��X���TA���C���8���?��{)D���b��J����t��2������>#>�Q�vvc����������k�������+"73_H�/�
m�^���Y��Hd��|��2k�d�(�{ �iQ�nv�G6=yh���#������_�{K�v}�����t�3�:nm?{�b����HhG��6�-J��1�m60�ab��n�� �_���bt�������X,�������/�B
Tp=��T8��`
i4q�7���l�[k��&�D6�k�a@�oo�ju���� �{�J�p7c !9���m�����&D'#)�4?�q����v��Qm�a@������^u�����q:���DB^B��\�|����e��VZM.�������B(! �v�J���@�����o��d�B������^�<a��(����V���u/ea�}o2��v�.$���!����x�"��
��\L�h����r3�x? �!�!�!�_�h������.��C�\6�_d��"cv�x
Q��'}VO%#b��)�jj��h��n�,
$�@2�$��������O��0d��l��M����V��r��\w2[����c�'_����@P��(����AR�r7�x�S�*jg���I&gV'�!J���N��qjt��Oc1��teI��d�i�q�4
@���������.P��dd��5D��~F�&��OuqK;6�)!�7|��g�|~���������P	�t���(��R�u^����cL�f�Jh�����.��JH�l@h��_��EF%2�������������v9�7_���(��bq�7d��@)��A=��Es�A)�Y�?�nl�����n{��7�@����hj�.�r���md�Kk���_`{�Q����"U
��?��������������:��i��g?oc���PC��zO�|���1���R(S�24�4��4C4O/a��)���Cx_!�����b���/?���wO����.L���<m���M��{�!J�	E4?�J0��y|������U)4�!������>��+�@r9uR����*�h�
��]x7$�<�D�_!�����%Do�R�������Dg������Z�;�^VA���|���>����0��1��`M�C��jf���j����E$����Jfj7�^I��R�������|:5���,;c���Z�����/m?G���:^�����v��|m��nq����E��|Tg�EW���$'\����
�<�Oo���j1T��j�)+(�������j��P|�����S'�.P#J��_g�H��k�R.H|=��oOa�R�@�vX�i�+����@4C4C4C4��;O4��H�:0:s*�y�W������<S9��/,��o�#/<+y)����+����T�AeP�EP�7j��/��^�����O3i��;IB����*vI(_�f��mC�i+Y��Qm�(
���0(�X����>��w):l��n��4)�y#N�m�����9s����!m7h��f�h~����5�����J����O�4�9Di�}���z�'��(���uR\k0y[Y"��)�C���{[i���~�����~�s�����j�'�=�����4A�����z�TPb�@=���h��v��}ob�}�&���*��(��bG���8J��(���hkb���������_�N�_�JL����19����J��:R�eLk�R5U1��W{���k~�X��nK�g�|��g��[c��7!gMg�0����MH��W~�{ O:
�>vj4���A�5v-�	v-?���G��9���o`U���MC���u���,��^�j�x��G��w��>kQ����L����$��n��)���i����*�j�k�/������O��I�S�y
�E��O,���e\�"&�����������qz�O���]��������������C_'�&�����^B��R�����(*����'jl����S5��\B�W�+c]��XWnYW�
����Q�����t���$t��h����e���&D�f������^�JE1����j� 2�"�� �cF4�{�'�b�<�/*��z�{n5���P,�FC4�Ac�4~������hc\�A���{�X�$u(���N��~$�.)�M�J�S*�Ym��!Sv1�oc:��������F�dt�v���uqMw��y�������'Qy�X��*Q����l�py-��(����������.vu��:222��P��1y�����}��S)�����ui�q�����c���4��8��~���1h�����F�	!5J"��h;Q���&��B�+?�=�k=,{��"g�k=(��������EcH����h Z��~�O�OQ������]Y.���
��{x7������������^��<������(�	P����14��> �b�X��t���K�6%{;m|��C7.!j8|f7���������p�g�p>����<?d�u��dr�3�X��:{-!Zx~~����!�?�[T
�����3�<?{<j�t������S��l�� ���!J�V����g��t�O;�FG�M�A`��@�_�J��
�a�\G�����e~5y���sG5gs�yE��=����
I:��H�<.!�O
�8���������n����5���v}�r���T��l>��P�u��"8,����^8��P<�C(PnMo�?����z���*OA%�<�Q�&�.�q���N^C�X<h��'��Pv��������� 0�h�={1��w��n�N��St��[~&�5&~��VJK�h�<�j���aG���
o���vt���\����y�&?����n���<9�=��k�n��O^���twE����s��x�A^���{/<?m`��k�I)�0/}�d��va:�7��+�N�3��5D�f����&�oq����'r�i{�CC�������1#%���+�/��F:���]�����B�B�����tADg%y^����2������BN�����Ot#���:?����Z���>.pJx__g}��x|Ho:ut�1��S%2�
��d�n����e���^+
J	�3��we����B�B�B�B��y</�������H��JLI��v�r���T�5$�c�����*�D�X�� �.��G�������V���pv5|�_rT�Qk|����-���<�9�
?�4��Y���)dQ��|����X�^����M���9�!�L�E������(y�dQSk��GdQ��''�dQ��t�x.~uc
0�<������z�J�< ^[a��o�b��8�n0�c=_�����9���:a�t�y���*1���:��������x����]x7:�1�����`w3�(66*��}��8}���i�1�
b'������������������������|�?�g��oE�5}����1ml2�MG���>Q+�)D��>��;���h�SR���p��%�������!�?||���om�F��A�<�� ����$�b��n��x���==C~^�e���TOF�@D���|���UM�5&�00���?.��x>�iM�����[�B)�C�@�)Sa��Fi*M�Kk�m����(��!m
@�4}z;��r1%9���������y��ul�0,����ZK/�<���Ni]=57��#1�)�m6�3�8�/��Y�>=��M��t���=���OZC� $'��|�PL�����	�I�g�������P�e@P~T1����gm��P_�����~�4�h�i(�si"�x``�<d����CF���a�^��+��A�4���M�}7��r
����Ng�q�?c��t���[�/����%��XHfHfHfH��!���\ye���#Ql��L���"�5D�F��I��n\��Je�O[��G�%�00��0��>�?�4��@������0:��\C��
�c26>K)��R��RY���q���)p:���S������6>����'�D��Gy������aH���M�M����V�F�z�#vl�!m7�����I8���Ta����\^�������X��<��<���&6�E�.���?���sVz�?����I2�*�+��P�2�/���}�����r�����B�^)~&��6�(���"0~R��VZpFm��!����E�������O�[}�������!�m�H��f�07��s�PC��d����
��F����)Z���e����1�-g��
o��_���2J/c�����m:����i���c����6��3$
�{�pVz(�?��'r�v?e@P��G���lZj��f����0��L%aKH�Z�|~�?����$��)5���Qm�P�e@P~��t��%�8��DB�(:�%D��*�������������b��������>����M��T�M�����_����/�O�C�mH��N#|����i$�7�He���O��C5D�����p�p�p�p~�y7(�va������B9���	S}�=��4���5����l�T��f7&)�2�(��2���
?�	��d<��89�|���(��������e��J�������\���2�(kA�������(�1e�2�6�A:���kw:���$�wa2�{�ss�W��g
���Gc��T�k�.�,a�A+�g�J������;Rj7�)�R���_H�*�Oy����g�G�g�%�T��F_���f��n�7��������c��3�����}��|�������7����,��>���Ue=a�F �3D�u>��d�0��L�TT�3<����4z��3�<�/��Y�m������dI�<	a�
�/�nBt��ll;�_��R�&m�PT�AeP���V|���;��M�o��sH�u+���k�]X�_��\+U#��Q8[�+�Xy��3���;�b7_{7�|P_;FC����e"���F.!Z��������h�`!7�a8ij^6�����?�����8�'��P�R�t��B���(�ry��}D���_����v��<9�@�
�R
�����?L�96��>���]�qT[�?~6�lp�����qx�����F*�Ji��������3Q*4�7-I�S\i�9�8��~��1p��//�����.���.���!)Ke�|�����/g�1:�q&���|��ecH�li@��� -x�����h��B��/N]Hv�I��6XC����z�
;�t����~���Y
�ts�p����w�_?�1O��p����������<�(g[���|<F��zm��w�7�,��A	X
Q��������f�y����2���=�j��3�3�3�3�3��o������z�HA�k����E���n7�#w�����v}uW����@�=](�I �Q�����_V��dGC�2������APt�jZ�#�b�$U������/g��5���f+=8@���Wi���H�����t���T>��!�Q�����$�BE��F	�_PfM�/���bF��3�8���������}Dy�b�2%(��?�/����&v�~a� y��do���@����3"gD����W��}��C
s�JV���l����s����I�������V��	��=L��d0��0�x�n2z�R����#^^����($E�KT���4��|U�u�F��`4
F_����D*�M7���Oc50^����(�a��xcB#�����SG~u�n=@ePT�A��T>d-�.&J�>���.O J��L���j����6������I�n�J�����>����r��=g�������, �����tVx{�?�������������T�[�@��,�Y������x����`�7���G�������9�T��~�n5Q&=��)�wgxeF�vW��WD�A�- �F� �"����������IN�MW���~b�����E�����C�����{YI��,7M�e�=*ej�<����K��EJ��'R��������=dh�Yq� �_��Q��($uVJ��3 �"y�7�����]�+TMt�(�>����+�3�������t�.�BO}�'A�
�W���/�	d�C��?p�+�w��Tg���@��q��-�!hcA{7"P���u�F��,%2���L����s}�
_8�r���RG�����##FF�������E���6��
5��&�nI���A*5�>�P�;i$|w� U:R�3/��Q����d0L~L��`����T���;�8n<�onLt�"�(����:�l��Fh^�:��`0��O������ �Z��<�@�����f���
��`3�����Z*�ez�<��3��4��)[������+R�'#"�������R�n�o�Sl�4
@���R�-H*b���R;���:I�|��;OW���g�j�����|��2�&^6z<�����/��t�����^���]cn,�/�jb�~�gY]�CVmlxLbt���Q����3�<�O����u�f
2/��l�W\Ml�H������Q��!�L��\>����ML��d0�0�����G�b)��5���6�����)H��`c�#��[(����L��&[g�k)6Ju|��g�|~:f��\��+EOE#� �9
ML1�����c���(D��������,l�z�� 2�"��O��*v���C��uy��<N&V!3�j<Qy"����.m���o�#���o�����R������6�>�m���7�e�V�m���}���g�Q���@E�B���r�
��&6t��%��]v��;�Ayc�2�e����_A���w{�n<�9�R���*���R�������w0����T��h���_G�&"�� 2�"C��������R+
J�*Y�~,���o5QF�r&S5��Se��m4�����T1Q&&�����n��������9o���������e�V�<�������2'#)y����]�#	{�R��lL��;J�[JN'��	�z�I1s6�n��������Q���
@�4���9���^�{CF��n?q�]���=A�|���&v�F2�"$�:����Wo��/
i����� 2�"���������D-�2���������m������jc3��>n	-H�,&���� 4
B�B���r�K�xj��2�v/���s���+w�������fN%����(�	x~h^���0 ��O�~Q�&����#��|x�D���1pw��x�N����a@��+>�V����ijV�������
��pv|���H���N��+c{���{*�D�8 4
B�������V�����7�9f��i�c�T8'�����{��j�13��S{�eN��Rl;:������'�~��^o������q5����x�B.���X�o{!�f��_�������#�����3?x�������~�*�����i�SLt���%#JF���Zw����c����:��sC��{q{�v61��&}�D���d��E���x���r1Q� 4
B��
B��tQ�tG�(x�F)��v��$�u��*p�Z-�����n��E	��2���]�}������F�R�� 2�"��Wr�v��-���Ut�����]��u�]!���B���	�p0B3�k���
�M&��3�0��0Q���B4}jC�������C�V�6Tt�a�I#���L�N�2��Y��@D*&���Ah�~
���L�X��P�"��Bg*���mX2�x�P���D� )��X{�g>d-6F|nj�1���O~���nuh���+!4/����h��T�b�	r���������D(�y�k��k�A���/���W�{L�z�C����\�b�D���Br��"p�(}|)v}V{�r5��D�Ad�5�x)���/v)5}�]�2��c���H�`��w��(�D&�������;��d H�����(�:t��9��?�+"R.��y-����_^8�����.�wa���[ H�� ���u�9M�5�>�l���aj�a*�����S�Z/������-�D)" hF�B[�����&�p��
h���w���IF{���"��t�a*b��o��2J���fr�h����x?��v��ML��d0YS-�xL��d����D)�:�.�c!+~&���gD������(������C(B6����A,t�K
��u
�����D����|v�����<�XpD!�Nqs�tT�d?����P�3����mu����?�����%![�8�4#hF��*�f~?F��m|�o(�����C����������K*�&�A�(U��{�������=N�AdD��]��:��Z���~B2����(�[9w>Z&M��e��H�&:H��d H���#��T��kjO����"�coLt�C����8��M��m6���5�|�bg�D�8@4
D����^a�n���0�D	�!{7I��L6&6s�W�7@l�m#Eu�k���6���Ah�����s
�����!r]{11��EG��(h&�:�RB
oS~�:�>���i�nu/<\��{���zJ����w�C���7�����������d��#J����%�q����gc�������~u�9\�I�Fn�W��~���PMt��@�2e Z(���Q�~T^H��B_D���b*|N�Q�_{r51"b����	u�������H'���Wl��d0L��d����}<������q �t�+����[���d����+��������M�,�_"����h4
@��5@Q��S�k�s��$�7&VQk������@g�FQ<����P�VL��d0�U0�?XTJ�RS�����D���FJ�f���X�U(�4?�:�l�<��#h:AYd0�c�0� ��q���F��������]�R
H�%XM�QN�����)��Q��"G���u��E��d0L��1���u�{�}i�HG�����l���T.�����:#�|�F��F�J�CrtV+W���� 4
B�
B��0�����(x���)Q.vO���xW����7&:I�X4j���K@�wgu�digN�v"�� 2�"_��yQ`7"G�(;E��0vM���mL��<�U��!q��4@&����������?K���nO�|��0�����������O>|������N*�Y�n�~�6:i����[1���!5t7��c�3���)6���v�Z3g�����<��L1Q��#dF����"d���W�}�y~o�16o�M��y�n�����.7]���k����|�� ��4����8T���/?���0�CEN����3�����n��]!#Gh��8R)��$j��F����������%\�����<���w�(�
 >F|����t#:^||D�:��S[�.�$M����$~�(%�<����^Z;*s���!-:��a�<��3�,��v�3�[%�����������y�"��}J���������O��Wf���/�����=z1���u�R�j5���?E?��L,�3�M�Q6@����n��������8��c�8~
��T�CKe�Z7��G���M�������x���������7��p����'������-��F�?<~�����7z������KI����X�y��O�����t�O�~)/�����|��L�	��������g*�L]�����sZL�j��[�8����U>���(o�����<�t��L<#xF���Y�-�������w{�nnA�9�~C��N�<�7&6_Q�%�5i����=9+����b�tV�c�8��ch�kS�@��c{j�<fw�iMR��v*�9���4�"���R�g2x����f<��3�,�3�Om�����$��~��<�9S��1Xt�XM���K�<�v^d6Z-6F/E������L�)��@g�t�A����j��Nu�����oDQ��y�..��q��V��%H��� 1H���}�xq�S�������b�_����Bq�K~�H���W��(�ts{��y=���A��.���2�,��X��H��j6X��]#�t��4���$MlZM���(N�f�g�M,6J�X�kP�k�b�Q����R�H=����3���M)�M���������J�RJ����O��]n#�uo�xGE��c�:�f�E�a�7���b��
�d���&K����o-L>�|���2�c��)����0�$��oea�)�l��e*�Q�I��^������g�x�n����x�� �!��
��C�p���jbDC���Y�A6KB3�_���^�������/��L���{�
��_~�7��w�ot$P7m��vl���d51:R/:��-rV��IU�<jep\Mt�Ap���1�c�O�{������k���#	���=����$}�H�����E��7�<4�+��Q���Ad�����=>��w�o�tw�m���2����U��C$���xN��r�)�K���11z�(�z��N�E��v���zT�n�j�@2�$�����b72������O}�S5�!�0V�����#��f�l���T���@��W���jk�I�<���X�c���u����J�$\x���������_]�bc���tH�z�j����Q���������t�Nf��H�n��C�����d�]+&J�
�2Be���U��n��n7�wsv{p��r���y3����E����&[�`������������>�xN��~�.�������������^]��nl�>8�P��D�+~&����~�E���:.��_�6h�G���6i��~���/#^F��x�����j��kz����R���v�C^����jvVzn�3�:������\�8�(:M}�6Q�*[�s�W;Q����(~����K�5�WG��P�e@P�2���m3�Z�9;�����BR��&��������"�"�Q���a@�
�c���wM�\��L<��0���nN����:i���R���o���e\���(���/�f 
x�NY/A������=8y��>����!	��W������7QG�M`�o6�.D�zD�O�2����������t"���>y�b������������D�g�6�u}���u��;\|�
h���9P���(d/�4�_<��<�gAs
4�x5�)��������:v�:�Q�r�w��<�bb���"�����JP���e �:����>\1Qf���g�8k�Y���y7F��z������<��QS��_����~��p11�Z)���]*D�r6Z!�����:J'���`2�&��W���x�(P�7}#��6�L��bvcd�/��D!�g���m����_VC&���Ah������vB�S�����b���������`�(�F�5nt�%�IZ���b�L���h0�~���@��X�@i�'J�j��<I����&��6�Rp%�.��|�\��+��00_
�y�5�G]Ou���S�y�`cb��1���UG�L�T�>������D�n��`2�&��W�|H���0�&u�Jc�|�����I���b��D�T�����t�Z��{e�F��`�������6�6����A6����#�)Sl����L}����5Vey
��@!�k����u����#�(>�2��������n��="#wh���Z8v�Dj/����������e>�7QG�8��T�zq�W��8#pF��N�n/p>���
T��
)L����0,&6T$u{�=����2V�F�E�>��%*&J���i@��@uG��������7.D�����g����V�~i�6
�����(���
vB���z�nT�UT�f)
�v��c���MXf��l��|�.�K�����eS�J��#:Ft�����wD	��L��i{�J�5}x�&F�Q[�;}�n�wW�p����_���t��ep\~%\>^���P�Ic��Qa�HL�S�~�W3*�~
���Sl�N1u��Et0���h0�z����i��
Tp����)~	('{6�E�R��h��&�+�J����W��PL�=�e@P���P>��M	��kH����BA{cb�����������l%��.��Y,�A���AePT�QYp������ �*�?<�����O��>���]�7)�����
#|�������k���=���`�)= ���h �z��K�oN�nO)L�r��86�4Ld��D4��C
���U�sU�c�8����8>����g1P7�a��J�a5�qh��^��K�����&�����R
����^����j�[@���W��n1���v{p
�kz?��Am���(T-	`��^G�((w�����(�t�2iP�e@P�9�Z���]���u��]�w8�&fP����RF�z��=�Z�A���@4
D��}Hq{p�5�}���H��$�aiy~iI$kk����!���� �j���1x�/rDa�_
������v�
�3������D}~�,��lt�@����RG��T�AePT�%��lvt�<�O�O�W�V+�p�W\S���t:��+wO����=��||�m,,P�pD[��/w�o��-�1�����3}��#�yH������O)��,m�����!l3�x��� �>����24H����|�������_/�I��#��������I���������g�o��6��*�l����k�����;j3����������������QF/>i��	xe�h���(�����?]�Pz�t���������x����~����.21���/}5���\�/����	�}?w���kRL��x���1x���S�))Z�m�@����G���AXMl�H9��8�Y*�`�FA:)�����b��u@�4�
��[]Ve7�7'ZS����Y����4���B�>�������]��su�a���S���3�<�O�����-/$�\��cV#� z�U��9�H���W#'�H�""g%��M��b���)&:1@�d@~�����rC�"U��6�q���AR"A{,@������N�M�t5�-
@��+��W~�7bO�b�t}�tw���1�H�g������^y���N��
���;�AdD�����
D��P^vj)�><{V%��=);�����|@�d@~�G��c[�\wmL~�i�_�]M�u�H�����Ol����XW:����L�:�#T��@�4-�`�mA�
����b��=8E�7�m�CD��y5Q�����d}I�L�l�?��M�+'��`0?$P�v����Q�L&A����!���v���K�leHD�Ad�i"Q��o�:������<�W3�:����<)]��������F���1x�
x�+�JuP��Q����$Q�g����D��L�"����-��x������� 0��OG�|��{^<�����&�ap���l�������Q(NJ3r�������~��D��V���/��R�4zw;v��tEx��)\��������L���������(k������8�X9{V�RT�Z��I�:R~�p^I�Mk+&�*b�*#TF��
Be����^V=K>L5J"O�/TtT�y�������sVDh�e��o���]>-�d6�����1h���?��O�atm�KT�� ������_Zr���VHR5�dZ#X>�}/��=b�x���<�J����'=���3��4��)f���N�������j����AR��N�y7�z��y���L%EG"�Z%Z{%2/#^��@�������o�m~kZ^Tz���>�/w�?���O��98�o�4����c�/����k���[�[{��!�~s����db�C�MM0��ig�����P���0e���x���������>?�eA��x�������m�$1�,|f����m5T�r>��?��&F�UMg*:�]�&�6]�������(4'%�b���G�B��l�(�~
w�2�%�/�&��j3�c�����Z-����]l|�����x�g<O&F���7Bx�i41*��fptcl��>L5������zH&�[Ah-3���[������
��=xh�����a����Si)m�����Y���T��7�>��
���R���
Y�6dm����ng��1������M'��A��8�����]������)���|�o\wvI��r1Q� �2�.�����Qa�xxp�4l�����)��B�+�M��c�Bh��0�� �'�!n���+4���D�@4�*}�V'��{�pK���T�r�!�������L�ejv6�]�9�~MJ^�r6 2�"�� �����z�Z�����Sp!O]E�Ef#$���~u!�Ml������H��T�th���k�>���Qa����@��h�c��[�eE�Q�8�f
=]�&q}���[5Q���2�*��*����rs�t�[D<����}�tC�{j^1~�%XLl������f��R51�f<�3�kS��V��nm@g�t�A�k13�5Q;uMO-��E������$����9���7�F%����Ig��Wh4�-
@�4}
����&���ZZ�/��DYZC�D�a�.�<��|j�h�C�t:.fJf��0��M\��ep�Up����$>&2Ow{p���6m�Y*�&��U)I���t\G�(��o���0%HG�&���2� ���C*�]�"U�>�k@J�h�#���E���5l*�]k�S���[�~Z%~�F���4
@��}H%�5����\d>���Lz���N�o�6h����f>.vM��|���D�'	F��`4�*}�X�����"Q�6o�>�T�bb4��W�mwd~mF@�&�����D&�c�x��g��Z}<� �M��"��M�h�vq�i���H���;����2�(���5(R��K����6�@��L��)���wt��I�
l��F��)"u/(�9��M`�f30��?�~���6���|Td$���`*���z�Sp61z���v����J�mg3	���x����M��ap��&B��
�nT����Vd���@y�niD����������}��mQL6>v����VhZ���uQjv�& ��}�;��v�:��1�����=<�qw���G����x���z�������pB��������vh��b\�r/�Zn�����������q�M�i�����'�go��3��W-�my��&�� �F
F������}e��'��n�[����I��~��'�������#���4N�A�{�
�2!��(�J@P�e@�t-����v��T�c��)Rg :
�����L���19+��&T��,��2�,_��A���~�)5��2������TjC��~E"�,�&F��m�v�9��q���c��h �~���d��x7Uy���a��~%�0��$U��2f�G�Pwg3	���G o�
������e@P������J��!�t;��|P����
��A�2���Q�6��m���q�nI@ePT�A��T>��C�-��|��M&F�_q�MN��4���P���$����D7�`�f��D e�~���'o���n?1��N9�Mx�����QxZj_�v=d���&F��o"���A2�1�|�p#�1�{��)�����-���/\�_��f���������/����yp�/7�����Dy#H�������Y$��qP:�X��u�������z�r"����&�.��K��T����w���~~�x{o�k�=*_�|�d�����2*�������rV�_{7�(!$���
U$��[Ay����u$��I�|��h���l���!f�1�O
���X>y�m_��K�H�t������g�������v{cJ�]L�T
��7���c�&����g�|��f�R��Z��lI��5�DG ��P$k���;J�8�ID���&���h���*���2�|������R\��o���n6�AH
��2A��Y`��4�Xe$&����b����(K���4 
H�H��������
J�){7�iX5=���!s��^����KPM�9H�v�r�L�"��,&����e`X����<
��?@��������/�lb�V%m/p������$������<�G`���l����$.QU���>� ����HC��`D�CJ�-��r�Xi�OD��Mt��w��d���������p�o��<�L�
�&��A�����I�������E�V�<�7����+k�h��g������2��Z��O��L������O������u�n`0����x�#wg7���������K0����~�l�m�������	�r�����F�*6J��x�������7*����n�v���RV����\[�5G!�JU��!R��t�9E�$A��)��&FT$�����tvM�R��L�������(�D���=#z~��3/ZEn|��l��ey$E�����zC�]��D�z|����E��~��l����n{�,8S�&���2� ����E�Z�2c����	.�l����L���lb4#�S���'�a:J������K%{5Q���2� �O��
vLn�D�{%�`W#�$�I^ ���l�L���6����0���(��2�����c���^��`�����1���g4���I�n����`eF�sCU�VM_T��L��@3�4�@�����|�R�9��.G���&VG��ZQ�K~�D=v��(���&�yKgV��&�F��3�:�����t>���n�[�7f�&%���j2��R�y41�s�����$p��JG],>���3�,�3m����@��m}R2���PH���%��_����;������=�K����tJ�l�~T��p��1p�d��+�o����\��S���Q�����Z�+8��1P��Q�����q�n
�00��������)�O�w�ws�1O��D ))#�i��>��>�X�H!S#�����RE�h�~�����A��Z�!��DF.���n�j�3�M�9�R<�f�gW�y���(�K�C�������;RMt/(�6��`3��d��c�*>c�h�f���_n'���b���������I��T#_������'-���|��Rd�Ag�t����Q��6����2i�������ys�Z�P��90�`K)��S,�G(����
,���(��)e� 
����r�|H��}H?�e�
�I��r��`�����R�o��^T�[D���%#J~�Q2���=����I�N�����-&:�H]��a��jg3	�����.6��VM���@2�$�@2�������nH	�m]\�*�z�bb����s��ryLr������
@�&���������lW��|(W�<�e������bW�����=xh��d�k5+���lb����7T�d��[Ak���f>��~��B0�j��

@�����O��a����CF��n�A�}M]���
-�p'�l����������jb4�@��]^	��	��/d���/��>���#�G���������������)�������9�|]:@��8�f���.jo��9�������>7]�������n:��G#�F�J�hT �����������s�^�K{���a�y �&:I��J���I<���D��`�cP���00�����jL�(g�0��^r�xSF�h�c����Qm��/��n4����M�-�C�
��D7 ���h �:�*gS5����l�Hrv51�M�����@q2�QQ�4��.P��(��e`X�_	����v�>P��v��N���l�C����
��|:�4�h����:L7�(��b��z�|<��-t>�*�@��Mt�8S��Mw
A�jJ�&��l�u����H���D�<@4
D�@�uDT���r���|�\r������H��2IG���h>�*��U���j��`g�4�W2|�_A�s�.x�����z���������'=��������g��s�vx��`
�R��v�v��`1���K���K��0�)�&��mfA>Z5���3"g@P��G����.v��_�[����AB�����T�+�����Q�D=���?�7��������0,G�\}�
&Y�����*#���h;����K'b>P�z��DyWJ��(^v���F��,���w�t����RtE����3��W=��z�(E�_>�������FeFH�����T�����hF�UG��9%�O<�����yu`j8��#�D���4
@���������Uw�����<�g���E�^N�Ky2��RLl�L�uv|?�MP�e@P��Cj�mCIO~m��#�M�������l�;r�0�"U6NM�����1��~�u������F������������&��u�p��h?���3���Df��WJ��\��}�.�2�fH�~����Ml�����*u�n�$#HF��J���P���wCq��������'�b�c��+!�
����p�8�I�=�]�����D7p\��ep��������(���kb�GXLt���H���J��&V�I��`��%�Bn4��(
J��f���?~��>J�+/��Y�<�e��Z��A<��4��n����?��-<�}�Ip�;���D(I��;_G�&����/�{���������H��:O�5�,�f�/�����,�a:J������mU�=�&��z�2�,���'����^����l���]*�=��P1�MRp�<���D��3�q�8L7�8�����c>�U�H�#^���m�G��-_��!����j2�Z��:/K��q����f�\
�,&J�X��e`Xf��������k�$�q21�������&��W�G�
�l����H�[U��F��`4
F3���
����v���M����<�g��IC���JU����MMli�����-j~>��n:�4(
J������h�}_�I����.tq��=�$��r�h�	����Bl��3�O����@2�$�L�|<_ �]���k�
���Bb&����"\���t
�jw	�H��tS��b�(�R�N
���
c>�&#��7��4y�O�G���r������s5Q��4(
J��JJ������@;��G�m�S��0���Tw1�Y
J�J������#��L���n9��J?�O_L��� 4
B�� 4G�J�QR�n����T��QF����|,&6H,5�{?���vy$��D��E������D70���h0�a�!�n���v������db��$���;�q���bU�=����Ko��r;�2�� �k!3��L][���{<������F'�$V���>vst�����
K+��?w��L��AY���0|�\Mt��AePT��x���via�@��r����B�P����� ����f>$P��,!�_��D70���h0�a�A5m|L��3I��H}�pN�o��6�cM��K?������<���t�Ag���������:��n���i��H]:P�B�L�� ��.1����0�)�H��PF�&�1@3a��<��kM��������g��.	egG>��$q��f
ahZ�M���<���*���2�*3T>�x�|��#]��E��&:I�\3��e>�����f>��zt������d��0
L�&����W���t��m�C�2���-+#�6l�x������x�����o�������X)���yS����'e�+��J���o�8��=��y���������f$�@2����x�>�y�����|������]�t�y������@��>��L���f>����V-�B��n:�3�>��������7�|������W�psHU��%����Tb��n41J��Clb��d'4������i=�o���	�U*�O|Qw�[�����w��^~��ZE�������[�~
�F���Vm���������~6y��u�G*�v�j#�8�Ix
�/����/e����$��`0|%H>��M��.��,�������`9SW�H~��ag�������NA��c�'�8LF\\58a���������"������%��	H|lu��x��
�Q7����I���~Q�y�d6y�xx��1�[J��0�Q�hb�%tU*��<�����)#RF��"e�o-�j7&�����z���M&F�����R����q:�������V�V�@2�$���>��O�at�Y�
��3�w�fe�$)�@'�����q6L�T���t��t��a`��`�����w��g�<~J�u11��K��a�� (�=��b�	u�C�:
L�2���?<~�tzw�����/fq�3d~2��w�n��`*_��t|���l�lb$WS)k��^�, r51�����$�$48��c���|��p�����A�<���F��"�)�6�Y����k.&:���*Z��W*_�8�IP*���>��W����5tk��
��xP>�nMl���$�#m@�|������z4�Rj��<FU���&�to�d�A��@f>V2���^�.��L��8����@���oG�W�����JRUk_���@��]L�v``b{3�@�����_���d&tg���A��Su�~�'���2��i�����������2�
P�e@P���uH
{�c��u^#!
�����ve�U��l�����v�>	n%W�1;�(��2�|���sJ�����!wvtA�L�q��h9*O9��uY����XW�O��x7��4���@4
D�D/|���Nu?�����u�M�Jb6���������a9tA��Q����ap�������
s�s�����~]M���<0�
^g������/S�w������O5QN�>���3�|�����~�����!I��D!J]r�s��
��Dy�+���@�w
�F3��
B�&B�?��;��"?==��O��Q�<���������_����k��t���rz���QU-���\x�����K����'>��X}��>��	���n|%�>\��e��\~�w����M�n��H�4�������&Fm}��[�e����u���P�_������d@>)����G���Ne?�g
��_��BRy(Q�5=���j���!ug����1�Q���e@P~P��TT� ~F��1�{��Pz���S�ycP�%!��x���+^Eu�h��(���ep\��\I����W�^����<����^���]���:�
�y<W��jh������2�(���5(S�N��E�0��p}x��~1�a"��9����l��DuTv6W��z�K:?HU���C|����0��������{ �~��c7���_{��s����S��8�I��T���P�9��e����q�������Q:#>|_��Z||<��~����h��g5���%]�0��P��{P|�Vlp������	coD�uc��qI�#�X�����A������W�J9��3]��fG��@q��k���
I������������;�{PG�@��H�2"eM�����
P7#��^�zJ�
}�7B~�C�t[B����@��WoJ
u�;<�t}.6��2�.���2���>����a�*���������E�����l\������c�j�sL��d0L���C
����0��4jd]/&F�:��|�\G�((��a��k��k4����L��d0�E0eC�9���p�w��p���N�����3S�rY�P+�,�������)���X���r7�����Pmt�.���2�._����Ue8$��q���C�T�k~x.&:�g�t��G`Q��:8����������H����>�}|��������-��x=��_�R�n*R��]��P�sb����Q+�"LS��=�V��<�'��x��*����0�F�2"eD���_D����V���]�J���bv��n&:IceJ��-B%3�K+&F[R��
��O`�$�F��>���3�%����1N��h�<���Ll�\S������@�&��;T#d��\����F���h������6�XS���"������0���guE��fb�U��qi�t.��2�l����G��Q*�����i�����fY���L��^Le�#
qa{1���q���Uk��z���7�\�������F�0��h�T#ws9����h/t3@��3���E��=�Ll�H2�y7�0(R�������b�L��AePT�O��������SL��@�^����j]i_�C�R5�q�����q���\��c�<~<�����F���cAl�[w��/v�sT�������6s@ZtXb�*_��E5�����<T7��g��F��0�3�|5P�Q����������D�i5�1H�I��!<K.�R-;�y�
v�`�VLt�$�@2�$_E�!S����>O��<����J��2�I�D��s������LtP��Sj�D��W�s\��ep�ep�x+�j��t��d���:T�)�� }V���b��P���}����6���f�`���������%��l%���J����/ZR�������y������#�;�2�*����U*S�]����a��ly1�QH
E������v���;�T�d�P��\��ep\Vq����Rn���J�=�����@M=�LuG��:�HiLg��r�]��y���(#T�7T��<1\PD���&`�f�`�0�(��no�#(�'*���?��Lt�"�$Y�]O�]j&6��
!�����9&��`2�&_e�!E�0v��Vn���OQ�bb�YP��8�l�t o|���yF1Q6���ep\~\��`��+���0p�X��b�>�z�{���R�������%Uo(�:,���O�������f�`���|���D��K$K�.�I���M&���*��q��y";�3SU�yr����x������� 4
B��W	}LI{�rUa��|���a3i�����R3�q�������B�s<��c��e��x�� (�-������]i��%��lb1S�	��s5�� ����%��F�
@�4}5`>��������������7��{��� N��l\�,k^�/�t��AaP��R���u����be��r�M�y�R�=���������������u��F��$~��r�s��������?_>�w���t���������>�?���{P���?}�������/�Ow?�?�wwt�w�������y�c_�l^,X�����GK�=}�w���kt��%	������RH�
��|J�i�S���������_^�����t����10����a����xn������ot�)���D��������Z���J�k���
�}��&��b�t<��c�<�zzww1:�P�ReCZ�I�qZMl�\�X��JV#+�T�3�{=����a���d0L��3��Q.d/L?K�N]��Q�ka���44�:A��2����:��@�{=�����.���H���t��
�7��Hq�����Ek����]�%�����;�5�� (��,�;�2�*����7�|L}:Q�0
jH�~�#���$2�~�e^n&6�L�����i�&:o�e`X���X�����M%�u������u5DdOf��6�L���Kd�z���N����s�\0��f(|<��HyHT�#�s���w3��G�
 �y��T�!�&6��P���&�[Rl����3�:����7�|T�z�nFa9+�o����(�J�|.��[ZA���3jI���vO����v�3�<�b<��	�n�.��h	/��xVNu����J<��f�<�f���t^���[���\L���;H��t�R�lVmt�(��2�(3f����v|M�;O��D������(b.����\,���������������_T���Ag�t��I�C*���M
ty>�������Ci�J5�{������@�Ix� ���`0����O����t���]����1��N�"?��ju���t�$6�] :��&������~$~N�(�t��00����(TOtLJ%���S��f��-��������:J�?���|��2H�z��~�_����=�M����j{$�M���Rqz�ww��x����(�1�Vt���
��&��`2���g2O���Bu;I��f*L��|X��6T�)�(;zm'Uy��xWo�������,6��n@P�e@���2�����U�!.�a2O���(L�t����K���������d�u'��m@g�t�A�o���26eG�����C�6OI��&�n����dJ��
��S�f���j�8����0�`2�&�0��������9JW��"��A6�w��Og�bA���U�AR�����T��{7�e���+vQ��w
���+Ay�Q���F���d0L~���� ���d~d����.$���~���#~
V�;G$e�Km�����0��X��6	A��1(�t`X��e`��7O�C�����v^S�<��-`��1�&F`��jj����E��dbi��'~R�(��2�,�/�|}�	�4J�Y�z�S?�a��K����G:�$C���g�@��(�������?Q�:(e}�d�Af��z����V1�k�C�a���^($�MW��@�6
u��������U����X~8�r�'fA�#��<��~|�����/����o9�������_�����e`X���C�b�t����/�'�����(�S��*��c�7H���f����������3����:��v��v�-�����vf&�
��U>�����N,���n�V�{�����Y��(�q�k�`y)�2)}��0
*MoC�P�b�0����7��C���Z�T=�u�Q�6�gD�/$z��o����9EJq����r�|�������
#�I�{4�����g�e0�������-!b���F3���������/�_?��}O��W�G������E�������!��Y*+����E����&%6������	����Y��NP5A�|&���?45�������������}8���!bC��o��3�
dta�f��_��Jaf�X0����nK-�h�'>z/&6���M������F��-Z6�lh���L0>Z{�8s�\]�E��T�������fb���c��g�]j&6�P�uZv/m���Smt��@4
D��}H�;���s?��IZ���#=�*���}�G������CeD|\���t.�2� � ���|*���c$����$i���J�
��G����i����K��H�(��g���k6�nr@f�d�A�������K�0.�9�(S�L�r
H��Nx���"f�^�~�uy�Z��~�Pl�i� 4
B�� �uBU���9,*2� Q���E)I��yq����'���������A����r�F��`4�B}�`�����=�a����hWj��=����Sy�X�M ��Od���nnf�`������v%��On
K�I���D��J�X3��������m�u!�z<�T&(UVm�h%�����3p��nS+�d�$��������F��7#�a���%������D�!)��X��vy���w����C%F$�B�s�3"gD�f��������(]���<�� ��;6%���X+��M�r�R�S��a-��~����B�o���qe��M&6_	�P��t��2�(+�����h�s�mH���'?	��mCu���������6���Q[�]�.IC�6:w�c�<������������1Q!^�&:I�L��S�F����W�����Lw��G�{�&���AiP�~)�F���X~�b��)�i)���4��&FiW�gm&��\��N����4Y �rr�h ���h&�>��Mm/b��p{x�8�A����_����]C11JI�
���^��8#5��o�{o��>�TP����?���b�������#]��������D�!��=���-�*P?�`%o����^Y1A�t�|�Ai���
3���
�5m�X��}�Q�.#m!����c���jb�
e^OqV����d�I:w jC���
Q�����"�|y�b�~���)���)O���K���B�������6���R=d�����D�G��.���2�.3�����}7�qX��	�.������!.�T�jb�j��j�&I�t��`�f%�oV�f��_\�D���+���������_$�E���4b���k��.���j�sT�AePTf����TD�$�������v������H�VX�������S���;e����*��B�r�N��2��:^�|L);R�U�6��EP���&F�1U�,+;�@�4���6Ar����6
��5#jF��R�f>�HF�i��W1�4Tc��@}��9��P�9(8>���3����7�Z�u^v���@3�4�@3���~wGG�����3���b�Lv���M��.&�����������4�6�[6���1C���
)��uJ�|�f0R�S7�aK-J�#�&6X�����*�_M���x�@�]g�:yH��M&��A����3b��;���U�v3u�f/�|�\��]���G
CJ��7�&~�VL�N�	����2H7@1P�@1d��'c'*k��Z����~6Q��"��)����DE�;$P��
��{���F jC����JEm�F��1E��jd����wSX�����i�Y�*m���r
K-�r�,����D���5�D�?}�Y�[�n�Klvmz���+�������1���^���Cm&�
_����������"�Z7�0 �g���4X�n�n
�n>@�y1��1��o����c�C\ZE�'�����@�5��b�{m��T�4���@�F��&��`2�|�2�!ej�	2�
����8,eD����T�'�2"y�&�����Rw�t�y(�a:@ePT�_��$[�Q��W���~����5�S5�qY���Cm��������4-#���������e�b��k�u!������S���i���=����7N�����"V�o�e~�5���6,L��gN���K����joD���2�����	����W�;�6#lF���b����>����a���s�y!���y11"t�O�6L�A��@r=�e8�s��ap�%p�x����dO)�C��d���Py��������x����7$E{�u��� �b��J@|�@�T�k��fb�/�6��@���<�0HpN{��9�(���&��`2�&_a�13�S��k�]�H�Lt�R���������*]M���i^�:D�%i��Ah���Ey�&]�o'��K�n�bJ�v�8�wrs?����dtg"vf����xC��r�$�3����g�p������������
.����$��Q����i�]"y�Ll�)u���ck_+�c���3�>��W�|Hy�:3i��rQWpKj1Q����tf��JO�&��Rw�.��Z�����4
@�/�|^���A��[u����I��kRq��S�8��6s@�����Q6����]��;`��b����� 1H��WB�#J��b$���I�2���GO7�0H�vg!���re���`1X_a�!e��u>�I����� �h��>���>l&6���Q�3v����\�AePT~	T>^�zC�:��L1�"-O7�(5Z�nOg����$E�D�C�7%{)u�l^O/6���x��g��J�|�}AU�]�\���n&6@�
5�M������1�}��4���b��%&��`2�&_a�!�lJ���8�<S��!�T�Ml�H�t�����Ll�)�E�R�L�N3��B�����O_��O�}�p�Y�M�M�F��/~|�����/�(�Q���oOo��������>�m��?��5[zw����������tg�b�"��5=�� ��IZ;."��DYDC���c��k��a����lb��H��z>��
��*���r�(Z��?���������������}8�b*�[ �
���!��3�>��2�8��:�Hi����~�u�?�m&V���Q/�mw ����n~�f�h����,&sE�=^�|DU����=�#u���9�)�e�5#MP]k61�������j����LP-���x�	����g��1��/��M���3��B��Gq�$X) ������-�u_�b$��I�]�|�O�bb3q���^�
h�l�Hc����-^So&:w4#h��e`��k2�!������� PBT�C�t��
����|��L�C]�(��F���Lt�4�@3�4_G�A������%=H�n#u����-R:t�U�����!��G������Pqo2Q�^@g�t�_��������{7�5�I�gW$�\k�t��#�5�&6�$�F=��;�����Ah�z�|��A���~�R����j�C�4����b��l$�jb��rI
�wPm�nR�ep\����\>���(��u"��:�Hy�p��Y�j7#���M�����9 �2���`I�-C��Y���U�O%7��F4�P�\L���c�����/�lb���"~����D�
B�� 4}=d��d%��R�n�o�������h�����9Q:��[���4��j��n�6���o�:_1��F��`4
F_g�Ae�����0	�(����($�7���	�
*.&F9*"�W*u���d�A�Bf>��(r@I��Jat�S$Lu��]�%6z�M���t&b�]���K��h�@bu�\���	]M��B�� 4
B_��y)���!L����R���Q���z����k��h�D������o�tN��2�,���|PI;F���������]�uf�s���<�wT����y��	�l�����F����I��)�!o`���D�bb�����F��~��d�!Q:A��"��e�����R[�Ah���G�<����7;v.�D���w�H��
����!�����<���E�aWL�5�?3�0��u0U��nTY���E�H|����\��]����#�&F�~�i��q'��f�� �t�
������EJ�,�6��Q4������,Xm��f/�mw��ZLlf�t��������gjm���y���Lt��@4
D�t������vC7C4I��r�G>�i�c�t3���Cmv?������:6%��"�����(���c�8����w?�	���0��@��Y�#mhHm�.�*����bb��HW��][,~�����x�����g��?r���k�~���u������y��������/�O�-�\S�W=[  �&6��b"T�k����j�<���h ��������� ��u�
s��T����g�L�,�n�T�z>+m��@�d���K��8Q63#]'�������fb�\���.��x��=���(�l�V7�����tN��3�8�8���������[��������-J����db�W�������.H�Ml&(�.��~<m��`X��e`���y&������T���m��=�������+�{�L�J���u=�W��?��Mt�6��`3�63l>�����y����F��#�sHti*�-%���bb���3��u5G����x��g%���������������U�����z0���E�v��b}x���U�H��5�bQ5m2������]K�����^�!>>����������?�N�=����������{ ������+T�uA���\R���g�YoO�4vy�.�o,�L��.>���[�6N7)��3�2�����v��x����F��s!`����"io'���h61
�)4�B���� !m6�M�<���K�3����BU9>�������O����*��%������EH�"�cga�2�1Q�i���1m���2��D�
D�@4�D�|���$m���w�6T�H{�(�y�<U5Y��h&V�!uy��
u�nRd@�d�G��������I����G���"!������(S���X$���C�D��<��3��R���J�.l��.�=$m�:�|X�U������Ll ��'>N���8AW���;jv|A�f�s\��ep\f��n��C�a���s��hW���j�����jV���X������g?�6N7) 3�2�� 3C��
����L��#u��"���*&/���d0r��y���=�q:'�d0L6b�O�~���k���Dcl^�XY���X���Y+�|�
i�s_���� �-]^u�E�����7��]�>�?�j��M��3�<?<�+�����\u��0v'dtg�TiBm��f���
s������6�&6�����~�'>��(�x��h����tc�������T�#�$0��6$i��~m�)���lb��������Jh�t����1h�h�'�����k��*\&i:NC�Y��QG*����	��Z�����*,2�]�.�����@P�e@�J���e�P��)��- ���Hw�X�������0t����8� /���� ��T��#���*i��y�`dbtC9�-���/�����b�c�tW���]�M>X�Mt���`4
F�F�a��Ry.��{1~���'�z}��/7��WV���C�z7n���x�lb3?T����6M�IU��
�:���3�|%�������rb�S��i�)b�*c�a�����(����F�><����o�&��`2�|��U�S������m�
��&6���.���n=_P%�:N�(
��������l��@�-~R��O,�J+��_L"t�C�C�e�������(S����?T�:�i=��p�&:we@P��+��!�j*���nK�u���t?�cO��wqf�)�D)K�)%?�8l\l���n~�ep\���+\>�d�g7�G�D�TG��#%a��No�-��'�bb�P���;M��p2Qn�f�h�_�Q2���1���]Z8RO��8y~<$������(�C?n�^���L&��=�g�m�� H��d �J��/��0����1��qQ��Hw��r��.
��<���'��� j���x�f��p\��ep�
��b�H�7$�����jb�D�
��${.��&F�r��r���	��>�z�����>���f������O����
w���t��������|�7=���`�K`�C�������}L�4y��YOJ-l2Q���B^l	s�6k�*'�Yg�g���I��nz�D��rzB#���g��J}�}�	�� ��=���`����z;�'��&6���i��_,k&�0\��ep\����J�������5?��p�e���q�@>�M�"0��	��x9T��0���?~zz���F��|h����o:���k+�q���J�L��J��&��g���aW���4�]C�K��mz&��j�t���h0�����IA����k���),v$���B��l1QRHz��	q��2���~6��4������6L��0�3�|����H�?~�pz;/�E�Ny���^�&a1�qH�fR���b�/x-�� "{�_�
m��	0L����d��EG<��q����%K�L�i�R ����H�z1���P����.m���0�c����\S�F������6R)�r�(:��qY��Io�O��x�|�sD�AdD��|�l��gW�(,*w11*�IJt���[	�������BHc������(Ka�5@
P�P��������h� ����v�z[I�����R	;$U�L�B�Tdu����t*���2�*s���tV�R��})�@<W��R9e�f��8
�s�2U%��D��4�@3������v�y>*$&��:�HqH���z�Cyh�F�1����B3��*���2����`9�-D��Y����P����\�R_��V�s_K����GP�������k����?�r=���n:�g�j���
F��`4
Fs�3���3n���3����]G���tx��������J����F���L�Z?�8��3������v����m�
����H�E�
�2���!Q�]��j������2�*�*���w�2���%($���Aa������\�<�fbT?=Q���v���j1�O���A3�������dV�������������?���!�2��?�7���������jFd�6���nM����^G�8�)X&����OK�Ml������/�<N�be����+#V��}�t��>��������p����u�J�Svx����%f����%f���
�����t�(��2��b�����.Ze�}��Z��)�[x*`�l����1�����B<���S���]*�0/{P����`4
F��\�|������)�y����bb���N�������C��Y�$�q:'@c�4�Ac��
���i�(��n���T
	.���$c��6��8Qt�o��{I��YoOE�v�H*&J�g�|6���������"���j$�~����_.}�J��%o�S���{���6���y�q������lb���py_`��V��
h��f���D��pzx8K���s<4���.�v{����=�{�e�
s��0��uY����XW.)�t��kV���p�4(�U���3�:��u��9���C���,
I�n#m�L2���]��oY�|N]Tn�tN���2�*�*�G���D(�d����S�v��eD�2����LDB\�����Ml���!����E�s`�f�`f�����c��~\����~1�aH�G���k���lb�����*�/�`����`���i`���AU�8�}�:��:�H����0������bb�P��.'��Y��c6��@�4�r}�p�UP�J�x?�����:P�)�	�]�wI��<���C�G���YP7������f�h�����{��WO����,#u����T	���;g|������of�?�3�:���3K��*�#�yk����jp��e�WSi���.u<���HB{��/�&:�g�|��|�sf�u�+� P�-�x������/���8P���L{�)B9j�������i�.�R�eQ��nzf�`�f6p>��Md>RK��l����#J��R��?�����������M�<N��6��`3������v�����JD�j�u ��9�;�:��DY�Z�PO��]P���j��Aj���_���	�INo���o����t����G���I1��t�2���m�jl&F����c�.u�nB�e`X�.b�O�~�����c�{����oOo�����������=�S��a�_��xQ�*s:�
{�m�6RGi������6�5�O$�.{����8���@2�$#R��}��5�����58����>���,&f����Ap�@u��@�����bW�)��J��^���	���4����0����{���|�_���L|�b�m%;K.-����Q�h?n[���
4�����%
����Rn�i4�� tF�4�@3:�@��No���f��\8W�V�E����_������"��j��f����+��a���B�����]�%�E8�8in'��$�aRe�|1h�"Rjo�~��-�R��h�9�����'��R��zDw�b���]dF7��i�� ���q��	�3Bg��F��O�~���k�3�:Yb��;������|h��������h��8n���_Ll>{:��"��6���4Rs�~
�'���R���d0L�pzx�$\�Z��d~d���Ih�����\~[2�ME����zR��"�&6������Zs3Q�	@4
D�@���S=��)9���S��C]���HR�����f���e11r(Sx�e����B5Qf���3�8�8��I�������U�G�F�2���-|�%J+&F�sS���=JY�&6�b��9��t�@2�$�W�e~�5����O}�|]���Hy���������_���	����Q3���2�� 3�|��U�s������H{�l�����lCp�x11r�����Zh�tN�� 2�"�"���U�t+Y��|1)%{�aK@����d���	B��e~:�0�9����nFD)��D9#`2�&��`��(������j�b��F��6 �:��v�����fbtk�������@Oo& 3[�b�%�r�����S� �7����/��O���>~��l�K�7y�..0��o����yX�:�~���T����m~���Ju��o��Fn��F^�m��	���*#TV��|���+|��H��JUd^�5J�.�)�JYQ�������U��h&����O��|y�6��W�Lt�@�4
@C��������y=�,.7�C��l�#u���0Gj���(�L�����0��_�����M
p��1p|��c���kL(��Kq�b������K��9�1*�.&F������$5�����AiP�~�FM��_�A�����n��2���-9%	�%){�T��lMt���L���~������8'F7��>���3�>_��o�&�\u����w�0FG�0����b�l<����h�<�<��	B7�]��D�/&�tqp\��ep�*���n�a�!�q3U���6,$�z��.F�C����j@�V��I4�XMt���`3�6�60t�#U��/��K����=��6n����R[�.���&F��������~@�Dy,��2�,_
���
"S�:�,v(`n#m�\
_���i}��y6����g;/��:N�p��1p|�U��Z���l+	���Q���������fbt�>Tl3���fH��n���
<����*`�����U�>~zz�����Tj���Ci����G���e�n�
��f�o��D=P{����4���#��i��	����'C5%�o�N�������_W������h�����7�����[����[��{|��f{�M�K��)1�����s�]���fA�v51:o����������&<���May�������%�<>��k�1�=����]G��#���8��F�L����<�aPu�n&�Pt�A�����R�����_�����fz6���<�Zo�b���bZ�(��$��`n�%��S3���C�CB������j����� 4
B�Q������)D�������!a����
�L|�t3���]��|_J:/�t����1x�y�gG7���J=M�`��
)��ZJ�]�6�Y�M�����������]uH����=�6�60de�c$��R�CJ�A�2P�9s�����4�fb�Ov�">��H�f7�;��3���2�|=f>��������%b�<R�)�sr]v��
���h�}�&~w����4�Ac�4�N��fd�LM��P��"���`q���ah���c7�uZ$��g�?�3�8�:8�WZ�Vgt���{1)�������lb������q�<��	��)�=Vh�t.�@1P����C����F�5�F
����A��P���/o�A'��������ucV]��Mt�F��`4
F_g�A���Sz�R�v�����Q����8#'<U�vy��-�OZMp?�9����/�7kC*��\X7�����B��l����^���wn;v��~���]��T%#��\����m��3�
$o�)�q%e�L�^�]�����@D��j�Q���{XCE2���j�T�>��L�l�z�"����Z�������lP�(�� 2Fd����3�����������*
vU��&:I�B9Qi�^U�r21��������q�I�f�`�!Y��s�.��L����~�;���+*���X(� �����$�K��g�?`3�6�������l �<��+���k]���a������p���w+�g���U�����=����Ah���=�pk@�
�O�)Q�<��$F"%e�a��g9��l2���w��`�0�����h0������&e'�I��(�(=��($�t�}��_�rI]�&F9���k-H�k&�3`��
1����Y!|��;c��������/>��)��%��������.�q���W�L��?$X��{\�0��1`�w���|0�w>F��Vg��!S@��8�=�i��=�H�6mt���x21���<��W���Lt�6��`3�6�C�>J�.����;���u��>R:���)o)X<
#�(�3_}%L�tN��`2�&?&_�~3O���WuK����4�5
LtYh2��'���Q�L�l�������_�Lt���2�,�L�|J;��`C�i��=��A��+��\|�:��>�������&@�d@f�|�dlj�\L|Z�_F��#Er(�����eV<�M�����"��.H
�Mt<��3��\�|���%��~N�\Tps�U�d=�M��Z���^�r&�"%�(n����/���_^Y�L�h4�@C�'����hA�}{�e����_.*E�zh��aG��!�j��}#D{��8��Z�S��b���4������/���N&6��He�t��'�q4�h����GC�~8H��#�>��w��4�j7�����.����a=�M����Z��-��w2��n�N���48��4_81��!i<�/�V�%���`��$�m9���/&6�=��w)��w�)����?$^��}]�c�f��}
D�@4
D3��	�nJ�����3�2�����{�JU���q6N�H]+�M�4L���`0�0��Y��M�Jb���*gW���T%!m�t������DGE�;��_�B(H��D�(
J����s�4/YEJg�{�R�n&:I��2]2������lb��N[0z2��D�@4
D3��9�������w)�jW����g�]:&�LtP��|�f�q��&�@�L��H@��i@���I��Xj7�%7Yp�J�n&6X���n���y��T�$R���O_�&:�f�l���O�� 2��+�Pg���z2�($�W���q�i�����R��
n�q�	���� ���h �	�O�7]�S���e��=R�Tc�5��.��^a6�����%�_������d@� �T�N)y����l#u��"�����$�5z��G�����.���f��#7_�W
=���<������=���*ux����Oo>������x6����>�[<,k��>�������o/��y0�g����U�����<����w_k��Au�i6�����w'��[^���D����g���^��=�����N�z�*�5���]�'��@�:�����3�;����c��f��?��9*���2�*���I���t�����EH�Hy�8iL����<��
�k���^x �f�rd@���Qx�����n���������Oq|(J�qhC��s��MKh����?5����#��5����e@P��%�R�����6��,7en�t��S�|�J�>l�L�J�T���q~Z�q�2:8��������/Z��]�����'��S
$^�����u��<R�v;jO���4r#��0�v���rW�0a2�d���!L>a��R�P�����5��a`�z]��e�J�lb�O�$��	�a:�a`��a������WC���/I�^q���4*��wc���*�����*������I�i�nRc�0������"u�J�	mM�n#u���8��w)o<��������Q�����8�����1x+x�_�	�5H�2��������a�������m&>
��,�y��������a��(n&�_H����KXt�e�o���<������=�����K$�'��
ku�4�; �'����K,�UZRu��uA~�d�|Uh��*C�~���$��c�|\X��6���!
���bU~�4�(cU�br����������n��f�h�����eJ�t1��U��?����Nsw�!��&��P:)��ZQ,����Ml�*\=�{�6J�� �2���X�_y�0t���5�1�x���$%rN�}���;
�q�GGA;�3���\��a`�����f[���I��i��<R�R����v���p6��
T���"M������c�<~<>at����"��am���}�d������N�����:�db��S��7w��d�sx��g�x�J�|�}]B��� Q�<��$�sN=�TIs����'�h�O��v���`�&�?4
@��W}R={��������=���GJ���R������f+ma�*]����94�?@3�4������?��h�r��
�y�c�Gg�`A���j"n���	���S<�Ll��4�NP�{f�B���>����B3Q����e`X��{��N6���&�h��P�IcU
��O^�qi�B��MA�"�;����6����c��a��%��f�?�3�8�������������"i����m�s���t���3�m��wAz�lb����=�*�QO2QV ��g�|~6|��&��������o��b(Rw�@�������&s3��b���iw��4&��T�V�aKiw��5L&����h0z[�dK��l6���xx��
�H�o��8��h�/P���jb���AO#u��nj��1��-�&6��P:*���8��tN�� 2�"#j��}�psQ�tC����V�S@���h{B�6��kM2A�����(�-��0�]n;E��R��j��~6�>a���F�a5�� 2y���@1���!��l���"�zO�>���L�_�0�3��E�'������`�Bn21*��c���?��R&#|Lt�j'��[��D�i�AiP��9J��"IC�2&<O#u����F-.6>�x&y��X����5A�o&E7A�3�>��Z>����)�St+��{0���/�v�������+�m�����L&F��=]����4L7%�2�.���27�������Py�)��P�m�Bb��x���($��v~[�o�5���������U�s3A���i�_+=z���y��-�����������4(
Js�>��]�8S2������������=�<P��.d~���q#:J���HJ��&:D#����g�g������4���`���Y�5_�W�ztA��m�H9�{[�A�������S�
���f��@f�d�Af.p���(o�>F���lN��g3��b�#Qq�G������I�������,��e@3�4�@3���j�������������"K�e��>v�Si�a2��<��3��l�|��E���N���	�����:�H��4z�j�:1
�q�4����|�[7�����:�3������,��\W�j�����=X�� �n,�1�a�r�+�d1��(����Tg����?>�z�k���>���A���!3Bf��P���M�Nc������Q��	��`�8��P��]��\��L�7��g�x6��/���h�p���
�y�c%Fg�`A$������7�:O]0�V�D@����+HC�b�[��;4�����n�J���@���
��6,���'E�o����������w������P��o�B�j���D��Yy����@�Za��u�
35���j�d���)
�}�.
��6��M
�f�����Q�|s�������@W�����F��#� ���KqW���&�&�j\b�����x���2m��h�D�W��������h���
�\����"t��u��dm7�S�Z)���l,��=��lj>v������L��< 
H��4 �����=��]��������(��Ll�HU3;j���C�E��u6�������f��tN��`3�6��,�O�pS��`dXs��H��|�r��V6��a�L�e3��q�r�]�`�2��&��h���_�_\rOG��/�]f.���V
���bb�P2��P�����?$W���?A���@��i@���O�;����V���2�]�M�6J�rv����o�1���(����#����%��'�?�40
L��4�������>���
�j���!�H�:�gZ�1�!���$E��D��>��������H�Ih!��oV����!n���0��UM���rO�����[8����S&�v@N��NP3Q���4 
H�l��mF���%�\�E��R� Jk&�KE�ZNcG{�U�v|O�����5K;xU���%?QM���:�&������|�\��+���uR�;�!F~�R��HR�z���Q�lb�w
�*���_�n&�;��F#�F�|�hT.���K<E�l�L�s_����+b�bg��lb��f��O�4L�p��1pU������O7�u=����H}����H����0�L�g�R|r]��
�@��Lt�:���3�:�t@*�4�J�tZ/�#u����J�xR�y
.���CW"��q:'�b�,����3N��D�F����`��]N��	���Dm9d�6LB�� �.k����2f��ep\������Q������2���Yo���KJ��(=�G�`�g6L�tS �2� �@>�h�b��^��LA�20���!Q�D�$���b����0v��G�����Ag����o?���5�������n���F�K��~z�W	�l�k6�y��@�^Y�_b6�������<��D�Y��e`X~���FG�N/�H�ti:��h��Hf*�Y�����8�c���gx��tN��`0��w_��|R���N���R����AbA(�C]^���;#������@d����{2Qp��4 
H?H�6FE��8��	�M�v�/��������U�����;���C�A�@������(_��h��Q4��@%�N��[3�x@S=��1'j�,�%5��q������0
���@1P�'�k�"b�4R����{���v31��L��Y'd7�hT�Ft��	4�@"��7�^��oC��y����JA���r,�
���`�y�����Y&d7��HT.k�;B��*&�W"�n�V�(�� ^F��xY���������zG����w<P�& �� ��{�����n�@}����u'���(�I���|����0�.�.�s�?D������b�(~����xR��w!	�WS��H{�)���qk�K���C=]���0��1x��*����������uW�"=�4^��I�N���Y���������
�P�5
�	�`O&6�P�qH�;��UL&:wh�����z]-R��ku*��Dx���$�2�Da��.j9�a�db������A���D�{��h ������j�9��� �F���f����$��n�+m��l�]�B�����d��U2�� 3��<���K��O�A�h���=��X�������k:u"O��=�L�^��UP m���&����2�,�_
�y:]��W}�Cg�}CJ��z%�d1�!H�5�)u1�
��%�dm����R�}��tN���2�.��_��I�������L�F��#r���.F��=
4r��]
��/~�B��D��$�@��@�	#V64
�k*W.~#�~���$3A������/�dbt��=��b��6J7!@2�$�@�W�d~�����s|����F��#EqN#���Z������c����x�s8��c�8�*�O*Z�>
n%!/��'��d�:!�����<�������8�]����l��l~K+uo�������{���7�x�������r�>�a��y��������r�D\�l��j%�J2���?���&F_B�b�@S�3O�fb��8����BM���{���2�� ��=���K3�� ��4}~���������~�yw�+������"3�
2�]��>yJ�r��}$����2���'v���]��
dEp�\M�*�y�4&�z�(�x���hI=��C�������l�������I�oc����ww������p��Gr�0���uc��Lyxoc�`�r!v��~L�l\��2,���&:g#*��'b
���\���c�
Tu:�%d�^��:�H]�]�j��Ll�����J�s,��b�,�9$�)jb_t9	R����<V�)�Ig���Em��U���n���Dyl
���0(
S���t�J���b>����X*���u,�`31�u<v����P�.U�D79�2�*��O��'Q�3R+�������K=|�RXMt��F��$=*�j�P��P�t�<@�E�f���d�Af��8^>�� Om�%Oq!Of����#�!��k�w;�)����RoH�^�-m�3TMts2�� 3�2���J6��BO���y5��GJ��N_������,R�{6��
�4�Ac����`@�"�������|�P\WR�E�e�Cm&?Rq�~��n���&���z81�;.���\m���� 2�"���|�]e���� �\d���X&��y�X�o����7$YKZa�Q:@aP�A�c
�S�v����q��=������D����$����"����b���)�w|rW3�M�8����3��|�� �C��T��f&����fbt`[s�y��Q:�I��5��R]��lt���1`���2O���W�Ej4�tfJ�^���#�`��S�:����s�R.o&:o�`0���c�T���X�P4j�b����U��|��i��$7{��)��(�D�� �2����|�����
�P�^�h����k��fbCBJ�&�|�a���5u"�y�'��oA�lt�0�3�0G�'��~�C&����Z��i�;��
�2��L�������ke��x^���'��=��n4O��P]M����$=�n��,����~����Z&)������f�������g�/Q�L��j4����k|�C�H5���C	�]<�g��h^M��L��_�?V���@5Q^��������"���j �w�����������z~~i��W���������=�����������|��{=��-�4�����C3�]����D�1
���2��ej7����aQ�� X^1'�:(s��A^���B���*���c�a�Dv��Q$Z����D.��fb��E����>n�&Jo@g�t��:���F�'��MDx���2#���>�Y�fb3�?;��b��&P�t�m;�W�F79@3�4�@��SJ�#]�AT��]Ocu���$V�Z���X#PEu�2��
�&��`2��&�T��]p�_�f��1�����lQ��0����f���t�P/`/%E[�5�;�:o@g�t��:�O��D�-�R�Te:��<��������T����]�u�&��]��0=��}�s�QK��M�4�@3��������pz-4��F*�Ib�|�����Lt������	����v���x�IU���6J�h���1h��T�&'�P���f�8&;�G��wi2�q�nW��R�{<h��������<��#]����(��|�5@R����g���j������5@x��Q6����������2*�<�����aMT
�e�D��[��N���_f����W$_�RM��]��sF�`C�����R���JT5���Q]~�R��(/o#6Fl��4Fl�����R*q�3���?o&6<&���}�i�������+�If|VW5Q&����3�:?u:����5V{Ni���=�S��~��+HV�����v��+��������>l�9f�F��4�@3������26E�.g7(d��D���2i���b0�"W���cT�jGa�	���Em�A?���~������-u����~�6	8afC9�FAW��O�}�����UY����	��\�F�8@�����@�[#��12bd�����z�,X�l�a^(5R����*_����%�.&FYoU������i&6o)����5��7�; 3�2�� 3��y%��c�R�qI{fu���h�BR4/��A:�I�s��/['����j5�j���{���7���n���=�w
��]������'U��.�!8>y���e��^���T%���R>B&�Z}1XJe��/�b�Jz5Q���!#B������,Q>�x�16����3��e�{p�5m��������Th��M����mI����0�`��lt�I`X�������Ih9��������Pt��@����P��������D�i|Z��}MMA��fb���\Qo7�7�2�,�����7?�Q���I���������I��M�U2����
a���j�Lt,��S�4���m���0���c~�����9�[m��� 2�y0�)��CIK�%>�x3�q�J�P������Ll����:�����(We!p`X��e`� F>�t=�B����'6z����R51��)�:
{o��j�<���e`X��|R��t���x��f���!rM�~Z/�^����_#�^��~SXP����t�+Q�����;�i�0`ecA�l��{=�}YEZ:��R�����3�������.��/�7�;�#8Fp�������j��Q2����ry�����#e0)��u���Ij���r�4�j�s��_���8}�TN�	��W�'~�������M�����J���
��e��w~���nr�d H�H�%9���z"e$E����0�.���L%v
V���R��|Z�Y�Lt�~3�jo�en���6��`X��e`� RB��V��_
�T���3�!'��x7�fj�4���F�8@�s��d�"����5�|s���-�m�o������4���X�R����F-cmX�����(�u����[��D�
�`����?�8�_�Q���u[�_�������H	������l�e�'�(+:\�.���f����J�����XZ�Z99�2�,K�2�������?��h�_
�g��=eq���$O��6$��E� ����h�9A�s�s�\0�����@��F��b�[���^���2� ���w�Xg���z0O&6�P&t��4-����(� H������F��&����@c��1nQ�y��&JV�I����]y����(�������"�7�&X��e`X>��y(+KQt�8��.��'MW��D��dT���R.7o�E� P��(� 1H�� ��O�Y�.�>��>���e��:RW��^e�@-�}Qp���� ��o��>��/�e!8A9�:�����O����!���b|��8��9����;��2��8v�R�'�����5��� &�7�;`3�6���E�|l��D�m��MA�-��#���^���#}�Hu�H���h&6��P]�����9���08,]������C7��%����I��M�2�H����%�S3Q���jM��3��E5Q&���3�<KW����j������1
�C�
[��y�l^)XM�$�����(]x*~�I��L��q\Uk�|��1p��%�F'�W�1�"��+����eYSa����N��2���5W�"*�oHWeT*�N:���:J7�08�����|a�IU���.�p���H�^���#}���\]:����H2�]T�����2�.K��q��a*�D#V�:�_�E��]M�n&6.=�z�L��&��x�����B����	��{=��-p�w?�.'�P��n�S�������^�����p�A^Bo&6�Pk����2B5�y�12�(�_��9�k��U��@y����u��u������0R����|.W5�?���Va��X��N8�/�:�@��YQ���,����_�=��@���������j�iN�n��.U�&O�3��E�E�g�lto(bd����#���pSY:P����������PH��c��'����#��I��c���}���WC������Qy'8��c�X�,�1�K��9%k�:���S[J�^����j�%�����`2�q�z*�j	;���!aC��{���7������7?>��������_���}5.��P�|��OeJ��p(C����Z��?<�Ll`HztV�|j&6��������P�����13bf��R`�������nT�j�Y��*��&F�1�����:H�?�kOjtl��(��/�������|�=�f�����b�!�[Lt�y��1�$�����0vc���{32Q�6@2�$������������:�S�W{p����aX�ny1v3����|C��R5Q�P�#���W�'��F79`3�6���e�j����|�_�����O#%o�)���l31������?��g���h(���!Jo����l��f��t>6�T�.]���0��&F���*�Oj�t����w�.zD�rB5�}(�]#���y����g�u!�$X�h����V��������=8���/k	I�6l5�y�b����%�J��x��!.uU�zB�QVA��h��A�8zC+�����2]P�����4X�#����>�j1U�v]���+����6 �����.�Sh&6�� ��6J�@�������O�[����n���%��f���A��XS����2��db������Ah&:o�g�|��>�Q>���p�t�Pr{J�Z�<H����#�3%W4�]j&6�Pn��q��<�����p�g��6t9���+�/O���+}�)z�>1{1�:I�������_�����#����8^pY����p�g�p�"�O�l�0�$8O���4V�)��>]�{[
��^���
�eM�e���Te�1�`�����,����b�C^�I��jb� �.E��a�$f'������e�8����(��g�p��/F�'���e�6��8�d�w[����&t��4o���q�$i����(5u�<��c���<>����vvI��P���!!I���?������;T';��q�6�����f�`���@U��CF~56�t���0�.�a��[�z�b��n%��I���J�X�xJ��_�w��������{������Wg9�r�my?������>�����w����~s��������c,�?;���6��{
�t��Q�<�"�{�^����ww�����O���1��R�`����@�T�!]T����������^�����%����nr�g�|��>���%F��h���:�!���PBG���_z����)��?�m�l�Y����)���P���n68���08��8��u�(����Q/cm@Hu����s��(����Fu�������M@�1@�E�4�:P!*��Z-��g%����="/L�t��@�s�=�oT���0L��������<u��=�$���9��q�jS���c>2�&F��4�����JS��M�(��2���@�����B�T���T���s�'����B_���[�Ll���I0)u���<��c���<>�p�������s��:�HQ���E@o�T�z�L����Re����Lt�0��7����~��%�%>l�RJFe���0Y�|�W{0�I��2�?s�h���i����H������H��s���|�w���`1X���U�W~1\<�ym{��bO��=�O�),�H7�B+e!��~�4�����hk@9��kQ��(��Sa�/������8�I����pJzC������OQ6�O�
�9�jj���H}��,^L�u��,����?=�w*����Rw
q�"L���D�
�dD������%� ��ow?����7��������7FiU�SIcX�n��p51�I�����{�\M�j�����$��If�`�����+�dM����������>|��������V�&���&�W3O%�H)}�������5P�����FK��DY�J:5y�"F�Or���BC��N���Q:$#HF�� A��&��6����D����(2�j����5\�7I��(l��k~{X��q�����5>��il�O������L�!������<�6Nuk��kZ���c������[�/��y����7����A��#(FP��A�������=�_���������,���#��[�m��:RgZb���O����Z���v��K�&:o@dD�A�C>�T��]���9:�'��D�)�Is�5XE����;�����N������l��f��X%�V��,��������#������S�����V���4�O��]�&Fm�j��.;��F���3�0��A3��H���"�����B,��[J�Z���#����<�((�Z�����9����.��r��b�w��r�+���e��;R����~���F&w�n���$NR��Lt���2�(?m(�:��J��	<�R���*�B/��C����1���������%����m��Ud����5]��w,��b�,>�O(LS;"����
2���XR�tL���jbtVL�s��m�n:a@��C�R��g�TB���U;����#����Zb���$��Ll��>.�lms��We>�(������	�����+q��BB�bbC��?/u:	����q�$�>��8�$�^�g
`�f�`>��y��\�!wep�(V^�� �:1��*>����xCb��lW��=K3�y"�� 2�"���ur�G���( �6V��~=�n����)�U~����Q�yA��A����T�������~������16�=�	�/�F���B�O��0Bp�x11�g�F(�85�A������j��)�f�����#zt>-0�� �=��v�
^FO�e�@pj��HY������o����:���7%�G��D���� 2�"�����e4���)��n1�9m��$�����t�2��=�<����w���ca��L�b}�&�gl��o����=����{wY�3x��S�F/M6')���o����M��9�}
$$^�Z�}��d2�s��N�S��_}"���H�2"eD�O<R����f�F<{��U�.y���K�{�����i?�;7�!�db�N�F����/I5Q~_ 3�2�� ��! ��a�a���@�
����L���~�(�{GC�%�L/��I]e;6�^t�Ag������L��a>hT�CR�jN��Q��|
�u�9��Z^\T
a��L�*�\���m�����D�����1x����2�����@������6���5�Z�^if�B�z�$��n~V���P�AaP���������8IU��aG�<����I���d�F���������D7=�2�*��2*��5V'��b�[�����$��=���S�g^�w~&#Ru�_n�l^J���:H�x��yA^��8��\�����(���tvm���6|�sA�s�{m�wi2�q'w)��4�h&:o@f�d�A�c2�U�.>�CRT��O�D�)���w1�����R��fm~�L���:�������n%(`��_��$?�������H� �*\o,<��Qvu����|%p��(e�<��c��8Z��\p��!�C�kV�Zq�>Kl��62�(MeKvd�����b����m*�3�6J7�1`���1��)]���o�~��ui�L�"S���J��c�c����|��i��`���'�����|$j���X���s?�:�H�W�������QD�����S$�*�D�s.���2�.��<�^�VM��������&2��������D�G%�(�.�}��_l&���Af�d���|N��S���~T�y1��G
�Z��v�?��@P�Ll��n�.��x�^��(�x�t�
�����Y|�c9c���5M`�f����D��%�|4����R�%j�4����>`/|nW5Q��0 ��e1�� ��x��;W�p+N|
k�A6�j51:�&�:z���4���Pq�s��(� 0�� �OP�_/k!�v*���K����4T��:=teK��l*v;P+�g��M��E'~��	����^����ww��?H�� )���G�|��������V�y�����?�����z��[`X~�X��"���t]�Z|z�SE�����.�9�j��E���0�Q,a�]�����Ll��6�����(����}!`L{zyn�^�fh{����h#�b��?oi+�{C���������qT�O�����j0����_��H����
]��q����a5Q��J�U~��C���V��Q����*�W��@��(�/�hF��8q2����|M��]\���mB��'��wM�ve
�%��lb�i����6J�� ��S2���*���K�R��T<�Q`C�u��9RG(_��q��$�����;T����{8�����/�����|���sb�o��:�H�Kr����yI���KHn���I��0��`0�>k��!�1."��$J7�<�
��"���wi6����R�=��[�����:w�g�x��x��A%�Vc��'�+KEdW������|�k�>2]L���Z�C�m��������5t�"���f��4�Ac�4�B�|F��w��*�z61�1����_u��i2��P�C�]��j�Htl2�y4�@3�4�����T�%(�n&:�Hi�2����Y8O&��Rb��378�KL&����g�x~�x�c���F�U8�M	�C�%����0u5��G�B���E�.>���J&�:��H������d�Af����5�zj#����A2	����!�I�Ll������_cx�VT5��}Q������5kOz��R�K���B
���V�U6��/�U�=��.��G/]-X>���:SNk�+���db�����%���� 'n2��3	��'�����������J�s7������������� �������H�H�{�����jz�*��"�Q:##FF������|��B!O�^�����P��\~�����t7r��N�l\�����u��x4�y���1hC�^VB��i�Qpb��QG��&?�J�4����.u�����I3Q�
������}C�����R�v=�^o��������\��$���h��������vL��k���m�
[k�$���q6NP��@In�G�3h&J�,�?qH����6��UO��g�W�6b����`����������-&FSAy�}v���/e2�yS��������h��T�AePyn���������xT��d����b �m�>���4�#j��F#���t�w��ui4Q�1�*&�B'^4G���DdT�4m�n��S6����h���Voo@b�$�����{���o��M�[�I����4�&:�H���|���#��H&6�D���
�=11bb�$~�$d��=1��(��g�`��QJ����bb����^����&JJ������`%��&���e@P����|d���(4�rp��KT����3p1Q��r�TgW�]�s�: d��������e�����Mt���2�,�g�|L���y��$�V�c�#��M��3�+����0fV��\����D���`�sf��"T~�7
�I|��n���_8�M�c�������0���%���F���$4�58@P�e@�L`��h��������/u�c�{����Ai�Lzu��;2�x2�q�t���� ���ap���3>�@�����Z	�I�Mt����i��W7\��U�Ej�����M`�����K� '^�Y5�H��9��L�d�����p	�3��(����'�Q:�I��������D�f����ap>�����C7�!����V�U��Lt��r�)�K�>&1������7���V�&���Il&���Sq*�$� �
 �����pJ_g���E����}Wr���M��1
�!`��}�h�q6NDj��w��`2�����1�cp�9G�YP��I3>4|m*5%G��b�����`5��(;������%��DyZ+}%(;��M�Q���b� �!SS���]K�eE���&F�#�9���q�}R���V~/8��9���`/���a�1�h��K�E"F/Cu��B��C,~WX�G�lb�eB'Jc�>�?��� H�����d^�|m�S�Zo$�S;���T�c���j����$2��3�i���d�G�|_��|�����������>�>�)H�Pq�����
�$9y41�BU;*5�M4�L����1p�gp|Ly�S"����p���Ll�<�B
a��l������E���G��
<���'����w����f]�:|���<���;���<����=���]������H���g��Zdr���.&wR����Gr3A�����M|z���7J�x�o������w��?R��� �8��.�|��k������L��/�FD��?����_��vDW�g�s������h�Vqd�x*(]�������&:o#:Ft��Xz�w�����u
eT/V�|�I��G��5%:�]�
~0���C�;�-�G�;�3�<��e�jx�oz*EB�_,��n�}�n�4�VP
zi3M�t��&w(����r��l&J%D�AdY������u�������8��C���f#w��������/]	��@b�r4��&��`2�,]�������>P']��dbC�&I�'U��l���22�O�y	e2��(��2������$J�P����Qz��L�f�J,�L���bb�P�D'�<
�P���������R����8��c��t��G�
�����x�A�BV���W�D+}��!�kh��4��%aW��c6�f�h��5��f�l������|T�:����C�-	{41�a��c��Wl]&]�*�!D�}\��3Gj2��>���3�,]��U(� ���1�n\����y��)�\���W~���6SRG�����8���0e��>|�<����AaP���x>��]J����qS�G}�o��F�m )����M�����B�<��Q���Ac�4�.D���15kG����(KZ2N&F�M���-uL�_6T�L�r�H����R1��g�x~�x����p���$��/�P���]�&��P��%j���n�n��M��m���j���y��D��� 3�2�I6^Q��_�������qR����
I���*w�_�hb�SR���q�n>@c�4��P���F�c�����cZ�S~F����Qp�Di�M�y�����?�.���.8�n&J-P�e@Y�,]
�|PdtW�OsR.����������g,�R-�!��b�mM�z����Z��OG��4�Ac�X��������-��_�R��l&:�H_�&?��������$��S����R�����2�l��f�Y�H�����}��5
[&_7��&K���$�of<�@yd��0
��&��`�����_t�D1]���@�l�r^�5��/>��I��C�_�����jb�P&u�J�����$4��m*��b��ST3Q��`3�6��?I��c3�2�]mS��f�}�}�	���2)���Ig�Q����"�O�@��(se@P��o�����6yL;t��0j���
B��<��������'=��l����D^t�W c�&:w@ePT��?��vQ��2��W{0i���7�U�������#�
���C�.�K$c������$���G��8��c�8�$Q�N]N9V��2T�)�I��	�G�������v�n^��~3Qm��2�(��|T���\�7�zj���D�T�z��.�&6��L���.�����|\��ep��s�x1�@6������*��YL���1�L�aF�BA������9�@0_����V���9�7���?R��Dhz����FwJG5���1��4�p�g�p��c���r�}���bb��&F��~�0����
��cn��������hZh���Q_
�zl���)v.�*HeZG�L=����).q��(Oq�!<%���}(A���D79@2�$�@���������R\m/����@��w���tH�N�rU�����p2�q�n�{	����L�/�(��2�|�G�{*��K�EQ��q�
G5zX�K;'��4����2�p�(���@1P?�]uK�4D,����)t�;��&A��l�Li�~�]g��w�vMg�6�U/��)����L�J:�6��`3�|1L��I#\mS��^���dW��l&64$A�,��_|y����*e���K�0�ln&���6��`3�|��G���0�P�����m�
}���?N���8}G,fc�i���A��O���|L���:1)�1������C��D��G�\����0��c/"��d����c�<����1����V�+�t��ZR�4����.�:�{�<�K���e@P�/B��2�X�:����#m�<�Y;}���Y�4ZoU��@����(���'F�����~��� P[���
�y�c��l|5��x�r>��kV3��/&���'�����)���;���x�`�s(��b���������������x(�7>F�7�3I�n�����rl�{B{��D��G�V��?�����P�����mb�\��D�
X��`1X���o���S�~sTu�SW����<=
�QG���O�}H�ofb!�����0�{������e@P~�P����~'�+�����z�_L-{j��4����ZLl"�yp�����L&6�Pu��!6��|��Lt����2�*��gCe>z��u�������$[O&:�H7
$H�^�����������w��	����n�n��x����&��YY����\�]��������w��������{�������'���3�|�G���P����4
{������7A��=�/�~[8G#��+�������h���!XF�&�������Y�U|I#S��j����=�H��{~�a6.�����&�L���@0|6.;���]�����ojC@��{W���Kdbt���OWe;�f�vP���+B=*u;�������#�R\@���V��L�l����RS3	jZ�uzb����D31
�i{@%���"�$�l61�(�]$��q�����������|t��T7��J+�,zm&����l� /�&�PP�����a�ML�_����U$�</�n���Q�&:o�e`X�������1�4-z�
`lCu���4h���c�K���;��t-l�����&:od@���@>�p���R�"!K+P;��������
iA�<��h(�*P�u���-��*e$�p���
g>.1��r���j�~�~�t�x!3�Y����$i:��E��l<�����P����p\��ep�l����*��J���qy���@e50B����'#0��4��8J?��A����4p�|E3Q���@1P�gQ|L��%^yabJ�nCm8fOW>$���8A}C��F�0��/���������G���(��R�k�wu!�lb�Q��VV�M�v�Q1�"A����(Ep\��ep�lT����Q�N��yi:����=��bB�>L�s��(��
���RwH{v��j�%i&���jg��f��h�/�oUw���?�������y<~
����N�*�I�k���n��
���)�+�����#A�bb����,�m��q6ND�^�L�81�?���D��3R�~��}�g�6!>����~��~�������7FPvl���Q���w�����O;�	xB�����9>�W�_������=8t��u�v���������w��.@�{������P���i��D�Ocs0Y�G�>D~������o�6?<��tJ������/#u�9?���K��w~����N&VZM�������c�L�����2�*���/N>�z�����[D^��M�2���������G�����Q��^,O���������q�#��h��i3{��������N?��x�u�}�������cE�EW+^-Z]���V����g����tq�����8A�@.�K8
�����1�cD�/ :�@#
������)�:��n�,��L�f"�~����\O&�,g���������s2�#��3"��B�'o����>Q�@#:T���k1�;R���:�H9H���
����@8�M��C���kJIk�tN XF��`8~�2O�HM��+F��<����������C��%��6;�8'H��=�)���\��a`�����G��?�,v�n?����Hv�������%6�M&��f�?}��������Ge�7�*���2�|����K�����,�8����U�.�����Fn�@���jf$���D�x������|��2$�F(�P����u������=�%�~61�
Z�����db�y���m����1h�������_w�����dJ#NC��b>8�M����r�j��|�/e2�:J��+%�m�A�<�����g�|�/�������y�ph��Lt���r)u��������2�Y�P��{�0��Y��r��h ��_�yQ��2��NFT�!d�`R�}�w���s��D�)}HT�{�D�����V���m���<��c��r�����Q�.�
k�R���:�H9L{��d�i���\��6v&�q:'�_��~����=�b�]�F*�z2�HJbR���UZ�D����C�)�iW���&�&:�h ��_���a�QW�Q~Nq���E���(�j��++x�c�t�@����q��@�a@�r�|@����(�����:�H���siI���d�L����br������M���2�(���e(����.dS��8u��hbCh�]7$>F_��Q����x�s,��b#��_��~����>h�16��.���,�6����z�������|*�#���|t��Rf��X�����?�@�y<��k�x�q�sS��F�7F'���LIJ���|8�������
?#�OmL���?�������7q�W`�f�`����
��(TS;�Pj���@z�gM��>���<�7�B_;
��#m��d4Aod�F��N
����F|?������;�yG�z����w|����(*lj��![�l�|����o�/��O���@��}6��I�y���*��C6J�},�Ky�q�Sz5y��������#��(#P~	���h���+c ������G5�R���s~����6���bC>*=98
P��}i���L�2�,��b�,�h������w�z�D��R�`,9Om���&_�@��Y��* 2��1���������4����d2�(F2
;�4����P��~��Q��
:��/���X�)7��.���07
�C*0Y�t�T�������n=�<z-�n=�N�~���;�a3���x�J���C>���p���nO��~��j�%>����Q�����������w5_��<o,u�{��|�:j�VG�������%v�����h�������`+�����0#`F�9����{�x���Mm6��P=����m��+�R��^��O&@24����F��Y�>�Fy�vF�85_-�<��)n}$�c�e�
�zR��5<����������U��<N��c���+�����WK__�5utq��I7f�m(Nn6F�qC�b��0y�#�tGXr�]���t���M��H@3�4�@3���Q�S��*�T�K��ll�����e�����uh51r�R����/kR?G��n��ip��_��vy�_}����=xT�����2PGi�:�t���������ZLl��t%�I����l��X��e`X~ |�����lS��E���@u�X�1:��N[�16b^L��k�P��I���h6Q��f�`���|����'!y���{K�n66�n"u�����M��6��4P �
�qYG���e`X~	X>^���P�IUR�)$lG��:�H��P�4�5�O������N�KO���k2Q�#��e`X����������.m(CO�r\;=��&6�t��:$�(,&:�f�`���,��k���$�L7����e��;���	��R��oCm\��j�`����;7�	J���J��^��������
T2�]�������(+���Q7��	���K��q�m���NA�0t��6?#���GO��#$FH��� $}i�[���}���kK��T��o����qT����p�S�1��Qz^������[��{����sh��f�yWHb�U���C�j����O��c*�>�>��T��5���PRH�N���o>2�L�\�]R��[�=��7�?�2�,���<"�uR,�(0�{���������#��}���G�-����\�����6�<���`���Y%���uk�$�akk�:O1w�+����6������5*��c�8���8>�L�>"ejhA��K	]O��>�C���K�:?,��W�L=�����e`X~�X>^��3����V]cF����!G.M��w�z�<���hb�$��632���<���w|����D����?�N��������G=�+���b��Q��|���NB�����U��Ai�1�.&�\=5r��7�gG����Q91����2�|��������)i�����r�Z�t�T}c�f�%�����1�n��������e`Xb��_[}*~�����Kj~�����r�B���V�R��NG�l�{J���.��'g61��5t���/$o����Og�i�D��@_����[���e�
|Kt���ZoY�[����#�k�[j�XF�AT��Q1`_��A�j�����R"s�Q��9B�=D�^������y�#����k�]���T����rr�f�l��;���E@N�����G��v���X��^iHa�W&;�Lk������g/m��:������7*����;ds��]
g�uV�������{<�z�J�^�S=����M����R&��%���A��Oq7�9��|�i�nj#F0�`�0����y-E�X���2P��$�Pr���*��HGR�9w�**.&��
p���s�1���Y���(��j&m��y(4k��W�GJf��sl�2K�u�
��NC��:��4������<�,P�?P���-������j���1�E�|��l�,�6�������E���6������Ap2���9� A0�`��/l��<M����Y��
>Gy:9#E��(z�)y�wd�#�����}>�J�=�9[G���d ���>����O��Q�*����i���G�gW��E��{0�P�G-�����w��[�>�o1���-����r�79�(���h ���8�\	���<�����\��2��D�hc�h�����#�]�#m�����$c�2���h�sd�Af�d~{��b��1E���#���n��T(�!�e��xh�F���c�Q�'eGh�<����c^����e�d^��M�C�V�17e	*)�e�Q���!������m��
F�:2��=�M&JU���h �~(d�6��+��������<��#��'���'?������:����q�#�'��	�`3�6��`�Cl>��M���HUBgb'��@�*[��+}X���e5��(PD/�;1
��0�3��"�|���G���>&i�
!V�]��#
�]�r�bw~s���8������������D��.���2��P�|�
�0�>���sDO6:IM�{�]v1{i����q�{<H�� 1H��*]�<����0�u�ZR6�Ux�!���o��l���X�����������L��M�����L|=3d�%�1�=�����������z�WP��P��X��i>LQ�(g~�gSq��`GN)r���6�_#�~�����!ZP
l21B�"���������Dy��4i ��
�y<�����g�v��$e�Ge�+�6����T�^>:�9�S�)t�������c�<���Q�m��ijv����E���� ��ZMl<�%�������k���r�6��`3��"��(J�P
�c��T�j'��q����H=�L�������!�
y����8��1h����C�2.��W���O=�iX�SY2�d�#�tGT�]��������C��]�����/k1�9F��`4
F?�����5���Nr�<�f�c����R����nx(�&6��|�v���h1�9L��40�"0}���_��2��HMD��S��:�H���;A��y��~H�Cda��tN�0����#?�rQ�a��,(����$eq���)[��j1�q�J���-_����Kf��
F��`4����eS���@��#!�I�}����)��H8��S��e��R�l�L��g������/6�CpI�[�A6�w���l|5A�x��[���g����b�����
��l��x�#my�0���GM�����T��
�M�y��SC����7�X��x�r���-m��]�B����0|�J�����u�JG~�4b_��$���;Bc��6��H�Z��G�����D��(�����29�C�M��cZ����� q:[� ��HvFA�&���q+=�����{���x���1x�`�|L��T7k���k�5����BAj��K��z1��{��!�
������������U��w�������p���l�-����nO��~��j�%>�����S��~�|w����B�mp�����!�R�Tm����+������2������81��Y���y��	��0#`~���V~t�����}P���[}*��X�&;�(g���3��\>,���O
���1x���?}����n^���*�X��Ap�~���4��bb����L��,��b�,~����sN[�1����B66l�u,��O�C���K1'�&�/�:�����������ipZ����E���RaU�.��?<�����6�����'V-m&~p9S����s�j�w}�����1�&:f�`�����jS(�<}6����F� ��(�ZSW7H�H\Ml<*5����R�a:�ep\����|Ta��c�4[^���x�����@���O��)P��l���t�R_�������9>���3��2��/�VG�lpd��,@�n���g�����"��6Fx��Swm x1c6Q���NQ%HV�60�C���]��h0�������LWd4i�5���Q���6[�j���-���),&��+)�K��rXE^b_LtS4�@3�4?��������F���&o7��Tl%�J��!,!+���$��D��<�:�{�u������h~�BO��+f��<j��W��Q��L%b��Z�<������C�������='�?`4
F��`��A4�����1zp�����NBO66���)w��[������G%�>XT�{����l��tJ���4(
J?H��J�Dj6��mh�B�7e$	]������D�Z�������b�����3�l���5Z\�����<���	?{����������D	������y�rt>wCY�v>l^Ml<z����_��')�|�>�j�w;F[�����s���k��,����|�GG��+�m�#mI�Z�{�������f�h��]*_yL����r�K���:�n66@%W�������Q~vj��yiPIS����n��h ���?����F�ly8�F����}�0z&%;E�$K�v���.u�j8O.�0�cd^0}��5]���������
����TwK�Z���2V������	;�]97�3�8���E8U��AP;���m�
C��O�S�� �����2'���zR���Y�MtS0�3�������Ft�W�W([A.U�Xt�q}�����;��a�����:���H��G?�����_=�����?�N��������G=�+����c~�m����!/�0�������+����$Z�H��|~�����
}�
z���m����X�u�)f����`3�|����]��e��_�`�Q���4��|r�fb�TrkZ�k��4�����h0�3�_���n(#�X����D�:v�4����.-S�K:b�_�V	�
�qP*�C���}���t��_N���}{� �P+����/����n���[����y�Q^���l�W����l���C���G���������b�<�@��(Q2 
H_�� 2������*��>Hn�Lm8Rk��9���m��+)�)u���8��0(
������P��I�����jn����q�H��r�3����.��#��,&6A��l
��b���?�3���0��6���B�����x�5����0th�]��h��+�����x�s��G#�J�#u� 4Fh���1B��{��
�9�2�6�l��Q	���RW��e��g6�g/A�2P���!:��j��h��f������N�i3�������Z��Fn�u��r/L�"��7b����������e1�!Q�^<�,��`1X_���	hb4���V`g���L�cYbu�{XL��G>j���N�:R�`��1`|����S��3
�4PGi�R�C��m,���6��\�)$�^��^e1Q�@2��z��o��g
��>}�����6���a�(c��'�����@�������������
v���lbT�+RU�4��c�7�f�W�
{����!�N�h�,�
:����A�:S�vZ�R�FV��lz�� �o��6BW��=�%_�0uh���)Du�=�N&F{
�;��0
�M	8�����r}�(���5v���\�]L�jvQ%��eV�x(�&�VR�&PiN�$l����}6��j�g7�h5������P���?���o?�|����_��j�)B����v?�z+��*�-��#R�k�~Sj��l�Tj�Q�d*��o8�����f���+t�i��4�(��@#�F���W�3���B!��t���=��4�����.o�C��_�����7U/��i��	�,��b��;��?<���CR��4P�,��4Ri����w=���l��.g��n��'����4�A�@��<
u���$B�4�B	�3��8��Hu��y<�&F�D*����$���&:w@g�t�A��������K+x:�T��D��~!S�	�zB-���\&�44*�����[0C��n~�g�|����|>hbv_]��#TR���:�H�Lu���tf�BhEM�no���O&��X��`�����l�!J�����{p������V�������+���k~f�u���]��F�����<��c��rl�s��\j�P���� i����f�	�}��x�&#B:��0
�qI�n\_��i��@2�|����PK�*(�&F�
P�.W�&�2PB�|^z�{�0{�s2d�4~4���X�d�I���Ll�`R���V,�`���b�����>��e.�hb�I�y���������r�8��3�8_��^��)=���P��8����M���1��q"����`�s�~�_������*��JYc��O&:������O�?���@7
�,�uKs�t�&:f�`�_���D�	���G��E��?�v��u�a�4��)u*R����;�'���%��7F�7@2�$�@��X��{�L9|�����d61j���h���YR�z2��P�[
���bge�sH��d H������e�[aMA�J��h�#��t��SG(>�zh�F��+��m$��O&���3�0������fu�����������J6S}��.����O����[^�u��dg�Z�m��Lp~vV����	�1h��_�q���A�@���������h�����|�L<�M)�]Yll��5�B\�����]�K�.��MT�X��0���h0z�����o�����6����Z����}��'/m�LV��.����Bi�T�$o��py6��$�@2������(@�CF���jE�}e��m�!�"�S�UXs���YMtP�o��>�U{r	4�Ac�4~`�=��]��TS��@��0�@���������G�
����q��� �b��U��W����i�5��$�A)���T�sK9c��W#�"��#�o�5�O�MtSH��4 � }�����~�v�O�r^��D	������$D��	�K/u$����������w��1�~��5�����z�@}���������)�jMq4B�m�����Yu%��u}]���[���Q�N�>�(�:��;�sJ�&#LF�F��0��Z6��$lvh�f��kH]�k�d�����IwJ��<�.q�[��D)i�@3�4+������W�s��`
�]�Z����'F)9��T����Tg�MC���b�f��5��&��Ah�~ x��#�d)=���������2����Q����wy�1�z1�Q������-+�"7�3,&:��h ����}T};���p�*Bu�����:��
��&mCm\	�nOwY�!�un�� 1H�����o���|<�7��qw��������K)G9�
[�(Y�����	�g���L���*X^LtD�NQ����5V������Ah��X���^��]rY+k	��f���19�0�����_�bb$��������$E/&�)��h0��`�a��0l'��(q���F� )�I�N�z><��6�����8�@ZXG���e`X~	X�����^3����'7M*x6��&y����n\���lbt��nZwyk8��$A{4�O������'�b�
$�����_�3H�[��u�����������o���R�����>~����>����.�|�b$j��o��y��|M,�!
ZE{���i��^��y5���R��.�T��ZLt!tF�����f7���xm�>������y��6��$����!�����HGH��+�>���#u���2�l�����kS��8/�
!��t����Wb��>�����$��M�����2%f�������
V�w���~��}A�O�><�?�����/��1����1�&E����Nn�,$qiv6������.����HGb�V��k�	''*_�`�f���7g��c����\�?���m j��������<�W#��R������{���n��e`X��;����H3�Z�z�7�Z�����Fq;I���i����.&F��j�y���'M|F�G�6�8+�0�/�����llz5�Q�n�?�~�P��l!;�KP��V�����TzfB��6���t��9��#(FP��Z��D���Q�/
/m jb�����	6�z*�U�V���g�?�2�,����Zyi����{1aF��[��#
Rs���D�����R5��oT�����#�ha=H��>�T�#����Rg|5���o���T�-���{�@{��PD���PWHHI�fo�5$V�+3u�����A@�.��+I��=s+�v{��N���o�"g�^
��M�N�:��XMm�s���%Pw����o&6.�������O>�Lp�,�jZ"��]R�P�0���W���`*�A��.;�X��������z(������6�@��`
����_N��p��2���j�Z@~��x ]x�:�Ikr
���{��M����f���jb��K��<Z���b��	]H�BB��	]�C�Q���J����6P���,�����K�L�����K�y�
2����o��I�~����eIc_,������=`�K���������Rf��]����0*��$�"#�4D����IfA4��`0�t`�+R3>):�`�Q-v�P�v!�$b;a�uHv:�hZDE
,YWaZ��
H���!]C���t]~�(]W�%k�������E�4��T����������5�J��q��k0�d6�CCCCC~��n��u�GpS�_w��#'H�5	�K�Lg%����M[y776�=��$3�
�`0�_:0�`f��)��.��k;�W�t-�S&���I���(��;�N�
&:�*6Tl��P�w�bC�b`��t�����k�*6�)!1u�n:��3�i����]2!�.�Vai��2o#�����������W��]�y�$��ia�H�	KC��t0|_�e��VH�&��q�t��;�&y; ��
����������'U�����"��?�^'������.LPj�#��N�E�1������3��h4Q���]j�}�H��;M�nP�T*STd���f���2�)9@����
���0��<1�l�H��yo)���y�x4QjkB��EQO-�O+��� 4
�B��i7}�}�bw>D;5� ��������� )F;G�K�Dxi^�����s��N��d��)����f`���f��)�h�
�p�������N6:�":�iO3y���D���4A�K�>��M�n 
�D���<=�T-���?���$VG��>��8�(!���!�����2
���UG���Y��e{����f`3��1l���M��f�q��|5M� �L�"�HS	�n\����X��&�,F������f@�����������������d6
+)�d���J
���II�e���h����s����J� �<�t������xw�s���;��&��n�~�w��o��~����_�?�����p{�Ka����eT��1�������+~�_�9�� ���Bt�=>K�������<j�mc_�4���H[-�����m~S5����I ���O:K�.���������Q��l�<f�yN�
�)����i�x.�;U�}{<�A�l'5e��U�k&l�������/�
�r�q5�L2��A���
�"w���E:NX1>����o�ixUqZ���m
5%��$��������qn"*	>�&y�.����e��cya<MS�g<��lwu��8z�H ��Q��������M]��a�I$n�#7��������<���&��y`40
�F?��{U���1�]��bjKB6�$Eg��S����+u�9t���=�
c�y4A�����FJ�����4J�,��I�g4&��r�pqu8J�Pc��u�6hArr����@")�41�/�^��1�(���m��A4�O>����������7���v�)t|�p�O`�D�hp�{�)���M0��+E��lN�p��-
��(�k�����DiC>��q���y��m(�P��l�
e�w��������E��l����W]m�V���4�(�sh[OS:r��h�����i�4`�U�4,�5[�?����1V�(&��l2\i|���Pu�B��
�&Jm��(�\�E��
�&�/) 
�D����o@�^�s-�"L�\).�T�T���9y���d��!Wy�
����$oCf3��`~�w+qS�v��:���P��d��AR�&��5Q�9��nh2Q����$��?	���&���i�4`Z	�����t,K�����\�f�]�\�gOJI���0�Dj�1"����w<W�>���2AH�����:}�����e6�y!���<���YeU�~
Id��U�c��Qr}�}c�=��Z��dX�x����*R({��7d6���d4����o������`0�����ueM�\V��h�����������
���D��������:�3�T���F=�%��E���*���t�Ku��*�T*����]/vaR�m�d_�G��Mvy t�<�"���X�_�<������F����������#��}��]���L�56��`_4�w���������g�ap����h�������w�;���'
��}
�Y~+dy����4m�=�$_��:�����V�w���:	��LV��h��ggg��A��cY+��G1��]�d[P���^���N�_��`:7�v�&R�|&��PgP�����T�rU�D~~4�k��R����N��P��	M��fV�%��h��#S���
�Uy?
�{��#��f��>���!�X�ti�b�T�{)�z0���'��m���SL2u�3��p�g��(���e��x&�[��J�y
un|�S�G���	����(��jx0��mX,����e����@b���T��+��6�x���G&�<��
�:*yOp^����v���R��h2����p8�������2;-�C)&�D�@yV�ye2Q��l��W�������6�d�R�g�3���:�����*��!A	�J]8���=��$�(9�������U0n*V�yP�Lt6�tM���j-�a
&���h���^w?)�����7�:�Jb?����;��L�O�w�o�V����b��0�?}O�>}�zyw��1��p��������cM�z��6��+�hR}�R[/�{�����*>|���d�t@���������M�=�=�=�=�=C�����k��
��!��E04�9XB��I8+����,��5u3���4����x<�_<�����q���#'<-��)8w��*���q��&ZO3)��!���=�d�Gc�1�`0~�+�� �m�U��$#��Q�eG3����m��=�(���m(6�t���e�If�p�4P(
�J?���M��[~bSR�-����h��%j�9��<^Lt�d���)Y��(��I����g��<�q�W�/Q��Y��e��H���� X�����7�����RHu�al��t�B���D�8C"�Mtv���`1�X,^%�<�������Wy�G��)���FsL��Z�d�Q�b��f��x6�R�7�Ff,nJ���U����g�3������6����aJ��u������#�Y-Gf%���M�8Bg��o4��Ef3������?�l����l��+b��=�6��l��@j��\��n��{�&y�(���1���3���}EL������?Vm~N��bP|��j�X�RiTS���KsI��B(n���J��cE.W��:[i����u����A]�uff8x2x2x2x2x���nl���m��2����mb4u�#��PgMc�L�YB���F�����c�����V
�0�Dk*B�c�����T�����=7�w`2Q*��R
�J�����}�~R@�|�����+�Z���n
��.\c�v��%��8&;,&���4���fN
��l��%(������ry{�H�a��
�&���*�u��m�X�B�MS���N�yf��f������KO/�$��$�A��z
���X���D�*U�3YK����C�j/-Ni�.������#��J��@�.�������@��A���a;��)o�
E�.����R��k��7��L�����H\�����6�d�377777~����h����0��i�2�H��C�tsG1g%hN�fMe�"[�[�L [��!�GN��'6d����Ba������o��R7�f@��w��DJgb���b4���Ce�f��D�P���N�?�0����\5�9���'�f2 �lh���d4����[m��o
���Rm����&g����T�&�F�bh\�b�;��+u6B7&�Ug��d�c�L�v	6x2x2x���n%l��@�K��0u6��M���hO�������L&J>��"	�(��a�4�d���f@3�����y�U��t���`���i��>LW���^��W�lj5�j����s�P�W��v!Jj�9��&-+v�b��U����>��k���S7�H�����3�������K�od^��0b0b0b0b0������u�'��b5-�C��:�MB3E�Ye�������}����\�*f���'�L��h>+4_��y�q{uy��c~vYi���O�w���[�k��V���5:?�=�<����]�	.��"�<�-����D���
����M�����<�w�����eb@d 2����z����3�h�`��]��������5�Rn�q��*���!��`��� ��uoz�����a6��|=��@e�2P�\���?��G;��/>
�`@��b-D����>��B�+�z��ny��_�w?���c�1��U�1�T
m�g�=/v��w�
M��W�������������rvE���5W���
�o	��p8����|0���L�R
K<e���G7-����*��D�f�Z>��_��$�jn!6�
�u�����@e�2P��*�S�����
�N�T�7F��I&
I}�$J�'{��J�!�#(k�%���5�/�|����}&`������o-���4)qmY�x1��X%����
=������������������w�����9<�q����")n�d��T�at��������������o.�gSY������P��^C��c����a�.�k���r��8DRy���I
��rX�@�a]>�������@�3��=���Z����be�$T��Y�'�����c�5�SkT�yq`1QT���L=&��%ec+a2�k��';����%2K�teUx_��h(���f;����*)��x�%�i�R�#����������i�����=��(�+�������r!v�7�7�g�v"����T�kG�[�*j6OR�b���LOm@��Z�d��j�2nd���GuE6y�A�gD�q~��<3�m��E��U���.�x���u�N��?2�� �SH�8����l�6x��ox��A�~���1��<N(a4|Uj�Q����6U�l�4��/a�����������3�;�2�2�2�2��cew;���1wj4���8S�����c���K�mh3�3�1�p8~p������
���bn\k���xu����D�)�8���{d�B�2����h`40�e�J	��at4mh�:v-?X�h�����PIq��DP���&��k������.�*�������������(��t��l�����U�3���R����-�'���d�Z���ddd5�������{���,���s�st6�dJ����.�����5MS��������{G���%Id���PA�Fg��!8�b�c���������������D��O}y_H���?T�;���������aA��l�T1V\�F{����E�z�	�����
X~����`y�7�NJ#X&hN�=��D�\8���!���zz
��G��>����������Q�M�v pC���
�;�	�9�M�.V����Jt����4S�r�������Ci8���
��k���6:���
}��
=�D�:��B�[�����B�*�T��*����m6�).&'�]�����{�Jg4k��5��)����������	��T�c����_re{2Q��f-%��m��@:�|�ry{w����Yw���|q!��y�=��pF����"hQL��i�5[%���Vu"5��?<���|v=��!i_}������w���w�??\��bw���t�����_����{��1U���^I<2|�C"@��C�%9�����d��7S��?� ��(������^�,&J�C���w�J�K@��$D���'J���):��www��Qoo���%L���?�G��9.&:�H��Y'k��Bo�����j��i�f,�� i�8�����_��y�q{uy��Ca���~������9�M�4M�?��x��.L���Ag����e��?���l���W���2^�����S������Et:��9�(���G����<B)es���Wmv�E%TL*���jM~�4�k��gg6(Cz6���������!�f��K-�6����������`����I��(�l��������w�H$�_$�������0_I�oL�Rs�H�_����G&jr��V
2�)b�.3�+��P�~�����_����������Ct9'��e�b��G�������q���L&J�BBXVEO�t�4��Sdb�+v������x��z9��j#�����@y��u�]eM�����P��D�4mWdY��d����WQfI��fI�mz6�l����_�����[K����KmH��
�h��<����T@N�x�����Fg;47s
��T�F�N����!oS�������t��(������p�����"�,&�1��.*o�I|����3�<�3�3�3�3�3���?|�]hi��
]z{����YM����{��yB��hmS!)�xu@�^0��?}��:��=�Qx�1�?���9���A�5�#���dy�P����p�)U�[��$O�^Lt^�$S����.���v�dY}�%�!�-���Va���z�/�yo���	�]�F�{(+OU���Y�������j>G��j��[Fl�%���o�w��mZ��������j2q6Q�&9�:�$�lt82a}�����222�*�X:F���yz��M�����LU��_��qN���:�Q�e�y1�l-�������.�&��	��t���3�����g�3�������]*��!����"{2�A�^�f��~��HOX��;)${Pfh����i[�Hg���b�y���������`a0.ku�0��ceuF��hm�tk���#r�L����������8+�+�������mm����Z5 tOB�	�r��c�cTJ��J)TJ�ub.�Y)U��]�2G��m����?����1U���dl�8�����9�m�F2�'�e�e�e�e��#�;��C[y��<6��C��=�h����f�|��|��^�,#����Vd��<�b�qnl�.�4|A��=�(�T����*{��1��B���
�@�W��lL�f��e�:���Ev{�y$�<�h1fI�&��Z9��*�
�������X��������������7m_�(d��p{�K�@��h-��l��f6�I��5���.E���8��	��������M(��^�p%b1�hm��~f��_�{TW���m�d����@����</Q��-��pc�����������$�~Y���9�*���So���@UN��k[������]���9���[L7W3�����o�<������y��(��o�g��r1|��tMi^mE���0���8��)���9gI�Vj����>��H���]��'��	�bvrB�������mJ�� y�b6���i��o���]��%��&�{��d�vf�s����������A�^ey��:�A����>?+>������,v��Qol���=�3������IaP��=���4����n�U��p�����e3_���x�P��b�H�3��j����iO��"��a�O���k��"����=h��Kt�?\����3�
��OM;t����z~�Q�d���k�����l�X�[>{�_�w� 3����V	3������y?Gi���6x���\��u��`1���7�m	�)#vl-���.&��E��u��]���g�sT���l��%h�:Z6P��H��Bi�'Rr����Q[���:�I�<M]L�nF��yg�_�������T8P(<d�������{��#�����ZwI5��DO�6��_�(j����6�v64lh�*�����c>��#���5��BD����]O]=�/�&J�h���2��6:T-F a��f!�k~�J�~��H�����0�����X5��~���oA���Rd7���6�U'�������G*B���%�5�'.�/p��/���TL�.F�w*j�x�l�]��-.�v2�Fj���Dm��
I{�_��r��m���r�-��Fm�<Uzk��[n����t0�%��D1-{pz/������i+���i������G��HCB�kI+���.������XJ���4�H��L����|���i+�	�������5�I�>3��0$�!!	a}�p��TL���A�=�S���Z��o{���2�f2Q�6�;�o*�������9
x\O�����&�oJ���@�<��OF�.����%�w��nm�kg�X�����L�PH��S�����K\�+����q���C��_"������$�����P��/�-�����Xd�����k��K��(Ko>��b���V�qg:
`�`�`�`�`������v�p��{��i�#5s!�v��y����������?_nn�}?|;\���pa�����>�}��_����G������w?���0@������s���]B����h������1R���������b�U�Hl�7�qU�8��|���$�+fh��E��6�r;%�	�t��K�K�K�K�.]���q������c��@��L2Q(��]��*��I��!����������r���E�U������w���k�B�������"���0RF�[y��������
J�����������_	W�e���_�K��3�~Gk����f���o*qf��g�k�4���� � ���4��b�aH	���CS���.�dyZ�7�W��K�g�{M�x���<^�|�
x���4gH�OG1M�J�Q��l�.
�b��h����f�|��=@db�^��{BfZ����6�m��j������i���=3���^��)��&%���7�.�jlc�����Az1QJ�J��E��^+����9d���i�4`0��r��y�HI�(�Dg}��*��V��d6��-_�E`n�6y�������%*�,����V	��������oS�/�V�5m7�V>�qdr�����]�Wib��z���z��=>��V������?�?�?�?��\����-v��u��M���x	a1���J�������S�r�g"�h��5`C�C����
U��<l���t1��1*� I��LGW���@��$s���nC/]�*�,����s7
�y�=Y�B������o�w)p[�\u��ql\�b"��LN>���J��}�$pgN\�:L�)�ysiQ����
A�6��"h��������N:��"������L��d/i�h�}��D���l����J�}k������
/Z(J@�j����n>�b(]���P:Z�Uu-�&���j�#��e��?��
�O�t6��m�O~��U��8�l�_����'���.���T�MKc��OC0���:�>i��	z�ReU��y��_�w?�hC���
E;S��Oh�M�!��������me��N��=�(�D��a�����s�m�PqC �tC����/����qo�4`0
�L3���Qv���h��m�[7$`�;92����,�O�t.���c��Y:��g�-�e�2`�Xf`y��v�J	�m��#(x`>2�AF��G/a���&���z=���G�I�+�L�����8��o��+R�]����*��:���h��N�����������C5��O<Z�ng�z�3��|>34z��vMU�5��Q���6�(9+�VT]l3�����m$l���"a{~��g�y�]�+F���o��g���nL����5A9������|�^��&������W���W�����+����}/�����o�;X1X�9������J����H��]�Sf2�������Y����C[i7�Z�u�F�jg�Tm��P��j�U�J�0CJ�%`7u�� 	��-������R �������&u��K�!,�L?�X,��L�y��6u�ZW�1��L�M�&�@$������Y��,Kp�lt�:i����n��QZ�4q����OW ���������������
�#9�������_'_j������ri����1�k�{����pC)�e��f �����d
�r�6���9#m[��'���������^ps#_��H@��=����C��
�
�%zXQ�m7v�p��D���?�����4�|*�����z;�m����U�m��[�}J�]CS#cq�X�����*�uR�%�n�4v�ITuZ�w?��c�1���������UO���.����h�ES���hm����k,��KJS��@�.Vi���)����f`3�������"<>))�<)%BG�6�j�!����^�����u���J��3�@�P(��]�����jj��X��fTR��d����L������`�H�~7&�ai����3��GN������
M�a�P��K�L���D�'$^�	��J���9^
�>������0g��h�	�~+�?� ���p��0�w�L��-��������<25dj����_�L]	�)S7�fM��]=����^�I�18�u}�D6&�����)�`�fn�L&�30�?3�g�3��fJ�l����Y�4�����d�B��O��kH��Ko����r}��R �hB�Z�|�t}ux��S�0�Y2s��S���[U�����:�AJNI4�������Oy�L�����fi�$�J=����#z���^g��a������W�FR�C�i�L������=�6Z��5���?��@h 4��O�Q�P���bn���3m4����dBP�����.<��C�c��11ER(�:��Ps1����M����UF3T�<����`��Y�[����$k+I�I`X}�H��}���;|�y0@��H�%�m���*&s��=���6�U�j��AZ��b���I�^��L�V���J�|���C0$l#a	�y��l���x�T\)a{d�?���������E�p�ym�d�/+S
7n(�h�"��������3l��SMC�+�
�����p�/lS5"����B�Jg����M(��W�S(���>\�U���A�pg��v�i�����eA���������`�5����F<�e�Y���O	*�����iC�~K��9�q!o�t}w����~�Ry@2^m8�9��7�S1�w���U�}�����-82�y��>��_���P��<fXN�
�E@e�2Py�?�z_����������P���������^�7�vm��G��-��CBZ?���x%!�W��r'Dg��z��;��:�
���kjm*�ao����}d�����l��/���m��X��s��&o;`�`�`�`�9%��tm^����a	�r�y5Y��1vD[]=5���L�0H�����0�>�d�Y�$�Tn���������JMJ�������y�P��s�U/M�6-���y�&m?�(�
	o�6f#J�0Z);�W~����R�;3�<<<<<���*���c��&x_�����3�m�!Q���(����RA��#��u����t:�
t��-�r�����@�=��^�8��,�m�	P���]}������w���w�??\�H����>�M?�������W�����_��z��G�P�Ec���w�>�]����{V��w}p����C�F�p1�&�L�I�=��Yb�b�vq	����H�������o���8���71�X�X1�������j��P�w�f2QS��YZ��,P�I?�,��!�����[�yw rC���
��ek�h��c��T1:��Go���a�PRD��DMH�U: �U[Ua^X�d�e�2@�P~,/�c/�
k:�P��.�����-&:�H�v��{�Jg�Z���3$3+��e�2@��(�R��
u�n�h\�a��0�(�����f�������4�l{>����P�T*�
T�5T%4��~���b��m*-�����Z��d2Ci�����W�_���
�*H�L	0����c�2�w�1&������b��0O|��S���ZF���
P(����@y��u��vg��=�)���.�i����d/������A��*o@e�2P��*Py��GB%�<���|�Z�o�hm�H�D���c�*�
@��qG��2-C>�sx�����V{���!|��R��9�f�W���������V�y��'��C_L���	`�m��t`�����>�������?���S�6����%���eQv�d��)Szl���U��
�	6d��@e�2PY
�����t�'�������b�q����mp����7�82�y�
f���H��P��X���\�&�(L�Q{��\C��QJ*��i0c�n���j����/&:0H��f9����6���������x��4#��~ �K��3�b�6�`���4�q4���7����e��1��\z6�l����_�H
�WDf��4���-v��O�P�]������	�
I��VF����N��2�K�2`�X,��l&��(��@��X5��0ty���I���s���Ym�����x����������R���7���
����.n[��������L�v����VH77v�e��d�N>F���u�v�
<o;����������C�#�`�4.Fe�]8R��m�X��6��s1��}0�;���uUC��#�(���WF����������{*d��~}�������M��(}��8���5uk~V�|���	��.\���W�xG��>�N�D�$m�l�J��:g
���7#�?�n���������.�m�����i����������b�G.?�#K�������62�)���m�
�H��Y �7O���$�Ix6����.6�����B���(��	����f$JApt#A7�c��b�Z<��2#y`F*6����0BJ�S����u�����,
\L2+���P�����~�W�������nW��1�c-^�X��h����F���a��u�r�`7�5����DI�������o�m��c�L���d;d��h����aC�~v�*��}k�y���*v�k�A$i�&J>Zb��5�{�;��_H�F����"L&��
I��H��`y��v�,��9��I��G�<n��$�{�:[a[0��[[[[[������%�P�h��	C����r%��_�PH��P���p�2@�P~��?��3T%�<������X(���R�If5�+��@����B5w��e#����#��e�2`��(W��@	��2��4�j�h�9g�deO&��e��w�e������=F��qN"B�s.�QS��L�}}����|\����T����US$��|��@���P�KS���I��]��999V!�\���[�Jw���Dkt.^����p1���S��XyJ�������o�I&+������O�_������H��L�H$��[]h6����<)y?<)���wi��>�P�������H���6�W��e���|���G�A?���O�����nt��t��A1����M5��r��(��>{1�!��x�GP�"��S/�5(�n��"6"��,����$�e�W"�@~V@.V�\�(v�ZS9�6��*C�t=�(&]K��:s�r��7=7���O=��II���7���w�#��]����:�\=���o���� R��%i�i���!���>B�)#�����)C�}*�.��m��J+�c3�df5����J(2���0�C)�YO� ����(�����&O-�����[K�*�Y;J3�M���<�0�h�!�8�ZE2���)�D=�H$�/�������������}��.PV��G���x�<�U��J�����y�>��!��l��!`�|�B�R����U���0��'�a1������r�T(������>��#�
���
2�@2O�2O i@�?��N��]������N����$�O?5���U:U��	AR�J��-�'ls��o�`B
�����WIx)�X#����w�J�o
���q����`)1�jI����eqf�������P�iz��a?��	���Z`<�D��\�b�����������6�8v�b}0jl=�(��J����xo��9��<�#� �y�Uy�D
�5$�����=��M-va���+J��j���C�z�o#��J�%�MQ��O+��37,�����oT[�����U��V��nWC�1����Ifq$�Id	��6@��*������c��H$����!y���s4�k}3�T�T=��I�A9��"����b�)��C=�X��n�5��:��0`����7~�^]���n�Ai���m6�,D>���d����������|Ld^1�n2��i����cp����T��_��-����g�%!m���T�p'w�p��������*6���Sg.��P^">2�q����{�*�
P��Z�7�����J5�j(�[��&I��E�B�b��G��wmMJuh]=�2�
,&:�H���j���*�NL6Z��P�	�:%�(=������n���y`k�#TZ�s�b�2y`xk�q�oLX��v.�[�;2Qz��)K*�h����z. E2�b�������fKaH�\>�{�%�����e�Rr���r��s$��.�G/�&:�H�����,`�4�Qg;�2��ZZ�3�G�����g���`�'
.�t�a�E�`�i*��������^�I&���0������t��7�
�_O�s'��h��j:%����!p��q]C.���j���)!`�W�qm;�U^N^L��j��/t�����F�����! >����$*�jM|��FJ�g��I������"[��6� �|r�DH�lF�������}��5u�Ml�����+�_QZ�����:�����R��'����d�HF ���{������QogjR����a��D)�-	��
�W��0����O~������(�v��!�����_�.���.���R 9x�	�G`�qy1�d��������"cb���Q,������}/u����`��s��@�b�R�x����>�k���"M��}T�������$c�9$}Z "G�f"���gA���
*v����J�`1���GoR�u��T&���b��v-�Aj��� �i���������Rr���������C6`1 X��Mv�w}�~'���l�6�)�����A������Cfp���c�=
��$��.n[��?Z��MS�k���EZ�G�k���)�A���}}�	]��1+�<-��>��b����`m;���xmx1Qd��63V��������w������
�����f\���X��k:#W�'���^��c5��Wv���Hk�)�����e�2p����_���YO���Z3�D>����!s����J�*������FMEr7\.�?��m7��H��b�~j�X���/�����-C���4v���ulM�1T�������W���_�FF�����p�_��������������<�7!V���]Tug3�Z#����{<]�~}�se^����@���w1�%]���2�X�x3������������T�t��taaaaa���$���S���
j�k�QS����9��R��cE�@���?�;"��Sf����4(mF���W��ag��[�.~MV.�?}��n�q�H%��������M�����`>�����
\/��Q:��b(�y��3���S��*�q@�:t��I�cSD��%����;�����Yk��e�C#0��
����-�][�_
��3����n
:SQM*q�|�bU�����<y���M#���;)���t��������\���\n��p��r�������������6���Tt��n'��`bi��,��vu7Q]
�NOj���x0{�q�|����?{4*��[�FN*�C�5e��gr�sh�����5DYLC�P��O�dsK0�m�$�\�e��a4���� <V��_J���
�?�8'0�R0}L�{�C]���0�y���_n*���a'Ub�����.f�t�{���
U��90�ap����}-�sl�.���������m���1�"�#�w���������������$�Ab��9j�1�t���dj�L���<L�`<����S�Fw}��L��6���hB,Gd��m�f������%4�������z��G!Oy�w�2��c:�}����v�c~o��bD�jI/L��E��?}���g*��yJ�g���������4�����_���=�
�5��]�W�^CtR����$��n���F��'��J��Y+��f^9�8W;�@�����X'��������;ru����9������N��33333|��?���4�&%�8%��[[A������XM'D}�x�����t��y�_p��
-S��E���F��n7�y#U)��.�nncO���b���K��=����'�Q6X��7��:�r�����������~��������������3Fhi�!�ML��}%3�"��5�!6\���Tj�6 ���B0'�N�6,mX�������M12U��D��C�'+Yp�j
�B3)�������@���(����t�{N ���<����n>�-k�	nT}��
M����������w�����9d��w}�5�v3�����h��0xy�v�O����]~xH�q	���.����u�����x�(�I_��tV(r���}5����y���#�����*��Rc{
��bA.�)��l&�\m����,�y������8���#�ih=���n���,8VXB�2�)y|�#5�<y9�h����>�����T��M��8����������-Jsx�K<�� ��;������h���=���S�@N�8[{	1+�EE��)l��%��ek�,�)M
���D@P��;y���A��)W])v�
d�=RTs����
6!6{2��QS�_��6����|c�}�����"����H����/�U��C�Kw���������O5���Q��2� �pG��uD/�\��7������3��������SH������R3����,8���������$������nj����M����{�Wn����7�S���-���&<�n@�M6(�R(�;�/���>�������������n%u�:���n������J(��q���d�'_��]N?eWkg��x�f�p{L��o}�rv-�b�w���C�JU�l�u<��%^6�lx���_`���,��e�������n���I���������i����d��#8�p��.@��N�d�`'�9>����n���9e�7��n�0��5���/N�t3k4��A��d�o�r�KA���m�7��8l>���m5��|���iS��������t���&��y�����#*RE~�P�er����4J7'�8��8NV'���r
��������6���1��l���!:I?�bU���:�f�9o��0�++O��d0L�������t[.�<�mr�;PM���86w�m�M�
n��Ae5�g��)
�` ���G|Do:
CO�u��1�e�	�b e��U,����(�����,T��]�����������?�;����)�������������W*F�5/Q��Dj�� 0Jv���m�{�=��j-�L��}
?�u�j�����v�"( ��4�=���s��.�������C3�>����oB�T���u����'k������
3f�5���)�#��TV�M��8�X�w&�+,��\����bf?���e*�?�����	%)_(��!(bW2d7��C�h-�A����Di���{M���+h�+@��{���6������|�-���	A7C7C7C7_�n�
��=�G��n �@G�]��L�]��U�k�n�L�:�jW�<l��B/���)��]���Rb�E����uB��f{���e7���y�tcz���Q�umT\�l�����Q�=t1t�����>�~�w�I�E�f�:v�xzY)a������������2��y�s���t��`#�x������S�(�2�B��t���G�1���������+L��8��6���0uX�����T�*�[���;~������<R����������%��3�]��~sn+�6�!6��t�;L�}��y��(�:
B�� ��zr_��S�G4�[��@���?��[�s�Q��4������gN!6���f��t��:7`�M���h0�:���F����[w��}��5���\F�0�z
�����|��bz[�)�&8~[PG��
�0(
����i�"qM��(\�����(���,!:�7$dLGA���y�8�I��IS-��@`j����,��2��(�j`��w��
l�R0�5����2����l^��1������ZM�?|&B
Q�#��`4
F_��5P���e+����V�S���u��D�0�z
1�����?�q�
�#%��I��8���08��V>� Lbt~I������$���IQ��=���'c`#�%&
�M��Ah�~���t���W�[�
�v�c����sXMc>�l�6��c�K�:MK�+!�$u�h��f���=e��t�����{���It���`��`�$O9���~Bc��t�����t��ap����[w��GnJ���8��b�*����m~�\C������������1D�z@h��A�G	}H��]��<��9D!�v%���1�P���O!6����]o2��;�eF 
H���u@�_��*Z����{=�\�.�>�.��6o��i�6����c��d��7?�TBt/X��e`X~T;���6�A�wK���C�T9�Hw�:���4�f4�_s�%�j�2�X��e`X~����]�S�������B�r�r�����y�:��0�S��>m3��74� C����n=~�-<�����<0��-v�z��K��O�4 ��[a�"�U�x�W�JUvaI�X�,�I&.���L��'�\gDFu���0��PC��B��r����j���@DCD���3�����Q`D��F|�Fu�2�����G�Q�Vu^��	�����k����D7�s\�����L��f�`��A��.� ��A��8��9�~��l���������A3k$�!�v`3�6���_���h�Tj��H��C�n�u���+��m��L!F_x���W��6���D(�X�?'I�oq�<5��A���O�O�o������?��xx���{��������p&��.[�M�}�����]M^�<���h_���L���V��U�b���mZ���8�f
dI�AYv�����������<�R>����q�����'�R+�1DG!����u���q6��b���|��|UCp�|�7��)����#1���PB����>�~�w�����9�!�e���=A�7��	z�WK��O���5J����C�Z���.���b�����E+2��8�IP����.%HC+!�.*��������;��6k�sl}.���Q��.6�A7f2���6_��J���8�I�������2�<�����5�k���u?.R�x8>�}M5F����[.�}]C���G��U�nO!:*J���#�X��
�e�P@���+������N�]��Z�r��@��j�]��L��Z�fA����
X��`1X��G�-o�Zj���}��A�Ty1���;�����aP��c���V���*5D�r*���2�*?N�����
1��!�zi���T����0��)�f>]O���Z"������2�������mVK�Ow1���/�Jm"�J�{p)w��&��S�
K�k�Cpf3� �e�����������:}|<���QfO>��/�u���>����_�b4!��P�>�?��>�(e��Q�Mj��9WW�! ���n+_�����H�mje��"��o+���t�}��6��d�gt;��,g�7)l�����BtT�B���i��SpwyQ�H��������W��w��������dL��wk���l
�Abq��W�w�������j�����l�g�|������x�}����/���3��c��Ab�J�r��q��$Rl:�u���lSn����`Fr���2s�v���t�]CX�:|
��XC���9��8��*s�!fG�T~4nks������6'(
J��/�����h#K�uN=���~��,!63"�ZP�����2�?��x{�p�b�~���)=��Q�Zq�C���]��I7��]�U�e��T&�H��;��=�'��
$;��.������,�%����������P����f��W�k�2s��8d�!3����y7R�����S1����=����� ������������/�.j��h��f4�����TC���wke!X�^7oS��M�[��3g����!Fo��
��o��8�f�2���N���F��l@dD����GdF&:��y���1����c^&O!F)_������(������d��F��(}@�d�������g��t��ucl����Q�<��6���M��7��B���]��R��t�b�(����c�tk�T�&Qw{0Y���C�O�������#p�L���W3���'��i^��Q�	���1x��������+T�!/g�<�����xO�
�����~g3���]�����t�d0L����3��vu�1�r��oL���!FG���vY�b���tk@6t���-HuCt��AePT�*�\xi�U�>5�����H.�5�)��M]M�^R�-�i���R$�&qy%���y]%Dy�"�� 2�"_'o+@:���~�i���bCC?P�z;�����_1g��(
���0(�}
������9*jw�s��@RU���k�����!��F������mr�������h���
F+"/��l��=�jU�o��5���h���k*\�6)�� 	��M�*��}(A=�8�^�-��V j_������������u<8�2��gS��msX��Y8�!V���\����8NA�`.�t����f�oa�M���������N����Z0R�*�V��q���$b���?Z���L�g���k�1���,��2�|XF����Th������s��6����R�����xg3	�������|����Rit=��A)!�Kb�1x�����e2���@������r�G�N��%Dy�G�C(�?�G����1���T)�nq���8L7 H��d ��H>b.v�*TC�������G��L--b�Vm*�Q
��&�R�����/*���C�f�`�0O�
��

��N1�n���d	����)���q�����6����5D7@�d��J����*c�Q������z�-]�����K��wM'��,��M�t�������B�� ��0 �}�������Au��-�g	�H��B�(^~���)�f>��5u���8�$�e`X�������C��(]��Xwe���+���y��H.��t	�*;�~=0e?�)d0���oR�{�q��$���ek@���b��l�e`X�O��:����|k�^�?y����%D!�^��M���$z	�j������m��c�����`4
F��w�=��[��~H�6m����T��X��REl���R;)oK7
}�������c��a	�O�1`|-0�����X���dJ���@���<��$�`��;A��i��$�f����(� �!�Ab�$fd��k�I��v=Rf7#i
�:R�=u�dg��0�����nU	65D��X��e`Xf�|�$�T�7�����o!�s�
����a"�)�f>��&�����5D�����4(
J_���a�D6�!��[�"��N!J�(��Yh��/��q6o%RE���P�:��,!�W,��2�,3��_}�r~y��T�l{���N��z�b3#�����?�,c��|�kr������K�1D70���h0�a�A
��'-7wy�L�����T=��j��������1��$��������w��F���������/������0�	�/��/���t�h'>��T��s)%��H�WO����D%>����4�����M����Y��������T�z�G������sH���x#FU�}�e{��M�l&A�8C��d~�RC �!��
	B-���uwZ�_Q[��W�����UC�b!��xx��
����S�[~�������}��V=�������-������!6T���3pUc�1���yQ�n�i>9����*{��?����'��K����s=
�^�,��3�<�Qs���3F�uGEGf���9���Ku�7IX��S�!O^I/������Q�	���O�������w�������_��>�nN�K;Uj�5���7��/�����E��o����T��o��|�����~6�E����������|.�a0�):G�	����>�~�w���s|����_L~AL>��MU�����*
K���v����7.:gCl��R1�7[�XCt��q��Sg�:_��3O	+�������VZ;�aX��y-!:IuTi������}�Bl�S�<������!J����i`�Z0}<=�����n{�E��~��U2�k�Q^9���^� �JRC�vn�:h�%�X�I���Ah��������#-9Y���b�g��u�4e�^�=��U*���8Mi���?�/I
Q�0���h0�a�!3�]���U�y����1��\���w��3`>Il
���OC���|$����0
L�v���?��AJ���%~�^�<�9��.fw
�A���x�^�]?l�n���B���}d�1tiI�l��z������l�ep\������Z\���������@�m(��'�<,���+�C�N�}�5��(��P(�So���G��k�!�}�2�� 3�|g����0��a��8�|��%��4��)�MT���L��Mj�7�x{}�Ml��f��j��{�Ju"K<��>?���%d�is	�G�����MPw��OO+�&;����~I�6���� �b����I�����xJm�\���bu����[�N��a:J����6)��N
Qn��e`X��e����NM�����t���!F���}x�>�!s�d[W�l�Md�Af��j�|<���eN�)��c2?�)D��J�j���v#�y�:�	x����u���J���5
D�@4�����
R��C���s�9����#F�a���o��p���o�Sv��3�8��L�����cyw��XN�+�{����3�������7]
���j\<�9��e��%Js��+{=�(���Tf�y�����!��-
-
-
-���|
+9�.�J�"]�yB���v��o�r����%DG!�gP�A��~����!F�;��:��,��YB����h ���h����r.U��.�[t4��D�*���M7���X�-�{��X��`1X������k�'�g�d��#u������r��9�c��|J�<l�;����p��g�x�<�B	eGN���^����������oR&�5K�h��=���!9�9o����B
1:h�N�y{�,yA%D�r@f�d�AfN8�����Q��K�L�t;��_��ju�QH:��Nw9l��z��(�v���uj����e�!�@��i@���A������E��"���I�������E�N�bU�-�T%Eis����[��U�[U������-i�x�v71M6wn��s�P���s�Q�02�C����(H
1�4e`������� 
Q�HiHiHiH�����!z���h�����5���C��F���v7��b�+��S_������Jd4d4d4d4d���rH���U�>��Y"*��!6����M�[L���b�i��J
-����C��������6Z�����,V��f�';����~z���u�vS��{��t�����2�;�Y��Q�����pQ
Wi���n�m~���r<��3�|��_�9����
J<�lv�)4~p9/���+�C���R����2�1��y-t�=�[�������AePT����j���v)��b�(C^6Sw��T�R.�!6�_u� %k1�2uTj<-���3���|=@4
D�W�hT*��xz}s����>�����nJ���,������t��q���4�F7�Jv����J��3��d0L�����k��v;�N�i]�uk�V�C�E8�P��U����P��]~��s��U����`>c�n:4
@��c���h�%�JI&"5���� �wm���3J=���kI��n���,�OK6A�MO����LO��h ��
DO�����f W�.�?<~�>S�k�U���b����Ymq�(����q�������T?���nx������n��<P'�E��ju
1R���2��S��1�f��ap�^�|��t����������aqo���7%;����y�C��q���4���k3Q�v	1*`�gJ _�`���{�c����� 3�2k��_7}i��*����`5z0U��R���V��C�Ct������Q�����Qn�yAO��g	��3�m8�p���
��6+�n�<��M������Ru�s������zv
y>w�d�)���H��*������@�4
@��`��_��?�����qq�C&o��w]�������b��Rm;�}�A�i�!Vvm������F��`4���$�V��<�D�;��yif��)Om����*�!6�!�z5��f�\B��N�q���g�<_���nT�%����nf>����mZr����%������ �0�q6�(�z\N�i����1D7PT�AeP��da^^1���=�@[�TYUH�#k����9?{H���4NB�$<u����5��M4�Ac��*h|<���F&?���_Q���C�Q���Z��,l
�QJ��d���I-��aa�`�8X~����y���g��[��-������`���m+��q6� ��?a/�t��6�6�6�6�6�S���TS��k���E������>�)��s���8���15!-��E7���|f��E����}z������sy�7t�������o�x�����G���_����������_ug�[�6����:���T�:�~�7)��b��S1����,Ab
��
����R��t�Ad�F*���-�����W�?�zYX@��'������n�q.�K�|I���b�!�$L�*�l���.lnI��1��(1�k��2�*���������t�s�si��R��l���%Dy�*�ej\q��[�b�hOVB�_����.���2�|\��f2���8��QZ������2T�)����\�.�C��A
��Ma�6�[r9����0�3�0�K6����hcS!����M|��bc�!:I]l��%�`�TCl�C�4��@ct��c�<���1
������7#�r���:�HIL�u7��h�0�$�!F7�),���8)�%D���@3�4_�yH�4�*P�64���
a�v���b�n54��%a�_�b��\��-�����<��h����<
^�� ���v�z{�K���e����g���������A��������*���2�*?J�c:��!u���9�Cm`H�t��������C��C�M%���7VBp[�����H��V��<��'�H��t�\��4�*4�D���tM������'��^,���o�b���O|
�_�������.�l��!�D�������7��*T�r����[������������n?�|����4�.pA�����!�c�s|�~.���5];����P�?��>w���M�j|���c~3F!Y7x���!�!�!��Q����or���S���T�l
�Q��������{�i�U����t���t<u�8��0���ep\�.�f�����|b^��|)��<P����J^��L��g?����T"��+�K_��L���!�����^�0�3���`�����D.k���\��C7��_/!:I�X��v.�]�q���P�sG��0�X�g��g6�c�{p�����;�������/���6�����w�_>�~<��yw�t�2���M:2��,�~9������Cl�H�t\�2+~Jc��t��u/qh�n������������'_�P���E*�N>
%�k���bd%S��vi�\��N��(�(}At����?%G�5F�y�4
@�0�������jZKZv��u�\4/!:I�X\����<���n���=���(�,�2�*�����T>�����_�(�G	%?{��bq���V�]�1�f:�x-h�\G�&*��������_u�b�xg�����u[��e�c����V����oI�MbY�
���<�i���h�|�����w������~V%F�~B�� 4��J8���v�h3$��
�J��S��
��o���L�|�.����0�)�5���w.|Z��MX��e`X�/���{To$���tj��<����6��n�m�G��R
���N��R���t�|���[8@���+�4�i����
�v����R�Bxyj��ic:����TB�r��YWhi�G�7d�Af�d~\>������9'�����]C�)�R�\\m�o��a6�r���a3�4^bp���#Seaat��{�
�^q�`�<�]��G��Y����%R"2K��7&*�9��I.����%�����c��hl�b32�}u�n
P�P�P�P�W���'Z��n��nD�9��
��g�:�f*dJ�m�NA��b3����;��t��b�,��\o)Mi_dC����|���������������JI�6j3E���6�>e��(�D�@4
D?�������Z����#u�p�'x��Aq��F��������y2���
�2�� ������Ls����d���T;�������W���|te*������rUClfS�?�g��W�J��O��Ah�~\;��QyW�����841�f�W�K�������X���MC
�Bt����2~��h������h ��G�!����ms�[\nO-!6T,u�{>Sf3���:5�Bt���ep\�.�K��pO��Xm$Pr���}J!��:�H_>�wL���(T���b3��s�������2�� 3���b>�����1�G�������!��U)�)g{��)x!4��1D�D�tJ���X��?/1�Z,@4
D�@���>��]�8���O�y�0�6T$���N&?�be��1�����D	
���`��+��$�@�V|�2,E\�H;���2�f*{w&R%5�[����C����Q����`1X?.���	����m����<���$S�����}l
��N����u���L]��h ��G�1}���m������fJ�Cl�H�t���h~�0��L�j�����
Q�i@������1Z/�|0�u��u�������T���x�z"�i��)?��������l�q	1s�	�E�/?�t(G{�9�Y�������O��1O��	�A}=�?���z�nL�t�����d�
����i;�k$#&� ���n�qN��"\]��|�3YC��n]l:Vjjw���8�f
�����p����YJ�|��������s�����7@2�|w'�C9�L>��M7�r~it�'?�KS5�����ny&�Q6H&�:u�0����n5�j��W�a�@SyR������uD��C+��<T��}�D/A��b3��W���:J7p��ap����Q��uh\r��P�VlNs�@R�9w�]����>�_}�btxP.��f��%%F9�x��g���x>�[�Rj����_'#�)�����$��|*�D�5����fh��;o���l�g�|�����>Qj�r���n���L��!���I�%�zj����2O�bt�Q/fm5�H5'������2�*���W����D���_�jIL�)��Rq���=�� �)�� U�]�&���q6)nB�l�2>���/x����>�~�w�����9�O�/��P��t1	�9>��>���]R��������e�a�m@Fv�u#�O�Cl��t���5D7hfhfhfhf�f��E�m']��n������6������cf�)�b��S��Sv��!6_	*
�y����M<��c�<�����jZK}��Rj�z����bt�K�t^���R
Q������$v��wI��<o�X�����(#BT�O����~����!����!u��-(#2�������z/�!:"J]�a��M���n/r�e��g�g�g��+����jD���nM��
���g
QH��R��L��ii5���tr���c��x���1x?{�g�Tg��y>��]����<.���[�8�f
�Gw[�.�K]ct���d H����cz�m�R�U7�k�����btQ��3�J�5;��O�!VW�}#���( ��u}��Aw��/E���f��hg5��T.AR����V�/�u���./�)���F�n�5D ��O�������C
��
�W��9�v:��t �!�!�!��@0����w{�nt.	�=�����:�H�\L��??��L�|h�7B}���,
�@1P��^���79�o�����#K�AR(����S~�0��������+����e
"�>��6|~�������Tj�_1� ����2���<7;7���"|���~��P����������v.!F���U��yp�v����1��4�@3�|�G�����C3�2���'�)5�./� �k1R��U��N=�!:(J�|���bhu� ���I`4
F��`��O���2?�:9Ro��f��GPi�XC�(]�?��v��0c�
�i6�w���\
����i@�Z ����C>�h�a��<�����.&�����:�H_~[��>�����_���J��`��1`�(f^
�<W�N�{�!4_�7B9�!:
I�\,��7��a6S�����[C��t�ep\��e���t�s�P��T���H�N!:I�L�4�>�\`��j
�r0��9��*�M�tod�Af��Z�|<���oN���|��"���u*��6
�������(�%�A��0_��,1��w ���h ������Lm�j��^Q,�9���V��Cd�x����2��oJ��6�&��`2��0���vlr�\s����q��"����b�������!J,J�3����b��N��(7
�4(
J��JJ�������e�_m������o<tT��u��h���.!F�Y�y��]��J
���O�(�t���d H�����J1��oBj]���nn)!��Tu���($~�SwK�h�l.�vg�%���O�o������?��xx����������?Z���'?\��������Su����O��������4���v���f,�����v
1�b)���H�{�b4��T�e��:z�-�P�P�P�P����y5���"q{y��u_���cj��=��K��q��TJ����~���:���L�(����b�(�aj��R�VTTs3��I=������xwP�u�w���5�f:�������W$����^�.�����b�0���}��6o����7��������o?<|��|������
or=��P����~�W���o�7g_��7�t3��G5�=�s����YFmYCl��k���VY�4��Cl�C:>�����M�����Z��sl�.n��W��+�����.4�!�e��~������)wVx��R
���ke�o�������f�h��f����9�t����Pyu���@�$^���"�o)!F6
�jw[��9��t���0���h0�a�1m��W�����y���>�]q/���4?�bDi*��:^<��@��q�����|���4����S�[T�����?H��\��2�����-���4��Wme�;�tT��q�?�w���|�dWS=��/Wk��l����n�`Y���}��F_�����h��{���������C2� (�Xj�d7���e�b�.R�9����0��B��iv&�����%Fi'����
[����x�������7~���C����1!��z'��t�Cl�X�?�9{��h~�2��L�o�v�qV5=��k�n6�4(
J��W#��,%�sFV$Y9��=��Gv���\���_Bt��P+�4�B(���P�t�D1�R���`I�����������������������FJ�����bq�C���Z�m��Fxz���ed�[U/����n�!
!
!
!
!}�4��oK��bI�n�����!F~y�iH��V������T�S~�O�U4��>����/�8@
�'|�������_��.���IZ�\���rQ��!�U+�q'�{j32���������� 4����j���@BCBCBCB_������=�_����,��Hy�R$�S��F�x�@�G
�KQm�k�J���Tn��Qb��}�4
�|y��;V�'���6D�K��L�����(A�<T��<$O�?��(�*�!6�!����e�����������Z��<V�����j<����y�}�A���G��X��&>-�����J�D���Q�	���1h��\���Ze$_��6�-�\��jeD�zY�9����c�CR*��gu���Sjl+5�t:p��\s�����������/�}����>�/��>�d>��M��2e~�M�%N�b�s����g�?����X98��%+�M�t��~�~�~�~�����F7v��f�U�Ly_�;�n���T�R.Y������ed������X_v���`3�6��`3��?_i���9�4_i�	��b�A"����{��[�j#y.i:z���_���`�f�`��|T?�{j���cg�#Grxk�DR�\�i���~Jc��t�oR��J���4�@3�|5h���������NeC\�f�Q�<T�)���vgi`�N-!��]�l�����������z���3K�Y���A[�l�e���;z�������Sj��������:O������A�j!���F��4�Ac�4���m�Z��-��=C]M���H���t��n����>�#��:
TF���_N����1���mM��B���_���4>�)_�i����V�����jbT�����������Mt\���T3���+W��������A�����M����_t?F��������y���P���K �'u��"��u���S��ss�v���6o�n�A��f�?�Q^!�AePT����2�2�	�h����1PX���?��Ll\"q��}
0��4w��H����(�@2�$�@��~t���!����@����0y�8�Mt�
DT$;+��nb�	���[�>L��,���M`�?����aEJ�@)� �����*M�8�$��:�����q�%��j&6��KZ���]�f_�j�|3�ep\������c*���e^3�����`w"V{���>���Z"[@��L}�a@����15k�i�q=H����jb�@���/�G������Ji���F��e@P�	(�a�kK���b=�Qq��*����DG ��>8A�M*�aVi3���Y�~O@G�Qy�"�� 2�"?&o+����O7�1���lbDd
�y'�RmtI��X�}�>�|NC����"�� 2����)\{��rS:W�b�MSXL����0�����I�M��fbTg�r������P������+����������),��Z����gF��$�>�Z2:�'�3�vo�Cu��b��<�O(������5e���6J7�#4Fh�	��������b5U�N1�n=?���Z����$%2��S���>��J��k��c��j�����2�(����t|w�<i�s������p3�a"��~�THl&��;�$���=H��d �&���j��{=��#*�~ )z,~% �q1	:I���4���^�Y����j����J���/I��M�>���3�����c���EZ��y�����$�s���"Tf��P��nT�(��00���s>d�5)�n�RYo��'�U��&:IAL2t���6����a���58�7'�D��(���M@�x!�`����R���zt���@7(m���;�B,�������_H�{<��`0�\`|<���iOW��~8*7�����g����A�&F95�|;���oI�Q�A�h����Q����i��l�b�����&N��f���<
��2�Y�S ��	�$Yk�����w���o����>������&z|���B~������_���{�x�Ai�����_���g����T��)�K��>1�Ng*�>~g3�1�!�����/Y/&6��0$�����t�;�2�,�m?�7(R|�C�Il���.,�����4}�8���>-,�{ ����T�z����S�u��=~W��)��7~��
���?T�l(Y0-}��������_:����o������7����O�����TH�������U�[�?~�O�-�.�@���U/������b���}@���/���?\��==|�G}��S��\�m^\#�'Q�&����[�G�g��^�;'G0��T�����?n������-�tN����f��V���(F	>��YY�H<��+e���O�j�c�4|�e4+��nb�Or������,��2�,����T/>,z�v�N&��:�Ha\��<�����bb��s�jx�{a���f���2�� 3�23d>����F�ji�}�

},�w�5�}R71���cR����q�I��d H�$�k���Y�XL��K���_��4�/#�s����1�!m��T������f�����{�sP�e@Pf�d�H�n7@����vL,)\n&F��r,Cw�D���X%�m��y'h#Q���k�����/'�!���������/��?���=��jP�����|5@V�������`��:��#�8S�i;��_�M�8�:��<��+��8�� ZF��h��2Z��Ce}A�b�Gl�>����.�0��7%�1����+���r��A*�bb���y�O��k&�}
D�@4
A��v�`Au���& �����H@������80;���:���9��1p38>����Ktax��S����D�!�nA��5L��X�Q�*Tm�B%��VI���z&��8�@2�$���d�t�Qt�S�(��j�I�~�+�%=���8�u���9-\*w�h,��j������~5��
B�� 4��|��+L��cL|K��@0����V������� 
�y~�X5l���8��c����A5�Z;	�C�D�H}��)�Ub�V[�u���8��P;'�"�2N��&��`��0����"#~����R]S�Ncm�|�c��fb������S�����<�o��5��sG��Mt��@4
D�L�|��w�S��"���fb��Z'd���������4�N��4�>\�Wnx���\Qz�ylI����G_NY�a�J������\�cz��"|���r��W��@�F��j�3a'���t!�����
q/��>��0��Y�+4SJW�����q�3�����
M��/��������?����')��U�|�.��C$���p��)��?h^����1��A�����?������1R������l���-�lp���J��6]
�|`��N��e]W��+���&6P�-�V��&V!3�ptEP����M
�4�@�'�A��B��
�t���3����D�!�N��G�����;���j�2e���In}�n�� �����5/�*o�J!`�n���?�4�4z�����jb3-;o��]���S�Tv������(���h0������$�������5`�O�I��&F��\���\>T�tc1��48�QIw�?�4(
J���4G��J�)�8�����:�H�*������u��y��^����0�1@�7b^5�:dd�_����i�5
��Q��-u��B�&l{�k��l\��P#�A�s�~�_����`~�}u�'�N)���f�G�����Dr��v��/��Xee��0
��]4���>9A��7���+�N��n���d���ap�������_}��r��l~16"�A%�8Q�G>����G��#����,I6�R�hT�%�m�y�"�5�&I7)�.#\F��
��n!���W�n��`?�e��B�`�M��4�_=m�)�z-���nb�����]"��R��z7��2�h ���h(�wO4�%�:Fm�_Mt�B:�4��N��4te���G�C����q�I��f�h�94U�.^P���l�C����������4r��(ak�!8c��D�-�AdD�"���E�C����y�j%8��]���U����&8��&F�h�� ���>L7%�2�.���2)�R�Q���6��|(�I3+���jbt��#�A�a+����A�s���@I��tN�@3�4���*bO���������G�@J�.C��9%��(-&F��3
*��q�Y�e@P�(����[y"��|1	���+�)�
�q#�is��d^@XLl��b!����>L�h���1h������)�W��Z"}�>�mA�i��������zT���+��8���� 2�"���*Z�q���/:H�#u��2�)��o'�|���9�I�7�
��$�@�������(]���0��y��J{���Q���L�Z�����L��&6/]���b�	��6a1�����s�z?���Q�D����g�|>}�����_��&�#>���r����u�L!�/et�y�y�
s���y���bb�Om��R�������f�`����T���pm���&���jbt��c(�N���Fm5�AQ�m�%Q����`~1��H��4 }���n����40
���`��s����:P�)cvCI���%)%�L���R�u���KG��&���AePT����i�1��wd M����G��U*Qq��q���
������M
h���1h���b0Ro� �*<'��YT�j�C�������t	�T��L��e���&��/�L�
�6��`�M���+�k�����n������]M���"�N�}����]�]��t7��$�@2�$?.�(0�5�e�*�9�J�	1���$��3�7�i��-8]�&FA�)S<�Y�8����2�,��a��*v��DQ��>R�)�k��!�|N��7I�I1r(�a�w}���7eg-�l��f
�BlEA������?�v�x�����*I�1n5=��_Ml��X����%@��h���?3�������
��&��`�z{x����_.����?<��������Y��xL�#�W(c�S=�]SNr��4Z�Rt�K��?���&��T�OU�����2N7)�1h�����^�����(�B����}�=��=7���M�*�-X�=ZM��s�.�K��f��l��f��&��g�(o�H��/�V��.�1nc�L�o	��bb�P�H/|����QP�P�>��o����Mt���2�(���l<^�|][H���`_Bw����@R f��*y��!�bb��sa��.+�O��&��p��Ah�9B4!�:3�3���6���O�&N|��:��
�;]TD��5z* ���n�}�]����r�h���ma������z�gy��P���O�+$���E/�L�l�/�0t�q��e��T���8���33�&6�*!c��4l~�����CeB\��G������A���|�o"b����`��F�����8��I3�Sh��)=[G ���H����B�C3���
�%���8��2�,���s��A����(��5�����G�c����f>L^���JyR�d-7e}D���������n�B�1��[d���
�F�_/vd?�U�z��4��Px��m������@����������g>L��tS��1pLG�o>h�������-������G���y�:RGi��c�*Y����q�97x��>L���`0��(�{,�*"tI��\��6R������|�
b�u����C�LL�U�&:h��o�|:
J����� ��J��xev3�QH�i�uE�[����q71J8Ka�n�o&�D�Mt�H��4 
H�Q�!����$��b�>RG)�3����]�����^]�@���tN�@2�$�,�y���V_)��y��\��h�����T��������y����G��Q��)�z�� ����R�� 5H};�>`���(����y�},���&��A�K���������b����*����-U7��H��4 
H���w��=�t�����:�HaH5?�)�\��8Qe�Tx'�q:'�ap��a������"������(^��Q��Va��Q���q�~���z1��i���}����r���ady����������n%�FO�Bh$���$kGJ}Jg*�d�j���ky���4����_\Ll�E��3�_���9�4h���xX>��MiXI&d��:�H�b���4m�~�Mf�Vx�0�2� �2��������I�[K�6J��fhU��l���ap��`g���d���oA���&:@g�t�o����V��E�tk�HW{0��.���	��6�?�Q��#o�s71� �q��:H7@1P�@1(�D2��{5��d?���x���0$������^�Z��B����y��}9e�m);���M�.���2��r��)���%��#���:�H�L9�D���9�&F9g��/;�:��&�Cn������W>23��@��c�s�H�nm��%
q�����D)"K7�Y��_C��/�5����h��f�����'�r�a��R�H}�0��t���0�.MVv@fy�M��;� �2����bv��j�0�������
��4���:$����9D
�&����a:g�p���?����')
!���Od����F�������%<H�N��m�I���jb����!m��������|�R��?^co&�4qP���k���(��=-������":xRU�:��M����S���F�����ew���Y�&F'��v���y��0L������9U�N�*{���`GD����Db��zM1ou��
��&��KR��T����>�P7Q����j�������O���z��������(�j����5�����v��Hg�%$�4�_��Q{5�iZw����bb���>��,��}��Hh���K�������`_O/~���BO��*�������,m_�����}�#�)]��6������`m�Hu���^~��M��/,��������{������/#^F���-�)�����|C'�TLu�@R
�<��1���:��
�:�3/`�R#4N�`���
���q��Ts����
-&:I���}n�1��&a1���J�L|Z�s\��ep\���n�|HNe�����bbt��#��N����M�l7�;��P7�����8U������:N��/���~����9P8U�^r��yT;����n���U�n&:I�fJ�C�e;a����K,&F��=%�O����D��h����G��h>)H��*]������\��x��iAV�jb��r"���0H�.aw�ZVH�R�u��3�0�����H���j�1Qh:�Y�7YMt����_���.`�����2`����2���w.���b��p���ip���Q������
�U�n#u����������b=ZL����C�+���(o��� 4
B���v�YF���j&����@��d5���X<��~�J$������x�(��~���P��AiP��������Qy�V�Rl�1��G��#u$j����]juAt7A�BaKea�1R����v��L`�X+�����DU���[Q&s��6@���j���Z.WW#�������6L�bf����3�P�|���=f5��)�k�_d�C�4^��������o�������4T�t�
��u���� 3�2��P�/fo����y��@W������CR,�j��O��3��x����Sg�Y0-}��	�p�g����Q��y�,>r���6R )�=������:���4
9��6�@�n&E��.��V\���/���x�%+
�y���D
��O|�����l�I��4��|�h6����(�����
��7H�"���J���L��,0d��@����W�O���W�F���+�7�x	&���r���������t�O\���H���X�<��?����9�!�q#3�]t�S��dSU�MY|E�D7?�<#xF��K�>^�|PA;�TjC��#�H|�a���nw�X�/&F���i��-Hkm$����:���������1�C�W{p���q��r���bbT�+N�0�]�6�P7�����]Hu�����d0L��uZ��.*����\x�����\%jj{���#����(�4�����|�nb�jS�<�����#dF��H<)������������^�!
�*�����Pvd�����$q��'8_-n>��M�m�Q������>��.]oyX��i1y��}q��(��M��|ZX7Q��J#�F(�P�FBi~YV�6���cvo�]��U�.1or2?�����'7���H�w��(O|���a���'��D�� 4
B����hj�$���X���>R������8s��DU����������y���s4z�&���4�@3��<���qG?�se.��txM�������������Yia1��T���i�L7��L��40�������:-�~�/����R�N���~5��
b�&U�*��C���?���} �k���4���:	��������;����*TcCE���v�����'��HV�����%nAw����G������9��3"fu�!L�������A���\�{L�g��:�HO���c}�v�������Cn���2�x��?�3�:��7�g_�j�`��-�R2]����=���2�:����R�����.S;������)S;�Jw�ep\������x<�B�B�9n�yy=�Sy�JsD��at�����j�m���OK�s`�f�`~��������C��m�>R��:A�!O�v�4r�������@p��Mt���`2�&����@���bR!�IP3�
�q!�i(��o2��we�d�������;3�O3Q�&��`2�&?'����j��vi��gr��[��6d�txw(������R2���aW�[0C��vP~Q@3�4�@��h>��MWaWH��)��L��4�P�
U�v�neh�W Q��nZx6w����`�+c���p���~������mV�m X*����up������@�����2�
�|�*8M]Ll���g"����bb�O�J����e�� 5F�]`�'���D��o��~:��|���������f^�?Ru��fY����[u���>1%Sc����G���HQ�1�*�����H31z]�(��>N�Bg����dm��DhZ�Z���&��:q��(uP];���P�,�(G���0$�Z�InZ-M2)ed�t�yHIP����� ��Iw�|5-���j�MA�"��+^�������O[d*�����C�z$�����(h��������V����bJ3Qz�@4
D������������T�]�@,8��Aa��&��j��Ww����E�G�g��t� �2� 3@p)({��Q3�6{f�B�y��>���fG5-�n���
	�I�q�M�@Z��
��:�M��6�x���[���V~��AP-�F�i���/&J�V
�Z9���������B-����%����rz�h ���h&������`H�6����H���=�:������2�we/d�7B�����u�nJ�c�8��c��cL�&r����-;L[�� ��e���C�
E�J�>N��&��`����5Q�IW���<�^�'�EFb����nb���!
j�.�l�������>����D��.���2�.3��!��@�E�V����PxZM���0Q-�]��@�n&V�
.SZ�@I��t�d�Af��!�@�|�i����'^��>RG�N��H�$���M�bb�d��������D7A�3�<���gT���O�i�b&.��k�� ����7������V�^�x�H��n��-����Y�
D�@4�D����}LS�J|�ak���&F%>Z������T[LtP��i��	�t�q:'@f�d�Af�����8�Hm-�� ��A6��,l}��]��7L����)Oj�[����F3�M(
J�����P�_�������R0]���P'E������6�?f?~�0���OaW�����;X��`1X�D�����6���z4���$a��6(��B�qK���A71J�U��,��>N7)2� �2��J����)`|�Z����CR������U>t�&�@U��R�r����{1�M8
N������?~��$�K���9���������h�Y"(�o����L�pr~V��ZLl�*�2LA�����8A���G�m���(���O���x(��=F���m>(��)eA1�u�>��HU�C������H3Q�����N6u���H}�nR@dD�A�{���w�C��I���L]�?�B��U�s,a�#OCj���p���o�~9���� ���Q��NK��/w��<��3�|3x(�6+���V�6��j�e��C�&fU����y#�j��X-�tJ��q�h^Zh&`���n�}�'|nSD�o������������\~��{������o���zz�^��F�^��b_!�������/���S*���������5����"k����Y��X����������������.l��`#�Mt� �F$
J����!t�_��\��E���Lt���>�2����Q����u������[����n��t�Ag��f�|�P�E�Qr8��cHn+���y5�AH��Xu��B������QM�u�F����L��;
F��`4�E����
u�8��������6l��V���
#�i1�����3�B,�tN�� 2�"���Z�$�9��N�|�Lp�&:IgO�X�?�D��&F�W��5�}p�7d���=��!�2��m���y�)S����o����>�����;���8���z3��6���x�z�yyI$Ma������v5��
b�Cp����\���?����R��&P���%[]dq���e	������Z ��}�Ep��0:��[���'�������#mX�C
E��?���&F�BU����X��&�6�m���(Df����T����nG��v�FR��F��#�����L}�������p11
�i�2�a+�2��I���d�>�������_�����o�s=)"��1����@��S���7e����f,����c2�.&6�R�=�>�Q:@ePT�Ae.'���^�������DT��+Q��G��#�UQ:������&f:6]��T!s��=!d�!3������Wb���_�f����B��WQH�|�d�P�!�Q�|PI;�D7��r��������O�E(Nu�������v��J��/������5@
P���f=#yU�H�>`0���F�"i�e�v����jb�89j�����C����WP
��M	�m��P��n���}5._��<
��\5�<&w�_,�d5�QH.sS��}���I5��!*��D��i�nR�f�h��f�����v�C����>R�)��/d��u�v7�q��t���}��D��>��V|���/���t�"��:����/f*���G�O�>�*D:����H���]�
v
����k�����8�0J����0�1@��O���U�b~�7��Tc�z/���~]��\�I�db����W��ZLl�i�"gA���rV6��AePT��k�~T�:;:Y>g�2q9U�����(��������+�-&�4h����Ii'!�h�D)!��4 
H��y��(��z,�!�H�V�����^��)XMlB���q�U��bb��{iW�Epo��(�����3�>��������wa��|P��G�p0G?�I������Z���r�s,��b�,~��G�]��Y��m21:r�z�.���e���b
����]��o�|3A���"(-�)-�<lZ��&S�h�^�5�����V2{���/A-�Z2�Ll���_�s12I������'���G>����A���q3���������=�j|����|��V������s�lI������M��������q������*��V�1��� ��.==|�G�?����S�3#l�������c���;s���i�����S�j���Fn?Le���$�u�?�(#PF��	���n��q��6_/l���W��k�k��������f&j��\x,/�l� )�y����t.�1@��ez����;��C���-����,&��&�.,���p��5�����8W�|������b����Ah�~���S��?��r����AR&�8O���^ah���'1�r�I��D����f�h�	4����n/n�\U�����������zf<
JR/�l�\��]����&4e�1@P�e@��x���6ue�!�
�(Z�&6@�T�2�����8Qc�9��*��^�-quW�qu��{��h�6�)/��R�j��Ae�<���H�|L�����o�\L��t1�J}�w
��E\Ll~a�����q�����D�Bg���:�D�|����n��n�H�Ns�v���b��n���_)o���M�6�S^Nh�t��"�� 2�1���No���)%i�I�&:I�\�7��"r71"r����q��"�b����h�9@T�.��/�nP5�T>�2��9��E/&VQ�/>	��J@ePT����~��n�,�#��"
�y��J����y��(5�x���>d>�jh���R�y_���&j����^-8c�&:w�f�h��tz�A5��F�h���j3B�6I��
���F��#���Z��m��@>^Ll���tL|�2N��$�@2����{J��#�R�c'�M}����q�M>
I�����AZL��	x7
����t��
��h�v}�T�]�����/G�QH@�n����Tyk��E,�l�
$M�d��/��x<�a3��&��G�b�������I�����}���>���&�%��k����p�E/����W���x�.~b�>F4�8�2���a�'���o�V��}��U���e~�����@p���1�c�P���{����S
���T�����GSQ��8����Fn�y�����0� 1H���Cb
/�=�b�)��}1�(��L���X��q���&6�E�!O<��q6N�Bilg������n�sd�Af�dfc�n	j-�i����u��>�-B�qH�jv����������g���8�@2�$�@2���V�S���*��DR:�X���v'��jb�P	Ct�j&��1�LPL�+�{���� 5H�&5P�K�Wf#M�x&��x�����6,�^�M\�XLl�Ia���/j����;4��sA4�h�h�� ���vt�0w�� Jk&JICh�
����P��Xm���=�v
��[7�O�U�I���7�Sf�A�U���o�0��zz�^�B�a���w����,�%�_����E�G��sI�?��|n#_�l]��#��%mE0��bb�%{%�����������G�vt��m��
-����D6��j���+��+�U��N���I�����Q� j�&F��B��y=��y�H�.��+��2+u`X���������C��t�����>��9d�!������8�����R�n&:@g�t�Ag�����v�X�����j�LP�JT�V 	�?3�iZL���0:�G%o/& ��8���Z��t�S������y�/.o�+3
��_��'	�~v���P��X��T?����s��M�t(���4d����t�;��C#�F};1���|�_�����@>�����|�m��b�<h�R1gO��U(�)����(����-������96��`3�6C��\X;�z[��2��m��?R�4������XZ�1
�l��$M�Bf����X��eK,��?�~���j�|G������]�G��}����F��>q��v~���X|����|�t���<���2��	���7z�|7��4�@3�\��~�V��C����'��.�jh������y"�G��0�<�;U]@�nb�����q�}�nR�d0L�������!c���|���5����R��Dq���+[�o�(M�������g���bT����}���4���N?�
)���P�-F��^{�7������b,R7H��V���R����C�����y� 8��&/��xQC"�zq���L�nzP#�F@
J��|@�Gn�P�t��/�Ew���t��?�4���U��N�1�&��H�� 1H����H2U���f>v�C31J��t�k�y+��Sz1�*��F�@"������g�|�o���Jh3�h�-��H�W��X88Op�����f&bq��5���XLl���,��w%mW��J���4(
J�Q����=�T�����#mhX3���;�M�G�Dy�+�7Q#���R����M
�&��`2��3���v���"������HJA�\�?�������?nr�B��tN���2�*���_yUVZ���@��@<����b�1��q��V����kK1�!�K�1w����Ht����G�)4�����4(
J��|�|H=��)�
�,)S��a1G���B���bb��*XSQ�m���c��Rh��ip��yNT�N��_1'@��6@��"y������n���a��y��`X����2O!������=�J|F!V��L�w{��U��~�f^��&Fl��"%����j&Jw@iP��Ai>x�����tQ���!m������#�&��CRL�H���Vct�doW������#h��!���Q����	U<Q����������W��jYbW���k���������(��m�@R&�X��yT����(��t�+���������n�<#xF���������<u�����4o�3?�$qW�����)V���8�Y�L��w[Y�v����4�@3�]���������fbOq���:���9�C��
������S�gO���,�tN��`2�&��<�*hSY�����w��8r���CN�(��|7QF�r�Fjy����n�@h��"�O������-��A6���q�y0�g�����2�@���-x����W�0����\�����q==�D��J3A%�'�y��3o�~�fi���>�|�������\^�v�Y_O/~����,��,��a�_}�v@�����S���}��W����r��n�G5�u71�GU+��'M�/���,�f��#@����~:��|�+m�>�v�������E���=��U0W�~..�W�Q��������,�}���
�����.� ^p������u��2�kS��K�n�������{���@�Jw�><~c#d�k�I@V�Y/��v[?#5W@F#"�<�pF����bb�P,������������"��d�y�s����E��3N�!q��<�`�>���}	3�)�Y�m�
	[���kQ�oZu��eW���g1��4�@3�|#��<'����J�t�L1��'`I��DG!q.d�\��^�����Cy�"����>N��6��`��������uJ���7
���$s,e������M�g�R(;��w���j��8zF�XKI������E l_G�����S���`����L)�j�l_&�PU��jb������e%�����?�d��X���@����3Bg���:=w����6U���=���F��#�#x����v�2��
?n�uC����(�Z�d0L�uL��Ey�eQ�������<@��GG�"�XYr����|����T�`���?�t=�?�6J��&���U��6��6&R����$�(Q�u��:�H��Z;o[jE���wwj��4�!�2N�p��1Bd���Z�4%������������o������l� 7` ��$���������j�����U�Q��� q`t�5|������j��Lfq�����t����e�
��BL�
�:t�~�3����^(�B��������>M+�vT�z�9���Sd��}@����[�,s�O>T����j�����d
���������f����N����B?P���G$4�ms��S��3�r���t�Qaf��
�22��JH0o�,�1���s���4%���W����<��G�9������E��R���5����f(ie��^���8���>�&+�8�l�n�RM��_��?���G�;��}�W[�ff�s�d�������[7�����f���`�H+�s�5a�n��ju���t���Qq�F
1:
p��j������������X�ZP'�1���$E&s������KS�"1�x���w���v7��q��.���2�.�D��D���*�A�6�4��f�|�oo�W�a	qJ(R>��.��-g�h?���~���'����/(T��?=���7�_3DY�����w|��B�������zi�cj���a���k���.Q��M
�8��
�8�=�M/�d{���U(��>��cC	(�{���ow���p���*�{�V=�->�N���W���ncm��j��P����r�h:VC|�	��S���l	��00�#�O���h���]����(����%��s���C/���Xy� ��va���Z��kH���@����l	�� 1H���������]�p�7-�Kf� 8O���M������\c���l��!>�P6��"��|h�v���P��cK(��b�(������7I���q9�,/
��d����������O�>	ti����Q�`��G|QW�oR�rI�� �:���� �}�,���F5�iA0PV��Y����g�p~�p��b�iX�%�����i�c��|���BQ�!N���lw��M[�|>2���%���u���4�@3�4?��/�&��%p;.�}��[�
?�u
Y�y{�W!�9�x�W�
UM����q��!���4�@3���/ji�����p����9�I8�A�o��k�b<���3����Q^�P�nD_��|x^��'t�~
JQ��#����������'���t�R|�k���O�t��~�?���
����x������n�n�n�nV�f�,�3�s��ey�$O�������C����yq:�F�t�jfE�kq������~{w�p���_{,����1O~�Ri}���^����B�i�����z0���M7<�X�\br��}��ux2���q'K�y�C��T�^�M��Ze{���������]�[�wk�����>Y�%�F-�RS��]��N��3��fC�s����Q����1`?R�%K��Qw�/d(�y\�[��b��\B�o� ��L
���{U+�	��%��1`?��z��NA�D����=�]Bl�����$���a>)�$��DeK�a@��CX�~_�0�C�zK�������������8UmQu�bw����O�N��������i�[:�1h����#������&��:^��/�u��t����f��"��k���v�fY(���K��f�h���G�|Q��.�h��^��p�+9��x�l��]�;��5S
q�������y���b,��Ag��������6Kk��(OL4����]���&>����z�D������>/��{����]���
���)<��x0��`��z�<N�����Ci�8t!���n!�Xx�_8.���g����E�Q�([@/������z���;=4���9���e�
=��y����YI������lZ��q
�e.���2���$O��l�:m�0J!��>��qM,9e���6���; �9��fe��C�{��l�^z9�=����1��&��`2����D����tvH<��2^�M�(�n�h���=�%��M�������Z�-,�T�����ve�s�6���+����'�����������y�~wA�����o�������L�j�q�����������o�W���T;�8��M�a�gq�PC��L�?���LcgS�<?|����WJd�V	�=�W����=��M�,K%�0���Os�O{p[�~�-q�UI~�k����C �z�?S~95�Cm6dQ�0�A������
6�z����-�����=X���(��t?�Y/����!F-����'-���Q6j���]w�V��P�-�4�Ac��tZ��iP��q�>C�f�z�!NK����� ]v*������pU6������[�������=]�]�����1���@!Ml3L[[�;�q���b����Y���!>Bj�|<��<�<���vJ��k�-HeHeHeHe-���m�>�4���O����n��2�X.z�������N����
�o`
�������(��kt��K��Q���������8����Rc}���^�Be*��&�n�KS�����)�z�{p"�b��}D��'O\�4��V��9��=����D�U������B
��hfhfhfhf��$OBNx����F�L�
����C.�z
I^�-!N0�*�q���x�)q�S6-���](��;)l��26���l�����Z/�g�&�.+����^��|���zWc%�k�O6�HD�>� ����������A��(|M;P�TK,z{
�H��#7Zq���X��M���N�Wa�kE��,��2�y�_eE���"OBN���G�bD;�����8���G�����������t��NiiL������m*���x��e@CCCk����m���������b�o��@���"��X�������B� �8	d.���b�Q�����08����8|M���:(�#�e��k��@�o5���7>�9�eM!>�����FQ�UCl���3��������l��9*DY�������K��_Z��i�q��1���
���������3��q��'C_��I��`���2�.�����?��Sy����}��i�>�R�"�i	t�oL�5u��.�W+l���|����RY��XM�^S��$�	�u�`�&���w3�n����G���r.hc�����Gw�����iD��mM�\a��}�L~�4���l>���]�Wa
�I���v�9�V��[6���������Z�p@O�m��',�'���c\J�dk`
1�A�����:���8�I�tks�x������1F\��ep\����^���_^�P2N�Z���k�<���������e�Mj���yYT���8���N��qe{`0��`��/�W�M����e�]�����0���vRT���!NK#n���Xvm8�x"L��d0�Y3Y�CNs�y�TD��.�c���wqi�!��5��C�e2��z]3QN�C�^�Jk��h����08��G�X��_��=�������#��M=w	�����k�O6��Z����l	�@1P�G(��M������F� Q���aG+�c��B;=,z5�IS�tZ�b�:���}2����d0L�0Y.�1��ig�������q$e�rW��E�}
��2]k1���_N
���l������P���.���2�.i�K��]���S���G�T]=�8y��A/�����[q��[T���[Hdt��*)[W|��b����������v��V���
dkS�Zs�[{Y�N�|R�e��5�([P�P�@/����+������j�1�f����K���6�h�K���n�X�$8�i��M�zp)�p����<��c��H
���>e/�6�k!Hsi�!w�b�%�'��Jo��em\C|�!W:/'���=�O|���2�.��G\��E]�Rb�V\Z��v1�����N��/?�{>���y�sN;/�s�Q��@3�4?k4_P��%����!Ry�r���k��Zz����| 'z�7���f�1�t�b�(���#�|�5_��SIY���W+Ncm���c�������(�a�2����(��|������~y��.��[m����O�}WY��2�V�E��y��PxQ�zlJ�.�I��R�j�������bS�?����t���n�P�)�[,��������:��������L��2�x���[�B}���y�|,S��9���=���1�]��0T��������7��<�����(�;�ur������1h��Op�Y�x=�k���i49�&u1t27cm���I��^k�j���yp�Sm�Y�����(��je@P���i��prQ{��I#��!���bC���lG�K�w����k���f,[?]�4���^�>�������\p��!��gH ���%��d��9��C�F t7��G�3�8��Nw�����S5p�~E��N,�e'4��v������w���Qh��pV����h�g�AuB�B�>y�8����C�q�[���?�Ln!>9�K�p��T��5�����[��(��j �!�!��f�����x�/�j��>�)�7;l���s�\�
����PQ��&���t�������8���K�|\
E��x���������_��zEpA�*����^��H�����&��PA�5����zwY��u���a�B��|5���������O�O����S�Q��?������'�=�V��	8?��t������X>��
���/�m"�@�y�o��	k��~3��a�:����!N��m����l\l��p��b�5��X�</_�<;�f��/��
"/!N��lf+T;��1P+����w��%A
�e"�� 2��&�bN��@��P!H��7u>�/b<\��}��2���e��!N8f/Z����|��X�s����E�5��x���1x�h��#'���H-��\�|���`/cm�����w��Z�����l��H�aY������AePT���5}�67��T��C|�
k�!H�g>�Uv��D/��8��tc���,�Gb<[>���3����,��NF
4����al�;��b����5�X���3�����u��������V��zc��B3�<��c��Q�,����}�H-�sY�be����Cr�����l���l���[��Q��c�<���Gy|M�:���t��{
�!H+���:��k��O:t����K��k�-�|�������Vq���,�7;���.��di���Ua�}�c�tT�6�u�5v�w=����`?���-����sI��I�N]C��|`��{�eM��ND69��\pS�P��W������!N?Y�q�aR���R�Ua����KW����K*��1�&.����T�����zp���M�5��~�6V�y	1�C��=.�g��bN�~[�ji��� �Ki�$�w���]Z��Kkk��8��4 �����;��p�`>���B�����=�k���'�}����������O�j������O��o=�����I���(�`v����7r8#����V���04C
�uS��Rd��8"S:�����)���`lpRCl�%_(�B��G���R�|M#;�&wC�I�Sd
�1H�T��.������P�B|�!#�����-�b���h���\�4WME�s���)�#���Tu��ME4E�s�S-Y�a����=�8mz���w
9{�5t��3�������������v;61v)�K�@}���6�hU3Y������\��+���F�^t2t2t2t2tr=�p��}I#;����I���X|�,&;z(�M<
�I�^�bAF�l�~�_�������ls�v���+����G*I
�h�0^���|�����N~�45���?����g�AePT�A��E��������F:Tt����2WP�i�Q>	�
e��X�Z�[6 1H�� ��$��=M�t��3�������)�~���������7��*���8���>�������m<��������������_�!��|U���������M-M6?��!��Cv�Yw����V��Qf�2k�2���Y�������G��Q3���EaHv��D2��z&�)-W�q��G���X<�(�jrrrrrv��O����!�f��6����������� ��4
�I����hv���C�aC%C%�J����d�\��!��@��|p��i'�N�D�@�J^C�70h?2��/��(�}Xw�k�-hdhdhdh�@#����T|I��oJC�����^����1[��j��<>������X$q�-p��ap^�_���	#�^/��(${{q""����.1����M&t���J�)�FMTw��~q���� c��V��17T�������5� mJ�Ir�m��x����u���1�e�e�e��@/�����pH}�rl���T��}dya��8�#�3�W$��7D]A��% �2� �����vPc��m%/cm��jc�����\i&/�j���N7E�u����9��P�Af��[����9�QmV�J4����]��P%>����B��r���&�2�o!>��|
���I�|k�6�A����/���$�|V������W^�8�s��>w�(mV���������b.���WG���M�T�b
RG�~�_��~_.8�����qi�S�X��6�h���r*������������dM�!8�� �rr8����Y����������o~�����S�}����,E}�C�����2�y7T�������(��-��5�'.���BV�U����@��O���4�����������Q�M^w�~x����w���Q_������^>��2�^�]�:<�T���E���I��� 'z-����h�M�����&���<��2`]����b����sunrhE��Y�����G��G����a>)d��.�����
$�Ab�������T�}�C�j��]i���u��������J�{��P��9�e����k�-�0�c��X_p��H��Z�b�<��k������7{�9��X�]�����AeK0�c�0>��5�������;����^����%�W;�k���Z^�)�VT�����AePT�QY����f���I�)*�l��F�=��w����R�,gr�y���:g�~d��C��7�����j�ym,�Af�d�����c��v��y���n ]�� ��S{�e������> i��s����)H2��AaP��)|Q��*�	��]
�An�?;��,��'�.�X���� �5Z���|����o���4�Z����������8��n�iKY]k����95%.��kN�b���d�6}�n>:�/������T3T3T3T�3W������u�����n����6�hF�]G�����I��S��2�([/���/��[#�)��6v�V����f��
�6?��!N%]�=�[Y�+CAE�j����_�_}Aq*����+�]��.����P"����f�OM!>������O�M�5����\�\�\�\~�r�4"��`>ND&�K�Zc�Y�f�
;Z�\+�7y��	b���fC=?zy�����P�AaP��jZ'������T����$z��N�oy�2���34T��]�[��������P3�;�>~z���o>�B��/Jc|��Vj��`y.�R��o������Hu��],���!>~,M)��f�C�!NJ9S#11d{����/D'�����5<P��4c�?|���?�r�4�o����c����}����Pu���!>�w\�<�
�)�'��N-m�z+/Rj�- H���u
uL�x�D��?���+��m�b�Qq��m��;Z(�24���K,��)���:QoQ:�}����S�����2�,�m)�fZ��L���M��_P��i�3Pch�k��Q>;�8��17m�5��\C�M��+����&��fb��f�`V�d��O��O�5-���N��G��les�MjIH�4��Hf�[�!NK5:!�v��+�!�e�0�3�/(�^����I�-�j �5,!N2�Qn�����%��Z�C��Zym
my�9�[:�3�<����6qm
�f\�Z<���N���ZSo�������@��B����@����<�������/���<����B?����B��Bl���E��&5��i\C�v�����R�b\!�@3�4�����%�mS	��:$4��r(D�"WS��|�H�;��b< 
��`1X�d�l�m:�P:�����s$M�q�����!>,$����i�O
�y#��bqH�e���1h?��/jZ�C���jZs�Qj�b�@�~k����mbm:mb���e��!(�FWMt�<n3���������P��iM}C�4�uW�9�i�6Q��<���S-��`:oK�;�S�-�e�e�e���.�O;�|A��QciC�������e!��}�Z���C��i���vu����R�!��2�� 3�2��^fB�������j`��6�h�\K��"pf���=����5C�9��G^/�[6@��ET��Q����|���������/@2�$?��k���RAuX�,N���xU���A&`���<svM�?�����O�G�Bl��� 2�"?w"_��J���vo��u?emf)�V-!N�,k�������v�&FK�]�����^(�B��m�~`��y�I+�F��|���.mWV�V�����cOZ�u��|�������F'�[6����������0���F�|
c�t�g���K��?�U WTG6��9�9�'���z��e/��@+C+C+?������33Pw���1>?��F�t��_���$G:���P�����|���r��4�'��v��es�)���1h�k������>m@$nRB����3�����Li��������
��I�zb�-��!��#F6�l#J�f�w�w+��?�����v2ew�6�i|�l��
]Y�������!Fq����n����s�.�[��E�5���(���vn<����c�
`Y9C��`��C)Y6���6�h�l���n����5����M'7QQpVG�� ���v"����ZH��y��e�
;��Y�!�iL�|R��e�P���[6�08�����z��CM��1�n�P�P�r��(B��N�����AP�����m�.��bKL��d0Y;-����I�3
���v�J&29�5���d;����D�&�C��,r�'�4]5h�VY�F4�e����D��w����?������+����������O�(�����t}�O���������Ew�}���r&���m���Q���t����L��l	@)C)C)C)k�<�S��t�S�ce��
���6�h�pl[S���T������O��8�e�*�^C�e��2�.���iR�Gsj�)K��V�|�oLvt"�6�����>$�:w��*��L!>�d:��+��1|�� 2�"��w��������%$��in����i���.q�d� ���N��q#�5W&�[6`2�&��`�cL��{��$���=�?�`�� 5����� �|�7���uQ�����}L�-��bLH��d ��#Y>�Tg+��^'|�s5�]��h�<�����Qu�y����j�b����P�my��� �Qk���wZ�+����`�v:m�4_��Nt�R���z	��G��#:��Y���)�'����:��=��(��s��P|M�:6����r�H����)� ����s��E�8�$�x.� �A����0 k��� ,O�/��J��7�~���f���s��>�o]j�U��.V"o�Bl0������+�A����`1Xk�!y2N@�����|��,qO��;Y������������V�S�����B|�!����������d0L�N������|M��mR�u�V^��)����.j���Ed�8�$�8��5�T��c3Q0L��5�������,�h���Ok�%�������|.C�/����������&���5���O��*�r���^X��`���7���T����z,�	�4��,tZ}p-V�b��_�6���$&��|��C�e��t�&����(�u1~A�c�<������������_^]���T���5K���5�	�\(]���qN������8n�!��@��g
����SO{05���������5����#)�z�������([�/���������\����BO�/��B��a�c�3�����}dS�>Y��)�'����*����U�m�������w�Q+�w�����������O�D���P�Es�����4U|M�����6��r�tL��fy�0����J���Gg%6�Bl�@3C3C3C3?k�|=����m��
T������+yT����9���P�4�'*�.;�,7���d���DV�����P���d+��JzY�����m��6�S�O\C�n����T�./�j���@CC������i[��L��y�$VUW����u���{��yg#�6	�������8�TC�;d@�5@Vh&��P�"N�q��`������1����!>>��e��i�O
��:w�]l��1�����b�(��������u�2]��P��Pr�b�����>)��0"��B|���>��d�Z�C�����1x��G<��W���K*/Fb��A`(C�v�X\$L!N��T'=������)��z�d0L��5����i�U�K����GI��j���%��R&��u��?�)�'v��G�5��Cl����2�*��GJY�O?�~�.\F��r�6��M�A�eoCm����l���������VTuu��<��'�z��� �
����G�bs�gf>M%_��������T^��k�%����F���)+2�B|��Wlm�,�5��@3�4O��^���X�����t�b�5��3YC|2";���:�':�=���b���k�-��p��Z��~���i(>��2�\��$�����J���6�\����M�(U�I~7/��j�#�F�Z�����1x�"_��&
��aq���6�������{E��<�'	���v^��X�!���U��>����������L��P�
x���[q���y����
��=8Vvi\��`����G��m�����C��9�_�L��4��i��}@CCk}�����(/~�V>����$$�.��l����Bl���J�S����S�B|�a'zs�f3�����g�p�I;K^O+_���2�u�	;��uF�
�as��L��eN+��7)��k�V7���3�K�}����x�����sQ
��\�R���1MM���)�:��
P��7g�����Nn��3�e&s�ll/c{��/n{�4$_���MI���?
�<���]�p����|i
��sG~�b�W�_N������
?~6����g��k��t?W����*<����Z<�-�]���8�I������:��!F�T�AeP��S�z��L�N$�Q~��bO#} H���ov�u.6���C����e�!j��2\l��p�_��-��i">���!�K��q���s������E�N.6]1�g�j�F�����??L�{w�p�����#��Mc�����??}����6��������/��������������k�����P�����q��>y�:|�\[���0��f�9�'�D�\-u_���m�����
0�j0+fj����;MC^���7�8�������b�G�~��41�wfP����L!>����>7�=0�c�0~T%�$r��e�8u6��������m��;Z
s5u0��N�6��C��c����C�_@P�e@�Q(_��.�vre!���4���C�-�k!�����O�����:��P�@��G��5��D���6�>��-��H7/JPV�K�Q	j__�X�b��iP�to���c�x���1x��4�a�!�?n~ujB��zI������o���>�u����C�>���o��y�"�s���6�8��3��(���[�����pH���a�z��e6�sOm��y�0���C���tNeK(��b�����zrU��d�u��H��%n���G.!>	�����X�.?�
�C����OS�m��B�g
1���@3�4������	ju_�$����>P�����2-���B|����%��j�-@P�e@�Q(_��c��T��k��f7��%+<�i�
�Z�L}?�a������)���*��T�����)jg��f��yq6���������T�%���H�W���]^����rFS�O:��k��Ve�%$�@2��Y��[q�nmS�v)"����O�`y�T�{�oL���mK��:�6��S'/:�Nf�l�5��#��v�_Je{1@��WX��>��Y�C��G�Fy��I^p�u
�H�����=,y
1�
���
�����R�8��B��o���P��@,;U�{5���{���/�{������=����������6�L�/����txz����.��p�9���
�%����:c�K���X�C��K��.6���5����4���5��
�8C=C=��y��}������U���	��+��$+��Px�T&Oz������<���C�@J^j��?�e����n��^����?���E�RKB2�U�l;<A:^���4*_�;���}��cI�
�OIM!>@d�z(e�D�S��/�=\3%j�]���8-1�bl
����F�r!S{�����X�OX�\O��Tt���mJI]^��e������:���n3�k�O:��i�VP��C�k�p����>
��=�48s����mOU�d	��!��mP5�a>)t�x;��bY�Zp�q����1x�D��_J���Q�`����!w�Q^��|����Bjm~d��CP�����unQ�_�c�X�c���3#*&d<�M�WO-`��T��o=&�w���|T�=t�M�0�r������������>�>�d�q},�F�-@�1htC���t9��L�\C��L��j�������*nC��^������:��1�t�Ag��q:_��*f*��������,
�	D-�k��~)������|���6E>S�-��A�����R��6���$�V'�L�BJ�:EI�:��
pc����i�O
����n��1Z*�08�����JY�x�����0�@ns�|�6���_��Y�^\��'*w
�����!��Q��r/j���?�>}�TV�������{�����_�]�{
�C3
��2WUwCa<��I"�f�6�����!�t �!�!�!��|=�z�KNta�����l
��G��k���ai�^�����o�i�TCl���3�8�p��WIE��������L`'{
�HKg��sl���(�k��^{�t����������Ag�t���U����6���+���6�h��u����\�!�=]m:�i�vf���Blo`��o
����e�3��
tLT�vy���O�AJ�mN��z�� ������(�g
���Q'������(�b����C����*�b�`�f���uq.<�;�z`��ANk��V|_c�tYn�qj��vJ�t,r�0����ru�-��x^����R��6uI��}��G^���5��-��{VM�����Zr
�g��-:�5���@�d��Y��1�{j���K���o}?����}��E
k��W�n^l���L�a�!��rf�`��c�|M��M!
�a(�NN�b��vuDt��'�Q>	t$�wee2�9�XV��`1X����5���\��n�4��lC�g���!/��|����M�t�CM!�t@ePT��9��'T����Z9J>4U���Gf������a�B��L!>�P�[�reK,��b�,>V��[p�t���,������]�J����Y,/��t��^���+9�b�D�AdD>&�U=���VK;g�%�z
��G��z�"�(Y����|RG�cm��+�b��Ag�t��Y���g6TL�x��:�����6�����q����:������ek��������Z>�!����c�<V�`���^j����XVA/�!H���.*�`pS�i���~z�:���l
�I�kb���r�b�D�Ad
���e*$3:Siq���ZBl�����dw8Y>0��E���)R���Be{!���%,r],F������X_��m���/�1x��y�F!�y���3K;�����k��E�P�u
q2����Q�x�������O��X��`1X|���I�8���7qec���\mMeZmX�JE�!O�*���,� u���a]��I0�/���5��
X��`1X|����Yg�L�]�����Ss��e��:��y��l�.�Rn�~�d���bKH��d ��#YG/M�*�c�����]>��_����!>�.�a>
�I��jw�A�z�J�xj�-�0�c��X_o@7h7$���t���s��G�>`:o�t)�i
�I�k���=(��b�(>F�5��D�=�Nq� oOCm��B8��������n5�8�$d=�cW5��z@��S����;����oJY/��?���������xXt�������M?~z���oO~������h�����;>������IV�/����c�[�J�,!>_��j���?�)�kC{h�nw�����h��w��5?���o��4�/�e{�?����?MK���������y�����BeB8AR&��B���8�)�>��&V��q	qz]�;����Jv����B���L�h{gK���P��^�^O2_��10Eu�:��X
CO���
������P�S�w^Nj���Rl��� 3�2�d�O�z9��V�M�j*��`r�����h�:�������g�h�X|!S���k�A��Z���^	8���08����_��������U�A�l�^PW��v���0�8mtT>�������B�>��6c����q��	P�e@P~���s��V{V�rG��)$��^(P�sC��r�@�4���rL��(+�<����f�`��|A�*C�6�k!������]���
o�n$���B�$rn���u������08���dy������0UN�]{���:����������0��I"�C�iv��q�$`���|Q����*��R��4�-�������a��s����[��
_zg{+�1p�z�;v^;��<l���D����f��-������;}�v�����V6�S�O>t:���Nr#���Nh
������M,��A>��������:w�A+o�e��4�(�)����9�'�H�/]��,Z����m
1��C4C4C4C4��c��vH]��:Ua�v��CD�����b�0�I�������B^(�!�|f�`�0�S�����Na�
�6�r��![s��G@��4�fu�����u��l��B,sHgKH��d H~\+�<�*-�Q�3���9����*��b��6��WF�YL�|R���p�[GOE:s�- H��d �q$_�����zy9B%��Hx�,���i��+�S�j;E��J���-U��?���b{A 3�2��?�eH�8�*C�6�kqHftK�	C�>K�OB!R�����c��D�����:�� ��0 ��<� �K3��z[��[F���]
pujo�6+\�9�'�T�u�\W]O!��u`X��e`�q,_������2����kI^D���VQ��5����J�L�h��S�
�?{��#��l����	��$#���_ ��.����������o��Hu{{��;]V��t��X[-��sX,.���X��J,��?~k{�C��e�F���;�wi�V+7�0U�v���
_"�3��mu�5�������eT[�t��=���oCE���A����tF]��U��l��X���w���:7��B�����'�Oj�N�h�R������N�f3�0��\���������x���n��tg���x�-;/Zb��a��!&C��8�l�j�����k��\�s����4�@�K@3����Ae<>���_=����Q�O9�R�g'#��>�!����f��0P�w�-�VF	Q����#�4�Ac���P��}-!�K�6�%�q6���>>R����x���Gi�ms���d^���Ph��_a���~"D�Ad���j]I��������;R[?����&J�h�k�N���i�3�!���9�qf4�@3����?�*��~n*�:u�/L��p������������9����X�������)�N`�:�;����pr$*-{��_N��.�hnF��.����L�S�%\>{�:��MgM��9�.�;TB��z�t|��&\�.�CCCCC���������e<����/N���3���o�a��A\C�o�����������i	i�E3�0�
`���6��I�D~T��w�[M�I�����9�:��mP�>����_�|������s�H���RxW?}~|x����h���/��Y7:~B*yj�����M�B(��s��5����#R��;����
?�4�P����������a_��}F[r����������DK�*FO�4���kS���H}���3�-�:,��2�����JI�<�^�k>�>Br'�j�E���LN����jUv���4jU�����~����L���0�a^��Eg������y/a^�O��I)
~���7(���4 {�]�e�V��v�����7����{�
������(%Di�<��tlt��=�C��@0C0C0C0C0���O{p��BH�\vJH$s��)61�(\������!���g��e[G�e`X�_�����	������}|"R�����M�hz*��|�=�5�
�R5u�4�Ac�4� ��7
�r��m��[bt�G���b����R7����C0$��iE!�EM�ep\���\�k��h{_$�4	;�� �z;Zol_f���t�L���W����[�ux��g<��{��(u���������m;��f@%�T��[���M���hs[�@�4��[���I%D)���O?������QV����Wn���TGi��}����)����u)��#��K�kS��$�z��hE����zZp��1p����}���YA>3�B��� �%��2���X��o
��R����e\���u$�Ab���I���d~�}u~x����l�����lQ7&-I}�����N�pF��(!J���H�� 1H��&��|%M�� 7�����IjW��6!(%�u��r!	�Jm��� YYls���@1P�O�x���#]\�P�<{�UM1J��mo���|��i	���|�b���k�p�����xU���z)o�
�N'�c\�r����:=������[���/�n56.�x�������7��������w�oF_�����#!<�}�Qk�����b�E��T��9��A�Iz{P��t4�E:tQ�o;>?7Ax"������,�� �!�!��S�.&��9w�Mx������Q�������*Z��:�wh
����T���)���un5�j@PB�7�j,������/|Cnu4.�B����V���n5�j��I|�����pt`�69=c����t��G����$n��z�
�K��zK[�L��/;~JTC������6����@/
��MMMMM�����c��#��y��9Fi������b��\^C��xf"~�tb����G?c�!m=��h0��uF�E���Wg���o=:2��d�zj��>)�aW���]���y����������4��N��'�J��f���]���.[��Nt����X�*����-m������AU0s�����������v]���}N(�����lNco��������R%R���Tw:����
l��f������J�h���n9�L+�S��kv��Z�7IJ��B�i����aO�������]x3�g�yR���x����=u(��T���9\B�2�w���������@CCCC��>u��`�u���9R_��h��Z�6g!XC�\�`{GK����4l�3�1h+����������2���.���Q#��_���~o~T�&�wa�l�]��By��y���	��#>����t�z���Qu��a'KHc�#
F��`4-��k������Hi:����3��BWF��m�QBbg��~���)!J=2v�u:eql�e@P��7���LZ��Q��t�Dm��Hi�C>���T
$��i	i�$�o(�������_X�<�4f��� 3�2�2���Q�Hc�2���f����T������t�C��/�K�N�,�lR/�]��m���c�8�/	e^-�F�zH����e��aG<��f��j�--�:b�V�E"K��5&=�2� ����W���T��~X�6{���{��7]�v�4�AsO�?�C�)����;��m��f�`~`��`�m��T�f?�����:��C����n��G`	Q�E;��	���K�����1p_��<^�qz��mW��aG:+���WR�����J.!J<��+N?T���sG�����i���c�p��'�����������u��dnyU6�� ���;Q�l�Q�������dG;c���9D�Nx�����I�wj��-�%���(�Vm�^�^�A�*�Y7J�2?7f��>�brg\H���-
��#��d�S�k��%��:���3�j����4l� ��0 �����G6�a���0�}�|<�(���!�����2�;��4���	����K������z�{ ��aX�i�yq������r�����*����{4�;*�A'.�R0���Y��6I1H�3R	��f��*!JY��G:��h�4o����[���L����!���}<�;�R�z|����g?�6���	Tb�$�4y���a-��w���t����!�[�OcJH[��h0���h�?������������c������|��a
y~�����@�h��'�vi2�����������)6������&7��g~T���W�{�=���������g�cgIH�3QV��+�%����-�a4)�%��#KH�������"�3��4�|jv�����i�O~���Z����(��@g+�����%D)A�������Nc��k	i����`4
F������c�}U�"F�z%!M�u4�_u����!Z�h{3���q���B�h��p��v�.(
N�^Pr�7��;u��\�������vO1m��
�sy�o)��:��!����b`]�%��a�~�~�~V�����o:�C�.; ��y�a<^����7v�{����D��R��V�Q���|����g�%Difz����` �����G!�@�H��Gi����]r�������HZ�[w9�	�%Di��5�pbb���b�,�������}z��Q
�5�J��<�4BH<�D�!����f*5Dg�`}o���?	i�8
N������4/��r�@hw�hbMZ%�>��Q(��V�>�1����G���%D��\�/�A`�/*e~���u�g�i�p�S�2m@��<�9DiJ���������87�f�`�/�y������ �,��t�����$[����A	Q�`�h�/�:!���,!mV
8
N��������t�`�o���y���Q��8��l��&PJ5��dS�4�y��(a����J�_�w��������/,�������)���:x
;��������������������]x3g������(���������e�����:��c��IkyR�c�c�c�c�c�����wwTJ������8�K�}O��LJ57\`b/!j����]c��v��i`���0���P�7���������aN�VZ`�����$9�_��l#������9�����H����_��R����z��W;Kf5��IO
u�GifdR���P���!:=rcgRwT��Es2�O��N��KsZ%f��$�yJ<	g���������:o��'�gf��o�����|������N3�]]��DU-��0�c��%Y�����T���C
�q\�>R�k~y��l�\k��p��Z?5�:�t��%�����	��o, A��{��\kiJSv���|.bi�r=��W�%��y�g����0��`0�b�?�+�����+m-�`[��A|O'A�F`~�WC�zDg/�_����"?�4�X�Af�d�/�y���wu����N���t:Ed]��P��.�!������@2�$�/����������Gu%��$/T����rv�;��ok�6J��Y#�)����~������]w�p�$�6��f��������
b�+%F�}�E�����]�g�����C�L�"9��}=�4NS����������a_��T�*�X��j;gL�L[�T��+��7�!:]r�9Zh��>)�[/!��p 5H
R+��������.�����RN��
�3��Y�D0*���^�����_�BG�R��s���S�K_������F$J���	`1X��'�������@��o3$�3%�l��in��LN���W�8%�z��P���l����4&BK��hj���O���6k�i@���Lj�gVv6��������:RR
��wf�+
Y2�-{=Y�]�Dg�Kt��4���f�`�}0����c��������nq�� HV4i���4>����������/����O
�=����KU�����+�1�����>:Ke<�t���NWh�tq��?���A���4(�����5VF���Kao&���Z��� �C$\n���S�M/���Mu�b-�R���CCCC��4��B�5=Y���]6N�����VK��5��!����x���Lr��S���������]x3�Rbu7U���{�9�M�y���q�t��7����|�T	��Ku��&�����������o�u��#�L�n5�4H
i�L�(�:�N��Tg���WG��~65��uh��f�Y����m������^8�V�J������EB��s���*\���g�
%Di�pu�,��b�,~R&���&�k������H�i��R��w)]�����#���������m���ep\����^��.	%��un�F)�]���F����Mu��8��j�����l�H�� �����7�Hq��1%mwar��	^dP+m=���������-���!�'0I�����F����������r����!����}��O[w|� _2����=2�-���6�4��69z��?��MY��G�0�]�u���������n�i2c�<.^�OV5��H�A��8�$�UC�f��)+��j�.�a	iLt�AiP��}J�O��`�2�o��uO}I�����j:�8�^g�pu{��_��}R%���w����
��O|K����w��e�4��'S����q�}Y��uT�AePT~��{��-D,�1->�F/VnX�q���zJ3�!�%=D*Rfh�"�����v�p��H�Q�!'�������B�m?���]x�Wt����6�������o{�ox>���_���R���{�[����Z��8��;3���2J���Co:����)u��d�^��hch�����W��T������t$���?��>���_=�����y(�g<U76���v����j����omo��:~�x	���
T�����<�sH[�e`XV����	�4�89����;W��m��
�=z�6�T�X���o5D�W�c���4�����������s��������2�.�v��c*�#��P�x��S�l>;�g/-���#��} ��!���������<{Dd>��O!m��AdD��?������/������w�9��dO!m���9����'KH��������i[2��l�vm7P�e@P~�{����K��ZO-��#��#��J�Y��P�=�9~v67k��@�@0?ji ^��
�b�luar�i�t��%����(����9���*�K<���2dA;~V0�j�x���1x��$��D ?����YO
��#��d����X��t����#��]['�_��~�����SG���W*�O���9��@R��V\���2z��(u�*^]��HV���X����h Z�h>�Uk�����@	[}c������5���k/!Jw�
�[�����s�O>�OE��m���f�l�����t�]O�������ev��9��.����/�5�(��\��%�vm7,��b�,~�������it�+B4�[��GJc;��`~��P�������/�����<��c�����NQ�J5A�����z�u_B��#�r6���>���H�D�W���rV�O*��������&�<�CX���=�|�x�U�%����o�~�W[I��OJ��ft�O����z��������R���v���|���N��JxRV5���][' �!�!�!�!�aW����v�	MI]��s�6����H�v=��h}yj�����C-��	�XB�����1x�x�C]�j0�3����n��
q�X.!m�S���N����R�q���5o�S�2
i�� �2���@��L�|���������!��R G�(�:��H��������q�3c[A�9�q�x��g�x~��\_�r<����?�_S���_3���,!ZUcLtk�Us�R�R��h�u���
s4r��<N��0��VB�?�����xUg]�l]:zov,��CT��<�^�����?����7|r-�:�k,�!}�<m1"��u^�y�!:?���ff�4�"W�W���1?�6J�������g�o�k}�Y��:�������.�Ik�N������������i��5�v�`�f�y���nI�����\:.���J�mpc�&t�<im���h����Y��F���:���3����k�Fi"��<����f�&6m�]?����J�}J#���W����6(J���H�1h��������Wib�/�1���Pw}��7���DEV���Y3
�%�qe$�Ab�$�@�����Q����;O�u�k���D�C�r���]B��z�z�t��=�C'.�4 
H�/��S��=�����y|�������5��hg�����Y��������ts�?��({�����\q���}���~���q:�o��7��f4���-�x_��}G�b�>����U�j�N�(�Z�p�XY�������������]��"�
u�G��.Y����:���J�;7k��� p�y7Pi�n3Y��g[:��|%*WC�Y�T���T��_��-��'�h8��/_��
�"���� }w�@�1��5��|$2�sLc���|!��2?�+
u<�q����j�|��x�'���1��;aiK��\��!���a~(zm�x�����\4�� ;�4����N$�c2a�����]�(�MW%������F��5k8��/A8��DZ�lX������U��$7���)z�fc�����j��a\���q��������OS�_�A�$}��L����f�����o�gZ`��:�����.���
�!:=\GgNYWf�#�KH[� �!�!�!�!�/�(���v��Mn55l�N�[�#]BP������NW�z������~I	i��e@P�_��'Zyq�d�Oe?bY+������]��G[�*�����)D����]���w��1�� �����7��f4&K:RIJ����:����Rs�����g�d���7���V'/!��f������������,b�8TM�u�,a��)F���M���Y�s���egz0N��Q[�up�g8��{��(���-��];��f���waCYS�� 6��y��0�hh�U���E�����?��/����G����x���������f���h��=."%q�V��,�	IP,�6��
��Wmr����1@�O��8��q��y�����?��4lc������i:�xY��'E5DivD[������sH�#,��2�|�X��I�}7p��S����P��{Ge|Sq?R���4���z��Ev���y���!��}A_��	Q�1%�%HK�aL4c���~�poS�q:}���v4�PP,�;5D�K��v�|�m
�����h �~�{����S0���N�.��*�x����'K5��%��!�)�;3���!��P�e@�������M��Ll����9�ZZ$��J�\dA��x��P��:���(���/��62�x����5�Q�g�8o��.�t���H��'tac�����1����_-�-�:bl�����r'�4�1@
C
C
C
C
��>�]���w����4�!�K%�Q
Wq��R
Q��!���/����
H�� ����A��H��~Jc�/L��4�<i�:��h�T���.�'����#���M���7=O���_B�����������W^J�z3��&%[2$�:
��>��<wC,��S�� ���>v�r��lj6c�� ��d��������
����~xx��%���1�1�1���:��R[K��=P�M�3{:�!
����5DI��>���B�|�PC�n\k���2�|�X�=����t}3����k-������B����Du����UC� (}�[���o
�zw��f���o���j)8I}k�S��GO
��C]A�I\B�zd{�F�]���j��������������:9�:7���.c��(���6��J������j���75��G2� ����I�M�[)����*�WS�6���������
��$����&5Lj���3�7�f���q�����<+N��q:�]�lL�������#;xO�J�5�K�{���ACCCC��>1��:��N�q�67lc�T��8��������H-��[���-�:
�����?}~|��om/�xx�5+v�t�H��WO)�~o������0-��X���������Ew|V_	���:�I~���q	i|f�`�iYD�M`��A����OE::E�����{��h����J���L����:l�
���c��t���]��`3�6��`����Of-��������+�����~��J!�E�Q=�X��
��t0M���z��J��]�C��J���4(�"(��'�
9����7���gkgM�5��a9X�aG� �T����]��;�K��L��!kYM�CKHcV8�0�3�|I>�������Q	@�>#Q�.���l�1:�&��y3A��7K��=r����o�R+�q?7 
H��4 }	�{��mJu�R�IY�>��`����I��"��NGl��SG����SH�lp�g��E�yBV����r�C��*�E
z���Poi�yp�jV��%Dk��9�.����a�}�f�`�/������M��Q������H��l�g���VB�����Zx��m��q��2��X��?��>�j%<�PxW����������}����|����E��k3V
4��)J��1~8Zb��*�o�-k�w�
��uQ��7���L�����3>Mli� �!�!�!�[�4/���i~Tn�(b�'$�``�o�73Q�����w�����`C2�*��������H�
��?#?�\B�R0���h0v�����x�V���]��9F��#��2��������(���)�;�+�<�����q0���h0��w�vS�.+����P��:Ba�����.N
��UW��y�����4
z�l����f^15*x����D�h��O��qSLc-�o@��`z:|�|O���hG�tT8�����m��Af�d�/�f��lL�����W2�����Vy�d�M1JF�H����j$���K��t�)�@������J}\�N�#J��l�i��g�a�������
F����Wg{���Ee�C�y��uV��cmG���8]�t#TI��5�\�?���u��l�_��'cyI���)1,v�';{j���3��,�CJ��(�Np����Jwf�b)���@%��C�2�.����%��������s]���9��@RDS��'y���#q	Q��x7��:�I(
���2�� 3�|��;u�c��������P������go�~4C���-�`(�$����w��y����%��C�3�:+�������m������t.���}���y�����wa��3T2�_�2Q��������j{�`�?�e�%Di���ox���q>X�'y�QH:����Gi���7�#0�|*��?�Y������j����v�O�-!J=�����$�����(
J���4(}��}<Yzw���FJ�j����������V*v��6��2P)*V�.}
x�!���>�����������p��������]U�O+T�T�Kk5%�������QB�x<S�4o������1x_�����]m7���wn�3��!93�u�/�k�N��	���m,w��`1X_d�>��R��]S��w=�wM+��f8T�k�!s���~X3��F�������48
N�N�
NI�	��6�JM��.LgJ�i�t�uN�)/;)ef����,{���%Di+<���u|T�������ts�?|9�k������^a��?��p�����W��g���o����vJ������?����o�����-��\���}G�F;Ve*����
���?���������!HeHeHeHeHeX�'S�)�YG�:���'P
��!�o���	C��]Bt���5q,��Yql����p����&N���� �A����]8�e�^jc��m����������,��.!J!ll��T
���2������f��f�'w:����<[V.1J� ��<$j�W�Ok���`��O��j���(!m�R�R�R�A����k�v��������EIrL�����>�`e�-v�PC��D�H
��T������R`'IM'�)�6�#-ia�L�{�ia��Q�Qj�������{G��azc��k�����VJ����7�A��Y�h���G�6�o4
m�?������W@�m���N��RmJ�\p��Q�!ZG����W��RB����#)�&G{	i�Ep��h�����2��t�^�D`,��d
��hS�����!��R!���v��9��?���dO�������%}�;F��g��N����e�NZh8�����%�M�����rW;*�0�v&�'A��C	Q�lm����ako{2�f�6?�*�x�o����-���a�$K���|���Z8��QJ����������4�V�V~���%\�V�V�V�?�g������T���B�t&3{��r���3�{�G�,�f.5���
fvO	����Ff����#Xm�j3V�V�?|���M	�
���w��-�un�/lB���GB%(�$d%�8
�sTc������j���p�g�}A_��D]�;x�����;�J~�f�O��#���O)�S��!���,��Epm�����C?��]�b��0�3�0?����lk��%g3�
��#��9:D��M�u��r
QZGT��{[�f���`�f�������(^)����7�W������_ME�t�	Ww�@_��}A�'e�.��a���d�J�Y�\��Q�I����8���k�kS[C9b�>�Z�u0�c�0~�{��SL��@��67l���������w����!:]
#-������]ZB��0�3�|�`��X��|-s�f<�@��V)����Q��,ox�(�r O*��|~��p#7v4}{OB�����/�G7�`~[����Cw\7SZ�+R������BNd���(��K�6%*����Z����j�7����������O��v�Y;�~c�������Cp6:�l/�����p9�!T�d�����KH[�e`X��X�m#�X���4���i<���=������+{:+�9N�W�ftg�K�d9�S�N��_�����������l���?�����e����6���������^�:3�z$"/Nk���`S�"m����@��PB�z��������o�>v���]%�im	��Rz���8x�w�4;��!J��0)���Z���UB�����4(
J�>�y���MG�������dk��,�u�8%�lm��BZdkO!J���
[�����g8�;O�������?���5[��p���,�)NI1c��^�?~����(��������JH[����������������������\G$7l���' 7��#�����s��6���+���l�5��&�@3�4�>���]�A_)�v��{:�x��{��T���6$J�
0�a`��~R�����<x��������ww�����z|����>���Yy�o����}rV���5�ek��$k7?�m�/�k��)/;��).��JH[� �!�!�!�!�������t82kg/;7l��T��c����^6(�\Ct�FG��hob�2��mYC�z4�?����������_��S��|u�����
���������5��GMj�s��q��*Y~`�2wYD����M�!�4{��v�5�Wz��*'��M����^�<��l��q�AaP�J87TG�a��Ng�G������,��)���g�oL�qmoo� �����O`�W��VD��U��}A_�����C�?
�G�:��������'�����6
IgA���c���)F�;�6e����U[@ePT�_��!Xi#?�*���.�-��\Nr����k�����9���S�6�'p���������/;����(���U�����(�#��9�)����~�������OK���5'2��� ����!����,cc�6Hb�x������S�z�G�}�1$�21��hA������O��n,*-��D~����
0L���aV�`��'|r�m���'�{�i�R9��'��s/��/cO��:`�i!u�6����0 �o�?��x��8z:_�������.�4[�E����Y+-]���L����1h����%�����&����mj�?a
i���!���)�y�>��t���������2�*��-T��Q*��sA+�g�>�kSL��_~�o*!JK����bj��@���ZZ?�H�cWO�d H���KB�wn�])6������N�����<�B���C6{�y�4E [:��Z��0L��d0��wi^��lk���nNk���PC��
2���Qn�A�)�K��X?�-�B�z$�@2��"����6��J�b�jvo���GJ��c�8��I����r���cR��1��AaP�$�y5�$�6�?9�q��8�YJ������i}����=�f�RE��(��(�]7�V�4
@���K�:
�����'������dS�l���cIM��������?"'{l\���h �~�����E�}�Mt�oO'���?�*�Q��>)i)�i9��u�CP�Z�`��}}�� ���(��7��fL�`I4;7���	��=
���3_O�*��g�������i���:��>�������S����d�h�������OR�%%O��x�D��=6&c��K���}���';��!�)�@4
D�!�����m�,�yV/�����'?�\����B<��_��p�K�L�t�:�_������8��g(6�x�o�����>��U"J�U��K����0f1(��ZC�(xK�����?���n�������*����(�(�2-�S��0��<�g&a
i���"���2_|*���J����7�[�y�C�zF��`4��e��c�B�6
I�h~@n\��^x��u���c����������7�_��.760��e�H_4'X��6�~����������k����,~W��a^%�>��v��nLK9l�]B!$��o��_�w%���%W�����������mw�_S5����;�<am�l-�����?�����e�t���
&��t'��;0�j�x���������*�+���@����#'��k�'9�!�������rEv���,������
>N�<�]]�������������W��B��}��?��>�m{N�v�OX�~W�u=��:;�C��e���7�p]�\�aq>A�������1{&[g��.4�lj��hchchchchcx�'�j��C�&�9�)5D�8G�����fA��)���R	��h���Z�uX��e`�	�|����/����]89��i�
�a~��(�
BR
f���S+�X^S,�@z�b��U�W#��-�z��eM������Or��0�����-���Y�F&Jg4M�|�Vm�hchchchchcX�?Z�i��I}G�+K�\C�V�??"�Z)��jS��C#��&#�������;�7_�����`����x����_�Bx__on�`$j������U�h=���7�0��8o	��N|�s3�<��Ed��L1j����$z�*��L��p])�VJ���{�B������3%i���*L<�(�����1��������dKu��?���P���Q�5������L7��p��^��~��fP�eq2���|7�����
�PC�Vq�1}�eI���her�������)��7`4
F��/��������$����������7��!�',H�|��|����p�����I�Kz'@2�$�@�������x���.�ye�6V��r��be�s��f%8�����J�3�����6��`3�6_d�>-�)!��\D"�@�).!Z���,����i�eK�q�F��`�
�?|�]g����s�F����&%swN!ufp}*���JH��H
nr�e<)����m;���t�w���g��v�F�m�*
���I��h�b.���
��<0 �=����'2�[���������s`m2�)�_�����x������(����>�>4T�5D~�t�D����l��c�y����N�W���t)0P�@1P\��_<���^uJn��}F
���k�AR���c'���|E~���u�>o������
�>������3�2V��J���w_�IC�.���?�����$���o���&��!o?��?����a��������7'U^B��������������J������V�@����*G���b�~.�3��Q6pM;��u�nP�P�P�P��R��$�m�x7���wWl�0��������b��)�1#P����82*�N�,���d4�D����_�NK��=�\C��������S^9���]3x2�;���5��t�X��� A��H��;��i:eUUUU��U�n����!�9�����sU������4���O��z��5�fw�N �{8��3�8��������mHU���Z�mt�g
1����
������9�U]d���J��a���9�e���������i7)y�G�]C���MK��r�&D'���U]4����(��D�D�D�DKd�-��|D������t�D����a1*o/s9W���F5F�CiFP�ht��@������U'��������Xs}<:��:��(3E7>(X_^BlxHM>�;[������(Y�bj�w0��t�zzzz���e�65Rj�.3�|����KTrMu���
���&D �@%�x�Qfp��9e�l ��` ��U]��S�c���O�)�d�_�C�������������:���3�:�����c�M �<id~��^�C��Lg4�-)K�*��L�����@���$�@��G��t+/S��r���M���q��������z-���Q:*Ls�i�2��` ������~2�]n�Z?nu,��!f|b�������h
r���h����Fg��~�Q)�n\<�M��2}��g
v>�9D� �(�C:��y�^��N%d@���m����QBt�
33��`~���������wU�z�We���w;r�w-��?�
?�����>�so�������	�(����&�@�.�]���������g�����(���/��Q���75	�w�!���?���,'Y7�Kwg��gSCtI
�2�2��I�D� D��N���(�UU?V:�f�_B� �p�g����gG��ko��A�2�����A�Oe�lD�$��x��s}���y<Jb`�!67�Dp���Qfp��@�B2)��U������8�p����������$��``���.Yv[H���G4�K�.����h��6�K��_^�/�}��EP�4����x��*��:�:����w�F�cz�9�T��iwr����;�N�)���(��y+/���16i����9��K
���XX��2�u>�n��x
���F�n1tD7j2��s7cu�������QG�M��$�u��<����/���i��.z������'�L��i<��/`ZC�����s����\���5�����\���:�>6]���B��eTq�_�N��;%�z
Q������s������k@��i:/�nyY�*�9x�w4�*�C\�����Y]�S��6'~��f���^4�T>�|�OB�1��+��M���(�p�gz/�R������~���k�w���[�(���m���'7jD~Yw
1��Qx�7��PglT�>z�����1dr�J�K�Xz��3��uK�����4�cj�&;7�#��<�/!F��
~+\[~�R���Tk��-�)D7@���gi~��������n�s��1��u�M����G�X��f��� ��;>1��;�����/���<{��77~������>WQ�'k�QT;v�e2?�cC��I��~w�nh��T��}��8}������M�����ora������O�>�u�������3��
�����WJ������L�X3��Am��6tt��[��_�c��������7?��_�8Y�k~�T�������������.���d~eSYW#�xm�\+|��}���u�v�U�5�(#��G���aP�i�M�q��	/�M��M���a����?��k�Ljt�gY�p
1JJ�}^���G5������\}]����(��!�!�!�MRx�?����HyL�:��I��������b��P���}R2[�E�t.�4�{���e	�2�.�q�_��M�}�Z��%�z�������9/��T��.���n _zZ���	�y��f�idu��;�
���V�!����" ��B�GIa���
���o����Z{�/A#(�)��3�s�H��������Q������ 
������v|�W�����������������4�u�jr��4�UTA4r
1�R�������������j�t������o��$���AiP�~.��_�FK��H3R��]�����y�a2q��lBl�X���R������'X��F����$��v�w�2/��rH*�xe~�)6>�0��������@�|3{j�
����e���n����������w��~���.�&��g~��8�!H��_�������j��t��tZ��t��d H~.H>�`����R��J�����T�/�^C���H�^��Q:�������eeI�xR�O
,��2�,3J�x���]h�@{U�"~u	Q.�*���*�{A�D.�2s�AePT�*��Nmj�6�aj��
C��+����lk�����|Vl�S���n�*�������"�������D*��waj|����~7Ie��i
Q7I?�bN�Nz���b-o��Ze�@�1@��c���|�J_��eT���&����/?�"��\;��"�S�
EU���+u�./��e`X�,����GmR�1���/��c��y�^�(��F8���%D�����1x?�_�����Vw�0�}
�Bry����db�'�WKV�]��Jb� �bF�������MG'���_C�@������M������
0���h0�a�1�kj~����0�e���O!f-�p��Kv�co�4��C/�t���e`X~.X>������W�t5z�mH���.�����R� ��vh��Pj��i	@�1@����edH�m6J����+��R�49xz5��to���x���8^����������@4
D3�>����mC��I/W�y
UaO!f���~��������y��R�C;C;�&`���tI�T@t����e����w��n>�n�a(-�bvc
v��d
1z�H��k��R[Y^%��g5�o�N_l>b����34��*�����u)��[0����������~t���q�ps6�77<��T���(���/�tH�k�!�<�<|6!6���G��\����n����L	Q����<������!��DU�M���wU�����Cl�H2=N�'���U�16�)�/��~�����fB�� 4������4R�����T������Z��?�i����C�_���t�>��^~��n��y���<�_�;]~�
r������]��-�a�@/���slH�<
A^z2��3
�s�*%�qM���N�
�����z�/�L��n�f:(������%�Q?�?���|L�:��I=���K,`r�C��X��F}��[���16)B9t�opVG�&0~g>������W��������W_N�O��SAE������������=���������$�>�~����?\�7�����W�M:������������]w���1y6��:�Z���v����y���I�s��@
�XP�[�jeA�b��LhNY�
c���X/���qQ���4���wc�n��g$A}��D}�0�(9�q
1�p<����)����R$S�t���:Jw?�d H��dY�������>,��b�B�Nv��b&L#�����X���Z�s��O��N���AhZAh��S���2�e������/���Q�K1��X(�lBt"Q��Oz�"k����\+@�a@��L>�s=�-]7�����_C� ���fD_kT\���Ua�@a�f��4�1+���7]�imY���9�l+r�3%���v�����O���z���#��Yw�Qz�ME������|����_��n?%�����������n{�
��t�A;�"��l���X���	1�j��i��mZb�f�Ze�F5�j�����
���2�^���r]R�2�S�}���g���#0���L�TK�s����SclR����F5����9�3�<����u�C�1�I�!����A�^B����=�M��wQ�
�k������%D9��A��@���X^"	hZA�M��G<KNb\Cl�X���v����rk�,@�a@��L>���]l�@�u���k
1�p���L"��Q��l��F!hj��,�BB�5J��E��nd<�s���&v����Sv��>��P�4��t���NF��dC�u^G��t2t�K��?��-Lv�x��������u�Jcl.�[GM^y���p|n�v��L��m
1R�$z��K��o�s�1F_��J��
�kl��]Ob��E	e���xP	�+�LG����om���U�1�g<UO�/�=����f����?�>k�r�T��������0���bt��"��(��B&C&C&�,'�/��4����<�C����_����m�tV�,��r5�f:]�<�W���M8��c�X����k����L�����4���!�\�G��!�����$�H���h������1f�]_�h�� 2�"���9���z�]������A�g=�Z��MO��$�f�?�B���` ���k>�G��o��o����]Bt�B�l��\��A���N1����2Hwy0L���g�d^�(���!�t#�����������%>k�����f�����W���J����8�4���&��`2�|M'/�]h��b�#�Y&�!FV1�m���>kN"��JI8d����_�I~z?���'v<�2f��_�N�>�~<��ywr7tF��>E�?\�i-������-g��Z�s��PN���2�:��(fF�B?�{��<�����u�f�f�f�f�j���y%�o��f(���Vn�~>���{�K���_<���2�N�w��&�)��AdD��c�����zzzG{�iY���@���yb��,�.+e�T$�
9�-H�J��v�@2�$���|D��]�5��4���$\C�|��Ns���-��f��[IM�����:
T���x��p�4m�x���/"�KH��5���p��5������b�lBln��3�Q������������/T|�Q��7�/U
s0�G7����?���b�OB�t�::x"����p_���mO#&�o^��N�������;�d����-~��H����/�'/��i^{�r�];�U�yI������OR�[m$�����O��k�'����p�K*�b>
�����\[]�������|�����A`�S/���9�~:��������8��Y��Te�{u������M�)3T�x��5^��z5^�1y���d�J������e�<��3f�V���)M��vJ>��� 2�"�� �5_���uG%^m�cK�x&S���������~ZU�I�h2�
#{z��@Q�mW����f��r/�h���w;�Q�Rl���]8�k�L��Q����ZC��s��v�,�um������(��������t2 L��o��e��OkU��x�������W 
�x�x3X�~����7(~�N�\�$�,�(��`~�q������T��{r��@�P��]<��3�$24u'!����H
������>'��*�y%d����]�����.�>4�P���4���� $��f�f3�Ip�C�1�N<;J6�����\�������[.�
�t���S���)�h�C�v"S�2����h��TY+��!��5�k��0���y���i^w����>���d>1�����r�����s��#�R�#7�|���&R���{�������������<%^Z���k��S����z��!~&��P���1]+��_���n=RL�I��pN��@8C8C8C8?���w�0�B#(���H6��y��[���gPll���8�K*P���[9�4�Ac���n![m����C�5�h�G��gk��oL5���o�J4*��h��W{��=���J��B���h0Z�h����;����~<����u���'B������3B?���|�3��P��2����l��|<(��HiXn5�5��[6��k��wTm��gl�������oK����k���m�?���
�2�2�2�2�2,�����M�Q�����mExB�!J�(MT����PZ�v}F�y��L��Q�����h0�y0��*cd�
��NI���������4��j��R�5��J E���r�?�bi+Ze@P�������v�Je6�I�?�/��*���mk��"f���C�����j�1p���U�������oS����_��	�aHj<��y,�u���
�,�����2���LS^�)�I)�v�p�tR2��K����	���Ka�
��e�s��m���)�C����@P�e@��V>^6�������H����YB��P���2����AP:j�����lNPB�{��d H����H>�}���h�u�.?|^��%����>�,�W����c��r��<�(�&P���t�Eq�������
{7�x!����w��F���;����,��8$�lu�r�������d@��6�^�k���A�H���6Z��i����}I���y�{����Al��X��w�\k��B[������|K������b������E��r���Gt�}���������OJ6!6	Y9�������k��t���g'L�k�5D7�e�e�e�e1&w������:�-1���a�}��a�e�/��!F}�����2H�?�[S�h���$�<#�X:�)e4��������s���nH@��z���_O�?�R�0,����64�hw���~����v�*k^�mB��z�����������.e3j���mcO��SBt�� � �������\@�M���|l�>�����3��`���:�Y�-�,�
���l����:J7�,�������IT�������	������49!`�&�fN�w>����
F��%FYL%�E���d4Jw?d@����77m��e�V������E�f��<
��^V�?��(�,@^��yA������t��i�����e�o�+���8	�o'��V��d;w�3�k�����3�<����y�< �-�w�s���8-�� ������������f�q�W������ �2��]����J�R���2���R��(�_�6c��[�-��V����� 1H��Gt��u��$����� >+���$MLJ��q23�@S��t���;��Z���Jj��Z��;��\6>�����*	�U��>|�����w��l�5�����F����]�_�����g�f�y�x4`���������ZX��7�%��?���w��y�	M�Qn���NrrrrY���o���c�F�
���q�&�6���������A<���ve�RD}d��PB��<��c�<�}������6$�:
M�d���]<
7!6@,�u���,�����z�����M��s	QZ)�3�>��O��������;����RM��<�7!Vt��%g8��Q�U)�I��A��t�@�d@��`>^&@��!4>����j�5D������
dY+O��`����\SS��S����:���)D� 9�w~�"�B���F
d?�une���U�������+��~����������S�z^��A6����'�F��x)v
Y�9O��[�	1z�J_���S�f�q�el��-J �!�!�Md2�\�����Iw��4O#*�9�Q��_�������0��ui7!fD����TG��Gj����9"�������������
��#mF��D=-(���-9���%�
n����K�����[�����L�Li���%�7�<��l �!�!�!�_�;���W}v3��/�v�,�S�L���?�~R��:�b����P"�]��.'Q)�$Ep��&��`2�&�yz����%_G��cO�3������6�n�:e�����3����z���V���2�(���|L�Q������Xf���Rc��Cr,��m2�y��R�l�%�5�Y	{�J
���z����{@�~&�>����0���\����Dh����1���4�����B���5�����L�+��4�4
F��������y���^|~�&����?	���	�T��U-��\G��8���08|���4��>5���y��vlhe���rh��m������Vv+V6�������w�W��o����
�^��)�!�w�n���.S1Zj�?��X�:����G��h�g,��6��L���d�&������v1?j��Q���y6e��.�������~i8���������ny�w--�m��b�u��w?�1��L/��K9uRw;�]���w
�Z����5/��G���!�C���OF��Fi	�����2H@�<'�:�w���M�A`���_04Z,���W������Nhl�P(�ER�?k�r�Tz3�K�7J����Y1�>��VV����2�*���k�x$�A���|�K���<��
}�8(0��;����f1A1%	��Ze������h ������>�u�s�����9
xH����4q(g3�xI�n�'K�CVu���:Jw?�ep\��	���`w��!:�U�����C��j1��Rq]��������>ODU��$���h0������x�S�&}�#�[��$�R�J�bUKj�h�����[]�B�=T�@P�e@�:�ihg��j\}?.���%�K�QE^1���w�m#%_���S'yH�]�H��������t�G���y��n�<�/A�:��]8$GM7	��t���3�������X��������4�>n��`a���^
�4 
H�'��>�ys�h�s���w��f���B�����M�C
H�4�������l�������_�,���{���:��zq	���F�������������l7F���cS�d�
k]���_C�����U��K����3���w�(���Z�Z�Z�Z~.j�-x�Hzi^:Ul�&���������B&?�Rm�� ���#�)�/�:::Y�m�����`�c�3}y�5�gnz�}��Z����b�������6�~BVt<�I�����
T2T2T2T2T���o�4�����6�e����|y�b�e:N���������yR�KG>K��t�\��ep��p�x��7I�l��b�t!�a\Rn�M����.
dw�J�,+�V��������bb+=`X��e`�����|���i�!L[���#G�W��qe��zj�q��2-�AePT�*�����y�=��\��)��6.����%j�bl(]�i6K(�t���e`XVb�?[�Cd��F��nV�n�w������7cu�Q��i�m2[���@�9�57q�w�����x���1x�����������jR`[/���!b[��J/>Q�16�����Ml
��t�Ag�tf�|D��������~�����5�����CnE��[��t���y�gq%D��J���4(�\(��+*+`��*�����n!�i�q>��W�k�I�,�tZU���I�g�(��a`��aF,�o�W�n���4)q�(����!v�"f%� .u��S�j����{�=������77����}�v�<Z|��Z|I_N�ZY�"��n���6-����X����z�B�������x^�%���Q�W���5���lDx��=�3��s������ig*�M�;��<���^Ct�#�=��l��T�[w?�h���AeP����Hm8]��l�Q��(V���5���'�e����t��������6��AePT�*��nChZ��A
U.����s���\��*�,s�������7U������<k��AiJ���?6�{G[Y�w
���n�J�����T2��4�*���5����1?�2���GA���o?��?����a�~�����~����w�������W�Nw�[|�:s7�	�Q���_���N��y��l�����|��� ����A��tr���.�tJ��W��������wwg���v-O���S*����S�����0��� $]�-�����l�������Z_��{�@h������1��� w"�L�M��\����YmBl�X��C�sY_b�L�o�E�Mj���4
@�WZ����Fe	�tL�b�w]r
�-���y3����~�QoJV�X�6	x������,�I��e�/?���������w����]���c2��_^E�;�!VLN�Hl�G���J&J���YM8���e�������2�T����b�6�A�n�#�B����
�+����Q�����fB�Q.�J
�O;����t�H��d H��������6���}��O��u����M�A
�^�}^����o�7� �2�|��t��+E�m�d�`e�]B���$��h]�,�4:��Rg5z	Q.��� 2�"+������@�������p�Chh�r��H~Uy
1*�'x���:��1����U�^�u��d H������`�6���>�y�+�<��(�&d�o���L�����Tb��E���"�� 2�|��G��{��o�H����d�����P�R�iCRd����L�R��!O���D�/>E�!��B�� 4����12P�7��N����Hov1�4"������$"Mh��F�M����r��M�a@��+2�7(_��+O���&&��=��$@K�QFR\h^��Q6�W���(j����/lhq�������I}��)�|&B^�b����1��p+;��(+ �9�x����(�s�d�d�~��_�Fd��2����/M�mM��^���&�F��ZBde6�Nd�wJ��a@�"��G�H�R�+j�����`�$�����ue���R����Sw����Sm�����N��O�����_�^jok^)M9i����l/�|L���&��O�/�w��K�
�e����D NYK%}��6!����!�����S���>�~�w�I��J�6?��_!��x���6���7&�������Q�����w�����m��7k��^��
����������4���S���.bu��[$�@����
$Sm&����}]�U8>O�|�c������t5�Z
���>s��g��o���
�U�
���C�4�����y�fCt��Ah�~wR,2O4���t$TG����o��C�t�|3�%g��FZc�Q��T�l��_�C��m��c�/�?����[
������4�yJ�A �t�]�T�����v~��:�)��������yWy12�;����w�{4 �!�!�!���l�
�4���J��~Yl�����X�O!F��\���n�dA�2���(�������go���!���� 4
B��W���C_Z}M?�mK�������~ Fw���2)��S��|�o�����E���
F��`��`���,O!��\�yC�Z�qk�H
h�����j�i��$�&���.p�K��[�e@P��
��e��]�m����w�XBl�HG-7��MV)/HX���P��?���|��t@h��A���>��������+0n��,!:I���<hO���>qC�����v�mJ�3_QC�s`4
F��*F�������%��#��x�Rc��%/#m�LEU.l��x9?��L��|R�L!����2�,�W�3�����n�@ �Jm9�E�
I����S�l~����1P��y�v�k�n:@2�$�@�U$��v
5 �R�$q�AH�����������)�f>tH��n�c�O!����4 
H?H�/e��N�^��w�\�����9	J���6w�z�
�P�!6�!;[��� ���ap����b����@uS~�C����=��$NI���^W�=�������PG�&&��`2�&_e�A
l�������[�#u���������Z~���)�f>�S0�
���1���6��`3��<�|<�*x��^�R vT��{x2/#m&����a�`��u��NG����Q�	�� 1H��WU��R�RF�r�/k�|N���$�
����m�����������H�c�n: 4
B�� �UB����>�eu������� $etp�IC;7�Ybh���R5gp�
�|3��� 
H�v���?�}!�/,���o�������f����s�?e�S���a]���8��L���N� �0�)���q��D��k�n6`2�&���T��xL� ��g�d���h�vX�_���bu+�IS��s�ag�j�����ii�B���k���

D�@4�'��_����:�0�����J|�2�(�V�^������77����}��fF>����8L�U�SH%s��,!���k�rc=�.����3�2��~$h�;����>��zx�����*y�>�q-{�E�bC�z�c��G�`d
��O���o���t��d0L��ke^&Zy��e�{�KE�O�;(�NcgO!V�"��;���{SC����QK���v�����Z��D�@4
D_G�!k����*��=��i���P����&���3�9�!F�zq�[^:��t�`�f�����[�6^��^�R�z���������������B�(Hu��%�A��l>����K��(�b� ���
���K��|x����Q�W�K��A��dp
�g�I<���D�7���n��1D7p\��ep�:��\S��(?;�<��Ct���N�h��k�i��$��i��B����=���2�� 3��#3�K���f7#�x����Z�-�����\�C�I��?|�PC�61�a��b'��Ep�J����4
@�����,��;�niu�[�K��AR@���nX�#���Bl�C�:��������t�h ������>��M
�];Pe����)��v	Q
E)��	�~X+�D=�K�UE����+��O�k��|���K���^x���|����%
��<����g]�`��_���������-Xq|���b��C���e��2���H�[O��}� !(�Cl�CG<�C��<22@��2/�^Z]69�){�Y�e��9�hY�8�]�9c���4�(�xA>]L�<�[v�c){�� Zyl7+������B���Exc�x��GL	J^t�!�[����}zP�*���D�_i&{����A��ve����!��]��������~���!�2222������Y�;5��R�6��s���@.u��{"Ab
1Ze�f"��Z�(]�&��`2�&�����77��N���w�gK��A��^;a������n0�Z�k��u�y�\C����h ���#���v���
g�� k��	���&�������1���&YNLyC{��\�f�h�A�����2�eD��$�����[�E��\�WoJY"���_��.�������,e�s��g�vMHa��%1�]
y����5at�cw�z�8k�����3�8��T�`<��L�(���FY���C7;���d1Z�%�L]�7�/
�����4a�i:�6Y�w��e�<��3�<�|zssY(��eo��[!�������|��b3#���tq�r������!�����Z���th�����z��#��N�./F���w,;�Ws�O�:�#!;���8�I���g�`|)X	Q~F 2�"�� ��%3��^��=������fr����7�[SAW�)�f>=�����.���,c�n:�3�<�����|�"��87���49��$]�����:x�a|�4�i'��,����n	�(���3�2�~i���aJ�T
�Ro�>-�k��������@���l�c����:m��,8p��(�D�Ad��2�x�@9���..$��`Q�����
�q����8$��s+���8L7p��ap��>�]�u1
�z)��������~M���*���b$���n�f���H�1Dw{�g�|��|�w���B���]��^g?x���,�4�B��	�Q��-��5D�Ci�@�;��gi���[QBt��g�p��/�y�j1���F�0PYU�#��<���YBt����������:��h����t>����RGw��������g�x����Cz��H�.,�y�<R���1�yN!6�qt2�yS>��!h*��{��O���4�����������u)��\��z��=��ev��%�����"?�����T��GW���!�~m]|=�i}��Y�"�!6���ii?�
�7��PCt��j�j�j�j~�y78���M$��� �g9���#�� �7<�g4�����Q��f:S����<��3�<��~������:�]Og:�e���n�#�$�����"c�r+�.a�t�&��B�� 4�}<+x�����>�n�v��W�H���C�����c��t���k����yK	Q
��`2�&���W��K�C��<�I�2����N��d�a!8����A�pd��% 1H�� ��I|���>���5}����:�H�z���=_�;������p��)>E�Bt��Ac�����?<�����Q���Z���2���-+OZ�^�����VWl-�X��	@1��z2���z���w�k��
�4��9�A�u��<8�����������a>�1Jyx�)_�����&�6��]|#�9�
�����Q������!:(��P9
c{d���e�
F��`4}��������k��os��2	|�)�(a��|���I�n�)��������/��Ct��AiP�~.��9aT�#m�7�T��va:�1�f���A�8���`9����L�o�|�;��XB�f�*���2��h�����sZ�d>?�s�
�@;�tGM�!J(J&j��n�vIg���=`4
F��`4��C�g{:���\7KE^;�)D�ZL�!7>�=B�8N�B�$��v�=X��)^C�bh��f�����x*��F�079D���#��6 $������4�f]��aNP�ov�B	�����1`��N>^P����v��y,/!:Ij1��?{��#����W1�������Hs1/``n�v���5�l��7#�X�R��6]���.���+�QY��0����<��	��������\�c�0�R��t����M�e��2�(Z)���wc��������R�H�e����j��D�(
J����������	W�Fj.�����z�c�z��,[��/&:I�H���v�r�=�Ll���}�7��(��2�(��2�:������E��z�@'�����D!)�}��k^B������*�q�s �b� f@|H
�N���{
�l��o%Eq���s���b������F:���a�dL�v�b�]C������^x��������/Wxx������o>��rz���L���iA������������=�Dl:����H�����Uk11��I��..��T�&�o���v��4��T��r�4i���D�_}_��]�/�� >���v5Q���\�����8��z�����p�np�.u�uO&:wf�`�f�U�}���!��]9��0$Es�C���J���lb���M�����$��g�?�40
L�������7��I�����X���X����L
w�+�d�c�4f�������|��Lt��� 4
B��L"}�����>zI��u��=RV�:���gA�2���������m���0�c����QUm�<�T-���G
�Zy��.MPx5����T�����y�nRd@��?����)]��K�BY�<�����y� 3��wGT��G}�w�E|h��,�w��y}v6���r�K{�X�2L&6�D�\Y�{W��������h �~T&��C4EC��o�Rj8�~�n���V��q��I���71��q���Ap����9��`1X�����p�J�}�N�����5��A���CR��G
����L&F�J�w��7om:��2n��i`��L��P&*R@�,s��������;$Q�'F�����8'B7D���Lt����2�.��\�������5���_�C>U�M�:}��w����
��P�m�����{>HQ����D��.���2��q��u���G����W+�l��F)�b����8�N���K�F+���K3Q���`3�6�
�����I��v*i�e�eKy2/&6L�u�������h{�4���n0��(� ���h �K��������B{11b����d�O�l��
���-L�n&�+�@f�d�Af����si��_o��`T�=�a������'>`���N{��X�N]����4�j�����i`��b�_���t]���{�SO��@���+)/&F_>UXE�Ay���n&F�9����7����WQMt
D�@4�e�<
����|PB�������>JYLt�D�Fdq�W#R�1mZ��Q����=��e��p\��ep���!n��D5b
Y����9�nuEN&:.�C�(2��"M�	�zQ$�����ip��N���F5�|�hTM}��^]
]�Y8�#u����dj��,����d2�q'P���3�"�2p�f�`�����vLqX��k�U�n&6H�7E��LW���?�
��/��1L3Q�p�4 
H��4�C��t�X��8�z�$�(���yt��d�E^L�L�X���#eg>x���B�l����a���X^�5��t\*�5a�=YMt�����|V���5��,u'tn8��%��d�� ���h �K���`=��EA�����G��*m�����8'��~W-'�o&��v�x�|W��/������~�,��$������������G=�2����zv{�}U��g7�<��u%�����������O�����M�>T2wy9��r>
���\�2�l��?���1Js�����A6�w�J�l���F@8���q�����Z��[L��{��=�%�����<�$������yZ�����d��D�A����w���xD�qd��,�������e�>���@������w��sS�l�/8/���X���v�'���4
@��?�|z�pq1>��M�@���k,�2�(�^YM���z%$5+[�G^[�L�R�����uz�-~���rz�h ����|)Z����%����R�J}�w�E��x�9?&/������P�:��~s����\�Ll��[ ����<���Rb�AePT��%����\v�L��6�I��|�T��|��lb����s�]�!i��L����������i&�nz�g�|���k|>���;J�}Q�=������A�`%
�i�4���2t>�y�4L�H�� �]��x	��2e�c	i+��-&:I1\����1
�q!tc����o�����AdD������P[Mz*��?�,&JuV��ZRVvD�y�'��P?17�ez�_~��	�M�o5�1�z���\i�����[���_?~|������?��n��3=�?�?��_������h��{H�:uqn����lb���������}�����[A{�!����u���5#kF�������UY�'�V?i�t�S���CK��D��Ml��8��ORE}���LR�o��<�ur��6��&t�Ag�t������7�rH=�5���MO��6��]o��d���t�c�[k4����&�;�3�>����5>R��{b�#��P��:�HI����";2�Z�lb����K������l��h��f��.����*�C)��n���L�v�����������w@�v��\���|1�8��e-��D�
�:���3�|-q��G+5�e�r�����G�ZE
7�|j11���*���j���tR�2��`�s��ap����
�C,~�����#%���$�`U��S5��L�eSC4������Bn6��
D�@�] �x���Z�M�4��1��j�#���U���YS��j&F���j�����eU�j���
x��g�x��A/.��t�}�V��YL���*Q'���Ml���G�-��0��2�(���5(R��t�8�������h�9�����0��������[��0������W�e~�xy����O��~����(^��z�]s���]�.�,�F����*C���]�r�)���_��L�H}�v����h������^��j����F��0�6�2���O��y30�Q����o+!�t��qE"_���$�}�J��%�u�&6���.(o�t��`2�&����K��(b��w�����/>)�Lt�R9�]
��e'���;$��}�/�wm&:o�h ����<$�\�,_���]�\�s�S��Cu���y�����$]n&6��.�WB�A���1@��k_G��S78���MT�>9-&:I�\���������@M��iIn|�6��Z�2�*����U*S��,������O�3��Ll�H
vvK� k�9���3vn��r;>ri&:o�h ������_o�`�������w�P{�	3q9��@�ij>y�&F��(K��
�}���nrf�`������"����8,�E�C[��R��������J�4��T��e�BB+�Ah��A���>��Mw�~��%S�v��G31��#��O�u���fbt��ZgG>Jh�tA�*���������27��*���:}����_�b�S)t����:T���>v%�N���fb�
U[��6����f�`�������\N�zMO>����AR@{:����j�lp]�e��������:w�d0L����L>�������Py5�QHJ�v����N,�)m���%��_��P5A��xz���u�N�#����w]x�������^�Z	�D���
�0����j��y@!��`���1���k��f�d��U�b�R�b�V����P��/_�/zC-7y"�A�u93rf@����G��G��0���^��P{� ��������0��v�e�	��=��r�4�Ac�4�J�c*�C��0vrsX����d�tv@�OL'#w(O?;��G,������f�h�4/_��dN���eX$�d5�1H�0�r���������fb�M�v2���V�Q���h�������rr��_������j�d�����p��u���5\�A�H��N��`2�&��W�|L=����hr^��I���\���$R��6�&S��P�NR�Jv3�y(�������t?H����F����*f��k3��{0]�H=E|/x���6sP�-g����7�n&6���"<��(�`1X������c1$���f,��X��#H|��jbt�$�R�3q�0�!0e����(�1@�� U�����	�tA�������Dyh��kW_S3��%^|	�d������s���A����2�(�	����B�����mZ����}����P�<���6T)�Hv���Z����D�B�7�Ag��<�����d�Af��}3��fd�_��+��
u���1��/��y�����
���F�8P����<��
��sP�e@P���pA���e�$�j���C������m��CR�c�U���PT�A�;�2�(����f�rU�c,}��|�����U�{�`n&6�� �+6u��� 1H�� ������kr�)���d�<~6��D&9�|�>
�q��p%>%n�t�1@��A,�F������7���������%
�;�C�\��	�m��C�����������<��c��#'��M�S�oE�z�������P|��u�i�F��@��(k������������/G��M�2�� 3�|=S>�d����������sZLt�2�j�ei��H�
����'	���7��P�W�D�@4
D_G�!�l�XM�C���k���j��jb����W�~�c�O['�������XK����f�����h �N�)_�	����o���&�����y���y����P(�����@Z5/(�A���Ab�$��'����H�(������W��QI���?���$����;��
����
���:wv0�Z������1/����������wx�&y����|~����W=�	������>��}G6�~�����y11��M���nv&<��S31:��S_n>qn�t�92gd��2�|'X�Wa�X���^q�x9��[oD�*L���^����J����e��kPM�&����y/l~���(�L�d�Af��z���@��~�d��P�_oj���6BW�����4�����|�C�s �b� ��C*�TV��28���:�H��VO�����,&w�n�O��;��f��l��f6���g��Q�R� ��Q�v������H�<��]�\�%^;_F����B;^n�l�R��o�s�x^��Q\�x<��Q�Q�s3����T������g�f��8��[�Y}3Tey������u���p��1p������O�����]����c���nHk�����t�bt@n&F@��P���_{^M����g�|����|z��M�������=��.�c$B�2T�J'�V��)(��M&���Rg�Z��{����F��(��2��������������OA���I��3�M���0����@�t���t���f�sH��d H�:���cS)U�Ve����R��&�$Jg���}xa2�Q�N��q�a�RM��>���3�||�	����xm�z���{��=��� m-�g�'������*�T�Q:@aP�A��g���N�.N(�R{����9)	��D%m���A������p�F�D�AdD�:���[�S��b39�&F��$B�0mm��f2���]�{���B3�y>���3�,�3�<)���I��X���{0��i�k'�]\�=�+�d�����^ C��y�����QI>i������Mx���1x��|��
v���AuY�b�����cu������@xM����7���4.h�/���[\�=���o��r���pb_g+�������C��18�$�? �W"R����nL�l\��m��6/�t�Lt� MF��4i����rl�Q�����I�.�8�uGU�����#M�I{Ngw!��r51��������%}C�S�����2�*���?����aN��P��C���jb��Z=������^$W��9�a@���c����#y?qMW��'�M9&[]7
3r�B��*k�;�D�
�&��`�0��I���T1=�\=��$�E`���Px��_���][�\mt�ZV�3og���nr@ePT�A��g��r
]�J����|F���$�s����f�f��R��P���a���OQ����<��3��u<U������O�*dO&:I�X��o��-7��!�,��j���<�>��6|������Y��K|VW�<��b��y0���zhv���b�{s�)_F�|����z�s�"���h4-�Q�g\l~7���3�0��y�K��z��������������AF1����%_����PR^�[A�����DJu�|3�i����Z ��Q:@b�$�A������;c����Q�H��#��G�+qnk)8��(oa��E�����(�]�����{SM�::�,����`-D��rz����������7��I���F���>���^��Ll�HLvg���8��u�!=:���m�n:e@P�e&W�D��W��	Ph���Z�������{�bb%��h]���*Ze�i��A%g�5����j�,
��e`X�,U������5|fH�v���M��w������&t�X�,%��[5�yH��4 }/�,��P�4�0���oY��c�EPbEB�4�f�I�����y|����(V_+�,��b��I����wd���<�7(�>!��&�a6.�(]�����j!����b�e�|�y��-������z�ql����#�+��$z�k���SvY���)�(�mS�N���������8�(���R��tG�._�s��������������h�)��f���3�g$�H���3����W_#�v��0��l�
<YMt�"��f�QZd�xJ��(����� 0C��C��
�)��e���j�r9�bVZui�m�������u�m��p��ap���!�j�g=F��R����-�'�� �~���:�4�����c�7��9�	��-����Z��#�h�O�=����������>��d��aE���W�?��#�K��D��Ay�DJ�>k�B����7�'��?�z��4�fHu�~�B>I�&F�`�
��:�m���lt���93rf����3���d�����^��Q��K���\�K����<W���J���4�fV���p�1��?�)3R�Yfv(�n�)J���:��q��Zvl��O������`���|��F�8�w))�`We��t�2�e����.���k�Zo�d�.��3
y��a!5������Ll������>)��w�lt��@3�4�L���b�*KW�_6TyDo&:
I�u��Yf�����_�n��_��
P�uy�*l�@����.3\>��=�i\��$��]�Ll�H�����������W������ �Y�����D79���;#w������t?Hi\���:����]���y����`���/����\.|�\o����|�$R�k�%H����7�)�m8��J��g�������/�~��{j_+a�]=;4~G�c�}��&�7C3�����*�K�C����S��lb����]_(>
��r�g�?����Lrf�����3k��������.���E���N����p�Z]'�T��,���&w�����NMJ;�lWeJ���4(}7��UVe�"�J�i`%0�N�N���t����PV�z������7o��}�	ETe+�l��f��=S�������
�NM���[V><�Ll��4k><����@M�/��Q:�b�(��b���K�������D"i`B�tNaW������;C7���Fp������AiP��J/o��h��W��
������m�?R67ez�H���]M���RoH���wB;�o�7����`3�6��\}�� �Du�}p�y!v�{51�"������.�DGE9�cY'g
�)"��(g��i@�� }L��/9�4'�|*��l�m�H�u��]�+~��������f��u�n6�e`X��X������tE�,�V@<������j��R�QPy���|b�y����&�U�*P�{��Lx�Y��`3�6��\�|D]�n6�.�����4����g���0���p����f�s@�d@��|H
;�T��SM����(�(�&��������@��(U�;C��7a��(��AiP��J�)���YR]�r)�������$Q
��+���n�V�W����`#_9^m����f�l��f.��35��+����j�>��
er�=�����M����i����<������f�s`�f�`��|Hi;�ni�I����3(�&:I��z����b>Z�Ll��;/�o�t�3�0�
�����04���D�����6��X&���u��;~4o(���:H�x0L��d0�K���jk7&������]M��2���<�a:J]����v�{<@�1@��X�i� �����J�.�[>YM�5���������UK����D�E�;���j�V����2QV����4(mC�?�l�4�����y���j�&o�`��7'�]8�����DI )��w�7���Ys51��)V�|�Y���2� ���.��2��>7r�|�B�E��s1Q*����D�2�I�4LGA��?������������@P�e@���/������:6]��J��95��R3��L�B�z��YL�.w5���k�pv��Q���(�8�����_������o?�|�Q�����O�?��������z��(��y���y��*������
�-V&:��������Q��fbT�5R)� �8��/��Q�@�����F�r��H�AdD��&�k����P�j�}?�0�G�I�j�C�8:!�z���d>\�&FQ�W�g������2��h�����:v{��	M:v3�!t����3��9��
����a��4���4
@������,���A|���cy�C�4s�� ��������Ll��$��@��t���ap��%��r�]��^qI�,p3�!H�c*��G��a6.�.���G~Z���@P�e@����^��P����?��b����$_a���K��h����t����SM�����������%�-L��Q�J�+��l��RM�pb_k7~�6*�������0�v�D�1I)�2�f���{�w��k&6��2��,��Rm�evY�x(G��-z���?U�����y�����O6�>z}y
-�<�$O�_���������������������*��������_��{������L=�^L�l\ ��uBZ� �av��}�q}��/?�~�������������>��?����_b�������_���JM��������k���a�A�������o6�
&����?~8�i�;��]>WymL>��������x����������ew\����Io������F���(������f�7k��oFh��F��H7:y8)��o��.u�����\����*��������We5����"��t�|mx�Q��2�.����5����^�E����c�[�O��f��UhDb��U���jvM;���D�&]�$�@2�$_C�1���n��W�;���i�>������y�f2�qg�}�-V��2[XM��
�2�� �]�������#����q�3J,}jg�f�D����Lg>eo�l�L��x�_�/d�A��j�9�
��T�KqJJ�6���r�>�_��|��,�b4���[K�X$�&��&��O����.eLN3�t,g���������F���93rf��P���|����8��z�����V�8�$�(��=
���w�x�z�s(��R?�JG��N��;������~#�%J'C���������i������`�(�s|��f��WW
}��8tn�i�xR���&6o<]���X�l�wh6��'g:��O�[��D�
D��&_�'�k#�����JqZ�����o�|<d
	D[��!��<���t�yc��������)]��������db��M0��g5AT6�����4 ��p;�~5�>b��TW���az2�aH8�4R���5S|*=���������6J��.�������]�	.&V7���f����)�����
5�Tu61*L=����W$s�a�l�c�4H };�}�6O�f�� ���h ��?�ka�����m����y�/#mPX
����RJB3�/�&6�x:�]���6v�LF�?`3�6��`3��������T����G���Lm@c���,
ge�*v��j`�{1���
�&��`��0����K[���L�v�h�v��s_���$�$a�K���>3�������SZ�������m2��D�@4
D3i3��WT!����dJ���:�H�L�����V(t���Z&�V(�v�{���q�I�d@� T�������|�Jl�Lt��9���1o���G���C����l��q:'g�p��p�S��=~16�w����N�������:�Hy���v_��n1�(uc�?9vc��' �d���`�f����y�B;8R�����F��#EaI����.���������7B
GR����.��H�J�f�l��f��U�Kr��^�(PB7��`�j�b�y-y2QrQ7����v�a��D7?�40
L���i~��H�J��JAp�W�:����ph��W�:�3��l�����X�Y]v3�����1h3I�!m�?
�w�c�4)u������K����F�?>�.�8|��@�d��AU�������H}��f!=��|�>��q���Q�n��y��	p���{�0�m
X�[y������-��
��(I[���f&������d��})�f(_/|��F�&P�e@Pf��F�A!T�y��<R��������	�������#���k���8��c����Q��>g��>��C�H}�H5�-Q V�m���t�}�M�~1����e@��������t��c��fV0�qov��`4�n~�_��+���Y�<���c�lb3�P�����8'JO��L	��������`��Ge�|<0�Q�Q�/�V}��:�0d��R1�g��
Rdb��|��>nP�=���(�{j��;�C�sd�Af�d~!����=��>h��du����a�����S���%�&V���	���8���@2�$�
���E�	9����}�_@&�d�i��c�i�b������n����dbD�\�h�rw��=������h�������O?_�{���|�����w?}|~�N3�H�|���q3F���+�]"�V�R=v������i~�{f�����=����tN��0 #Q�v���2I���Np
i����!
���
M<�����+q�p��7��>���3�|7|>`�zY���cTdE��H�1�*ch����.����@.�Ml�����}7��l&�n& 3�2�� 3�9���+���O%����3%���
K�T�J����lb�O���^�?7���4(
J���4G��6	!�[���S��jbT{5��[� �Gj11�H���HN7���D�@
P�����@�B$���(#��x�}G�+&_����k��v9+/lL&��U�R�5�����?�D��� 3�2��\
����P��!gIu2����asIC�g��N����R6W}{��e������3�8����o���%�r5Q��R.��bJ�w����l���!O�A�L�tN���3�:�
��S���+�e��n���\��|�����lb������:n����7C'�3�4�>���d�s���h0��2hP�R���2�&��*.����k��Gl�������f�Y+>z������f�h�94T��#V��%C��C�p'��0$q;�.�*��L�Rz��.���q�Y�Ag�t�:0��2��/&!�����p��Tq�L�]8���Q�4_6�����wN[�]M�5j3�0�3�60"�����U�,�k�\M����8tc.�J�������������q:'�f�h��f�U�kit	��H~��r�s���x��j������y��	@P�������u?G�"EG,��QI�c���j��Fy������M����*��)XLl���Md�����`21�iw���n7�����L����3�>��������������(��h���.�k5������%E�����4���zd�&e�;���I�(��2��������?=���(T��c
����4��tl�n�r��,�ge��4()_��n������(7�h����@3���T��� Y����L�tN�*����bb3������#�[b6�����LiwG��l�0�����g�|����|��L�����h������i�<�()w��bf�����5��l��,��b�,������)����|�\OJ5��X����b�~,m���������,y2Q�k@3�4�w��f���o�j2R���pVw1�q�v���n��/��L���s�Vg��6J7! 2�"�� ��d�����iH�o�U��6Ri�\�t���T��<���S%���b�s4�Ac�4�F��J���W��@x�4i����'y�&F�O�1�@�	���g��� 3�2k�,1u��t���x��]�x������U5�s���|U���������?t����q��UL&:w�g�|V�iO��}���������o�����|~�:~�+����7�g>[S�|�1��!�9��~��:�Hg������'����X��������~�����u�nR@c�4��-C����C��C��O���:��wm��\��e�����{>.�8��00�w�a>B����]��h��I�]?lX����������\j<��QP��
�L�t.�1@�^��N��xz����������H�������������^�:��S��oU��Hx��+����k1��}O&F�`�����U�B&���3�0#C�����:'�3�|fH��Lt�2:d;A�:��q�{OwQ��i��	�4�A�����U>K���b�Q�5��V���O-�&�k���vG��[z���&6D�tH|�2
��&��`2�&_������K�����l�,&�F	%�]�vG�x(�&:"J�����L�Mn&:wh���k�>��ME��h	���:�HIH�t�� i���8����#�$��tN��0 l��>>?��w�/R�6��'o��<��k��y�`!���/����]�A��u�
������w�N�|~<���Cjt	��0��1p�W/a�5p��������g�������_���w�0�f8��d���/n��=]v��^�H}��yI�NU����n���?>z:��y�sH��d5,���e����s���J������@$�3��.��z�O�����8DmB��,>hi�t.��3�8S:�Yq�z38�{���'I��G3�e��=R��e���@r�lb�O.��o�H[7�;@3�4�@s[��xS��6^��
<����Lt�R�$_��7,���gj��(H��0� 3�2�� 3K��*��K��}X>�t�@�L�H���$���@�!U�����Y����L���|������Xv�7��AZu���W|�<�� $�b*��~wm#��lb�O�;�Fn��
A��r3���h0�fs�D\G��|Y4���H:����e��FM&F�e�@����<N7)@2�$�@2�d��.H�U�L�-�����*���&:IS��]���u�#���H[�������p>\�Lt�P�5@��� �n%���T�����=G���U�@�rh2�m�*!�z��m������;�7��]�f�M�����"Ar�vV����Gx`��������QO��zMN�+�Dr<H��Rn.^X.��va	�R��CMO��^��d�b�r���k����?>�lA��<N�h$�`3��J�����r��$[�t��l�y���4>	�/��<����w����M�t.�� 1H�����/��N�����ZCT��T�x���$Yp��b�����i����4���L���&�����4��.���2�.��!�k�k)����2��iW#!�dG��Y/�a6`��>�q;�-�d�<�
2�� 3�2�d>h�6���d��4����{$��P^��A�p��_#	����|X��e`�~�|���M���V`������s��5�|��T���P�j:UU�q��q�g��@�F!��
XX-�Bl����E!��%$���7{����(=����
&��i��#%���*����d���������\��>���A���y3�f����/�b���z|X)�n2��H��RM��e��^?��0�X������w��8����3�l�����OR�P�	S~i�A6����B����{0��)�^p>ih3���.�}K���L^�t��Yj*��Sa[�,��'����`3�6�W���c3	�A�
���<������i�?�8����{:�'�'#:��!)����q�I��d0L��@���g��'A��P�~s����Tu���1�3r����~;F&(��D����2�*���s3�fM�]��#��/��!�P<�.m�H=��������g�!h�e�b�5r����l��f�����b�T��h���H~��r-�|�p(��&j��L�Y�A�6�����g�x�y<��H�?{g�����[	t�F����`���B6����u"-����'��*t[�w�/�����z�o��� ���j��:I�c�sIe���Y�M�Gj4w�����L��
B�� �3"4�4+���I6t:���Y��U��&��J��.���<���:R��}J���7eJ�4�@3���'�P����`k0�aa�=uZ\;Q��@��Ll�q�u����8� 2�"�� 2O����)?n�-/�R��j��m�����n�7�`=�M�tm����R�@�n&�Je@5P
T�zT��������Z���<�ZM^��Q�I
t����<�f�c��z����lb����.xY��(`3�6��`3F�+��;�&8���������64���a�C���L�8o�ju������x��g����yUn'��&m�)���8���M�����e��.���I���tN�@1P?#��7FG��vj������c�V�XP)z6�QHJ��S�� ����H����\v%Qz^��Lt�H��4 
H��2/���0HW�zL1I����:�H���@t�e^I2���2�J�O�������d���g�x��g�'��c����g-��F��tj����x���,&6U�:���sX��e`�a���+� m��S�\\*�h��d61�Vc�;�5��&�	J��N��D�L@������T�������^y����>�r�e�7o~|������7_������A��C�@%���\{������/��3��LK��c�z��0J�`>N4��>�4lG��.�-\�����c��(�3_�5�M���^}��@�a�G0C�D��@4�h��Q�~����;^��H�R��s��"7uJL�*.�Tq������XR����J��O��Q��H�R�#�6J7#`3�6[����_t?G)�|K
�y|�
���d���o����$�D�7�Ib���YV������&#��n`���Y����
B�� �zM�� ��7����@&P\��a��i`�:�/&�:��=l���m�bbT@3�����5ILRp{2�1Q�Mn�$�"��tN�3�0�zw�������|<_�|RI�V���'y�����������k��o�f�cy��A��j�� H���g�d�#����`����`�V|1	������|PX��jb�P��������Ml���"�#�6J��&��`2����W_>����@����3_��&f��:�H79�.�O���4��������^Rc���M
x���1x�8�O*[S5��lw�$'���(��h�
��",u0��(��:U�:I�������p��g���<u+��@���tv3�q�V��D�db$��@W�v�*�'���� 4
B�����'���OT�z%4#���Lt�n2���q�U�K��'������q:'�ep\�����|VY{��d'��.��:�H��c�Bn���]�lb%���~��a��$�&t�Ag�YGgA��n�����
V��W����!��$@b�L2z%���}��K&f{�Dq��GYf6
����2�*�����9�y$)^)������n/�]d������3��r;q]���D�D�W�c�������f�`F�1�����O7�`H�����������G���)h�)[����(fs����6����tN��`3�6?6�����1�K��������#�����2vS0
�q�f������b�i&@1����!
@����2�_�����o�~�������o��n?��r�EyH���RKc�_�`0���GB��W�"�<j�3� 9�����,�d1��(��KEU�k6���6t�M����@��12bd���������~�2�������-���������b���t���|	���tN��`3�6?6�0v�y�[������S���&6��r����c31��IM���L��%s3Q�B�� 4
B?=�pk@�B�+�AB-,R�p���t����a6.8?t�m9p�����2,��b�,~��g-+�I	�
�����a�}�������0�)�n,�a��&��9��&t�_��EP#L{������~�R��6��Ecl�-����i����D<��d�L�a1�T.�l�yO����t>���f�T���]���������nz�h ��
���'��_���CF�!�V��
�DE4��:��Q��cj���_W����	�����i�nRd@��t���b��J�1�$�=�M#u��"�G����W&������8�J������\�AdD~.D��������T����RH�C�q���GJ���n��������j=��b���w�Dy�
0�3�0|<��_~�����\�$���k�O������s��3���fb������e�K&I�Lt�F��`4
F3�>��M�����.�_�T�jb�uE"u�~Ni�7���b���4�we���/s��tN��3�8?8�1���cy ��|)wH��=uqd���q#R����_��\���?$S����?����`�f����O�# ���%�/'A{�c�t��c���V��
�lb���S�����2��6��`3�63l>�����F{��c�����T�$�c[�O&��!��.I��q:'�d0L��L���uH�:���_����>�'���~xN&F7�"���/�/�M&Fg�����`����J`X����������k����th��+h�L2�4R��V�j�����\���,1;te(k�@V�R��D7?`3�6���!c?\kIiN�ga�5�F��#�3%W�]��KJg7�;�t�����6J7#�1x������2r���O7�������x�J�F�8I�N�je�:�2PG)�c�td��og�D��wIg�������4�Ac����O�\j���Lc����H5�AH
�Z(����������?5;�}�/v�2������������)�3Lc��m&����G��������+$�_l�������H�<i&6���e��f�H�t31���q(��T�h���hW���w�����0�����3�4?����a,�C��/�N�*-����Ll�C���;b���'���q���� �t�@2�$���/���nZ��P�������M&F�i���p���<NGB�N�������-��?����Af�d�2�T�NC����A!�
��CR,RN���C�ah��+����<o���q3{Dx�&=N�k�'��:d�[�]���n���~�w������t�~�Tq����l�w:���4������, ��T�`J��>�-�J�[0���D$(�����{�&6�P�u������A6�o4"hD��4M�������>�X���\z�Q�aq�����u�
s�sf�����w�db�O���d��tN�� 2�"T������*jS�������[����o�\��Qrv���f1�]XLt`���<���,.^�&:j����
��Eu\��Q�������G�@7��t��{���8�C7:%��(��8��c���I;��M
�����dO#�XH�t�[�.8���w�F����*�Kj�tS���1xs<>����R��f8	 DE4�d�qH�������fj����|��R�5�y��	0L���g����<uk��$Y{������&6U�z��K��������Z
;�oM�/g3��)��z���3��^�Am�}1����X����lZ ���l~T��I�u���;��y�!����[�M�JnQ��!�K������db������UU���	*h���t����
?��K���q��~�}��i�������+TD[���F����H���(��Fn�D����O��&:o�nC���
u[�n#�B$
`��#y.�������\G+�_�lb31���kg����*��]?��m� �*i�#�m����7�}|�j�	�k�u�0F�<x�sU)8�/nUD2�����V&�#m��Z�Jp�{Q��fb�O�#2�X��tN jF���Q3�f�i�>������{E��F��#�J����~=���G��3r����rW���7��D��>������3�d$a��
{0	�e�����d1�AHJ�8��w|��y���u�����t��b� ��@��^�t=�Q�5��m�>RW�:
n���������J�t��+��8� 2�"�� 2G��&aS%
eo�\M�z3S����a��,�j�Mt`��Z�//������)�z!��^/5���a+�����r�����T"\O&6�k2����=�����������������h0�����_������i�P���d~��������.3�?�����U�A(U3�����(��sC���
���������:wM[�W>��\m�:B&6�+U�����q��g���������M&����ip���o?���9�O�x^9����k������H�=����Fj7���*N&F<��Z$'������J�����i��� ����/���'#W�������@I�)��`�����j�<X��I(,��~����4�(kcJ�q���$�},�;��������+_�������o�(����0kk��������W����	����#�0��vH���>�`�����<y����S����S^w���lb�<Nm�)w��b����0#`~3
���5��^�V)q���&�db�Z7$�����p21���E���G��I3Q.��g�x�S�y�@=�=H"cO#uq�M�j��c��~�O����h�Pe��6�_�o�L�2?�0�3�����h�X
����Ci�f�(�LEr�8
�q����O�������a`~>a������^����� ����q�6n�k��Q� �L���tZ(���e��J6�|*���2�*?�k�r��R@�be����1F����&�r,]L�s\�|]M�����&P�W���4N7)�2�,���cX>�f��$t�E�|�^+���Q�JeA�`�m��ni6�:X��&��5�`5����/�*u���o���T�W��/��&������*��e�
�@��'�������#2��>�/&�:�H�]?��&F���<�����������*�1�"���14e$� A4�h�� �__����%j����v��������h*jj6��������j�D�5u���$�7���� 4
B�����z�O7�4���C�u�M��cv�8l,��K������������tq��vA��d�T��f�l����4��B|�h��#�m��1Pg��#�L�Pv5�����.^�6����N�=�%F��rF�d H����$>@3��CrU��\O�u�
[����]K����Q�Z���@�8���q�I�Ac�4���Is����#v���Lt��q�;��n�o��db����HO^L�t�2�� 3��,����/.\=J3�����"\KZ�&:�H�L�q���4�����1�2�%�g�D��"�� 2�"?+�p+@�u��L�n#�8�.�����
f�Ll��s������of�?3�0��c`>�����<5W����G%�X(��]���I��2H;De��Q'#L&�	��f�����?<����/R�B�Q�H� �������<X�
<X�B_Mm��!au@�?�n2I6_A��K���<��	J��D�eJ����4T��/���_�WPT������7���}{������@�'��U\Ff~6����b��__L��f���1����I���?.�zg��~6��@#�F��7�:�����*���qH��V��#���G,%�w�V@��&?��j�.wQ���
��	�&��`2���iN���%tK��E�|�j�SX��	�dbt�s��������L�6
�U���L�t_�.���2���������J7e�c��E�v���E�lM��{�3���Jn���d�,�%��V	e�?.���D7?�40
L��4���J�9���4$�O#u�����Du���a��ZL�"�=�����q:'@ePT���O�<W��G���<NG)��,]�����Og�#e�n�����L&:wf�`�f6\>�����ct.-��Xs���CRJSDNQ�v�Kp}i21��U����{�I���D7?�40
L��4�����)���1"�y������&���c^~���9�bm��J~�sT�AePYMe~�6�*�/�V	R�
��Lju()�����:�H�\��^U's6��'Q/��J��~�&�;@3�4�@30�R�v�-�{Hpn#u���9���/��c>P�Ml���tw�����d��t�Ag�tf�|R9;������h��:�H�L��c7�
��f61r(�.zj��~XU�7��<��3��|��_[1*%�lXp���\���*���,&63��
��x'_�l21���]�g~R�a�)�f�`����������a!3�����Lt��5���^���0[;D�����a_���)�f��)��40
L�,�O*o�4�FV�%P7#0�8f:���mO&6;O�
~��H�U�&:@j�����O��t�-������REj��|�����Cqp]�U��'�����]
-�a�)�f�`����;��c�~P��`���t�����aw���C�Ml�q!uT����i��	 H��d �E�IU��1�T)k����G
eK����cA�>��8D7��]p���8��2�*[R�_��Y��/T}�c7O�lX�HA|b�����y{*C��`	^��$�|�����?�����H�����_������#��nz@iP��o����(�#�hC�/��{��� ew��[�����D�!�G9�z��aZR���%��������y�nRg�p���P��s�p���?��SI�v�Wp��:��X_���������',&6�L��S��e�sX��e`�a�'��������@�*���jg
X8��0$�s������|�:�(�z���	HN&:w@iP�V�i���������y�����?�y���������T��i����(��*���)�9��H~�t����������4���ZJ���.��$G����D����o;k�f�������P� !���H��L�F��"��4r�P���kw�L�e��ep\������m/.`eCD�0��S([{����fb��X�l��\�0r�
i����;�D���3�0�|��	w��H���q-��F�����\#�w����FW��C�@�����d��A��3�<�<���gSN4�@|���FG	>���^U_����4�]w.����D��*�������rL��Hz�(���`�}5����/��l�=�����?�������ww��O���N5�2���u�_7�#3O�w�]��=%?g��y2��t�5B�O6L�!�db����)H_?�}�d��p���iD�����p&����D����H����O�tip
 .&6U�:dI��6N��,���3�2�i�h
"#��+E�aneGF�KJf�geR�t&bN��+�e>f�o�zXdAbZ�s\��ep\���S����5y�`�P+�T��`��z1�������&6����a��B0C���p���ip���Y��C	a
Y�
#���DRRS�����kJ�����2��:�<�����8�����3�>?#>�0�e	��D+E%Tvr	�r�b�����8�.�5��`q6���2�}�C�i���d�Af����O�% ����~P:�4R�)�s�������'�M&�z�Rj�������o���2)x��g�x��|Ra�S�N>����������U�Dg�\����K]�8��2�([A���_t?G�������� ����Bp�i������jl,m&>����������O�}�5�3O&����ep\~���������q��m���P��������\;���(3��G��3��D�D���
�i�8.@�d�s���h ������/��~��Z���Zv�Z��
�$fO#u�����4v.o����lb�Yp%w9�m� �HM&�	�Ag�t~&tF	��_i��������������)�TO����S� \�Mt�"�v�L�j�������?$Wg����\�e@P����j����B���6R�)��2=�]64��6�(����8*�����7�O&�s�h��f��q4�T���Z/��@�������>C���2/&F�B����q���(���M 
H���3��	�X>�@Tw$����	�mWewDi��H=,wq+O��D�J�iu�uIa�����Ah���O�5 �����T,��U��&F�X9�.�I�4NB)��XMyr��e���%)'`�f�`~�'��S��0~SBD�F����BW����3�G�M��v��ug���2��Ag�t��Yp�g�N�?�`��|��X�.���Q/?����W���q�$�o�lb�U�.�m��0L�������>�r}����-/�4�8�\0����Q����/��
*���`F?��W��_Lt�NE�#�f���=�n6�������$�d����0���h��P�o.���tS�j:W�<
��:�H��S��N�VQ[�G�Mlr�t�����y2��:���3��L���By]E
���[x��S4L-�� �S����D��+��hA������.��mz$EZ&�; 4
B�� ����)5nC/8Z��y�C�t�R��d����_��L��Y���I��L�n~�f�h�����|��m*�2_���<���G
g���:J^>�Fi6IF�}W�.;M ,L&:@g�t��	��e���h�V���(�V���GJ��2�]F�<���Cm8����{���d�s\��ep\~<j>����)���1���Hz�d��������p6�����aw�ZR���(��@3�4������v��N�)=x��(=��P�S������#�c�M��CT�s ea�j&J����)���}a��L�j���,���������x�����p�_
h�����_���/W�H�7�l����1t}(~;�e�~5Q�G�%�.�s���K��('�:F��u*���j{���(@3�4�}�H-ro�~����w?_=|�L�t���i��������v~����B3�2��fGE6(�:�q>[x����#es����Y�fb���je[_�gv�PM�Q<�4�@������n��yr�����O�j��w�e� ��g51��xj7����d��fb��@{�}V� -����
�3�:���;�Q��������s:�R�
�QG/�.\���i&6��.��6^��yI������d H���*�Q��"�����qq�����;�����i3���u�
���G0C�D�
�&��`2��&�T��t��ZJ�9M.��Xy�T�R����4��j7���5����	�5��:W��>��Kc��]������G��@����F��ro�XS��!��\��c@?.&O^��.��R����m��T����:H�x��A_���J |B�GO5/�D�PN�<TGiL:���h���j��h�zC��c^d��o��D77 2�"�� �W�|Ni�9�
A)�sH,@]�Mt����g��j���?X����L:J?�����&t�Ag�YHg� �h5�A��I�p����BX.�~�}� u51j�Y���e.%���R;���3]�g�������
��f�h��8�\Rf�Jp���m��Y��6V��,�2}����'7oH���Q:�a`����`���u_:�F/(I��:�H1\3�{�h{f����|�X������H��|Db�;��{1�������f�?��UwN���/iL�����f.����m���4�]�*R�yw����������W�����������7~��q�(�zwR+�qn&6�8*���������f�h����s*�.P�f
�y��Jt-cu��R��M��Ae�eJ�4��	s����X@����������a?���v�)�MlHH%>�	��j�$�tc@B�pq�I�[i6���f�`��!�pG�P��R���"�&F:HM����7�DC)�)wz�h��{�Lt��@3�4�_A�9���Q�aXh(@O\Lt���������I�0(��_���3��R��d������6����Dcl~���L�a�c�z��(�����)�$J�0�]�&JJwT���h����Ab�$/B��[��>�K��o��9�/�T���
R-�+�/}31��V5��K��&��e$�:��RG���c�8�R�|��I����
���<{b?�(+bH�<�K� 2�S3QQ�����"�����L����m���V��A:��������o���}{Qp����/��T�^������'�+*��9��������w�j+>���B���g{u��H��)e�c�������k2��O13��jb��\�}�>�og�Q�*�:#tF��=m���_���|F%�b�X��% �i\��P��h��r51�����nI��p��ap���w�����2�\O�5$^���#}����.-iR<���3tn=ioq2�Y�&���.����t��W����<m����+���Z����K$D��R�|\3���t��2f����
�|q*z���o��8W�<P����\�Fo����������T����pp7��{��=.���t?-�����J�w��(���##FF������v�z��CQ��a�������$}�H��<�(j�uvhd���1�3~�{���WZc����=�kQ��c���|��\����	6������
�j���t�P��'�Y�	���q��I���'�"}zP�
#.F\��XJ��aX���8�G���VH��l&6�A�s��!�S3���*U�m���8��c�2��djO7���H���Px�/��9�I
U����j�8.�RZR5O�j����� 1H,^�5�f9��R�K�|f�?�N�������(��:��;���E�.~cPM�*t�T�w���t��� 0-�W}e�%�T���C�So��
�XM�G��� ��?��t��>�T�t��Bp
�L�[$�0�c�����F��a����U�q�|G5��|P,�;V�Y��lf���a_�����R��AdD���|�(�Jm�S�O�~ �z���"�b�<��v�;���!�V#��T�r�LB�U]�{��p�	����������|�����C����y��mcm6
����]���Ll��z�O�j�t FF��1�4X9���Z����/��M����:��_9�X���V+#P����G�(��;�Lt���������~����U``�$c����S��m3����~(������P�9 ��d�>����<��y}��M������r�t>��Z��|��7;F[�~�,])f����G�d�[����'���;M���NeU_����62�����U�����9�C�m�!�*t{9�T��O�E��<R'��#I�9G��I��Xv����:���-v��L�vKC7��H�?2�&J��2Be�����P���Q^&�"����a�)���b\��yO6.�d�Z����W���!�S�d)���p\��ep��nn.���n�(aS�9f����.�&6`>�4�ww~���������6_64kh��"�����-��~�R���Y�T��l.?�q��������k��m�PY$];���Z�c)����O�j&��A��!2B��:D>a�����:��.���0���u��D��XL�u�h����Ll�!�:_x#h��lt����2�.�����K��)�:y�H�7}��j�Lg����k
�w�fe51j�T;'
��(��_>�������~��S��	��@1P|
�'U�Ij-������+��N[M�fbt�������(������zq����������c�Xc^'4�'!H+��z�5�|���M9
n�C�������'�9/����z5Q���}W�0�����T�f.���2�._�y(�=nCP���L�	^������Z���#>����xCzu\��6��(�p\��ep��O*^��Hg��r��I��M�p�Aj���Y����RwH������Q�D�Q��f�h�����H��O�y�4?���>�s��/y�$Y�'_d�xJ��;K�A�����.�����B�SJ��P+I��"��D�)y��|q��'SU�3��Z��J����f��p\��ep��O*U���s�9����[�u31����s����05#4�F�h��i&@�^��V�	M'.���
�;���F�`���z�`M
��W��yX�[��fb3�B
q�Q6��.�u	�i6:w$#HF������n�4V���)�m�fd3��G�"%:\�Y�����R��tf�s(��b�(���s���H�8����L��Xv�&�9�|A�i���H��-�6'�)v3�y�������������(^���?{��i���{�������������-���c
)������[L�.9������yHl/3�|/�j�����1h�S*�GcE�5W��:n��#���\��Q"W����A��������K�<n�t���1x������X��R�'u6P�L���[Lt�nOj�tX�����
5��(^���t%l����WeM�|���������R �K%O*�*
�VG�'!������	��R>f]�&FJ	���&z>xn6���8��3������bv�zG)�1{6��������"1�7�s����A�7��U�Dvzi�nv��4���_�����pY�#���������o�
�}�q�s|B�vX�|R);u���uJ���>y��z�P�����>���S[c]�
xw��rW� A2�d����a_'�U�8��vs��K���6$�n�>�kPM�ttR��Z6���yg�F79�2�(��2�����Z�B��8�T��C~{��]K�x|�����7�������Q���*���!�(g�p������Bv|G��`y51:X����_��24%�>^��@eG��K�4��UM�{�|��g1���F�3�g����������^$%O�$5�W�x+�!���B:m��$L��^e��yW��<��c�<�j���i*�z�:n#�j���y����[M�w[�����/�k�&�<h�WC�;�c��=�,_d~��$b?%wV������������]4�^R�g��D������������ 84
�P��5eUYpI�
�qa��U
����f���1Bc����[Vv����������Sm$����>��6a��~�_>i��R(f�����$A��l~���nr�f�l��f��[6q��)Q���aIH��g#��4h�	���W����������9���1x������zpW��H�%
������G
�Z�zX�vB2�^L&6�P;�p!b�`n&:of�`��0��</.��Ou����T���R�����{��(���.�$-�g�F��	����=�ei6:w�c�<�����<�^�F �R�}��W�)�z�#���5kZy�\M���I��N>��D77�2�*����W�|R��ui��Z����c?����:��}X�����v��/��k�U�7�3�>[��_��Y�U����O��lX!l�P�(3�|�c�����Z��G�b�,[%}��2w^������~n&F?*�}�M���i'$�����/5^�E�(��Q����=�0H�Q�vcG����0�a/cmHH
uX�[���������'q������'#N�(��'�T�.���+�<Ky��X|�12���u�������u���@/��������\��+�a��a�c��f������6��_��c��0U��P$7�B���3�lt�:���3�:�S��
���H��J�t�Iv�1E����l&F��Uv�{���L��R�y S���<P�d@����*���%R���e�
���'p�Q6P�t^�����9QM�b�o!����O��R��;w�����?jC�o�s�[s���c�O�GaF��a���C�~�FyO6����m��}W������f�s�1�c���Ap�/�F@>��<�v�S��9���"�z��T����lx^�<�2���(HQf5���0(
�����x�k3A%>�4�Q$Q/cmPX���t.����(�k�r]�Yz;R����(wL 3�2��r2N�l����E��EW��1JJ�����P�o�)�sJ��/���&FRI-�qY���r��M�*���2���x������+�_��(����Q�J��[��x��jb�M��)�A������xUk-�+�$�����JY_��u_��}��Q(���\�
�
Y��gl��3��E�s;8��E�y��5���3��IGg��i��C�<��F�@8�p�0�a�N�_uK�����eA�zO�h�Q�(RC��/�zx
�&�k=�o�6UT���&F�]Ux����=v�lt?
�t�Ag�b�>���Pg�X&	�k���DG )�Iz��h�l �z����=$j����c����$��"�sJ�} ���'wT�zk��Q�� F��l�R���w����5@P5@�:,�b~56Z�A�
{��.���|����L�����%V�/C51R�Iw�Ka�I/���lt��Z5�jh��@���c#6����L�Y��2}�h����@
�:*�L�u4�s+K��T�x���?���2�e�o/T3��z{�_��.N*\���2|H��2�������G�������Cu������i&:o*#TF��w��o?���J�O�e7�u��������h�������QH�$����6�)���^�<�
Q�n�F2��?^�6��
 ����d���\�zt>�0Z��c�F\R�0����&�
Iv�<H�YXM��j�_B&��)o]��1`�7W�'U�k��0�IPz�4���U�.�
 4����Ky�p#�Z�C�&:o@b�$��s���.:�����S>�������T���2.Q1_r5Q�G��~��h�l��fG_tv���A�����1`_��n����]@�}7y�K�2�A*S=��$8��?����j���+�j:���:J�n��1`��a|R���j���S&�2V)�H����}]��8��_i0��kI�l�������60/��"�&�7�����3����#+��3�&��*g:=���+����CR�dr�8<�Tr����;�V�������b����=.0���]���`�m�����>��J��~��s���:���Tq}*g��[��l;
�
�
�
�
��Q�z!��������Z�pa�f���:�M;f#�"�)�����������|��i$����m{�f��
�E���ub������H��8HO	������AD�0PpVpVpVp��3�mH���(��XH��\vGB���5�OP�x�Y�"�&z7
�H%z7��<uMW=O2��������������S�pHfp_�K���]�����ID�����@��5��K�&�J���p��a�l
�
�
�
�
��`|�9�q<����)3�(>�����p]��Uu60�EYJL�t��L"��(+++�p8�c!mr|7���o�b2;:���~��\F�:������HA7�� "<���=��l��n������0�r��������7t�����?�uP�$����m��W��6���&�[~���uG��H%�	�r���$R�e��:��N�*�����5(����)�xc���(���7N"2��zc��G�Dx��^�$R)W���4+,��
EcEcEcE�������%: �Yz6��6 �k��X�.��:�L"u��|��m@��G:4�z�
�
�
�
�
����z�
�A�������#��f$j��d���0 �>+�V�6�(�(�(�(�(���WI�n4����x[4_��O"2�a���H;HI��""���c��o���$"�����;A���0�[������g�&r�����5u.��i����i���?��D�����L�=}���$"$s_y4����~�H�w��!e�'���$"��"�"�"�s�����������LpA�J�L�A�����b��w����v^*C.$���z�	�eu��qlm����tu��SVVV�������FNhs� �!�H.qP�W��@�!�[��f��(Rg?��y������T��l;
�
�
�
�/�7+9f�����xHF
�w-��	|h��"un�mMC����:[@74�.>�{1��v�h�h�h�h�h�2L����/S��a�����$�p=���@[�eu��}3�cz;��l7
�
�
�
�
�+p|�k��lc|���:B�D*�Q�jx�\rKQ�R�0�����2��QD������u�l�H�t��+grv?����/u����������V:F��dk��m�w�{�D*a!���o
hf�B�J7��@��Jf�5V�XY�X�X�x%V��h��@�F�L�O�$��T�:���8���.P���$R)`G.:hOa2�0M"������������������%��*`z8G�p�y�E��@g^��Vi�YC�*�
Q@V@V@V@f2���[5�E����]�������R�9\��~VL��$R�D���>��V���b�b�b�b�b�JPLgF+�hi��!����e���YD�>\4F������>5�D*�1�}���M"������������������d7g���i��/�V!�N��e2d�a�3�P����;H���("+"+"+"�tD�� 7��?��p4o~����?������.����� �&N"2��B!�]���i!�T�{�T>eXfm'��������J�L�C��b����������&���f���p��?t�E�l��F8r��F�Y�Y�Y�Y�y��5���d4���Sd��R�p�����N2&3F@�D���5�n,��t�>��7�e�e�e�������q�09�E����7����$8'�J;�:X3\��[���}S6��&��NDd������������J�|}n�m�������Dd���B$����a.�T&�J��v(9_�� ��<��]5��\{��hW�
�
�+�|��k{0��GY��(R�C��~�����&Zuv���H^�������.�������u,���T�����N�,������oKON�+�������0pb 7hGn|V8���!",�V@V@V@V@���d��x��z3@l���1|����wr���F�����E��H����5]��.�`�`�`�`�E����MY]�r`��i�x�/�������i]�M�S��T�q������`�`�`�`�����1���_L�������5=��='���@��$F;Z&�=T��:x��-C���B�n�����q}pL����G��LL!r����L�}�B�_�Oe��h�C(+>46O"���w���6h�i��W���y?���g����������+yK���SV,��[�����si�W)T����k�a��)q�x�����3�q]�M�����yk:�>��((k�����\��Y�LGF��"�A��.��5~<O>b��8����G��������v�Og��F�X�X�X�X�cc��s�����>�-�����0�1�T:F�>F'n2VJ���k�"k�8R�D�5m}��W�f�U�Ti�ikX��K[o��������K������Nah[���cE��'�:�q�a�p���Q�5��������;k��5��rN�f�c����e���\�7�}W��dG�:�9��)_���$R�p�=[B3���QhVhVhVh���A�f1�f�HHk�D���Dj������S&����!�����V�ID�Y����>��|9��^4��`�ll�U2��!����1T��|��8t6�I��)o�Q{:Z��d��K��������%D��=�{Y�rY${��A��������4�����^1/�'�z��W�6?��]���
�6�6�02�~ht�>}X��(R��#-�2�p�eu���,��p_7%��v��k�^k�Z����"*n!o����n��;NpL;H,�Dd��}���x����?��n���_� ��v�x�x�x�x�5��������
�m�m���R;��m�R3�/����D�8k,F'�'��B�B�B�B1�6n�t "�|����k�y(P��ham��s�_�W��@�V�=�qZVg��m�s8<2�6�Q�EaEaEaEa.&]
_�O=2�|,�da>���!���������&�J�����ML�]� "d7*6+6+6+6s���a�5&��8�g�5�O��"B���3�9�}QL��'Y������������D4vV��[-�������u���t(�5u.�Y�k�)��_���������[�E�<����iU�
 #�d?eJf�HD�~��(k�����1l���������*��T�y6����E�Hc������,Ruz0'�w���5Q��v@����p���n�N��e�e�e���w�"�r+��O� _c�:����O��hd>�����,\��9����I��~,Z�t��(vE�'��������������y��d��V
�7�02���Y���N$��)���c0B �Dd@����1�gWX%��"�"�"�"�"��/7�of��~|C������AUQ���K	Z��4�<���br�hc�q\&C?.��C?�������$"��b�b�b�b�b�_k���DDGW�Lu\*.���k��#�1�"��#�~&!KG9ew)�Zi��f���w_��|>��f����n���)�n5������nf*�'!��
�8���G��
Zp���V$V$V$>*�%��
YL�^�^��[G��yi<F��wb~5D�l��|��-M"��h�Z�����l���Vo����v���$g��D R)R���?�?4�Dd`�
�A�F��t����=�=�f�f�f�f6����N������c����3��~�Y��3��E���h��dl��m�� "��e�e�e�e��%��f��f��];6-����u�	l;�~%!V���(Rg;����q]�D����z����i{(�	����:�
�~����*:;����H%nU�F�8�
�:��P��d/�w�D���������'l��@+�����.:u��9�hIdN"��@���ML�*!�a��ohI"<Y����5G�9����y�hg
Al`h
-����0NK+!�x�y�J�8��g'��$�$"�����������Kg����Fg���l�
�aghD�|���m5#0-��g0��Ow����D����������������J�Sv�U��f���������� R���������H�� +n��j�l�=-EeEeEeEeE�� ����@��{��	zh��,"�..~4J}�zKQ��v������M"B/AqYqYqYqYqy	��5y�wc�g��"y=�TbnJu3@Ln)�T:Wv i�BOa�y
���������������g�&rC��C�����7kgMGJ���{�������S�����t�[���!��@^�������}%�������Jv?����gOr����i��������:Z�\aK��YDXa�@�����5HW)xG��Y�m	"
��w��15�g�z�o�34o���v����K�7?��}���x'{�o9�Rw
�
�l=�>p��d�=�f����}����1�8��
����&�'�E/k+�e�^��i���������J�7b7�02��T+�>�N�H4D2�79�I����C�I�V,��d������*6k�.���q�.=h~�1��a3mw�6w!�����(��r/��C����3��I�R5���,T�K�&�J����d(�%	"������8k�������o1�:3��0�0�`��'�J�jHS�K�<7�
�:����\_v��[Ed�Q�V�V�V�~�M�'�"�����.zY[�,I2�ID?�x];���Hl�oC��=����2�8����c�c�c�c���x���{�B���klK��>/�����v(�>��1�T*���,v��]�IDvs�����WA�:����CgR����QD�@\xI�n�����:�pM(��r�mGQYQYQYQ���t_�JE�4.Y5l8 ��vad���uh>y��Y�,"��3�z0E��b*���zq73N6M"���b�b�b�b�b�j�L'N�6�H�y#!���oFz���Pp<������:[@�d�q�f��l7
�
�
�
�
��@|��k���Z������q�y��Hh��8)�i]�M�9m:U���������x��d:�]�����5}JS�ghEd���a��{�������Xg;�n�.AX$������������j8L�����:s��$�����"QD�?\4�i��~���2������7�Z����wh��gggg�Up��\��n��kC�:.�!�\F���8���	���dP�U�U�U�}��K�?��cy\��L��~�9����I�R0�L5FV�".�! �5h�����M�f�����W����B�y�};�4&�J=8�x6�Nr��g�QD�\l�����_aS46O"��(8+8+8+8+8���5���0l|���R���,"C .��X 3�E�����A����v�&!�LZZ�@��_~�cl��� ,�s�����@�:�z����Ihr�Hz'Y���@���<�*�"��ZC��8�Cy��MF�J*8+8+8+8d�����+$4���� �A&��!���2���K��2���;�'�J)v^RFag:�>��n����������������p|��!X�n�����A `��2����X��e�Ln)��
������40�IDv{����1�������o��]~�������8M���x'{��9�Rw����������'Lz�p4g3Q)^Ff���v�l'V2;�<�l-��F2�q���eu���$�+�.������_c��P��|J��:[G-D��4��s�2��S�D*k�
����9ll��n�����������::_#{�����p�>�)�"V�"��>����)���-E��'5����z���p����D���&���^6��j����������X�A$����{�������G_��r=l��z�f������)��o��E��w��t�X)������n
��2��;�"u������Q��p��~+a���/k������|�u3��fh���i����F���s	����Ed���.�DHM�}�N@e:}E�l���$|��`y��F�Y�Y�Y�Y�Y���<�a�28���9@D�?\p~qd��`}���s&��QxVxVxVx��3%X�	b[h�lQ/\�����#�S�{g�|�L5'a��������0-���S���8N�dPVVVV^��1���]c�����^�H%F6����D��I\�Ssk�)�M����sIUZ"&-:�J������<�v�m&J�^�	�wW���m{o^_�|�tl�����u���y��!{0e���@�L�z)���|�n�i/nRN5~��Y�g��_A�Lws�+��Q%�fFB����r�.��,"����m�T�!�U�D*=��0t�����r��QpVpVpVpVp��v�������4y���%;������T�6m'���-E�:�A��s%Y�����l7������������|������&��!3�����J������2�Ln)���l��#W��c7�(B����dn�G��-;4���3o��{��v���#}�W���Z��f�CZ]�	��{�D��[.��!���?���"28�n�=�>���Dw	"B���Y�g
��_A�LD%p���4U�
B��do%��I�R��LuA$g��$R�H��#/#yF+� "|C�����5�=o=�2���>�^$3�SeU�E�Cr�5E��p^�H��4�~�Q�&�G���������:V�n[W�Y�N�1�s�����`����!���0:�f����2X��H�������H�F*+++�������X`}}HL�\�����f.@�b�����H��1R�&O�'����"2,�>���e;oF?�ID�EfEfEfEfE�7���5����Z�Z���h <�������<�+���-E�:�q(����t��$"l/������������.�u�I�o�G�l<���HGA�KThz�S��!M�������;3���'���
�
�
�
�
��������k�a �;S�"�0���R5]�7����B���d�W�U�U�U�U�]��kLR�t�{��+���`�����z�>�#E*%'�9K�cq���(+++�p$�����Tt��~t��1$��OB��(�.�Ts���,W��������	w������(0+0+0+0+0/�����\s�����Mu^ZC���IqY�-��o�tf�i�i��FQXQXQXQXQx��3Q�QQ-��F���2��F���������&�����{1-�mA�W�W�W���4SF���m��`�\-mu�va��n���f���O"�>R��P���%�����.��	�D��-"J���c&t������O��|8�7� q�����?=�h�9fb������3;moa� R���3�V\�X;�t}_�<M����i��5�(Y�d��5J�(Y����)�z���)#*M"2����!m����u`{��LW4]�F;�("���sz#;���vO/�c�Z��dr�=g^J2>�@F���������,k��y���d��Q&�r��������
��Q��&k��[�2�y���<�Tz%�C����8h�Dd7G�Y�Y�f��_x��0���JyR:R�t�X�n2�����YD>��"2u_.��B��LU�m�(�o���Dd�QdVdVdVdVd�|�9������Dm"y_�|
�J��@�����u��!�W����^/b�l��������/�i�'�{\�K��B��.R����m9��d�p�H�^��"���h V��t�v�]Bf'Z?����.�2T�Qg���������n?����a����.?�MZLc~�g��.L�a%��6���1H�yi�v�a�%�7%���Nx�){���� "N�����*k����&���k��ln)�H^�D*�	�Z�=���uu�h?d�e�g�"��<7$���������������������?�1
tnqZ��/S5����z_C����t��������7�"����r�v����vcC�k��/����w�����_�����/7�}�(�ogk��f�������{�M~����3�^��L����i�+�L<wz�����~0g���3��~������>���K~0o�dc?Liff�����p.X�o_.� �R���rG��,���������_��%������ Q6�i�������7�������[a���������������|�z������0c��|���V��L<3��������}�7���?~���w�|���������w� ����/�}��9����z{���c�������u�����w�o�����{]��������u�����u�o����jA{�7�_�����U���~����^m��^���/�-(��������x�8����t�����P������/O~����9G�����/�G@Y\���>����Z�6W����'������W������_�����E��v���g 2ie�;0Fg��<�w��;:�;<	���}�#|3)�-L������4z����
\��]��[c��FiG�|�����7?��������?�����������_����o��%�f/���E���Z.^���w��A9�2���d�'�{�;jQj�b�g����;J"������d����jP[q;3������m�sp���e(ngUq?~��h����*U��!L��mn�����W�T�������=X�������a����lG����B��L���&����M��1T��_����0���zh��t�����+ZQ\�����7�����6��O�E�z.����?�0&%����9�����lC��x��������y��C{�'{���_��Xu�\�d_���������#*W����BK��LM�z�Nm0�a�w�Ed;����Q��;c�m�������O��3�w��/�G���p�d��}�,"S���G��c���+e���6#�}���<����Pc�Ic��i/��U
H
:q�*��fP�����5�p��T����|��'�N����k����gRl�$�����a@���sc6f�D*�3���8���a/�5_.Xy7��t�?����g�����6���[��;d�0�>+���$"�G�z=`�c���w�@o�Is��\��S/������U-{JT�a=hb?mN?�l�U�"u�.72���1�V�$"��~u�4�/���EJ�'����kL�]��]c^}`��KO���p<
��t�����A_6��/�R�] a���N
[�D�����bX���<J��Q��X����#�	c3:��������XY��z��}U������H�P�����N
�M"BE��n�}3��%GIDh���g4c������Y�3"9�p�q���-��m#x�<X�VF�����&�q��7hI���I~�V��$"4'�U^T��Z}�����u��9�a�i.97������p��F�	Cm�9��-��c*�o�������E�?��x%M������{oM��p>���g����+�qa*�)j6���P�jE�I�������mi�W�6�_pGd�z�l0|�5����#%�p8q;����`��{Z�p[w~�)G�{���0KE��
AD�_l����i�=����0��g����q�>���5>r��L�w�Q��O���D�~g}`_���W:(@�/cO;�ia���nD��R���c�������
��G��a����?H|�Q��FlfT��7�F~
���H��;�`�]�mw��pG��w�
=��B�A���".��H�3���
e��������-��u7}g���JR�N"���Pi�q0���� �=��j<�B���W:��6����H��W:����%�oT�e�����s�6NwX$�3���m�s��E�]��e�V��������G��
C2�y\�!��N"�Wp������}o��6k�9�6r&�����!t��:���Ot}h�*|�������O���i�\Y������V�+*iu�"��W�;�4���������p|"��rh���6�iO��-C����@�M'$�~�Q�����	;��4�W��h�HN�����!��b����/����bXu�k�Zd���>��d��.b����P�lp<O�.\�,"Sk��AX`���[6T�(��6�_;2�����@QuM.���,"��r���Y<cu�H����9�JGI�R
e�C��~��a/�=����Z�
�v��-Z� �H^���Gv���#w���Q~E�C�
#��V��������y�_YD�_\�G�4h����Fl�Dd;�/�t'�
�M�����t���]i9����lIl��+�T�0��G�Z#,��p6���yIe����x)$We�X6����&��J���k�`l�'|��"_��H��gw�I�*�D~"����J�����$�e���V�,RG�[���(�0���2�N���
s��M?�MZ2��	��C��_�%��<����m��')<�g�:
���E}X.��YD������ac?48-L�_��H��,����!�>�����])������7��m�=_*]�5��
 �>+���}d���F'��@E:|��[Rl$��md�f+�~��z�}�>��E�(���qI�2����qT2���y��E����H>�X�I�]I���EYG�Rv�#%��d��x���9Ac���oP�C�����tH|���� �x�������w.2��$"�/.�O���������5,mPH��������0��{��� ���B�#!�q�;( _�4�+���oq_��|_A��ZC����<���Z�u��5� .��[t�e�ADU�fR8:T[�^��X�3����^}���m ���IxJ�$[�9vv�V$�&��.�Y��
-�a��u���7�KF���R��]{�D��%��a�V~A/��H~t���z����`�Rpl���F���Y�KEG�D*y6!nhZ�[���<�[^��$�
J�������Q��^T�"�_	��wx�|���:J��
�,�W�7C����t�F�:w�����9���R���[�`m�)Y~���W�6�_7��>��A~f�$��p#"����$<_u�ff��;�2y�D���,"�.��Og���w�b��e��v���9��� ������i��C{��a?�
�
;}2�=�B��	E�6�b�F�
jPvDT�Oc���%��V�[���"�_�N�,nx�#����������.����������?��9.��x��>����������4����B�*4�CD�����W�y�(4���	p�����@Ed������7����hP���������=u�������An� ��|� "�..����-���d�m]U�$���TL���w
��-�T��(m$m�k��NVw�����
����0s���
���$���h��_�.�D�T���2�����8o��,�$�Q�I�q�~B!TnGLQ��C�o#m_��7�����nq�jF�_)g�Sjs���Cw�!�O��4=���Q����'k�����$����x��#�z������qa^"G��pD�:;8�A����6���wcU��HuoQ��!�����lG�M���7���h����6r5�E���o��(wb�����0�����C9��CW�d�~	�)����m� ��6�_�G�*<F��3�|/>�7[�o�[���BX6�w|%&.��J��-�`ttZ
o�C\)�f��g	��S���E/v�Dd��y	�]gm�Fa��^���m9l�t�Z6	Dz"������[�w|3�\����4����{+��`����y�l#�Us��~lqzc����9c���J�`�.�������v��m���&v���br
��O"�2�����}M�L�z�oc��,�bs��?�1tm����#�e�7_{�e0&�}��[�-.���h>w���e��aS�H�V�	\�q@�1�!��@�D*����dlz�
R�YD-�����7�i��Gb$?�	�-����S���
�Y�u
`o���nCZ)S*����J�9��S&ID�4�o��x�h����_�{�F�$���z�����������\rV�3�n=P�"B��*<�����<��V�$"���~l-���\-
dw�!�w�#�/U��>�?~{�����>}�{����xs���oV���q��j��W��Wa�����:#u����N
�D��Y @����F�J�F�k_L�VB����H���P���DFf�ad�}�����7�������Ed���'
�`������Df.���
*�B#���/�=R$;ER�����G��g�)3�����O"B
2W����Cu����i���;J"2�_�G��n�w#=Pw�����$��C��i����+'��#/��������kE��	"_Rj$��K�����DdJ��ST��M2|+��03IDx�vlf<l��a~4����"5_��|���@b�S7����u�0@<4�
"u�}:V0���+��F�w�_}G?�P��{h���������\�������gt+�<��cc�L�bS/�J�@��1��.�+e�g��n������Z8�0���.�%��L������o��|���8�����m*�o�E���{ymO�=�S9��M�-r>�El�>U��0��>|T}�FK����u>��X�S�9�t(2wx����d�����|���'u���C������N����;r����KW�[tG���b�_YD�>���Pr�!������
(y8����]�����e(����t�`g8���2{�c$�~q5�hJ*X�A���D���~5��!��\���]���D��C��A��Gi|>d�C��*F�y���b�R*!��W��Q?��6����e�V��/����4��K��lC�5/8%�t�s���5m�����GZ'���1����:�<����k~k�f^V^�";#�J.x��q�g��8���Ed�����%��,"��~�88������!�i��C��@<_�`�"��k�\�I�F&!������g>w��w�ET���
����(z�xr���h�9D�k=���.>��M�\�������|�"��i3�� p8�����NP�D�~�z����#��8��)<�0/���9����O ����`������	D����9��S-�#iw"������!��������c���H��ze��${����f�A��7(����p��o�}�;��������/~�����o_lU��p����p���A~��?Z�~U��:�-�s��y����>��^.�#@�OCt1�(��Ya6�,R=ff��`���`������(��oq_���~���J6?������S��g���JUX�,R��������.�����H^)��%B
���������_�o~(���oVn\y��H'^�Wx{��W~��
`H�CN���`R��C�|$��k��{6W:h�p�i�-C�������q�PF$P�,"���WvqZ�����]�,"��~������!
FE���K$KzYr%��:J82U�}��=��-f����'�Zg
�h��2���J��R=y�9F��F�����/=�$=�%��#�G)zR��;]��8Sh<j��95���j�Edj��(@;�l��o���IDh���P��P=lc|><��3���9����z��Y5	O������������Rs3���Ndgs���D*�jB4H�iC����������7�O���H��7�����+��e�m�/��8�\�>�������*�j���?O�]1=���D�
l���!t�^~	����2Gc�����������B�#I���$w���Q�^%~XQ�B�+����^\�e
�(������R��\�NB���\�%�}�"����a*��=lA�=��.�K�d�,J���L=��d/�!�����60'��W�g�]���$"�.��#@S%?�&y��g������Gv������`�'~���#�s8��������<�=�t�����R��!���W�q�i,\�L�\���;J"��Z�{����f)�|ts����s���U��������_�y�kB�c�r{���Y��-\n��m��
`M��k?����ka�h��6Xqe�g>0'�w�G�H{X'�!�~���w�JDSn��M
|aw#E��PT�2�����6��*��v��\&�����Ws���
=�0���g�!^����{' ���������G�f�an��
��I�lP��$�M>z����I���P�z`SJ�0"�$��(]�}����mBG�u��������md
�j��
�,[n����E�&�V�^���W:�A�cl�2�Or;��k�Z$�J6�����`��$�3�������4����$s�YD�h������}�6vf�$�����������+f�P�,"�.��k�����Xl$��md�&���m
�V �����G�f�anz�U��o?��S�|�"h�
Ch��D��i�*v�����Zu0�JU�S�+���At�F���~wC�l�k���q6_��?�=�J���pAl���
j~���Gy�}n\)�f����zc��m_N"��e�~����	���`2��!2E��V�Mxi^v.<��d'l~k^�y�K�Qt���"&�J^D�py������X�
'abt��Zec���G�&����G���an����
/G���p 5�K<YE�|��73���$"L��W�M3@�-������d2/>R7us�C��,��VCmm������Omm�����x�)5w�
���qe��r4�6��?Xl$��m�I��4�J�a~�J��zG+7�.�j6=B���k��������B���4�y��X�B�@�"2��fB����{��$d�e�����
8�%h>��u�i��J�L��x>u�0	�<�����$"���
�T���r%
�YD�t��N������C�����y�������+��tcN�
lq�PI�V���p��b�R�b<e���sb�[P�xN���Dh������(���~����c�}����>p�tK|%S4��e����V�24+0�4C��e���*���at���,���q���j�`����r�C�(��v�_5o�U�}���s3�\���8p����OTQ�m�����tZ�����/5>t��������HH4~<{#cGQD�i�Z���aL���"�_\�-���+x�Jq=���v��9�j5�UYd�%C��H%'+��n�W+O��|YD=�S�E�vlL��hgp|�Y��w�/���J�x6���Dk�[y��N
�B$1�T�w�j�|6�b�e������Ec\�Et�H�|���9rx�H���$�0i)��F���.D����G�#��iX�+U����&�������B������8r����s��h��1��U�;,�����k��\��Ye�bt�j=�F��d�:�����&C��M,|�$"��~s*��iq�A6��7�#E_��^�������r*9N��}�4���b���IZ���E�)�����%
���$"���Uwz��)���|&e�u"�r�p,;�l�#��9���<e�FS�������$���H`���o���s�K�K������� e�N��}N��"=h��.fx2������$�}>3"�[v������5y'x�a�E�&���E���:5����N+�O4��E������2����6�AYR�h�u5��z��
�}��D���3|���j�p]�!�&P�N�^���W�z�s�0:�����5<fz
�s��[�|�����L���o ���9oJvg��%�;�T��b�T�0/��N&ZKND"'��*����&�k�
=��/1+���?����@����L���F�D��U=�x�Z-��Y��a)���Zx���X�����A�~�X,���w�?�YG�[����Y�����-�I7�_�P@)�h|�h�+����:b���� R8��]���S#�[u����s�0:��{������%�7���S,!����(R�������hB�����H�&l����J�4�;����}!������w9�760s�� ���,0����a���H��E!�eH	��Q�P	�y�*AN0��l�K����\����!���Y��$�n3��v}�9��RO�q1��V?xA�8�3���>��F�"���?�6�W�� rz5��*1�5��7|Ca������p|8*��'k>K�����-{��f��v,�mRv�"������bC@�>'�+����}���~N�iN�����[R
]R�g���:��_<�0��* �G�2�qS�$J��]��3~��)��A{/? ;��G�q���2)�3)�P���;�����-~����T2���/��\�����2Pb��H%�&b�Z���yte��k���������b8����Uaj��p8���Q�+;�$_P�`[puBb�F��s���"��{"��V�A(���L.Uz�z�H�n����C^�G-�v�r�]��7�5�|�[u)2q�m.��a���s����,�|���y?�Vsf��O��J`Sgr4�f1����k�O�I��@�������?����@�b	Y�}�=�Zp��h
�����"��]�B����$��<Y�#��V5S��8��LM�D�He�+2}���"��?�
��z�d�>���E�f�
L�B�x���Z#~"O���A����r�?2���@�Q1ba~z�(R�0����
���|��ei��t���(��R���e�M�;�)�
5�M6Gl�����~����=I�x�>>},{���[?�����\;����?����s�����A���$���1�k��a�V�Qo3C�4�R%�� rr�B�Y[��Fq��VcqD�Z�����e����yV������w[n�����(��+�V�������}?sC"u�=:����nvdE
UP����w
��@�zze���Q�a���o�u��A����yK���8n�t��j�P�����6�I���#��fp����?Q)����y���=����&���i��@$�,"��S>�a~�z�f�3�i
���8p;8���
9��0�����9�Wm����@��xa���}��]�a`�G�H!q�]��L�����F{�1%���#0R%��j�����������0"��[�J=�!B��}��OB��L4P)���"�������($D�5�CTr�������p���%�&AX�4k���B,�5��@)�B�bc�?%mf��F���$��m������X:������L���&��� D�;Jc�AdN�(?���T���=�y��A'B������M9~:6���;��Ur�
��}=�=��;����&r�'�%�
�P<�Tr��c��9L���I�E��@�:� �~!�����5�Trx�g���6C��`y��e���@�c�1/��U;�B�	�(R���5�R���2��s ��| �JT��T�m�5��� e�+y��;q
�[�_|��W� igy�$��'�:jI�!�E%E�");Q�jI�}�R���_!��@Z�7#�q��}������2�Xgz�2�IpNW������U�N8Lt�!���EN��&�(��.J�Q��	 +�C�"����@;����.�����7��J�	a�,�(R�0.��|=%����*�����B�H�j����Z�]a�%�
�{����E�!-6��r
������D����5<���9f���teZ�����;�k���m��E��]=��]fL�7�_5�3]��@�T�<
/���I�'�R]���F`�b���+�������YVd��frC	�I�L5�1�.+�X����_$���i���{Jg8��6��M���������}����N��F��4�������q~n����������=��0�I�'�C�
U�/\��E*)x��CSo��$]y��0=��66{6�_�L����H�b@��$�v
����YE��H��~�7]x$r">z���ld�&�yP�����8d����#	�EbwV�	3�m�WML*�
�&��I��v9C���������@++�5�W��������Vl�vs������}��_T�r���#g@a8FK�Z�QvV)b�(R�Bf���V�}�,��[���N�P�)��	�o��Z1:����?����;��L����+=�%��G%�
�S�����xy��4����Xr.�����}?�D�N�nq�P�p���oc�
%�V���+��mE�VG��/{=�?2�����G���"�b�����9]�����$R��j����9��C3����|`Mjk���������3Y������C����p��	��1W^'_�D�J�����Pg�L5\�����"�s1%Y0>	U�p�C��Z����p5��\4��Bw��86A��7�$R�/n���$V��9�-c�Q�L P8<9�.��Np0�\#>�+5�]�@�N<~x�{��=��~dF"u0��O�RBN&���(R��O`I���r���K,Z�c>,5�`)�^��0_%��g��YOTJO�����M�E�h��2�b�����}A�L\�*�����{���&�:5��i��u�Y������M@=?�3)����H����E�bF#���E]v�v]�
�~����)O�dN"��1���4�z`u����}Q�V�9J
�h'B81���]���)I������k�.�fq1�'���o�9Cw���?����`�?�������w�`�9n�w��O������te�k��}�n5���Y!�z�$�J9���Y�|68�������#]���(�,F���ax��W�x��� �_[l���5���}��>�O&
LpN"e'j�A����?�`/���@$I�!I6�P�z�~`P2���]���20�UA��������Y�NW�P�]"�tc�-��t��8����w�fy�Svd�]"����}a�<_b�[�J!�]S�����\b�I����_�0��:
��ei�����*�u�N���y�L�@��7C��5�(R��.�X����H%J$B��R���C&U�h9]
a@9�h����E�{
���4N�=42�
�@'���5���8�Ha�k�`���Ss?3E�2f�
#���i�.�4���i8��3�xA��3Y��I�C���iOP�H!���j��*���������
�A?K�0�����Z��aq0���k�k'QAQ�������aaDo���v)��
��p��#���e1��>E�)�2�������/�k�2���!�#^��B�L���#|��u_R��bD�#�4-uk�(Rv�vS��$)�eR9�S^���i8$���m�o�oO_?}�-j����x��~z����,{�6~���||zXp�><HE:�������c�~�X,�T�����9L�������<�������''
"��\,m-�f,iC@%��/k+Q
�����4��w�VL�	{���1����=A���B���w�w��G��ZQ9	����w��:���V*n�����������G�T=��>����y�G�:&�#�a�B4��E)��4[9�U��c� ��T�ZC>p9
���F����~[}p#�l2��Q
�f�Si~��$`�d��`+����K���z���
�^���8A����z������vX�D>�J�����v���Gmz��}���.�2hE�jb���g�=D*�_��`8�fL��+�O�8w�R�>�)���+8,N�aq�7�Fd-�-8?0r���@����g���H�����a�1o���[���(rf"�K�
��; �shm���r��
���v�w��UB���K���������V���v'`#�E*y	F�7��y���(b)�}�z� �<M���\���Z�����G�{�,M�1u�����i�ta���O"�ZDPT�Zb�v�{F�'
�DN�OdN��A��]6�{e�>l s�;���RM���'Ilct�����#�	yQoQ��+y�����H����jf=vG�O��IND"e'j��C_��������V�@���>��%y�`>`o��k��G�Q����pN?������a�����s��g�$�F�"��t�`�u�C�u��:.4��c�������3]�1��jYD���i���`�k���c���E*�p���\u�d�p��H�|�uG��=4*�A�$�����g���` �	����wNf���X\�

'F�h�a)L���XlNr�����`�����p'�j���g!��F����f�������Q����!

���"g�5�]�
�J��i�+m6P6�������Qy�e��SA�y���A%���r��x��x��n��*|�U����]�B�(�w�Y�I�G4��X����,����- /��#��40|�2s�<m��(R�0v<��$�D�Kzc�ocz� R��l�?���b �u����tFe�-�5d;���!�Y��	M��J��1 �%��q�mf`G�J��(�a���SR=�j�H�=jW�H(k�Cl�p�u%>���	DN�"r�Y�Y�&���SU��)l���.�0�A�p������$�,p.���\����Xh�eE9�*���J�!T�U��
�(���1��+�
/R+�!.�rG���Y+���(b�B���<�&�4p��������R���* Xm[y�9Rb^��|w�_T�r��lEM�T�)t�����t7h��{��:�|�`���i���'�F�������h�1��z��ke����6�������!���l��
��~U
C��G	��T�'[���M������O�t"���&"5����(r�3h���wN(D
��;�F|�^:�������#C��"����j
�xe�)�:�1�}���a(� R���k��<,0��VW�Q����w��,U|Z.o��W)�P9ky����R� ���a�6��{��!s�D���u4�Q��)��`$�"ejW��c�������TrZ����8\Ow�1^��qX����}��8k��(RI���MG��X�p-QA$r">t�k��`t��c�:��k����p.���9vg������@������}#O"e�z#�Km�,���vp�G��Hy� �(f�/<���+�w�u������R�M��������a�? �0�����-�HU3�����d?cE��p<!1hXm0������4���q����j��{{��qt��)��<�$�<�R���1��#WOD��!��~���+K��6�
k��e��\��/�A8����_�������t�R�G�2�q1��]�%5� RXm�����N��`-�uVH���za���D�Y��~�
'���������#�����3n���?���b����������G?�J` |Y��(R����_��dB�D��H�NmW�I��B�Ca��Eb����(��Bz���^O4$��B{����C��H�"T[j�_�_5��jq`���"/j��<�DN��V��e�AbC�%?t�����8<�X����)�KD��0z?���"e+�BFK5��"�_�D�i�"��_�VC>�*�S���1Yc>0%;S������cz~,7:����W��3&4��J$��=��P����,�I���At`�/��#�z%O)�G��L�������
=���	D��C���O�|���*����"8�/MS�����D���51���|99���sO�)���4�w�9���,�{�}��E�Oi��<*��"u0���7����L�|#D�q��F��7 k(5V���<������C�l���YF>c��-n��1d�W_)iH/�M�������I`Rfv�$��"����bf���or��L�5E������	����J������}���"^p�v]	�+)`a��NIo�R�$��"�4�	�`�:���h��D$r>V%'1��t��L\z���|����<��D|��%���w�%�,*�i
���2X����A�X�@�8;	�}�0����a�.��3������5�M���$��6^�oO_?}�m��I�����o��_/{�6~n����pM�?d��Y
�U;'��.��y��V�HS���>���]�]���N#��}a��Z��j�������n�Y�![&������U��*��B���6���:��y�_D*9|�]��X�O�O��"���vo��C�;�4V����=��i������#�~d@��"��h��~���"e�����`���xzc�*6�AA�0}�.�1�K�0����i�������P;���Lo������L�ou��F�'���$�}?iB"uT
�	0�^���sDNM*�#��_��aAe�i���J������]�&��.���Rz;�:/��Q�R|c�E`����'"�2��o����N�����|<��=�������C�����O�^hX��aa~��A@<����e�� r2;#�Kc�,ZC
V`n�����������?!o;�I!d��oq����|��O��Jd��E��������$�.H���s�������y�n&�KNE���s���:{�����P�s��t.^��8L�����,��� ;�u��H%����R��2�=<�����^X�!D��a�Y���C`x�;����v�tg��l8��4�������=�Tr;0-������Q0	"g�0�`�W�Wy$�O��*����������n�;�D&���oq}��?z��z��O0��M��m���:A�q����%^�_�>��y����ia������D�q����qQ�a�|A��M�Z�z�������_��9,�A���H*��0��(R�������ZE��*u��������)<�+u�!�*���V�@���H���8x��`D%���H��!�P#�)��+� r2+���n+2R��~�Hpm��r�P+hay��l�q���L
>)O0��q��������9������y'�:��A��������IND"e'j����U1]���\���	��������&}�����oK��3Y����XMN!%��V��gN��+�TR~�t�.2��Y���DQ�|����<�^������5��r�+��oQ�sRg�V'����6�S��|��Q��u~)C��A������}#O"e'j����LP�� Q�����@���~
��b>�������)�5��Nvs�s?7��	/RX��������w���':��<5��UH�:?���^�k^E`r,&�F�P�P(*��]��B1�T�W��@�7�R���M9S����k���9.������8����S4G m�����R?yk����rm��2PQ}$�������`��'���PGZ�����o�{��J+��B��j*l���M���/*{�?���C���|o?�Db��te�g�.s��<�%�����]������ �G7��fM�����{�a�*P��M���LE�����`L"e���vL����,�YF��g������Zc�{>��m�����,���������u����h����V�H�����{t"��"e'j��co�F�yLgfG�}JC��i���5KO*5����D��L�[������[�7��Q��MKc�xe��N���"u�1���R�
�bu!�a���ZA��Z�L�N��@��L���~F�R:-�p��b���:p7��0(a���9�*�(r�HB���h�E��`P%�wk���\��\i���G}=x?0�|�1��c��E*��qD��������(RH�h��c�RA`R��_({��O�J%X�J�61_%� =��dI����Z!�p�@*�2-����:^�
H|� ?�&�����`,�v��d���R�D�T�C����T0D���8L��?��#9O�����$R�/n�����\\Ki�}�G�3m�����?�������:����Jp������a��sR���:���6�F�`ya����to��GQk0�6��yg/������-ONEN�������P�V#����D�T���+���l�`Kp_s��K�%aD%��qaD|�q!~��s�l�c�%j$\W��A�������sH�C"e'j7x��odE0���*��E*)��_���*�i���Z�������m��}������Q�^\��{>��%cxI��D���-"�S+0���*v��8��}�7�v�����<HL��,����.$R�0.�}��WA�[�L���I�|)L�����w�?������:)9�����$n3����ou�T�������B�X/&o�l����&F�Jif�?\6�j)S~�����!�a6_���kU(��C�l`|��q/��'0�
#���a�v,�^�x?���TJ����7`����6LND"'���*�'�zL&s�����C)9��Q_��(�M�^eKI�}���Q�a��
��]wq�te�A�5�X4�<wM���k�����'�8�'_f�3��!q�)��}��H��
a����;�2�&q*H��D�ji����������}%i����,K|xh�,�.pd��(R�/.��"���(Rv�v:��Ip���F���kH��C�<��t-��"���@��3nNbRE�"�������d�
0��N�"'�)s�]DNY���9��k����(��<���`#��c?�����/�m{\�}���?�������?�v�O�����OI������k���9��
��2`�L�_a�ti���A���n��7/�s��eGi�����\���P��6+��+9L��������B��N��x��>����A}A�����%�����~;��E?&��S]a^+��5��c�/'����>b@$��,���;rH�p�^�:�{����6�J&�Js8�$r��P?��VY�@Z�2P.+�O����������������w_�|����?N^����������K�/Cb[D69�KC���$R�l����\@��x]�A��e�h9��� ��i�}k�"�f�I�}�c�|��,���������������c�W)�� ]^��Ul�>�P]����H��F�Oh�Y���3t���U���
s��<^O���@e�A�H�e|�b$R16�H�'toN"`�QE��������}��]{Mu
�@aU
k��X@��/_���c3��=�0*�� E��v+d�������::\���1�"���c��M������r��DO�X+�@�U��<w�\Z�0���q-���J���K���|)��0�H�4�/��9���H�l�X�h�[��v�hE�k����PW�����0d�FX],E0��(R���Z#�Q���|'�(R���F=��~)�DM�]�����8���Q�4t�t���G�������T�~O
>�����k���aD&�b������G&��e��}@�������pb� R�X��]�O|u��Jd��@�U,��s'����x��)���$!��9�EZ�� H��I����.(�x]����Zh);R���
=�)���w2%k���bq$O���X{A+��B)��\Co���	0�$3�G=���|)�e�<����'�+�P�K���MMd#�[fg^����6LJw	��b�1�"{)C6W���
����8si�C���]��'�L/��F����Z+�@��R����g8�6��1��d���&���X�z��U��2}28��,rb>��dQ�\�f^�>:
k���f1%�p��p�`�.���7���]&�2��Q��;��CO�j��@�~���(�Zx,@qp�|?
��Wf|����w����X���,|�0���a�yO��J	)��q�����%���,�3W��fl��,|�OE
KT���&D�Y�z��C{W��k�"025��y�^p�a*�b7�_QPD�jA��tX?qQs��$ZM����9���Y
�a�im�([c>�%5�,�8�"�n���o�CI�B��}?h�Y�Q�R���0c=��3�Q����&�F�PI�:-��k���f�%[Mp�P$_36(q��Eu��sl�[���A��'�x%�Q�Pq_>_?1��~��1_Z�"6�`����1�g8�)k�����$=C�&B������������2���Q���$���8
�F�t,91�'X)��)
#��]�~��1%�|�gF�����[s��@PIGCQ����`�z)����.�=���L���1=R)L�k�j�BK���������������@�I�z��)%��HKb�J����(J�w��Y���\�lE*�Z�W1t�����/-�{�*�}5�82�QlrW��	�O|o?�T�4��}��^�lV5�n�|�'j*ei�D*�?0�������KE*A�Gx[�'�� �Ha��.���anr��j����|``��a����R�_�7OE��3&P��D
!��&�/��`
���(r�����J����Y��%_i����4,��O�{���3��2,�z��\d�6�=I�;�S��e�f���r������H��q��xi�Q�u)������w�W��O/�Z����P>�(���D�"Ti����H��6h����
�t�>�Q��|,J`���:xW�������~��#^�G�N]0x��
�Q�b\�)��������=��'���k�Q���B���G	���$L�FH���"2^����}����9�
S���?~|?���]~��_F�m�M"e��L�p��������G�"���v����]%(�e��H��p��g��.
����<IUT+���S����{��z��V�:��s����X�aDVQ��@�.�12]�bt�� ����	|J��S6�V���L/��-���G9s������L�VB;zC0�R���v);R�hG��*�
�����0�I���'���j�"�m���d3v5���q�0W=�Y\:7�K��N��!�{�}���@:�~�����eGiW�H���bq��AUS�|�J���6��6���������S�c|�9I0?�j)���X���'�Gb8Q��H�b^��N"
"����6�'���ZO���C�����<4|�Pr�16�_r�j)3�l��s;h�Q�>�{�*");R���gE/�F!GD�,�
J|�k����o��W	#2�5	#*%
3U���k��m�.N�!t��%����(R8e��l|UB
N����Q��S�La���0F�''%�Wk%��������>��qRn�������g���Q�0�����F�n@s��!��D��6
�O_���@��"�[�Of\�m��}�
'��X����,*��H��)�`$�K����#E�B?��p�/un��-D�WP���MS)7������/�m�X������o��2c��k�~�����_�������������~�@-��Kc������I�����
e������.-;J��Wz���;��$B���aV��Z��$}�A�B���X��2��'�B�&5&��������M3c���f�u>��KO��&X������!i�;D���z xZ�����=lUX&��%{Q�{��ox�����p�E� �F����*�=|�Pl����"��#�k��������1�`�G}�xZ���g_C����
�%�Y��D-�W�������w3Ak���\�1���$f�@z?�w���(��
��z?!Ga�\�Z#PE-�*���������~��R�e�f�CA��:��\�H��P�����}����E���.�}��z���6��s%>N�ap����z0h����P�����k�6�fC��X�#!>�DN�����f�����(
kWY8�������[���F���->8��Hx�tq(���}Mt��E*
��S�v���X!^z�������j�Xq���}��a�B!�q�
^��p����CS�O�$	�I�V�b���H
����f�&"�������"��{R����|�g:?�,&�`���`+6��R �6}��*��A�b�������C0�E�N�p����2�:1�X�>4�!hnz����?I���.���-
_���*��06��f���z���U���e��>u_��Hc2qXz���(�j"��Tq�
�R_�j�@���;�f-d�Y�7T����+-p>vz�2�q�>�����CF��#��y0��
���Q�k�.p����3�h$��*V��!?�f���D� �F=Z���M�!���t�����9c�v�+������h���Dp����7���S{[}"''ABs�%x��9h�_��3�K��$��)���g�HQ�����8HP ��$�9��_�B����8��>���C�*�(�Z�0�,H�E���H������SP�����H������"�]<�+�F�!0���b����I�����/H"ecZ\z1/f�P|�t����I�0K�2��B���cOOum����(�g��->��g����(	�/�S��Fm��E
=t���8=��/-�j���l��3��;]�C�{;&[T�V!o���/����Dw��f/-�~~m�n^��$[�?���]�����"E����+w��E|�k��9��$�$(3��?��/�D� �F=�#����I"eGj����G����c�
�k��b�F��vs���#���\*��	������?�@��!P�_���w�|t4]Z����OXe��Q������%�e��yV'���\����l�A���� �d���*�����.�r[$m��0mF.2�T��;00��a[���$|��(���������a��;-����|`Mv�d������Et`=�X��j�Z������k��Y�h�KO��1�	5PD�j����.������������Y(���K�*�e��p�j	���Z�K�6��D���}>����)�X��n��C	8���g%��y�V6���qX�
��z��i,�pc`:K���=����&0�����E)T��Bs�0���A���wv
����8���E��LX,��A��F������z�A����k�� `�D�U2������A����RGL}����V��h���w�,�W	*��,��a*�<%?\h;���E�Vb#XT6��M�kH���@P�����H�B��J&��B_�5�Ga�y1�Z�g�aq����ed^�����EQ��gh�,�4A"e��?I�4A��HA���i����{�9#�-(*������w�fi�L�7�WP,��5�U#�&����{~�����j�H���X�kz�9H�z��"ez�a�����#�����5���g�/����hb����i�����?�����>����JQ����F���R��((3��������_P�0J��9/^z���x�8\���<e%����
,���"m`A�5��V�Sh*GL)F$@�������+�LT;�#�fr/R�vZ�S������|P���5�W��p%O��&$��N�5Qbs;���$Rk3�>���PP� �2E�2������3q/���F}`K�'[�}��gC��@����b���
��p��9A���H���\�b^�0d!�X�������eGiW��=#�*D�C	����K`f�ff�������������dJ��H��P���8}H�M
u��g);R���5�M`$�0�������s��g�J�Ey~f����C����"�A:�j�t�[�����I��H
��O���U�'�&Uz�kC(�=�"y��d�e�3��m��,���*2v=��Z�%��T��~O�(R�l�������),�
���0([v�Y��
q��L�Eke��=���@Ab[ALV'������^�%]q���z�bu�n��C���[DA�VC7������+_�E
�P��
����^��2g���L����l �x=�?�0�]:���P�Qj)�V�"�O�~������	$r&Bc�������$��Q��
�C K�;&��~A����6�7��oO_?}�m��$��������������_?����a�����U���5u~�����ea�8�K������R�pi��>M(���6��K������?�:T��yPi�Z����!�6����u�V|��_
���j���-/R+�B�	F�����H����<�6�DD8�2�J�H���6�O�����G�W�1�Xg
�z�H�Y�:�c�{�,1�c$z|��)�F=6�X�w5�g�+KD�����	������w����_������������C�-��@��y&���n�/3������F{d�zS?8k��k��Y~�����/-;J�:�E�b��^`<�dMr��!�600k[1��h0�����'�u�����M"e��=?1��%l�����,Rv�v1��:~�9z�
��������l`b������J/Z��nv��g��~��(RXe����p��1PE
�+-�^��5����WP����y��-��zr��������9�r��G�
�R&c��8�F���w��0����>��$.�,Rf��G���3�[��eGiW��&a,V�a�q�{%Q���i*��y�
������_2���I��}�����}����{��2��)����������?��v�6~������~�dP��`@��s�d��	�g�������vqhN��8AV1}��<�G�"gq">wm�������:�|����<�_�����`���M�I�#,�R����X�8-�H�G��gm=
����_y������e����&s��]9�E`�:������Z�Q��F}`|,�g���C������?J��9T+�z���p�6@���aj����m�&~��X�"����X	�]�m�Y�;_Z�e����R<i@\�-����n�B�������o?��oqVE^��90��H��Yd#�aL)W�L}�v��������nz/Rv�v1������90,���5�'^�^��� �������R7�3 ��Z�H���GC���zD�`�D���0����P-a��J��Q?1+�8����ZU�"�}-��8��
U�we��E/-C4W��[�*��E{�|
�Q������\��B!�V��7���k�218��08�����^�+%���!�	:T��)���H���#�s�gN��g1?����m�H�*��d���b"T���H4F<k�N��l��M;4�p��C��O&���Uk�H����7��p$���Q�	[�H!+�]K/��X�
"��t��~�Tj��T����P��\�����M��Bq��gJi���sYYD����q�Q8��[�G�K����r���o�z��!t���������l �x�r��Rd�x���<;X>&@�g��3^�P�H��G�x,5x�g?��'j�"e'j��'��D��C���~c����Q�QD;��4�0�Q6}x�J�R�/��������'��#��z��P`�,���b�'*�o�J�m��Ia|�v�:����2����s����q�Vs�L�MZ���Y-���G�B-�}�c�9���+�8f��e/b�
J����a����[G>�
��d�����y��ZI�Z�_z�E�c��,"�
(�
�xW;�Q���q�(���g�jz�F��_�Q��or�������b��zZ�@l���K23������fW���f��I��|�iZ��?��.���V=�\�!�b}V�Ks��]U���K=0��+�OF�$��D
�6���(�Cl_G�Wvyw�#E��#�j����H%-*����$?�R�@���.�����e��Y��
��!��9)b\����K3����\�����p��	�� .�"S�I�W�L���q���1F?B�l���(ti�QZU.�����8T$��Ab������N�!w����3�`�Z�`/�{��r��Co�Rw&36����u����;;~��HQ��.���Nc+A�Ntt��VR9l������F� ��#L��bqr���u
"�P��E:�"�a#yE�0"����z?����,�QX��d]��8���Y���Z��`���BL��Q2>����t����3_Zv�v]
��)�nl�p+��%p;���@3��R��-_��R����Af4��v,��4��"���EH��FG"���I����^�����D���Zm�;�v'�Rp���O�y;�?2�����2X�2}2�R�0)��
V�=����"�}%���l�h�D|����7���Kq�.]����xG��^��x\����o#���������y��$�k�
/����m�6L���4�4�#�H����]�FX`+(������
�Q������6�`)��h��
�9 �2���wA�n��_��K��Q��Wb�s��{�(Z�������D���R����J���R�7�`S�8.����[�'�J���)	"���e����eGi������a�:�z"P�H^Yx�����4�v�'�O�}����y���d��I��SJ`i'h{s�}��%����2Ds]
?�JZp4���	#�H���U2��!�i���@�Y+����6������R2?��$	�vK�%J����o5�����[z;�'I0?���}$�h9�}#8�� r�1�@������<`K����%)9,�So��������������p�������|�����]�""�2E���G8�e.
��-}�I��I>�t����N-}�sN\�jU	,�0�`������e�����j0��'.Q.��eGiW��on���������.E�cJ����g)��x����X����o�I�`\���B�wp�H�`~);R����A�@��'Ftt��4I��I�aD+��zY2��3F�Z�`k���[�|�N\��(R��4�z#z�$�w��G
!F��d1%�������H�|���S�o�W/��������<��~��������7S�(�r���s�+��������~��7��"�y���|��Cy�w����hy_I�.-{�����N
L\B�H�no����X�a�6�����!;M��x������6~	�(R��������?}�1?����=0�u����G���F��0J�x��xK�����2J-1��D� �������E��p$F}�D�����G�z\%5�R�S][��)�,N��w�=��W)e�Q��Tw��^>v+��>6�=�v��x.O�(���JG���E��w)]Zv�v���Z��L0�y�+a�
V|o?���.��.����+E�[tX.�29�$U);��ka�p�,���Y�`�{���w�
f��|i�Q��z���0��{��W6���$��>�����f��nu�Wg$
�*E*%����^"��~
'c��D{�z
�7������{W��*pH��}�S���h�z�wi�
�sB��o6L��!��J��Z>��2�6Q.��eGi������������UNB�*�����&{�Ld�f�F�0\�b�-$��fl)����xeel�e�D���.�;�J|s;��4����1��� =�/��#�+@
�R�d�~6y� R16�>8v����Q�0�i��{���?�/������C�Ni~�o��uo<��K,}��>��e.��BR&�����Q�����ti�����@�5?8^Zv�v���8���Bi<!��VY���6��z;�D�_�5���`�����K��"e��;��������G�"�'��
Z`�������%��dcRq�
�/o�G�/��!�`��K+�]�[m0��^��"��#�k�;����!,�������T��|��!�-�z&��:	�cM�C�a�/k�o���\��wT��ZaD��)$��|i�Q�U.��0�3::-0�3���+(��C�l�5�,�R%)rl�V�aXm6�����q�0Q|52�?��Y�G��M(�B�C?��0E�#������u H�;���]^�(�����@���a��G�e�NbM��2�B�z�������q$�\�"��V���;�{0�,��RL�F}�JjU�AQ_%z���I�P�������z`\/f��>�PfN�(?�w��a�����R����y�Q������dP!��VU����M��Z����04���v\�����eD���3�k��X|�5O��{����}�j5��E,>h(n���H�d�X�I�!L���F�S�	�~%L�(RX�c�LA:�EE����>e�2��wdM���I� ��m�}7%�2�f1q��Eu�9���Z���V�3���M�~����qi����@�y��B����P�WK0�u� ����p��TsX�
��l+��E�N��$�2���E��Q+�S������08|�������H�#�������x{��!���u�4
�~r��g���|�Xj��y	y���0t�Y�g���~�I�b\���/�?#��#3�WDQ���h��c���T�w��^Ir���,K}�,��"P�f�)��|����0m����F`�::��������b���H�Q��rb��f���$��|i��kW��L��
7vqX���J&�;5��y�o��kq`8�/Rb<���b�e�+(��-W�a���a'QA%���v1�������	��<h������e�l�N
�P"��{��� �0Q���Kn-���z2SyEEN��

H��_�b�9����M�_�{��;�������|o'�����aK�'
D�T���	����+���2���k����~pP�4f�E�V2��i8l�3�h#����t�k���I����N"e��
tm�!u�I��G�Q��H�b�p1��a�
V�"a��| _���pb��^�lE2�n�[����p�v�Yq�D��c�9��"���
46��.
�""���&�0��G*F#�8�������i������g�������?J\��Z���'��Y���S�,�_z���=��	���YM�������!�:[�S�}S�#�����`&��	A��#������u[ ������>��mU<��R%~@�C�\,�"�2�q�{D�����#~);R��GE�r��N`=���E�U��������Q���i����1)���qu����W6�z�(R��h��E|��WW} V�!V>�U��Y�����
_#�
�:5|���� u���::f�W���}���GA��A���(���G���%9���i�����4�f��g)���en��Q�(�ZM2=y��(b)W�e/N�v�����������(��������{��z�Q���^��yh��.(6(Y�v4�O"e����m��H);R����1���+>�(
�Gy����e�3fg��������f5q�+�=�����Z��B�H�Y��	R��.-4�����0�g�HQ��"mS@���WJ$����9+���&��osJ�����������:�{W���������!J�����_���e������������O_���?��������|������aJ�X���{'D�`~){��{1��p�w����|�1��i�"<����}|������w���o�2�;EO��'���?�����wWp�N�_	��[��t��$J�!Q��D;���0�,�g������O)�LL"e�xm�����?C���C�X�$k�#�H��=�K�H��	I6e
����,e����������_?m�>���������eo���mA��m	���qM�?�i��0-����a����(:��>]Z���,c�}�;_Zv�vu.��0��:)46�R~�t��rH�'��=-2�Y�ZI�b97V~t���N�{���������0�H�����0?���D�V����Y;��[C>p;-��yv�� ���� �b��qa ����IF"eV��S��
@��x�D�����GFe\4f�R��E�m�����'�3?K+�q�-���_���E���(>���#(��"��G]w�I�y���C�7\z*��%N~-����cV��RM`�Z�T�&�'-�-���_�Z��D�T�&��������~�`l~�"5���b\Y4�E
�R���	p�p3wn�X����<�p��p��$rs��3f��O����<��%b����Y�"�����w�%��`�J)���6�=����Y�uaQ����g���D����X};�a$?\�s�0�����~�R>�TR6�������gN�KO%��=R����/3Pt�v,��>��YJ&������������0�F"��(���%���C)���K����D,M`�D�;����F���W�w����x��./�
��~E�Z��Cp���G.-�}�0��y�B��+
�2L��s'2�\z"EN;������\���5������mc~���f�3�l���C�"}���7��\}!�E�0/��y�@}Ej=�:d���F��OB���+�;0D�v��u/�#�o��c�q����L}f^�D��r
�`����P����3F�J	�@������e4�}H��<e>����`}}�����/�8|��o')qd���KL�. O�H��.��LAI����(R���q�����!����"]�L�;j��?��K�k��P�1;�E1�L��;1���~����p!����
j�
���$f|�����}�CV!#��K����r(Ya�8�T�-V�-\����8���d=K�TIzE`��_>>~����"��\��CLt�����EN�� b����A���7$z
k�����$�n;�?4��X��2�#1��v'�2��!�]p �!�RA�P�k��2so������5�O��x�g�����_���.�af �/5c����1�/��d��],N�\���)���E�����Vbd����w�Jz23��y�mT$�E!em���}��Xb)��p�E��2Y��>��D���.�=���i0���q�X��;y����yp8�f<���f|2��b'&�2��Q�����#k*��I�D"eGj�8�=r=*��P��$�c�a�'~���������M�3�/*{�?N��@����
}�H���9d�w�}��.���Eg�tB������+�)���v��2��:��t�T�
1�H{���IVo'�J�di��ZI�'/H�i
}�I^�$�,R�[��R��`�dnNz�(R�m���bT(�Y!�*_I�����!���E��������2
}��-5�l?�b&��RDJ#�����zze��(����Mb� ��q��	�|��L|o?��������d�x��o��%4��S�e~�E�XQE�������(���|i�Q�u-0m�@��M3�YQ������9�Oq�:f���,���.���X*�y��'�8�~����Q�_���`;s�h"���%��#�yL���,L���df�����s��
�v��M���W�|����"ec������#�#�EL��ZF�C%h����K�����1����)�����y&5�D���C"����;�/��A�^�i���9I�b�"e�f�G�}�7F��eGiW�8h����N3G�p�Z�g�!6P��VN8�0���X�����F�2�q��:	��&��O���)L���y����mf^�����1��=���@��������=���p���o�1(?�TB=b����1�� R�5z��36z�)��k�2f� d�g�������oq�[�%�Qa������B4M������B���b�_@��6�;	��$�
�zN"���?wF��1:�T�#�H��mW�!���|t
Sp���k���@>�w�f���
����fb@l���������Q��Y����9��bR����|G���i�o��h�a)���3C�W�{�g8���?2��+)��c�����O���"�bl��������mK���.-;J����T�:�������Z[����Y�{L/��2+I/_�$�(L�q������N���=6�%������2D����Dt�������.-;J��]�#s����IOs�\�s�8�R�r���;%Wy&^�H���20��{�JU`�rnz�S�q� R��h��FK��	���]>�)������ON9�2�PG�0]ZfK�f�D�:,��L��H��H
c�`��������������b�������{��������|YAKLX�~�R�.-4����U��
f��Q������[��V��V�1+�t��=���8l�������a��O��8�y��~lE*����P8�_'$E9� R��i��Q�����K*�5�wr�p'���!p�A���86��IW��I���r
�D����#�+�?�"F��#5�zl�R��b���+�Q���C�'7���?�}��):H�W/������������w?���??�x�K����������s���������+3��u��o>A�"�K1�i�G"eGjW�
����AU�k	�@c84��B��>��b��V��.hW����#��"�`�1y�����~{�if��w�ul���,Rv�v1��Jc���=<.C���=.�?��?�����{��/����������N/B���	Cn��g��
r�����"�� ��4z�;F|E��h�tN	'�bZ��4�+h����{:�5�/=���LE���Qj�(�r�e�6&!a%��������m��z�ti^���s;�������y�K�������*����,���[;
7Twnhk�����u���a���X�&�jE��b4��`�FE��qea�9�5d�SC>g,�fE����#��8�~(M�tj�n��2hq-�Q��
� ���$�s&V���I�V�����! �_[[������k^C����-88�~g\�WPp	�}��
�ti�+�M��Z��_>�D���������cLJ�!]�Db��k�1�?u�a6@���:&8@#�0�E�������&�2���?�
�o�1�'��Y�����Ha}�]���*����9�a�����H����2�����G�/���y��SA"�P�$�vJb�SqRQ��H
���q���}t����s�����;/���<�-=�"�}\��/l*��0�j��U�o���0:=���4S��?����C@�FS�s�����G�K����F���b�A D��F�8��P��vlk��Ex��C^�P�W�`-��0�:�\����D;�}�3�����	��D����h��`~b*���T<c�Fb����Gr���f�*�A�bl����������E
�v-=�/J���F�it��\Yz�����,o�q��m��K�D)M�����5���4^ZG���C�$��l��� !^Zv�vU��Z�
~i����cdb2�{�6��C�{�
�:��*�y��?�'1���o���~���+��L�w
�o�w��>>}�.���$�����7�O�"�����r��w�#n�"��������m%�z��������V=�����B�l��+�t����� �g,��\���(8��3Hh%H���
x�	A�R.���i�X��
�����>��f����S���Aa�o���p���U,�b������f;������q��g��{���o����R������'��K�'
�DNw!����?!���� ����y"��@�qE����U�}�S1~Wrw�����D�@���~�T�bfr3	�g��#�d*�8����!��	E!�_�M��g�E��-�B+������Q���b��=��*Js����G�w=L)\��0��0���	�zM��5�qQ����o�����qF�H I}�_Tf}�?�}z��O3�f�>�@��ti�����1B���"~�D
�����~��IT�uW���)8���7���"d#�<-A���oF�����Z5v�#�{S��P)L-p5�o��o`�]�K��O�X���|��������������l`N�����]~��������(�4��=y�2����Pt��9]0W�D���0�5�JS�0�C:H��Q���A���B}��f&��N�q�����=�K�)Q2�~2]Z�h����*���t��0"����X@��X���[��k��2�3�����_Pt!Kd�=?�c�,%���wF���(��r9��'�`��(R�~���T��*��E
#�����dw��u�TW���J|�c�;1�����_)�wK���pc��v`(���IQ�H�Y�Zz?>�0����)��Qo1*�s=h����p20+��@���~���N�0�r���&qL%d��qU-2z�Z���!��d|U{X�������Z0��Y�IR]�����?X������!^����+x�L J�2���;�Q�����Y�"����d&����Y���<���`P�@�2z
k����/`a=�� ������S�z�>�?�~,��b$R1�y7J �6��dTbz� R��m��cQ6P���EN.�@��,�e�$��^�MFA6��M�����;�T��R�K��dA�@���	=1_ZGu����v�t��MI�EN�FaLp��^a�<�|����HO����39�z� {-0��������D�`~���%v�>_Zv����[of�c1zEz��������<������0^���D�2���}�g�|F����"'#:��P���z�bg��k���d�$���
5nIId���/�W���a��'?\-��/`����q�����u�:�0J�U2Oa>�K����K���``�
�<Er��%�1%��y�/�R>Hl��`J��a��I�`\}7M�5��/��3n!Rd���Nx��X�-�8�|�y�����,�L��<�����a��hG%u�"2"uD
��l�K�})d�����Xze
"e��a�!��h%N�UY�>�#�y��^�I�d�eo���q�1��W#���,�C1]Z�h�=@iu]����u�r���������\@0:�/D)
���X+�@�T��[Y��1g{�)m�3���-��{���kaO4�T������K�3����{/�6]$q�n���G�"�3t�Etg��iL������Jq��@~T��	[(<�����%�0���M��i,��K���y�B���i,�M�F)l��6ZNR#�#������-,oX��?�}��i�/N|������Q?�Iz�qQ���������?��[��t�Om�O���}���s���������J���F�,y���eLBQ���.�����!�i�G�3���]c���Fb>r�E��n�����U.�z�M7[���s_uVJBa����T��O�D�J�um.�8�t�g�d�D������\&U(5!����<�=Uq����=to���	[?o3�a���P)L�p}3��z��6C��+O��,h�p����:*��r��*�����<�����i9�-MO�V2����?\�6��H�y���Z�������;�v��_�p��-;P�.=6���#};��6�}��#0X��j�������O���x[�����o�6S���-�=��q��<�)�f)C��r%$P���o�~!��5�g���J��"��8�iZ�9�F�4jKW0�D���=��R�=��<��wT7}��G�J�0����9L�h��Q��c)����y
|�|��%�1�_>��B$�Dt�>��&�}��[��%?\��MEO�c�%h��2���@u@O-j����������=�V�+,}4vKF��rt���;^�����0Mf�8���X��Dwz�*B"u������;�e4\�`H��D�:�h	C�b�����������C�to�+~:��O�e���z�KM���D���5���5`�+�_���h);Q����,��Q�oO������C���Y��@4�'�/�m���?�q�������|����I��_����zL�@�9~�}�D�E�JuG�.�Ersa�DQ�,"���=�<�a~A�
H��Cb=[L��:,������R��I<G�J)G����%�Q��'A���3x��#���|�;��F\{$���9D�n8�����������MU�'��=�����'H��,s����E��b:�R�*�(Rv��CDe��B[�C�y
�@����Y;`�L=	-Z��]�|v���)�-W��]��f��e�G���
�J������0�x���#�
5�Vx:�l�A�z��&��R��o��@Ci��K}�(R�.���[���$}z��&9Q)�����oC�];S�:9P�k���������L�4;��\�n"�~�Q �as&�� ,�������Hl}��'�xe��i��{�%g�����l
�@��x�5�*P'M�|o�9f�)9��_C��J�"%/R��Q��'��H�T~���<v��DNU3�	8V��
�������T�	�D|o+��^$��OVQ5� �NU
2�(�F����y�OE����)��'�6fCq�9��
�jp�| /��>�	
�O�����������4�)���#F�:�G�	�������;�*�(Rv���	�x�@4aF�����0����<�sm�Def�J2|��������1�j�������HZ�uE���:D����eaW�|��lO)<P�:���PxO�B�\��'�6
�����������E����1�D�h{-��qbEeLO�E��;��A�c�{�;�j� >�
��h����d��#�PI���3�H���ZY�6��F>�������@*����U��\�Z[���4,&��Z�$nKof�	IU�6s#(2�]I��%��/��_0J�����1c�\)��D�]tV����v�>��%�3
��y�/��+�('��0$��tn:e:I��x[���e�"�Z+�BX����
n)����x�J;�n��m1�!f���J�aU������d��"d�[���m��A�0��%FY���E�yk0S
c������H��D�b�S�1=�C�����l����4'���,�3��
����8�o��"0E�"��O�$�{�a��H��&7!��4��k�Q��@�j�z�$��Lb������@�4��~�����]�i�[H�A�*|��E
�u.�1�Uc#9���O|_�Z8QAQ����.���A*���	��
���%���r����y����c�	�D/�0?�T���0�#�a������5\#����{�>.��Va�
�K|�C�Y�M:����<�u�0��5����v1�N
����a���E��C�$������@����/H��H��nzq�z���������-�{������J�Bo��+����.l;�SW����$�7C���Y�������>��Ow�A���^cS�� �2|m�����w~��ci�\&tKtP9��G����VXkA�$��YK�rX�w0I�O�o0
�����Z�7	���:P�&�a�EL�f
K���H���5�(o�b�C	sl��^C=�#-����7��yz�U����oq-��Q��A.:a4�pJ�gUM)���g�;�1���g_��S��^sP',VYX������@��2f���ZaG�d���Xn��MZ	�$SQ�-��UH���X�(R�/��@,�MlV��L�(�)H��D
���=`87�i��q��k�"�#-������������bg>bF�$�"���#`������F)�l��y�0�
Ua�H0%�qD`HZC�Q1_%��������Q���T�A���#��>
2����] 8��W17t�k�(r6t����R����������	�L��e������]�iO`���w��>�TbN 8��77	-u`����g1%���/��

��y�Z���"����8����k��7^ ���n�����2`qm:B�%J�>�D
��-G���P�0�A_�Z��Q,����,�7^��?�}��i��$h��������:@r����:?~�""	��*EU�r�����o[����u�9����G��=Q)|	�U�~����)l����O���N���MN�:�������Y����
�����YW������$��"�fwh��([P."�2�.��1c'��9��\�^g\`s�{����a�����QU�j����a`>�&(�ZHc�T�����*�&q��<�m�����%���|N��s��<�ga�Q��s������L~�G���dS+��/S��f��K�U��e6�{���<�9���	"���]�����;Py@��t�`��qX��##�t���_���"�����Ua
#2:0QA$R���E�A����q�@���%t������&.M���Y��L/B�7�����:#`�oduv��� R�/.�}W��A?�a
Y���� >�&Q�E<@
v0!��a$h��h^����t,V�8;E��
�^�%��e!�����\�3Y�Z �3]YG���V�I#;XcuG�B�I�
��Sq��
��f&��q(��"�"����(RI�`F.��v��'�D"e*�]�[���,)�F�9�k�"P9����X�3��P�/R��V��"�H�>�����������	Til������
�������?C�YH'���������t��
D%Xe�+1���6�_\�w�����eT�$.]Y�H��m�`d*Z�9|W�����z.����
�$�>F7J^��������-M���U������"d�����6}����Z����6t,��i$v��K��$R�/.��Z����V��H���E|�c���!(��s��*V�S���?7�����c��_��o*�����"e�b*�x;�x"��D�N�0��=_�p����Xk��d��J>��$n����)���H�E-	�s}-x��,�3W� &pZa����KT]Yv�v�H��V��t��k��=��)�}b
�:�;���u{�,���i
*$Rv�v���2=��K�n�g�z F�bd��o'Opd���tl�����%�J�X��ea(�E��>;;���]����F$� ���O��5�5��P#���������\?7xd�����-�2'I�v�H�-�z�	�E���"��5�i��,,����=+����	���E�������.�4#��=����\���$R�/.�����f� te�A�:FaI_m�"1�1�"f\�@���C�|T��n�	�����f�d|��
^�a\�#&���1"��,;H�P�0������,}�D�;�����`����M����I�}���_T�r�O����X��J��������$�F�d�8��K�&j9����U��BJ�LXD��n���	���Cm`<��2�^�G�Y!�c)���JJ�e��'�zS�Z`0�dR����H��D�"t��0HUad�T�Z!~�J|o��������L����t�f{[��"��1E��u����J��^�l�����"g�g\7��2z_t���g��|�V'����������G�B�����FX��gV��J9�H�X�pw�����H��k��0(l0����������&�9��������\�'i*eh�:j��>�\�(Rh�����{��t4����,����@��EZ�X-�0���<��r�-�Y!}9������.�yn��2|2nN���H�M�bS���U��k!);Q�V��U��$6#{�n�18��s���Zt��F��p�0+O�����E*y��
�Tv����+����0T���:<���A�$x���s�O�E4q��9�'��Z�wi���AUg���`���4�%��<F�*�A'��_D"e'j����bd^P���EGL�:ox��git~~�����82����3@b�`��T�	�y/R)���k{:L+za�"��a`��a%&�	|��zYC>0.�a\n�tR��[��F&��z�\:��'D�ZAD4<K��a��8���9fy�D���e��{�}'y��&<fBW��]����x����VB�Z�b��"����Y��sXZ����I���,����C����c�E�(R��]�w��`���A�����k����!M60�����h�O�B���M;}2�,��Q��$�@�W�Q&�_�'�"eZ�e�+;��h%k��m�?������mg2MI>*�3�m��5m�%�{%|e���p�p�VK���,�3�����C/�<f���g���d#�����1`?{��H��{_�40�j[+L/B6!�U[�s������B�C�r@�1��A<�'�!��������"ej�`���s��B�G����D�4�,�@�����a��`(d4���:��F��@���|'9z���G��I O"'�c�C�j����)J�1?�%���c�����z�����������TU��)���aj��<��F�+����0#0���Z&|���DQ����]�B����E� R�?�\k���i:9�#^p�~g	�+�1���G��v,u]�� R/.�}H���@�NN)���
�Fcv��	#��?�%��s���786����$�TF�����*/RK	a.�.��K�+�(R�v��yL�5e�KV�D5�������#����v�a��������x��o^b��������?���������~��%����N�]&�5u~���z�����OO���uKT��W�y��3Ag6%�;��� Rv�v/�R$�����3�Y+������W���kw�g�DY�S+iT����*��}�Q�V���H�4��3	�D�3e9?elQ�������=S�n����i:�S����%O�gp�R�=2�.��px���o[��	�l)CX�:����_����T���U�JNE�N�����C#1;m��m����	���w���1:�Pm�5�*�M)$q�E`	���(�L��\����\�>�
���H�6���m������;<Q��4�w���fft��'l�g��@+�w�9�/`J+����p[;L&�u�����a��.-���-H�_\��ss`�_����+���Ka0}sLy#s �;�r)D`w�;^��9���KF`������Q������;�&�Q]q�6&��"���v1����D`� �e����| x�!x>k�M�L��(p-�a��^�%�^%��`36�s���e6�{���W�����H�J����������su}R�k�8���!=�c���������~� ���A���'��c�?�"�<������~������N��t�c�_�
��1G�G����a)��\��H)q5�0fBW��]���}�����;{�PlN� l�k1����-"��TX��M?."��3�E�M�A�=#�+��W_����n�����
"e�iWm���Y�L,��a����f �
!�\�r���Rh!��>L���3I�OW���s����@�&�D
K���JR�q#�����t�������s�=��+3����90��?���S���>!�D�@��Do��2�&2Z0UBA��>����)���(`�d����)8��T<����d&�O\�Je������
���dN��D��Dz�;�}]�I''�"ej��XL�6�T�����Z�����.��{���4M�"c?MS�\�����}F�"��%�"�2|�~�yO*h�'&��5��������� ��e�?�;���_�|T�"�{�':8��Y>d��>�<f`y�����y���Q�=� �N��X�N�O���L��|�BJR�����|�������|��O��J�Fp�$V��k���\
��tm�s2�$�y$Rv���
��,6;���g��0�x)Y�����x����U�a������G��+
{t�H�C2��a{��{Y����6�u7������#���T���*Yv���"�%�	���S��)l�g��_,R�/)������;9�Y�lE�"�1�\w��E��k�GN��p"��D��w@�Z�������I�aR�#/�!�W"^?"��OZ���������UJGV$>I/,����H��N�Sb���� ��K���R��,��Y�%*H��u��x@KF�g��dEQ�T�1I�;�1M3���y���Fy��3u��'��_O�,�C?���:87(*E��
�3M�s)l@���cFF�;��n\{:�i�%LK3�9�/O����{�@
;:�����x�A��<�T��P����z�.1�+b�2-�.����H������������K}�.?dY�������c`)����I���~a'��R����v9����vI��4��%�NV�Q)�AE����&R=���i?q���2�'�h2Q��M�dNb3�r����h��� {�� �2xI�>5�,�P�}
E��A�A��u�a���A,X>2.��q�@[���c�	���hK��I����-�q=GF�&9�9��=��>�7
}����#��t�R����8�/�<�H�G���|�LO��J�+�#���;�X���?��ts �7fH�_��X��v����W8X$"0�Q���2)�H����z��O�|�za���z	��������b�2|�}wq��<B<��<<�SM"�AND(�B��t�;F����t�^B������� �H'�%�]!}�����XlN��kA,R��v�vP�F� ZO�a����#�����4���ob�"�gl��<6���;1��-�A��S���<��V��Q�B
we3p�l����D*EGP�1���s���dE,r��X��:C��h��]���R|n��T������"�������R `m
�KjGH�_R��B��'z��+s�$�g���l�`*�0��C?m��k��D�d/�N6P;�z��(��
�{���c����H����A7!;���;O�S��0�DL[�@0f��i
�H��%��J�_�*����r���ZYa�7�jnr!hK��"�<*��5�`:8�y*��`0��=�z�����Ldj���z������q��+�)X1�J�.�_ MW�=�"'�g�c�JZ�L�}�s�e��H��%���{h���9R���A�a�#�O0vh�������D
���#�s*��:,{�k�G�d/aM���y~'���FT:S���p%�>A����6>�+I$RI��;��_�o���DN�f�V�%.�	�4�&�V��D�&>���{_�#���E��
����T�?qMV-�|����
"�B(�-���� )TA-�}�;��i���H�����I#�O����-��F`��W��y��E*�����Z�y��~�E�B�!�Sa]he����|$P	��Yb#9�����_?��\	�����6����?��S������_�E����v�P�S����\�"�=�J�54�rc����:����:�|�x�W��C�y&���k��-:�w7:o{�w��U��F�����P���j��a�M�Q�	�CC� R�5������������pH�����7$rN9���~�H�Y����Qc"�����>
�4�x��VYZ�V[�6?��C%

)��'�J�G���h���?�D�S>�q�`V
<a�p���S>2I��I����[�_*���o�<���x1�����q&6�Y�V�2E�
h�tq�"�")[P�N8w�=\�|�x�I��T#a����G��V�)��i���*�]P��
�eCCf%	�I���!<,~�*��$��,r"~6����
��Gy2)�����FB5��^���H��w�j���{�������}�����~��w������-������xc�'oY�*�?�I@g��������#�E�6����W����������-�"e+Zo��������o�=}}���i;'�-����]W����wca
������"4���!�����EoA~����`3��98������k�'�# �`�]�h�-W��68A����6�AH8Lo'cF�\�"��U�j����Pm��q��x�G"�H8������#?�t8L�����q<3
��'�J�7�?F�xE�g�� )\Q��w���
������d�x����V�����
�z���p@�������1O"�:������4]��j!9}��P����CZ&��7�s�H��>��V�R�3(`K�$Xg/����D������
X�����T��D���������Mb��H!m�]SDa���1&����X��M�HX�"��.���8k����e��N_�JK�+��V�.��}eG�t�w��3���Q�	�Y�lE���� ��x�r�_����h%���}��x��J�bR�d�M0�d�����������.��<���!T��!��`��c�6r������B5g����^��$�o	F�����
���RK!T�w6����o)�H�����E����{WO+�r��B�=Z	�Q=���[�#��d6�}&�������6f���}r��s��	]����$���JOV�"'���0:�#�k��
�z�k�G���P�X}=�?2��������",8A�0�&=���f��-E
�P��<������	P����dE+!+�^����VY�agx��L��J>=l�b�H��A�U$$R[�"��=�&�����;O�a����
���_��f��G+a>60Y����]��V���:�	��G+�Y�_R�������,[H��B?�0����y��k�Gj��P7~n8�X�`@;��A�}��H%F��h���:�yB�3!�78���\A�EN">�����X��nv9�>�W�����}(������l��;ez����7��i����<s�X>Yb���K�&�Q)[P����������l=pk�i�NB�� ��V�7B��Ce��5�yb�s%D��X�y	��D*�?�G�nTcqQ'�������KJ�F��~dKr��H�tBd�^��ty�f�!�HaLQm�$R)��nOCn��[Q�pE�������^�d
������ �D��/�x�3��}�<���I��3+�3V|���0��n(�F� �P*]��vb^�H�u�;F��9��<��hUX5b.�C���`[��\k���t^�}|^��4L[~
��E�������K�N"��F�a���BWF�'+"���:���Q����"k�G"��)����~=�?��@s'��g�rF�%c�2�I�
�k�`Gks9�Y�lE
��k\N
0�0W�����J�$T���^����s�����t��fX��?��S�����$Q5���D��-U5�1 �fi��d���IVD"ejW�X�x��3���z�i"��I���,2���j������9�Y+)��#����%�����)*��@eJ�UJ��%�g���i�$L��������c��Gs���=@�=�w��:��������H!���3�Q�
�%&v��{u
���to�ky��
������/={I}��������F/p�i���2n�������'.�TJ��Jp�xb/��7�H�L����ToFtG����3s|d���B$���2H���ja2����g|���|�	����E��%UA���[���k{/&+"��Q����,��$S���	�L����{��=E|�!�L����<�;	��
����`�:�w��������&L��w�-�e��-�����8Q�����%���D��YW��!�o�"�`�=���7�Y��R�>�0���!�B��]U��@�
�VT���Q�W5���%lL����_�j2��}����d3md)�B�8z�t�e�:��S0���	���$rN��Y��t=W��B�Z|m\Dv���#O?���E��#0�����N�,��GL"��P�8G]{n�PL"eJ��C�v�p�5Z�t��!��^��<����y��K|'�E��^cw�<��q�"e���;!���+�<�D��H���U5*L�A������]DZ���2��S��}�y#d���(�=���T
�#���d�2�R�/)����9@CV"��k��H�Vl��R8�y������k"�#����?�z�?1:@�K2*-�<�6L"u0��Gt��BF�+�"g�c��q@K�At�"���|dJz	S�Y�oG=���=�����������u��r�XS�_W���^Y?�WV���fK�\��n���PV��+�jw���F�S�F���1:��(�1>�G�*>��zJ����|=�V��HTj%�����K��'#p����xT���i��"6��]��H���SI
X|=�@S���+Oveh
����}��3��F�f��U�1�s?��H��C��V���%p���I� ��tjpzT(V�����D�AD�|��|���6��������b����2�:9�w���3��������b����$9�Q�>��Q5�9P�j]��7��(���E�4hA�P�:7���5M����}�y����M%kP�&|�I��#k��H�J��|��hE�
�e*��d���wh��z����t�AB�l���� ��d
��yWB���J"����pz�_�;6��X��|6JT9�6]����5�#�s>�h!�m�!���f����O=8SPMb)���]��t1�!�jH�lA�Z`�+���FC>�~74M���i�����[�'@B�w9��O�"e���8��;v�2�=	 �H���E���Y��Y����1�l�tO|R<s�V���{�6������d;>��b)C��p�P ���C�E�V�.��fF4�QH���<�H���>��%����v[H�y�K��Z�k�v������X(��D
yM���/0�G�����M�����S���e,�V������k��^�z��!��hG�e���Z��0p���o��j>���K
t����%Dx�fO"g
���v�
�eD�����:8)���R�������P�����tX|g�����-G'3^r��H���[P�����#����jc�R�s��Kco�o��~�jCz���]�,IGT�3��/0�������$6��[��S`G������B������a0�	��QQ��-�����yz-���i��"�N)�(:�g�2xIq
��>�T��H���<�_;�h���1!m�=��@��m���I�z���=��J�{��%���P�D*����T�{�(�;O�GU���m�4�
��������C���=c�'nK%<�����u���
)�����I������,[H��Zg�9_����5F������%<�Zg�c��I6>Q��[_������\�$����w)�K�2	
5V=%���EN��M������n!>�%G	_�����A���Co]����>�JQ��R@qHbs�EN�����Xp���Xk�O���%��f��q�1^j��J��pS�>������g��2������Bhl5�Tjt�2�O���.u�8�"��Y��O��UF�� xz�����b���Ni�X�q}n��k%�
j�Q�9�D�<���D7d�P�F���v�(r"�:[�#	|��Tc���5�#�r�P-�����N����whS�����NH������M�4F�pU�x���vO��b�r(�'�!
E���l9���ot�IK+�d`����?��Oh��^^��]��k��m�|�	�I��Q5���k`��+�"e�iW�XoF�P�6l��\��������������8�e4P��z����	T�[�����M)��4x���K&��H���E|(D}�L��
W��F���CS��
�_�L�����B�]��>9�I�V*�����#.�?��NL"�e��b\k
�^��A�8�g���N������*��a�|&���*1����phLK���!��&�R�Moa���N7�&����h�����:�F�7����D�D�Ia����1)�/�5U��p=fF�����K
��x��
z��H���5(�##�l����
���130]'a`�N��x�D`�8�K:��M���*�T�C�5F����
A"�T�v1��%1�m���h�����t���������q�:g��
����L�|�cU"��e���X	c�
|v&�g6A�")[Q��K�3�y����mSl�����H�:�������;��-�4�J���H������8
�xE,R��v��3j�47����^#~&I�NB�l�W��A���,��$��R������{���R�3�Dj�'�j_2�g�&	�5��B�|������L�GoJ�����i���#B6P"��t�!��8�VMh"H�����/O��EL��w�	����9�a���3\��B*�Oj���O
'����������X��O�~{��������$�j_~s�:��BJs�.�#:Z` ��:s#�9�2��D�a�����w�Y
�q���k�MmRKs�����:9��YIjSD����Q��&�� �6�Z9�"��=;��s��z��+�BEv$>�������b�G �H?����@��4���w�b>�:H1����K������D����1&X��0 ���\c>�#���@���0�����%����5	T�W�M~��X���}hQ�a�/;���,Rl����������>����I������@x���(zm�HL������SN���Xm��#���T
��b@dA��E� ��}�eIt��l!m�(�������k�"$�� �Vm
v��f�����g���7�K(�S�� R�0��~���N7�-�e��������<���=�"���,�7��_��>�D��{�3���������l�m��f�cv�{���� �g6�B�d��r�%
�b�w�y�hr���+��c!�")[P����`����PC�F������J�N��c��Y��|)�B:�A��W&*���Y�_�%�����.��Y��H���.�m?�x�Wf�t����N%�u�3��:��T
�:��'��GL�Jk������`#����b�D��)
!E�
�#06��~i��)��JD�|��|����G����Nr$�E�f��p����Q�d0����G��$R����88�2�&Y����������8�{#��[i�H'U:�
)j;v:<�������7N�����?��E��_�����������Onk�]�����i����<s~&���V��0�U����L��w�I����W�Q�Y�OD��g��F��U�;Q�g��d!d�P~��km[Dz���;�����#�?�4}�1�'��3U��P��'�
���l��Jy)[Q��GY�B��R���p�^����O|nw)��B�a��?�zx������9��z������r	��OL"�����T�2/%Y�i�3r#�}@���������Ri���~������'�<���)C�B�i5�[��:�������<������tm��^$RI��W����Q���X��-��x�+�q�}�n���H����kO��	��c�	T��������E�&UB�
0��r'��Z^	�Y�lE
���P:��+Sv�ND����<��;����k"���/ejY������Jb�T2"P��R*D@H�d�}�QX�L��_��x�������he)[Q�:NM�*��pX��
����W-�����k^Oo������*L>�����"���(E��W���U�"'�� f�5P��<h�w�0���Z�%�������=0%q������eH�R2Zx��WF&v�j�vOy�o���
:�����5�#�RK��
t�z���-��<���){2��YjE����:t���D*�8��C�����h���*f�V
�Qd��T���^+����&�:���8�6�������i<���Z'�\@�*��URM
�G�+���Ha&�aK�B���������H�%D��jId���,�c����Z���jR�O
�!R�P) y�k$����H���<:��Pc��(�2����LJ-aR�U���[���A����8�AcP�Cb���J\���$Rk�q�e#C	g�D
��v
�
�aT ���dN�J�����O�FK8�$�:h�(L�9�T���B/��$0E*&�o����x�� )�A�"��l�FZ��F�����<�^��|��E��N����!,�&�<Zi�"e���p'����)�;vy��+�"�)OU�c�9�����
��<�#���P)���}�K�����s�nP��2�FW�Z��M�4�W��5*9�������;G�y�y����s��(������a��lh��?�x����^�1�O�U���v�{���k^��D�����G����o��H����xZN��{���|$\���y���m�g���C�
x_��H�g�t(Aia�8�=���j��<���
��#*�"91)����R�+d��N����!)���r����^���)x�3�:q�*%(y3,���p-/
'�C�E���$6;���;��,u\Q[n�M�����dE$R��v�
k}hya��Q�����V0���K������H�T1*2����3%��j6)2+I�E"�L
��r�O��s�vU�V���xL4sh���(���k�G�e/!\����;��
�^�t#B?	D���?�I�02'=������QN����D
���b-�:4�4S�A}������O���l� ce��Fd������n����n��F�w�Y�R
�q�������H�lA�*�0T���B�U����L�^��l����1*S0�	j��a�����D*EN��
oh�/>�+b�����K�s�w��I���������2.]���th�=k��������&��~4����
�K�v��H��.=���
�����h��K�,rb�@��t��_bYk�Ny9�����)�h�"�g���0����Q���87b{�%q�4J��)�k��#H4
��i�;@Y��1�� <J"���]Of*����"@
���rk9�F�1}����t\��0O&���}�N�b�7�7�H!���GY�������$���H�L������X���Sy
���4�[5j��@G&XG�8�0	b�$R0)�]�?����!|BeO"��U��G�������9
�Y�k�G&��0)����~��!����^���������r'4j3���Y��r�Rw�D���Pe�Z�y��B����4�a�k����G��_�Z�D����6�j�"
S��8��@q���:Yx���pc���D�PI	�W�a���s��L���Y.��}��ydO	{��"���c}x�F1� c�$i(B�D�&�<�����|�+�Y����2�=��~��n�"}�H��
������>�!s�&&|�#�H�>-&}�Y�g��@O�Y����G�t��<�4�l��"e+jW�( ��i�ED���������M#am�N�s�p�M����0`)<�����a��]���<�y#�q}�h�"�N��\=�'��=y�?��=�j��n�$R�0)��F����!7B��D
�����.��YP
P����
�6�'���XS*U�n�tg����{�������_?m���������~�����Ow���_�������I��c@w�P�S����`?�1��PYmlGm�Y!-���kP�Q�����8x���������ei�lQ  �gG��Q�KOp}�D����tO�����j��R�����Q�h�W�2��G���������C�u��
���"eKj����j�5(�����|��Z	o�{�v�/�Ks&oy���>��������@ ��������,R�h��z�(�&����D�Ha���Qo�U�"�Ss�m��H^�"��E}�����N�HK<�J����c��y���!�E%s���/������j?!���xk�R�U.�4F=c�.�,��("K��X�g%L�'����S~7�:�_E��%��5���(d/B�$)[R���_haQ�������D����VO/�F��"�zZ�Mv�2��M�z��u���.���;6$RX��2�5�)�]|��q�k/"�X�����'����dL�����c�R��@/��0��f�D�(M����R�"�"�U�J������eKiW�(���Wy��o��r��Y+"���[p#��K�xzlv)�J��J��Vh�O��%��Y`K��lg�Rq��l^c>2W������"������s�Aq���6����P�'Dw�Y�����k�{t��.>���<1:,�����4V�1��T���[]�>�X�����,�����Y���
���C�8'�N��#���rj�8�N@r�b�2�!}	��	n��8��8�H��h�Y�Y"��z���Y�����)+W?~c��c�d����v�0
�DA��H%%d{da�1(G���f\X-Q�0��.�qJ�
��1-��������{��N���Wi%��6`Z�|eL�b�M�$R����W����a5_|L-�c�$9QO
��s��0�l����H����o4*�u<�Z�K���������Y���'?\�<�<��	jh���Q2!����W�i�"�1�vM�Q�ALAw ���^����$�L������xg1���u8�@�+����8�L��l?\��C��h�d�z�$)SC�b�3%e�=����N���5�$����h��@_TV�yE
�b�k�Hf��GE��
w"��	Or'0us������p����t"�d����4���s_r$s�&V}s*$��AB@�n�,Rv�K�M��h�+
�\E������,,5?(�n����$tM�F�Mv�a�&d)4�7��<_l��Os�4�9d)BWs���,�	���:q*B�a4[{(B����a��N��<��v�	���\�%���Y�b���{E+��)�i��Q!�L�1���'|�O�7����M���v'��Mu�q������Pb�|�����S��tk�����r~_5��eiW
���	M�_]��x+I	�NB(=�'?����������
C�h��L�(A�U�`��%��`,��)���|�"eKj���
����c������#��*�dc��.���Z9���v�w$R�01����O��b=�"eKj������:�O������Lr0?���	���8b�85��
U�J�����OO,����n-C�T���C�O���l)�*���0�|�io����%T���m3`�Y����P=��"B+L\Oo���>���D�&�zp	X8#em)[R����Ox���:��#����y�O2:�Z~�F<~G�����"^>������CS*d%
06_�?�KINxT6O"����xx���O���X�}����Q�����Xo���H��o�z�"�gN����"���>}O�M����E���y�|It��EL��!Z��Q�m�,z [�����a���kXh���
Xr�>�R�:&=�����{���
�L�w���6��������'�"|�����~��w�����-������xc�'�����7�6~������n����:��x�0p�T��u����SyEjQ���[=
��gb~E$��
��V�k����oO_�?~zZ��"|-�ya�w����G7�42���`g�+��
�G����-``m��#�+��B�>0�|	F"���v�:5XZ7/A��!)\���(P�7@�P?�#��������#���U?>��E�g���������7�Fe��{?�f��Kl�������th��)��+��&{�"e���%�J�0ed_�N$��eKiW��Ve@,���,�%�_{1�_�%��nk�#l��T���L_2��D�&�z(L�Z�(r�vy�$��*R��v1b����1��8s���|$Sz	���c~ ��H?&T��	r)�XW)�H�ab��*����BK�?�C��,R���A�Fx�����-�V�"���F�"���h^t�WI�d����o�X������r9{��}o-C�T��Nc�8�|���+a��.<�*��5���B
��7\+���$����_�d���A	V�������j�6W�H�c2���*(+�6���"eKj���Za����CR~�k�GR� !U��D+�D(����
��m�r�ML"��\�*=D�U�o�M�J�DN��
��-���1^���;]c>�*��t&���9C/�bjs����v���!%1�Z�h�3�R�q��u!!n,[F�������M�~V���7�V,��9H��g:���#PS}kC�#���D*����v��>���Z7!�1���gc"�����n�K]C>�(	���~ ��FH�c�8%��H����H���{�-��eyuu�K�[�����Fj9��{��.y�G���9�{���o��~����z����������e�n��6�V`���>����g����;=�1_����N���.x���[G��bz�(+�oi�H�����j�G<��0��F������~�i
X�����g+A���P���������LD�����"e��2��U�s��K�")[R������C��h:�s!8���H�$���+j���>��^R��Il'w���H�
�����#�c�2�I�
a�)���*��11=X�lL3HB�M��
]�R��������;o�~����t����(���PAFp�O�o��A+��`r��_��Ds\o-[J���m�x����%�O�)��y�"<�0�[�#��T��C/�\t���{
Nd��Vbf~���F��J'8�H���Ez�*�`�,�k���H#U��H?=�V�.
X��g�;c�$���$R11������e`���l)
����
��@_���h�$�QD�|�f5����As�K�O��/e��������LQ-:
6���k�i��'x
���ex���i�Z�\��l�d����� rR�bPA���{��BW=wk��9���~��@?|���
y3�� �p���\@��Y��5��?[G�\
�`�E�z.
$�@�1BRcv,�A���fC�c�:f���o^�M�?���1V����Io��j)��KOz���;L����2������F5 �2�n����������Y�K�d6
��%v�q�����>G0*���?�����e�������0�$�~��q��l)�j"�5�8A
����Y���%�@���;2�t��J�`Bx���.Z��$c_0)��\����L�t)|k�R��:������T�������8JX�g"��D��t!E����oa��2h�Q��"�I���X�/�E���0���J��� *}��s��D|nPi�q�/��}���9����q�BiB��57���P�)=U3�JM(������a����
��EAWpf@�����p#(���������9x
�{�	)���l �JXY���:91k�-&��:�Na8i���0���o�}���I�,O�7d6�����s[>����7��]B=qOc�!8�H�(,���0�z�d��,r��P��^�C�������9��;y��{�m�9I]��HW��a��������)^����5qH��)$}�HS�F�w�gYY�0��j������:���P@4k��*�	�� ����@�:��F��BZ|T�H�Q-�|H(�}�0�V�$)[R����UD�8��d����LXi������j&�+���9Mz�1�H������4�R�A�$)[R���x��u~D�������)���MD
�f����H�n?��7��C���M�~^�'�)��C�����(�0�|���[u�C��}Ex��l)�*B�IB_R8>�f�����D�E�"l M�����58������ ,����i6<����}�]E�&��LuA��,�������'��E
���bL�U��>�+�A���3�&}'�M6@�x=�?2K�����"-���"�@�;xT%F�\s�_����"�����G�d������3��wo�@�"�g@q�A�9^���=�����c%s���Q6�����-��J�zk�R�,��F2�`3��$�����y���6�w���Lk���4W�������E��I������K7 �B9��^x$rb�st(i���F\��Ak��DK�I���y����G�1o�'�	P�"e��<�Yt'qiS)[R�'�1��7b�g�*2-�I�,��������7��x��o�R��J���Kj��Y�Sw�1J7�[�-�1�5;���k{��K"�B�X�:FLy@�n�j��W����N|�=���*�F������/dq���d^$���7	Lw)l�$��5c���y�.�x���H�j��D�����������L|�c���2�7��+
��������T���������G����� I�"'���#�FU���;��'}$a*	�����P������Y��>�����~����@x,���_�
cj)�G�����D*yl#��Ra��Ze�S��*Lc`�F��M��>2J�*,2J��Q����-�f�T!|�0-�.�K�z�7"���
�#�j_�kg��d��A8h�\4����|�N*	w�$R���
CTQF��L
z���"e��������K��$���I����0
m�{�A�5�#}R���������Ix�7����/eo���7C��+�>�Fk�}�=���u�LhE�A��������eKiW���nFD?�����IY�J��l����EL]�P�l�J!0(�E�"e�Y��A�>�8��E���.�;k�u����D�)��AI�JD��'������\yO��� !S��Z�/��
mI���������}�dlX�
�O�#.]�����2��a����lz�k�G���&�(�3;�>7�5�i	�+��u#P8bkfV� ;��H��&F�R����Q_��t)|k�R�5-���e�B�\8@%+%�#M�����y��z�A���a��XE����.��>L��&�h�0tDQ��$�R��"4��(!�H�d�x�$91
3���6 f;��!I�ZB�l )�
�# �7�.�n/�&�H5o#810���.�B?�I�p$Z�=rh�G�V����Ki���?��E��������_?m?�������_��/;l6~��/||~� 
��N���O�{����<�CE�&���`�_��!|:/������Z����� 1�+��X�lI�j_��=J�1��������Z�^m i�zL�}�ZI�Z���<#�1_�	�Z7�SE3`�t|�'mX��<{Yv0^+�@�������tO-�{6kp�57��5��������"e�U��
�PF���������(R��vz0�<��������������E'}����]�}���r['Ts�/��E���n�����&��2DK~({G�&z�o;��R������\@q��4(L��l�&E$�j	��t#~ ����a�r��F���h�O�}���Y�H��Xnj���Z\�0,N�y)��m��10�0`B{��;^k�"�=����@��m#������9Tw�`�31��z���$�j�wOz�IL"ej�P��{Z?���9 ��|�{�7��|�3�o���g�a�|&N�S�	��.�CJb������J�k���eKiW�(�Z���P���|���R|n�����i��
���0��s��A����X�$RX�)�z�:7xMK�<����I�pI�bH{��D�-k��A�G�'>�1�"~`��H��
��'��|k�q*F;<P��tg�B�8z��o�(J�]�k�GZg�:r�y�\Ml�J	}V���@������?�����n�*��%����[�T�T�g)���+�6�%�H����nJ�����(��BSq�Wk������������^�\����#�;e�A��E�b33/��[[N�"���^�bS�Q7hG� uH��M��|dM�����4��X����.B�����H�Ud%f[v��Q���<_L�/A��E���0�m����h��Kk�G�d�&�~JKPl[
�}�I�����"�cP)�?���grI�_�(`��)��xs����`��e�@��kf��~�H�*���7_UV2���\�0�dI|����W?�a�c�~�����E��-}	��M3����e[�W�(�����G2d��W[O��;8���J�<���}�0��)z'I��wx&h� ���E��%E�cY3���6�?��n)���E|cn�����\
x���;�J��8�N��G�@�����BB"G���!��C���H��*F=
x�%N�!��G�gO�"�d��a�7�>s�[
��-��'�����9����"n��R�M������>��Ns��11M�V��% $����l+�*�f�t�b$�N�>[���5[	]S=?��MJ&���	|R%�|R�
��A��v�*\7�������E��|���r����f���gQ���N�����I7�S��w^Z-1�I0��o�U���]�������)�R�7���7�s������nn���'�k��9
V�v���dx�Sb��Ee�P��g?����j*+gL���N��~W���c��j��5]�
�XA�lK�*D�-�X�V���i��N�*DO6�$dS5�f\�2���T���e6��
L��z_�K��%���x�!��d���$��E
/�z��t����l�<�-�=���P/�5~���������}������QG]�.�A���:]DB�K~���!��y��"�"e[���mi��m��nQ������/�������>Z]_���iXdN�<�J�l���$R�l��15��j��}��2(�$R��z�
�m ��t4��W^t�����#��W�����l2��}��v���`��u�rQ�&1��V{�mE�&��kv�����S~�-���T/����e���8us-��I����y��dP��WtA�?��x��j�,�C@��-	�L'�B>|��oZf�����6�����4�ND�|��>>�����7}�rN�G�H9����"�cb8���~�<
Y����Y��m�^%�&`�?�oi�����V�x�g'a}V�'�t'(�2{P6@����`Q�_bo��o;v�^l�GbWx��M��0,����(�*�E{����I8��=�U�H�mU�3)�c��#{��NW�Ha;9���[�T��B��yi����g�!��A��! �1����=��;��?���}>B��LL���;����_sG��a5\�6������$�
"�a��r*F�6}��Sw��
w�e����NV����Fw�%�{���r����w�T2�A��
��R`�����d�L*'��"R0�������>[�"e[��^�HK4����@���.i$t�3�QK����X����XW)��d5��AL�z:�3��
|��kjmw�[j�H�V1�IiF����,Ub����'��'��\���O����|������'�l���%=��?���L�p�xJL�BI��n6G;2D�6����(�V1<����(�ei�V�U.�X�
�l!���#�*��4~���>��&�#H_����P7�V6��
�b��J����yi����U���N��"���^�3T�+��}�?������G	=������#+0� �,�7����Wt�D
��r�7Ew��d84x�H���D
��wT�(}�{��H�"��7�>cb��y�]�����ZW�\B:�<��^�K�-U2s�*k����ei�V�U.�k���4P�����4L#�aV���M��6���D�����?/���A�`R�2
$���DKb�x��<t���������g,j�-�=)�HH��QK�z�p��D9���������m��j1e[��`���0Mp1��+���4"^�������w.��>�������K������?��������z�O�����>�.����������?�q}�v\�ab�l��_�����P�/=��O�}�n�qe�Fj���T9�y���b�y��`��I�������/g��F��.0c��2�"�� r�z'�@��6�H&0���	��vXw�Gg0�6���q��'���o�������g��?2d�v4
���2�\r�� R�1�v�������?~��(
��^�!;��-�BL�}��4�p�qZ�w��'�����O���!s��w�g�h�42}���
u������Au���f!v�r��Y��v�lWh&tU�	��)�u�[��SU��������cW�w�6����Q��+J!�������R&�2���
e&��
� R��b��6��l;Z�]�+<��J��/QKB��$1��DW��r�L&��P�c@A
�V�� r��C���Q�
#^i�c=���`�/���8��x�E�I$F�G.��]�	J&���sf'�	E��5����aJ���OwV�m�^�n�����1��������U+!����w�+�b�����T,�+<������+�7p3B���!-!iDN������LX�7��DT�[�{:���IOn�;B��i	��+!�b�:� r��AiSK��A� W�]��%��Q��T�nQ���d����dj5�KLd����/t��&}���O~� dg\�����^�Je�A�IC3����?|WX�
�[�?E��T��{�A��ne8�[����VB�U�����=��!�Pwz��^K��Z�%��"e��"~*����^LF��;r2�H���E|35�b�q�Q�o!���V��|~�����z1��N�y5�'ci�3A�bb�k7$@7*��#��CQ�lK��P����v��K�����J��-�~������>s����dNub�v���+���N~�('�������Lt����K�-~8��f�B����l+��:x�B�Q�PU�!�����S7{	u�D��q���B`^���F��'.-�����4�;�3���Z�%X�"���z�����Z�m��,������P�
et��{�8���n�0���bA�0� F�/����g���Q�LU���nH$?�����Lk����({��t*�����Wph���
U���j��i<����R%��R����V������\�ad�O�h#����U.���K����P��@�O4��3�H[L"�K�O{��1�`P��^,"'�=��
���^u�#�Te��=���j]�#����t#&7����j���]3��e�^��@����O�@E��T�E����(!�H=^��{�$����DH��Nd�Y�	��C������b?+�T1t+�',=����H�C��4t)����)���Ju����SdFn�O,�%���{U���_��Eno� r�M�]h�Nd�+D
������E9����a�1-����[����>yG�?�|E����Y����� �SRN��>����tM�&�BK�f�����S�����j���i��r����![:��zO��<XR��Me\�]������A���%�\���]�g���[
"�=�*rO�^�]�D�tL�mL��S;�����Q�Y�Q��u�dC�>�O$��9,}����P6DR�n�,R����<����)��x%}2x�%��1���t���:�O�"G�O�
��W���s'&��!g:2��H~�b8�h�m/zO�N�e��o"�s�&u��)���F��vO~���^�����?=5������!Z�`V��
���H���m�^�B��%��\��
�p�\<�s�0<+���2�\��S��=������"e�b}jg����|K���$R��z1���^�����q���b�/	���~V�!�� d��P��-�9h2BC�rCX|l.���c)�I���ge�����K��Kn�E�D�w#�1�@x���(���z9�������G�
�gN���������6X�!���`o~������w��WD2/-S%���\��*�cQe~+qi�V��e�h�g:��[5^�c5xn)��Gn�y�!�����R;��y������,p�*4�^ug�L"e��"�����2�(�w����xR C��@�*]���F��N�d	�o��t](ia�7A��:���K��V�l�/�!/@|)�R����s�Q�T��d zO�^�<�&�G������[�s��Z� ZAO��Q$[{&����y&w��K�-}���p����W�n��.���l+�*��h��V�������[��s5	W��j�}��r�
����1��L���3��@��Y��J�B\/�l1KA�������'N���u|38}�ZHT���ah<�w�\��}9�mC�������0�J\Z����N*��Y8�W����L����Wh��G���
�;�a�:%S
S��=��d3�a9���ZYQ�<�f���-����3�����&���#���#��j����Se�������x1g�\Z��:���7h|)���*^�L`X��c>�������d��5�����Ma�J����I�b�dS��N�0��)a"���Z�������"r�DP���ch�-�CA���^�W��v7���N�����S(���P�\H}�����W�I����>���+�kv~��O����w�������}���g����?��X������D����/=��;����_��J\Z��z�.y4c(R!jK�1������Q�`U��+�O������%�B$��A�_�������������E3"�l��7�>�H����<� O��S;�G����w�~�������V3��
���L9*����p�J�F��!�BT��l��
�GO��{��ZET^4�zH�&:��(F�����j4��?|��*v
u���o��\r�����Ha�[���'�%���[�KO3��$��hLB'^e���=������*�3U�������A��w�5���(R0)�]�����x*����"e[���h��Q��%j�p�[o�s=G	���&���������gl����m�)������z�^����M��g�"e[�����K�)|�.��_������_��p�A3�������ojq�9��!H�9�M��x���N({1
���z�d\��yi�����P�@L}���[�K��R�r��5���[��D�'��L/u���9U.J=}N�M���XEa	�s#TJ��B`�� R0)�]�,���P0��H!��^�3S�����<�w���g��o���"(�l��������LnU�ub9�-|F���A/�CA�0�[1�]�p��MCDr�����"f��������G�>!�x���~���Ds�����?L�=
��|����o�����^�V�+,=Fw�����V��������h4�2���[B��������:4�j9�k:+���6�����#*�C��q���L�����1���YNF���S)�R����7>	s�z��>�t���I�P����;�};��wB�:'��Qr�?�V�2�:�c�}��A�D}h��7�r����[��<I�e������x����_�0d��-�V��'��
��?�������7��2]������=��������t
�T���x��Jf�e����N��_5L�O�r�3Y�����D�{4��P��������A(���6��0spa�n�>�#�QJ���`�nO�D���%��BK�>�^�R+���4��R(������'����(�f�D�Mu��31rhD���o��&��>�w�>���p�!�y�>�"E�DCx��� 1%\�����R��R�h�5�($[�"e[�72�,�q\�
��C�o���<L����U�?�Ru���}��O�Z�$�w��,R0)��a�f@���q�:�H�*�����$�4ZF�FM����G*	=�U��P��|h6����"�/���a���B��5�%^��J��&��%�r�s�����ejG<�[�{��zi���F�F�4&^�[t��xv��z��r]$ZAw�����u���<��;k���e���a��c�-�+�^Q��2�$�Q��Q�����>���>=H��~U������Vj_��e��^+���HACd�e�W���H����:c	����8y����gZ*	���sj�,/-�bq\��/r��g�z���A�$��^��cb��
p��<M4����\K%�Z�h�4*��Mx�������8���:�yi�5.�,�G�a��:�>!�W�X�H���U2J[�8xM�~UC�V�xf��0;����hY<��0lNzM�K<��_���$rP��9�����H2���zZ�"�i-�h};�^�����<K%�YV�L���A���������|h���"�����4IdZZv�J�w�D0���F����Q�L��d	o65��)}i�
U:����6-�U�^W�0�$�s�5�F�����?j�E�,=F�8�@�������e[�������K���nM	O�Tb�92������uA��� h��A�]{�����>�O<����"rb>�n�
�J�����R�=������9�Y��AQ�����(�����
o�H��nDwi��K�I�O5�V��{�;�	�y��1��������i����O6xS� �Y�
q�oI�y�������4Qs�������jG�.CW��
w)�R������H@T��5����h�(���v�����FzDH4)Hb�~�Zu� zd�C�-�x������	"eo�SV7CdW���R9�!0�J�������M�����I�oc��c�1v��4]�9�d��w���NV�H!����dl�Q�"������H�"���'T9X�jf���Q�o����"&���1����� =��c���ut�X���.
L������]��yi����4V2�iai�V�U.����d�t��C�'l��������xG���n�g:�������b�a�^�`R��t
EL��m�����
nD?EA��'�+��(������xG�?��0L�s�S��yj���j^z�[���z��L)�R�7<����R��+6������Njw�t*�Q�P����n*�>�K�-5)��?M��h�FXZ��z���DS
1��"2��V�x���P5O��M���c�(�����u�>�"e�b���aG���nR'��m�^��2��8j��x�l!�	�ZB�<��w�����v�W�����w��"&�����L���
"'�����|)�gR��Q�
�[���;����	+��)������~�u��$��_~������������`��Xs������j�,��j	��1��7�Y������jUVH�q���[�K��R��E�,�
�p�@��+#Z�`�{�1�Z��vY�Zw>_��5s��~�:5�X�����h)��zq��s�K_>�����"��z1�Z��B�����������VB��������o@>�/��[�c�H����|�w��Q� �S������?��\���})������ �n������;��}�W�^]C��R�7	����(���#p�����������sMp�F�"�������p��,-�J�*��h�II��NP[���D[	K��L�5��1�;Y#,Q2��
������W"e�*\�1vY�[�����I��"R��z1O[,e;fE�iU��o!������y�
�db�sD�$�a}�<��9���?tR7�j��EN��T��qroG�����_�����{��E�>���
��3zb?:���z��A�A@	]���U��4t�_G8��W���G���m�^{����4��:L��fm
Om%T�
�	�T���#UZ�,�,������K�q��J3x�v�d�3	���-����^W-��z��V��u���gf�ff�!����'t�����	��)����T�i�h��P\Z�����`-Z���a�
}|�[�{Rf+"e>����.=��0������SQ%NQ��7U�SA�����������K�Sqi�V��D��Ng�vl(
����LQ��m�~��q��5�|����AXG8,����66�A7�|)���&S}c��-IF��-��yF�Nq������<O����?�$"����"��j�<}�&�2��A�zL
#l���?�i����<��D���`g�&b*�K"<�-�=Q�k�8�R��|�����K����_[M�-^ei`��������'pu.���
�zN5�2K�-����l+w��~�������<6�$Rjl7��(�jz0����~����
\�c��x��B��"x�c'a;�i�w4�"5���	�$Xw�n'�����%�~�d��P?*�Q)�R�����3���P8��@��;	;��F�#�������r����	"�e#���t8e
�mN�,R������#�,���J1S�	�|����y���������IU��?}Z�>���������Q~�<�aYZvK�X��[��7#6n������m�^�A�	T<V�����p��,G�o��z6�������&�m^8��������t�����?������~�'����>3&��*�i����|�^�F��H��XwN�Ta?��E���=H��(~�W����9:.�t-|M��&4��'�hR���NBO<��8	��My
P��0��AL�z�=��V������v�O[� O��
�=?�{~��mt =��v�Ao���$1��C���M��^��)��@(�{���]�E�Zc#��G/u�������mZi��B*�3[��S7;	uS�O���+��Q�&����m�z�:5�uT����6��%[D�&�������.�'2'�R9��<���{.1o�otQmn0o<I���[�������G�C��pL��Y�A�bb��=����!	���)��H�"���'�����k������$���$����Ux�M����M|��HU$Lh/����q�0���!Z�d�$���������,-�J��E��9M�������|L#�cV@�z�r�\o�)���N��anD{!I��g'2j.����"e�b����:�I��x`��7�R)�R������85�$N0����'R����1���3t�~���F�0W��!v;Ph�H���o��^X�r�G��-����h��1�$d��M���F��|~V��n�C������[411
��Rd;�@�Tv�V2��2D�����\���������m�^�B	��'�{�S!���r��M#�l�n�;r0�\LC\�2����Of'��D
�oK��z�*&�4aK�^�����H��j�|���� |`5�`�+��dK# [���@~�������3'��P�!	0���V�As��IeaK�oz�Y�pK��MN���5~��m��;����-��m��������r_80�y�������x!���L\71���yi�K �X\s��o�x[�.K��R�����H���B�}������Q#!�*{Z*�f�d|�$q`�C�mc|l1�_�O�$�H��Xw���d��T���H���m�^���s6����[4���i$��
�A��'�
���w����Fwq��MA&�j�Y�bb��Q��4a���+	"�7jF=�4}�:������w�Q���*�Y�W�E}�
���5����H)��8���]4<�G.��U9F��T��b���t+qi�V�U2tc�&�!N��|7���|�?�T��3��~�����|�a���$G��g�n _O"��E�a�o�
��C�{F�'�w��,rb����!5d�n}�Dw�z~$��1����|�;A��;���wB���HaQ�������&
�RA�D}�g��KrX�c��V���I+bH��D7}��O����R���2�n1�Lr���/-C�T�87�AQ���Jf)�R��D��u1h�/�1���J�2���YAw����
���1mzX�A]�V[.����6��,R�)|�CCcU5[-��%��.�R��ES�f���b�������h%��
<�,����C�|s�n�Kh�_u�&���E`�|��-��D
K�+0w��bC\V����E+�/����g���>������G�>��i�������P��?�����,_������������H�o����w�����Ry������/s�_}��|��8j�,L�@��-�	�N��w�x�Q����QQ�/A�x�]���V�����T|��r�c�h�]�5a=������_�Ug����?z�J-/ ���1�5W�E��b"����v#��}_��R[�)��qjE-����>�I��-�=��J����a��y��,S&����y��y��Av�������!<}$�"'�C~������j�3�"��U���Uc�mmZ����&���g���3�+��)����MMS>2�Pr�G�B
V���(4�o,E:�-zOs�{��Kh�i��������
�+\���@�%�^��c����3U������?Q	p�B��b���S����+.<��=�{�(�a��\!����dAv���PK�����h��K�Bn��������d�����YZ[�������������3s��3��z����>6�[��h:������O��'l���."e �����Z�|KQ�lKU��N�5��_�G�{�(��g�������}pmr���F��(s���/���X�D��HX��U�n/�6Sm���dm�fjFn��@oL��n01����=#��02�p�7N7��MQ�$�-�������_UL�^� qm�fjm��sZD*������<�^��|��X���L�����\H�d�DJ~� 3=�5(��a=��tB��?��N"��m(Z(����J+�W
�����r�i���UW�u��X���g���4�t">�/qvO 0�G�*��<���$t
��k�.u�����0-��Yam�f*FlK�����Pl�����&��hrg�~7��������(c���2���K_z&^�|`�d���
���E��T3�;�j�88�O�h�n�yr��'��>�!��L�)��k����*U�_+0���2pJ����H�����J��"R��Z���y|��c�_I_�zZ^/��5������>�.=�?~����&
��{L�����rno�����Sr�F�2������s��R��~:	�'2�$R��ZL�zHr��
�J�����7���r|�5��\Y�����8������� ���
bA��aL��^T\[�~�E
�}����������e7���M��X��%w��c<12J �Qs���l�$�&��g��F��=jCL�������H���X��*��J�����
v��@������x�@��tM���������^(�b\��+O����AK
v��'�Bo�w��m�b��*�5=>��bX���5)<�ox6����f��r(��kr���a���-Q�>��+�kO��\�� �9�M�����EH���:����\����v��������@q�h�:����]0L���&���`���{�i�V��"e�f��Q���Jx8���5|<�r�*��=��J�D�}�xC(s3?�c�	��n}�����"e�j���!xJ5\��{ge����k�6S1���	�.Tk�&�n��y�������o��s�?4��|�1�+��k��%G:����G�����=�B�I��gH&�������������������L���=����*<�b�������%B�����:B�9���L���mA�5��H�s[��a�G�Gz�`fP�������AD�<�%5��zZ�0��Q��L(�q�0��e�N�h�!)�������h���A��|Q�����c}	�#����>�|\1����0&6x�^z(��i\[���O�pMJ+ZZ5K����>z�%��/��%~�p�/M�5����XCk�E'?|�/1�D��U0�3� ����$r��pu"�b`����"�O�be�Yf]G6H�C���q�l<�s�<�1�u��>����k��S
�$r�IL�K�����:x�����-=c��t�7��b�m��y���gy:���:�[l"��������Kl����N���	"g�����v��V�#�3\�9z�%��Z������r_�!�D&Q�Y�g�5�*dt��??���bt�d
�!�
��u>��v����$�,��5{&-�+3�����8��*�J:���g���L�L A�a^!LF���I���_���2����Tbz���e\��$R���Q�jC���=	u��(9���'s��P�P���K��V�am���o�F���h��"e{���Sj�nzm"Kg{�{��(�J>Db���B�N��k;s�oIz|'_�����x�3�ok9����a)C��%��������hH��JL�(R�z*V6�F6�:���X&�l���i�"��}�NT�
������'m2
�m����v�"��������wy4�v����HS�������vzO�Et��0������u)��h���x��H.Tba�1�<G�h�p;`��-�L�H7��a������*��-�=Ur|�����C|���B�0�/�Vtb
9Ab���������r��2D��2P)Nti���J\Z����������k�s}60IT����.������*�����7��C������`�S�#����2dIA�(:E�����������Ts��t5��3�������8?��J���=�
�`d�P����\r�+u���
��H��j|��iT��#\�x3m?S"�FD���b�9�}�����W��a��o��������_��s�����5�����+��iV���'f`�z�c�>=.k9�1�k����-U����H�D�\�M[A�]�h��*�� d���[��}OM�bo+���+vTw1t�������j�@�"rX�P�3��F�!+�S9����������d�d����^D�<Q_��E�����(2--�P���� ���)�Q�*(��pp'�����4},��Pgn�����/Z9$=��b���4�����}<(=�x[�������k�)�.�c��i��=��g�H��*�(�5��v47SJMD�6�3�J�F�+}���}/��<
{���E�mX[-��#vp1����* �6A�lO��uO�������2�a���96"f��J��J�+���o����am���o���Zic����a��m�b��B����bn�P4X�@�������l�0�E��o#H��"��s"����A���U�O"�5n9��f)�^����Y��:�w��[#��f��������$J���b�+�s p�����uE�&��SW,���c�Y�'k�6S�I�4��p����
�5�����N����s^q;_40������	l�U^�"�'w�i�5��aL�8����y�9� RhqT�{�1�\���r��[�{��:��?d}�h%�'��4&$��Y� ��?>~��c���O����T��~�#�Q��������R���f�d����	"e[�X�)���n:�2}w%�<��o�U��N��fU)�zf4CWT�>�Y�n���G��}�[���h�}�����x�dc�{k�x�1(����v��pf�uC]�����6�Vs>�$R�1��1�0����^�]��e[��r�x|�A���"�)r[�{��Q'_�����E6]���Tz����l"e��*�%��T������[��m�V%C��H]���t0��V�������M%bl6O�/��1*�l�)��e��r�\����2lIaNR�f��h������"R����w��H-�O�
����m��9�J��|~�������Y�MF�$w�T�5�=
�Mw���%��qm�f*F:}�(�Ll
�3�"��%��.y��G���,��8��2��u��paQz���W�3������H�K���
�z�D��N���A�H��*V6�i�<B���z�����S6���yz5(�<\���0�'��Y��uw9hf����Af�p+�H�j��nn���q2��5��nA�)�JD�|Q#�bkT��|�/A\n�q7��%��N"c��*��qW�|i[4�#����j�����i�L;�h�mp�=���}��2.�~Az8u��v���s(�����Wkq5�$T�����M����?q9�c��+m�^��:���-E�� 
�/�>�E4P�HI����*FO6�"��I����k���JZ\y��Z\5�A�#� �v\���LUlu
:��_��6�h�o����Z��<�����C�����2��W0E�0&����dI�|Ia[)�S��o���	��&�<�-�=S�X�/�A��o
���"P����������|�x��G��P�H��.�5��C�����cYz�G��hY�\�9��^��4�����O-"�	���-2K��A��2�|�XJ,�(R1)��1#��e��Z�'�'Y[���M
5��5�d���W\	���"�������}F�C]	��r.�c�?��aL��^(>�CA<#��3*�}��Fx58�x��1���p�D*e>_�<�q����N���,��bZC��Y�4}�\�n�,r���Q��]����?�tKA��
�X����of�*f�
:�e[��3B��������>#�q���0�(7$�=n7��"���0�����i�H�"���Q��1
d-��E��fj5�<}��>Ro������F'�o]��3������� ��s��[��FX{���hc��`I����5�i�ZD�|����H�"�Gc����*mpuO�����A�TA����Ot
I=/R�R��M��_F��<�,�H������2���((����5�z�)���l0��������#�c&���T�8�]�h��$������d^[-)��K��"=���
"��V�S��5}����==�t�j=����hg�	�w�{�FO��e����f4z��cA��=�0��%J(�-SB����-(7}���v���V��|�4��\����������������"�������<������,Rl�u�S���/�W|��M�e[�X�0��2�m��[e���������oR6Y�p��y�/aA0-�M�����e���}�K
P�U�C��,R�����jB��K�%��c��l�����Q����!p�/a��u1���;�k��%<e�Y;���	����K,���}9�������I��PO���S��8�/<��b�V�������J�q��`�nFAU���-�
@4��U!�B_D��T�U�,�2M��o>��z�&��G/IG��=���|��qM�l���iw���k��%��N�Q�b2�|eK��e���i�t��CG"`xkKx�d�"����>Y�v4��x��[��P���m���dT_9���K������sW�=���#w\�R�4�_��{�d+"O�z��`�������A_~	��4��������������/�������@�?�Eag7��c~���i%�m��(����VQ�6n���"uui����X%'��L"�����/���6~��E������V�hU��U���c��o-F��:~���a�1�/���y�T�
"������������m>��y�'���*�rV��_��G;[���a���w��=��$r�-�8�L36�Y��=E�B��b����� �=CK��&�<����W������|&�w_�%���}}�/
����#a�1�����>����5x��-��c\c-�6=�:z�\1.<��PL�U2Y}���/��3��8_Y,k��%�Y�%�������e���T�����gz��n;O��[`L����+%l�r�i%H7am��Hgv!L�V�������e�����i[���O���"7����:� ��i^��Hn =s�s2 K�V�>�Z�/:���2�F��BPP
F�$R�>
WJ>R5=��#������k|[^����J ��������*�)���
��7�����������\�g:�m�Q��E�&��c$^��Pz��t/b��T���u
�w�K��������xv"�����������6�4��c��V��(R�19���C��*��)����b��=�}���k+.>�-�=��{����~}������H�6d+�W���&��An��~��������d�}���2�!Ub�����F��20A�lK�0��#]Ih
��?��a������������A�]q�)��r�CJZ�"�]$���ca���\�I]��D�,KO�{�z�:��:�f|�[�{�d'�I��J-�){�Wg�"��-���K�s����'�L\[�������q�Z������,7H7�!�wX�~�+���"=ZE�Gs����FD�6��Q�1E�N��H��js�w'���q��b)�R�����L^1��g�lz$��4�W	�����C@�as�cs���M	���fw���Cw7:YAF3��i�������>^3��g�-`9W6/�W�2�����C<�D>1*h"!�G����]$W���S
?|�V�_=�D�@��L`V|�N���=G����<�����x����zT���1�<��.k]�.�-)�"�v����H��
�����p��Y�������%{_��(�.vp�@����#��d��Mq6��*(��'�ecE2Uza����+�d�a���[���[:�5�T�ccf3�����(����y����c[D��0'sQ6MaC*�H��R��,M���v��
�90��	+BoK"��
�+h���N�U;�����e�Q&{e��y���u$�X�-o=��E
�em.��H�D��c��=Y��=����v���e�S�_�h���6j�XY�s)�Ei�P��G���
^F�=�������jD$/���z���]�[�K�b�
���PjNiZ5���}����#�<��pvJ��P�|�+���cE2)zxU���a��5�dE��o:�&&P����A�/{���{X^�3
Ii2����1w�/�#z��"��YZo� b���S-��=�����z�V
�|olX����?��?E��6$����d}��������o���l|���,���d]��.��l��1E����f�}"��	�Z����P����IE�l��&Vq�b����3|�x����f�O��S�A���zK]s����$�F���C~����!L�y\�����5�
z���H��
F�	�v-��&��]B���T<���h���4�B��S�x�B��)k���<�DT4T������\����E$�LR2�ibW������*����6
&��5�]�]��9|�#|��������������_����0�B��l4�41y�\���"y��qY�������6V$���l���m�i1�}<`9lP�>���9|�2�3*I^uz�}�<����_��&!�*"�!��Q�eN���
J��&���TL���x;���� M1��h��0�����}5�Z���^Q�����6m3�>eM@��w:����~cc*6��>b��������i��z�i���
�
���_��F��k8q6�M�Eu��4M�h�*�����������wqtKV$��@��������h�#<�n�tPQO4bR\b����HT�k7�<���1�du�_eE��T2��%�f.;=q�}��9�7~���W����4Y@S�oOCC+����I����'�A{-���S�H��J�}��i���N�Wf�X*���^�Y�#�?j+�MW5�$�����)=bl�~�{(��'�'JaE���gA��Ab[K9I$i�miI,')X�,����{B���zndl��AE���F�/��f12"���|b�O0���y���]�v��7���j�����}U��}1G��-���>��z�����F�������%C��������UB1F~�"iK�}w�'b	�[�'���"i{*������>5S�.t�7�%�<p�?�����~��������>������o�����5�|������n���v��l.�L=��$E3��4Dk��q%�]�P����&2�m�`%C������.��=�-6V��bu���%�7ZQS�V�j���/b��������6������?o��uK���0���'A]([:0N	��V$�h1O��=5UkL��V��C!"G,iy��T�2�g�	�����Q��+G�������o�����T��K+����m��FX.��������<���?�������O�=?y����*�s�SCo}ya��mz�^�!y������������YYjv"�t����������cu������C������o���`�v����w�l��5y~�h���Z{��i�+�1�s��T
x��y��i�U�T��T�l"��`-��pY���h�7��������N{8so(�7oT�&�������1�3�Z��y��g�	s@,L+�����N�S�
5GE�w�9�7~�����^hxI�������������^W�7snwr��6
\z�C��@�2��6"��D)�416���wtd����t|n�%���z�h��}���Q��� [������
�����2�"�
��LD�r�m����mISo�j(�Ji(�!<��}�J6��������K�$�,������t��r(���GIC�V�.c��~Ox�H��
F}K�����(��/1J6�������n�1�#�)Su��"i����]
��TlV�6m3%����4���N�R��:�?���
</[u�y�K���/~��x�5:��l��4�����^��q<�f�D�m�R�i=�vbL>b9�b�a,	d��������Q�g�z���[r]R%�1L��0E	�y�H�e��z{�l|�����Q@��6m3[���NN�A��������B��������S(�S`�!������_��-�6
[z�c��[�c�����i�)�-����4�Z���H�����Z*���2���L�q�}D�7�o��&��vM/�f�%������H�P���M��G�`�M��.G�3��#PI���e\3m��IEZ��6��8Lm�|bdY�J�ANG��H�0h��Dvs��f��4
T�!���b�kZ;��w�zn�l��I�^�������&'�(a�C!�''�H���<�*z��������x
N�(V��"+������������^���������QJ�3-%�]K~U"�Q�?�U�E�p�U"���izo��k
�7��!��
�b��������Gx�J6f��bF�����"i@'.������l���3	}��'c���,�Q�*���X��i��-���]7��<z�����'{����z~l.2f;xosE@;HI����"6 "�,�*L�D�K�f���2�|@[2l��L�]y�������A\|SW�v���j�~t�H�h��.5������9p'����$�����4�`����D�p�U@�Y!�8����z�I���?�G�#6���n�k��?���S
D$���b��;5*)�_>�4�����=��00��t��Y�����9�7���"/�������~W��v��5��1Il*��=��F��u�m��=�9vU#�/��}q~�bl��=�M��9��Y�v�?���a����A����4�u-dE	L��o1����zKV��~��a�2�V�V�3�qN���=�g�2������+�+�S���������96�z%�r2�����39��<����U
mo��{���P�~O;O]U3K������z��l�YE��������~����)
��H.�W��F�	���$�b%�� M�-m��~�W�9�7.���R����po�Moy�G�X�����_��'^�Z��H��|�^U����V$��1����H�������Hb��`]���r�'j��a��:�5�sV�8���b��c�@>�KA'�����I[��y�N^d�*���5��F���$5
�{;�-D|�j��p�����jy�o����i�����*��^�����&��s�(/"��W����F�� ����<���}#[�*���K��#X����u~8�/A��X�6g��/����64���m�����cE�vT���&���:�	XR	*��\�l<�Y���v���6��=:_�������-��K�y1c���9u�H&5dS���%� G�,��c�p����
���7e3���=�k4���?O��|	?i/yxog\����f��v�_�Rq*D$
e�(�qn dV3���fOV$mO_��w(��&o�-p	�+#�?%��U�_.��Dn,�v�U�V�����o�������j��I�3h���c@�)�X?��1`�H��
�b��[A���7M�W�~Yy���H~���9�������	o��l����l1�K���Z�%<�j?����k�6S0���� 3��[�^�@�J��*������d�Dk���&n��L��5M�� ����2w���Y�G�����%8AWp
�������.��<	�U��#��v��}+�e�\��y<�}*�!���Pmf;3�uK3D�8���E���}fXj��-]�7�H��
�0�)�p���U<39_c�l�n��O?����/1�h�2I+{>|����b����3����p�G%k��y�6D&a<������}�M�wq�A���<
��lK�������e"0�V�k;X��m����'�5����5��CH_)� ]��|��A��}]6���d;�D�k4)��39�<�����:���i��*�w��RI�R�j��`�0�������FY@��DM����y����jX�B������=m���f�|M�yTb�k����{M��4��l��{��i�)���������:{PH_i� ]���~4��A�m��i�fOaJ��H&k��Wf~���.�����D(��N���y��X=G|-DI��-����a�n�������o�g�X��~���?�5��~��/����7���T7������}�����{���U�`�@�"��S����y�
a24�qm|/p�O�cZ��+�`������@��9����Z�X����v)a�V�����L
��<��h���o[��<x_3.�zK�`�4F$1g^2�G��Q��7<����/��A�a��1��zZ�4�������"�	���>1����u-��M�L���,�S�V��\�b+�=�~���u�c�_�����[�����?�x�����[�^��7���#�vl�D[Y���o[�GI�9&�.��m����4*ZL^s��Q�nID�F}���x�P�����U{l,``��V�L�Pn�8��d��*��r��d�rj����w�����w�Y���
�/N�����'���A��'_����/���	��O=y���x
�6
^�+���t��,�X�'���]�����^�V6��=
Sa�n�g�bx�w�A=�+q�h��?�� KQS�mf{�8�6��E�O��V���r���^}�0�.�����
F"]�M_����Lc@���V�L��D�DC��I\i�u\�M$�+���N:m�lC+r�^�!fF�H��T�%6X-�O��o��xP3#\|
���9��4
�
N@>�
�l�����D����&V�@� ��!:�33M�����/<�Z�}P�Gb�agEm��5�c�m��Z���<�B�O?q��Dw)-���Jf�
�����l6��	���?W���:9[��%K�����n�i6C���jA��~k���Y����esx4Ex4Cs��x+Mp��M��d:LC�v�eEdm�f
�]%�}HMf�FKM
]x���Wy���a�c=	?�t��.�X�[�����V$
dz���&d��cE��T0��i"n�z�f��;|�W6*~��fC��H�_�[T�S���U3�N�m�M$
�Zuc����/��,V���9�H��)X�������9�����n�������CQ�C1�'F��[�zR$"��1�����(���i�)��40�u&J�q-���W6*z��GH�������WS�^�m���\@��$��������$S���X���|��L��Nu0<����g��x��|k+���'�G=b�:�E��c��g��0�xgn��v�����/M�����y�}�MR���Z�"i;*��`t��
�/�	�����g� ��c_�N�.�b/�l���i��})������P��M�L�'3A�NX�`\a��Ze��U>�!�%Dy��E��QXrB�7;^�&��u��[5�'F����6�D6J���lc$�K~b���K���lT����M��v�}�T��+�,J���?���a}�L��F^�?��<o�T����F�j���$z
V�L6�)
��ic�(��FE8���IX%�G!���b]������1���7F�d]�������Z5��g]�����N�n�d���&����P/�"�|���^^��eJo������/��n�1Y���Kk�����q��q�&"�.X�������Ux��>
@^H��������:������<"�;������6�a�@��X�V�,\��4��vt]Z�D��TuClG(vT����^8r���P?[����!�/1c�Oq�}���F=����e_�U\�XE����6m3�"����XWK��i�>n����u��X�jX��#�u��m}���%���]"��2��K11�p�����������
��k��C�Wt��g���.��wf����&��T������gd_�L���Q��
kI���Y,������~"����.�����
M0�/<,��BB�lUd���%�<�j��i�s;uC?w���W���i����"��4��eg�Z���6m3%#��-�F������$`X��U.�����N��G&^�9!P�@�E$��A*�i{a+���R����)����J�.w!_���'���.0�G/!S�=�x;?��a������5S��c|ISZs����t�dV#V�c�X�#���0�gO�����N���"T�VE%=rog�DN��\�����v�{��W��Lb~[��K#+�MRg�������`7�e�S�Hn����WY�n�d�"J)���MS}w�Zg��j7eId#koz2��~nl����k���2��Z��"�i/�s�wB��K��,�0�?V��c�����\������
�h��3�2����������=�Qp[�l[Q}
d������1�!���n��1.���TOL�����h@�Y�S�5���x���j�;�x0h�?�7
w�V�6
\j�G����7}����$�\{�7�)j��V��NV��%��v���S�������	5�H.�����C�`Z�z�_�FV$mOE#�i�:U�u������=���>bm������X��}�x�i8�(���Y�t��G�|��5��4
��g`�fg7�P�"��	��"i[*X��b&�&�),��B���&_���P�]��vj�8	T�P��OD�8[��U$
cZ�/��s���~�L�-D�b��7�<w�.�B��T����T�M��(D�������"�!������ _�"fHB(AD-��!Oq��x�����K<�Q��Sq(K�|�F�������&O�7}���������G&4�3�����c�w������=yS�A_�w�2�[_����]�o1y�����M	
���
�����y	����af"�fzY%��D�r�"�t���p3�f�Q�2��5�zX�����zL=+������w��bV��������U�?�����6m3#�5�L����/W���������7M�����4�V���M����!"@�X����bO"2���d�C���O����%��{���*��~�oi�z4X�������?��<>��{��Jp�ri���}	k�c����X������m�`uS�\��m�3����~���b��!�_V�Q��a\����� d
���!bT���F�4�ia��01z
Z�������6m3�\M?��<����?�B����qo����i�1�V$
ez�C��\o�����|I��JF>�����zM��z���?����ZY<��]�aH96F�,�����)F����4��1��P���y���D$mK����?/�#�L��m�k
�!��*���{������a��kM�
$,��E$������7�K�wsv��[%+];��3T�HRr�d_�W1A��o��~B	��,�S�{%\p�Y����wU3��Kk�E$mK�^�4�nM�-���-Cj;�6p��?8�q��c|\�����G���U$k�?����(n��&�A����Mpd'�� "iJE���D����]?���V$�aH�z�i;�a<F�.�YqTzN�����z�I�c�#G��1\�	�I�=ha���������*����4UT0�Ix7m3���=�l�a�B��UD�R'9�G!����^g�����':T`�Gn��s���jI#�-�A���t|P'l[+r�~�dt4���>�M�^����W�.��`�a�:�G��{���S��a�;?�)Sb�3����V7V$��L��������n��������n6u���C�6E�u=Z���|�*���^��0��M2������q �p��\�3��}H�f�.r�~�};�x�d�����z��A��|�U������O�z?a���#����c�����|+��2��c�<��������6m3G��1mKk�i�QN��rx��� ou����y�������a���p63�NX�����K',D20��0M��WG��XE_N����)��l�k+C���P6B�T����	/R6Y���S6��{ G����vm�M�7+(��:�D�m�a���'IT��%f	��n����.��AE���j������f��8N2�K2xX�H�P.=���NP�>�����iZ�d����n���g������\�B��K�z�k�����������K�mij����Q3P��T� SE*k'GM+������45��7nj��u��]$mKk����G;�s�n	�Ea�*f�����RN����(����um����a��	'}P�avm�f
�yK�`9k�����.��A��,�Z��#q��~q;��|�_�p�D$
ez������������H��JF~���`�$����3�|ag*v���;"V��r)���2SK�=�IR?���E$�7i&3+�>>�����Eu��5��
���)��\p)�:�H��KQ�K����=5��~bx"��1��Q����X�G��oGD���lf@�����Hp3k��<`f;sP�3��p��J}�O�;=��em.��&��N
���i�)��L����F��5^��>
#�/�'}02?>eq)"�����dFG�t��3�(L�L"3������Zc(������1�EO���du��5�0�R� E����AG�A#6�����E0n�nL]yW�z���?�}aE���$��i5�=�R�"i{*�S�������x�:��B���h�pff4O'����/�?-���/�_��m�/�9�H�u���K��}����2�g�t��v�;M0^�ga���Z���'������<�f�*n��G^y�}m�:�G��A�UO���=��I����R�3���H��"iJEk�4�p�x :w���3��["��sK�lF<aTE�1{/���*:��7�x����?���r7thFx����� �������e� �!�Ca��m��0�����v�l��F�Y�;�t!X�*����3Y�~[�������Y���K{�/��*�!���fO4�[E��T0�)������5��.
#�i���r=��"~��.��_d]�"�b�-I�-Vb������������Xgr�L]�X��#vt+��C�H7,��tf���^�@���:��u>�i�������Z���"i���p"!g@k���6m3%��c�C	���Yw,x!V�*b�4(�`E��	�]��qnv� �6
^	X7����P_�M�h���fC��1	��w�U���r|N���x��{�u�����aH�OJ��V�Y
����?WG��"y�M3w$%H^Y�K��R�e��&�4B���^��=	��/���E,��d�Ej&K��vj��h�3�	W��$^�	��p���^�z��b9p��pU���F�B3y����W����D)Ew��1eJ$�M$
e	���KOUxD�:FAY�����zS�����9](t���C1	��/�������T�k��}��3rl�D6����������_��C_��?�������{�O����O����n��S�\�����GW��U�c|m5[l������y�s !�*bhve�F
�[dk
(�����9)��I��=���3'���LJ��A��NF )"5V$
b�U���?�x���d�:�h �r�����m�`�/4�i�m�Y
X�Ba�T���������
�mM5����U�I��i��C���3>�F�W���jI�Q�Ho����a�}�_��3	��/��9��_~�"�[]���*���H���K#�u������mE���~5����(�3#"i;*U������}3���MF�RK�I��|��m���+�<�j=��(�:K����jkSo"��_S1|��T�k���Hbv�T�co1������1�����Y'�����]������fhz�]�N^�zn��]�K�H�u�3P�#9z
���,pN����g������uzJ��&1E��������~�����G��>�If��#����L�S��Z�GQs��$Z=��@������	B���5����K�"]�m��Kk��B���]�+�V$b�h��N�HJM�	����dE�p�P�&L�����>G�,�V��������������8��+�xC*C$�cAF�b�G������PO�
ny�e�sMDQ�@�[g��l�����n��+b��L����I�d�n����T5M'�jS�E$��Z�<��[WtX�48��9t���*b#�\�,�t�M5U�/��|({��������	8�&S@�D7[��2cg�D����]$
bz��I����D�"r$BlD����ji0/�^(���By�����[�+�|�NHJ�aI�:�aOl���d���� W��9��Z�W5�$B>3Z[{�aS�*6���>bo���y@�Usj���u����cd�����5����Um#"��X���
�[�S0��vB+e#|�Y��<<��#[�+���~�g��i*]�0���*��&R3)v�Q����"�Q��Q��,F���<���W9+x�u��&F����]3���]����*��4���Jq-6�����pBGS����*�B
B��U�����xQH��K�_�������
L7�gm��`���	�H��E����/�"�;��1"�k!��Q<A46�X�
y�#C������}�����}��|]������h��XE�����&X=liI�R���j���<Zx�P/��YE�,�n�����t-HV�F�SJ�I���0%X�H�������Z��-�����NC�j�B�=.���?O�n|K
E�T��Ogw��&��hwj��e�d-[-i���������&�""ijH��!bQ����e�D^�1���k3vc6=��j�;U��Au��O�Q���p�>>e1��k�L����yb���0{q�o�A�.�1-��KNue����+rdX$��2�Z�1�u�7��~cl��������]^��nPe'�����ei.���~�w[D������4m+��$Q�i6#Y�������o$M�+8��;"px]��!�ZT<1P��f��*�m�U��t�]����T����ljRd�L��4m�[��!13���Za��������L�������MZ�et���Dry7�_N\����2��*r�~+���vG_���X�������Q4����x?���+a�t)4�%��H6��'R���(���XD��/� ����v�Lh��Y�G��T��O������a�oi�Z�����6W�	b������KR���%�U\�E$S�YWa��a����v�V��16c��@�������Rc#����:�=W+� ��hM�9�������,�aw]�-��]�}?7�+z}K��aD���z
�q����2�L���YW*f���Y�7�(�hk�R#���{��{��[&��T��H��
�}G�	��v��v��p�q3��A��/���$+����Iv��y~��f?LN���9c�|F�9G��H�>���3ZgD�i(`"rh���������A���+�Z��|�7l=������H�e1n�[����k2�M�bb�Lg1~���H&�/��C�h�������mj��u?M��g{u�.��Z��������-��i��`
FL
�V$
dZ+���!{?_���K��R�CY�H���$L)�;w!U�*R��^�6���-��C�^��K�{4�m���������1j�����L�V��=}G`/!7���i;*X���f8@G�=���FY@���V:��j����9P��>��r��x��4hi�n2!�+o���+�������SZk�SO�U�@�0*k��p 
)� �6�me]K�Tr�v��Z���K���4���M�eD$mK��a �����iRq��.e��R>h�3�q������Y�������^���3�H���OW����JD^��E�p%��Y�	obQ9�����U���x;����(������3"�����A�!W-"�����h��~���b�����M:���0���2���0(k��A-��� ���-S������D��ei��O�i����w\�c��m�d�S�A�e������t-�I����G/(��S���??�-�Lc����������������O���f����%��C��Y�����Z?��
��Q�&�@��x-"y^>���k���s��]���Ruou�g�Z��u.����_�KX�x��:�T�<�jQ����
*NG���Q`F��2
X����6`p�mJ�M$��[*�
s>W3�a!���8���q�E�����O!|�G!��"���_
���U�8�d��{2@�� �����m�`��z�JQ[f��<�{�t6B���{��{���������"�����%ki?�#p��M$���$S�TJ���B�DU"� ���[����qP%BmTt����\R%?���������t��^�G�&���d�B�j���V����(n��+Xf�U�8c��l����������}?��q"��&��/XE��V}��I�};��Pi3�H��J����;�'�������09�ar^�����o����?j��L�f.n*'
&Z�b��5z���G��
�H�G��:��b#c���+�M�������8�/U�5�
�K�h����]��yw����H��!�/+����9&�@�&z�J�"|����$��@�;�6E�4I��K��f6*j��]x;j�M�����,;=���"���������x��bE�TQ�w|O������c_[e����<K3|��o��%�����m������c��\��4�SD�H����<FO��k�wK��L�H�
��G4�<�nq�Z{�
I����Nz8og�D�zG�e���H�j����g�H��P7)��pD����.����A�1
��7U�@�A�;�c��1�������Guj��l\QNHd�(�\���}�@��������@�M����Z���t��V8�|	�{/�X6��C���r��v��c
����ix���y/�� �������"����-X���'������9�A������_�_�G�arf�I,a
*��eh�I���������E�s��/M�J�h�f3��2S��T��#�]����xP�
�H9L3���dDoR��\���K�s��y��'�~����H��J�;�;M�
V�a�a���������C�:��>���h���*��!~�875Q�E$S���yF�����>�"�*�B
�G0������[T��![�p��t� �G��1G�����&��Q������|4����#c<Q����W�G@�P[��p��7M,`r�KJ�@D�@��8T���a���,M�J��<�sk�$��������X���>�#A(cz�nf�B�d���v��nW������F�}��;�gY��|�G��au��P�]������������!5���B�lU�����B���z�Ig��p��S�
]��+�1-�)T<�m������w��.����Q��r��z��7�}��G����99��fp7�u��4�d�Y���K�wZJ��F	ny�U$mK��khEk�+��H��d� �������n���xM�rV��7�����������O��:w9��U$
�ZmcxV�P����s5�+"���`eCxm���M����C;aj�%��=����e�H���Rp9C]�����h���c�z���z��	|E�n�.M�J�h�\�H0�U��G��=�NH�|Q���0�z~l�����i�i��x�H��W�����.V�c��]�p�,)�Q�>�Dg��h����V�=��E�{��}�'q3nD�.�!��X�t�������0"s�����}o2���F���ZD�F3�h��`�d�n����O0�k����(�����E��*k����1f���i@�}�B��$��Z���i��5�-���zA}��H� &V�4\�	��/�w�����{�=���3�|7*bY���7m�F�����-�"i�)9z`��h%�p*�[�,p���S�.�ND�'�ux����L~x,�����0~�u�d�����u����iCk���`�}�;�D�{���0�T���T�I;���R��^�YD��.�s����H�/Mp���l�$l<;Z��$FK
=m��JLt��'c�f�n�d��M�7����i��1��BZR��4�5�\���U��NA�&r���&�T8U������Nh�|	k�#4��)�z�ft�M���v��iTf~�9��'�u.t�kXD�(�U�2�S6��B�D�6T��1�b�Fw���6_
�
e�SQ6�0h!n-%�(����s o<�E�M��V�R�w�h?G�+�V�����v���c5��l������?�g�/�w�Y�?�vh��?�;1�)M'HN�lM'���5L��� ���H�
*����h���]��n�^��|Q��Ao�,���~���(c��+5r�:?���`�0���@���������\����1-[�O���;ZE�vT���N�x������h�Ks2z�l�������/�gN���hG��^����1������~��H&�g<�jh���������M
�"���~��.As���d��N��C�~@c��:n��j�H��w�)���[����}i�VJ��M	>��gS�52g��Hk��d��S���??�M4�o�:6�������������_�����������t��`M�?zg���Y��w�F��YE��|r6�07{8��|$���&r�s$?<�]?`KS]��GT������?�5����jOMy@3r@���B�"��$�f�"����m.��T���{�Mx/�O��J�����������M],�r��ei�\��������)lU�3����>~��N���4m+��p}��:����
\����U\���Qy�#�n�oiOx4(k�X
d}�p��G������)&1"����t�H��&rx�A`I�����izH!���F(���b���7��i��o�91�L���4����Z���\�xM��R�<������H��
61:tN�&������z!{�*����)��Ed�L�d��dI�����k���#��W���a�6�H�v,�L�"��R H
�B�p�'_��!<(���7��#7�cf���:�	V��m4�"�G�,e�5�%Q6����"��N�������Gu�����|�Z�5��B���X�p�a=
�kN�)�����H��vZL���"�716�D�{�������~�$w�K���� ����(��h��)?aM������<x7�����P���x�"=��4�lu-���ny!~*���f+l,�wk���y�����������O������/_x,�~���7m�R��a�Jj��Qy�H��[h������"^��y�f�"i��`�f@���c������U#t�AEw�4I	i�R��C��0��"�1��_"�4��>�+��9Z[����!W�H�.$�0`��rP�+_�_��4z�	�HS�����������a#�rh}����9"����?������R�����O���}��|�Ocm�����i���j
#���I�l������H'dY�G������f[0qZ��"i;*����=�������>������by����-n��5��h���Y�(���i���;�a|bo����oID�vT0���pW��v0V�	�]X���ey������"
�������"��=�!O�
R�^O�x���s�td����t��"��T�������+�u��.w�������KF�UE;�qo���A[�eL	�/��W��&����mhs���n�jF��m�Q8�|�U��8��_��~l���S�AK�f�
��q�E���AK�ug��p�)�)	+r�k��$%�--({k� 8>�$��������#g�V��j���
��o���u�tC�Z��46�����L�Df�U�XY6�����7?oD��/�j��wh����A�V���^���D���fQ�3DL����5/��wz�{mI�O��
�aY����ou�
z��L"<�Ah�|Q��TB��/c�k��?���q-�����9�Ub+"ijH������S$u�)D�2m#�C�ER�U���W4]��������S�H�?��=����%"i�b�����jLD�{;��I�o��je;���#oGh���vy�����3�M�K3wt�A���p�,������h��u�]�v���4m+����FU�uo�^��B��h�,�G���|�{+{�yA�	Wuo[N�V�fv���m�E$
�Z������z�	��wdE���
/�������&B���h��c�EP��x�,]����D=]�x���I�f0�U���"����r��:�R�"���
}��C������!>B��-���x�r~,���y��v��h"����-�rY�.�
"b�lzo��/K"rNN���xhpUY����
�r8	��
�,�{+{�*�%�82��G�����9�S��z��3V$����5b�.��K�w��$f��-�@�/�-\��q�e#l�A��<sE�&���c���X����;��F��GaY�;��$�������`�I�Q�R��h�o�Y��qv��3'���~cO��Gr�8z�w�����@$����'(����Up�#��-��E�Q���&� <f�2�w�(�K�(P_v�`�^�������h�j���������0N�K��� ����H������T{�DN���U$mC�0��\�/�l�r@)�R��
3w���L|�%�#���^�}����dj\g+�m#��?}�a~9Q/��4���F�*���79�x�e�-�������N�)�"�4E�Uo�����E��D$��*���$M�1I���'�
D)�99>s�y��@��(Dq%?�����G��+/�QE�&%�@l��dr�p,����+4/9m���0���&8�-���	���7G}��=�g)�N���w|�G���F��i��c���s��H6���p5N\�z�BDN���o�a"y�R��a?J� �z!R�*"��'B^�o�1���=��Df�\������NI��"gt�VQr�����X]P&�{�R�B�|��^�K\�����8D��L�K������Z�"�\*�H����"�;*8����a5�3{��P6B�UD���w�,eS�kQ]���(�0���>~��:P+�����	z���,���p���F��0����@/��QE�<=�B(V�5iL��[��Mb2ETm����.�nLD��
���X0q2z�
f8HVNB�����*�0je��$>������g�1�>���.��y�o������o�|���n
��o��]���&�����~\'����Xh�H�&��g;x�,
R\KC��9��)"�^o�J���~j'���5	��5�&a��EV(�3Y�������.�LZ�R��y�p���'�G�[>
vc7:���z'r�~4�#�\E�<���|:�����C�vW�(Y�87MU������}p��-K����I�Um����o	1�����K��J@�f�G�b�.���Ah���{�Q��:FH�`I��u�xo�����UL�t�_g}F�MZiIL�&������<O��D�9u��W�i�
����W��v�4`[�uRq]������M����H�����f�����qo	��m��,�H�/���z�HG������^8���sz���_������:)u'$z�����H�����m�_f1m��������v;�f��a������?!\�I�5}��A��?9gO>��Lo���/��p>���$����\_�(���g��{\�c����}��
*G;�"�*X�*�(�)=&"u�j��:���g~��7�&h�e�JtPcV�H�m��<����~H~�D��T0����X�It������>���%�pkw�=�=Q�>_{���t���������j��_��l���z�#�tRQM�.��77��_oMq,"P�!���d%��JC�+�T���i&���q_���O��#I�Zl�c������@ �F������.�a�Y��;>�xv������#@�e��R���~�[���=?����4hD�y`�D��T0���B�����k\�N�����o����~�tah�5�zWW-��;��"�d�%oy�d�����tv[��'�W���`;0�vOt*��U)xWQ*�?�,�G:���uqZx�=C>�Z�B�����X�u��Bh��A�w�a���,��l�����Ft��'���Rm������XS���]����Y	���5���s���l"�����
�"Tm�E�Y���yi��h�����k�8�rT����$r<K�����ao�f����G���Q	?M��5-"�x�o�JP���5M����-��w����v��������s;��e\��2U�*��V��~�����<����f��-]�q�f�G��gV`�/"i��ZTqOf�V��C�H��
V6�0��F�>�i[����r8q#5���X��ll��K�f<��	+�H]����#5��;��.���D�D'�P�������BU]�4J�%��O���7�+�e��M]s�|������6vx�4���n.0���	�E$mK���h��>������C��$+� D�p�P:���a�L�����c{��������y	����F���"�
V��aN^�{u=�""���JU��P���3U�S>p0���V���r�"�N2V����@�:�,<��=��r�Y$�Yx��s�*�5����'F��,rZC�����6C�4
���
�m���/�WzA%zA��6�w���,M�������u�� �!K��R��n�t.�b�'����F�f���4*���H�~�{�`���"�2�P{��L~�x�H&����m*k	�c,"r������u15�f��z�k��iT��3���tM�e�tM�8Mm�(�VN�V+"��s�����	=}m��Y�D��S��A#b��ll���!L�"`>��Q\>��� A���^���UOxi���,��3�lE��x��Q�����)�8�o�t�5����;?#?�q.���,z�[>�,���Lw��Y����u��(f�\;�g5Q*�HI��S��:6go(���������'_:��$
�$0�i���(�$f�LyH��j�������h��V��?����� �iT����[K���?������_����?>~~�����w_����7�p�K�A����;�NX����4
\�\���D;�����V$��(8'A�3Eii��M/o5�wa`�a`>�Na~O"�G�(S �f��~�2�:�G�I�{����d
��� a����6N����;ZD���2W��}s����Iw���njTt���>K�e1j"@���f)R������[���ihp�E3��,K��R�!C	N��k*Z���}����Q'Mub���)n��T�����b���n+��t���fJ���w���mG~r��H�h}p��B��K���R �p�s����u���s`�kp�	�0�G�pXV����Gs��u����(��Y��1���*c�{�����K�y���07�rU�L����
��[Eo�?�!��/�O��{�-J�E�Y�����C�%�QX����0�@a�m3B����P(k���(��(hZE=����A&"���s����T�a�[q�
�S!vi��Qp�%>m������'�y�~���n�,��{+���?J�l��'���������<���Hb�J��������.�*���
V66�:UXv���NW@�s�V17��U�,e���K{�#H��l�@P��#;���-M�����f�G�IglwT�����I�R�h'HiG*��W��S����CY�8�����''z�V4X]>�M�+�2=��60S�H��
�}C����1?ppvR����Ul�RC����������t�����_>����^����M��~�5y~����#{j�u�5.u���g��j-"y^>���m��~�
�a��Y���H�j�,J���G��0\k���������:���v�V���������tM��,�&������(F�u��-M�J�C����
3���X\B��O��O?��k�����!!C[E�zr��-K�B�E$13��.m���1Y��ODN�K�F�oh�<�[����Y���8��$o�a�ri`6"��C�"�:����++�I��i��6���W��lR�6FeI�
�l���oUM#�S�����qz:�����p(� ���9�����x��k�y�%L�D)��D�40���e�U��D�s�
h�����L�|?����h�t����~��{C�.&'l	_dI�R��=��atcga���p/,�F�����Ur%|�����Kb���w������dJ��Y��&a������33�H���
c�V
$�=���L�H�:���V]C2oh�[�O@���y���tL�����}3c^����:�4�i���(���������i[)�0��j ���G-z!~�%��w����������`~!��U2G�Q��\��AK���2��G+��-��^��m�d���65�:-g��@������(�e�|�_vcG]
g�'z�Z�D��3y7�aR#S|!�+Z�8�4��}�����rh*�g�3�!�)FNu�7���;�� 
&��FE^5�����/�q��JL�����"y o����'�
/��:��nYD��T��B�$�pdh�zT��]�/�����?��<���<1"6T���k��i��^�����#V���H��
�{3B�4�G��uny!L6B�|��N�K�H��s��1t��|I@tM&�8f0]~�w��Yy����V76��_�������v&_��3*�����I�����*�o�R��4h�-
�����	zBD��T�E����E?'8jwX��(Z�G��*�Mu���T5\�&��`�w�����d���8�BY:o�����*���_�M��1k`�����V�|	?h��7U����,�D���R�������p&lc�w�UJ���>X��ILh��<����*��K����$NL+��`&�@�����E�����!��V��<s����k��t�eL�=G$�X���k\���Q�A<
��H��
F;��E�,����������gN�������yP�x��[�>�E$�?w������33`9q/5WT\���|���f_�j���l�+�������]�5�����������kiE�t$'�j!�QQ��YE��{����������i)���MS���g+�	�C���*J���~V��E��P�cBeD�kw=�1�F$�U�V�[�����p_>�u�&r�~��:�Z��4����rg6zaa�*������@��#���^����,M��80�-�y}K���l
���P�go�v�����~����x�*^v�;G�k���d�Q#����J����K�����Q�?/�B��,�)�b��,[aU|#�2M���G�q�m�}�1}O��{Q�%#��V�����O����r~l�����L�X���;N��j�4��
���D��ufb^���R�n.�6n�a���>c]�a���l��s0���	k�	�e��*�����H����b����:�."'�%p`h�5�������1��{G����A�Wnyy�?������f���<	K��;�0�u���H�����.#��>��DDM�������l{������N��|Q(��t)><�����/�"����y���z������`5T$�`e�0�\�U�V�`e=23�w5Tt
�G��k�	�/�j��L���q6���7�?���R_o���V��at����|p�R,Ks��2��s�0����,��m��[��a�����X3V����0��a�1z�Qd�����Q�����O��y����<�w�|����iU+9�~����2{��9'��S�;��$0Q!��oDV�m�`�F.�������^e@�	��S�M��l�Y��=���-����C�l 7�@1�n1�d��Z�n
)-���	z�Y��5�x{�u�G��N(�|	+�����M��:9�����p��"���Zhk��|l]�B��\E���.��&���#��vO0p���{Re�~���d��Du�sd�cY&�mr��"j{j�OX�L.-��H3�O�����
�"��!5ls���V�&R�����Bg� ��g���VxI���n�����M�,K����+�BX�[s4�8�u�N�"i;*8���������T��	�]����Xy	������M�W�a�/!R]`(������n�5��i��>[�>0/;e���	�#Z�r�P)Mq`/o5�w�Vv'�2^���������T������,�$A�`�D���n�K,"y�
� ������HD�vT�qa��YL���[��q]���/�jw������Wb�9���j�|�(����Y$
az���t�f/���C1��m�`�7T��xoE����B��U���bUL�w5�����bI����f{��_B�*���93���H]��g�oJ4{�V�%��w�}�����_>���{��x%������?��y�-P�6��<?~��r�y���r�%����n��r-"y^>.'0O���i)�����)Q��i�)��'�mX��+_d���u���OW�	h4��o�Z9�4�.��>�����E��l�M�/�A���������$*��Q�65��v�CO�,.��*���[\��Pr�J|%r������4w�s�N
�p����"�`o.�����<��Y��m�d��$vck���G��P?{���������n^�f������Z/��5���)��5n�,���/&|s��ZMA�p"�*Y9^}O!,�������A����Nz���G�$�P+����JT�d��Xa�����D�����cj���U-�D��E��fj'QnGVa�B�����z@ca��HA{�����	��\h'���?����L�E��T0���<����3��0$(���A��Gh����y7��#x�38i0���60?�n�����KD$����4�	��b#�2����Al�]���h��a�mP��E{Y����G�������n��$X�4�i�n-�ih�����Y���R�N����RZ���l��;���H���q �wfa�������z�a�VU��@9t#q@
�E����z��
�����{��YsP�5KM)����n�,�������+�+e�������m����� ���k��L���	��-�5�������P�!R3I������Q)}$$�AE"�����c�D����ly���u�d";��n�Q��y�n��T��I	�U����37��CVa#�D�]X����X�����`���Aj���j�x_����0`f$z�H��
��q2��V�v���K
�]�����X6��6�C�F���wu�_j('��x�
�|��(YzT�O���h��M{��95����|�&�����	T�P!���_��S��s�)�l~�x��;�R�H��z��k��lL��3���-M�J�f~B������9�(�����M�KN�m7<��������}]QD�0��;������q�U��l�:�vUG��]\;p�[q8��?�����4O��������~~,K����-a���*pE$��������s�)"V��JD$�fQ��s5LS7��M=8*X@�rPQ!�j�"��y(��X�&���4�����h���s��MX��m�`��$����LBA�f��Vl�Z�������	�4](PNHd6�`E2�-�)Z6�M|+n��w�.X�D�#����x�����;�����Al��^���������;{@��P������r���"����gB�a��#uj���H��J6��1�at�� 92
��/
�x����b\�B4^���r�����a}�zm�������N$
bZ�[2�86� B+ni�V
F;�����<@��;'6`	erTQ&O����M��{(��\�� �tc�����=6����w���{q�����|o)������'Gy�T���-����k�e��c����?� K����t�n�"��,���&�"�a����w$"��Y�M�����um�I�����QI�����������,���!a/�K!"i��-�����}�C���H��4m+�t��C���hF�(�J��`���<}z��������o?=���������&r������G�$��v�����4l�aN0�?�*4N��4m+��m�E��n�^��#x��U�����~�����eb2�a�q�����\�:�z�%���Z�+�&%�H��Jf�I�n��	s1����7ds�$��h]�t �MC��T��9G��AUM��f��6���n��k[�����X�+�D��T2��z�H0�~�k�%_�Z~-8S������0I�q�����.Tb�E�1/�hK%!���!;�b9q/)	�w�N�4��m�W���� ��(���{YX ��p*���Fr�H��.���I[�s�����I�"�I�Y7�z�N�p+O�eU��u;��|�#���]��[�5	��/�{��Y���=�G�H������/����o?�����/�~r������W.������������0��G�+*T���`b�bD��Y����z�r'j��T\)Y���o��O�[���O�>}������G�E��G�h	<��
��v�����&���@.���dF��cQ~c��0?U�E��:&�H�7X�:S#!DoY8_����kK����m�d�wt��h��;uu���Q
��SS���~��!x����������:ry?���b�z� D��58�4h'��5HI�p����$>���ME���<�zkX�,��9��������7��?X')�j2�R��@���,g��/�6�L��t~��R��6�4-T0���l��mghN`_�j�B��T��50�C�5@_�����D�I>��,��H�8�4]h��H9��g�� 6�.=��&8�>$z���M��`xf��bF>(��87�1��� 9�O��e"���6��YdJ�v�C����0�w�D4��#Zo�"i;*�����	QVUM?*c��u!��I�����D���������:{Yl"�#/D��a������$�Y$
@ZH��C3�d\�������-M�J�X����mj���E:5��������vg�����[�t����}����#K����b����;��2m#c��r�� I������j�T����aa8p�>������3����z��v�?L%���Q���������H�o�����T�y�m�^Mx/�5y~�G������N�yh����GB��5SR�O�/���M$���'�����$~+���H�$��o,c(���qO��r�L��>�����������P�i{,Ih�����h��sb6�����D�u���4
S2��"i�����
�Q���������
�������o�D��m�M;^x������f;9�4�-��������>��[����MN�J��Yf�����W:jS���E:R���o$�98��B��{�x?��� �K�
v����]�D2��0��21������T��SmN$��H�����jj+3v������Jv�7�&�\�2���)����0�)V�@��v�d�vX�x������z��j���b3nm�f
�q34MK�W����.e0
V&)������e8�������N�.�C�H6�B���Y������'��u&���~�j��{'��"�8���s�v�r/�=���.��UBC�ID����
�v��&v��!���Rq"�m��
��\���j:��+���~e������vhaH�t�����HP�.]O��H��h��D��T0�a�2���kG��9Ma�ybGh@}:�����?���Z�a������������	�p�9��c�f��'��Jd���ajsO��m�k�+���Td�>�;��2d-�3n�H��e�S�_��6X?��U� �(�H�;��_z��1�ei���o���M��iv�D�����ID��T�5���m�����7���^+����1%j/��Z��4���g��u�+[���f��������N$
A��������2��QH"���
F5}�����0sH�P�R)T���t2���NFW]�]'q�D$���B�@4pk�6S0��g7���i�����/�j��*����F���9�t��/�;T_F*��7�|��u����!W�U9�4�h���,M�J�J�����1X5=o��v'�[���/
-���V���MG&N�������
�����
����c���AH�f����<�>)� "i{*�T�S�E�M#1bw��&�)�	x?��o�
�L��i��
�\��MC��Uwat(�&�{/"�����VP�6H���I�rC /�ASi0�#��{���8|m��xb6%KS=��������. ��d�Q�|/+aA�X�'I���j�60	�e��X����P7Bj4
R�t���m�����^���<�2�<a���}.9^���Ko^Pr�6�x�f������t�I�0���~�f/�.�C�"����� ���Dw���!���s�f������ �%

�q
'r"��-�^��3���F@�+��J|����fQ�*�vQ�I����V��L|���{?l=���������CL�ZE��}Kw2�_a��>�#"i[*�(�j�Tg �D��sh�F�w���'M����M�������
V��qdm���~��t�8�������`:Z�4��w^m���4*F��1�3��6��@����x;��9��cdm�fJ�zK�6����D�E�X��Q�$_?�*���%�	�y}y�+k'r)Wm
��iY��i�IQ�����y#����q�FD�fM$�o���8���#>0)jao�%������E�D��������i7%�U;�<��-`���N���`�'�����:Q+�r"�:�^����oy����t���������Hb�s�}�����7�7z�{kO��M�v^vX2CB��[��2&_7��">b��M~��'���aq�(�v���U�Il��U46�a�S����&r��U�z��)h04����P7B�UD���!�t,�����/���gx���!E��A��D�0��=^�ti`-�]M!�D�X�cA
he;7X�Xl�Ff�`d��l����m�Z��+MoZ=��U$�gAcY���E�����45T2��A���HH�+q\����On�O{��e~C�q���$���������p�������#��N�K:�`IS�ga�T�oa��bL�jID$mKk�����8�����Zh�|	��]"��w���%�������0+�3u����|����i���}&FQ3�vE��6m3���T�][[B�{�sEX���5���P��6B�-�g��
�b&��mK����{��"��O�����S�N���K��b�m���E���'�aO����
��li/�c���KdT1wqt�b#��S7�Y���a[�|�rh�A�eB��I�R�fFg0lLG��VG	!p�*g����^s(��n���%f����&��>0��V���0��Li���V���zf#��SA�������HYQ�ksZ�S����P)�"V���"��W�������������[�/=������A�R8��=�yB�<�v��N��9yaQ6*��i<�Ng��a�oi��{+_�Q���u��q��p�������N���Ox'�}�D2�W��_&����[qKO5�F.����`S<T�u.�P3B�lT���t%><���1��+Ag����H��93�em����sJ�r�]���[����
��;l�����q/4�taM6*���CR����n�&b�������K�u��`�F��>�&����!���r��y?��m�0���r��
J����/���=�L���o��|���������[g���_g�4�^Z��sN]6�
�G/����u{
�2���y���8R���N�2("i[*X�b�X�#��A�������������
�]���
���0��R
�E$
c����X�L]�t���i�)���L������������������U������2�Se����U$_����4}3�l;juP�.kO���DCc�s���P���y��������[�[��������L�	M��p&O��T��xi�3���D�Wi_��(ZGD#������srf�S����fIL=�ZN1��%&�5B2mT$��G����j�KCK��� 
,R5�6
\Z��.��F&��EY%��I;��n:sH��g�by��D!���c���s4���B<���#W�N�0�oI�R��o4F�Q�M��7|+4O�������k�Y�	��?o��H����j�E{?�����NpK������Lv��K�0�}X���*+7���>liI|-��t�!��g8��o��V��|�U����2gZ������������E�As1mC9�>F)"�b��Z
�T%�:�4MT0������cL��Qv��'_�P�"K� F�������Cx��Y�M@�IC�^U��v�	"��!KF>��[��C�@�Ln|�~�*���F#���e)����r�^�C����W8��G��R�NG��>[�.���Qt��.��U�lE���k9��-������1M�wC]������k��������	%5���[>iGC��Q����i��[5�������(hYDS��O��K��Z�]����]y����1n��4��K������U�q,ks!���U��6m�`3nm�f
�zMZ���C�5���nv!V�*b��;0���#����h<�90�w��:����2�������ll���<
�-M�����^E�6���-�f���;��t�a����9[�s��b�g������J#��Ai��.M���x��h�N�EoI�R�h�t���4H�x�r��.|�V��<�B�2C��)'#z�S��.M��UG�iH��gl�����d�[�����Z�����*�!T>�z��E��� �a��E����Z�����U$
�Zuc��O�N�����8�D�`mC�S�1
���*9^�j�N��|�U�CX��Y��4gb0zP/����H�S��_�D�^�MdH�S��'-�����[t�^����]����,��B;{=�]�YV�X�i~������b�x�H���K]w��������V$��Y0�I�6��')��,�N��|	����_j<Am��	?&�5��� ����~`��*�T���]�{=�)G���E����/������zw[��Y�j�G��6CECG[|,h�BB@�T��/99}�7s�`�2�Fo�lxI�;���+���*��������1�Zj����\=x�bv'��K���D�p�����#WB���;a�"� �\�����J����|A����o��B��\�/��NE�|��G��<�Zo}���o���!����D�84�p�H&}C����%J�D���=��tSc��a
��^����fh�b��'e���%�|���fY�Y$[����!%X��-���DN3C����8���U����P5;U�A��8�������:4�"��C��5em������2��mk���^��[�����xX=�?���8����5���&dq)"��������\��g��.T\0�{�Dv�����H����6��mY�l)��w[��-�nLCMGI�m4v4]��(_���+�]�LV|?^���
��`�35��=l����i����,-�i�����=��C� �D�VIG�K	�3�������Yu���%��������0�K{�o�KL*�<g_bY����w���,F5��kI�d|�7T�U�~l���#�K��f�W �3�K��u�gG����73'Cx,�~���7	�������f�D;B�����n-z.b�y��<c���zNLF��4�uj����F8����z����n�����6���E0r�N��v��
.���f=l�$�em6��F�}�mI���Y�������n� ��_O�~}�����Oov���7�������i�%���L��~�Q@�P,{������|�����U�@;�!��vXwk��8�����J\�z#�B"�������`�7�������5���,{��A��K0�>�G=�6�r�d?G ��p��H�9�5<��	��<p�q�����-g��� ��qH���D$qK+f���`\�u����^g��u��D)Y��i�VW1?'Gf�\
��#@02RJ���idHrd��IH2���9�j��`������C�_��p[����������U)���i��[d:���&�$�p8��=|�S���5��y�61p���ne��J��WDlX���4�v��mf-���@��F�K��aZ�*��q�$7�qqV�37!�	^�H��r�n�#�Bh�����i�_M�����/��|��������/��,���O�o�����+?�,5�\2��}
�-�I���i��BwV�����N$��[����Gr��u�H�Ds����W��3
��1���B'$y��B����_��a�,���G������R�%"����
�w�mf�F��$��d����"R��0�-��-�!Z�E���Z{�XB������������g���������d�k���/����3�g;���k�������"���w���Yd�s�����������o�-M�J�������$ls�p
Bz�KX�<������+=H��@�����*���b�,���om�Z�)��h��	�y8���m�d��n�a`�������>��L�������-`",����4x��n�n��F��6m3c���3�����G��!~�E4��~?o���8e�4/�F�U�V�E��h���ek�TnZp$T��H����v��;l����mi�V
V3�|dj��&zqek!��%|�vj������������k�@�C,"������;����	���C#�Y��J��3��	/J�=	�}*����)����UM��t�4�I�\��<-������A���"���
�����4��c5��;C��� ��g��w�{����!�#���D,�d�O�x����5�4y�M=a�UW+D��U�d����g���jH]3�Y�W@��tx����M�0F�`{�����ino'j���������WW�\��oait/�Q��D$1=Z��#��g�B����^X���z������������w������[�����K�h���WC��X������i!�RQ��Bs�����9<�����.s�z�Q_�����__�-_?���>��9�
���
�UN$q��vK�HCz����iUA
�+�H�b-�n�
�V���s��F���%l!��!�_���MC#�����
��$�[W����Y-nm��x'RA���4�n3nm�f
F:S���c�iC��t�t�M�*����@^�o��@�"��;�tX��P&"�q=��I����z��x��r"'���E[�a��)��3G�Q�%��ww���&�n����1�g�d�����$s���/���D�_n�j�	���U�H��[���B���x9�%0�2���<�:��xw��
�����K���M��zt�b�T�AA��14n�R�z[���f�6���v����>;c���em�fJv)���V������G�p)G���(J!UM3�dHhcE�cI���'��\lc(���G�Q8��=���hp���H��g�B��N��+>�Gq�+^.�������3]?�om���$2������:��\����j���DNu�0�a���C;?�5������)�y�_��,uSZ��~YL�������H����� �Xz�����wom�fJ�/��v��"�L�����BX���E���,��i�����U��rn�;7�M����������;��=�|��M5ue�#��r|h��x����0j��Ei'Q��g�������B�b��h��=�:A�H��_3���@������V��
�[[��M���bWl"���*�����~����\l�H����{�p�KJ��y��Gm�(��"��qk�6S�41��o��J����o�>	'�/��Oz�8������z?}K�z���Z>�w�����V���Tk��A����Jp�?{_�#9�d�W	�1�ER����es�csi�!�&��@VV#3{���Q�Qd)7����]��������>��!F����'��E=�B�s�����f�O����|P�L����!��j�(��S�s��v���5����D
g�a��o�$��Pp�"B�aCE��d��_cp	�kF�1tbC���0~R�����.d��Qd�K-��m�&���h(r�g.��&��)��Ey�E���gB
�3���#��m�a��_��q%����3�'j���f��}�v�g����`�7���y�������h�&lke�i����M?Z��#� �8������*���8g.p��qW|>p�2�j?zM
(5�� ��}��Ti���=r;?���,F[������vA�
�99~���z!"�� D��cP�^|RcZk�rX+��P�P���'��m�e�;_9��V����A:q0'�A
�*��C]�.c�`errHJ�"��t�L�(�`(9�����5bn$�;�6���D������;�
�`�F�|��B� ��J.�a?�S��5�mM^S�0q�J�Z6�-A&���[*�dC$"�R���F�y�����^�����yz�B^�t��i����q���)�����t'�`��"��u�����U���D�
{�=�������X���^3�
�*'���k^���5�3�P�LGc��Vv�roz�4�K�Iy������2�1	Z�OM>��O ������ox���|�C��)0n�=�h�=���B�A
��	XV$"�6W��� nD�kIA�*"�7V7��.'��id�����w+���_6	����~X�R7�{�6us�g�G�7G���l��D*u��g1\�����a����2�2���y&��0�ek���/VZe��h�������X�B]���z��*�E�A������j�D��X���}G��Y��_��g�+���N~�OEBu�	8�V��������$V��P%���k}68��nu���|p��9�m�au�a���B�rG�����c�;OC��b�zNc �<����i�\\���;�=](�@����6�3�@= 0�eB7�	� ��z��b�[��I�,�`�J8���=�Im�w����em-�������]L;�	ke�i�rG+��x��~G5���p	��	����\E�����w�����H�>|� ;v���_>�~�O������%����������:?~��z��^p�F����$R��c���BKMN^dhnKe[iY���
	�J`��f�g��J�	��������_W��+�����h��J�C,�a,s��o_I�
|4;���ID�n�z�mN}��3��3�_	�}�"��?��s�oH�w������$�A��;�������Fut��4Eke�i��s�0�����p�d �rM���1��=��G�6@3J
��������#���F�E�"�2�s��k&?�������"�&Yk�|��	- A)C��r�
��V|����Sv���=MRL4W1
@�O���|���L��<��c|��q������ "�S�Do�up���	
+�����_��vph���h��(���<����yT�x:
l5Dd{j���	���ES���������@|�a�����3w��_cd@�i\���}{PEd��#H|�UDH�o	C�|�)I��DN]��
~����6FY��|e�	�S����2y��>���2����$������y���!(0��	�l<`d���'$vz�`^%�	X��'����G�F�����Z2=��B~�q�7��^$a��i���*:��F����*�XT�V���������Q.�+s�'�~��7p���G���G:��3���mi��d���������HDx�I�X������
�*�~�U=���X#����jY�M7v����:�G
�b��k�0��lO
�9_��������2�����A�o��&-���8"t4��<Z+C����&���O��cO$�d{j��g�@�0{���V�DnT'��������xD��p��~����O����5r|��,"����DO����������S��	3
�R5���f���.!��b�&O�C+U�,z��$��}Y+�s��3���.��=C��lO
P��{NJ���;��Dn�������,��~����5��J����0��9P���""C��@� 
���������@��*"|N
#);A�;���iz�O�zM�F|�U�AL��j�����-��<}��k����1��J�
�����1��G�����"��G�n��V��UD�����>~�����~���	A�%&X�������
�Y�
$s
�Cg�8��2:���E�<K#^�z8N�x��A�w��9!� "�X��fU��5�m�x
��lO
��`��n�'!��m��m�'������-(y��0��-���a��""C���8a�zA!F��B����6�4����Dj��NbQ�_� H���D��@���P��K����c��q�Hv�?v�z�nR�7��[���O��G�(����@7)�(��f�MjY*;�
�D��� �n�
�2���D�O��}����G\��@s���:�����������}>����O�����k���w��_};�ea�'�*��Px������6+���H�VR���@�,<4 Y>t>�^��M�R�e�k�������`6�"g��:�si�\�
!�3������A��o��z?�#7h�����e\3�s1>��t;j(�}�=��]����3��T�E�t"d'��.���Z�hbU�c���+y���;���6����x�����"��	�����������{j�
l���m4F	�����Gvq�� F�f1:����,u�3��l&����p,����Q��6��nt�y��8:���(�E�h���"���
��'T�ez#k7�J�K�b_>��Q��:h��#c_0�x��� �J���q�zM6��6�2���al��f�����;��_�Z5��Z��\	�v?�����qQ)B�YW��Y���0V���(�=����~��Gb9SP�R�m-(n�1��v���8���?{��9�&�Y����<j����_�:@)t!��cU�r�y�L�lUL}������L�x�� B�a2O���{���n���������h�{�.x_U������u8��=��?t���a����F��tz	j3y�b�mBz�os��GtQ����j�"��&�Z�����8}]mKe���������I�������lK
k/�� �k������^D+5,Z���������J3��z*�����$�U����m����A�9��PdY_3�H\k�8���sC�I|!����$2����S�
:�
�v&�>"%��Q4B�l%��7��b�e@��t���W3j�=�/�?h���|���������m�Q/X��}_�?t�c�v5V#�����
;�$��W��/�W�;��B���Sx�����lK�: ����c��)���oC�M|��o��,������P=u��9��R B��7aK�a�\\���J�T���V�3�I����0A�EU���w����j^��X��5�j�n��$R-���3OX��p�Q�)�Voz�A3���O����g�	%{����WnJq�M��y��T����*��3����V�l-��1s��$"j�[P��������Z��j��6�������/�C�0�?q�O�.O�b�q�S�����=��Jc�����e�[\���*J���ig3����V��������N�lD��S��'%����I'H��T��A���*�����;nZ�"2��!�!�����{"��>�i�~�@��F��������p)����o��~��{^F��o������]���5`�\�}����gw�\�j~��tP�]�uG�O�,K�������V�/D!Sk��-y�3�Kq\�[
���X�������_���,^��U�����Zy�{e��nM�`�oq�kk�j��X���/�Z1m���a+Z[����#��|k~�""�S��"��CLj�=�w�D�YT�����Az������uS8;�_W���Uz���A��Y~
�D���e�kDU�1n��!D�3�;1@�a�>��~�oq�z�h�|s<06av*�^��� $RG��Bg�������������D�S��.k�}�G�Q3D7�Yt��Q�W
GX����dP<�ED�1��E�^I��_:�;�NX+�L��G�<�����<�={�`����v��2����!��"a�������m^�����TR�"�
��8ax��+A�����|�mW��;>�F1�j����F��� ��i��\��[��:8~>x�m�a�	�@���b��6F*�J��gLT��wJ��1�=���w��2`q�>��� ��vo;�*N����4�q�"��B�U�h�J/�)�-�:��V�:!{�����Gq���A����?�Dd(�� ~��^o���b$"�S��7����������8�|�y������#�xd����h��/���F\���*D|82�%^�,"�(\-�S��vD�I���\7����
1z�G�V��������n����]&]M&��wx1��JN'�S�Z������6�0��;	����Q��
=���=iY��3V�rH?���7H&����E����7<<��<J'-=O9���J����?;�'��=��e�%����O�rOx'��G�S���m���E��G���j�"�������R�%��dU-"��Y�M�`Z������yA�M�"o��/g^�H��
Y�����n�"�q[E��:W����E��_���Vl"'����.2Xg���2z����r����a��Aq�����8����Yi;��������{g������<�� N�����f�Z�f!��q�c:
Y��^X"R�K��'���!<k
q1wq�t����21�XU�����1��`�U8�Tj���E��M//U����fu(�1��j�g ���D����L����.��l�:h����h����Y���%0+F�^QG?���*"����u��D(��\R����?�H������_��v�^B[t�WT8�
�l�#Akk!�Cp�G
2bR�����0���3@�L���E�������W��
�*���"Z�/f�Hv�?��z�W��CK:V����y�zRL@�}���`d�a]�*Q[����`���I�eT"QN��PN���J;�l�J�)?��1�� R�Km	8�F02���x!5�a�c�Z�z{l��	��H�W��8��Ay����1�x.%[���B#a`�
�]��lO
����4�GQ�n�pb3=�1�������<�x'��Jy<r���a&�MX?t��52��H5v�d�v�-y�bn�aucz�gha�k�5��,��:��0�`g
�i��\�ADv�s�H�+F��Z�W�"�_A-��!�}�l�|�9����t,V���������v�"
h����B��[W0�"�E��R�V��Q�����7"��� �-x�c:�;����qD:�5E.�Q�����2L�*�e��\����@Y���-�m�a��>�`�����HY����?uI��A�K����[���?���1��5��X������wv/�dm����alN!-�a��i@E�8��1�Y��t��������<�}@vqp�:'�O^.kk~�������T�D��E���A��������A4L��a��o���������P.<��+3C�w�)��1�q5��^@=�����`�Ez�W�/"u����d���-�C�� "�R�f���0BY*���6����Z�/���)?�+���������z<*����(�����j�YD�Sd�����0��_>�=��pO-���PK�/4{�O'}��[���Xw���wd$"����A-*�
[�����PXz��k�h���eH� ��������9y�%$��yQ���}"�������:��q(�p-�bD�:����i���'��jf�p�1���X#~(:Dd;j��@y>f�t��`��S�G����/�S��x���gi��b����3�q���@�K�N|�����V..�����*������2&�=�
x�`��2������bM��|;x?��@���I�)a�$R��@�����3�$Z(�=�:>�9I��a�6� �Par`&O���O��rS��}��q��{�{R�e�+�hY��2D�7���EH�4^!�i_�1�5^�f�F�"�#�
��f��4�!i��"��jQ��Zv������t�@�Z��A;�0��]���aL����4�sc����$�`�na���Xl����x;8?���E�_��������1k
\F--�ID�m� Lt;F"w���S1&�%�\���Z�2���������&�����������^	
g$�}�
(]�8h�b��$9�8�g������S�^�?�`R���� r6�"�*@	��qjphQ��������z�3l!~�
bG@��2��b�"B�e.��>�����$���;dj����`��� "$�lw�K��7������>��H�O|�U��v�r�D���G�w��G�PW���C�'
�
7C�L�D�\�8�m�P.'�|��I���������!��6��sdq>[��� �`����:	�W�`{qu���0���[��0��p����}>c�A'��DdO�aCC;�L�>�j�0Q�L���!���b��1�6{A3�i�������������y����;��,ke�i����Vct�������5�_H���t)~~Te�hn3,JG;���]�����Mm�_��F��2x�7�a��BS�f�Z�f����F� a�R��������rb�������_����?�|{����>~��������H�K$���������(4x�ADdo	������j�DNuf*���ib���Lb��,�������WbUL�B�-B}��/0&Q�h
���nke���}.$��)�YK��d3j�,ke�i��1�#�L.<�L����#������*@?������1��;���V�.�;/���}�p��"��ZF<�-�u}?���^�H�O|�[�I��A�Ag����	��^��!Q�|.@*2**���0��^�����g<���2e�~�h���z��fl������^C�u U!�����3�1[G���}����E�E�<��>�A���`���Y5�����tA�������;!�`�880�"��{bT�,F���0o���0h\7��im-NHj�E�<�!$"C��G�uq�:�a�Z*�J�=B=����h�U9?��'�R�����7��<�E_��.76�#j]���p%��_8?,���a��\��;i�.d���R���U�����S�����'�j�K�}'����3��YZs!�F�m�8DX+��P4h~�s*k��cG
��`?b����c����1%�,9=Y���X��%X���Dj!�����7?�����2�G��f��R�wq'�����C�����C�����Ep����������*x�na�\��T�i��1w�B��Xr���zz���/r�~Q��sB��g�s{���h���?}�Cq����c_�f|��Q�b,��#��5�H-�������4�$T�@��2��]��>
���F�emB,�Pqrb'�������j#j� ���2xq�x��������	ke�i�>h`�������^������p'�X��K��~'�E���������u�2��=(W(���da���}}
��k��FY9�2����,hr��R �����D���1=�XM�;�g���<.�����:�~nn?�t!��4�Z��6��	d����0l����8���76��F���C�_V���w^a�^��Gt���Q
�+��m�e�[��a~i���_Bq�N'��t�,��K�Lq[�H�F�2���FL'�0 }5�������E�eqz�U|g��x0Q\�7�]DNu���qm���nBk��/������\���Z�b*1-N.V.j3F��k���	��Q�""���Ph�g���TQ�4F�����UQ�f
|�B�"���[���a�������t������p�K�`�[V��F���|�����0���K���6�2�-��=z���C�7���|�b������*S��6�"�n����R�p�G�����^��X�7�
WA�-�a����\����TO�e[Ed[jX����;�M1���
�95�R=���w�C��i��(�L���W5��k���
�.���Z��x�;�&�-����m�l+
���0���z3��������_��1�A���r^�yP�?�����=��:sl:��w��	K"2�qR @XT�j�}�Z�f�<�sx�*80�hki���W�%+��/^����-�/�%�����9��8��=�v����i�m�L�p5��0;�E6P��H�Z*�J�z������
�xq����k+�o�����wY��)S��E���/A��Da���� 6��,"2�q��r��*4��\�d�-l��������&{���~eM"������>���`����zd���Z�G�:��������6�0��qtBh�Kg��GoX��(�`�����zx���y[T����56��
���@Dt��K(}��7����/D�@��� ���	"gAd���Ph�����r��X��H�r��iZ���p�_��p�E�kXa��(�H���(+0`v���P������������0������=��"%������$)r�
(�ef4�P�8��y������G�"�6/�I���a�h��=!��1��+t,��u���@���A���|�"(�T���U��w���hU�����O�MA�i|�\/R������5f���������khN��[�V� p:�NUCN�bp��E^&�h��1��JjY+�����?��s+q'HD����	t�Co+�9f�5�<q)�Ky�^��f(|
7�W�2wb��N`H1jS��N����N8�!+��p���UD������DZ�<�Fz
���������������Qv�3�H�o��������~���/�&~~��yW��XS����a��3����{���|�[�A�����8hu�?~�CKe�h��������-X�1'���X��M�{<
�����U+�Tt�����k[*�V|������������������V�R�VF9J1��V��6z���U����Z��Q(��p�
�:��9QE��z���Ad��L�wp0;���z1��tV����E��n�H<;��{�����Zn�K��!=��m�����yN8�!m�>���Y����{��zs��cG$"�Q�FF�v���
�71`6�XD�UB�4�%3�?����[J�s���J/��t*�P�~��_)��@���Z^�<�"�-��u�����d�������e�� 2�:��e��=��;E4���R��x���D,	�ID�����27P�����f��H���W�I
������>��m�����
�m0��q�#h3�cADm����P{4�P)*� "�Q����]��h���$"<����&�,�0����?K�����;�6>�8�]O�q%h�[\�{�r� "����T��Va�}^��9
����l�D]�,���'%���NJ����������R��8��CC�|2b�Vj����@`P3t[������Y���}�����Td�O+@�y��0�z���ff�:J��c�f���(��H���Z�a�5�QO�9����1)4qq�%o&���5y9��a����T��L��2dq�-��� �1=�"�oKe[i�pL������/��Fdf�Dj'�mC���*�����tL0��O�@���S02���d|�{a���K�BKe[i��*���eu�4����;�#���#��(�9���YR�52b�A��e�O�.6!��K��"�
5�k�=����o
2����E���r�g��*�E���$59R`������(
���'K�"2�q!?�]`:��H����y�l+
��G�?7T�[
��b�!R3� �=&k�
��t$�����&Q��,R��I���b�6A���Z��N<#�������w�9R���:���?q�%_��#��Y�#{�aAd]���E*e���A�@�_��S���F��U�0�@r
Pyjb"f�"�Y��#N���|�M�^PD9;Q�
LGE^��P���+�1�s�\?Z�b�������E�<)S/����^��^������~{^*����
�B,��:�m�������G_Y�~;�J(�w�HE���t��7��A���;{U�b�P�^��\��p�lq=����Y�Rg�y4�D3�z
a�HY�S1����<���O�Q9��d#qI
�K���j���8L7��=�HE�%���������=!kYz�}E���{��?,�������H��E�<��f��\�\���B�eY*��Do�1]����lK-�=����p��e��;0�I�,w��Qz������~��9ib����gC���z���3���J��J���T$�in���HD���u���5(<U�j�J%��y�g^/,G�H�0=����=��c4�4s��#�aQ�J/�T-�3��@�Y�$�J"�-��v�D3�;���2cQ���x��H�#���8�r�$�j,������T.>������zXPw�T���q>�?�m9v[�9�r�[��[>����[��}���~
�(�S+��O�|��X��G)8�H%��g8`�n,����wD"��Y���0
DO���c8\]C$O�"y�?�*���(
@� _	`�H��dv�]�J'��T�,��]��a��2h�����4lO��q�)��_�7��=�+��V��<�W������H�����
)�����x"2����.��=�a�p�D"�-5�{;�F���hP�������p_�������lt�W
��p%O���?����s=�J"V�S�/^O����,+eiU��&V]@�&Qn33��h�=�V�����Bz0�}������W��I���Y��A��u_�>u�Y�������������C�M�N�}��2=���*����g���Tw�5��pa��T.>�A��)%Iu��lK-��%	c\��`����;�%�!K>�v1�����-�I/�%B9D��4��5��+�.����4M�O����+e�f?z�:��A���3�w����
�4?D��7���8����1D��Y��/������ &47��EYs��yi�����h{Ah��~����a��>���O
!�M�0�v�J�'U�L��2�E^sp��#� ��������?,��}�sYZ�p��`>����Z�HT����j����k��b��Y�?��e�L����8���rBk��h�� ���m���/�H�Jn�q+k������W^D��V6����>�cP�M�:�l����8�'������)p����L��Y�>��A"2�q������V�3�Dd[j�zB���^�3C���{�g�'���o���Qx�����+�����	����j�O"B;��{��K�����O%VA�:,KO�S�:�p7����Mt|z�[�Q�i��A?�2s*�}������g���}�����);x�_������Kj�~��uKk����4���f������'�"����,R������H&|W+���c�F����g�ku6�%2J�H��Ej�N����Y���3�\��n�u��-Ke��^3>���B�L�e�l+
��@[W�E���5���������Z��_As�����?��]��,���k��)�n���Zaz�hz_��v�H���>��+7
aBN��c��K���'��9���3�n�f��hT�V'~�e�;��Dm��	�
���n UU�I�����&��0���_sr%�;|��W�#�r�:Rj#;��{>��-���������.���&!��e1G�����I^�r���uI���3��'c�����/� R�si;��QL`�����xKp-�y>��1>?�slEO&(����iYL���������\���1����H��i����?51���->�^�3�vu(�z�+&�?{�tZ���G�]���r��p�E^@p�BZ�l]��'�l��#��kdf�Z;�>��I���A@H�t��UD��6 @�P2?��u.I0�I�,jYd��ih����i�v.���$�u�Z\��������3?����0���Mt1~���q�K3h'��e�5���|�nx wA]��~t��S�m-"2��q�]|��1��e���R�VZ���XA�)��}a����,�f�~�T��������Z��+����0|���%�?�����2�s�96���8�x%�� ��VH�P�.�
5t��W�T9����J�|Bn8��&�
8�=*B<+}��=�|Y*C���1�
�t�,x^DX��0�
B�`1�����>����39�rh���l�)S�J���':4R�$R��'�j�� g�8���S�����j�����G�:�h,��-OFwR�9�#��
���h`U����]�a!L��b��B[����e�/"utI���n��:o�@E�tI���aU��3���K���>���*!v�c�#�����,�W�������o?�O���K�G��a��b�-�4�3�.4-�m�a���G��t��-��A;1��x:�8z���K��w�f�$x�-zI(Q�`����1v
�'*��2�2�'�c�����K��3���#Z-pY	t;���KP�a�	�hG!_�����!���V%�F5h�i'{E�������$��U��2t1V�C�a�����9��y�0l� "�C-�~�����8����|eb1:��tq�$���*h��/"2������Tf ������0��A��G���.�K���=��f{%�gthj{FiAtR��;[E�f�(4r�?o��b��J���9��X4��5�]Dd;jXje0���aMZ�iF��������7���Pt~
)�{��L��LVAD1.�����3��oDd[j�(�������&�y�*
��X��B���=�yL�O�~;�?�f5�NPRHt���p�R��x'��0���?���P��E�'�~����7N���Mz���r`�*����Yp/�rn��P��'�y�"�=�~�R�N���t����x=��LX)�z��:����b��w����B��hQ��t$^.����yGP������npHD�y�b���F����9(�,��)�=J���gX�r`q-���G|�B�>��@F��
�Ml�����Tv�J�����O����
1�X����{�&R��!���a'{2�_�:
��A�^��8��?��P	�.��T���-"2hs�
��������~�d�J�FZ�1�M��rh:��1c
�+��90������xV���IQ8���G��z&���
0$�#C�>!��Q�G�"�A~XM�{D)u�I2�'���`>�aQ��:�A�j
D��C�.T��X���;�H8�������e�l+�^����>�E��~��z���rx���������c���nA����	�����0�6��6��VX)S\���/iI����D�I��U���C�}���t�2��E�����~����z��Op�_u#�J'B�&R�>����n�*�x+a�l+
���C�[g,
L������2g�}F:~W�o3evH�GAoi?Xc���{���z�wa$[�R�VZ��q��*���}a$B$��x���|����??���/�������������[2����.Bk���e�p%��g|��������._E�so�O�w�&����alhRKZ�5>���4�0�?{����#��y�8�7B�'	Ma�B�{�R�5��Om��Fh�
���ZE�y��c!H��4�V��
F"F����$������
�$:��������""� T����	���X����4��)�#Qv���D������H�����\d�������9J�S��"�����}�S4����s�,"�P���G��dP��Z�� ���Hsd�1���I�#��Y(�$��[��)��7�$1J�a��9'��RSx(Qc[*�J�p����0cP�t������$�92��-�����$PB�:�s�~����F2_u�E�%|�����3t��G1�-��qoq�cj.�n�D����}�t~�������x����>}�����w������>��������??�+�������
�^�A��H"�z��Gzi�Y�G}5��A���ALw4��Nc�J�CLG��	Q�a\b��,���j5���`�����Q��2h�w�������?���2�o?Q9A<9����0����1u��P�F�2�j�sd�=��AV��Z���Vv����O���Z�HEWk���:�e���2\�����O}�-J�	
��y�a�D����,�W�}�����9�����.DQ�sC�UDE�Z��P%����	+O�����J�1��F�����������	���;e=K����&m��Z���w��2hq1�)R�f��'E����4�A`��sNc��r�&M����O��?�����H��`{;�_�k�\�T.>�1��+A�9�a����ct3��!Sc��a"�'��_}r�?(�ve^upz!�*%_�qH���������C�"�C�N"m{}����G6�NX;�ENeC�r�O�}��hv2�;���8���P6��}}���K�����
0��K��u��X���,�A8��w:f��6-zL9VZ���0��p��$��9�H�jZ��L��=	T-���]�>��R-�y����qr����-�m�a�t��1���-�*x3('r��"w���<2��P�8V)@/���y8{g}�i~3�du<��|��7	%n��������`��V�T	W�yH���t��7B+eiX�!C��M�Hb
fU�;����:���g��	��o���.�u��T-.�}8S���`\A�-�m�a��?A�sD�&D�2('f��bF������H�5&>U��t��2p�q�:1����m�#5�#�
��(h	��7����;q#'7��{u?�S	0�s��{��A���4�������������u���p�C�6W,�����9�����X
�\1���i�� ���a�������K�u����NE,���'����8��q���Mm��H2$"�C
[f���2+jd��'�G������I�bF��'q��?���'GV�-����wTl"`���?��9XDd[j�9����:�������t�#y�����^���Iv����rt�e
N?Y�G��H���h<�j@B�ZE�4���
��1
It�W�x��`14����Y��J�����]
���K1�TRC��|b8�����m�iZP!g7qZLC0&��2��J�D�'�"y�/@84)�]�C����������T.�+�]��B'���o�[Ed;j�rG(H!^h��'0���B��I��V�z5)�h��������N�*d'���g���E��J����p�Q����{
�H
�H%������q�ndY);���C���&�\_�)����@��Y��HXq������s]�VR�>��;�n��"���� �:��8����p_�T�z���A�;"!Qh���<��������K�����Ed�v~n����FG� "�Q���oV5�eP$��>�I�;~%^��E�l��Y�������SjEG�S�D��7�2�����""�6W�����CB����"��N���G�0p��U6+��u,�������03�m�����K����B��'s��R�VFy��{�!��s��x�X+�-	9Q��(
7�m��C��7�w*����udOf�M����z����}0�S�+�n"'�)[�f
M��S�3���5��/�w��O4�����
�������I�����{�������a.{li.��*�:�TRvVoe�����q*��,^x������O���5�����JB�B9A��]��)YF�hK���������h�D�D��N�c��W$�N�=��QV�u2������lrR3,����G�,lHd�O"a�N��B�K$�Cg�d{N ���2���
 o2C}��2���|�F�"2d`�d*��1G \��>����P��W�
��,8I���P��#G+�_3�
l[x���������%���&N{!�N�����0�`�E\�h`
AA�$���$:�!�X��dhh�:�D�;�Y]�B�
X.M�l�
B^L)?��[s�*)rA�^UqXR�$���=����/$:�J��b�d����1�C�D�E�U�=��T��;L�>1���n���y%�W��rBd��qQ�3o[�I���>R�tS�k�H��Dg{^���5~��'*�u�P�C ��_�4����O���ReaO[Kj��4�J3&}p��'�T�
�V�����$~�Nd"����U����/�����A�;u9���L��.��+���+�H��f-l����x���Jzk=l���Y��.�X�������:����H�	�l�\n���R�����>s����D=�!�(Y���o��("�G�w����O$:!��;l�BZ��]�`c�Uw��;+��2��a���0���L>��g��{"����65q�uZ9����!TEShG�BHM�����u�Ee�QG'�\�f.��P�����0�������	8M~����y��|�&�g��"��a��o�c���:]���B��0�g�Tn�����H��H��0��Dc�O�F��!�/����W@q��@��J��`��Sq���dK.�~�����i���t��('s�kG9��D��6�<`$�7���6�\����}�u��I��/�~�������o����_>��������������=e~
�����H��*V�3�=���M#I��3�2�Rm���	$:v<��o(�w}E%�A�G�xC��l#��>���mm?�B��|�P��Yg��6����i���W�H�N�6,����[��/aNid����(��Dw�K}kjF�����i	R0��Sz�r�����\�/��r��:��������������Dw���m��uG���}������A�>�������������3������w�:��@b�S'�RC 1�}��"� ��I�E����9�r������cr�!W�@��\�_')�By�s�
�}�"�ZF�}�I��f��4�	$�b��������Y=����U7�=@H����?�YEt��&�>��PtY�d[1
Z,d$F�B����D���:tb��@>�����uOM�j��A+��z��x�	�����>~���0�����tu3����i&s-G<�Y.
%f�B-:�&�v�m�_��w�(�]�m�Y�D=�m���`����F_�zR��"z��2�����$�N�`I�����e�$�V�QN�C�;`H�Ppd&����[75�����!��������#�����Tm@f2�c��ItB��wc+�������a�*b$�l{�P?�����B��W�!�8j.��Y�A�
��+� %���5����q~�3!
E�?}J�y��>����q�Y����e&)��(���C/��%f"�)��z7�����N�3�W��C����f�����k1	�I�0�������@�1�oc�u���`�����'�����>�������dD���4"���W?^����
��X�}��IOR,sQ�z@�UD23I{�}�O�I����pO���4o6�,���d0�6^��$�)����+�r��@��f�����w�$Z�RP�`1����D63�2�<�oAN�gAj�
 �*�G��#X�����WZ��c����:�G�d|�D�K��@[S�Ovc��Y�Dd������[$:��8����[uG9��A�At *1�=��
!hI#BK^���;�� *�k��n�K d3���J����R���+O�j�,�
XpT��#��
��I#�M���$,@f��X���(@���u�@#���mI�X�?�W�^J$�N�L�|EgU5� �S��Q5=��o(���4��	 UW AA���`i����V��.w,1M�ZEm���D�;�Y��v`������g�@�����F����7�YD���������H`���8�$2?l�&��;� Q�8'�{����`�|'V^�mU�#��f�����B�U�U��^�B��k�U����w�k/���T��F5$�Z�$J5$>Z���eG�TY��$Jv<�bl3��5m��E��u�l@��C�/���2��M]�D�*�Y*a#��{P,+���Ds���c*��C$J��ROI��	���yC��
@�g�K=I�C����6� e���)���2�^��v���2���H��H'�w�v�7�%|p�'���V��|���"��f�|�*BF���\H��l��*�NJd�(d$�I�������"��(��Ntb]��"g������:F��&�?�����^�o-@V��(h�`w�8f�'z�Bi���������I�P ��WC��qCEI����w�SB��V����5���zd@ �Xx$����L��)�����&�����z%������-$�[C',"?PY	_s��`�3�h��H@q��?4� ��������D�}@�IJ�8:FQ�E���6�F�I�z��S1�(���r��&�A�zE��Q47N�[����Dg������,��|�[uG9��[�c�U��
�O�u���GP	k�z2����?�BvN6�������ig�'Q�W�����`fE��I�
�2�z�{lo��x`L����mD�iE���Gf=K��D�?AG�rs�*_�8���#x����L��'J%��t�T���+��T��$J�8�WS���J@�,�������k����'���a����x5����w�l�E�[��.���h�D�}��DRF��aV��5�y��4�D�f�X�i(3,w�Z:f��|C����+�}��H��U�KV��Em(�L���wR",4g��t	T��Dw�3�;(j?��0�g�e0��P��U��E�g�{|�����.��CeN����OL��EF�%G��`�'�����N��"��l��	�^��t��t��&@:[���(NQ�]m3���Z�&�5�x�[u�%vI�pm���=�@�;��][1��b
��5Q_n�T��@�W$�r�~p$�|os,������o�	�\�}�����9VE ���O�X30~��l/��7�=�)�7��|=��a������<IT�B3���-���gN��cq����6�yC����M$�������K'��J���'B��lE`�����=K�dS��-S6~'9�J	�)��,�c���00�Z�%J���|��('�qV
Zg�@r�O���a(��c���&���,�2��A���,����#��sb���b,���������D�S�"8�Y�~�}���o��f�������_��W���x��o��v��=e~
���C�*���0����u<<k$)��}K<�U��D����5n'ds��S������d����V�q���s��l!�:XTa����BRF���54���Y�Z�V�Q��r�Li���|o�W���w���p�Z�^,?�+�N�qv��e5B����M��YT����ItB&>RU1(��%���N�A�,�CMZ��+h�Y�jC���a<�h�JT�kE�����wV���E:�����k����O$:����%�P�h"������e%�+?��n������& L;���J����	��7��8�/Le�#��5���,���6Wz�6�e.N$��������[�s������7�= <;���zQ[��g�8|����(GT������
�����D'dr��d���q�X��k��?I�U��=58���B��3>�����g�F������%�!�!O����T�mu#����$e�
�?e���>�q�mH�X��*"
j1��]�U3X|���7�M@�v���=<��Y��H�"~�?k��c+M�P�og�Y�d5u����O%	�bd^Fz��z�o���RN(��lh���g�z�!B����������#C	�b:2K�P"���L�nxyg�c�G
$�#�X�
�/J�</��=?]@}�#�kk�*E>>��)�_���S�b>�~���^������5�i((�`�,�q�w�0�{-�D�1���J������g�w�rb�eYN��p]�V4�1�pYp�W/��$���0&V����M$:���wX4�#��Hg����A5���^��HR���)yI��I���J�3)6f���MbQ�*��.��3U���F��2�(=��;Ka��5t��-W������G�k��W����z*@�"��:2���0'�e��Ji�$<��#ImC���,�
Z`�k!H` �������J�f�yPC�#��o����w���J�TW������c��CV�5���^���/�H�r�N�����u��A����Ff���*��<�img���f������N����7�Zd�i�eG
%�[S3[)��5}*��<F%|@.��G4�����
.9�z�{XIH�%)�������!�.�Dp�s�:�o(�)�'���f/��M����������|��X�B��RC��,�1���6�D����\�����y��C��
e0�N�������S�l����+�eT��r;����8E������b�]euU3�tj�Z�
�XI'�J^[�����C��$ �O��\��V�D�U'\R��W#)�9�~�M�G
$�#���{�A����X��vK`���#W	��Zy9w�}�LJ��|
�����`0���z��g�{�I!\����r�25���&�K1����ME'���z�>��
G(�J�W����U��/pm��s�2���<
�W?�
��R�_����w�Knf�vu����m�����g$�����u�ir���O��,����t" �����#�(���������@RJ�������D�����C���?���"��l�}�b:���}�����G���&6�����t��&��+Hl:�HR.�0�P�8�w?��C��D���X�TD�}��*c�@�D�Gp���64�L��C����
��mV/<(4������(^�bL$:���W�F3M�63&j�/BI�*��Ro��6����7K���F����~�+�������#��)Y���O������~�df�+��diC��v��������E����(�%�E2	1�[��IDQ(h�x�����#�)7���V�)���}�_Ab��R��'���M��- �w��(����Y}����>���e�����7lh���x���;I��l�Q�����>�GV�}�b$�t"&z�������	2�x��(g�v?��
�<��+�*��
ax�^����������#��9����-��3���.k"����^[�w��������f�Fp����EJ�?!G�����[�����a������*H�D��)R5G820N�-'iW$��X�5,����.r�����^�>m�]�YZ�lI��Yp��?�j/�����D'bR��LG�fi��%��;��������/��~��sC���A*�hr�����Cw���07_��L$)�8�[~��vuATo���C��F��7���b+p7�O	|�U�\����u/R�*��(k�RS�1��c��w�j����f%rR�*n(���y��%%�]D�K���f`NU�+6xg�b�6����E`�_�r���M��TR����2�Jb�o�c���P	������Lt���FL��Hr����8�4i�\c���*2T%?������)^N�e��g���� ��N�X.dh�It2&5�>�`��ly�Uw�S��P��g`��`���������?}�������?>y���~�����|W$�8��g�g�e
e	p�k��M�n/����K,y$�I�T���;B�%��H��N4�(c����}o����G�o��'j�Cs�cW�\�����kq���`�{�
�_��,#�>�`�O��8�D��:�������`U��F��l�S90�g��zJ��T���h�u,F����w�Ci�F���d���XHtG:���>�m�0��E�O�{�U"X��o���
���?~������x���������t����_>������������=��y�5m��M�2L�����������H�||b-����B��N�AN�|a�D��={��ZW2�+?��|����_��|0������P��� ���<&[�r-$���5����a:{����J6�9�@��_���m
���vF����Mw����5������kJ6�N�R������������,�%��;����a3D�����:�u#�P��A}>%�ko�>a~�/�O�.���������$�Zv@����eD�e��$YH��L����d|/v�2����o��8)O��&���C���4P���������!��l��m�7�]�(�`�$WA&��q�9@r�	����
�!�:��
^.Gpd���BF*R���%d�|k)y��`���8\���[uG9s���&���c����F�A��������Q��������[�����������Gc/H!��	���r�������%J��$�2��!M
�d�/V������$s~��}[~���Mf����"�D�0����T��qE:�.��Z��6:���,yx�Qj�u1'�m��#�]I��@�@��PF$_����
7���������c�^OR�����	��S���D����\��l�)"�*:q�����q%v�������8MW�p�g�{1w
����/^(C~Xp�����4�>���W�8���j��8W��������O'I.�:�Y"���WO'��R������%��Z��
�:<V�Z�
!����D�"�N������s��l���D���CBb\�LU�u�@*��
d�2�e%q������8ydR�@�l[�
!�I
v�STc��H~5$������|+�K�����Q�o���f$-y������lu���h,[(�Uj8���>�\���#��`n�'I,6��HR�D�������{�D���9�*�����G|C��O�4f�$�IT�7��pA\�����Z�HR(c� �[�N�X��(�D
-$����J|\0��mCF��r��jt����
��'[pt��}��0���H�2���GX9r����w�rboj8��� �{�m#Z������S���!��K�e�%�8���	R���[����P����x�d^I�/�;/2�
�
'pO������~5���J��
���S1�P����e���Zn�I��G@�xW����0�(�c��[��QR`�-�E��A��u�e�G�k��W�~^.#pdlR�QT���9d���$�I��������6;&G�$�4�������#��ezY��!��X�@���h7G$����6���
gbG!������D1:��"�[k����!b����f�c��|��:��NtV��g����2����U�
�`���9����'V��?�����A3,c8�2c�k���$�J$��1�as�~�����{�[/i�sf`1�_|�9��Vo{R�" ���?�G�
��v������V�p�K�L�����*���t��(g��Qf��n(n`TOT@�����{>Y��Yb_�Z�3�ode�,�0OF�1��$[�dF��}<�~�'��NtV]C�Nx�g_~��������*(��	�z<K��-���8�.���K��It"&z?{�	R������Hg�z�aD8g�B�t��F$���ZyE')�TL����GY�Pi�ND1���L#���ZW@I�j���5�������m�}W�"p���"�E���e-��*�d�<ypAD�P#�!O�y����OR�U����~���H��gu2|h1��Ds*���u��N~���q��|�C?0�(��k}��r�N��������X���&���-�,���x��1C��s�,&?.q�����,O0n��N��n�I�\��[�Bp��g\��It�9���]�xG��T���'"�&@1�!�sGkw_	"|�>�BE�����{���"�L4P(3I��T~��o*�(J���
6���������7v�3���Kq����	�R~�*��]��O�����-^RS�I$��P
`�����`���I$,������u��	yC�mVLOH.��)P�3�����^�F���1�q�������(��1l�J�`Y�d
��G7����u{��]�#!��'Q��������5��d������!��F���3Z�Zb��=QR�������>6�����������xg��w{�}R4ZI��H�;��uL�w��oZ��Ep��rl����������*x���AmY�"��H��0���-5��T���N,��<cmW����q����e#P�����,E��M��PL$:!��=�������
�*�}�1�'�&\�D�X�0�������?�!�C�\��P�=���Z��K@a��	,=`i�������$���|�u�3������!T��]����z�"���	��F�||U�����f�h���6�vB�N��o��'Fe�O�%vo�� ��G"�I./!x	h@p&��KPC�pC��a ?:���=��+�5�����>@b���$���/�
���Sc���(3�g5�@���Hz�%���5V��~+�&���Z�G����_�����_��y�{���o�����}���g�7��G~���'����_��%=������*�[�����x�?R���~���Gm�������fqO���������B%�����(��Bv��b�W=��R��������g+g�,#���Gz��^�a�l��!"e�#I���x��/?����>����N�������(�I@���=��n��g�H".�l�Z��_����?�����Uw��A������l�3����0��F�5�����s��
�W#�R�x�zUV���#R0�
S1��F�c^`"�2�q�T�V���%9S$��pX��,���W�$G�8:3R`�Fc}|)�X����5q��K��&.��S��U@���_H��_��pd_nmO�l���Y}�������J����;������1�8�����q?�����?���������w��~����|c��Elb��~�v0�M�;�1��L���W.�
$:�r5>���F��e�c�4�f�a�e��?nk���,?dD�M���.^���lL���h�
Ju");�]M���~�;d,��L��������+�q�f��R�{����zi�,���KX��H�c]�W��o�z:�M>��=k%�bi��L�o�=(��<L��lC���d����$^O )$�x=
V�4�`d�N"i	�N�.�������]��g$u~��7������uWGI����;�XO7�����B�;���_gR���u������0������1��	A�`SB�����c������bV��|!�����)k���������������	7+v);w������f�g�[z��������Y�{�rb�U�,���2�1�B0�������R��Th�Nn��9�qe���2�-��NWYk�R`�k�[6vFV�h��H
���u�o���E���|�D��K��#ME�C���ww�X���0t�e�E�A�%z$Q���}��x�X�a�g���"����a��3�����"~D.S%�|��6E��$���-N�pc67!���Yn~�dk��`�/|�Y������s�����8X�D��a��6�����.��	D����
,�;:����L��A��R����
S�fF.WN�LRH5}z����r�D$�x��F
�1t�<h�f*��g��L$�r+��w��P�����|@;���-��GC�w��J���I�����������rrL�Z��{��C�q{�L�@R�����~>��a����D��vp@������1�'v=��b�W���0%9��Sv��w4��2F*(	��LS��(��rjq��a���l���7��	��]���-$o�3���H0�K��B����k1A���+ocZHt�'��o�nk��2�#1����\j9�����B�����nk7��\XH�����7���e�`3ogu�@�D�\L�g�C}���nU��LF1��S��i")�a�w�q����:��>�L�L��-3o�>�!�
���^��`�`e�,6���U�a�te^A�y.�r�R�����x���cx�tu�f,$��-T����9��wd�,C�����_����BrygX�@�v77������^��m`�@�����*)[F��X�:�s�?�u2b����9�8���|2��M���Ii!���t��U��qL�e�C������,���������	U&�'�8z�����2��@�����[�T���)�(���)_e
�P��>�����'MH��%0lz��m����Dw������m���������!c�]Y(l?�s`*�m���&OX���IJ������9��I��4d����#=�~`�C?��r��I��Ht U
����k��{�Q@�3e�gG����V�;�,��G?��nmCE���	�$�l��-����dd������.$Jc����-3_m����!���A��>�s`V���n���XH
)if��
�s�ZrF�7si(���?[����O��c&	�����S�8S��=Itg���U���g6�ncf�wF@R��$����p��2��~o��I����N������~��_uoe�i�V��\�Bh,��<*&��� �/��+��&Z�)�������	�I7,�k��a�{u���V�je,E���k��������C�$	$$:�jB:~���u��)�����Mw���cd���i��u���>�����?����oV�v>����d�<z	�$�T�%X��T/]�2'I��[H��E���V�7r��B�;�I�!���Y�a��8�����_u���@*����b�2���/��XHtg��2���34������TH@�"���\I�6���?pe$QB�N���0x�+�~���*We�u��By
��O�:pFU��BRF\��[e�)-bB����ta8���F��Z�7�BrB��F���JH�4\Br1�KuR�7�_��RG#��It, 5X����u��$_r��4'����>�!�Gc�A���~���.�2�`��
	����d|����K��Xk�q����H�Dyf"�������F�f��}bx����#&��>��rXj$�1��)9S
,����$)�W��B�;����3����'ni�$�\h0����6��4��R���M�e���'�
��r����;
��k_�|+��8O��Q�~��i�)-�����t)���>�q��Fj�+�W��t$�1��)�g!5�dk;��L�Dw��)�Lyd��b����Z���s��e QV"�\�j?5��@����7���D�bs�,�8�*3Z�|����q��=^��$������TX��Z��X��%�(%�$�����qhhk������2��K��[H�0%�$���J<kf!��re�'�;�q�*aV�`}�t���K�$�Xf�8�+x�_!�$'��%}w�F���^IdL���e������5SxMG>z������)q4�2�fN�d|���%$�3]�M>=������������ !�1�T%v������t1$%�(����y�<�������e�(��Bl��f��C?�}�m�+����Dw���eF��frd�?]I�����4]����D����h����l���D	�9�0d����!��a(4�wd
�l�9C���J�,eGN4��,�0�nL��_�)��I�\��uLwc���#�d�!��X@��u�4�j``��$k"��OCz`�����P��dF
%.��B�g�
3�k���(������K I������D���|�|\w�4�3���idN���I!��\�_C���~���$$:���Xr���9g6�z�RJC����W�<��S��$��7�
��������
*��Mre�!�Gc�B�J-e���
�=�5
+(���#A:!����+���a��=�3@/��fHHtg�Q��2a��q�x	�4��q���f~�m����x$��������E�E��bsI�g��2�D������4��?��rR��$J��TW��k�3���B/7�I-��^����d^{�%��/}pF)&.�%�At~�P�(��Re�
� k����y9�h���M
C�o��'t������c���'@�x&Q���
���T��zF�H�I.�<�8
?}�}��u�|�}�������w���ZMBM�
�$J/�$�lFn��S��~@L$�{J�eLP�����D��:�5�

K'fgf�R��(���HC��=�4�1ehnmE\��P��B>=>�eF��e��i"��:�����Z*"����"��t�RI5#n4�+���������������[S�G�����NB7w-H�������?������a���8-J�a�x�%MD2!)�{LS%�nk��L���I���GQ;�<��djzXaJ�{t�3���)A��$�����D�t`�;u�D����o�vxd�� �v�!\�%`�@R���Y�sO������$��D�+��jT�'?��P��PH/e�w�`�E�z��8�H�L����/$��dms�|FtNe�N"
n!�y3�����VH��P
�3�����kFu���&$:�
�_��
eC��9�����v1e�)�Z��\=<�eF@�D�8�bK��}S�%�����>
s5����2;vM2]N>�s!)#��g��-F�q��LHtgz}��Y��1��Z�1�d�t��������B�c������6�6�y���)���t1�+���ycap�N��>X�Au3���T�����p���N"�2�Xl���}�&�|�$zRF��������{�O_������Q�J�l$)$d��QJ�8����K���/�%�P*�Q����jv��+#�G�r�@�c�j��-2���V��y;��"�2��X��������?����������|���_?���wE��#�UO'���f?Db�d��CV��9
��)l�B�;�I2�W\����>�E�
�b�y���r���f��,?S�R��g5�t�K�3��M
C��1���eh��L��loVf)7�3KF������x���/���%0��I�E�3�07�(kbvd��t[���������B�&����p�v�h�D-�k�d�����x�}�KHt�'�v�5�hiR�"����B����8���v_��0i ���<c:$�C�HR�)!�T��FN&��Q@���$J����u�;����Tl-�r_/�$����������D���&H�2�(�����R;s�Ay3{P,
c�`��4%��E )#�L$��)�*v���B�,�<����3����1,���p*<e��Tu/$e������5���)|?=INjN�`�#�#�_�[�e���-�BlY����A�J�H�Lh�|�Z������'���L�>IS��fTc�`e�%���6��7�����O$>�e�8�?Zbw�3nL�W'��O���*��93���#>O�WM��,v�4����R6������_�'�H�����a�C]��r��Q%�@R�)�H3�����<MH����s���,���-u+U���=��S�j7y
��qf�D���_�5���
�,u���NB��D'�'���9����G�[j�H���Q&�6U:G$�L$���b��YIa�Iw�hpS�9��K���������������a�k�W����T<�o�|��B��KO}��&�J��
�=����
�6�f!@�F%�B��>0
 W����I�Br���B�����`���**Fk�+�d����J�������c[�2:ju�@�4V���:����QB��("u��^��J�n��,1�x���&l����t��D��y,�?7���&��*�v��@!��D�yRi`r
s;,���J�bzB�;�I��d��T
F�k���n�
��L�,GH��.���u�e��()o&�<|��YAe�<�D�������C6���uI���ozAM�$��~�B�t�O�%������ZR�lW
nf<Aw�3��\���lqN��a?M�a&Q�MN"
E�s&pIr~�(Z���U���������I
1�_N�1�����x�2+z��+�3X�L;[�j����$���@��Hb!a�}�e�^N��H���������D������};��~����u�^y��W�8�$R	�p�������X�K���]\���Q��9�����1������)�%V1!�1��f�T���)T����1��o��v>�8��9P�L��������*���f)i��BR�&����U:��N���"!93[������I�B�7�d��
X
���0Uq*�2�����*�3=��%@d�d�B��(\*8/gb��)�/[�T�+��c��K��\���\���BvLHtg��1���z��0��c�x��%�&�RB��X�B3��0��3��K��)���L�n1W����Z�O�+s�$co�L�c+�!�,��)j��6�� CI>E�z�����wF���'S4��5�LH�0��5e�-���t�BB�;���R�q�7��I��o�Y���Z�:�Q��zRp�p���������e4Ibg��W��3f5yp!���s���7op�oHYH
e\�*����a�k�/:��]Z<��3�=��R�SO��V5�F��J)S�$�0"�u@�
����/�u�$�����w�'������jh{���SFV��Di����R���h�n�;��-$JQ�����&���L�3���L�q~A�@:���M=�uGSk )�L��W����L_�@%�Brj^�n�*��3�=��R~����,@cc��
��BR�)���fX�]M@���&L�����X������nPv�X��ivV3��r@$)%)��������$g
$��V_�-�����0�c�+�`������A����2'Ix=!��;��X�}c�P����}�
�f�{�T�kL8�!�z�����Li�2L���Gg���Yn���b�<;��LS�
5Jr�T���53�R#I����f.!�y�*�<�G�=�[����%oei��.G�|�xxgB�S�R�#������3���B��i�m��1|���}����}�
�/A��$<���j�����*4*�!���b�O/5/�H�����/��i
yF��(�.$e������BFg�py��H����s@�YJ�Hn�-VJ�e�8
Q���B�?r.�m��i�)�H�@N"
��.-���A:�/#�����n��AP��t6�(�b���ag��QT�Wr�dl��q$�YX��
���[��M�2����It�r��|���G�{SL}k[�at*\��	;H
I����E������Ii8���8$V��\)P�~C�D�c�����@?�m�f#�(�$*�S�rh�R��]-����T&��a��d1z3K���IK*���$�"�l�Y��>o�k��{0si�0C�1�@�����g2#NSH�	��B�6{Et�
H�A�����C�C�����?o�e#I�d~�k�����H�TW�C�:S����h�I1��wF���J����\��l��i!���I�2�]������ p�j+�0���2� Q�	�����03���n������3%$�3�D2��A{`0��[�-+����B�c)S2&��sS�%��/�#�({/�|�������23c�ls�YY�3$]�}c2�k_K�H�����c����*�������MP��P��1
��p!)v&��Z��f�%�z�$�.
�f~Gj3�����*����y=��l�� ��$Jp��e��-���:1�Z�����]�mF:�)��~�}�*' :���h��qXHtzI�����q ��Hw����I�	���}�r�-7��po�/�}������Ru��O������T�_Te���������Y�a'��O�+��-��(%r��7w60��1��~�yL��A]H~L�����f'����b;B��,W�m��0%��@RF50�����m�NC��e!���$-,�Y^]_���gKG�d$)d*iJ�������G�I���I�g��A����9g�
m�]���*��8Q�
�>�s07&(��fo�J��DRFE�S�8�\�l5��s!y�
�6���B�m�)ef,c�E+�Tp,�����s��7���}����<��4�����a��m�������D�\��aD�b)S)
��+YbD�giW�m�8����)���B�[OR�I�)Jf���@�#=1����f,���H����\�)�d�1Ny�&
��*I��&�-���^�J
�t����4o�e�O���T��b3��#��FV�s�gd	��hC�5�2�rl]���P��3E|�!���=�J�^�!�:�PrP��N�9'}��b�N�P�By(gX��4[���K|
R�V�.�ar]���O8IRU	���_	��w5�y�h�5?�����N�t!�p���M�R&���Y��@��!I��+CL�7�:��A 
�6�b�Bf�wa���-�ld���L�������FV�&b�B�8��x$Y!G�q�M��]���� e�$i#���r�mm]�B*��
<�T�r����U�[k���d71K��r��!|��%�RF�Qgx]�j��3��rQ�^GH�#���O��~ �\���.g��OZ�F��u9;��"�g��P��"��P�063���y����/b��gh`���yg������C��p'KAq��o�����8��e����#R���Hc����|K5v8��S^#�H�N�
�
{Rm���]�b������.�������T7�'�����G;���q������V#��i�>y���C��&�X�m�����zM�m�
�Y�gK���F������p�8�{%o�i��ClZ����oZ2?��x�U�D��7'���m&�O���'�\WwH�$�d�����N�.����X��	�����Gjd�G��Bk+]�	�z��y#��k��g�7�u
V��K�CH\"�I����D�q�}�D`Q�V#��$O�1�$>����['��2Rv@.�����C1��R�=7P�a�������H)�<O�:e�P�	V�Y/_�x��j������[Jf�b��w7H)��f������=u��bY�3f%��.dAZ\=x� |�� �!�
b�<5g�(J��i
u:>�O���8���>L��a�L%������	�V)����^����1�8��x���u���I=�KK������:�C�;;*�b[�s��5�3���~����������������8*���j!�m��V�R��	��c�)�!�����oE�1���K�~����~�������������}�����F�N���;��8�1/:���e���W�1��J��,}��}�l����$��\B���d�gO��2���XI���m��H)TW5tz�����H��=���m��Qa�n�-�>�v�&N���<�����C����n��0i���7���5Blk��������Y��~����jpY�����5��a�U�_�m��#p=:H�@����$D�mM�M~ia<�k1�~���no	������B��"�4m����A�*J��t�����H�%�P6<�Vi������1�H^��[��t��G��5�5y��G:nKpFk��������($Bi��h_A7��F���YP�D(���6��yn�����R�mnZ<�x��l��V/�����3��jp���6��f��(D B
�%�8V���*�'I�*r�&�7��0�o���V]�P�eUu-�"EHN�B?1���G������!edo�	I��	���N�w����gx���$�Ez@�w�H#���9��HH�(�����R���s���x�vz:�!�n�(���nx�Tjn;��<�T
_��k�x�
��x ���7��	��+0�����u�+���}T�{�1R���Y��g�����p���vm�9sF����-b������!�J�Z���`e}Gc����\������}�;�A��Y!S��D@-��H7���F~���3M�2@�<�$bY�3�=9��[+�&?\���B�L�'�i���qks��g�����@4����jJTv����DN��2�K=95g�!�L4���c����)��D�Qy/�|P��P���{��+<��wH!��&�.����r�G8�R����N"����>o��G����ays8��iKJ#��e�����by5e�U=�,S�z;�wm�y<��Zp|C#p
�k����>���S!6�J#)�;���~^(������������ m�W�T��1@
I�}R
��O?e�<������*�q�w�@U�!�n���0����v�!e���Lr�Q�g1b[�E���`�[!�_?�������R(#�'Vz�=0�*�ss=�����GY����.m�N*�1�����PTt8�z�@����jN"�E��a��8�B��AK�YXBr��&����d����!�P��<��rb<�N��c�>��aj�y���,M�WH�$f�� /C#\&I^�����	�3��"�/�������V�������o��KGa*�rN��~�k��M=!���F�������i�V�������9�b[�u����q���#�?���V��IZ�n#���$�n�_�#;tc\��$B.�<Es&*��L�Bk)�)@l"�����������m�9��T��;�p�I\�FG�v���N~����01w�wL���@�	�N 6�S���de��o�����v�mM'���yD�e�h7-�@[qdB�����\SR�,�5����j�#'�(BZ?�A���XH��	i�CE�P�okH�o��;�%�d� X���.t�d0��B�K�\�;�n�,���R�?��r��������#�2�'�[�tsGgD�����q/��x0���t�"h����w��*���J[�plAS���Jf�T������"}\R���'�G*����V�(�2B^R,���}b)2���y��cW�,t��R��0{B���
#!P;6j�Zh	H[��� R�l��I�1���|��A��	s.f��!�.�������
R�9���+��y:��G��Z]B)��d,\�hhE�<QZ�
���1fbjm�k�>Vu=���*
� F��I�Rp���XD��`�
�����.���u;�&[���y�!�����0M;u����K�5�$�P�Z)�D(�1b!����R��.|�BJ�i�I������9y����"By(s`8Z��#r���k��R��i����Te[R�~�0�!FU9��,����1*�P�i�e�V�"�(yZE��k�?������@.#�Q�n�i�VRjw����h��'����Z�F�mM'1��c���j*���w����b�X�=�xt����p������m�Y�R����}Sg���?��- ����_����n�o�_���2?^�f��5�.����%?lT3�������>��I����c��[�����q3-��T,&�k[��X��:��y�/��C_�pi�mZ�ca%�4&�hu��V
�f.9}K�J��i�\N���:��4����fj~��H!�lFC-��2�f��fl��D����^/25r?���n��]����2�0�;�.��^�R������i��N�����Oij�����Vw��<���2BI0��1[�\�b[�ILt:{(sX��aB�"@���I�ZS�%V�@����!Blkz<�|�����J�R���B���o�j�i��S��'7�)tn��P�;�>X �5�D���}	^�i���� t�����C�����s�0d��<Q�!@��p�P>d2,�3���)���^����T"��X6������n��	'��5��S���_���U�U~a��������aa7�.��a��R��\��Q�1fwkwal��4#V�W5*��_�_�����c\;�����C��@+�C[7$3����L#�Z����JY(��d�]�U���[�<�]"�&j�d9�2���E)�2@����8����PSM���OQ���Q�mE�2Gg�G��c��e�e�|$eX&j
\GQ�@<���I�L5g�L���8B��K(e�<�20R�I����9�;1�z�P�c5��d��S�����b��'Ka����+�wy/��X�8&F���G�M�����D�f<~��"�=B�n��h���Rm����3���&KgH�$un����~�+���	���K(�9�4���~#��NN�)���_�����zoqVU"����Y�R{�\�_���%������rH��\Z��;�������a����?P"�hM�Es3�<������%Y�M��F%�����>��~�
	���_>�	r��q'a�����/��
��RH �3j*��j3�T,#����vPIrEQ���k�D���rKq����D�H�!6������#��}��)��=�����n���
g������0&Bi���������WN�#�G"�b���ef���2U��1~��f`�oRf�E��H��L�_��*~+q�_b��y��<���1��@��e�]������gK���-��}8�V
�11���k�f������D��J�c*�v(%qM�i���G����hC#}�6�)��MC��y
���9��R&�������})��}��bT�K(2�B�1���B0Y�p
/��������W���
d��&gT�>� ���0�����a���f�2#��o����	�@l�@�ka�0����J�4$���sA>��0h��]+�b"5���.�1�����a�}�Ae��O��\B�Q]��ts@�������{IE @
�2CT����<9��"�x>��n>�Q��(K!�$���HwYkI����P��gb3��5�j��I����(4��#�<�6��{O��`�,�:����g���0�
R��A�)qu2+Zu���)�%�2�=�9E�He�D^ ]���r!7�����eF��@|����n���)�������I] CB��"@
qg�
gE�����r�����
u�]�����/A��]�c��Wt�kk��0�%b����,q�"�P�����y}F�B,�R�X�
=���^�-�-(m�*rNSy�R���������}��vMEy����>������D�#T����F!Z�h��-#t�j`R]�:*��bMb[�uh�W\�(���WWM�$�n�|f2S}<��Z�����z���fe�c4B�](�on9�2�#�����v�dRm��c���z�p��!6�~2_k�����
n��&�����%���T��7��%�2!;�j���)#������xQ�d�D�mM�%�?���o�|���?�����~����>����?S��:��7��i��F�?/J�b��;ZE�y����S��C���K(e�z$/���h�e����5�)X�!�3X-����!�a�Z�((t��T����#y�a�P8���Y�kws��h�=otS�7���J+�#$���v^�y��y�~^�|����d�)#����v_����v�`+�q��������#yAK�������1.�����V2uf�Z=�~+�Q��1&>�Xj�����C�x�h�������Pz�Ra�OM�k�oU�];W������������\����BZ�E�E���^.d�[��+r�N�����!�5=�%��S���
H��Ts�}l��S�@�:������nz�f&��!g��t<�`@��[�!��g��M�0�i@�#���\?�,Blk�����}���p�N�x!H���p����j�Pb�P2z�H���~l��� �5]B��Q���$�w�^,#�&j�d9]�MT��'�E�mM�%�I������y�D�u}��$�UX����V�)��q���z��1U�%�/�&\�c�wH=p���RF(qY���i�=��s|�����[ud���]�K=B����rQ�:]����-"xc� ����3��{ok����&��R�����]��a�TO�����ar%�a���%6Rf�m�PT��O��O��&���Es3���}���	(����������|�|�-�C~`��i���m��$��@.q��:�#�M�P�����Ql~x�T��!&\�_�����%R�~��G�XJ�B!�t [p��8���Il ���]��i�2��W��~�H��������	�24ED@��R�/hP��Y�����d6��`��[gf��)��6J�G]MSb������V����hm%?���NA�F���	����v?�X�SK(C�+����FuM���D
hO��k9�P��2����rQ��u�;L
��&M7�����RF(�l��B?\&�{��(J���t�|F��|��6���x0�
bL�������>�-
��e��F�#s(%O�@!��p'?|�G��]S�w}�O���^Pb��Z��np�y'��[������c�e������HK���b#8�6k=B���@��}\��-�&����D��h\B)�������k��ot�8�wH�������l�(�33���e���D�/t�/����*/Q���k"y�u��	��08��C���A��+]��D�B�w��(�p|���gO�s!��!�~����!DH��u�TWNe�#�b�mM�P>(c��1/u���Z��%naR�2�H�d����=j�_���T
����Ja�m���iC�'V�UEy��L�C����@�L��_S��c�=�������%8�N����I�������D��&�y��z�����tt�����U�F��������#��\�RS-�mP�I �SZ-�5�V�����&�$������V��v�9!���e��i��w���U���'�����(U��k���5_O8�w��;��s��������$���cK���U���3������kP=G*Blp����D�	N\0�Q[o�}M����JY(�R4^��p
\x�Tt��/�����p��d�$)\n_��A���m���Xs��F���a��gs���+G�Vb
����4��J���������nKpFkE�}N�K?t�����T�|>]��Jxy�m=�X��0�bT��f_��O?|~�����0��+$a��Z����+m���V��C
�e���id�o<o��@�#�?�X
�v��j����W�1�V�_F�) �SZ�L�!�N��U-s��8�6��9txHe��&�^�&:B
�h��Zh�E��=
 �%]2����q��q�o����Cl"�5���i�~�L[� I5%Blk:�X
z{q�h���sI+�~�w9�����+����&���OD>��D^�$�,�*o�=I5�����OR��:sq2���3�fb�� ���0��)*3����%�lS��6_^�����Nx9��mVR�/'V���z�M�Vr ��nbM�����n5�e�jN"��~�����@�NY�R�f��-� �<���f�b�<�.�7I���������D��N�����j�f���PRF(���bC�!D0@��$���D_
�Hk���qxC���+Bl"�������V8��O��C)�����gO8�1��z�#&?l�e��LR*�/[��&�0�����]0C��g�>y��@lk����>L(Mqe�d�H	��8��QX�X��"f8Q��AY��i���GY�����G1��m��(Ri�(Z��
ez�Vt������F%y��~�k��o���R�6��T0�_�����`:�<������;�8�J+|b� ���������<k����G�l�-���!Bl�JW���}S��s�B!F���&�u
_��������eW�Kk��;��|�p��X�	o�Zm�[������v�{����M�o�
�����C?�T6�j%uJf����ca��;�KD��DC�Kj�`��3���D�P�T�f>�*��h�rZ�����z��e�����G2�L�j:���1�
�B��&��X������1u�b���0[y s���qF/��VT���1&;��4��0+�4������GY���` ��V
�|�����!�l3z}5��o���"����#)O�4�n{4	�R�HRZBggW���G�$��3��������0q"y���v3����=�`�$��f[|�&������0��VD���o�b��������-Z����9k��)#�C�2�{���o>yj�#���K(�����1v��&B���@�[�jm%��j�,�^UE8��5%�g���[�"�>��>[)���)q��GK]u�M{PZj�xz�M���0�M��f #Cyn$��2Bv@��)��:�>e8�y��d���k^�
@�Bs�{H!N�����(�z��yz2�82y o����������R�wA?#v��J��)����R<�oQ��u��hby���LC��f�&��V��b�<�.0��qhW@�v8��������a��F��aYw��4ix�1
�V(�%����R�b\�%��PK��"�v��'����!eleO_����i�������#�~�����A(�~����������������k��k���
�����]GH�8Iq�_���Y���Y�Qf��r���v�w>��@lp���4�L\�r=���l��gd��������CH�z��	a.��n����"��D@-��5!��s�rx��Vr�^�)l��N"�E�>�=m������-���hymH 6��j���16;u����-�6$��N�
gc=|�y|�9��:�dDo�B�g��R�[3�k�8v>�$B����P>���2��#�.����r�8���t���'����ib� �t���p�<iH]���
�G��`��9����s���g5g�4<��@+Ju�1'���q"ELT%?ui��b���
e��X�2�[A��DG�MQ.�|P���LGk�X�7m����G,i%�!�[U���@lk:�,
{v]>{rjO#��t���������D"���#by�� ����H��NM�8@�������
N5��6��
����D*�&��?/!�Lf�!e���o6=�G�l�+�����m����!JG�W�5N�V�'�F"R�v�[jJ���
������(�{�_3��G�Y�=c)��{�f���L��D�1�O�1�a��5M������#����u��� 
�6�"5��	�.3u'��=��}����V(���s�^����m5�8��x`���B�B[OyWV��m���������e���>j��C�Q��������43����_ZWs�����
R�9�q���/�_��-�Y!�5�D����j9��E��6/	������b�s-V#�l"�b�G^B���a�i��:-[2O�VvR��1�p$t��-u"�(	��(�"�Z�u�[y�y+�u��"�kJ��;���D+}�X9�(q�Y�<�1Z�g���:O�qZz|U��8�����RF(�z��j���T������^PR�8��"}���i��B_����,iM%�uM�j7���0o��K�e��FRh&V���g��5��
b�<�6SM1*�No�E$e�����5T�h}�@J; ��I�qi�6���Hc���kj��������r�������>�2�=	���l�_�/m��<�1�,�^��;��J�@lJr�(jO�;�ab���X��
����b�W��f�D��$O�&����~h����~G`����h����H+"vu�Y�����+#�*|�3,(k"�b\����m��Y���n�-�Y8�-��K����L�����i?�':�W��U�Z��W����@
�v�oi�A���[E!�(��l��:$�k)%,Lb�H5d\'WZ���a,
�F�3�}�����d�)�����U���5�j"��w��#J�Uz�Q���
�p�q���|>��%��/-������=������e��$@+�4E��j���9�+��D�k2��g*E�����A�3��R�3�;]��'�����M�1��U���e���F�	�����~�h5"d,t�Yq�/��^+��LuE����41�����\�(��d�Y5�hS�A�m�/���x���`��7��O2B^R����'�u���`�.u$�����t��	
y�� �����MR&4�e�Wt�f�6�D?\��f�I�u����F���XO\_��T�!����5�����R�C	C�+�
+�"�w�C
��ws���30�m+q��J�<}oX�K@9
�������bcSZr
Dy�<�.�h�q|j���j��W�k�=����m����;�V��q�D�M��K>����|#S�
E��f=�	��z��%Bl�_����grw^��v�E��Y��MEm� 3�V�bd/j����:���Kt��a�M;��*'���UH��n�K��Nl��64�����W�	�B��L�p} �*��
TyIm��i����k�_��eu�����C��@m�[(��u��#���[S���t]D(����\1U��8J!5�R�d�i���o��;B�],�P9���Oj+�LBY
9;pJSp=w���fM�%L�����:�3Z��]f����q�Y�wm�y<����_8L�T�q��2*��O��|Zp���_+�n��H�������D�����%����@���D�i��:�<1oH!�$�
��sm�vh�;�H\�K,����=>-�W�������7������0}H�����Q'�����RF��O�f�g �vib0;��:���?�����?����?����������oo��7��=+a������a��$���t{�M�h�>���n�D7����&�� �A�����Z���J��`"R�����11PV�?K 6��\D��"�������,���C2I�cV�����
v�M����P}����Jm =q���tmv�y��LM�0�Z^
g��BJ�#�������/_������~z��7?��'��r,$�E�v�'����e��JkQ��.��0��.�;���>�}T�>��	[0^]�p[���q��/�f
��]�D�s������1n�I��dl���=�%�s[�����H�,�$��`��c%�MQ.�|�X4�v�T�]�V8.�L 6��J*M�O�����:BN�����v�E8(Dn������O�U���u-�h���>�[(��RFs��e������N�����:P�E0��P)�VPs��w2��>y�������*�n+`.X�8�����GY��Pc��U����'R��tR�rw(���mH�w�p�I�2B^S,���3"���4~l�]Lc���$��&���<��V�H^)c�Y6�N"]^
�wm�y<�|[t3�%��p��d���gC���]��_��
R�%B���������'�!�.���f6������o�}��.t]�\<(�i�3���
/����')mjgF�g�K|������}N��K{Zf��h�d+����%���7H�h"��#qmR�������G��SN�i��c�o���K�L���M��fO��������P��k�L�y�+S���]�*1�����	JR�E3��z�qM'a����G�?�k!�te8��������"�&+�0��,�V
j��L���Y^���b[�I��d����]m��!�
����X�Cl�J�,D�iB9�/��e������P��������RF �
[��r�T�1�/O"��9rYMo +�����x�4n��5����H�����s�i����[�1.�$� ��sCp����~_�'�!R� ���g�b���:90��N�K(�8��@�%��aV�7�`1�ZS��V��(,a��6g5B^R,�����'
	�p�J~���J���s�����{b��wm"�^��()��A���(��5B>{"�����F��2`���r�nn�)�RU����Clrq��
s3��C
�VK��
b���������&v+��51�b�e?�X��v�A'���e���t��H
��B\�8��W�2:K��|)� �>M�G.�M��S#����,f"����_�������nR�H2�!e�!�n���8�j�<B�z���RQ�on��2�|f����1q�	���f�
{�!F��
n��P�
�i�gG1�)��&�_f*�`�W�m~�Ke!��q��!��a���w��I��!eL�@5�b���CAAU�x�.Y�����h��c���%��ha^��� ����hW��z��m+E��x���s	�czZ�4Qh�@����X�j=�x]�������[�D�1�D,��~?�X~$/8��F*?����+�b���J+���eq&�s[��uYO �5=���/h�^0�x����!����'@u7	Dz^ /x�|���v��`brIK�:��'o���xH)��(A�(���0����>B�i+�����
����������������,k�x�_�������t����5F���T�&7�lZ���O����,��Y��HwOj<S��$V�'o�#���
%?�S�B����k������c�f���T7\�7���@�
|����d��?�1�g���y[�@lkz.��H����O��.7��K~�Z��;f�6���/|�wW����zMhyV=>������]�b�K�_��P/.y�q�R������V��p����k6_������|I=��@�_Z.M3�/����_.#������\�pu�k����\6�#�����y\��tp��z�=���\�!@
}�aM�pfDn�����@�,�/����?$
r)_�,!t���P����>}�2|�#x1SB!�P��7#B�^J�J��M(.�b-O���3M��U�;�������lZ�������P��-��-](<�XR�>z~3Q�i������"��� �7c`�K�}i��1~�y�Ok��ze,B
	%?@���{?}G��)�	c`&��$�?�P���SO�T����V,���������t��|m]�:�@�Ir2�#\N���B� �����#������I\�|��	���b��$O��:���	��8���A�Y�){i#} s�A��LRWJ�;��w�	&-�ai/�YMU��<}��b��w5n��v�}F�p%�2����D�����vp�/x��$YS����Ja��c��Ja��R*���r����u��J�
��9��v��7	��h-*<�5�A�?�6�!�]?r�_����K �2�,���7����Clk:��v�>�<��@��v�)��A�_.Deq0������3$�3�GfZ�4����?��}Q�qO
��$��\OE�Z�����C�-k�\����`B�Y|j5}Q�,�r��c";�����cz����DY/C1y����a�I9�p`�O�E�n��}�<��A#�����2�����$b)X������O���n�����00���J��Q���!@l����o��s���M!@.}�P/M�j�|i��8�wH���P�Z7#-V������$�l��2�K�r1�w5Ly�!��V��������>k���_R �/tc8�Zu c C��Q�/NR����RH�=F�j������D��k��^>�CB�MO*�n�4���
�:I�����H	�,u3�d�N��v��O�|�;0TA	�i��k��o_^����h���y��
T/u��'�!O)t7��r�����Zf��L+q�w������z�Z�����y�O���~�MywgQ�B�*�K���sKa�k�R��yb�H����#O�J�-KB�,�mM�V ��QN��.Ic�d�j�i���+ID.�����Iu��!iI�s�j 8J�5����;�8�MOVIGb���	C��R��������i���zE�S����`�wA���z�����{�`}=�����GH!Y'�;0�y&����k
c�AF��3�����9Nc�f��
��P�@hS�!��t��v�Gn����Clkz]�;�L����{�l�>E#��<���j"������8�[�wm�y
i+r�W��0/�0'�f���.��@���'u�)�Vb<8�B>�=��o�6��L�=������aBN�}�W�k
/h�����l��
!flw�������%�#�}
�*�����C�o�'1��2nWrA��
nvH!���<�����d}�)�CNh��V��	��cN�B)	�������W1��g�2A*B�Q���:/���K���������e�i�d����4��dwq�+E.B�nM=������1
�	������z5t\LZ��H)KSP.b#��y�=��	�v�xA�>
�+�GN��hE��������N
Sj}�BM`���t���}Y�R�����R(��[����IU����W+qN��5��~e�����?e~��uE���^���L�m4?B
�-�tEq�Z�xb��5�Rt��W�Z�����gq���������F���F�
vH=u���R��A��D"�5�A�N��������ck��!=0��O�=�RF(�-�(sK�VQ�!�;b�/��r��������o��&��`�bIR�|���H�H` e���D��#\lU�������\3D^e�3j��1u2���eT�F�G�������y�t��9I��7�2tS��{��7RF(G�j���t����i�/���9��)�� �����	� �g��@�
`��P�&�l[�IDQP����Q�\�NW��0"p�:���b$�Z������
������T�l���gO�����)��Hel�`j�������!E�Z�m�B���b��
��qQI�!3b�G)��%�mcR�Z,��t���^�Z�Usz�MU.� �����`��\*r�$��3tYM�xc�j����8�f���@EQSyIm8gh[�0�abx���Mt 6�
�J`*7��
^�*���yV�QyOb������6�$�15����!�f��f���#(���V�jM�2@lKz.��2M��*:&B�Y��P�1�i�a?<�\u��U��)����W5�bV��������w����������o�������oy��������$�����0 ����X���R�4���]�8R����e�}=�P�W�#�HM#'R��=�K�����%�d��rz�R���$B!�%=�L�6n�u
����n��<Z���81>g5����)@��3�.LMO�5f����5���5������Qp�a��ULH��8�
�8v��1�c;�w�)$�d��q�t��6MQ�)B^�D�Ci�����,���4=9��:�l�|��0t���a���Y�1*I�������d�p��<����=�yk����NRFVN�u�2*���/�����D�������g�.+o��9�wH��8�i?��1bk�&|���c����t1���Tt��o��b�Kz����]�
[e]~M�:u�A�>��6c�����:����~��/ev����.a*D��������������n�Iu���)#{�`�v���q�V(��	������/�3|�����.t�R{:"�&Z
�b���������b[�%��P�}���eQ�-�,(HJ�#�&j��Q�Xa��I8C��&�l[�I���E���`�#�n����y.DV&�g��9��c{�~��	��[���������]�bN���>g��k��~�4��
� ���
#��f�6�����!���%�J�-���*����p�BbI���t	W;_�x��D,�]S;rn���*9��6@k�$L�F�_���,L7�c��~��A���) r�7���N�
'#�H�f��6/�t�/JT����~3����h8�yM�\D�t/4=}
�~�}��pP��!B
q�f�)�?�����[S���t+)(�}�A����P(W�0�p�����^|��"�)��D^MV�z�����c��������d�q�F��<u���Oc>�/�\F]WTs�f�
���� �5=���8�x$Y�][�nn�c�:8�T�S����� ����I�2B^R,����*�����t�d�E���?�wH!�F�e������C��D�Kj���}����*��Em#�&Z"���n\��i���������/����OH��2�z��F4��1�e�b��o0�z�d�������0�"�A0	e8]L���Kn�D����;��)�S7S����lEc�1n�I��l������P*�a �(�g��i�be|���K3�������v�'�	�DJ�����PLJcZ�2o�/��[S��S,�<�x�2O:�1V�_���I�=�C�TB+}p��IE�0n5��N#r]>�D=g�(�rG�.5u��qD��b3�Z�$���'����v��QQ.���,������j�b!��&�TL�����������o%�s�������}�0�����`�����J�et�U���n�`��8RU�e��Exy�1�<�r�|��^{�9],��Lb��l��R��g$KSA^���]�sH���e�e-��D��z�v�+��G�������Ls���[n��[����v�����b(.���m�����$ER�)�}A*�@%*K��X����H
������7�B/y�1O]k��n�����z��a����5����1�lP3�1��$� �Q�T���P����Q�  �L4��07e#�v�`+�3���&�� e�G<��;2o����.������@��D@k+]s��G?0%�]���~M������RxC�K����L���� ��������w?�}��k����I���J+g��n�������2j��W��=�5	r��	�����}�v�EvaSNxt)/v�^B����Ja%�8��A�vX�(<`��
�}���� �P�}�������v��p$e�G��S;��ry�!�����Wu��PW�D�qM'1��fr�T��70��]k�����R�F��@�>[���+Ob����A�kG_��b����Ie�!��tm��c��]'BY�8��ud ��W�;�(N�!e��~�X[��>�'�k�1�x	��;�8�v�4�a��D)$�uCs���Yo,�b!�5�D,�0��l���a��|?���"G0e��&rj���5�����Gr����~�m1'��s����&���J2�z��|��f��������fpw�Zn6�_]r����9z���Eh���e�C�Q+�x	���%�e�1����"�7^�a2�_������x�����h�w����b<��_�mEHN��<9��1����.U��2�7�PPuF��h��]��&����Ld�)��g�O-�N���p^�R8%0H1����Op����q����f>�Kt� cL�q~iM�v��
�n$���?!�����V2B�����`F�����
��t����]�EX��kb�����Zm�.�r��=�?�&��s5%�9�2��Rz����)�R�6�b��q$��jx�\By��"r�E�=��������;�&Z#�f����)�yz�WU	��>�p�_Z��o�f����yj�.?2-�{��-L[���X����	���2|��I�������e�_�t���@��r/t�+��B�-I�������#������=KM���RH�1���1����-S������1B�l*�C��{�N�KM��B�����a�Z�P7D�J!�������t�����g��_~������?����/�~����}���o�t2"��.Ex&��;��M~�dj��u]����]X{;yz���D���
���k��]TJF���%�X�T��-
��S�`	��i���P
���#��:>m�c$��\fzV �X�D��j��q��C�S�.$�~^9�O8�'������H������d�R(96
�=��������1*�%��P���iW��80��]�=:@lvI������W_O|��&���!�5�D,�0������Jl4>Z�#�A��&D�o��7�R�H30���YO��!/�
�
��G�
��.S�0�z��RXIj#�&ZEaM#��z�\Vk��5wE��P���^A��H#�A�X�RF,��bR�B�E?^�I�5%����9hU������P��"�l~���l���-���P(���B.���+z�Nm�ujXI'��}��l(�($BY�?��Bj�4��f�b�I������������1*��l'�u�#YN3Qm��r��NP��Q�R�D�A��+��L>���+B��r2b)���N��Y�&�y���M�9(:j�� j�4'�=%=�����h�<�Q��t+�)�
B��}� �o����7��&�� �$5�b�P2=��H�'�Gx��V��l�V��|@����������s����n��y+9H)*�6�.����W>m)B��3|�X���-��5�m*o%�6�4J������i�!�����F����X�1*�e����a�0����T^�� F�Z��<�w�8���6#�%-�@����Gf@��m�xC���C��f���kX�gLZ�7��<q~	�=a�jc�=x"#���5���#+�?�8M�-L����X$�{#@�����<"Wa��]����&���� �v��������SB����p:�y�*������$�b���
*�::�L�V�*�$��$��Nb+�89�
)������T(��T_qt���5�c"�&xZe�p�Uc;��`z-H��-�$Z �����>�������.W����V���%�DH�51�n�����.�bT�K(e{�w��.V�4�bu����f�;�f���RwXI�r���Y�1nb'Kqs��"�^5LZ�F�da��iE]-vHm��8��'3������"�5��l�?SE�n�m�B��9H�t�����l�����9���p��>�945���kZ/E(b�����G�������������Im�
�������>�m>�����w���,o�������<#l�1�h�oH���}����6���()Z���2�72Q�������s�����C����c�E�zJq�
�z���|���2��zr��i�'�x����5��o���x Y�a�D`���E�4��ij�
'��'�*�$Blkz<�����#{������}���=���J����b�+--���@��2��>��Y�
!����\�.v(-b����B��~F"�������xs��V����Q���,'�KO0W���h�����7��I�3��F��GG8��W��cQ�s��$��@aNlo!����C3�j���!!6k��u��8*]�Quk�1�
�%��B��(�4]�� ���&��PV��z���>Bl�r	�C���8+���[�n��D�M���� ����������}w�)*�����{0`�����w8�Vz��R��Y&-S����4~����e�e�.�B�}������l���t���D(���Cl��I�B}�f!J�=���G������r�#�c7�t�j������Rh��	��Nt)�v]���<���H����N�Q��F�@Z�E�����fy�M��/c��'iR���H��!Blkz.Y��X��m(G&�A�M�S�#js)�K#����U��~��K��`��Tu�����
�C��D@k%�q�x����6J!�5=���HFp�
<�tn��}<#HQr&#Kf���J+���j��2�7��#e{.YF@��b:�?�tDH�d�g��C�VG��+J���J�{u���"~���-�:_h���S�_s���g��� O��=�W��\M^e���1��S7>���)����@��x�w{h2A����������c�}���Q~2(��������)���!HS1H�>���$B.�<��cr��A��[��p4M�$>zH!SI)�u[���ro���%�R�����[8�$.b��[�����<����"���^w�AT���B��w=t�!�K��DC#��S�/����2�D�(<}Z�s��
�a��sNV-�D(�K��"t�H�����p���t�b�X2)���f�k��b �=%sz9`��W�J�
=�)�0�0�h34A[���@�h�H!*��Vv��4.��-�$�P�F|-a��������#S��������b��^H�=sM6�^�;� �5]B��q�i$H2���6�(�T"�&j���sL�$0�]���bd�\b���"��*B���� ��������w?�<�����Q�����������](Q�1�	�:6��n���E^��M��K��<as|Rr�����n�*���>b4�Z�$�c�$�m�;	�<��r	�B �fIIg�:
w����R��6�q�4���D;@^��+��+b�C>u��>;���)�
$g�!���{2�o��r��2�����������!��Ya%��`u��@���#���Gcy��a����}LGF���������%DwH!���&��t�-���1F�O"��W{�C�@^m�si���(!�
��V����	��6��|5t��)�bM	��������Y���s���D(K�@\x�Z�	O��8��b����a�f��������9@.��(6{d��8v�+�����Q��R��h��#��eZ�����!���)c����m+o�;�P�����_�_�[�����.Z�b$��iY���T$�����.DlcT�/�'�v����|��)�/w��#O?��Y�|NA��C�b%2�<�bZ�L����a�z���t���2�jG��9��b#�7�V����ie�YD��R�&���c���sl��D@�CS
��N�3
S�y!�5e>������R���LI���pb��j�8�@�=t�~�(8C��g����]�td 7v�1��lbLF9�6����^�!�����zA=�����
e�Pd,����F����."��2�z������L��N;��X�f������[LI�SU���K�
���-*qs���1yp!7����3���o��.B�)GZ#M;�����{I51�yIe(b�S��{�`�������R��3vd�[���Kj���Fn}���L���
�\�~T�W�Q�F�������m�D=ay'���VR��G�
��B�Q���)�a����=E.k�T�{�@l"�U��U)�tl�;���nc���9����''�P��3��i*���<���I�����<Y/
�!�5]B)��NL7T�6�� (H����`��i'�tL�:�5��b<��D,�89��b`,��P�Z��f_I"�$��5}a����8��J����l�"d ��
��>gxdZj7����J�
 �4�0,��w��9x����x&��:)1��3���:�2�R�V�vqk��`�j,��g	����������u$��)K���wD�.���w���[�X�r��������.��D�S6!�5�D������Iy��q��-g?�sJ 6�
e?������	�b[�%��s�3[M-��-�,(Hb�l�!�B��K��jFQ�#�#���<�XjU����B�S���[fm����R�!F-�J�kYA�$�{�X�rm(U/)�BB�K���f����o��Z2@l�_+��dg�<e���#���E��e�,�If�E�����i P��B�+j�]��i�ja�P��2�����5�RxC��G�sxb*�/[k���1yp�M�����;m�5��q��+��60��=�.��C����1~Ta=��8�0�-C������D��%��b��R�2�u�8���F���t	�c�G&��W}��u�����Z,�����`������1nf'Ka���9f+��-�
�������yJ�4=wHm)�bM�fN-�b�����D
�7��j���Vyi��b�m^��LU�D��"� F��%�J��1x�>pZ�Re�*;��������+�E��g�����53���%2�C���M��YH��_������)�avM������2���?��SB����^M�����FH� ����ed����y���K(e�<2�12*����~[Z�Z�j�L!I[-
-�WM�t�
��~<��}������M8�'>�Bpeh�Z/��n��Zx����b����0�CK_�����{m�������R�
��-�6�D
��r��i���<�fC�=��b����^7���	��h�'b4�P>(eZ�7����f��T��nc-�ZUF*�*�w�!_��}I#Y�+f$��Xg##������h��w���q#��/S��k�6G����>K��c�<����6��ENF�249;��4x<�f�B������.��;���1*o�c�4�=��@���wp�����!F	PK�8t,�Q�>�7o*�!'%�7��a>�_��C�����.���)5k�8F�Q�0�do�%|�w��C�*7N|�(��\���4b�\��%wIH@)���n�o4,���~���4�K���~��~��,)@l���FY�dRs���4O��n/����CJ9_X���M���!/)��W�}��A���(_~��'y��MN
�d�Os��B;a���j-�Z[���
M /�
'��P5����l6�x����+&]!���6N�
������
;�r~����#y�7����:��CJ�Yz~MU����+H���"'��E��a�*aGLNs�����p������W�EEH!5`n�l:�5o�Q��v�q,�I��l|��J��n�\�]�z��;���nqYG�F�=��k*a=��S0��!�n��zV�(l�X�#g��o�������j�%)"����'����
���F��@�����a��QJz^���Zx��&W��!i,.N�P\��W���b[��1��V)���?�����~�����������������������I���@����-/�	�&Z�$��Z��@�/d^y�(������C}mG��o��@H��[HW�	e)4�X��fM�G� O���J+rW�����_k��������y����� �/L���:��)����<F�MC/� G�QHt����w`�����H ��@��p��4n����m#@^�D��}&�H���u��z����EH)U��x�����������$�RxC���a�R0���.�K;N�6�R��A�#���
R�H���i�c���1A�D�����@"E�2��iZ�������U3Z���v@(�	���%����5Uvx��Q�w���������7����	V�D"���g��Z5���� �f���X�Q4�#Wn�#[��0���������4;����10�zVb3�
�u�"'/Oi�����R��w�����5��9���#����5\�x�ARK�D��j!���ow�O��!F�L��P;~EQ>�a+)����s��)���W���e�}iI��4�$�p2���1]l���z��!Fs�6�3�9�R������K^��c��P���#�(����<�M+������mI-�������&�l[�IDQxCOz���DN�}F(+����Fp���i�`$g��������s�#��`N�
�W{Rm`��P���n�mt��L�s��JW%��4�u��9��1��<�L���Yg�|���O�u
Z{�LI�y�+(����
�W����kO��#��O{y�W��4%��p3|�X�y��D\�e�l$$k�%��#�=x����|D9<c�}�9$��k�}0H|�8w��]y�!O\�*5,�;<!��������C���f,�yH�t9�K5��oC���(J��5'�����e����>3���Q�)$�#�W8��
�;�p�krh!I�>-2�@V��M�6ygh]d
�NK����BjEwH=��7T
����7+��D�mM�y���r����=c���^���1�C���sC�`O�X_��W�r	�)n1�r#������2=]�=�&j�q�T$y�5��37{����|���s8�V
{cr�+��8�KA���o1�5k���{�C���m��!�����S�~rC���G}��8��Bzn(wj`���(v�1�����l�Hg���};���e�xs!�;�����N�J"���5y�IRK��W�����Q��Y~���ow��Y?����~������w�np3�l����=��x�����Rl��
�H��������v0C >}��Ux�jSo��Jr���VC&�$�R� ���G��J!���5�'m�@�8���9Oc3k��#�2��@���T��W���@l���A��MS��$[Yh�<�9������-T�h�3b�<�6�����y�=���d���:�6o�i���jih[�m���C
q{h@�0�������XS�������t3
�y��M���:���w��@�5�P��������*�2@��~O,��Wx�	q��qu��uq �Z��l\�c���f�J��n�i�}�o>H�@lk:���Y�}jjN�s�v�R�2������2��/q<A*�� �l�X$5Y��h�$�����9,i�_l@r�J �5=����=��ql	�������U>�������}Z�{h�T��������G��M�_�8�gl77�f2�Z��B��c/2������q�h��.��3�b�L��Rj?��Cy&n6�����[q<2�Qws[��m��!@
mU��������#�H�O"�����&���Rh�<�i��e��+d
!�p��e������p%��rr�m1'�~��r����#���f���	���#�cM��#�����|���B@���v�k�����La����~�:�l�Hj�����f:7N��B�3��)@lk:�����ZI�{'��P��H���A>#BP�D��M���@�-�����\J
g�!��������0�Dm+��2��qb�P�n���������W�(J�)�IL����Ji�NLeW�$�3��������$X�D�M�b�j����6�T��ClKz����k��"A�-k��.�	��Dc�x�M=�bG�0�I9�;��bMb<?��f�Yd#^{��Jb�O���!bqLn.#�.��
d�Bj,
�%]S������$�UN]q<��P�����o&)�/N 6�ZIzR����S��JF�kZI���G�_0jB����b���wHm�8�d&\4>�S�,��N �5=���-��u
o��oi���7~������
�E�"�\�i8�-]H�J�&@�>���!�Gr�f!�f\����1BlI��C33���:JVv��m���P���"T�0+)���\� ���jF��%���B�����y���V�"9/��4�$�p6�@�Q4n �GB��o�&����HgI�	2�]�����v��s������zw����\�����!���&3����mI���e�<���������Kk�
��B{�`���L��u(���^"���6Z���L���#{c*�m!BlK�vyW�^��-X����N5��n�6��qH�5���h_O�}(�����u�5}v����>�|�!6��
M���U/5�0Vc)����@lk:����}TE�5[y�j���(�v
a�q�v�������wi�n���
�:�6��{O�q�a+�����
��'���1Vrh���O<oi��i`	�%Mt�<0:��x���SK&��V&�������eRn�������!�5��V
��>� ���9:��k`�~�f�	�������Cl��Ug�)��h��#��I�!�5�D�����@�'���@�f�g��	�V(q�Cg����^���n�"By q u��*������B,��s|���5���SGl����*��~������[��\P���Q�BYgH�\��o:V8W�&o�{�Y�t���\�C���R�����R������j�Y��+s^(�)�$O�9y�X1i"�_o~vq?�r�e�FxC�1�#Iz�p�>�H8�'���:|�`B,�%9��R(I��Y,?��xT��Go��kL>,S5��fV4����J�h�@1�Y���#|�D��mM�v@�����\��PiMjs�l9����h��5���=�T]I F	��O%b9�=E�k8F!�M�K��]E3UEW�����T�0(��W��n���oO����vo��a����)%>���b[�eue�+�B"��<8��R�����S:${H�<�T.,��(v��f�[��FY������+��S`�%@�L��g�C���:��)v�Qm�E*��N�Y��~��v��~����qS����o*�$�b�����V5K��!�S�s��!�%=��d�)*�-���3�.���oO�s=�	+Idi��by�0o����OC�!�L��`�a
����q���>v�M�*�f�����A����d��[�1A���Ho����c��Uj?��'6z����Q��5#Q��e�����������n��"�x�|.!G?����*t�����Vn
�:H�tM��J&���8p�mI�H>���y��w��!1j���Z�
HCX��������9B�,����'��}^�k&.�&�Z��T��/b���J+��xP�?��7V8����\�W6���3�z�F�+�$�r;��f��F�u=�]	����-i���L�2yd��((��FR!R�n� {{���W��yvSy���v����������+�-{�AE(v�MK��G8����,��[�b[��m(�57�N�����.P��r�bD?:��{%hC 2�@+��%��#B����������)��p6?�P�r��F��4������� �T������n���j��<���N"��nr��|�(���nf�i�K+�C����H�� USC���v;��#��m�H{��`�3����D,���P2�����������D�1�s]���(iw��������%rzvH!�l��c�zMO��f#�>�@��YIa�N�����g��������|����
b���kb���;�T���C�V����D�����������p�g$(��)��M���
��Tj��C^�D�C9MV���o�����)d�G�q9L�\i����,Bl<��l�G�X�d�*��_��}���������������'�����Z�����v��1n��55��Q�k��$��!F��\J&f|���mKXr!����H�� )�i�C��J��� �����X�������k�wOY�����BG�3���	wRHUx@K��0���I2���TD��"��G _0��J>:�s������'����d��$Ok�I�h�m�{o;F��!�y�0�|ghi�@Jp7P*�.���C�%��/4M����3��c�"���/�4��j|���Y&B�"����B����!m�e���w��t�,����iLv��g�R��z!�m�4��1��Zm��pG�+��}G^�m�x<m����Y�Hy����i����j�K��6��B�:���&���.)�7_�x�V��c	g?~������]W����[�����!������mo�-�!��6=����������N����E5!u��f{���� $�e��j���)d%F�u��RG�u����b�;9�:������d,���3�����i����
b,���C�5�=����lR��Rr	�	��9H�$�t�)�\�Jb�]\X�oV�7�	�%RP��WkD�L
Y�)C����#�t��k�jb�<�WG�6�W�6��Ab.���/��2������������!�������RHC)s�\3�����`���n���K(2fC��l%������7���4��;F��LT���Fo�IDQP���B��R�>�0��B�j�!e�`$�_�����'M�J�\l�����m�V�(�ej���L4��+r�;��]���K
��8��v��,����8�;c���C���<"6��K��v�
����W�SC'�US4k
���D,���}b)�h8XJ�%�PH�{����n��^L,q����4P��V�0��r�o�7��v�0�������o6:���&G���]�F���-��B����l
g�*8��������1$�X����US���p�fb���p�D,���>�"�z�*F��)]�h�?��2h�X\��
b�<��I�`���L�MN�#!F2ym�Y�Q�S�4-[HDA�R�B;������^-�����N 6=y<�|��rG:9>0���E��W7�l�����P�iZ�n F�}���}�A8�&��B ��������������U��f���E�g�3c��)�c���I���>L����f���w�)#���@:�q2<6����C���^��������P5��#��.!�	��X���Vc��h��� �=����2�X���2�Vf!���L�s(	b���J��S��(���
[�@���#X�'e�CG�:2�G�9�|"@�$W+�nr}�hW��w	0��"����.��q�=����J���b�Kj��w���d�h�*C����1A���Tq�����FUFm�V]P��Q�Cl��U��3���"���@lk:�6��8������`�|@��Qs!Kos�,�ah��'e�k���Pif��/w��ub�HZ+IZ*�������$���~�"lA�ID�t������4����+NOc�M�V����g�v�>�;��C�!��-���k/�
��y(=u���A�v��H�JH�@��DQ�v�Z"�b�/.��e�P�0b+�YF�/Fk��������p�G��$hk*�;��"�%����'M$�{`$_��+���k����X��bL%1��m���RF��Z����^S����\J&v�����;�i���g���������!��
4��-���	��T�����=��^Q�������\y��B��D����
y�7?�<v�n���1��~K�s�?�J�� |����,#-����S��I��+02�5<�q��}b�@); ��U� �|V�@i���f����0@�Y�Z�t������U5)�)bS�K(�yQ�4���K���0)D @J������
�����pxH�l���X���=�A�%x���q�� ����z;�+I$�f;�&yZ#MQ�\Afj�����b[�e�e#}�6����-i��.B��Q+�8\���->��6"��S8h�E���B��.�	���A
E������&< ����~G���,���J+G������6U��g������Q����������d>�%�nFi��eo������zWA�A��!��-�H?��3�v�D�q�~s�Y���>nf	���XkkP"����l�#A��*��y[e�=2@����~iO��+tG�\>R������2t�H:�6t��-�fMb[�I�Q`��y��=m��)�}ET�d)mL&B��RZ�40��	�}y��` F�I��d���_AfG�%��@ �xlO��s{G��vLq����4�E��P���4�WK����	3e�4���hm%I%�X���P4U
7i�I����_��/_��9GQ�q��e���e���X�!5.qu�*F�%��N�E��p`I|^�c�V)8hhc�����J�q������F�Q��j�X�{m���PP��L�z��d��yYH 6P�cM@U�D��n���ClK:�T
�����Bb�R���W	��'����n�g�=9�IEi8x�[�2��C����B�����%kyv5��~7E��8�����|�s�����qV�V()�98���E)q�7_�x��b�T������K���BF�[x�d]����<UI /)��G���� �*�@�����^+�%��e)�m$F�^�>�(C��Z��5q��Yy8-�<�����_��S�!�*&��LTi#�e�f���z��%�DQ���!F��%�����gn��t'���ND�8�)d*�
��/#\f;^	g�tIr���������F���L\H��~^�������)�����]K�]CQ(�@^Rsq�>L ���l��d!���u�~����z�#����7<&oXz���}]X�� �R(���db��m`!O�fv����s�Y+�2k�l��=@*6F!���^'��J	�L�\_?�6B���kC�7��XN�v��|YH|sW^(�1���BI"�����A�-.@��M.��e�����6SEKDE�O����t%t��1��-�jb��mM'�F��y�9��)t(��TZ�X�|Z� ��>3�I����}�u�
	�%��d^f�`�k��[�Z_��B�H�#_p��8�^��/q�(���T�-��l�o>�Kpc��D@-�u��L�ki��Zn������3�V�jI<�|�R��������z ��A
��ab4t��������;� /�
g��ZcH%�u[|O��Dh>�Cl�6�mO$����'�	����v�mM�g�o�2����oi�����u3����a�l$��Cl"�}}���~��7�
B!�5e>��Tn�Rf����[�� �hh_at~k,��R�u��i�H������+�;DH!4�%C��y �z��>B�k�����a��A�D
qV29&����
'��D��kM�[L�%WoQv��V��h\2)*d,�aJK�����R(��D���:f{�[�\^*�MSN"��Q�>�"���X����,G�z���'3^�I��B/�x�lp��>�C�����@^RD�}�^2���c1����Z��%!�����������-f�XS�\����)�`p�L��q��?�DH�X24��8&XQ�<�E���_�A&����e�i�[B�.z���m�Z�<t.V8���&~+*��mM'Q���G�Sc������F�1{^+��S�9� �$������_�*�C�5
ZG�0��]�#}��,�V a#s[-S��L��X&����J�xw�Y,��o������r�i�gv���I����*���v/yt]���q��L��9b�q�E�jv=��Iw�i��a)��A��+B��g��u������a!��&�T�����KP�D"�PH��\;����~�Y�'!�����~�`���
���d7����*@�"���ia���R�H���}�D����)Z��Fl�5$����m����!�k9��KC�����w�G�����$j �����8��������@���vp���}KWlHB������q��DR���pSnJ�G�C�]y	H 6����$���1���H{�z'��O,o����S��=��cr�3F���78���c�l)�r�)t��2,���������6D��]����s_�
�
{Rm`�m�j���G��^B��j�o4e �h/�
e���%���:A����jau�!�KZx�F���o�P�9����]��Iv�Fp$�
��=�l������
+Il)�dX|2��h�pLO��Cl+�,�l��sc�sC�fW9�����}}��@MG������6�Z^$Y����JY(�td�$kTm�����?&u;��v����=.�Trj6���?�����~��������������������Djq_�A�N(�j)l���K�1�n>&����RHZ���=�Z/����
r�����t qX��������{rT=�PM@�0�����<i+#��Sr�%������=���%�x�1������R�4=��n��&�2B^��V�9�=���6@+�&.���O�2�j#MA)y=R�H��+�c���W���@.�p�����o)���\/���B��7.<�44���
yq��K?J��5,
��'�������T20~��0ppVb������2��S*���5����)D2�c7$�W�D�2���6������6�OT<�mM��S>{
�;���l6����t#Z/��������$K����K�Zw���$��.��e�@r��B��KTv�&����8H)�]3��u��	y,c��%���#�C(-
	\�P������T�
R(��d�d;O._�]ym�����������:E��'�x$c`�&
�!l[s)��R�TM���&O}�<���q��@��jA������4!�\�\	u�H��+O;��Rxw�1���;���@
n����=Ef�s,�TQ��8t~��
��;��:^�@�����A��2���Q�o�_hf��1ng�g��1�Lk���0��;?EP�DQ���n���VR���������s�=(��:Y��k<CiuMVo~M�P�� �hF_��"	(1����a���X���[{fM�!Bl6X�����o!���^�b2��Y8{�
A0{����x�m5���7Q�$��BM+�Dd��5]k��^�x���yF�=��}�}��@�5)�h�P�5H�s��;n����'����M����$��1=�$����<t�S_/a�H��1J��+�%$��j�^�&Wu�B�{�I��d\���ju���>@
i���N%z�{7��3��f�07��
�_[��Fmg7L����L;q��|����$���C^��a���0�B������k��h`@�����R�e#�jFg�
�a�\�����a��u���Qyw���s!��������^S���nM;��m\t�!�-)�cNZM��AKh#�H��,JU�S�p��$[���#�~�����O���l������'o����������0fs��]�Z.
�����R(�O�_K�?}2j��!W2�	(�(��g��v$�q�_�Q�e����z/�y���CUM�����������H�r~���H�l���w�3X�#���H��D�	wl��L":�A_:i��z�
-�e���k����3z��}���a�0x�:Pj.�_Dd�zh��kL�c��!	��Eo$�*��!)��kG������f�G��t��s��0�/��F������S#�?��$PODd�/c�
Tr�'��o�;�&�T@c�:�x�L9�8	5�B�����vkh����IDx��@��h�47(.��B��E�b�(!�4�vpf�qh��2�>��l�[��G�����/V����::N�7��W\���.h�&)�^�Sn��\�������<��=������x��nQ%�jc�N����e���,nU0P���(9F�M�pr�b�-����,
�P���y^l�BQ��aIG{�y�9���/O���|����~����!�5�����"��O�x"�st���S�r�1� ��I�N���v_����QC�{�~p��(���(qp|��w��w����Xv@2��������|�����'�YbQn9-�m>�b#f\�fA���y����|������j>��#[D�;� BG���rA�����]�@1:����&��c�	XT;�	��Nga��2��i(���C5�����������{>��!��x�WI����
1�"���	��{d��R�<P�yQ��B�o7`�I���{���-%"2�r,y%��`��<�[�?�3�Dd����TTt�b�N���("��XX�b��1gz�*
���,������%�
3���N��!����7_��9j�A�l����;��`�A���}x�4N�dP�ED���`�7+g�'��0�C�����PF�%��]�D��P��aL�J���-��^D��������
<j��������6�!o�1S�&"�$6�|����I��4����9�I	Pd9 �h��'���y.�Ny�""���T���X���i�������m�����0�g*����%3��c�����E��	�a@;�d6�;"�#9����<gc�
���G�����&�Dc���\�4�f���E.P�U�pQ��':�	>���T��YD���;�������D�$�r��������C���$A�/V* �_$"�`
#{f�DD�<�a��>�gt��+��!�ar�N'9
����?;
h���f�L&�8	����Pc�{k��!3���j\��gB)^4�@Q��e�(��k��De�	�K�~L��m%,�MDX�q,X�e���&�Y��B ,�b%^P���X������=sQz������Q�	 $�(Y{1���-�c"����"�*���\$=n"2�M>�|�&�gfhI>.�c{��H�^���U��N�����X�;�u�
�4���!>�N����	H���wS#"�	MPa�/���<��[9��=
��tQ�F{�Q��q�l?�+�"���A�<�c�)Q�'�#y%4I��,�uA�������8�Lkf���x�#�v4��%F��D�.��������5~�[� 9*����-��F��P��.��s��GMp�p�J�{">�����'�J�������-���G���J`�{�����5(���ED�����[��Q�i��"���""��_E������1����	�7QD��\��o�+��:<��'o����6~����������O��>��/�����is����b	90w1�����O��]����_��{�6���F���c3���Y)���Q�
(o��h����D�/��x�����R�A�[=MT����d�,`�3_I��C�M��g���c�1�w����e�;��7�8F������O�DW9a�6��
U�U�l_v�1�r�p�d��A#�=fi�uG7W�U�J����6��B�����JrBq-%�02�L=�Hj�t�A�#ir�C=��V�8�k��lF�6u�8|�3!�9�N���j�/@�6�������Z�l2=n/L�R������P�v���wt���Wt�.W".���w��4r���!��@e�,�<��M.��I��'�5;�\��v>��\��3��������*���b�K��'E����c�L��A��������k���|���%_��#p����
���F���2�r
���OU�����d:�����cx����(#�I.��(0���8pkGed=�
��}l2m&k/�{L-W�t����M%�F�9:�T��D�_����J��|����B��\�3�����,r������������t���8Xk����|
H�����i�� '<������"'r��]�
;�V�B�Q���"A�O��l����q�7�C�%:zT���Gv�
���G�U��GG)�%F7����o������7]��zq���q�F�L��K����C�A-�F�&y��e�*]�"
��G����� #;Ga�3|
�x�A�Pj���P�Q:��Z��s<�L&�Nzr3F���e2h����[��a��Zs� 
�mv�x��2�rB�2�[����m<������:-�A"��
��8".��q��4��<I ��\�CX����&���( �i���������Rd�����:OrN���o��G����g������l�/8� ';$w��:xb?4D�������]�"1��=-��q���b������zA�v�y!0J������k���8;x
e{�5�!.0��nn�
q�A��>�%V�a���'���u~�Z���2:��'��/�0X(S�X8�&O<#����������G���#x��6�����V�����u�&H����C�u���*����a5�I#b��X��(#��\HU�{n�akez\0�o�#G�+X�/�cL�\ �h���,>��5����b	���h!���9a��/V����6T[�]�����r��I@���1�=�i �{���`1�:)V!���D��s!9TZ'���%��L���+P��*����	����Z�ODn`:���8��%��'��
7 ~��El��!l_��z��1��H����F��c����p��A����'W��A��c��P��\�=F�������@�Fx6Nb�	��M.��P� � e�H�JBp���#�Kv����qn�\�i��������*���J���z8�����sou}�d��!������q�Dt���np	���X
E�BG�=U�T��CWu�c4,�F�#��MD������4H���V^8|+�^�/�}��m��[,o�f�>��DG�&���V��X����T�)����������m����W
>�t����u�"��(�����A��s�S@��9�V�	�K�]���l��p/������n��,�D����nu )g��G�S�^�s�\���MD��
I�}L�s�hc=������F�P���
��(�,��1g�-��O�X���R�)���K-��'Qhy�sp>��0%s����Hv?P� ��GPg��u~!� -���3����p�y ��v]�<�t#a*-ND�L�$������?:G����o�O� �N���
/��[���W0�����z��`�`H�Dd
�����AO8j��I6����'2�}\���VP�c&�E��	��QM5�D.@���x���y;����R��l�����
r~����L���Fb���&�Mp�&Z�h�&6Y)x1T����6��5��7��� ����%���4�D�y�����wJ2���m2�e�\$hC�"��,\<�p�-f�����Rv0.���)�}3�d�&7�����V��}b9G#�<�e$�Q-��n�Ap�I�:�|���Et�}kZ;eC���GL"2�.�L[��a���!$�]C�"��gf_u����>�	A*�������i��@`�anG��C!��J@l�2�Y;wc#i
���0�~��=����G���V�w�|DP����7��b���,O��������f�$��Ay�A�w�cm�(#�~> =f������g3�
22�2]���"4��Q��������P��|(aB?�����)�\�7+,~������Fo����,�(�u@�P��(#����{yT��P��%��m�l������O;��5/�22�.�(,�0U��C�<-��:�CW{������L����J����}�Q���N��4x
�5��B������;f����(#��$�?W�>�4������=�����k�J�!ld2�lG�2���G��'��_��+��$��|�����QF�Qf�D�^HV@���~R)�NP@���b_Hvxt����O�}���|��f�5��������Z��v�NN����	j�._
j���-r�<O�����$��|�m��@����q�4T^�s�������L������mr��I$�gB"��p�� /@�e��-����2J%�`C��R�e�$�����X����{���MV9�0�e��=�����Mrwe%���L��Xf<Ld�����I�*�����ea&��.7�<$"U�a�����
����7�%lM���9u�����Q3 ����$�=p
��)q�c:�\�c�a�`Dk��]|�
�
Z'Ye�FQ�B�:AT����j���
���]������E�-��?'b_�sA���_u6��
Y��%�R,��*�<%��6������%]�p�f�xs��a.��K���"��,A>e�2�����U)&���he� i@����A���'�U�1��XIhd}��ef���2���I �U!��3cP�H�CS
h`���w�)�x��u�l����3y�@����LI2M�JK��\B
,��$��%/���L���euGQ�UP7�i�)	t��K*���G��3	vhfs�j��rA��]�a6��{���^��vA+�.�}���������
zn��o�Q*�����Q��	2-�$^j�Ds�p|r�Q
.Gj�q����_�����C�����dd�9�b_�j|���b��B��$j���a��B������e��c�����cV�5��U���kX����9x��>�uq���22����
��fied��.�|�Bug"vs�������D��~H��X�0����?E�h~� ��X���?-����N%��l��(D	�5��	6�4:�=V$�	H��bsni�!�(��1#�W6a!���Ta���%�C��A@�a�������u�".|H��d�7O��0$�����������=�aL���w��\��!#���{�P�eK<'dd�/@d��N�5�B�G�!����w�����
%�J���c(}1^c�MZJ��g���1s���	�%����pA��'7caNGc��I��QF��\H����yC����z�vO��l������P�B�v���(#�(cn
��|UZX%��x���c���E�x���C��G����>
AB�
���;
�����IX���"�BLl�4�ed�PD+'	(��i����y?�,L\#~E�@F��C�/�f��q�C����[F������J�.KM���M
�V�6�/j��S��v���/t.���]cv��P{!P����m�8s��FaY�r���n�B-�I�q��it�DP����O�;�.�(��0�����QFx����zX����P��I�����uM���8��G%dd��G�5�vb���L���%$hn�	 �k�n��Ih�~��U^P2�k�Ip�� ��LH���k�3�DC�
����Z��eF���d�
�q�L��`�td �O�T�)X��(#�~>
-������Q��itr�d�7�� \��@I�����(#��$�$]���xz~���;<�p�mf�(G�
��3�:R	,��q���!����bd���b����8������4}����.e�������C�yed�0n�r���?��$�M^�]����_�����*���5B�����?�����������-��5��b����Q�QF���4Ps�zW;.���m�}�
��
���q���(1m_S�Lf���--/�-��J�%O����crN~i\(;���[��$=��?[�&�Y�\=��������N�Z�����������������~����O�z������|P���EI(������Y%k���=�Rpx{�%Y.�o9�N��������]+�E5�OZ�����]�

��g��[z�b������*zV��OxVE-��D.�~
�G��G��R4��z��	\}��"�OZ�<9��}(O�(
A�������]�1^�}K��j`��(�fr.b����O�%_,�������@j#�A����c�$���V�Jb��-��{���:��>���m�4�V�9][�(#��(<�'i���d�@�w��r�H)��0���N"V,�3��-�L�P��s�+.O�S	�Z���U���Oe����W��Ch��|�+u^xz������
>}~~���q�s���l�%�x��b��x������$:����������o�y������|���~'D�qr�W�\b+�/6.�u��-�k���9�����g��6��L�!Q��D����6�QF����i3\�Z����)a2�����hQ������f�������(�3v�W�-�Q�p�n���3�����a�_��A���q��&sm��z�E�I��/���aY��:����4���a�[h"���D����,ba�����g���:����r-o��d��T�oN���1<S0�g
Fi(@����D[w�W
�2�*0�<p��t�]k�=�-d�0��sRc�O�P���C:p%-
����� ^��C|H��"g���P��\"�V�\�!�B�����yQ�aa��^Z����^X����Z�8��V���l��bk�t����s�0���\"�P]��v�_���HU���^1��WM��W�����ZV��<������B:��'_L��s8i��:<�T*���x��a\��8$Y�=Oo4	��[vqz�+������z��e��*�b���������s� '3�F`\��,q\(S��!���Q����2��P����QHK�Uz�3X(S�$�S��o��?6w��E$_�����`�+�3�^��W������ZFO���C���[�.�[h�Z�	��J\�L��B�4p\(S#��"���^A�UM�����{#���$bFJ���Np����H����3,~����_���9�h�UW�9{s�$��0��m���Y�pQ�%l�
hl��M�
�D�Y��S��b,���#�3a��J���\�cDt=��)r���_A�T�l!Jw��q'O.���~{�P[5]�w�F��:�h�itHW}6������DQ�������/�b%G���t�<���ID;�/7�1�~Q(4'�B'9$d?��m�G�z*�$8Jjo"J�8~M�>��"���'�
$�-����J��s
b�V�6���Q�7��7�������;�8���WF���J�$�slUw�����L"2�2�z��v�s�W�k�%Nw�����]g�c�xo�eRk��D�&���^�5&S�9�J��5$���%�s5����8����� ��}�Y��t�&(e��QBI���
3�/*�������9O"2�s����������i�)t�x�`�i�6��	�D�L����Wm��W*�G5uM��� ���>�IW9����q�����V��As��J�7��)r��>FBZ����-�/�Cw��,��MD�:.���}���}1���>��L����Lbt(��[�~�#�
N"����1���E+e����y�\���`�I��<��sl$�������O"����D"���#D�*��k�K�a�|���p�/?���'�0n����PW��������A�Dd
]����;��y���QD�`��[����MP(D\y���
���ni`��MD+���:���K��C�B�&"���2r]�5t�Q�n�U_���y3-��j.��_��Y����IF��Q�jFO"�sP2o��+}3Qjc���IF��|@6C���T3^��j��i����AW����5��
���zZ(�u>�C��r2���T��L�����y�3��5J�
B���~"��|�R���-�� 	��Y:�����h@0��35
2��9�I8Ez0��"g#��Cdd��{6M�M�\y���(#�\��o%S!/����0��	�p8�CG��C�����22�N�C�@��O-<�\3D�/�b5��"t��{���N5��>�_^2�xJc��#O~�6���	$T�� ���B�uz
}]�.��I��	edG��^YAoqc�^��
�B���
#���[���f��22����kf�3j6?}~~�9����������"lYb����E��L�O���*�����~���q��O��J lk�B�=����d8#����G���!G��f�?`B�$�\�<C�E��G�+k
A7�B��s�x�q0�C�IF�������)���=�@Ab�b�y�C522��c���R�Q`mc��(s����N	�C��0hC:4�O�C�2j#��o�#���QFxD2&�@)�~Y6O���I�:a���B��f	Yh�z��6�32t��d;	"U��(��t�"��|����Mj�9E
QF�6���V-edg�$'�td��]�%r��i��2���f��D88G$�#�p\��k��;<5���B5v��I��dQF��Qp�q��>���_e?���m����}<��t�Fc�����G��4rB���^��	��Z��z�f:���y?AF_��5Hu���Y�9-�Y��8���*��`Tq^4�A7����L7!S���I� �#�Q�!!�V!q~�VY�
������1���z�xH���<(���.}e�2���w��kZx��Up���L���Hi�����Y���F���CBR��0Iu�-��p'��J��x�xE����y�b'�F��'�(�	�&�*�8�2���L�j��.I"���(#s�j��,5lX(S�$8$���E�����sp@����.s����8����Z@J�"t��<T�b?��
��c�"��b\(3����P�����B������^�,�Cvy���	��`hWKiy���� ���l�;HB���,HQ'�_������HQ�I����sE��hQ�����B5��2���Pl\(S���n}�yya����kA6-Z�na�ej�z*�������|(�"��m�>i��@���f��IN����G�zE�L� �AFfv��`�FP,��q�p�`Xf
ftmod�/�!n��^E����H�/w��x��1j��	��J�#���
w����.���^������B�Q*�	5�� �d3��B���;��aH�#�X�(��h:��D4K#dd��d�T�u<j�#��	���R#s����W1�9Bxl1�z����{;[�������
Sx��X����a�e�A���iA8'�d��C�ZPg���q��d������
��5$���d>V���T?4��~h���:9���~�, ��uE�L�[ed��&�[Ud=	�u�M.�
��8��F��|H�����i�VX����d�U�$o�[����<��?�����3D��*��d��-{�CS��=w/\>��c���q�P���&���QF�+�5�,�T��it����D��=��V��%�P��\�j6��W*��8	��R��X2�tfR"d��='l��D.vY�yq�����U1�E��0���dd��{>��MT�a� #�(��G�3��07���AF�4B�<K\(;'��
M ~��[+e����=&8�l�B����?W�B��`�@.�F%i�j�)�d���><�y>#M���*&�E4!������"H�9o�C�A�V��C�����9W����>��	�w}��w�5����z������O��V��o�/����p+`j�)Rk�
��H{�N�2������*��=l���AD2��p�q��
l+�1KZ�Xf��
r2�2�{��bZ�;���)�" 1X"�x����P��\�5j�j��8	�TbF��ZB2�&f��!4����=u���5:�AN��INa��K�����m�(�e{��a[
m�KZ.%Za>���N{���[<�^>,LBF�*>���r�Q�B���_�r��:�9$6[?Pa+o�����M��!�k�A��v���Z�'9g{A[�C���������=�b/�5[>�?-��:]8�/#h��,�" ��.@KG1A9�����p�2ft=~yz~��k����B�5y|�&!d�D4|���7 ��E:��4�6���4�u��P��x���(����!^u�S!�E�ZQ���>���u��^��rq�L�c1�o~�����/�|������x����O���J���m���5�1���-�d���#_9��������L��K\2/������q�l��8*tzN�27/19�Nw��h��ba�l��H���m�E�
�"Ddd��*���=$6[?^r��qX~f���qX�Q�� y��Qx��N'9�#�������2Bn���P���%����!�=�}V�d��^.�"a�l��H#��/�A�`�L���C��u�1�J�����0��L�MR��h��h��F��p�h*,����QNkFOWQ���Bc�UGTU�-t\(�����
��A
\���p�fI����7
q�HD��H��c�������1.��*���x�d
-Q��$�������=��07�������N�vypFO����/���Vo��4��2��Q�����sr���E���.`������C��\�QLvxG��5/��M�U>	��u)*�U��	���C�*��y�.���ov+�z����s��H 5�|����cb�Mm�	���40����G��C���@a.G��������!.*�����]��!(���2����8��_i�����5Fk����� ��GO��d"��yoq�l��8��DP{�|�L�_{a�&cZ�pD�6Q�cM �mX����?j�Ta<��B�'�J@���3�^:���
��w���X+C�@���A3��9�N'9$X���@����e{���P����E�d�r
x��X��D����DX���@����5�l��it{���]�
2��si��{j��y\�� �G�z�p�&>��t:�i8;�}M���9Z��������&}�e=��g���~h�U�}N�O5U��o�XG�F��{V��q���8~����_��c<j�'X���fI�����b<�����cT�\"���k�D���V}�5h];�
�D���U�M�el��
[�9
#��v�o��=�nB�����9s����vfd�����~���e����[�](a��Z������J%�����vD��H5����lp����[6�$��iS*���Z�����$�����g�������G�$rf��o�7��&\��w�w���+������l��x�3�����AF�F��L�c���dY�����IX`
�J����{�{G�F�B�[�r�o�V�X\��m��m�3��B���>=��{���D���ca�}�l9������,#���Cz���(%�����t��y��`\8�e�$������z�7������W����PdZ)S�(H��`D�&q��"T�;�|��/��}�������$"�(w����>��ID��QMFO����-G#<.`�i�0^����]��6�4���*��"�J�"�m�Y��}Yl�X��Q��������"JE@@b��x���o) ~���k��6��/Vb��v@g�9U��Jv&B�V���X��	o���I��(_k[���xD�)%�8���:���x�����j���5�"��zy����������{[�+���87�Xg\(S�X�{KgO�l�s�����$�b�	�5xn�ok�0�=�$"W���=�����(�Dd
���M�������X8��R��\��[T(x�"�W����[���/�Ry$���B) �c4������!9���]��g�l�(�-�,�u��FQFvF���F�76��������>�=	nw�G
OZ�>}c�g7�k��|�����c���d_n�@�LTThK�>�ys[(3C|k���P$�����252V�@��U�#��}���������� ���(#���o]�Xa�I2�
���s����t��[j��Q�y���J���C����r�����r�j�`<�b���"j<u�������A�(���T�
�m%IB(m?���/V�~����,m��*�22�I�B7���$���ls��R^�l3m���p������w?p� ��G��L:�B\(S�!
��a��K)A����@\��5&�/�,e��������,~ 9��Ax��<�����b_l������������Fh~�0!+�1�\&����O_�~�����`]1�[���&
��0Q�5
�����a>!��q�^�e��B ��| r�|���,%?�\��R+����2i#	r$1����v�(�(#��87=%!�|L�%�c��q�/�%�����	&�"�C������.��!���P���B�)B}a�z��4-��:�"1��-Np"�<&UxA1kH�>��E��vK���he�G��������|�(#�hy��f�$6l�oq�Q��.H&�y�<��E��h�%�E���2�\x|��Va�����J�a�l��'2�����8�ka,#co
�B�m�`������F]�0V�������(F�\���j��w����0��5��9�����3p������l3��A(V������A\'�q.����_��`��22�2{����S�i��P��2>.��:�HS>��r�9)M��P%�G�{fB���~���R�
y�4��e
��	
��6�5I����wANv>3({h�L\�P��.�5��4���=��0p��G\g���722�.�2�n���Ib,�d�N�I4���0�e�o���d��
M(f	��Xd�hES�>}��v�M��C�Yph4g_3��A����v�6�s���WF�9�(A20��1c2�C�1{�[G����ldd�ZI�L�U�����L��@�Hn��.�t,(�b������X����D��N��=��{\���������h������2���Z ���P2�|d)��'q*p,�L.c	"	�
fa�d��2/�>y��F�F'��J,����L��"?�X,��]K#Kd�Q�m��{����$����?����RA��9��"��l�(���Ui�\��/^��P���������; #�H|H��c���F�F����$g8��,I��$-!UVd��!�f�
�8� �u2%N~{\,����y���q�l��&������a�����
s��C~j�p,�uAF��-9�K�(#��(f�{��� �2o<��3���m�������n��4�{���('7���3)Ih���X��d� 	���_i7p�+e��JJ�0�����[��>:�k�CG�:W��q�C�P%�A��IB�l�<�X���I=����0����s��>EQ�����2X.�N���|�i-+�>�IV�����o��dd�P��3F���O,����~*�_�S���L�}$�$��,.���uc3�����V�I��"����|����$�A�*;
��G�{���QFf��$Fz�M��k`�L��{}�`x��]h���C��hk���p8.<3�� �(����������_����B���-��
�-�e����m�9���25.�N��������jI���%ed�/@a�ZV[�:.��q��F!e��K����w�����F��@�i��l�p�qAF��z�t��&3z�Y����������Fc��8d���4�00�����U�����O���^O�(����d�.z5" M�1� #� wKl����
22p�SH7X��ba�L����t����b��0:	EVor��a�l|���o=���[b\(S�$�?������q�l�����J��2�.��t��������8�+c���8

7~,�X���~|��&����������#����aL9��s���s��M�����:��J<������`�������$���Q'�B;x���0���U��,K��xf���t�A�����]5�9�N��~��a�Z�v5��_\(�u�ID���Q
��d�B�'q��"����������� ���LN��s1�`�Zk����l�m����|��}U�B�1�����g�K���&�����A��t0�#0,Z���8$]��������}�����e���b���A��Pd���[_��ke�s����d��
�Z������^5'B.!��QF���$������IN�H���>^J�d�<�R��A�/� �~x��������H�w�^�r[���x@�t�=o�Z��U����"��4�����;-�1WH��)���2��J�w�ozA�!#S�$�$)�=q	2�1c���^�pG�L�����������$'�\��U��=�`���v�����l���r�rBCqar�<���&(����2_�Eb$�]��4��@6Y���$�S���'�Y�#nov��Z����.(&LN�����e���q]����!Z����d�a��0�q�r�,u����|,�tP�����,��q r	�5��Q�|���`(��G=.��S����������<���rJ@D�%�(�O�<-���5J`�p�e�AC��u�����]���@YZ������lj���W�|U����h���i��x��:.�
�^b��@N��s�_�d�p�0�ANf��0	���D'���6��t�@�� �/St�"���������,5j<��O�CBpe5���+�3,�r!E�����y:B����@�r:���I�%�[E�AN�Sf������3z���{*�
���&{�p��2���39j��IB#�����P��t�|��q{��������@���@K����5Y� �\S�'9�*��/w�����nh��tAN������������'i;��{
Z�R���E\������!^��.�x���Q�����m�d�kH3;%�C�����T���sa��c����D�O�z��\���+�\��S:��~�	h�y�r��q�r�<���g�j�uLb���}u|��P��I����Y��<��I��}%�/�J�G+g����sz6F��]Z����.�� �R����g���V�Gf��)Vb����m�k���

�S�����Jx�����������?�������m�C0�	�O�}�����/O���>�_���si:� m	C�����V=&��d&��/��s:��Q���r7���3�DN5��zZ���Wu���E��2E.���+����*�q^ "�R_��U�P��7�����!��9!�i��������N/�b�jW���vq�y��D����K�^;Fa���!�O��������Kk�������0l�Ea����z�^(LoO���,��.'2��^�?D=��3+���WAD�/�GO��$�p��sV��L�X��\��w�y�w['3~\%l�
�����Z����0��*������mB��
L+u`������i�W�������E6o�?<$���D��:\��q^[������<4�7&�����q����/��},I�-_�tQ
�uUo0pgy�O�!�=��e����GX��j�m?��������J0�4���a^y����W6��<Ra�AN���J���]���s�0W214�Dd�g|�����?>~����_�}������>�x��/?q!��!�:��o_��/V�:��w�of�1f1O"2�q	T�c_2�&&�B'9'�@ ��k��M�V�v�C�ytH�}�a8��4�p��P��=&j	.��"@D���WM��s��fAryHH�P�rw���o�a����a�AD�,���m���Z���?1�������W�v]@L��KK����L��h�D�A������c��z-�m��J0��kI�e�}(p��}��&�"��~��K���|`V���#[�F�����c-�K$%�Ix�y-��"J�����=���H�G���Y��]�W��gi���O.J�K���d�+~�M(����:�j���E^�f4���K����.#�~�l��@

��252�[����*����X��d��i;����1���Y��o����?�|��m��0�W	U����GEl�q��>P�`��GX��2�:����tf�$��H�;�YJ2n�����M�4����P%����5��,����M���l�d&�/8 ���&�m��40� "��G���R5"��s^���T	��o"	�%/V��������z� ���6�2Z+�Jm3K�(#����t���J�G�V)��J|�z2DP��+�q6K$�#j i@&��je�y��B�$>"'���Ml6��edU�H�p���E���+w�	����b/n�I��M�&�l?�������'H�k-�sRt�(�o@����	��b8Hd���`3���G1gW�Y�ejTb�UPX�����I05K	"J���8��bd��o�,���$��cv����]nF�������B����6��!T��.��`���tm�!X
�\����^���9<.M�cq��QF-r�����C8A�YF�D.�p����-�o�����o'7 ���i����)��L�JPH��c����������M.5c���[=��(g�"B�^	�I{VJ���L'�Ck�������Y�YF��������=��D ��:����%&9�Qd^�f�,8��r�.���:u���lG���1(F�p�%��,E��429Z�)%7�NIt���<�t��
pOJ�nH��x2�>��2�(�c��f����8��>�!'�'��yl�Z0����L�Y��f}������f�u��=�<P�FfR(�'�}�K8�����k������o���q���H�;}D�?��}�~#,J�~l�E�]M!�2n�Q���?��I��5���EaY�&�����8D�F$+{/���F�Vd��R���3{�{�^�;�S�`�/��Q���9]�����sa�}�o���5��4��}�����==����AF�9�@x��H�����D����@-b�>t������8���c�%�Ti��	J^�����88�V4h�Cr��|�p�Z�+j�!�Tq��7������
M�KYo��4��B8�3'\�uk��%,��g��s]a��8�A|��q�L���^��������o|�O?}~�>����/�?#����|��QP�#���E-��X%�t�S�/�H�j9��&N�0s�1b}lL�,#3u���n��qV�_�<j��Y;��
m��Ck�z�8-�U#��q.�}��bb���$,[�bE/;�M�e�Q��Yx~�ef3��������f����N	{p�=N��`cZ���*h����CW	�C�e��5|aGoF�X�/lp8��4:���������.���"��HXz%�t��|�
��gY�=	��X<���	orC��O��J]�����YFf��iX�v�g���o #�(3�"���#��
 �5�N���q�l��8�
U��}(S#3�g1K�Y��<�p���
�&*����-�P�8�Hl}���^m� �W�`�7�#l&|J+����� �_�f]D���\ejTbU(1��"b���X1�����3�W*����*��#T�TLq��"��>(���1��d]@�����25.�`7��]7h�p��0�w�'�|��`���MJ���/����X)��5������wt8��<���f����p�Xz��r��W$��]�U6�zt�Z��C�(�I�@�M��25*�����2�T�G�		�[�Q
�������@a�*��%.I
ky�0Pf����G q�4����ejTb|U�_eI�u�������Y����a� �o2��H%8T!�L �z�����.&"B�q��LL��������7�ZK�'#���?��~G,���?���8�[��� #���-�Q0V����5l�0P6�\[�	�W�`M��ejZ;��!��^�����W��H�?��Cm~1�o���__��]?����_��=�n�}���R*)�M3�>�Q����ie��q�=�N����8�&�cb%���i}7�b-q��U�M8������������:���N�(4���K6���`����0��&&���hY��e�������7�2�����,�E[�4��|�m�//�����
r2\q�m�:�����H�y��*���F36�i�m�@�y�j�r��cez</���j���}u��c��Wx"��2���L�W�_�T&$AiC_mSW-��Vh���;��Vcl����=��'W��e���lr� ��� ���-��a�$����/��.�_�N)8�l������ ��rj�M�`��?���N�7��?��&r��UB����J����{d�75��@x9!��K�o�)����L9��?���D�]7�.��n
2���\����f�K��2q����������������'h��3� #�^������d��?�MS��?_��7H��+�
���C����\L�C�
��g\���B}ep�,�����A�@�n
cez���.�����-���ZTAl���
Q|���Bm�z����~/���T�7O�����}1����JqTu��>�����W�	���i	7s����e�7���\
a�G���d���$��q<-�@���P�t�F���}��9^9���|���m������������~�����~zM0��.�B9���P(!<��l"���HWL�c
u|Bw~����a�9�Ng�=�c��%��3\��H9\!��:�
L��J�EUQF�Pfz�B���z����5�oJ���I���@���]yn�]6��P������xSe�k��
&�dnU.��&\*1�r��J;�� ���
4�0;�����Q	U�A��R�DG �`7�r�,�,�K3b�S.C\l��k��tbI��AF6�n�'���7��)�JL�
"��
l�;^�#�b��M��	dX�*H�F����32��tI6���:�V��r��^��}�=����2s�|8�0�W$q�~8�p�����D�;x��0P�S3����caS���t.���_z�����}���/_?�����O����5���I �����o���k�9�4Z�Co��ej��I�������-����o�����zb�d���B�8�����V/�d���AY�t�>�3z��}���H���l�g�d��f����,�U���8P�22���(,��C���ar���5�����1keL�o��&�����E}����5r�n{�)�*K�����p�#oZ
��f�OlR)���)8f8��������1122������8�E=g����b����/|��/:�su~�D����p��3���6��>Mr����\rDj_�d�xQ�h6�i,�`��Bw�J8�}��YA2�;o�����Yg/H���I;�}�>R�HfZ�����>�����u6��e��MDi�h���7xRu�W�&"��D��OH+���f����~b���N.���$����"J��~LN��w�7��0�|�����>���������	����p��w��v�."���~
�w�O>��q~�J�2iAI�L�^���9��$fq����q�k3U�>R�H%$&��q��$�Xi�7���5`���MDI!��'\g�VC�]�]�)���!&@��Q���nfx�?���g?��("�~�F6��� ��
M��}�L�J��
�k:/����l$��z��A�y�L�J�Hl$c��.��.aS�����x�a@?%�V�S�w��ol�arI{��%��."S�x�����)��#��9���9��o���,���<!������8(,IH|��~X{[�pQ�?��B�[�PXkd�-�J#I"	��(!��2�����������&"��\�ou�����o�p�bU�*��� ���=�A,�WT!��� ��+��c(r)S���!�BQD���_,fG���
�A3tPX���-4��,"��$+����L��Vw���b��G+z�k&�>P6.��s���������z%��������?��!�#�_�Z�}3x7� ��6P�,.�
8�>&���q���AD+�
�B�;S���@�H���z�����4��a>��y'[\����:���gm�MF��.�A��_'4Z�/�
�!���{k[��%��.��~��A�(srs���Ps����`f���
����C�r�ejTb
���c����s�
nq����`��&���>�N��`ie���73��x�Ax�!	�0��@�����~p���c�g�F��c��<f����v���	)`�@���q��_��xy��h�edU�C^P����d{&�z�q��/=h����h���!^�2	0P�F%�'��cf���F�5r� �9�"������Mc���h�YF�������������-�,�����g%N8�e���p��G_�5x�/���.��=��#*����������o ?^���������Y^�����G-&n��;����v��Z����
��p��Qh�2B�T�$l�c����	A��_`[6%���"X� �8Mo���N��(#��$�P+�M<��6��������Xb�n��<P�F%0T����v�&��1�Q�K�Z��[�#��C|�
����s?E���I�l�2Q��g6��zf���[�Y �OH=�0ahW�J�}@NQFf����M��f����-G�YF�Qf���k$>�c�,����_N�8���������������2O�H�P�D�$���&���W��w|���}M�	��ej\�������2
g���Q��LD5V��g�!�WD�F i@f�������@��p�mr�mbw��:�Bg��������_��������n	�(�8��%�h��DV����#�3�rJ�k��z��������
��b&2z��g�'3
���GEP�c��`p&����Wo2W�ec>Oznuo8��b���c�����Fv=�Fw�e+#ce��S���&v��%^:d�7Insu�1Fa����
��Z�N��V���d5��}�*��N��x���y��ed�����7�I���@;M�F Z���,�o"Y%l��|HZ����j&9��2B'v<H���W�P=�,f#	��d���]�v�	���M@�$%��O���c���Os|�bP����3n%2�����*��aCb9�����8x,���|��I�a��������9IJ�0P���H��W�E=�M"y�4P�QQ����aU���'v���4�a�O�M���S�1��q�����_z�����}���/_?�����O�����Q�v4��by�U�A��2����0x\��>��q������I�]���]��CDF�r<���22�*�!Ia!�OL��0��5C�d'/Z��A������l,�,#��@#\B7���ejT��x��[[���	Y�v�������e��8�B���/�~}����/�E��p6��5B�%�f�omH�$�X�5�edUb
U�@�@,1��D� 4� �jB�J�h�"l9I������m����AF6�|"H`'�3oP8��Q�&>�������('��&a����	�u@%
X'���j��*A��R��	���k<z�.Aekn�B���.�POa|r�Pp8r2�.B����xd������QF6�\H�z�i �����ej\(�QHma�M�S�p����$U��vy�YF>AO�'������,=���/���fJ2�ln�9�� �����@����}���eV-��}l5���Ce#���:8��LIUV�%�\(���f�Gxr'��NH��l��O�#g����$����,w�F�&4Yi�hBz��hB�w!��5�G2 's�[�D�E>��p�6����9�Z�7@�C�pZ��>�|���t���
!���Z�!��dG��m�"��4�����vJd0b�)(E����'�^k��L2��6(��P.0�'�	4m��ejT�����b#�}�y�V�?�HY��k�����\����&����A�E*O���Px���
yi��(�ITT�Q	V�2��Z	"	���]$t!!J�0?�A������l�*C���cl7���2�	��h���*���Z(�������������@�l�R�TR�x�(#�(���Ok�XE�F�@�X���b���d���N?1����@;����8����2�q?B�u�3hy���Y���L�JV��q.�B��s��-�@��P8��HD$��QF������`U�r*���yq��IP��K�������������d�������>���K��� ��%J��F��O�L���c�p����Lw�����������n�r��+H�@�Y,���tC�:R�
��+�x��R�Lf���+�1����l7}q���
�f V�V�0N�����-��$�6�#�b��������Wq:������'�O����������M^��bZ	��QhL�������
�t084vDo���wr2�.����{���Jx:���!��T��2�M�� �-�~1�Wt��=Z���Ex���q�P�0?;z��>�H����
8��1Pnw�rBt.�������wp��:�3����qA�������[Fed���/G��}��������B�����[��;��3s�5����������}��������n��)_��7�a;�����oI�{q�ccz��^���C!'3��e��N(�e����"�8"�Q���qY�E��C6",�+�
5D�������.o
2BZs.D�����D������R��(�b9�@��G���bC3���]���C�S3�-�J���0N���,+�ITPX����R`cC�@�~Kc[�B�dN_���8��%]�->�m%w'������iq�U:�Z���)9Q�rw��9�c�.���GiK�YF��I���M���>�����e�.�!R`����(S�x��o��\W�gi�q���LXRJy�h��Z���z�����P����$��g�����Zn����03���y����,aqy��#�&�����������@����*1�'�O�9/EO�]��Q���pE%+�:F��ECDS�1�8Pf�����7G�v�>��t�px��G*v/�dd��20�����8P�F%8Ta�=J"i��(w�	�%/V
��AV�&���E���r�U2��8����h�1t_���K������l��%	����a�l��8$�C�D�2Bu.>�-�H
�\d \���%���~?��/7(�t����.F�q8b��9q��D8�`�l=:�I�0[�ed��\������q��)^�U�x\�Z
���D!I�������A"�T_y��i�!9����� '3�\��M��v�pd����69�DS���� �I������4
22�*��
K(�}"&�}fZ@�J�>4���y~�������J%)�#<��
��G	9�d�
cef�>p�vD5��a#3�J���i
��6�XD���95P���B�8P��Y��������/_��
����?����W�y�~�7�hM���6���:/.�X���s���l�]��&I4�&������Ft(�?��l�Ed�8�}��Ya2���a"���8��s��������|z^���� f��bK����i�%���0c��r��h����-�m�x)S����H���Ox4�m�g[?n�����������J���!|J��@=e�&�G���`��rH�]D+t������O��D���s������������*��2C��C���&w���x�x*A�
1 �'�n���?�������b�t-�5xb��u������yI����Q	�����"���zZ��n�~	��A0���l�����i�&����2E.���j]���d)�w60������H�"���6>f�]�B����R��0���Wbc�����R���w��S�\�8����|��&"L�V�+c"C3��LK9a
j�":x4=�e���9U�>R����c2���G�l�%��~)�w�A���_��"��."��$3�%!���kKTLM;�����gw����5L�������r�n"���d��=g�b@�e��<R��t��h9���w�F���q�%��T��/��-J���&��(\��`���<W��|J$��1f@|��hEm�g;y�R�n�wcR�����Dd���^��PMt��."S�2��V��2��BW!�?������G0��5(g�_�	?]F��pH��01p�[����P QD6O��,����Tz8�������������kX�w�K�����6m�7$w��	�H���k%^�6h����vy8�:��RZ�mP�u�io"��P������_>�����G�7�d6���C,xAd�;��E����mA��������J��+/�Jn�p��AV�w����I���<�d�4�l�8�hA���'��Y�}��k�]�0��Ed����};�
�@�@�7��iQ*�N}���q�l���~kAr��&"<FW	U�@�-a���
��&��'�7p�0P�����������Q	����m3M(�H,g�����qmp��rH��e���
������X\�������Y���n-��($E�������Z/ba'^V�)�s���J^���.�+l�O���$��m��:�*��02j��ej\�����2�����7�Z��j30������CdBC���g�.��� ]��I���jwC5�Jx��@2� q�'����f��GL4K�22����� iFO����(H`X_�7���0>�	^��]����;��7�	$�B�J��}���Y$f���L�C�/F���p�e��8N��� �+�A*���	�q�X� ��e�S@�@-N8�����e���L���A���-�5�����o���|��s���i���������_>��:~U�A1�H��+E)�pS�����!��k~`F�H0�=z-
�*$��E>��0���z��C����Q����+"���(#�M����y���X�24������B�p8������,#��,8�Y��V����������������O�$�0�^[_l;�sT�d$�N�fA�`���l��v�-��-��/��p�U7.��:g����	2J�����-K�8Pf����2z��� QBl��m�� v�	g����]��u;�Q��2���$&�1�H���c�
�W ����J��Y�$�2	tz��@���k�d��YF�<��Dy�2�E���Wh�L���
6.=
L����!�hA��qh9���@���P�&�?�K��J�t!D�g�ceh���+j������y>z��a{[tIJj1�u���QF�!j�&��q��1��Q��U!�e��������"�z'���[�E)q�S��H�>' t�K����0|�������X�e!ANy�,<w��8�!d$��F��et 	��k2,�L7�d(S�x��8(,�H�G�F����Y�������m�x]��iT	I��H����)QJ�FG��?��k^������	�k!F(�� I�t�R�JV��~�����z��iAF)x�A�}r�!'ye�a���<Z
���Y4!�9e�i�4aDlx}8�n���Q	UhB1�X�&��f�K�����Bx�NR?Q�j ��yiZ>wBc�++��V3�����QF���������'!O�Z��z��&G����L���~����������VI�SL�tOu��8��b�>�@��uw�	3���^���v���"���<�N���[*��lR�
&-�'l�Q
������*��Uf�T��0�>�3?��q(	|h�,��?Ms���7�_�$V0I�	�O������(l6e�2�}G$_�i�������]�F���L��edP��6�@� #<]	U�@1[H���+������+D�A��P�|���AU��u��G���_+z�5�5�QF�$�<�8a�y��it<@�p���V�'�xq��1@��������$����J���������sJ%��S"������;	$�X
*�����IB�Gv���I�H�"c���/�����'?��4���2J� DO���/��$��O�*�$�4��@��-��\o�}�6��������ob�����X2ItN�2%���I5�{��<�mP���J�\H�58&5@1A���X�l���I6L0
8���PH��l��/a�
��N��=Rg����Y!i��*���!$Z�5�zC�&/��/�nx$�dM��.��xN���w+�Ba����2� �w���<�D�X���k" s1�o���0c��Xg�Ywa�l���:D��u�{���L�����{c�lH������&c��d���_������'O���406oI+� ���f4=7vd�X�g�}=	g�X��=����)Dm���e��M�����hh�`�}J��;�����)��aQZh,R�D����u���a���N��252v����+�T�:b����&����k~M`�^+��U���:`[3�+��`oh��O \��+�T��&CT��Bb������������wN�7�$�I�BD��=�E]?�O���:��)4��{�9h�L����7���R'�?lH���q���OP��r���d:](<h��� ��d�@�@���lk���W�W]Y^>��k������\(�_n��r2@s?F�8����l��9�N����}������ �� \HN��u)��&�� wA����
PK�B�
A���Y�"1����=�V�8P�F%�P%�|gY�#l�����r�����o��'������S��X�7 �"�J�Q[�������@�<���9!s��*S�$����8��`b��I�K*��
!(�A*.@����rB��5����pc� �?�ue(>�������$h���M����x!7�7A����T,��#��[�j�@=v�������a�@�ga\_r�!y��Y��&����	G-���g���Q���J\���bZ�Y\��\����$�eR�#4����4�������S����b�����c�������������������8��sq�-��P{b�25*��J�� ��=Tb����m:��� 'CwDv���������s���#��I�����b� @tXJA������o��HK��l�9�N��.������`������qMc����U�
Z�|u����NH�p��`�
b�]�����6BV��)���\=p.�]�e~�Y&iF����8P�r��CNpJ[^��;�� '���!���4�y��a�l��H�4Y=���1P�F%�S	�?�$��@����^�#6
M� ;������@2�Y���7i�JVa�w��o_2�"�E�Q�MW��=�a"��DE���� '����a���=_���C,d�����20��^�"|J�${����g1B�
X;�qanvr2�q�2't�\np���e��Y\1v05���H���%����r��$��2CcH�H��pt9Pb��%�6�Ab����u���L�Y2d*���,�	PG�%�"6'r2�	8Ao�Q�F+*Y*���
�����cY�8P{�D�&h��@N����	��]�l�@��s�`d�H,�-�q�Qh������l���_����_t&���������1�$p����I)�b�w���[�I��":S`�>�'mc���Q(���Y�CFO�|�owwl�n����WZ��-��tk�";�w���AFo0>�e��-�������=���
J�o��G���qx&0�K���8�����kt9}��?A�[��I���tC�A���?�	�Q��t
�����y���9�������� �W�u�AD��2��
/��8j��L&\Uu���K=���f��i \�e�L������U��z.~�D���h��;#\5L8�T������-�~9{����*�};9���|T:�}�(Y~��U�k�Z���EjF��?�����?����_�|�����^>����>���
W�u��������$Ed����a3ok�3�y8."2�.�T�?��s�>��,������o;F�Op)S�,�~0�2.�x��*��	���"�e�����B����D��d���ft�Y�6���E�\4��M���0w���>$�v
������QD<�~
��m];�`r4�"B#����g��#���Y���-$��,"�HH�	���[?$��p�w.H~�x�I�V�h.�+���Eh�]��1�,����������S�����l����N�Nh�C����g�#e�.0��!�Y���L�E��2E.�5z��i�d��(����'�;u7�"r����������}�?���	lr�����7�j-��R�!������<�yZ���p��H���=��c���������"��]�O%Z�����}\j���y;v�QD
��0!o+s�QD����������|��x��8����n��N�b���0H��~7��.o��\��R���6$�,�_�'2��(���&dt���|4�.s��.��i;?k#oo"J!��AHw�	"��e���2O�H��<V#FX�$��4��N^��K�7�l�>�}�MF�6���.��I|��J�e�r%�E`fm��-��x��(-M��:qD(t��(#�(C
�9��23z���d�\ah�������4��&�6'�*��2�{�0��V�~�	&/Vc��q�����r��0��%��wvM��bAF���d%�X�r+��n�C3�`C"Z����{-a����2�(#����1+"����5�!��:�.�;,�Y�r,�d�f���X��V
����^�6����
����(�~}D[9������e�i�^�m �������`S�>���_`�����k q�����$
���P�F���q�q	u�����S��JpHr�G�=c��������-�"�2Z+�6��B`\��d�I�JV1a�%�
R�G��9�(#�~E0�YK5�-�Y�\�6��'Ka�0�6�:e�?p����p������ ���U�+��UM�&j���O-j�����	kF�7��q����~��]��+@��W�����!	�#���J���~�����k�pgb�n�bq([��-�L>
�2���TnH����Y���c���ZNi0P�Fg����t���_�}����^��������������9�>arM����M&noG9%#��.X���6u�N�Y�DF��H�=*0�����ul ��-��I��4��4���E�>������y��(;�(��E��Y��f��P]gK!��f9�ejT>�3���,Wq���\�b�����k`y���������|�d���	�z��4I��>���@��Q�����Z�8P��x����>P���C�_����!��L�-��\{�0�K��e�D�dd
U�H�n���P,�L��=��0���}�f�]��!�;��;46]A�`��JVCu��N�k5����Z�(�����K��Y�4�q�/��7���������P������d����5 ���y�L����%|Els�1�uP��.�+���9}�P��0��@���|(C9w���d���Y�|��iT��S����9��
N^{(�s
�~h�
<	2�������-Ms��@V1P��I����oIy(S����1�A���zW:���d��6)N�8�����G�NI����=z3�4��$RO��0�(dc����9�pp��,2b��L���������!�����)�0Fa��I�+��D���$=8�a,�X5ZU3+`�'�����1��+Nbaj{|���y,&d���@����^�
��S�`b3����pv�;�@�������p��Fga��/w5�x�f8���5��K��Vq(�@��zCP�"�NgY�=U�FI3�Z���I"����e���$h���5�@���-��>�.�]:�;N�20P6�|���rB���a�0��T%8T	c�A�j�]�:���MP�	9��n`�~e����� '����P7�(�t	�0�&�1����o���<
����,#���QK:�����0N6�|��rXZ��25*��
3(�O"�LgV:��H����CeX�B�5��-e�W�p��L�J�_pmz��S���"C�`��.7�dx���n�(|�Zg7���y�l���0��j�����*�k������0��a�����%���/��P�`q|���y��R���~d���25.2A��+�`^����
��%�>�7���QF-�k�i�%��;�e.@�A,�fMnf�z�}_�$ga��C�a�8d�7�y�������(�iJrgq�q�zqpg���An��H#�Q�)WA�[	��vVx�B�f���0�c[b�n6�`��Y	Cp����P]���Y���xO=K�8P�F��(�9/(f
	K�l��H�i�ele`�c���� ������25*�|m�+�6�
�X���2�������f�8�� #��\���Vek�mG�.	�����a��i9����q��`f������$�&o�O����W]=^������4��IQ�7�X3/L������J�Ch��bv2z����fp���pojG��)�K�l�--n���d�#3�g����('����I7�y�n��;��#�iF�D'�����d�e/	�����V��1p��c��J'E�!��]fm��^"(��`a���������<;��W���|zD�}J�r2�2�{q�R����y@�ds��aXS�"a�L�J�G��G�~1H�uby����Zt�G�(�Q	�kt�'��A�D9���M��cM���������������a�l���8:~��s�0��������4�&��JE8��+����C�!����q�����P
~o����b�n�o�<7
�CN���:����KYh����rg��eY*b�����
��:�1$�:�j�
�r%����� ��F��\�00
�	cu  ��_�"BFX	�U�w1�?n<.7���}c�K�	wqx�	u�@y��P�"�O���q��
���<P�F%�S��,"��wH�����W;#R	�kc#z���7� '��p�`��bA�)
�
(�&d�6��sCT��vI�t�@��-3��D"w_}8��Gp�����-����"/�n���	oV���
������4��y��v��[v)���t��SyQ@?��D����r�(#���;^��8P�Ff~���)���x��F8���-%��Z@���(��_�#}�������
���ANf�09����5C���t��f�#���Wz'_����@����c(S����b�� ���Mi{G�
���	ceh���"/��g-k#��i{8n��r:0��c���CN���g���3z��Gj�J���a,�)t�*	���o�0�==-}
����[\�D��R)������Pr�;��@H����/���*��w�d���������rm�o�i�I��[����&e���1Y�������!��:���t�<	%���W�2B�^	U�k��]H��A�r��qh����N�AXs���Q�W"��@����aAv�����r:F9gv�I����Bz�����0Zk[b��u�q�=@h�X�[5�@���O���9���b	?x�{x�u���X�G"��Pl5�f��Dr�.'p��YC����M�� zp���}e��
&F�wrpskS��>@M-/fPA��>�vV.���?~������c�yIC�Z���>$�W�Ju��J��b��a����I�{�$t�&"%�����w����F����
����ns��������C��(
�Vc�+�������z1���$�lE���d9���p��pH��,�`�eL^���t��P#������������m�Z�p�E�~�S�c#b�0�>*�P�a
��3'p*`�ST�)���%���!�s.������@�J���D<��bd���4�_�$S[eA�p��p�m��F���X��u;�A��g��b��������A�7"�� 8@������
A���i��B�o�nH1����E�G�1�1�-y���k<�=��M��=��B�'�#�������<�QQ� �s�N��af+&�~#Tx<������#7"f��(��~���nn@6��1��$�+Dj���_
�V���������Y1�^�?^���4:/$Nvz�4H�����-d���5 R��!�-C����N�a�Z�R(�J)
8�i!	��Lyx���i�Xe����X��>M�t}u����2o��
l�t<��|+��7-yV'{��^kz��H����yy���������D�q��T�"�b���[��iZ�J$j��i�PC7��l*��om����/�����!A���F�;�'Ft��~�p
WRFd#�Sh�����&*p4U��0���)N�E�'��e�Pc�LE�r��4�#��,lK�q2!Dx��:F�K:SR"MK�"q�F��xq<b��B��M�*q8@�vT��K,����&G�T"<�&24s�����������j�V�i!f%�3?�"��.Peh�PUH-� 3�Du�����qm�3\7�W-��c���^h$D���sA"u��s<X�i��0�m�8me^i����!��C$��Llf���n�������!O�s�zN��Xb�W�Qr�Z����Zzq����@Jfh��#�zT��@��b��!n�n>�8$��9$:"c���W�������2��1Dj��O������Y����V��44��(��W
�����a��zj���YT���|
9e�<�����C��x�+���c��T��4��L�7z���L�q]_�"
L@��!� &�9��\��JDj���T�0�W�Q�3��#�{Q�R�q��6��?~4=�%dx

��W��GC��HYW�5����t<z1���K�53�+�F3 Os�ci���/v�G R�g:"c?r�A��P�iR�Y�9Dj������b�������<A���(V�\
������=x����i4�qgd�8"5�G��T3���J�g~�'k$�&��Hd���i4�T���Y�E��D����NrP
pv��U�P�7��G����#�����<-�=?T2w���q�R�X���\�Av��F��a^;=;d�C��@�1��>��k�S�iR��������9U�2���H= G^������/��!�25F�p�
N���T� exM�F(�h�qT����N����"��#S6T\Px�l�Sc"�����������8Z
g=��*�B�I�(����MeU23 ���si���s���_C�"<���~���CbB���\\dO�1n���O�j��c���6�b�*C3.��e���8!C�1��Gr�!������N��J��j�N.sjs.nb@��1)qH�1jY��	P�$d�xC<�v���4Y5��1`�j���>
>��v���YTg0bV�gk�
w�5n�C�������B����R�����O��y\�����PW�����6t�X��m���(9�8��}���9/|k,4�q5Y����c�8���4
y>��eE�OY������w�K("��� ��J�:�-����~)�Q�����_c��������e
y�9������b�n�cr��? ��E@CS��lL!��!O������eI���|���|#"���fBb��Y��V
��C*Q���v���R_<Z��2R3��~Z��X��Z?E�1����M:H�jh{R
��Z���F3
�(�.��.u��D ch�t�<��C�H�0��yA�9�9��(`�\��n�P(��q*��4P;C��N?���������1��o{�]��,�-d���D�YGg-D������'��QNn�m!� c�tK�}��D���p���9c�h�n�X�2ZF`����X�R��������F���('�eg�����8��e��l���������|�U��D2����q��]O�F�q/���$�3����"��r�\h/Q��d����d����G�f��>e�?r<���E�`g�8]��B��a����F�p��xK�����G(y8@��nD}�r�	)�!dyI�nb��o�)�'e��"��������9�+�
�����L�8%�yd���,���h�c!bM:�AfW4���cC�'�4	��UC�3�N6�y(�=~t��^�$�t:v��q�
������A���l�����z����A������������TP;������I�r�|E��Z=��,G��(��o�����!��TJn����
yj���t��cs� \Fn�Gc886�]�38���2S��s�ChS
���7�����z���?e����/T�*)�#L�T'ub��3�������H7��o��r�
��V2<�J��wr��3����r�EQ^q����� c
����j��F�1R��4�"�D�)31M���0�d�1��b���z�S%	�ig��,#�I��
�r���$%�/2<�G�"�ta�CE�O��8d#�f�R;�cA�iq:��/�bd����`�)�)j\����@�|�����Xx���
Z>��j����^����=.��Q���b����,�%f�����S�5r��]���$��s9�r����>��Oz�t�����-��(��s�Jv����9^:��3�v�x�0%<��V�F�l@+t�0\O�����"��}���������"O�iR���K��|�y<y�\�����H��vVT���K��u��%E����mZ���>G�L����Z.��:O���?��X�`�i��Cx-��%�i��Y�^o��6����l�4_\�2��ae~H�6�nq�C�t�)M��b��s�lm�C��C��Z�-�)^����[�!_���ssdd�-����/�u�_������n�����;������K��������1W��+�?]g��������r�����y���2�����O�r������o�����*�����Ow����w?���/����_��~����[��!5��w��l�r!�����Mi�_�kt��>�-�n_wI�n������d{u��v���1��|i�,p���7����iw��.F�z��Q��s���W�F���������@�u�D���x�I���a�K��~V�:�'�]c�,���2.W��p�kuT��M��
"�����Eq�Fq��6�3g�o��7��J"�2����m�K�'fy���n�����{|����dwH���+��eR������x/��e�o���<�X�������f�1�|E�m�������hP�����������Pm�����W��B�m������+|���>��N�>�V�R�V!��2�jw���Q��L6���cz��D�a���
�W����t��8��2F���;y���*;���W��p�J,X�@>��������z������u0�p���
�S8�sq�W����j��9���3[Sg]b�����}��W6�����g��Bs�C��_���1�@R��H��j���f�N�Lk��-�;{�i��[3X;S���e�~P�^��������:��#�Z���_f���e���:�,�^��W�&��z����i�.���D���������u�:����*�a������PK!����META-INF/manifest.xml��Aj1E����x?v�+&��z����4[6#M��>Na�vQH�;I�}y8��N8s�����h�jL��������P�\	�^��a|~��4!��
�I����e&W�;���$���b
KA�s�]��;��C_�W����<4���V��c�A���Ak9�~�=Q4_�w��6#D>"���Y��lz�_��g���?@C%�G�3����H{���/��PK!�!��]�meta.xml�RMO�0���H�P>�9`��e��[-�f�������B�����z����T	���J�(
B��d���������F��T�T(	9��6��M��{����5H�W`���IC�Q�-���"i�XFT
r��9�8�����/����%����5��n��>��a��48
"���'���o�6hG��4�n��q�YETL���E�2@��V���3����W�VA��?N���5������|?^Gq���A�Kn9>��,�����vT[z)���d�:|6��K���B���c?\m�;��$��G�����#%�%Y�c�����\�'$L&�*�c�C��k:wo���PK-
!�l9�..mimetypePK-!E�sm!
Tstyles.xmlPK-!��!�
�/�content.xmlPK-!�����META-INF/manifest.xmlPK-!�!��]��meta.xmlPK W

#47

Thomas Munro

thomas.munro@gmail.com

11 months ago

In reply to: Andres Freund (#45)

Re: AIO v2.3

On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:
+/* caller will issue more io, don't submit */
+#define READ_BUFFERS_MORE_MORE_MORE (1 << 3)

- Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
such. I don't love that name though.

Problem statement: You want to be able to batch I/O submission, ie
make a single call to ioring_enter() (and other mechanisms) to start
several I/Os, but the code that submits is inside StartReadBuffers()
and the code that knows how many I/Os it wants to start now is at a
higher level, read_stream.c and in future elsewhere. So you invented
this flag to tell StartReadBuffers() not to call
pgaio_submit_staged(), because you promise to do it later, via this
staging list. Additionally, there is a kind of programming rule here
that you *must* submit I/Os that you stage, you aren't allowed to (for
example) stage I/Os and then sleep, so it has to be a fairly tight
piece of code.

Would the API be better like this?: When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
Then bufmgr wouldn't need this flag: when it (or anything else) calls
smgrstartreadv(), if there is not currently an explicit batch then it
would be submitted immediately, and otherwise it would only be staged.
This way, batch construction (or whatever word you prefer for batch)
is in a clearly and explicitly demarcated stretch of code in one
lexical scope (though its effect is dynamically scoped just like the
staging list itself because we don't want to pass explicit I/O
contexts through the layers), but code that doesn't call those and
reaches AsyncReadBuffer() or whatever gets an implicit batch of size
one and that's also OK. Not sure what semantics nesting would have
but I doubt it matters much.

Things that need to be fixed / are fixed in this:
- max pinned buffers should be limited by io_combine_limit, not * 4
- overflow distance
- pins need to be limited in more places

I have patches for these and a few more things and will post in a
separate thread shortly because they can be understood without
reference to this AIO stuff and that'll hopefully be more digestible.

+    /*
+     * In some rare-ish cases one operation causes multiple reads (e.g. if a
+     * buffer was concurrently read by another backend). It'd be much better
+     * if we ensured that each ReadBuffersOperation covered only one IO - but
+     * that's not entirely trivial, due to having pinned victim buffers before
+     * starting IOs.
+     *
+     * TODO: Change the API of StartReadBuffers() to ensure we only ever need
+     * one IO.

Likewise.

+    /* IO finished, but result has not yet been processed */
+    PGAIO_HS_COMPLETED_IO,
+
+    /* IO completed, shared completion has been called */
+    PGAIO_HS_COMPLETED_SHARED,
+
+    /* IO completed, local completion has been called */
+    PGAIO_HS_COMPLETED_LOCAL,

(Repeating something I mentioned in off-list bikeshedding) I wondered
if it might be clearer to use the terminology "terminated" for the
work that PostgreSQL has to do after an I/O completes, instead of
overloading/subdividing the term "completed". We already "terminate"
an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
this feels like a sort of generalisation of that notion. In this AIO
world, some work is done by the backend that receives the completion
notification from the kernel, and some is done by the backend that
submitted the I/O in the first place, a division that doesn't exist
with simple synchronous system calls. I wonder if it would be clearer
to use terms based on those two roles, rather than "shared" and
"local", leading to something like:

PGAIO_HS_COMPLETED,
PGAIO_HS_TERMINATED_BY_COMPLETER,
PGAIO_HS_TERMINATED_BY_SUBMITTER,

#48

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Thomas Munro (#47)

Re: AIO v2.3

Hi,

On 2025-02-11 11:48:38 +1300, Thomas Munro wrote:

On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote:

- Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or
such. I don't love that name though.

Problem statement: You want to be able to batch I/O submission, ie
make a single call to ioring_enter() (and other mechanisms) to start
several I/Os, but the code that submits is inside StartReadBuffers()
and the code that knows how many I/Os it wants to start now is at a
higher level, read_stream.c and in future elsewhere. So you invented
this flag to tell StartReadBuffers() not to call
pgaio_submit_staged(), because you promise to do it later, via this
staging list. Additionally, there is a kind of programming rule here
that you *must* submit I/Os that you stage, you aren't allowed to (for
example) stage I/Os and then sleep, so it has to be a fairly tight
piece of code.

Would the API be better like this?: When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
Then bufmgr wouldn't need this flag: when it (or anything else) calls
smgrstartreadv(), if there is not currently an explicit batch then it
would be submitted immediately, and otherwise it would only be staged.
This way, batch construction (or whatever word you prefer for batch)
is in a clearly and explicitly demarcated stretch of code in one
lexical scope (though its effect is dynamically scoped just like the
staging list itself because we don't want to pass explicit I/O
contexts through the layers), but code that doesn't call those and
reaches AsyncReadBuffer() or whatever gets an implicit batch of size
one and that's also OK. Not sure what semantics nesting would have
but I doubt it matters much.

I'm a bit unexcited about the work to redesign this, but I also admit that you
have a point :)

Linux calls a similar concept "plugging" the queue. I think I like "batch"
better, but only marginally.

Things that need to be fixed / are fixed in this:
- max pinned buffers should be limited by io_combine_limit, not * 4
- overflow distance
- pins need to be limited in more places

I have patches for these and a few more things and will post in a
separate thread shortly because they can be understood without
reference to this AIO stuff and that'll hopefully be more digestible.

Yay!

+    /* IO finished, but result has not yet been processed */
+    PGAIO_HS_COMPLETED_IO,
+
+    /* IO completed, shared completion has been called */
+    PGAIO_HS_COMPLETED_SHARED,
+
+    /* IO completed, local completion has been called */
+    PGAIO_HS_COMPLETED_LOCAL,
(Repeating something I mentioned in off-list bikeshedding) I wondered
if it might be clearer to use the terminology "terminated" for the
work that PostgreSQL has to do after an I/O completes, instead of
overloading/subdividing the term "completed". We already "terminate"
an I/O when smgr I/O completes in pre-existing bufmgr terminology, and
this feels like a sort of generalisation of that notion.

I have a mild laziness preference for complete over terminate, but not
more. If others agree with Thomas, I'm ok with changing it.

In this AIO world, some work is done by the backend that receives the
completion notification from the kernel, and some is done by the backend
that submitted the I/O in the first place, a division that doesn't exist
with simple synchronous system calls. I wonder if it would be clearer to
use terms based on those two roles, rather than "shared" and "local",
leading to something like:

PGAIO_HS_COMPLETED,
PGAIO_HS_TERMINATED_BY_COMPLETER,
PGAIO_HS_TERMINATED_BY_SUBMITTER,

I don't love those, because the SHARED/LOCAL does imply more clearly what you
have access to. I.e. executing things in a shared completion callback for IO
on a temporary buffer doesn't make sense, you won't have access to the local
buffer table.

Greetings,

Andres Freund

#49

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Jakub Wartak (#46)

Re: AIO v2.3

Hi,

On 2025-02-06 11:50:04 +0100, Jakub Wartak wrote:

Hi Andres, OK, so I've hastily launched AIO v2.3 (full, 29 patches)
patchset probe run before going for short vacations and here results
are attached*.

Thanks for doing that work!

TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

Phew! Weee! Yay.

I was kind of afraid that additional IPC to separate processes would put
workers at a disadvantage a little bit , but that's amazingly not true.

It's a measurable disadvantage, it's just more than counteracted by being able
to do IO asynchronously :).

It's possible to make it more visible, by setting io_combine_limit = 1. If you
have a small shared buffers with everything in the kernel cache, the dispatch
overhead starts to be noticeable above several GB/s. But that's ok, I think.

The intention of this effort just to see if committing AIO with defaults as
it stands is good enough to not cause basic regressions for users and to me
it looks like it is nearly finished :)).

That's really good to hear. I think we can improve things a lot in the
future, but we gotta start somewhere...

1. not a single crash was observed , but those were pretty short runs

2. my very limited in terms of time data analysis thoughts
- most of the time perf with aioworkers is identical (+/- 3%) as of
the master, in most cases it is much BETTER

I assume s/most/some/ for the second most?

- on parallel seqscans "sata" with datasets bigger than VFS-cache
("big") and high e_io_c with high client counts(sigh!), it looks like
it would user noticeable big regression but to me it's not regression
itself, probably we are issuing way too many posix_fadvise()
readaheads with diminishing returns. Just letting you know. Not sure
it is worth introducing some global (shared aioworkers e_io_c
limiter), I think not. I think it could also be some maintenance noise
on that I/O device, but I have no isolated SATA RAID10 with like 8x
HDDs in home to launch such a test to be absolutely sure.

I'm inclined to not introduce a global limit for now - it's pretty hard to
make that scale to fast IO devices, so you need a multi-level design, where
each backend can issue a few IOs without consulting the global limit and only
after that you need to get the right to issue even more IOs from the shared
"pool".

I think this is basically a configuration issue - configuring a high e_io_c
for a device that can't handle that and then load it up with a lot of clients,
well, that'll not work out great.

3. with aioworkers in documentation it would worth pointing out that
`iotop` won't be good enough to show which PID is doing I/O anymore .
I've often get question like this: who is taking the most of I/O right
now because storage is fully saturated on multi-use system. Not sure
it would require new view or not (pg_aios output seems to be not more
like in-memory debug view that would be have to be sampled
aggressively, and pg_statio_all_tables shows well table, but not PID
-- same for pg_stat_io). IMHO if docs would be simple like
"In order to understand which processes (PIDs) are issuing lots of
IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
it should be good enough for a start.

pg_stat_get_backend_io() should allow to answer that, albeit with the usual
weakness of our stats system, namely that the user has to diff two snapshots
themselves. It probably also has the weakness of not showing results for
queries before they've finished, although I think that's something we should
be able to improve without too much trouble (not in this release though, I
suspect).

I guess we could easily reference pg_stat_get_backend_io(), but a more
complete recipe isn't entirely trivial...

Bench machine: it was intentionally much smaller hardware. Azure's
Lsv2 L8s_v2 (1st gen EPYC/1s4c8t, with kernel 6.10.11+bpo-cloud-amd64
and booted with mem=12GB that limited real usable RAM memory to just
like ~8GB to stress I/O). liburing 2.9. Normal standard compile
options were used without asserts (such as normal users would use).

Good - the asserts in the aio patches are a bit more noticeable than the ones
in master.

Thanks again for running these!

Greetings,

Andres Freund

#50

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Thomas Munro (#47)

Re: AIO v2.3

On Mon, Feb 10, 2025 at 2:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:

...

Problem statement: You want to be able to batch I/O submission, ie
make a single call to ioring_enter() (and other mechanisms) to start
several I/Os, but the code that submits is inside StartReadBuffers()
and the code that knows how many I/Os it wants to start now is at a
higher level, read_stream.c and in future elsewhere. So you invented
this flag to tell StartReadBuffers() not to call
pgaio_submit_staged(), because you promise to do it later, via this
staging list. Additionally, there is a kind of programming rule here
that you *must* submit I/Os that you stage, you aren't allowed to (for
example) stage I/Os and then sleep, so it has to be a fairly tight
piece of code.

Would the API be better like this?: When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
Then bufmgr wouldn't need this flag: when it (or anything else) calls
smgrstartreadv(), if there is not currently an explicit batch then it
would be submitted immediately, and otherwise it would only be staged.
This way, batch construction (or whatever word you prefer for batch)
is in a clearly and explicitly demarcated stretch of code in one
lexical scope (though its effect is dynamically scoped just like the
staging list itself because we don't want to pass explicit I/O
contexts through the layers), but code that doesn't call those and
reaches AsyncReadBuffer() or whatever gets an implicit batch of size
one and that's also OK. Not sure what semantics nesting would have
but I doubt it matters much.

I like this idea. If we want to submit a batch, then just submit a batch.

James

#51

Robert Haas

robertmhaas@gmail.com

11 months ago

In reply to: James Hunter (#50)

Re: AIO v2.3

On Tue, Feb 11, 2025 at 12:11 PM James Hunter <james.hunter.pg@gmail.com> wrote:

I like this idea. If we want to submit a batch, then just submit a batch.

Sounds good to me, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#52

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Thomas Munro (#47)

Re: AIO v2.3

Hi,

On 2025-02-11 11:48:38 +1300, Thomas Munro wrote:

Would the API be better like this?: When you want to create a batch
of I/Os submitted together, you wrap the work in pgaio_begin_batch()
and pgaio_submit_batch(), eg the loop in read_stream_lookahead().

One annoying detail is that an API like this would afaict need resowner
support or something along those lines (e.g. xact callbacks plus code in each
aux process' sigsetjmp() block). Otherwise I don't know how we would ensure
that the "batch-is-in-progress" flag/counter would get reset.

Alternatively we could make pgaio_batch_begin() basically start a critical
section, but that doesn't seem like a good idea, because too much that needs
to happen around buffered IO isn't compatible with critical sections.

Does anybody see a need for batches to be nested? I'm inclined to think that
that would be indicative of bugs and should therefore error/assert out.

One way we could avoid the need for a mechanism to reset-batch-in-progress
would be to make batch submission controlled by a flag on the IO. Something
like
pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)

IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
submitted using something like the existing
pgaio_submit_staged();
(although renaming it to something with batch in the name might be
appropriate)

That way there's no explicit "we are in a batch" state that needs to be reset
in case of errors.

Greetings,

Andres Freund

#53

Robert Haas

robertmhaas@gmail.com

11 months ago

In reply to: Andres Freund (#52)

Re: AIO v2.3

On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:

Alternatively we could make pgaio_batch_begin() basically start a critical
section, but that doesn't seem like a good idea, because too much that needs
to happen around buffered IO isn't compatible with critical sections.

A critical section sounds like a bad plan.

Does anybody see a need for batches to be nested? I'm inclined to think that
that would be indicative of bugs and should therefore error/assert out.

I can imagine somebody wanting to do it, but I think we can just say
no. I mean, it's no different from WAL record construction. There's no
theoretical reason you couldn't want to concurrently construct
multiple WAL records and then submit them one after another, but if
you want to do that, you have to do your own bookkeeping. It seems
fine to apply the same principle here.

One way we could avoid the need for a mechanism to reset-batch-in-progress
would be to make batch submission controlled by a flag on the IO. Something
like
pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)

IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
submitted using something like the existing
pgaio_submit_staged();
(although renaming it to something with batch in the name might be
appropriate)

That way there's no explicit "we are in a batch" state that needs to be reset
in case of errors.

I'll defer to Thomas or others on whether this is better or worse,
because I don't know. It means that the individual I/Os have to know
that they are in a batch, which isn't necessary with the begin/end
batch interface. But if we're expecting that to happen in a pretty
confined amount of code -- similar to WAL record construction -- then
that might not be a problem anyway.

I think if you don't do this, I'd do (sub)xact callbacks rather than a
resowner integration, unless you decide you want to support multiple
concurrent batches. You don't really need or want to tie it to a
resowner unless there are multiple objects each of which can have its
own resources.

--
Robert Haas
EDB: http://www.enterprisedb.com

#54

James Hunter

james.hunter.pg@gmail.com

11 months ago

In reply to: Andres Freund (#52)

Re: AIO v2.3

On Tue, Feb 11, 2025 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:

...

Alternatively we could make pgaio_batch_begin() basically start a critical
section, but that doesn't seem like a good idea, because too much that needs
to happen around buffered IO isn't compatible with critical sections.

Does anybody see a need for batches to be nested? I'm inclined to think that
that would be indicative of bugs and should therefore error/assert out.

Fwiw, in a similar situation in the past, I just blocked, waiting for
the in-flight batch to complete, before sending the next batch. So I
had something like:

void begin_batch(...)
{
if (batch_in_progress())
complete_batch(...);

/* ok start the batch now */
}

In my case, batches were nested because different access methods
(e.g., Index) can call/trigger other access methods (Heap), and both
access methods might want to issue batch reads. However, the "inner"
access method might not be aware of the "outer" access method.

For simplicity, then, I just completed the outer batch. Note that this
is not optimal for performance (because a nested batch ends up
stalling the outer batch), but it does keep the code simple...

James

#55

Jakub Wartak

jakub.wartak@enterprisedb.com

11 months ago

In reply to: Andres Freund (#49)

Re: AIO v2.3

On Tue, Feb 11, 2025 at 12:10 AM Andres Freund <andres@anarazel.de> wrote:

TLDR; in terms of SELECTs the master vs aioworkers looks very solid!

Phew! Weee! Yay.

Another good news: I've completed a full 24h pgbench run on the same
machine and it did not fail or report anything suspicious. FYI,
patchset didn't not apply anymore (seems patches 1..6 are already
applied on master due to checkpoint shutdown sequence), but there was
a failed hunk in patch #12 yesterday too:
[..]
patching file src/backend/postmaster/postmaster.c
Hunk #10 succeeded at 2960 (offset 14 lines).
Hunk #11 FAILED at 3047.
[..]
1 out of 15 hunks FAILED -- saving rejects to file
src/backend/postmaster/postmaster.c.rej

anyway, so on master @ a5579a90af05814eb5dc2fd5f68ce803899d2504 (~ Jan
24) to have clean apply I've used the below asserted build:

meson setup build --reconfigure --prefix=/usr/pgsql18.aio --debug
-Dsegsize_blocks=13 -Dcassert=true
/usr/pgsql18.aio/bin/pgbench -i -s 500 --partitions=100 # ~8GB
/usr/pgsql18.aio/bin/pgbench -R 1500 -c 100 -j 4 -P 1 -T 86400

with some add-on functionalities:
effective_io_concurrency = '4'
shared_buffers = '2GB'
max_connections = '1000'
archive_command = 'cp %p /dev/null'
archive_mode = 'on'
summarize_wal = 'on'
wal_summary_keep_time = '1h'
wal_compression = 'on'
wal_log_hints = 'on'
max_wal_size = '1GB'
shared_preload_libraries = 'pg_stat_statements'
huge_pages = 'off'
wal_receiver_status_interval = '1s'

so the above got perfect run:
[..]
duration: 86400 s
number of transactions actually processed: 129615534
number of failed transactions: 0 (0.000%)
latency average = 5.332 ms
latency stddev = 24.107 ms
rate limit schedule lag: avg 0.748 (max 1992.517) ms
initial connection time = 124.472 ms
tps = 1500.179231 (without initial connection time)

I was kind of afraid that additional IPC to separate processes would put
workers at a disadvantage a little bit , but that's amazingly not true.

It's a measurable disadvantage, it's just more than counteracted by being able
to do IO asynchronously :).

It's possible to make it more visible, by setting io_combine_limit = 1. If you
have a small shared buffers with everything in the kernel cache, the dispatch
overhead starts to be noticeable above several GB/s. But that's ok, I think.

Sure it is.

2. my very limited in terms of time data analysis thoughts
- most of the time perf with aioworkers is identical (+/- 3%) as of
the master, in most cases it is much BETTER

I assume s/most/some/ for the second most?

Right, pardon for my excited moment ;)

- on parallel seqscans "sata" with datasets bigger than VFS-cache
("big") and high e_io_c with high client counts(sigh!), it looks like
it would user noticeable big regression but to me it's not regression
itself, probably we are issuing way too many posix_fadvise()
readaheads with diminishing returns. Just letting you know. Not sure
it is worth introducing some global (shared aioworkers e_io_c
limiter), I think not. I think it could also be some maintenance noise
on that I/O device, but I have no isolated SATA RAID10 with like 8x
HDDs in home to launch such a test to be absolutely sure.

I think this is basically a configuration issue - configuring a high e_io_c
for a device that can't handle that and then load it up with a lot of clients,
well, that'll not work out great.

Sure, btw i'm going to also an idea about autotuning that e_io_c in
that related thread where everybody is complaining about it

3. with aioworkers in documentation it would worth pointing out that
`iotop` won't be good enough to show which PID is doing I/O anymore .
I've often get question like this: who is taking the most of I/O right
now because storage is fully saturated on multi-use system. Not sure
it would require new view or not (pg_aios output seems to be not more
like in-memory debug view that would be have to be sampled
aggressively, and pg_statio_all_tables shows well table, but not PID
-- same for pg_stat_io). IMHO if docs would be simple like
"In order to understand which processes (PIDs) are issuing lots of
IOs, please check pg_stat_activty for *IO/AioCompletion* waits events"
it should be good enough for a start.

pg_stat_get_backend_io() should allow to answer that, albeit with the usual
weakness of our stats system, namely that the user has to diff two snapshots
themselves. It probably also has the weakness of not showing results for
queries before they've finished, although I think that's something we should
be able to improve without too much trouble (not in this release though, I
suspect).

I guess we could easily reference pg_stat_get_backend_io(), but a more
complete recipe isn't entirely trivial...

I was trying to come out with something that could be added to docs,
but the below thing is too ugly and as you stated the primary weakness
is that query needs to finish before it is reflected:

WITH
b AS (SELECT 0 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
flush AS (SELECT 0 as step, 0, 0, pg_sleep(1), pg_stat_clear_snapshot()),
e AS (SELECT 1 AS step, pid, round(sum(write_bytes)/1024/1024) AS
wMB, NULL::void, NULL::void FROM pg_stat_activity,
pg_stat_get_backend_io(pid) GROUP BY pid),
picture AS MATERIALIZED (
SELECT * FROM b
UNION ALL
SELECt * FROM flush
UNION ALL
SELECT * FROM e
)
SELECT * FROM (
SELECT pid, wMB - LAG(wMB, 1) OVER (PARTITION BY pid ORDER BY
step) AS "wMB/s" FROM picture
) WHERE "wMB/s" > 0;

\watch 1

-J.

#56

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Robert Haas (#53)

Re: AIO v2.3

Hi,

On 2025-02-12 13:00:22 -0500, Robert Haas wrote:

On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:

One way we could avoid the need for a mechanism to reset-batch-in-progress
would be to make batch submission controlled by a flag on the IO. Something
like
pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT)

IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly
submitted using something like the existing
pgaio_submit_staged();
(although renaming it to something with batch in the name might be
appropriate)

That way there's no explicit "we are in a batch" state that needs to be reset
in case of errors.

I'll defer to Thomas or others on whether this is better or worse,
because I don't know. It means that the individual I/Os have to know
that they are in a batch, which isn't necessary with the begin/end
batch interface. But if we're expecting that to happen in a pretty
confined amount of code -- similar to WAL record construction -- then
that might not be a problem anyway.

I think if you don't do this, I'd do (sub)xact callbacks rather than a
resowner integration, unless you decide you want to support multiple
concurrent batches. You don't really need or want to tie it to a
resowner unless there are multiple objects each of which can have its
own resources.

I have working code for that. There unfortunately is an annoying problem:

It afaict is not really possible to trigger a WARNING/ERRROR/Assert in xact
callbacks at the end of commands in an explicit transaction. Consider
something like:

BEGIN;
SELECT start_but_not_end_aio_batch();

we would like to flag that the SELECT query started an AIO batch but didn't
end it. But at the momment there, afaict, is no proper way to do that, because
xact.c won't actually do much at the end of a command in a transaction block:

/*
* This is the case when we have finished executing a command
* someplace within a transaction block. We increment the command
* counter and return.
*/
case TBLOCK_INPROGRESS:
case TBLOCK_IMPLICIT_INPROGRESS:
case TBLOCK_SUBINPROGRESS:
CommandCounterIncrement();
break;

If one instead integrates with resowners, that kind of thing works, because
exec_simple_query() calls PortalDrop(), which in turn calls
ResourceOwnerRelease().

And we don't reliably warn at a later time. While xact.c integration triggers
a warning for:

BEGIN;
SELECT start_but_not_end_aio_batch();
COMMIT;

as we'd still have the batch open at COMMIT, it wouldn't trigger for

BEGIN;
SELECT start_but_not_end_aio_batch();
ROLLBACK;

as AbortTransaction() might be called in an error and therefore can't assume
that it's a problem for a batch to still be open.

I guess I could just put something alongside that CommandCounterIncrement()
call, but that doesn't seem right. I guess putting it alongside the
ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't
seem great.

Just using resowners doesn't seem right either, it's not really free to
register something with resowners, and for read intensive IO we can start a
*lot* of batches, so doing unnecessary work isn't great.

Greetings,

Andres Freund

#57

Robert Haas

robertmhaas@gmail.com

11 months ago

In reply to: Andres Freund (#56)

Re: AIO v2.3

On Mon, Feb 17, 2025 at 4:01 PM Andres Freund <andres@anarazel.de> wrote:

If one instead integrates with resowners, that kind of thing works, because
exec_simple_query() calls PortalDrop(), which in turn calls
ResourceOwnerRelease().

Hmm, so maybe that's a reason to do it via resowner.c, then. The fact
that it's a singleton object is a bit annoying, but you could make it
not a singleton, and then either pass the relevant one to the
interface functions, or store the current one in a global variable
similar to CurrentMemoryContext or similar.

I guess I could just put something alongside that CommandCounterIncrement()
call, but that doesn't seem right. I guess putting it alongside the
ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't
seem great.

The thing that's weird about that is that it isn't really logically
linked to the portal. It feels like it more properly belongs in
StartTransactionCommand() / CommitTransactionCommand().

Just using resowners doesn't seem right either, it's not really free to
register something with resowners, and for read intensive IO we can start a
*lot* of batches, so doing unnecessary work isn't great.

You don't necessarily have to register a new object for every batch,
do you? You could just register one and keep reusing it for the
lifetime of the query.

--
Robert Haas
EDB: http://www.enterprisedb.com

#58

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Andres Freund (#1)

28 attachment(s)

Re: AIO v2.4

Hi,

Attached is v2.4 of the AIO patchset.

Changes:

- Introduce "batchmode", while not in batchmode, IOs get submitted immediately.

Thomas didn't like how this worked previously, and while this was a
surprisingly large amount of work, I agree that it looks better now.

I vaccilated a bunch on the naming. For now it's

extern void pgaio_enter_batchmode(void);
extern void pgaio_exit_batchmode(void);

I did adjust the README and wrote a reasonably long comment above enter:
https://github.com/anarazel/postgres/blob/a324870186ddff9a31b10472b790eb4e744c40b3/src/backend/storage/aio/aio.c#L931-L960

- Batchmode needs to be exited in case of errors, for that

- a new pgaio_after_error() call has been added to all the relevant places

- xact.c calls to aio have been (re-)added to check that there are no
in-progress batches / unsubmitted IOs at the end of a transaction.

Before that I had just removed at-eoxact "callbacks" :)

This checking has holes though:
/messages/by-id/upkkyhyuv6ultnejrutqcu657atw22kluh4lt2oidzxxtjqux3@a4hdzamh4wzo

Because this only means that we will not detect all buggy code, rather
than misbehaving for correct code, I think this may be ok for now.

- Renamed aio_init.h to aio_subsys.h

The newly added pgaio_after_error() calls would have required including
aio.h in a good bit more places that won't themselves issue AIO. That seemed
wrong. There already was a aio_init.h to avoid needing to include aio.h in
places like ipci.c, but it seemed wrong to put pgaio_after_error() in
aio_init.h. So I renamed it to aio_subsys.h - not sure that's the best
name, but I can live with it.

- Now that Thomas submitted the necessary read_stream.c improvements, the
prior big TODO about one StartReadBuffers() call needing to start many IOs
has been addressed.

Thomas' thread: /messages/by-id/CA+hUKGK_=4CVmMHvsHjOVrK6t4F=LBpFzsrr3R+aJYN8kcTfWg@mail.gmail.com

For now I've also included Thomas patches in my queue, but they should get
pushed independently. Review comments specific to those patches probably
are better put on the other thread.

Thomas' patches also fix several issues that were addressed in my WIP
adjustments to read_stream.c. There are a few left, but it does look
better.

The included commits are 0003-0008.

- I rewrote the tests into a tap test. That was exceedingly painful. Partially
due to tap infrastructure bugs on windows that would sometimes cause
inscrutable failures, see
/messages/by-id/wmovm6xcbwh7twdtymxuboaoarbvwj2haasd3sikzlb3dkgz76@n45rzycluzft

I just pushed that fix earlier today.

- Added docs for new GUCs, moved them to a more appropriate section

- If IO workers fail to reopen the file for an IO, the IO is now marked as
failed. Previously we'd just hang.

To test this I added an injection point that triggers the failure. I don't
know how else this could be tested.

- Added liburing dependency build documentation

- Added check hook to ensure io_max_concurrency = isn't set to 0 (-1 is for
auto-config)

- Fixed that with io_method == sync we'd issue fadvise calls when not
appropriate, that was a consequence of my hacky read_stream.c changes.

- Renamed some the aio<->bufmgr.c interface functions. Don't think they're
quite perfect, but they're in later patches, so I don't want to focus too
much on them rn.

- Comment improvements etc.

- Got rid of an unused wait event and renamed other wait events to make more
sense.

- Previously the injection points were added as part of the test patch, I now
moved them into the commits adding the code being tested. Was too annoying
to edit otherwise.

Todo:

- there's a decent amount of FIXMEs in later commits related to ereport(LOG)s
needing relpath() while in a critical section. I did propose a solution to
that yesterday:

/messages/by-id/h3a7ftrxypgxbw6ukcrrkspjon5dlninedwb5udkrase3rgqvn@3cokde6btlrl

- A few more corner case tests for the interaction of multiple backends trying
to do IO on overlapping buffers would be good.

- Our temp table test coverage is atrociously bad

Questions:

- The test module requires StartBufferIO() to be visible outside of bufmgr.c -
I think that's ok, would be good to know if others agree.

I'm planning to push the first two commits soon, I think they're ok on their
own, even if nothing else were to go in.

Greetings,

Andres Freund

Attachments:

v2.4-0001-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From 27fd90a15ef275451f1112f340e2133ee0cbdb97 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.4 01/29] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01bb6a410cb..b491d04de58 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -755,8 +755,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload

From 18b1f01230fe768b807c8391789a9920b6120e1a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 5 Jan 2021 10:10:36 -0800
Subject: [PATCH v2.4 02/29] Allow lwlocks to be unowned

This is required for AIO so that the lock hold during a write can be released
in another backend. Which in turn is required to avoid the potential for
deadlocks.
---
 src/include/storage/lwlock.h      |   2 +
 src/backend/storage/lmgr/lwlock.c | 108 +++++++++++++++++++++++-------
 2 files changed, 85 insertions(+), 25 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 2aa46fd50da..13a7dc89980 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -129,6 +129,8 @@ extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
+extern void LWLockDisown(LWLock *l);
+extern void LWLockReleaseDisowned(LWLock *l, LWLockMode mode);
 extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f1e74f184f1..b02625194be 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1773,36 +1773,15 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
 	}
 }
 
-
 /*
- * LWLockRelease - release a previously acquired lock
+ * Helper function to release lock, shared between LWLockRelease() and
+ * LWLockeleaseDisowned().
  */
-void
-LWLockRelease(LWLock *lock)
+static void
+LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
 {
-	LWLockMode	mode;
 	uint32		oldstate;
 	bool		check_waiters;
-	int			i;
-
-	/*
-	 * Remove lock from list of locks held.  Usually, but not always, it will
-	 * be the latest-acquired lock; so search array backwards.
-	 */
-	for (i = num_held_lwlocks; --i >= 0;)
-		if (lock == held_lwlocks[i].lock)
-			break;
-
-	if (i < 0)
-		elog(ERROR, "lock %s is not held", T_NAME(lock));
-
-	mode = held_lwlocks[i].mode;
-
-	num_held_lwlocks--;
-	for (; i < num_held_lwlocks; i++)
-		held_lwlocks[i] = held_lwlocks[i + 1];
-
-	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
 	/*
 	 * Release my hold on lock, after that it can immediately be acquired by
@@ -1840,6 +1819,85 @@ LWLockRelease(LWLock *lock)
 		LOG_LWDEBUG("LWLockRelease", lock, "releasing waiters");
 		LWLockWakeup(lock);
 	}
+}
+
+void
+LWLockReleaseDisowned(LWLock *lock, LWLockMode mode)
+{
+	LWLockReleaseInternal(lock, mode);
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * This is the code that can be shared between actually releasing a lock
+ * (LWLockRelease()) and just not tracking ownership of the lock anymore
+ * without releasing the lock (LWLockDisown()).
+ *
+ * Returns the mode in which the lock was held by the current backend.
+ *
+ * NB: This does not call RESUME_INTERRUPTS(), but leaves that responsibility
+ * of the caller.
+ *
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). This is somewhat intentional, as it makes it easier to
+ * debug cases of missing wakeups during lock release.
+ */
+static inline LWLockMode
+LWLockDisownInternal(LWLock *lock)
+{
+	LWLockMode	mode;
+	int			i;
+
+	/*
+	 * Remove lock from list of locks held.  Usually, but not always, it will
+	 * be the latest-acquired lock; so search array backwards.
+	 */
+	for (i = num_held_lwlocks; --i >= 0;)
+		if (lock == held_lwlocks[i].lock)
+			break;
+
+	if (i < 0)
+		elog(ERROR, "lock %s is not held", T_NAME(lock));
+
+	mode = held_lwlocks[i].mode;
+
+	num_held_lwlocks--;
+	for (; i < num_held_lwlocks; i++)
+		held_lwlocks[i] = held_lwlocks[i + 1];
+
+	return mode;
+}
+
+/*
+ * Stop treating lock as held by current backend.
+ *
+ * After calling this function it's the callers responsibility to ensure that
+ * the lock gets released (via LWLockReleaseDisowned()), even in case of an
+ * error. This only is desirable if the lock is going to be released in a
+ * different process than the process that acquired it.
+ */
+void
+LWLockDisown(LWLock *lock)
+{
+	LWLockDisownInternal(lock);
+
+	RESUME_INTERRUPTS();
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLock *lock)
+{
+	LWLockMode	mode;
+
+	mode = LWLockDisownInternal(lock);
+
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
+
+	LWLockReleaseInternal(lock, mode);
 
 	/*
 	 * Now okay to allow cancel/die interrupts.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0003-Refactor-read_stream.c-s-circular-arithmetic.patchtext/x-diff; charset=us-asciiDownload

From 9cf33cb6c80e867651681216d682ae2505e0e954 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Feb 2025 14:47:25 +1300
Subject: [PATCH v2.4 03/29] Refactor read_stream.c's circular arithmetic.

Several places have open-coded circular index arithmetic.  Make some
common functions for better readability and consistent assertion
checking.

This avoids adding yet more open-coded duplication in later patches, and
standardizes on the vocabulary "advance" and "retreat" as used elsewhere
in PostgreSQL.
---
 src/backend/storage/aio/read_stream.c | 78 +++++++++++++++++++++------
 1 file changed, 61 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 99e44ed99fe..1c93fcae19b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -224,6 +224,55 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
+/*
+ * Increment index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	*index += 1;
+	if (*index == stream->queue_size)
+		*index = 0;
+}
+
+/*
+ * Increment index by n, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance_n(ReadStream *stream, int16 *index, int16 n)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+	Assert(n <= MAX_IO_COMBINE_LIMIT);
+
+	*index += n;
+	if (*index >= stream->queue_size)
+		*index -= stream->queue_size;
+
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+}
+
+#if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
+/*
+ * Decrement index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_retreat(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	if (*index == 0)
+		*index = stream->queue_size - 1;
+	else
+		*index -= 1;
+}
+#endif
+
 static void
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
@@ -302,11 +351,8 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 				&stream->buffers[stream->queue_size],
 				sizeof(stream->buffers[0]) * overflow);
 
-	/* Compute location of start of next read, without using % operator. */
-	buffer_index += nblocks;
-	if (buffer_index >= stream->queue_size)
-		buffer_index -= stream->queue_size;
-	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	/* Move to the location of start of next read. */
+	read_stream_index_advance_n(stream, &buffer_index, nblocks);
 	stream->next_buffer_index = buffer_index;
 
 	/* Adjust the pending read to cover the remaining portion, if any. */
@@ -334,12 +380,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
-		 * wrap-around, but we don't want to use the expensive % operator.
+		 * wrap-around.
 		 */
-		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
-		if (buffer_index >= stream->queue_size)
-			buffer_index -= stream->queue_size;
-		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		buffer_index = stream->next_buffer_index;
+		read_stream_index_advance_n(stream,
+									&buffer_index,
+									stream->pending_read_nblocks);
 		per_buffer_data = get_per_buffer_data(stream, buffer_index);
 		blocknum = read_stream_get_block(stream, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
@@ -777,12 +823,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	 */
 	if (stream->per_buffer_data)
 	{
+		int16		index;
 		void	   *per_buffer_data;
 
-		per_buffer_data = get_per_buffer_data(stream,
-											  oldest_buffer_index == 0 ?
-											  stream->queue_size - 1 :
-											  oldest_buffer_index - 1);
+		index = oldest_buffer_index;
+		read_stream_index_retreat(stream, &index);
+		per_buffer_data = get_per_buffer_data(stream, index);
 
 #if defined(CLOBBER_FREED_MEMORY)
 		/* This also tells Valgrind the memory is "noaccess". */
@@ -800,9 +846,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	stream->pinned_buffers--;
 
 	/* Advance oldest buffer, with wrap-around. */
-	stream->oldest_buffer_index++;
-	if (stream->oldest_buffer_index == stream->queue_size)
-		stream->oldest_buffer_index = 0;
+	read_stream_index_advance(stream, &stream->oldest_buffer_index);
 
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0004-Allow-more-buffers-for-sequential-read-streams.patchtext/x-diff; charset=us-asciiDownload

From 7dda209d32625b9b237de0417f8ebc0783d8551c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 21 Jan 2025 08:08:08 +1300
Subject: [PATCH v2.4 04/29] Allow more buffers for sequential read streams.

Read streams currently only start concurrent I/Os (via read-ahead
advice) for random access, with a hard-coded guesstimate that their
average size is likely to be at most 4 blocks when planning the size of
the buffer queue.  Sequential streams benefit from kernel readahead when
using buffered I/O, and read-ahead advice doesn't exist for direct I/O
by definition, so we didn't need to look ahead more than
io_combine_limit in that case.

Proposed patches need more buffers to be able start multiple
asynchronous I/O operations even for sequential access.  Adjust the
arithmetic in preparation, replacing "4" with io_combine_limit, though
there is no benefit yet, just some wasted queue space.

As of the time of writing, the maximum GUC values for
effective_io_concurrent (1000) and io_combine_limit (32) imply a queue
with around 32K entries (slightly more for technical reasons), though
those numbers are likely to change.  That requires a wider type in one
place that has a intermediate value that might overflow before clamping.
---
 src/backend/storage/aio/read_stream.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 1c93fcae19b..edeef292f75 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -499,7 +499,7 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios, 1) * io_combine_limit;
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
@@ -771,7 +771,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int16		distance;
+		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0005-Improve-buffer-pool-API-for-per-backend-pin-lim.patchtext/x-diff; charset=us-asciiDownload

From 47fbdb600a5011d06cf18db2f4791afb6b2d7ab4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 10:59:39 +1300
Subject: [PATCH v2.4 05/29] Improve buffer pool API for per-backend pin
 limits.

Previously the support functions assumed that you needed one additional
pin to make progress, and could optionally use some more.  Add a couple
more functions for callers that want to know:

* what the maximum possible number could be, for space planning
  purposes, called the "soft pin limit"

* how many additional pins they could acquire right now, without the
  special case allowing one pin (ie for users that already hold pins and
  can already make progress even if zero extra pins are available now)

These APIs are better suited to read_stream.c, which will be adjusted in
a follow-up patch.  Also move the computation of the each backend's fair
share of the buffer pool to backend initialization time, since the
answer doesn't change and we don't want to perform a division operation
every time we compute availability.
---
 src/include/storage/bufmgr.h          |  4 ++
 src/backend/storage/buffer/bufmgr.c   | 75 ++++++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c | 16 ++++++
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..597ecb97897 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -290,6 +290,10 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern uint32 GetSoftPinLimit(void);
+extern uint32 GetSoftLocalPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern uint32 GetAdditionalLocalPinLimit(void);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 75cfc2b6fe9..5ad1e2b18a9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -211,6 +211,8 @@ static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
 
+static uint32 MaxProportionalPins;
+
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
@@ -2097,6 +2099,46 @@ again:
 	return buf;
 }
 
+/*
+ * Return the maximum number of buffer than this backend should try to pin at
+ * once, to avoid pinning more than its fair share.  This is the highest value
+ * that GetAdditionalPinLimit() and LimitAdditionalPins() could ever return.
+ *
+ * It's called a soft limit because nothing stops a backend from trying to
+ * acquire more pins than this this with ReadBuffer(), but code that wants more
+ * for I/O optimizations should respect this per-backend limit when it can
+ * still make progress without them.
+ */
+uint32
+GetSoftPinLimit(void)
+{
+	return MaxProportionalPins;
+}
+
+/*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend soft limit, considering the
+ * number of buffers it has already pinned.
+ */
+uint32
+GetAdditionalPinLimit(void)
+{
+	uint32		estimated_pins_held;
+
+	/*
+	 * We get the number of "overflowed" pins for free, but don't know the
+	 * number of pins in PrivateRefCountArray.  The cost of calculating that
+	 * exactly doesn't seem worth it, so just assume the max.
+	 */
+	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	/* Is this backend already holding more than its fair share? */
+	if (estimated_pins_held > MaxProportionalPins)
+		return 0;
+
+	return MaxProportionalPins - estimated_pins_held;
+}
+
 /*
  * Limit the number of pins a batch operation may additionally acquire, to
  * avoid running out of pinnable buffers.
@@ -2112,28 +2154,15 @@ again:
 void
 LimitAdditionalPins(uint32 *additional_pins)
 {
-	uint32		max_backends;
-	int			max_proportional_pins;
+	uint32		limit;
 
 	if (*additional_pins <= 1)
 		return;
 
-	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
-	max_proportional_pins = NBuffers / max_backends;
-
-	/*
-	 * Subtract the approximate number of buffers already pinned by this
-	 * backend. We get the number of "overflowed" pins for free, but don't
-	 * know the number of pins in PrivateRefCountArray. The cost of
-	 * calculating that exactly doesn't seem worth it, so just assume the max.
-	 */
-	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	if (max_proportional_pins <= 0)
-		max_proportional_pins = 1;
-
-	if (*additional_pins > max_proportional_pins)
-		*additional_pins = max_proportional_pins;
+	limit = GetAdditionalPinLimit();
+	limit = Max(limit, 1);
+	if (limit < *additional_pins)
+		*additional_pins = limit;
 }
 
 /*
@@ -3574,6 +3603,16 @@ InitBufferManagerAccess(void)
 {
 	HASHCTL		hash_ctl;
 
+	/*
+	 * The soft limit on the number of pins each backend should respect, bast
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  Higher level code that pins non-trivial numbers
+	 * of buffers should use LimitAdditionalPins() or GetAdditionalPinLimit()
+	 * to stay under this limit.
+	 */
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
 	hash_ctl.keysize = sizeof(int32);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 64931efaa75..3c055f6ec8b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -286,6 +286,22 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see GetSoftPinLimit() */
+uint32
+GetSoftLocalPinLimit(void)
+{
+	/* Every backend has its own temporary buffers, and can pin them all. */
+	return num_temp_buffers;
+}
+
+/* see GetAdditionalPinLimit() */
+uint32
+GetAdditionalLocalPinLimit(void)
+{
+	Assert(NLocalPinnedBuffers <= num_temp_buffers);
+	return num_temp_buffers - NLocalPinnedBuffers;
+}
+
 /* see LimitAdditionalPins() */
 void
 LimitAdditionalLocalPins(uint32 *additional_pins)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0006-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From ccf77f7584d30ceb9997a0d47bd1e3e05cab1cdc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 23:52:53 +1300
Subject: [PATCH v2.4 06/29] Respect pin limits accurately in read_stream.c.

Read streams pin multiple buffers at once as required to combine I/O.
This also avoids having to unpin and repin later when issuing read-ahead
advice, and will be needed for proposed work that starts "real"
asynchronous I/O.

To avoid pinning too much of the buffer pool at once, we previously used
LimitAdditionalBuffers() to avoid pinning more than this backend's fair
share of the pool as a cap.  The coding was a naive and only checked the
cap once at stream initialization.

This commit moves the check to the time of use with new bufmgr APIs from
an earlier commit, since the result might change later due to pins
acquired later outside this stream.  No extra CPU cycles are added to
the all-buffered fast-path code (it only pins one buffer at a time), but
the I/O-starting path now re-checks the limit every time using simple
arithmetic.

In practice it was difficult to exceed the limit, but you could contrive
a workload to do it using multiple CURSORs and FETCHing from sequential
scans in round-robin fashion, so that each underlying stream computes
its limit before all the others have ramped up to their full look-ahead
distance.  Therefore, no back-patch for now.

Per code review from Andres, in the course of his AIO work.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/aio/read_stream.c | 111 ++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index edeef292f75..1a51e6eed31 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		temporary;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -274,7 +275,9 @@ read_stream_index_retreat(ReadStream *stream, int16 *index)
 #endif
 
 static void
-read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+read_stream_start_pending_read(ReadStream *stream,
+							   int16 buffer_limit,
+							   bool suppress_advice)
 {
 	bool		need_wait;
 	int			nblocks;
@@ -308,10 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		flags = 0;
 
-	/* We say how many blocks we want to read, but may be smaller on return. */
+	/*
+	 * We say how many blocks we want to read, but may be smaller on return.
+	 * On memory-constrained systems we may be also have to ask for a smaller
+	 * read ourselves.
+	 */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = stream->pending_read_nblocks;
+	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -360,11 +367,60 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	stream->pending_read_nblocks -= nblocks;
 }
 
+/*
+ * How many more buffers could we use, while respecting the soft limit?
+ */
+static int16
+read_stream_get_buffer_limit(ReadStream *stream)
+{
+	uint32		buffers;
+
+	/* Check how many local or shared pins we could acquire. */
+	if (stream->temporary)
+		buffers = GetAdditionalLocalPinLimit();
+	else
+		buffers = GetAdditionalPinLimit();
+
+	/*
+	 * Each stream is always allowed to try to acquire one pin if it doesn't
+	 * hold one already.  This is needed to guarantee progress, and just like
+	 * the simple ReadBuffer() operation in code that is not using this stream
+	 * API, if a buffer can't be pinned we'll raise an error when trying to
+	 * pin, ie the buffer pool is simply too small for the workload.
+	 */
+	if (buffers == 0 && stream->pinned_buffers == 0)
+		return 1;
+
+	/*
+	 * Otherwise, see how many additional pins the backend can currently pin,
+	 * which may be zero.  As above, this only guarantees that this backend
+	 * won't use more than its fair share if all backends can respect the soft
+	 * limit, not that a pin can actually be acquired without error.
+	 */
+	return Min(buffers, INT16_MAX);
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	int16		buffer_limit;
+
+	/*
+	 * Check how many pins we could acquire now.  We do this here rather than
+	 * pushing it down into read_stream_start_pending_read(), because it
+	 * allows more flexibility in behavior when we run out of allowed pins.
+	 * Currently the policy is to start an I/O when we've run out of allowed
+	 * pins only if we have to to make progress, and otherwise to stop looking
+	 * ahead until more pins become available, so that we don't start issuing
+	 * a lot of smaller I/Os, prefering to build the largest ones we can. This
+	 * choice is debatable, but it should only really come up with the buffer
+	 * pool/connection ratio is very constrained.
+	 */
+	buffer_limit = read_stream_get_buffer_limit(stream);
+
 	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+		   stream->pinned_buffers + stream->pending_read_nblocks <
+		   Min(stream->distance, buffer_limit))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
@@ -372,7 +428,9 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 
 		if (stream->pending_read_nblocks == io_combine_limit)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit,
+										   suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
 			continue;
 		}
@@ -406,11 +464,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
-			if (stream->ios_in_progress == stream->max_ios)
+			if (stream->ios_in_progress == stream->max_ios || buffer_limit == 0)
 			{
-				/* And we've hit the limit.  Rewind, and stop here. */
+				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
@@ -426,16 +485,17 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * limit, preferring to give it another chance to grow to full
 	 * io_combine_limit size once more buffers have been consumed.  However,
 	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.
+	 * distance limit or buffer limit and there isn't anything pinned yet, or
+	 * the callback has signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
 		(stream->pending_read_nblocks == io_combine_limit ||
-		 (stream->pending_read_nblocks == stream->distance &&
+		 ((stream->pending_read_nblocks == stream->distance ||
+		   stream->pending_read_nblocks == buffer_limit) &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
-		read_stream_start_pending_read(stream, suppress_advice);
+		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
 }
 
 /*
@@ -464,6 +524,7 @@ read_stream_begin_impl(int flags,
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
+	uint32		max_possible_buffer_limit;
 	Oid			tablespace_id;
 
 	/*
@@ -507,12 +568,23 @@ read_stream_begin_impl(int flags,
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
 	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
 
-	/* Don't allow this backend to pin more than its share of buffers. */
+	/*
+	 * Also limit by the maximum possible number of pins we could be allowed
+	 * to acquire according to bufmgr.  We may not be able to use them all due
+	 * to other pins held by this backend, but we'll enforce the dynamic limit
+	 * later when starting I/O.
+	 */
 	if (SmgrIsTemp(smgr))
-		LimitAdditionalLocalPins(&max_pinned_buffers);
+		max_possible_buffer_limit = GetSoftLocalPinLimit();
 	else
-		LimitAdditionalPins(&max_pinned_buffers);
-	Assert(max_pinned_buffers > 0);
+		max_possible_buffer_limit = GetSoftPinLimit();
+	max_pinned_buffers = Min(max_pinned_buffers, max_possible_buffer_limit);
+
+	/*
+	 * The soft limit might be zero on a system configured with more
+	 * connections than buffers.  We need at least one.
+	 */
+	max_pinned_buffers = Max(1, max_pinned_buffers);
 
 	/*
 	 * We need one extra entry for buffers and per-buffer data, because users
@@ -572,6 +644,7 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -700,6 +773,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * arbitrary I/O entry (they're all free).  We don't have to
 			 * adjust pinned_buffers because we're transferring one to caller
 			 * but pinning one more.
+			 *
+			 * In the fast path we don't need to check the pin limit.  We're
+			 * always allowed at least one pin so that progress can be made,
+			 * and that's all we need here.  Although two pins are momentarily
+			 * held at the same time, the model used here is that the stream
+			 * holds only one, and the other now belongs to the caller.
 			 */
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0007-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 2dc7addc4f3965ec958c3cd288ec278f4d14a602 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Jan 2025 11:42:03 +1300
Subject: [PATCH v2.4 07/29] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach read
stream to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap and code that uses the fast path stays
in one single buffer queue element.  Satisfy both goals by initializing
the queue incrementally on the first cycle.
---
 src/backend/storage/aio/read_stream.c | 108 ++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 1a51e6eed31..32e5def29f8 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -112,8 +112,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -280,7 +282,9 @@ read_stream_start_pending_read(ReadStream *stream,
 							   bool suppress_advice)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
+	int			forwarded;
 	int			flags;
 	int16		io_index;
 	int16		overflow;
@@ -312,13 +316,34 @@ read_stream_start_pending_read(ReadStream *stream,
 		flags = 0;
 
 	/*
-	 * We say how many blocks we want to read, but may be smaller on return.
-	 * On memory-constrained systems we may be also have to ask for a smaller
-	 * read ourselves.
+	 * On buffer-constrained systems we may need to limit the I/O size by the
+	 * available pin count.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -347,16 +372,35 @@ read_stream_start_pending_read(ReadStream *stream,
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Move to the location of start of next read. */
 	read_stream_index_advance_n(stream, &buffer_index, nblocks);
@@ -381,6 +425,15 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	else
 		buffers = GetAdditionalPinLimit();
 
+	/*
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffers += stream->forwarded_buffers;
+
 	/*
 	 * Each stream is always allowed to try to acquire one pin if it doesn't
 	 * hold one already.  This is needed to guarantee progress, and just like
@@ -389,7 +442,7 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	 * pin, ie the buffer pool is simply too small for the workload.
 	 */
 	if (buffers == 0 && stream->pinned_buffers == 0)
-		return 1;
+		buffers = 1;
 
 	/*
 	 * Otherwise, see how many additional pins the backend can currently pin,
@@ -751,10 +804,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -803,6 +858,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -887,10 +943,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		}
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -933,6 +994,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -968,6 +1030,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -981,6 +1044,23 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		read_stream_index_advance(stream, &index);
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0008-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 6c9f6faffb23c81250298975112128b6521da429 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Feb 2025 21:55:40 +1300
Subject: [PATCH v2.4 08/29] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 597ecb97897..4a035f59a7d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5ad1e2b18a9..47e1c3442b4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1257,10 +1257,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1270,30 +1270,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, BM_VALID after this check, but
+			 * StartBufferIO() will handle those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1314,15 +1364,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1337,7 +1383,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1351,11 +1397,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1369,13 +1425,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1386,7 +1447,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1416,24 +1478,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0009-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From fbee247f9ad419ec275a8ebb754a21c0d1cbf4e7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:24:31 -0500
Subject: [PATCH v2.4 09/29] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.
---
 src/include/storage/aio.h                     | 38 +++++++++++
 src/include/storage/aio_subsys.h              | 28 ++++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 67 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++++
 doc/src/sgml/config.sgml                      | 51 ++++++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 16 files changed, 302 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..0b8aaaf00d3
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO but interact with it in some form. E.g. postmaster.c
+ * and shared memory initialization need to initialize AIO but don't perform
+ * AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..de3bc37264f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..d1b49adc547 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -62,6 +62,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..8eb6a5f1292
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	/* placeholder for later */
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b491d04de58..4410bbe602b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -626,6 +627,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3cde94a1759..b7c84d061e2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -71,6 +71,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3252,6 +3253,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5286,6 +5299,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 415f253096c..186bc47b700 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -199,6 +199,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..76b9cec1e26 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9eedcf6f0f4..0306827afbd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2618,6 +2618,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fb39c915d76..406c0893440 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1267,6 +1267,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0010-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload

From fa8a603bd1f8017bae0edcdac28b12b81ab10671 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:57:59 -0500
Subject: [PATCH v2.4 10/29] aio: Core AIO implementation

This commit is not sufficient to actually perform AIO, it just contains the
infrastructure for using AIO. Subsequent commits will introduce different
methods of executing AIO and support for performing AIO on different targets.
---
 src/include/storage/aio.h                     |  299 +++++
 src/include/storage/aio_internal.h            |  348 +++++
 src/include/storage/aio_subsys.h              |    5 +
 src/include/storage/aio_types.h               |  115 ++
 src/backend/access/transam/xact.c             |   12 +
 src/backend/postmaster/autovacuum.c           |    2 +
 src/backend/postmaster/bgwriter.c             |    2 +
 src/backend/postmaster/checkpointer.c         |    2 +
 src/backend/postmaster/pgarch.c               |    2 +
 src/backend/postmaster/walsummarizer.c        |    2 +
 src/backend/postmaster/walwriter.c            |    2 +
 src/backend/replication/walsender.c           |    2 +
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1121 ++++++++++++++++-
 src/backend/storage/aio/aio_callback.c        |  288 +++++
 src/backend/storage/aio/aio_init.c            |  186 +++
 src/backend/storage/aio/aio_io.c              |  180 +++
 src/backend/storage/aio/aio_target.c          |  108 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 22 files changed, 2749 insertions(+), 4 deletions(-)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..d87cfe96b20 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -14,6 +14,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +29,305 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not publically referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ */
+typedef struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+} PgAioTargetInfo;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+typedef struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+} PgAioHandleCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..e980b06c1f3
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,348 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that shoul only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/* returned by pgaio_io_acquire() */
+	PGAIO_HS_HANDED_OUT,
+
+	/* pgaio_io_prep_*() has been called, but IO hasn't been submitted yet */
+	PGAIO_HS_DEFINED,
+
+	/* target's stage() callback has been called, ready to be submitted */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted and is being executed */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/* IO completed, shared completion has been called */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_shared_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		shared_callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * without having been either defined (by actually associating it with IO)
+	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
+	 * to enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strict speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at buildtime. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index 0b8aaaf00d3..e4faf692a38 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -25,4 +25,9 @@ extern void AioShmemInit(void);
 
 extern void pgaio_init_backend(void);
 
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..d2617139a25
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ade2708b59e..88effd259f7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(char *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3eff5dc6f0e..ec1225c433f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b94f9cdff21..d254b2d1587 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 12ee815a626..6302e2a8314 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ffbf0439358..f38e181cf5e 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -37,6 +37,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -288,6 +289,7 @@ WalSummarizerMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index df4f7634969..3015a744b5f 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(char *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..69097125606 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8eb6a5f1292..0483580d644 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -14,10 +36,28 @@
 
 #include "postgres.h"
 
-#include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -30,11 +70,1054 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation.
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()) the IO will also have been submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			/* XXX: Should we warn about this when is_commit? */
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == PGAIO_HS_STAGED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();
+		}
+		else if (state != PGAIO_HS_SUBMITTED
+				 && state != PGAIO_HS_COMPLETED_IO
+				 && state != PGAIO_HS_COMPLETED_SHARED
+				 && state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * locallbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_shared_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * It's possible that we recognized there were free IOs while submitting.
+	 */
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+	{
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+	}
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * Batch submission mode needs to explicitly ended with
+ * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
+ * error recovery will end the batch.
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	AtEOXact_Aio(code == 0);
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes: - it's somewhat annoying to see partially finished IOs in
+	 * stats views etc - it's rumored that some kernel-level AIO mechanisms
+	 * don't deal well with the issuer of an AIO exiting
+	 */
+
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
+}
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -57,11 +1140,41 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
 /*
- * Release IO handle during resource owner cleanup.
+ * Call injection point with support for pgaio_inj_io_get().
  */
 void
-pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
 {
-	/* placeholder for later */
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
 }
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..5629dc4cc94
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/memutils.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cbid)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+	if (cbid >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cbid);
+	if (aio_handle_cbs[cbid].cb->complete_shared == NULL &&
+		aio_handle_cbs[cbid].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cbid);
+	if (ioh->num_shared_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->shared_callbacks[ioh->num_shared_callbacks] = cbid;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_shared_callbacks + 1,
+				   cbid, ce->name);
+
+	ioh->num_shared_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage",
+					   i, cbid, ce->name);
+		ce->cb->stage(ioh);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cbid, ce->name,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_shared_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cbid = ioh->shared_callbacks[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cbid, ce->name,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cbid = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cbid];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..4223cd1bfd6 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,210 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_shared_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..89376ff4040
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..15428968e58
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,108 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..43f9c8bd0b3
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "should be unreachable");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..6f3ca878bd1 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 406c0893440..14a338e4308 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1268,6 +1268,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2108,6 +2109,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0011-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 27bdac00ef8da7a01c401f2f10b4017c9ee0b1ba Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 17:57:00 -0500
Subject: [PATCH v2.4 11/29] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 165 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 ++++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 16 files changed, 286 insertions(+), 13 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a2b63495eec..54429e046a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 188a06e2379..253dc98c50e 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index e4faf692a38..ed00d5c47cd 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..223d614dc4a
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..64e9b8ff8c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -448,7 +448,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index a97a1eda6da..54b4c22bd63 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..abef9d941c0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -341,6 +344,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -403,6 +407,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -437,6 +445,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1366,6 +1376,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1378,7 +1393,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2503,6 +2517,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2724,6 +2748,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2906,20 +2931,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2934,12 +2960,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3040,11 +3067,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3172,10 +3213,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3199,6 +3244,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -4115,6 +4161,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4265,6 +4312,100 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..942f1609af2
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(char *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 1149d89d7a1..bf6f204e71c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3313,6 +3313,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 4a667e7019c..d7f7bc8d2e3 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -238,6 +238,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28a431084b8..713b1dc6cd0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -375,6 +375,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 6f3ca878bd1..d18b788b8a1 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0347fc11092..cbca090d2b0 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0012-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload

From 0d79ede7a92225d90715a5370581c83b521cb63f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2025 13:46:35 -0500
Subject: [PATCH v2.4 12/29] aio: Add worker method

---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 435 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_tables.c           |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |  21 +
 src/tools/pgindent/typedefs.list              |   3 +
 11 files changed, 483 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d87cfe96b20..ca0f9b64d97 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -23,10 +23,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index e980b06c1f3..c6e7306ed61 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -338,6 +338,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 0483580d644..3eace131de2 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -63,6 +63,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -79,6 +80,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 4223cd1bfd6..87eac5e961c 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -211,6 +217,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 942f1609af2..d1b52cba2cd 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,30 +31,329 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG1, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(char *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
 
-	/* TODO review all signals */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
 
@@ -53,7 +367,12 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
-	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
 
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -61,6 +380,27 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 		error_context_stack = NULL;
 		HOLD_INTERRUPTS();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		EmitErrorReport();
 
 		proc_exit(1);
@@ -69,12 +409,92 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 	/* We can now handle ereport(ERROR) */
 	PG_exception_stack = &local_sigjmp_buf;
 
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail, no need to keep error_ioh
+			 * around. pgaio_io_perform_synchronously() contains a critical
+			 * section.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
 	proc_exit(0);
@@ -83,6 +503,5 @@ IoWorkerMain(char *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d18b788b8a1..cb977b049d8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b7c84d061e2..15954f42d4e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -74,6 +74,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3265,6 +3266,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 186bc47b700..3bb4e0d4d7d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -199,11 +199,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0306827afbd..2cf0120bb56 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2656,6 +2656,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
@@ -2669,6 +2674,22 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is 3.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 14a338e4308..e34727e269a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0013-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From be83ff6fb350b9900428e4c39c4ce339b354018c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 5 Jun 2024 19:37:25 -0700
Subject: [PATCH v2.4 13/29] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 7dd7110318d..53a1365f5de 100644
--- a/meson.build
+++ b/meson.build
@@ -855,6 +855,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3069,6 +3081,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3721,6 +3734,7 @@ if meson.version().version_compare('>=0.57')
       'gss': gssapi,
       'icu': icu,
       'ldap': ldap,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index d9c7ddccbc4..8e0c63cc782 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -103,6 +103,9 @@ option('ldap', type: 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index f56681e0d91..065bd89fbbb 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1422,6 +1430,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index d49b2079a44..714b7ccaa4e 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -229,6 +231,7 @@ pgxs_deps = {
   'gssapi': gssapi,
   'icu': icu,
   'ldap': ldap,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..df7584a6187 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -663,6 +663,9 @@
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 3f0a7e9c069..9989fbc5936 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1143,6 +1143,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2584,6 +2602,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 0ffcaeb4367..400a03ce0e1 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -709,6 +711,7 @@ XML2_CFLAGS
 XML2_CONFIG
 with_libxml
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -862,6 +865,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libxml
@@ -905,6 +909,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1572,6 +1578,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
@@ -1618,6 +1625,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8681,6 +8692,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13112,6 +13157,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index fffa438cec1..f4a5a50c3a3 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -444,6 +444,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index bbe11e75bf0..ca6106d37c4 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -190,6 +190,7 @@ with_systemd	= @with_systemd@
 with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -216,6 +217,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0014-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload

From f6e3c30c331f385a7bed2ee78704f6393b8d7f82 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:29:37 -0500
Subject: [PATCH v2.4 14/29] aio: Add io_uring method

---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 382 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 409 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ca0f9b64d97..3c058c84003 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -24,6 +24,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index c6e7306ed61..ac26aff80b6 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -339,6 +339,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 13a7dc89980..043e8bae7a9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3eace131de2..a1282351436 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -81,6 +84,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..43f7576498c
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *pgaio_uring_contexts;
+static PgAioUringContext *pgaio_my_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	int			ret;
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+		}
+		break;
+	}
+
+	return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will consume the completions, making the
+	 * locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b02625194be..9fec95dd4b7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index cb977b049d8..a2d1a9fa4ec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3bb4e0d4d7d..50fde0ba2c3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -199,7 +199,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2cf0120bb56..c13c1e9e95b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2661,6 +2661,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e34727e269a..d4734b85c0d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2131,6 +2131,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0015-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 21f1e178886b748049ef0f8fc2ff279ace66092f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.4 15/29] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 422 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 424 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..55e64194ded
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,422 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_acquire()`
+and because `pgaio_io_acquire()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_acquire()`) without causing
+the IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+XXX: "return" vs "result" vs "result status" seems quite confusing. The naming
+should be improved.
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a1282351436..f2d763180d1 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0016-aio-Implement-smgr-md-fd-aio-methods.patchtext/x-diff; charset=us-asciiDownload

From 13f0bd1fc0bc86addb30fab960853d61d137e4bd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.4 16/29] aio: Implement smgr/md/fd aio methods

---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  12 +-
 src/include/storage/fd.h               |   6 +
 src/include/storage/md.h               |  12 +
 src/include/storage/smgr.h             |  22 ++
 src/backend/storage/aio/aio_callback.c |   4 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  68 +++++
 src/backend/storage/smgr/md.c          | 362 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 126 +++++++++
 10 files changed, 616 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 3c058c84003..4c5fb7bcfce 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -108,9 +108,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -174,6 +175,9 @@ typedef struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index d2617139a25..762fce3f075 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -58,11 +58,17 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		release_lock:1;
+		bool		skip_fsync:1;
+		uint8		mode;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..e2fd896646e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,10 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..7b28c3d482c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,10 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleCallbacks aio_md_writev_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,9 +40,16 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -46,6 +57,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..86fa07b110f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioTargetInfo;
+
+extern const struct PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,10 +102,19 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,6 +124,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -127,4 +142,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(struct PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 5629dc4cc94..adb8050eb58 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 #include "utils/memutils.h"
 
 
@@ -38,6 +39,9 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 15428968e58..a43edd89890 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -18,6 +18,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -31,6 +32,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..a9c90bc4e59 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2315,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
@@ -2498,6 +2557,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2843,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2912,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7bf0b45e2c3..db508d63573 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -132,6 +133,22 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const struct PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+const struct PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -927,6 +944,53 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1032,6 +1096,53 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1355,6 +1466,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1405,6 +1531,35 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -1838,3 +1993,210 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		md_readv_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	char	   *path;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path
+					   )
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	pfree(path);
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_writev_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		md_writev_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	char	   *path;
+
+	/* AFIXME: */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+
+	pfree(path);
+	MemoryContextSwitchTo(oldContext);
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..fb231e6ad48 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,10 +94,19 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -104,6 +114,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +132,15 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +159,16 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+const struct PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -623,6 +647,19 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -657,6 +694,19 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+/*
+ * AFIXME: FILL ME IN
+ */
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
@@ -819,6 +869,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +903,73 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 struct SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	sd->smgr.release_lock = false;
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+	sd->smgr.mode = RBM_NORMAL;
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	char	   *path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path);
+
+	pfree(path);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0017-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 75eb063ade46283fe4ce29b969954db051bcd3cc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.4 17/29] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 222 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  17 ++
 6 files changed, 254 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..fa7191e7f9a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12464,4 +12464,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,error_desc,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..b4140f4c46e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1394,3 +1394,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..cb4f3d29201
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	16
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) determine the state + generation of the IO
+		 *
+		 * 2) copy the IO to local memory
+		 *
+		 * 3) check if state and generation of the IO changed
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: error description */
+		/* AFIXME: implement */
+		nulls[11] = true;
+
+		/* column: target description */
+		values[12] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[15] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5baba8d39ff..918bf9efde3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,23 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    error_desc,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, error_desc, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0018-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patchtext/x-diff; charset=us-asciiDownload

From b1e2c462aacb49c6c9d093d5d4fc578cf4001348 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:44 -0500
Subject: [PATCH v2.4 18/29] WIP: localbuf: Track pincount in BufferDesc as
 well

For AIO on temp tables the AIO subsystem needs to be able to ensure a pin on a
buffer while AIO is going on, even if the IO issuing query errors out. To do
so, track the refcount in BufferDesc.state, not ust LocalRefCount.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (nobody else has access to the
BufferDesc).
---
 src/backend/storage/buffer/bufmgr.c   | 30 ++++++--
 src/backend/storage/buffer/localbuf.c | 99 +++++++++++++++++----------
 2 files changed, 87 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 47e1c3442b4..ec308557179 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5444,8 +5444,20 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
@@ -5497,8 +5509,18 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3c055f6ec8b..92c45611e0f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -207,10 +207,19 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
 				PinLocalBuffer(bufHdr, false);
+				/* the buf_state may be modified inside PinLocalBuffer */
+				buf_state = pg_atomic_read_u32(&bufHdr->state);
 				break;
 			}
 		}
@@ -491,6 +500,44 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (LocalRefCount[bufid] != 0 ||
+		BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)),
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -511,7 +558,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -521,24 +567,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)),
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -558,7 +587,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -566,23 +594,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)),
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -680,12 +692,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -711,7 +724,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0019-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From a2f7a2d85772011077082cd06b7f8fa54c324035 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:08:58 -0500
Subject: [PATCH v2.4 19/29] bufmgr: Implement AIO read support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   6 +
 src/include/storage/bufmgr.h           |   8 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 376 ++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c  |  77 +++++
 7 files changed, 472 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 4c5fb7bcfce..6b34422607c 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -178,6 +178,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1a65342177d..2a0c70c9998 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -251,6 +252,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -464,4 +467,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern PgAioResult LocalBufferCompleteRead(int buf_off, Buffer buffer, int mode, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4a035f59a7d..efba4d88d7d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,12 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
@@ -193,6 +199,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+struct PgAioHandle;
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index adb8050eb58..6afdaaa434b 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 #include "utils/memutils.h"
 
@@ -42,6 +43,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ec308557179..96b54f7abdf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -58,6 +59,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -516,7 +518,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1083,7 +1086,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1619,7 +1622,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2530,7 +2533,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3989,7 +3992,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5569,6 +5572,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5576,10 +5580,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5668,7 +5681,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5680,6 +5693,13 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5688,6 +5708,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5739,7 +5793,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6198,3 +6252,311 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		/* AFIXME: relpathperm allocates memory */
+		MemoryContextSwitchTo(ErrorContext);
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathperm(rlocator, tag.forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_wref = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh)
+{
+	shared_buffer_stage_common(ioh, false);
+}
+
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	MemoryContext oldContext = CurrentMemoryContext;
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	/*
+	 * AFIXME: need infrastructure to allow memory allocation for error
+	 * reporting
+	 */
+	oldContext = MemoryContextSwitchTo(ErrorContext);
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum)
+				   )
+		);
+	MemoryContextSwitchTo(oldContext);
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static PgAioResult
+buffer_readv_complete_common(PgAioHandle *ioh, PgAioResult prior_result, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	int			mode = td->smgr.mode;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (int buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		if (is_temp)
+			buf_result = LocalBufferCompleteRead(buf_off, buf, mode, failed);
+		else
+			buf_result = SharedBufferCompleteRead(buf_off, buf, mode, failed);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			buffer_readv_report(result, td, LOG);
+			result = buf_result;
+		}
+	}
+
+	return result;
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	return buffer_readv_complete_common(ioh, prior_result, false);
+}
+
+/*
+ * Helper to stage IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_stage(PgAioHandle *ioh)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_wref;
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_wref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_wref = io_wref;
+
+		/*
+		 * Track pin by AIO subsystem in BufferDesc, not in LocalRefCount as
+		 * one might initially think. This is necessary to handle this backend
+		 * erroring out while AIO is still in progress.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	return buffer_readv_complete_common(ioh, prior_result, true);
+
+}
+
+
+const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 92c45611e0f..d997d8e8632 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -649,6 +651,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -876,3 +880,76 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+PgAioResult
+LocalBufferCompleteRead(int buf_off, Buffer buffer, int mode, bool failed)
+{
+	BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+		BlockNumber forkNum = tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if (mode == READ_BUFFERS_ZERO_ON_ERROR || zero_damaged_pages)
+		{
+
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator, MyProcNumber, forkNum))));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_wref_clear(&bufHdr->io_wref);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0020-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 517e55c26298decd26eab0dec4da220aa084ad35 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Feb 2025 14:19:20 -0500
Subject: [PATCH v2.4 20/29] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |   7 +
 src/backend/storage/buffer/bufmgr.c | 411 +++++++++++++++++++++-------
 2 files changed, 317 insertions(+), 101 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index efba4d88d7d..dc8fe197d6f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -111,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -130,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 96b54f7abdf..ee9a9f70167 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,6 +529,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int *nblocks);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1237,10 +1239,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1268,6 +1269,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1307,6 +1309,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+
+			ereport(DEBUG3,
+					errmsg("found forwarded buffer %d",
+						   buffers[i]),
+					errhidestmt(true), errhidecontext(true));
 		}
 		else
 		{
@@ -1372,25 +1379,59 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers(). This is signalled to the caller by
+		 * decrementing *nblocks.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		return AsyncReadBuffers(operation, nblocks);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1458,12 +1499,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+		 * StartBufferIO().
+		 */
+		if (pgaio_wref_valid(&bufHdr->io_wref))
+		{
+			PgAioWaitRef iow = bufHdr->io_wref;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_wref_wait(&iow);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1473,28 +1533,163 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	PgAioReturn *aio_ret;
+
+	/*
+	 * If we get here without an IO operation having been issued, io_method ==
+	 * IOMETHOD_SYNC path must have been used. In that case, we start - as we
+	 * used to before - the IO now, just before waiting.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref))
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+
+		while (true)
+		{
+			nblocks = operation->nblocks;
+
+			if (!AsyncReadBuffers(operation, &nblocks))
+			{
+				/* all blocks were already read in concurrently */
+				Assert(nblocks == operation->nblocks);
+				return;
+			}
+
+			Assert(nblocks > 0 && nblocks <= operation->nblocks);
+
+			if (nblocks == operation->nblocks)
+			{
+				/* will wait below as if this had been normal AIO */
+				break;
+			}
+
+			/*
+			 * It's unlikely, but possible, that AsyncReadBuffers() wasn't
+			 * able to initiate IO for all the relevant buffers. In that case
+			 * we need to wait for the prior IO before issuing more IO.
+			 */
+			WaitReadBuffers(operation);
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/* Find the range of the physical read we need to perform. */
 	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
-
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	aio_ret = &operation->io_return;
+
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * XXX: We probably should track the IO operation, rather than its time,
+	 * separately, when initiating the IO. But right now that's not quite
+	 * allowed by the interface.
+	 */
+
+	/*
+	 * Tracking a wait even if we don't actually need to wait
+	 *
+	 * a) is not cheap
+	 *
+	 * b) reports some time as waiting, even if we never waited.
+	 */
+	if (aio_ret->result.status == ARS_UNKNOWN &&
+		!pgaio_wref_check_done(&operation->io_wref))
+	{
+		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+		pgaio_wref_wait(&operation->io_wref);
+
+		/*
+		 * The IO operation itself was already counted earlier, in
+		 * AsyncReadBuffers().
+		 */
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 0, 0);
+	}
+	else
+	{
+		Assert(pgaio_wref_check_done(&operation->io_wref));
+	}
+
+	if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry below, so we just emit a debug message the server log
+		 * (or not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+
+		/*
+		 * Try to perform the rest of the IO.  Buffers for which IO has
+		 * completed successfully will be discovered as such and not retried.
+		 */
+		nblocks = operation->nblocks;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, &nblocks);
+		goto restart;
+	}
+	else if (aio_ret->result.status != ARS_OK)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* NB: READ_DONE tracepoint is executed in IO completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation. If IO is only initiated for a
+ * subset of the blocks, *nblocks is updated to reflect that.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int *nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	bool		did_start_io = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1502,6 +1697,14 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1511,25 +1714,53 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	pgaio_wref_clear(&operation->io_wref);
+
+	/*
+	 * Loop until we have started one IO or we discover that all buffers are
+	 * already valid.
+	 */
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
 		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
 		BlockNumber io_first_block;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+		 * block, which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in here to the IO? If there
+		 * already are a lot of IO operations in progress, getting an IO
+		 * handle will block waiting for some other IO operation to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+		 * account IO time when pgaio_io_acquire_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_acquire(CurrentResourceOwner,
+								   &operation->io_return);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * It's safe to start IO while we have unsubmitted IO, but it'd be
+		 * better to first submit it. But right now the boolean return value
+		 * from ReadBuffersCanStartIO()/StartBufferIO() doesn't allow to
+		 * distinguish between nowait=true trigger failure and the buffer
+		 * already being valid.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1541,6 +1772,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u: %s",
+						   buffers[i], DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1550,6 +1786,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG5,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1557,86 +1798,54 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * head block, so we should get on with that I/O as soon as possible.
 		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG5,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
+		pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgaio_io_register_callbacks(ioh, PGAIO_HCB_LOCAL_BUFFER_READV);
+		else
+			pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
 
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum))));
-			}
+		pgaio_io_set_flag(ioh, ioh_flags);
 
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		did_start_io = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
 
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op(io_object, io_context, IOOP_READ,
+						   1, io_buffers_len * BLCKSZ);
 
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		*nblocks = io_buffers_len;
+		break;
+	}
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0021-WIP-aio-read_stream.c-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From e5d682eedac10c0b703a81ffed501152e46db1a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 18:21:47 -0500
Subject: [PATCH v2.4 21/29] WIP: aio: read_stream.c adjustments for real AIO

Comments need to be fixed.

The batching logic probably needs to be adjusted.
---
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 32e5def29f8..7d1d308f13f 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -32,10 +32,15 @@
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work for no benefit.
  *
+ * FIXME: This only applies to io_method == sync, otherwise this path is not
+ * used.
+ *
  * C) I/O is necessary, it appears to be random, and this system supports
  * read-ahead advice.  We'll look further ahead in order to reach the
  * configured level of I/O concurrency.
  *
+ * FIXME: restriction to random only applies to io_method == sync
+ *
  * The distance increases rapidly and decays slowly, so that it moves towards
  * those levels as different I/O patterns are discovered.  For example, a
  * sequential scan of fully cached data doesn't bother looking ahead, but a
@@ -90,6 +95,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -116,6 +122,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -458,6 +465,19 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
 	int16		buffer_limit;
 
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
+	/*
+	 * Try to amortize cost of submitting IOs over multiple IOs.
+	 */
+	pgaio_enter_batchmode();
+
 	/*
 	 * Check how many pins we could acquire now.  We do this here rather than
 	 * pushing it down into read_stream_start_pending_read(), because it
@@ -524,6 +544,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -549,6 +570,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -668,6 +691,8 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
@@ -676,7 +701,8 @@ read_stream_begin_impl(int flags,
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -919,7 +945,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (!stream->sync_mode ||
+			stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0022-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 51185432a85dea16d55a49537d4bcef813cf0b40 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.4 22/29] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   4 +
 src/backend/storage/buffer/bufmgr.c         |   3 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 401 +++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  84 ++++
 src/test/modules/test_aio/test_aio.c        | 518 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1079 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 2a0c70c9998..396642415bc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,10 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee9a9f70167..b641bc3982b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -516,7 +516,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
 							  bool syncio);
@@ -5831,7 +5830,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89e78b7d114..0ae2d4b6669 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
 		  libpq_pipeline \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index a57077b682e..94a9ebfdf60 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..225dce08e82
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,401 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+run_generic_test('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	run_generic_test('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf('postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+run_generic_test('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	$node->init(extra => ['-c', "io_method=$io_method"]);
+
+	$node->append_conf('postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr, "$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like($io_method, $psql,
+			  "handle_get() leak in implicit xact",
+			  qq(SELECT handle_get()),
+			  qr/^$/,
+			  qr/leaked AIO handle/, "$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like($io_method, $psql,
+			  "handle_get() leak in explicit xact",
+			  qq(BEGIN; SELECT handle_get(); COMMIT),
+			  qr/^$/,
+			  qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like($io_method, $psql,
+			  "handle_get() leak in explicit xact, rollback",
+			  qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+			  qr/^$/,
+			  qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like($io_method, $psql,
+			  "handle_get() leak in subxact",
+			  qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+			  qr/^$/,
+			  qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like($io_method, $psql,
+			  "handle_release() in different command",
+			  qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+			  qr/^$/,
+			  qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like($io_method, $psql,
+			  "handle_release() in same command",
+			  qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+			  qr/^$/,
+			  qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql,
+			  "handle_get_release()",
+			  qq(SELECT handle_get_release()),
+			  qr/^$/,
+			  qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql,
+			  "handle_get_twice()",
+			  qq(SELECT handle_get_release()),
+			  qr/^$/,
+			  qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like($io_method, $psql,
+			  "handle error recovery in implicit xact",
+			  qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+			  qr/^|ok$/,
+			  qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like($io_method, $psql,
+			  "handle error recovery in explicit xact",
+			  qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+			  qr/^|ok$/,
+			  qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like($io_method, $psql,
+			  "handle error recovery in explicit subxact",
+			  qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+			  qr/^|ok$/,
+			  qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like($io_method, $psql,
+			  "batch_start() leak & cleanup in implicit xact",
+			  qq(SELECT batch_start()),
+			  qr/^$/,
+			  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like($io_method, $psql,
+			  "batch_start() leak & cleanup in explicit xact",
+			  qq(BEGIN; SELECT batch_start(); COMMIT;),
+			  qr/^$/,
+			  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# FIXME: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like($io_method, $psql,
+			  "batch_start(), batch_end() works",
+			  qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+			  qr/^$/,
+			  qr/^$/, "$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(qq(SELECT read_corrupt_rel_block('tbl_a', 1);));
+	is($ret, 1, "$io_method: read_corrupt_rel_block() fails");
+	like($psql->{stderr}, qr/invalid page in block 1 of relation base\/.*/,
+		 "$io_method: read_corrupt_rel_block() reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading corrupt block fails");
+	like($psql->{stderr}, qr/invalid page in block 1 of relation base\/.*/,
+		 "$io_method: tid scan reading corrupt block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_a WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading corrupt block fails");
+	like($psql->{stderr}, qr/invalid page in block 1 of relation base\/.*/,
+		 "$io_method: sequential scan reading corrupt block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_b', 2);));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';));
+	is($ret, 0, "$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_b', 2);));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like($psql->{stderr}, qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		 "$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(qq(
+SELECT invalidate_rel_block('tbl_b', 0);
+SELECT invalidate_rel_block('tbl_b', 1);
+SELECT invalidate_rel_block('tbl_b', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_a, block 1 is corrupted)
+	$psql->query_safe(qq(
+SELECT invalidate_rel_block('tbl_a', 0);
+SELECT invalidate_rel_block('tbl_a', 1);
+SELECT invalidate_rel_block('tbl_a', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_a WHERE ctid < '(2, 1)'));
+	is($ret, 1, "$io_method: shortened multi-block read detects invalid page");
+	like($psql->{stderr}, qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		 "$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_b', 2);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like($psql->{stderr}, qr/ERROR:.*could not read blocks 2..3 in file \"base\/.*\": Input\/output error/,
+		 "$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_b', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like($psql->{stderr}, qr/ERROR:.*could not read blocks 1..2 in file "base\/.*": No such file or directory/,
+		 "$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_b;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+
+sub run_generic_test
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+	   $io_method,
+	   "$io_method: io_method set correctly");
+
+	$node->safe_psql('postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_a(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_b(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_a SELECT generate_series(1, 10000);
+INSERT INTO tbl_b SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_a', 500);
+SELECT grow_rel('tbl_b', 550);
+
+SELECT corrupt_rel_block('tbl_a', 1);
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+
+  SKIP:
+  {
+	  skip 'Injection points not supported by this build', 1
+		unless $ENV{enable_injection_points} eq 'yes';
+	  test_inject($io_method, $node);
+  }
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e7c7c6a6db6
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,84 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..15851565853
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,518 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(read_corrupt_rel_block);
+Datum
+read_corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		page;
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+	uint32		buf_state;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	page = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+	smgr = RelationGetSmgr(rel);
+
+	/* FIXME: even if just a test, we should verify nobody else uses this */
+	buf_state = LockBufHdr(buf_hdr);
+	buf_state &= ~(BM_VALID | BM_DIRTY);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	StartBufferIO(buf_hdr, true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV);
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, block,
+				   (void *) &page, 1);
+
+	ReleaseBuffer(buf);
+
+	pgaio_wref_wait(&iow);
+
+	if (ior.result.status != ARS_OK)
+		pgaio_result_report(ior.result, &ior.target_data,
+							ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* this is a gross hack, but there's no good API exposed */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, block);
+	buf = pr.recent_buffer;
+	elog(LOG, "recent: %d", buf);
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, block, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't unpin");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+#if 0
+PG_FUNCTION_INFO_V1(test_unsubmitted_vs_close);
+Datum
+test_unsubmitted_vs_close(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		block = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_ZERO_AND_LOCK, NULL);
+
+	buf = ReadBuffer(rel, block);
+	page = BufferGetPage(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+
+	MarkBufferDirty(buf);
+	ph->pd_special = BLCKSZ + 1;
+
+	/* last_handle = pgaio_io_acquire(); */
+
+	PG_RETURN_VOID();
+}
+#endif
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0023-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 6ead714154be99719ce2cf032ef3547b6663ca95 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 20:03:15 -0500
Subject: [PATCH v2.4 23/29] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  19 ++
 src/include/storage/aio_internal.h            |  33 ++++
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 497 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 6b34422607c..2a50683adc5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -247,6 +247,10 @@ typedef struct PgAioHandleCallbacks
 
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+
 /* AIO API */
 
 
@@ -332,6 +336,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -344,6 +362,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index ac26aff80b6..9c82e01ed17 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -99,6 +99,12 @@ struct PgAioHandle
 	 */
 	uint32		iovec_off;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -135,11 +141,23 @@ struct PgAioHandle
 };
 
 
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -170,6 +188,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -195,6 +219,15 @@ typedef struct PgAioCtl
 	 */
 	uint64	   *handle_data;
 
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
 	uint64		io_handle_count;
 	PgAioHandle *io_handles;
 } PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 55e64194ded..191fb21e6a7 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -404,6 +404,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index f2d763180d1..fc82908c338 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -61,6 +61,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -75,6 +77,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -642,6 +645,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1013,6 +1031,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 87eac5e961c..c6b29a7b134 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -130,11 +183,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -149,6 +222,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -172,6 +249,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -179,9 +290,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -203,6 +318,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 15954f42d4e..58120b7add2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3266,6 +3266,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 50fde0ba2c3..2214846a0b1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -206,6 +206,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 76b9cec1e26..de00346b549 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResoureElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e7c7c6a6db6..822211f5dd4 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -63,6 +63,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 15851565853..15a548cff1a 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -427,6 +428,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d4734b85c0d..d216785c3c8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2113,6 +2113,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0024-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 3240ca639291ef0fa86654a8fd411081ea6f2723 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:09:51 -0500
Subject: [PATCH v2.4 24/29] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |  2 +
 src/include/storage/bufmgr.h           |  2 +
 src/backend/storage/aio/aio_callback.c |  2 +
 src/backend/storage/buffer/bufmgr.c    | 85 ++++++++++++++++++++++++++
 4 files changed, 91 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 2a50683adc5..1901b839aff 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -180,8 +180,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dc8fe197d6f..655885ff2d0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,7 +186,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 
 struct PgAioHandleCallbacks;
 extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 
 /* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6afdaaa434b..2e2cab305c0 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -45,8 +45,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b641bc3982b..65acc891ef1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6530,6 +6530,42 @@ SharedBufferCompleteRead(int buf_off, Buffer buffer, int mode, bool failed)
 	return result;
 }
 
+static uint64
+BufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on shared buffers for execution, shared between reads
  * and writes.
@@ -6610,6 +6646,12 @@ shared_buffer_readv_stage(PgAioHandle *ioh)
 	shared_buffer_stage_common(ioh, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh)
+{
+	shared_buffer_stage_common(ioh, true);
+}
+
 static void
 buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
 {
@@ -6705,6 +6747,33 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 	return buffer_readv_complete_common(ioh, prior_result, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->target_data.shared_buffer.release_lock */
+		BufferCompleteWriteShared(buf, true, false);
+	}
+
+	return result;
+}
+
 /*
  * Helper to stage IO on local buffers for execution, shared between reads
  * and writes.
@@ -6747,7 +6816,16 @@ static PgAioResult
 local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result)
 {
 	return buffer_readv_complete_common(ioh, prior_result, true);
+}
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "not yet");
 }
 
 
@@ -6756,6 +6834,10 @@ const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.complete_shared = shared_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
 const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
 
@@ -6768,3 +6850,6 @@ const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+const struct PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0025-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 7cfb9661bd6accaba0b4dc218c811975af96235b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.4 25/29] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  31 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 198 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 233 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..f5e1bc07ff3
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..62ad06c8bfe
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,198 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d216785c3c8..d084c476ec8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1180,6 +1180,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -2993,6 +2994,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From a08a59674ab0297ec89e8790adc4f7c65a0ab6f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.4 26/29] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 2d5854e6879..517c40cd804 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(char *startup_data, size_t startup_data_len) pg
 extern void CheckpointerMain(char *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 396642415bc..1d3a936837b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 655885ff2d0..2cc7d47661f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -304,7 +304,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetSoftPinLimit(void);
 extern uint32 GetSoftLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index ec1225c433f..1208926c0c9 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d254b2d1587..89baec0dd69 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 65acc891ef1..8b835d962d9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -77,6 +78,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -513,8 +515,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner,
@@ -533,6 +533,7 @@ static bool AsyncReadBuffers(ReadBuffersOperation *operation,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3182,6 +3183,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3213,7 +3265,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3275,7 +3330,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3383,48 +3440,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3442,15 +3542,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3476,7 +3584,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3519,6 +3627,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3539,6 +3650,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3695,11 +3808,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3710,6 +3837,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3721,6 +3855,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3759,8 +3898,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3769,22 +3966,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3794,7 +4019,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3803,40 +4028,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4212,6 +4691,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 91da73dda8b..c4a78dc96d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d084c476ec8..d1d8758566e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0027-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 7ed7dc075a64fb31d68218324b1143ce74794674 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.4 27/29] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.4-0028-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 3c501ebbb63c591688546b1e3ff62aaf30b05bc0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.4 28/29] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#59

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Heikki Linnakangas (#27)

Re: AIO v2.2

Hi,

I was just going through comments about LWLockDisown() and was reminded of
this:

On 2025-01-07 18:08:51 +0200, Heikki Linnakangas wrote:

On LWLockDisown():
+ * NB: This will leave lock->owner pointing to the current backend (if
+ * LOCK_DEBUG is set). We could add a separate flag indicating that, but it
+ * doesn't really seem worth it.
Hmm. I won't insist, but I feel it probably would be worth it. This is only
in LOCK_DEBUG mode so there's no performance penalty in non-debug builds,
and when you do compile with LOCK_DEBUG you probably appreciate any extra
information.

I don't think that makes sense, as we, independent of this change, never clear
lock->owner. Not even when releasing a lock! The background to that, I think,
is that there were some cases where we forgot to wake up all backends due to
race conditions, and that for that it's really useful to know the last owner.

That could perhaps be evolved or documented better, but it's pretty much
independent of the patch at hand.

Greetings,

Andres Freund

#60

Andres Freund

andres@anarazel.de

11 months ago

In reply to: Andres Freund (#58)

Re: AIO v2.4

Hi,

On 2025-02-19 14:10:44 -0500, Andres Freund wrote:

I'm planning to push the first two commits soon, I think they're ok on their
own, even if nothing else were to go in.

I did that for the lwlock patch.

But I think I might not do the same for the "Ensure a resowner exists for all
paths that may perform AIO" patch. The paths for which we are missing
resowners are concerned WAL writes - but it'll be a while before we get
AIO WAL writes.

It'd be fairly harmless to do this change before, but I found the justifying
code comments hard to rephrase. E.g.:

--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
     BaseInit();

     bootstrap_signals();
+
+    /* need a resowner for IO during BootStrapXLOG() */
+    CreateAuxProcessResourceOwner();
+
     BootStrapXLOG(bootstrap_data_checksum_version);

+    ReleaseAuxProcessResources(true);
+    CurrentResourceOwner = NULL;
+
     /*
      * To ensure that src/common/link-canary.c is linked into the backend, we
      * must call it from somewhere.  Here is as good as anywhere.

Given that there's no use of resowners inside BootStrapXLOG() today and not
for the next months it seems confusing?

Greetings,

Andres Freund

#61

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#58)

29 attachment(s)

AIO v2.5

Hi,

Attached is v2.5 of the AIO patchset.

Relative to 2.4 I:

- Committed some earlier commits. I ended up *not* committing the patch to
create resowners in more backends (e.g. walsender), as that's not really a
dependency for now.

One of the more important things to get committed was in a separate thread:
/messages/by-id/b6vveqz6r3wno66rho5lqi6z5kyhfgtvi3jcodyq5rlpp3cu44@c6dsgf3z7yhs

Now relpath() can be used for logging while in a critical section. That
alone allowed to remove most of the remaining FIXMEs.

- Split md.c read/write patches, the write side is more complicated and isn't
needed before write support arrives (much later in the queue and very likely
not for 18).

The complicated bit about write support is needing to
register_dirty_segment() after completion of the write. If
RegisterSyncRequest() fails, the IO completer needs to open the file and
sync itself, unfortunately PathNameOpenFile() allocates memory, which isn't
ok while in a critical section (even though it'd not be detected, as it's
using malloc()).

- Reordered patches so that Thomas' read_stream work is after the basic AIO
infrastructure patches, there's no dependency to the earlier patches

I think Thomas might have a newer version of some of these, but since
they're not intended to be committed as part of this, I didn't spend the
time to rebase to the last version.

- Added a small bit of data that can be provided to callbacks, that makes it a
lot cleaner to transport information like ZERO_ON_ERROR.

I also did s/shared_callbacks/callbacks/, as the prior name was outdated.

- Substantially expanded tests, most importantly generic temp file tests and
AIO specific cross-backend tests

As part of the expanded tests I also needed to export TerminateBufferIO(),
like, as previously mentioned, already done in an earlier version for
StartBufferIO(). Nobody commented on that, so I think that's ok.

I also renamed the tests away from the very inventively named tbl_a, tbl_b...

- Moved the commit to create resownern in more places to much later in the
queue, it's not actually needed for bufmgr.c IO, and nothing needing it will
land in 18

- Added a proper commit message fo the main commit. I'd appreciate folks
reading through it. I'm sure I forgot a lot of folks and a lot of things.

- Did a fair bit of of comment polishing

- Addressed an XXX in the "aio infrastructure" commit suggesting that we might
want to error out if a backend is waiting on is own unsubmitted IO. Noah
argued for erroring out. I now made it so.

- Temporarily added a commit to increase open-file limit on openbsd. I saw
related errors without this patch too, but it fails more often with. I
already sent a separate email about this.

At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below. There's a lot more potential
optimizations etc, but this is already a very complicated piece of work, so I
think they just has to wait for later.

There are a few things to clean up in the bufmgr.c commits, I don't yet quite
like the function naming and there could be a bit less duplication. But I
don't think that needs to be resolved before the main commit.

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

- We could reduce memory usage a tiny bit if we made the mapping between
pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
ProcNumber. Right now IO workers have the per-backend AIO state, but don't
actually need it. I'm mildly inclined to think that the complexity isn't
worth it, but on the fence.

- Three of the commits in the series really are just precursor commits to
their subsequent commits, which I found helpful for development and review,
namely:

- aio: Basic subsystem initialization
- aio: Skeleton IO worker infrastructure
- aio: Add liburing dependency

Not sure if it's worth keeping these separate or whether they should just be
merged with their "real commit".

- Thomas suggested renaming
COMPLETED_IO->COMPLETED,
COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
in
/messages/by-id/CA+hUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw@mail.gmail.com

While the other things in the email were commented upon by others and
addressed in v2.4, the naming aspect wasn't further remarked upon by others.
I'm not personally in love with the suggested names, but I could live with
them.

- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
but all the ereport()s add a noticeable amount of overhead at high IO
throughput (at multiple gigabytes/second), so that's probably not right
forever. I'd leave this on initially and then change it to default to off
later. I think that's ok?

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

Todo:

- A few more passes over the main commit, I'm sure there's a few more inartful
comments, odd formatting and such.

- Check if there's a decent way to deduplicate pgaio_io_call_complete_shared() and
pgaio_io_call_complete_local()

- Figure out how to deduplicate support for LockBufferForCleanup() in
TerminateBufferIO().

- Documentation for pg_stat_aios.

- Check if documentation for track_io_timing needs to be adjusted, after the
bufmgr.c changes we only track waiting for an IO.

- Some of the test_aio code is specific to non-temp tables, it probably is
worth generalizing to deal with temp tables and invoke them for both.

Greetings,

Andres

Attachments:

v2.5-0025-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From c67abebdc3bd5bff9aee5cf83a31792d29c9a8ee Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.5 25/30] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  31 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 239 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..f5e1bc07ff3
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern struct PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86b46e93536..8f22fba3479 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1189,6 +1189,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3010,6 +3011,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0026-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From a1c9b7ac4fd4772bcbaa9cdd9fd9a1a503a4a9e1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.5 26/30] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
ontop of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 4fd717169f0..4208d5f2c97 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(const void *startup_data, size_t startup_data_l
 extern void CheckpointerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 817c85ec9ed..13d7926e3a3 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7d8618b0b85..1449f540a28 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -302,7 +302,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetSoftPinLimit(void);
 extern uint32 GetSoftLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3d350c611a5..4190fa0a591 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * HandleMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		HandleMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 4dc24739fbd..7fb9545e477 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0944914bb80..4ba69c8b341 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static bool AsyncReadBuffers(ReadBuffersOperation *operation,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3180,6 +3181,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3211,7 +3263,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3273,7 +3328,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3381,48 +3438,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3440,15 +3540,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3474,7 +3582,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3517,6 +3625,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3537,6 +3648,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3693,11 +3806,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3708,6 +3835,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3719,6 +3853,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3757,8 +3896,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3767,22 +3964,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3792,7 +4017,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3801,40 +4026,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4208,6 +4687,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8f22fba3479..ed938a96d29 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0027-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From bdc9e5f5a1471710e7dd49ef539fbbe4c45a7d5a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.5 27/30] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ad1fb624e3c..8515a55dbf1 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -778,8 +778,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0022-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From d08717aa72457821a842271fd63579b702056531 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Feb 2025 11:26:41 -0500
Subject: [PATCH v2.5 22/30] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   2 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 252 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f44f3b939d6..e2fd896646e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,6 +113,8 @@ extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t o
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index bf714a8896d..7b28c3d482c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 
 struct PgAioHandleCallbacks;
 extern const struct PgAioHandleCallbacks aio_md_readv_cb;
+extern const struct PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(struct PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(struct PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 85562fa0cc0..86fa07b110f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -110,6 +110,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(struct PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index ff7b2f25c99..f0414085eba 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 996458fac71..a9c90bc4e59 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d01ae5f6a09..4c0e5371875 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const struct PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const struct PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2016,3 +2107,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 			);
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_writev_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		md_writev_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ea8b42ee4e8..6cbf8939fdc 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -102,6 +102,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (struct PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -129,6 +134,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -691,6 +697,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(struct PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0023-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From f4cfdf93a1e5409302c99f4ef0c652840cf73f3f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 20:03:15 -0500
Subject: [PATCH v2.5 23/30] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  19 ++
 src/include/storage/aio_internal.h            |  33 ++++
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 497 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 4913ff723a9..ce5c18424bd 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -247,6 +247,10 @@ typedef struct PgAioHandleCallbacks
 
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
+
 /* AIO API */
 
 
@@ -333,6 +337,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -345,6 +363,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index f905b13e85f..835dafb5439 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -102,6 +102,12 @@ struct PgAioHandle
 	 */
 	uint32		iovec_off;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -138,11 +144,23 @@ struct PgAioHandle
 };
 
 
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -173,6 +191,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -198,6 +222,15 @@ typedef struct PgAioCtl
 	 */
 	uint64	   *handle_data;
 
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint64		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
+
 	uint64		io_handle_count;
 	PgAioHandle *io_handles;
 } PgAioCtl;
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index de6c7db894b..bd9436d665d 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -404,6 +404,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3af61227ea8..ad0e033390c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -61,6 +61,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -75,6 +77,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -638,6 +641,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1009,6 +1027,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 2205658cd9a..a56126a772a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -130,11 +183,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -149,6 +222,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -172,6 +249,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -179,9 +290,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -203,6 +318,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4317cfc9d2f..41a4025c66d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3268,6 +3268,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8738ad51bf1..291ca8ab38d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 76b9cec1e26..de00346b549 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResoureElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b35cf61cd36..86b46e93536 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2128,6 +2128,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0024-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 40ec3844785b610c2c9aa66af644a9f9d4fbb5da Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:09:51 -0500
Subject: [PATCH v2.5 24/30] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |  2 +
 src/include/storage/bufmgr.h           |  2 +
 src/backend/storage/aio/aio_callback.c |  2 +
 src/backend/storage/buffer/bufmgr.c    | 88 ++++++++++++++++++++++++++
 4 files changed, 94 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index ce5c18424bd..912bce87197 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -180,8 +180,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index beeb4c47c1c..7d8618b0b85 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,7 +186,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 
 struct PgAioHandleCallbacks;
 extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 
 /* upper limit for effective_io_concurrency */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index f0414085eba..00098504efb 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index efd5b6601ad..0944914bb80 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6514,6 +6514,42 @@ SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
 	return result;
 }
 
+static uint64
+BufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on shared buffers for execution, shared between reads
  * and writes.
@@ -6594,6 +6630,12 @@ shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 	shared_buffer_stage_common(ioh, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	shared_buffer_stage_common(ioh, true);
+}
+
 static void
 buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
 {
@@ -6680,6 +6722,33 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 c
 	return buffer_readv_complete_common(ioh, prior_result, false, cb_data);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->target_data.shared_buffer.release_lock */
+		BufferCompleteWriteShared(buf, true, false);
+	}
+
+	return result;
+}
+
 /*
  * Helper to stage IO on local buffers for execution, shared between reads
  * and writes.
@@ -6724,6 +6793,16 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb
 	return buffer_readv_complete_common(ioh, prior_result, true, cb_data);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "not yet");
+}
+
 
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
@@ -6732,6 +6811,11 @@ const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const struct PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6745,3 +6829,7 @@ const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const struct PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0001-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 4557d9f70bf051beb8622de086756505bce684b8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 10:44:56 -0500
Subject: [PATCH v2.5 01/30] aio: Basic subsystem initialization

This is just separate to make it easier to review the tendrils into various
places.  It will likely be merged with the subsequent commit before being
merged.
---
 src/include/storage/aio.h                     | 38 +++++++++++
 src/include/storage/aio_subsys.h              | 28 ++++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 67 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++++
 doc/src/sgml/config.sgml                      | 51 ++++++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 16 files changed, 302 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..0b8aaaf00d3
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO but interact with it in some form. E.g. postmaster.c
+ * and shared memory initialization need to initialize AIO but don't perform
+ * AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..de3bc37264f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..d1b49adc547 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -62,6 +62,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..8eb6a5f1292
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	/* placeholder for later */
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b428a59bdd2..ad1fb624e3c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -628,6 +629,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..28dcc465762 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -72,6 +72,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3254,6 +3255,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5299,6 +5312,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2d1de9c37bd..a31c9963ce4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..76b9cec1e26 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d2fa5f7d1a9..7d5482f31cd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2644,6 +2644,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 19ff271ba50..c80365974e4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1276,6 +1276,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0002-aio-Add-asynchronous-I-O-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From a7db2aee940dd5c7595fc99f36892e47d988da92 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:57:59 -0500
Subject: [PATCH v2.5 02/30] aio: Add asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by a new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons all
targets currently need to be added to an array in aio_target.c.  This commit
does not implement any AIO targets, just the infrastructure for them. The smgr
target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit nothing uses the AIO infrastructure. Other commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, that I started to work on to make adding AIO feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |  300 +++++
 src/include/storage/aio_internal.h            |  351 ++++++
 src/include/storage/aio_subsys.h              |    5 +
 src/include/storage/aio_types.h               |  115 ++
 src/backend/access/transam/xact.c             |   12 +
 src/backend/postmaster/autovacuum.c           |    2 +
 src/backend/postmaster/bgwriter.c             |    2 +
 src/backend/postmaster/checkpointer.c         |    2 +
 src/backend/postmaster/pgarch.c               |    2 +
 src/backend/postmaster/walsummarizer.c        |    2 +
 src/backend/postmaster/walwriter.c            |    2 +
 src/backend/replication/walsender.c           |    2 +
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1117 ++++++++++++++++-
 src/backend/storage/aio/aio_callback.c        |  295 +++++
 src/backend/storage/aio/aio_init.c            |  186 +++
 src/backend/storage/aio/aio_io.c              |  180 +++
 src/backend/storage/aio/aio_target.c          |  106 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 22 files changed, 2754 insertions(+), 4 deletions(-)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..747ad83d037 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -14,6 +14,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +29,306 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not publically referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ */
+typedef struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+} PgAioTargetInfo;
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+typedef struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+} PgAioHandleCallbacks;
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* AIO API */
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+										uint8 cb_data);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..9d9cda80ccb
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,351 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that shoul only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/* returned by pgaio_io_acquire() */
+	PGAIO_HS_HANDED_OUT,
+
+	/* pgaio_io_prep_*() has been called, but IO hasn't been submitted yet */
+	PGAIO_HS_DEFINED,
+
+	/* target's stage() callback has been called, ready to be submitted */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted and is being executed */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/* IO completed, shared completion has been called */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in public header */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/* data forwarded to each callback */
+	uint8		callbacks_data[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	PgAioReturn *report_return;
+
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * without having been either defined (by actually associating it with IO)
+	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
+	 * to enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strict speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint64		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint64		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at buildtime. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index 0b8aaaf00d3..e4faf692a38 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -25,4 +25,9 @@ extern void AioShmemInit(void);
 
 extern void pgaio_init_backend(void);
 
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..d2617139a25
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index dfb8d068ecf..c8e7022ef0a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 64524d1831b..3d350c611a5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 7acbbd3e267..4dc24739fbd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e6cd78679ce..c42479452ed 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index f4d61c1f3bb..bd36b956191 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -37,6 +37,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -288,6 +289,7 @@ WalSummarizerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 15a71ad684d..a55db844419 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..69097125606 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 8eb6a5f1292..b8cfe012c4d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -14,10 +36,28 @@
 
 #include "postgres.h"
 
-#include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -30,11 +70,1050 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation.
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()) the IO will also have been submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			if (!on_error)
+				elog(WARNING, "AIO handle was not submitted");
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
+}
+
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state != PGAIO_HS_SUBMITTED
+			&& state != PGAIO_HS_COMPLETED_IO
+			&& state != PGAIO_HS_COMPLETED_SHARED
+			&& state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * locallbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * It's possible that we recognized there were free IOs while submitting.
+	 */
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+	{
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+	}
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * Batch submission mode needs to explicitly ended with
+ * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
+ * error recovery will end the batch.
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	AtEOXact_Aio(code == 0);
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes: - it's somewhat annoying to see partially finished IOs in
+	 * stats views etc - it's rumored that some kernel-level AIO mechanisms
+	 * don't deal well with the issuer of an AIO exiting
+	 */
+
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
+}
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -57,11 +1136,41 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
 /*
- * Release IO handle during resource owner cleanup.
+ * Call injection point with support for pgaio_inj_io_get().
  */
 void
-pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
 {
-	/* placeholder for later */
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
 }
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..3071bf19f23
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,295 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+							uint8 cb_data)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	if (cb_id >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cb_id);
+	if (aio_handle_cbs[cb_id].cb->complete_shared == NULL &&
+		aio_handle_cbs[cb_id].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cb_id);
+	if (ioh->num_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->callbacks[ioh->num_callbacks] = cb_id;
+	ioh->callbacks_data[ioh->num_callbacks] = cb_data;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_callbacks + 1,
+				   cb_id, ce->name);
+
+	ioh->num_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage(%u)",
+					   i, cb_id, ce->name, cb_data);
+		ce->cb->stage(ioh, cb_data);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared(%u) with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cb_id, ce->name,
+					   cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result, cb_data);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local(%u) with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cb_id, ce->name, cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result, cb_data);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cb_id = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..8e1162b09dc 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,210 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..89376ff4040
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..637e3aa0928
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..43f9c8bd0b3
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "should be unreachable");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..6f3ca878bd1 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c80365974e4..4405c2409d1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1277,6 +1277,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2123,6 +2124,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0003-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From ddf5c176f211907cc3f338744ff36c6bd98b674e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 17:57:00 -0500
Subject: [PATCH v2.5 03/30] aio: Skeleton IO worker infrastructure

This doesn't do anything useful on its own, but the code that needs to be
touched is independent of other changes.

Remarks:
- dynamic increase / decrease of workers based on IO load

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 165 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 ++++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 16 files changed, 286 insertions(+), 13 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a2b63495eec..54429e046a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index b6a3f275a1b..afefa35e2cc 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index e4faf692a38..ed00d5c47cd 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..872241ae299
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..64e9b8ff8c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -448,7 +448,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 47375e5bfaa..516f2b5d212 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5dd3b6a4fd4..3cf4fffd3b4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -341,6 +344,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -403,6 +407,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -437,6 +445,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1366,6 +1376,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1378,7 +1393,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2503,6 +2517,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2724,6 +2748,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2906,20 +2931,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2934,12 +2960,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3040,11 +3067,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3172,10 +3213,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3199,6 +3244,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -4115,6 +4161,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4265,6 +4312,100 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..8ca66895e97
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(const void *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	/* TODO review all signals */
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f2f75aa0f88..583a0623676 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3315,6 +3315,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a9343b7b59e..5d2d11ab9e3 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -291,6 +291,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index ba11545a17f..73c6d2d342e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -376,6 +376,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 6f3ca878bd1..d18b788b8a1 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0347fc11092..cbca090d2b0 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0004-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload

From da4a66c46ea377109697b0fb45ebff27b23951f5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2025 13:46:35 -0500
Subject: [PATCH v2.5 04/30] aio: Add worker method

---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 435 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_tables.c           |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |  21 +
 src/tools/pgindent/typedefs.list              |   3 +
 11 files changed, 483 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 747ad83d037..c5ffd9b861a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -23,10 +23,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 9d9cda80ccb..ba086b91be3 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -341,6 +341,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b8cfe012c4d..d6d21c44199 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -63,6 +63,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -79,6 +80,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 8e1162b09dc..2205658cd9a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -211,6 +217,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 8ca66895e97..5034ec8923f 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * Worker processes consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken backend can wake two more.  XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,30 +31,329 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG1, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
 
-	/* TODO review all signals */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
 
@@ -53,7 +367,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
-	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
 
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
@@ -61,6 +380,27 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		error_context_stack = NULL;
 		HOLD_INTERRUPTS();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		EmitErrorReport();
 
 		proc_exit(1);
@@ -69,12 +409,92 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	/* We can now handle ereport(ERROR) */
 	PG_exception_stack = &local_sigjmp_buf;
 
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
-		CHECK_FOR_INTERRUPTS();
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail, no need to keep error_ioh
+			 * around. pgaio_io_perform_synchronously() contains a critical
+			 * section.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+			CHECK_FOR_INTERRUPTS();
+		}
 	}
 
 	proc_exit(0);
@@ -83,6 +503,5 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index d18b788b8a1..cb977b049d8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 28dcc465762..4317cfc9d2f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -75,6 +75,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3267,6 +3268,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a31c9963ce4..1058726285f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,11 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7d5482f31cd..8b078f476d1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2682,6 +2682,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
@@ -2695,6 +2700,22 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is 3.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4405c2409d1..2db01bc4617 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0005-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 3236bbbe9f6ce5e9c00ca7e40ab46326fd3841bd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 17:56:05 -0500
Subject: [PATCH v2.5 05/30] aio: Add liburing dependency

Not yet used.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index c5e7b743bfb..ba4bfa43a50 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -447,6 +447,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0006-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload

From 7655cc62092c9f479464f3f57276c1dda8c79941 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:29:37 -0500
Subject: [PATCH v2.5 06/30] aio: Add io_uring method

---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 382 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 409 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c5ffd9b861a..d45d70d12dc 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -24,6 +24,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index ba086b91be3..f905b13e85f 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -342,6 +342,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 13a7dc89980..043e8bae7a9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index d6d21c44199..24c7e48cf9c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -81,6 +84,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..43f7576498c
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,382 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * XXX Write me
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+typedef struct PgAioUringContext
+{
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+	/* XXX: probably worth padding to a cacheline boundary here */
+} PgAioUringContext;
+
+
+static PgAioUringContext *pgaio_uring_contexts;
+static PgAioUringContext *pgaio_my_uring_context;
+
+/* io_uring local state */
+static struct io_uring local_ring;
+
+
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+
+	return mul_size(TotalProcs, sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	int			ret;
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+
+	ret = io_uring_queue_init(32, &local_ring, 0);
+	if (ret < 0)
+		elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+			continue;
+		}
+		if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+		}
+		break;
+	}
+
+	return num_staged_ios;
+}
+
+
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * We ought to have a smarter locking scheme, nearly all the time the
+	 * backend owning the ring will consume the completions, making the
+	 * locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8adf2730277..4a52bb23a7a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index cb977b049d8..a2d1a9fa4ec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1058726285f..8738ad51bf1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8b078f476d1..fe6216df800 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2687,6 +2687,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2db01bc4617..b35cf61cd36 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2146,6 +2146,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0007-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 4fc6a398b1cab4fd1d9a03535ea2e1ea4f56e378 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.5 07/30] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 422 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 424 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..de6c7db894b
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,422 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_acquire()`
+and because `pgaio_io_acquire()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_acquire()`) without causing
+the IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 24c7e48cf9c..3af61227ea8 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0008-aio-Implement-smgr-md-fd-read-support.patchtext/x-diff; charset=us-asciiDownload

From ea0abea26bcd6ddd05c8c4ff952557ea4861c0cd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.5 08/30] aio: Implement smgr/md/fd read support

---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   4 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  17 +++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 175 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 106 +++++++++++++++
 10 files changed, 366 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d45d70d12dc..8fbd206c343 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -108,9 +108,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -174,6 +175,9 @@ typedef struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index d2617139a25..3ff9282bb1b 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -58,11 +58,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..f44f3b939d6 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,8 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..bf714a8896d 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -19,6 +19,9 @@
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(struct PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..85562fa0cc0 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -73,6 +73,11 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+struct PgAioHandle;
+struct PgAioTargetInfo;
+
+extern const struct PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +102,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(struct PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -110,6 +119,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 						 BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
@@ -127,4 +137,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(struct PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 3071bf19f23..3112b935676 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into the aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 637e3aa0928..45a4af68f24 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -29,6 +30,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..996458fac71 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(struct PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..d01ae5f6a09 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const struct PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,98 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		md_readv_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		md_readv_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str
+					   )
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..ea8b42ee4e8 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,6 +94,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (struct PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -104,6 +109,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +127,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -145,6 +153,16 @@ static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+const struct PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
+
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
  *								 managers.
@@ -623,6 +641,22 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(struct PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -819,6 +853,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +887,69 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 struct SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0009-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From cdb3e408eb390222bf68c07f07a44b90762f4f82 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.5 09/30] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 222 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  17 ++
 6 files changed, 254 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cd9422d0bac..da72f1da78e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12450,4 +12450,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,error_desc,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..cb4f3d29201
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	16
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) determine the state + generation of the IO
+		 *
+		 * 2) copy the IO to local memory
+		 *
+		 * 3) check if state and generation of the IO changed
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: error description */
+		/* AFIXME: implement */
+		nulls[11] = true;
+
+		/* column: target description */
+		values[12] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[15] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..e8ab411cd92 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,23 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    error_desc,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, error_desc, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0010-Refactor-read_stream.c-s-circular-arithmetic.patchtext/x-diff; charset=us-asciiDownload

From ec605c2c1b276560e457c40e4274af653d869773 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Feb 2025 14:47:25 +1300
Subject: [PATCH v2.5 10/30] Refactor read_stream.c's circular arithmetic.

Several places have open-coded circular index arithmetic.  Make some
common functions for better readability and consistent assertion
checking.

This avoids adding yet more open-coded duplication in later patches, and
standardizes on the vocabulary "advance" and "retreat" as used elsewhere
in PostgreSQL.
---
 src/backend/storage/aio/read_stream.c | 78 +++++++++++++++++++++------
 1 file changed, 61 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 04bdb5e6d4b..dee51fa85a9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -224,6 +224,55 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
+/*
+ * Increment index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	*index += 1;
+	if (*index == stream->queue_size)
+		*index = 0;
+}
+
+/*
+ * Increment index by n, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance_n(ReadStream *stream, int16 *index, int16 n)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+	Assert(n <= MAX_IO_COMBINE_LIMIT);
+
+	*index += n;
+	if (*index >= stream->queue_size)
+		*index -= stream->queue_size;
+
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+}
+
+#if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
+/*
+ * Decrement index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_retreat(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	if (*index == 0)
+		*index = stream->queue_size - 1;
+	else
+		*index -= 1;
+}
+#endif
+
 static void
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
@@ -302,11 +351,8 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 				&stream->buffers[stream->queue_size],
 				sizeof(stream->buffers[0]) * overflow);
 
-	/* Compute location of start of next read, without using % operator. */
-	buffer_index += nblocks;
-	if (buffer_index >= stream->queue_size)
-		buffer_index -= stream->queue_size;
-	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	/* Move to the location of start of next read. */
+	read_stream_index_advance_n(stream, &buffer_index, nblocks);
 	stream->next_buffer_index = buffer_index;
 
 	/* Adjust the pending read to cover the remaining portion, if any. */
@@ -334,12 +380,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
-		 * wrap-around, but we don't want to use the expensive % operator.
+		 * wrap-around.
 		 */
-		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
-		if (buffer_index >= stream->queue_size)
-			buffer_index -= stream->queue_size;
-		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		buffer_index = stream->next_buffer_index;
+		read_stream_index_advance_n(stream,
+									&buffer_index,
+									stream->pending_read_nblocks);
 		per_buffer_data = get_per_buffer_data(stream, buffer_index);
 		blocknum = read_stream_get_block(stream, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
@@ -781,12 +827,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	 */
 	if (stream->per_buffer_data)
 	{
+		int16		index;
 		void	   *per_buffer_data;
 
-		per_buffer_data = get_per_buffer_data(stream,
-											  oldest_buffer_index == 0 ?
-											  stream->queue_size - 1 :
-											  oldest_buffer_index - 1);
+		index = oldest_buffer_index;
+		read_stream_index_retreat(stream, &index);
+		per_buffer_data = get_per_buffer_data(stream, index);
 
 #if defined(CLOBBER_FREED_MEMORY)
 		/* This also tells Valgrind the memory is "noaccess". */
@@ -804,9 +850,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	stream->pinned_buffers--;
 
 	/* Advance oldest buffer, with wrap-around. */
-	stream->oldest_buffer_index++;
-	if (stream->oldest_buffer_index == stream->queue_size)
-		stream->oldest_buffer_index = 0;
+	read_stream_index_advance(stream, &stream->oldest_buffer_index);
 
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patchtext/x-diff; charset=us-asciiDownload

From d372fbde22ac9011659bd99c7ac6e72dd54eeec7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 10:59:39 +1300
Subject: [PATCH v2.5 11/30] Improve buffer pool API for per-backend pin
 limits.

Previously the support functions assumed that you needed one additional
pin to make progress, and could optionally use some more.  Add a couple
more functions for callers that want to know:

* what the maximum possible number could be, for space planning
  purposes, called the "soft pin limit"

* how many additional pins they could acquire right now, without the
  special case allowing one pin (ie for users that already hold pins and
  can already make progress even if zero extra pins are available now)

These APIs are better suited to read_stream.c, which will be adjusted in
a follow-up patch.  Also move the computation of the each backend's fair
share of the buffer pool to backend initialization time, since the
answer doesn't change and we don't want to perform a division operation
every time we compute availability.
---
 src/include/storage/bufmgr.h          |  4 ++
 src/backend/storage/buffer/bufmgr.c   | 75 ++++++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c | 16 ++++++
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..597ecb97897 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -290,6 +290,10 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern uint32 GetSoftPinLimit(void);
+extern uint32 GetSoftLocalPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern uint32 GetAdditionalLocalPinLimit(void);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..11c146763db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -211,6 +211,8 @@ static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
 
+static uint32 MaxProportionalPins;
+
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
@@ -2097,6 +2099,46 @@ again:
 	return buf;
 }
 
+/*
+ * Return the maximum number of buffer than this backend should try to pin at
+ * once, to avoid pinning more than its fair share.  This is the highest value
+ * that GetAdditionalPinLimit() and LimitAdditionalPins() could ever return.
+ *
+ * It's called a soft limit because nothing stops a backend from trying to
+ * acquire more pins than this this with ReadBuffer(), but code that wants more
+ * for I/O optimizations should respect this per-backend limit when it can
+ * still make progress without them.
+ */
+uint32
+GetSoftPinLimit(void)
+{
+	return MaxProportionalPins;
+}
+
+/*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend soft limit, considering the
+ * number of buffers it has already pinned.
+ */
+uint32
+GetAdditionalPinLimit(void)
+{
+	uint32		estimated_pins_held;
+
+	/*
+	 * We get the number of "overflowed" pins for free, but don't know the
+	 * number of pins in PrivateRefCountArray.  The cost of calculating that
+	 * exactly doesn't seem worth it, so just assume the max.
+	 */
+	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	/* Is this backend already holding more than its fair share? */
+	if (estimated_pins_held > MaxProportionalPins)
+		return 0;
+
+	return MaxProportionalPins - estimated_pins_held;
+}
+
 /*
  * Limit the number of pins a batch operation may additionally acquire, to
  * avoid running out of pinnable buffers.
@@ -2112,28 +2154,15 @@ again:
 void
 LimitAdditionalPins(uint32 *additional_pins)
 {
-	uint32		max_backends;
-	int			max_proportional_pins;
+	uint32		limit;
 
 	if (*additional_pins <= 1)
 		return;
 
-	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
-	max_proportional_pins = NBuffers / max_backends;
-
-	/*
-	 * Subtract the approximate number of buffers already pinned by this
-	 * backend. We get the number of "overflowed" pins for free, but don't
-	 * know the number of pins in PrivateRefCountArray. The cost of
-	 * calculating that exactly doesn't seem worth it, so just assume the max.
-	 */
-	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	if (max_proportional_pins <= 0)
-		max_proportional_pins = 1;
-
-	if (*additional_pins > max_proportional_pins)
-		*additional_pins = max_proportional_pins;
+	limit = GetAdditionalPinLimit();
+	limit = Max(limit, 1);
+	if (limit < *additional_pins)
+		*additional_pins = limit;
 }
 
 /*
@@ -3575,6 +3604,16 @@ InitBufferManagerAccess(void)
 {
 	HASHCTL		hash_ctl;
 
+	/*
+	 * The soft limit on the number of pins each backend should respect, bast
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  Higher level code that pins non-trivial numbers
+	 * of buffers should use LimitAdditionalPins() or GetAdditionalPinLimit()
+	 * to stay under this limit.
+	 */
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
 	hash_ctl.keysize = sizeof(int32);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 80b83444eb2..5378ba84316 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -286,6 +286,22 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see GetSoftPinLimit() */
+uint32
+GetSoftLocalPinLimit(void)
+{
+	/* Every backend has its own temporary buffers, and can pin them all. */
+	return num_temp_buffers;
+}
+
+/* see GetAdditionalPinLimit() */
+uint32
+GetAdditionalLocalPinLimit(void)
+{
+	Assert(NLocalPinnedBuffers <= num_temp_buffers);
+	return num_temp_buffers - NLocalPinnedBuffers;
+}
+
 /* see LimitAdditionalPins() */
 void
 LimitAdditionalLocalPins(uint32 *additional_pins)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0012-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From fecdca253a1231debaea900df43137508e35c8e1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 23:52:53 +1300
Subject: [PATCH v2.5 12/30] Respect pin limits accurately in read_stream.c.

Read streams pin multiple buffers at once as required to combine I/O.
This also avoids having to unpin and repin later when issuing read-ahead
advice, and will be needed for proposed work that starts "real"
asynchronous I/O.

To avoid pinning too much of the buffer pool at once, we previously used
LimitAdditionalBuffers() to avoid pinning more than this backend's fair
share of the pool as a cap.  The coding was a naive and only checked the
cap once at stream initialization.

This commit moves the check to the time of use with new bufmgr APIs from
an earlier commit, since the result might change later due to pins
acquired later outside this stream.  No extra CPU cycles are added to
the all-buffered fast-path code (it only pins one buffer at a time), but
the I/O-starting path now re-checks the limit every time using simple
arithmetic.

In practice it was difficult to exceed the limit, but you could contrive
a workload to do it using multiple CURSORs and FETCHing from sequential
scans in round-robin fashion, so that each underlying stream computes
its limit before all the others have ramped up to their full look-ahead
distance.  Therefore, no back-patch for now.

Per code review from Andres, in the course of his AIO work.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/aio/read_stream.c | 111 ++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index dee51fa85a9..6a39bd1d92d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		temporary;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -274,7 +275,9 @@ read_stream_index_retreat(ReadStream *stream, int16 *index)
 #endif
 
 static void
-read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+read_stream_start_pending_read(ReadStream *stream,
+							   int16 buffer_limit,
+							   bool suppress_advice)
 {
 	bool		need_wait;
 	int			nblocks;
@@ -308,10 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		flags = 0;
 
-	/* We say how many blocks we want to read, but may be smaller on return. */
+	/*
+	 * We say how many blocks we want to read, but may be smaller on return.
+	 * On memory-constrained systems we may be also have to ask for a smaller
+	 * read ourselves.
+	 */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = stream->pending_read_nblocks;
+	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -360,11 +367,60 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	stream->pending_read_nblocks -= nblocks;
 }
 
+/*
+ * How many more buffers could we use, while respecting the soft limit?
+ */
+static int16
+read_stream_get_buffer_limit(ReadStream *stream)
+{
+	uint32		buffers;
+
+	/* Check how many local or shared pins we could acquire. */
+	if (stream->temporary)
+		buffers = GetAdditionalLocalPinLimit();
+	else
+		buffers = GetAdditionalPinLimit();
+
+	/*
+	 * Each stream is always allowed to try to acquire one pin if it doesn't
+	 * hold one already.  This is needed to guarantee progress, and just like
+	 * the simple ReadBuffer() operation in code that is not using this stream
+	 * API, if a buffer can't be pinned we'll raise an error when trying to
+	 * pin, ie the buffer pool is simply too small for the workload.
+	 */
+	if (buffers == 0 && stream->pinned_buffers == 0)
+		return 1;
+
+	/*
+	 * Otherwise, see how many additional pins the backend can currently pin,
+	 * which may be zero.  As above, this only guarantees that this backend
+	 * won't use more than its fair share if all backends can respect the soft
+	 * limit, not that a pin can actually be acquired without error.
+	 */
+	return Min(buffers, INT16_MAX);
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	int16		buffer_limit;
+
+	/*
+	 * Check how many pins we could acquire now.  We do this here rather than
+	 * pushing it down into read_stream_start_pending_read(), because it
+	 * allows more flexibility in behavior when we run out of allowed pins.
+	 * Currently the policy is to start an I/O when we've run out of allowed
+	 * pins only if we have to to make progress, and otherwise to stop looking
+	 * ahead until more pins become available, so that we don't start issuing
+	 * a lot of smaller I/Os, prefering to build the largest ones we can. This
+	 * choice is debatable, but it should only really come up with the buffer
+	 * pool/connection ratio is very constrained.
+	 */
+	buffer_limit = read_stream_get_buffer_limit(stream);
+
 	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+		   stream->pinned_buffers + stream->pending_read_nblocks <
+		   Min(stream->distance, buffer_limit))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
@@ -372,7 +428,9 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 
 		if (stream->pending_read_nblocks == io_combine_limit)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit,
+										   suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
 			continue;
 		}
@@ -406,11 +464,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
-			if (stream->ios_in_progress == stream->max_ios)
+			if (stream->ios_in_progress == stream->max_ios || buffer_limit == 0)
 			{
-				/* And we've hit the limit.  Rewind, and stop here. */
+				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
@@ -426,16 +485,17 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * limit, preferring to give it another chance to grow to full
 	 * io_combine_limit size once more buffers have been consumed.  However,
 	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.
+	 * distance limit or buffer limit and there isn't anything pinned yet, or
+	 * the callback has signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
 		(stream->pending_read_nblocks == io_combine_limit ||
-		 (stream->pending_read_nblocks == stream->distance &&
+		 ((stream->pending_read_nblocks == stream->distance ||
+		   stream->pending_read_nblocks == buffer_limit) &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
-		read_stream_start_pending_read(stream, suppress_advice);
+		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
 }
 
 /*
@@ -464,6 +524,7 @@ read_stream_begin_impl(int flags,
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
+	uint32		max_possible_buffer_limit;
 	Oid			tablespace_id;
 
 	/*
@@ -511,12 +572,23 @@ read_stream_begin_impl(int flags,
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
 	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
 
-	/* Don't allow this backend to pin more than its share of buffers. */
+	/*
+	 * Also limit by the maximum possible number of pins we could be allowed
+	 * to acquire according to bufmgr.  We may not be able to use them all due
+	 * to other pins held by this backend, but we'll enforce the dynamic limit
+	 * later when starting I/O.
+	 */
 	if (SmgrIsTemp(smgr))
-		LimitAdditionalLocalPins(&max_pinned_buffers);
+		max_possible_buffer_limit = GetSoftLocalPinLimit();
 	else
-		LimitAdditionalPins(&max_pinned_buffers);
-	Assert(max_pinned_buffers > 0);
+		max_possible_buffer_limit = GetSoftPinLimit();
+	max_pinned_buffers = Min(max_pinned_buffers, max_possible_buffer_limit);
+
+	/*
+	 * The soft limit might be zero on a system configured with more
+	 * connections than buffers.  We need at least one.
+	 */
+	max_pinned_buffers = Max(1, max_pinned_buffers);
 
 	/*
 	 * We need one extra entry for buffers and per-buffer data, because users
@@ -576,6 +648,7 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -704,6 +777,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * arbitrary I/O entry (they're all free).  We don't have to
 			 * adjust pinned_buffers because we're transferring one to caller
 			 * but pinning one more.
+			 *
+			 * In the fast path we don't need to check the pin limit.  We're
+			 * always allowed at least one pin so that progress can be made,
+			 * and that's all we need here.  Although two pins are momentarily
+			 * held at the same time, the model used here is that the stream
+			 * holds only one, and the other now belongs to the caller.
 			 */
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0013-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From fffe68285db092c14a15cd6f1c0bd4e932c7527b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Jan 2025 11:42:03 +1300
Subject: [PATCH v2.5 13/30] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach read
stream to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap and code that uses the fast path stays
in one single buffer queue element.  Satisfy both goals by initializing
the queue incrementally on the first cycle.
---
 src/backend/storage/aio/read_stream.c | 108 ++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 6a39bd1d92d..e58e0edf221 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -112,8 +112,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -280,7 +282,9 @@ read_stream_start_pending_read(ReadStream *stream,
 							   bool suppress_advice)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
+	int			forwarded;
 	int			flags;
 	int16		io_index;
 	int16		overflow;
@@ -312,13 +316,34 @@ read_stream_start_pending_read(ReadStream *stream,
 		flags = 0;
 
 	/*
-	 * We say how many blocks we want to read, but may be smaller on return.
-	 * On memory-constrained systems we may be also have to ask for a smaller
-	 * read ourselves.
+	 * On buffer-constrained systems we may need to limit the I/O size by the
+	 * available pin count.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -347,16 +372,35 @@ read_stream_start_pending_read(ReadStream *stream,
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Move to the location of start of next read. */
 	read_stream_index_advance_n(stream, &buffer_index, nblocks);
@@ -381,6 +425,15 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	else
 		buffers = GetAdditionalPinLimit();
 
+	/*
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffers += stream->forwarded_buffers;
+
 	/*
 	 * Each stream is always allowed to try to acquire one pin if it doesn't
 	 * hold one already.  This is needed to guarantee progress, and just like
@@ -389,7 +442,7 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	 * pin, ie the buffer pool is simply too small for the workload.
 	 */
 	if (buffers == 0 && stream->pinned_buffers == 0)
-		return 1;
+		buffers = 1;
 
 	/*
 	 * Otherwise, see how many additional pins the backend can currently pin,
@@ -755,10 +808,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -807,6 +862,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -891,10 +947,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		}
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -937,6 +998,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -972,6 +1034,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -985,6 +1048,23 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		read_stream_index_advance(stream, &index);
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0014-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 809aa501cee7737bafca06960ae73786c4472a2f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Feb 2025 21:55:40 +1300
Subject: [PATCH v2.5 14/30] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 597ecb97897..4a035f59a7d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 11c146763db..75a928e802a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1257,10 +1257,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1270,30 +1270,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, BM_VALID after this check, but
+			 * StartBufferIO() will handle those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1314,15 +1364,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1337,7 +1383,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1351,11 +1397,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1369,13 +1425,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1386,7 +1447,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1416,24 +1478,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0015-WIP-tests-Expand-temp-table-tests-to-some-pin-r.patchtext/x-diff; charset=us-asciiDownload

From 7c01d8dd84dff8ba1b31ad4641933ae394428fa8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 11:07:47 -0500
Subject: [PATCH v2.5 15/30] WIP: tests: Expand temp table tests to some pin
 related matters

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/expected/temp.out | 156 +++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/sql/temp.sql      | 107 ++++++++++++++++++++
 3 files changed, 264 insertions(+), 1 deletion(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index 2a246a7e123..91fe519b1cc 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -410,3 +410,159 @@ SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 
 PREPARE TRANSACTION 'twophase_search';
 ERROR:  cannot PREPARE a transaction that has operated on temporary objects
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK;
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,2)
+(1 row)
+
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,3)
+(1 row)
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+ROLLBACK;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+DROP TABLE test_temp;
+ERROR:  cannot DROP TABLE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+TRUNCATE test_temp;
+ERROR:  cannot TRUNCATE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       0
+(1 row)
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       1
+(1 row)
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       2
+(1 row)
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..0a35f2f8f6a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,7 +108,7 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
+# NB: temp.sql does reconnects which transiently use 2 connections,
 # so keep this parallel group to at most 19 tests
 # ----------
 test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 2a487a1ef7f..12091f968de 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -311,3 +311,110 @@ SET search_path TO 'pg_temp';
 BEGIN;
 SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 PREPARE TRANSACTION 'twophase_search';
+
+
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+
+
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ROLLBACK;
+
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+FETCH NEXT FROM c_3;
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+ROLLBACK;
+
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+DROP TABLE test_temp;
+COMMIT;
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+TRUNCATE test_temp;
+COMMIT;
+
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0016-WIP-localbuf-Track-pincount-in-BufferDesc-as-we.patchtext/x-diff; charset=us-asciiDownload

From d68313d45f1fb5dc1b3dca8f18b55d6cfd64d77d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 18:04:48 -0500
Subject: [PATCH v2.5 16/30] WIP: localbuf: Track pincount in BufferDesc as
 well

For AIO on temp tables the AIO subsystem needs to be able to ensure a pin on a
buffer while AIO is going on, even if the IO issuing query errors out. To do
so, track the refcount in BufferDesc.state, not ust LocalRefCount.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (nobody else has access to the
BufferDesc).
---
 src/backend/storage/buffer/bufmgr.c   | 30 ++++++--
 src/backend/storage/buffer/localbuf.c | 99 +++++++++++++++++----------
 2 files changed, 87 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 75a928e802a..e14888ace85 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5443,8 +5443,20 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
@@ -5496,8 +5508,18 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
-		/* Nobody else to wait for */
-		return true;
+
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+			return true;
+
+		return false;
 	}
 
 	/* There should be exactly one local pin */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5378ba84316..913a91f061e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -207,10 +207,19 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
 				PinLocalBuffer(bufHdr, false);
+				/* the buf_state may be modified inside PinLocalBuffer */
+				buf_state = pg_atomic_read_u32(&bufHdr->state);
 				break;
 			}
 		}
@@ -491,6 +500,44 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (LocalRefCount[bufid] != 0 ||
+		BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)).str,
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -511,7 +558,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -521,24 +567,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -558,7 +587,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -566,23 +594,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr);
 		}
 	}
 }
@@ -680,12 +692,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -711,7 +724,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0017-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From e11a2f867e50d63e77175f3860e8f03bd422922d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Feb 2025 11:31:40 -0500
Subject: [PATCH v2.5 17/30] bufmgr: Implement AIO read support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   6 +
 src/include/storage/bufmgr.h           |   6 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 367 ++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c  |  77 ++++++
 7 files changed, 461 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8fbd206c343..4913ff723a9 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -178,6 +178,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8b32fb108b0..cdeb2f0b528 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -477,4 +480,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern PgAioResult LocalBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4a035f59a7d..8bd8d75e940 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,12 @@ extern PGDLLIMPORT int NLocBuffer;
 extern PGDLLIMPORT Block *LocalBufferBlockPointers;
 extern PGDLLIMPORT int32 *LocalRefCount;
 
+
+struct PgAioHandleCallbacks;
+extern const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const struct PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
+
 /* upper limit for effective_io_concurrency */
 #define MAX_IO_CONCURRENCY 1000
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 3112b935676..ff7b2f25c99 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e14888ace85..7526af9ccfc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1083,7 +1085,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		else
 		{
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 		}
 	}
 	else if (!isLocalBuf)
@@ -1619,7 +1621,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			else
 			{
 				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 			}
 
 			/* Report I/Os as completing individually. */
@@ -2531,7 +2533,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3988,7 +3990,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5568,6 +5570,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5575,10 +5578,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5667,7 +5679,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5679,6 +5691,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5687,6 +5707,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5735,7 +5789,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6186,3 +6240,302 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
+	BufferDesc *bufHdr = GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathperm(rlocator, tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_wref = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	shared_buffer_stage_common(ioh, false);
+}
+
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static PgAioResult
+buffer_readv_complete_common(PgAioHandle *ioh, PgAioResult prior_result, bool is_temp, uint8 cb_data)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (int buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		if (is_temp)
+			buf_result = LocalBufferCompleteRead(buf_off, buf, cb_data, failed);
+		else
+			buf_result = SharedBufferCompleteRead(buf_off, buf, cb_data, failed);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			buffer_readv_report(result, td, LOG);
+			result = buf_result;
+		}
+	}
+
+	return result;
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	return buffer_readv_complete_common(ioh, prior_result, false, cb_data);
+}
+
+/*
+ * Helper to stage IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_wref;
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_wref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_wref = io_wref;
+
+		/*
+		 * Track pin by AIO subsystem in BufferDesc, not in LocalRefCount as
+		 * one might initially think. This is necessary to handle this backend
+		 * erroring out while AIO is still in progress.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	return buffer_readv_complete_common(ioh, prior_result, true, cb_data);
+}
+
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const struct PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const struct PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 913a91f061e..2b7bb5d3dc8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -649,6 +651,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -876,3 +880,76 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+PgAioResult
+LocalBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
+	BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+		BlockNumber forkNum = tag.forkNum;
+
+		MemoryContextSwitchTo(ErrorContext);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator, MyProcNumber, forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	pgaio_wref_clear(&bufHdr->io_wref);
+
+	{
+		uint32		buf_state;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0018-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From dc958619858a2ef83c21ae9a7ce0f692473190e8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 18:08:10 -0500
Subject: [PATCH v2.5 18/30] bufmgr: Use aio for StartReadBuffers()

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |   7 +
 src/backend/storage/buffer/bufmgr.c | 412 +++++++++++++++++++++-------
 2 files changed, 318 insertions(+), 101 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8bd8d75e940..beeb4c47c1c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -111,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -130,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7526af9ccfc..6aa5d51d693 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int *nblocks);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1236,10 +1238,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1267,6 +1268,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1306,6 +1308,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+
+			ereport(DEBUG3,
+					errmsg("found forwarded buffer %d",
+						   buffers[i]),
+					errhidestmt(true), errhidecontext(true));
 		}
 		else
 		{
@@ -1371,25 +1378,59 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers(). This is signalled to the caller by
+		 * decrementing *nblocks.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		return AsyncReadBuffers(operation, nblocks);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1457,12 +1498,31 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
+		/*
+		 * The buffer could have IO in progress by another scan. Right now
+		 * localbuf.c doesn't use IO_IN_PROGRESS, which is why we need this
+		 * hack.
+		 *
+		 * TODO: localbuf.c should use IO_IN_PROGRESS / have an equivalent of
+		 * StartBufferIO().
+		 */
+		if (pgaio_wref_valid(&bufHdr->io_wref))
+		{
+			PgAioWaitRef iow = bufHdr->io_wref;
+
+			ereport(DEBUG3,
+					errmsg("waiting for temp buffer IO in CSIO"),
+					errhidestmt(true), errhidecontext(true));
+			pgaio_wref_wait(&iow);
+			return false;
+		}
+
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
@@ -1472,28 +1532,163 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	PgAioReturn *aio_ret;
+
+	/*
+	 * If we get here without an IO operation having been issued, io_method ==
+	 * IOMETHOD_SYNC path must have been used. In that case, we start - as we
+	 * used to before - the IO now, just before waiting.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref))
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+
+		while (true)
+		{
+			nblocks = operation->nblocks;
+
+			if (!AsyncReadBuffers(operation, &nblocks))
+			{
+				/* all blocks were already read in concurrently */
+				Assert(nblocks == operation->nblocks);
+				return;
+			}
+
+			Assert(nblocks > 0 && nblocks <= operation->nblocks);
+
+			if (nblocks == operation->nblocks)
+			{
+				/* will wait below as if this had been normal AIO */
+				break;
+			}
+
+			/*
+			 * It's unlikely, but possible, that AsyncReadBuffers() wasn't
+			 * able to initiate IO for all the relevant buffers. In that case
+			 * we need to wait for the prior IO before issuing more IO.
+			 */
+			WaitReadBuffers(operation);
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/* Find the range of the physical read we need to perform. */
 	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
-
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	aio_ret = &operation->io_return;
+
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * XXX: We probably should track the IO operation, rather than its time,
+	 * separately, when initiating the IO. But right now that's not quite
+	 * allowed by the interface.
+	 */
+
+	/*
+	 * Tracking a wait even if we don't actually need to wait
+	 *
+	 * a) is not cheap
+	 *
+	 * b) reports some time as waiting, even if we never waited.
+	 */
+	if (aio_ret->result.status == ARS_UNKNOWN &&
+		!pgaio_wref_check_done(&operation->io_wref))
+	{
+		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+		pgaio_wref_wait(&operation->io_wref);
+
+		/*
+		 * The IO operation itself was already counted earlier, in
+		 * AsyncReadBuffers().
+		 */
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 0, 0);
+	}
+	else
+	{
+		Assert(pgaio_wref_check_done(&operation->io_wref));
+	}
+
+	if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry below, so we just emit a debug message the server log
+		 * (or not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+
+		/*
+		 * Try to perform the rest of the IO.  Buffers for which IO has
+		 * completed successfully will be discovered as such and not retried.
+		 */
+		nblocks = operation->nblocks;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, &nblocks);
+		goto restart;
+	}
+	else if (aio_ret->result.status != ARS_OK)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* NB: READ_DONE tracepoint is executed in IO completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation. If IO is only initiated for a
+ * subset of the blocks, *nblocks is updated to reflect that.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int *nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	bool		did_start_io = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1501,6 +1696,14 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1510,25 +1713,53 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	pgaio_wref_clear(&operation->io_wref);
+
+	/*
+	 * Loop until we have started one IO or we discover that all buffers are
+	 * already valid.
+	 */
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
 		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
 		BlockNumber io_first_block;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+		 * block, which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in here to the IO? If there
+		 * already are a lot of IO operations in progress, getting an IO
+		 * handle will block waiting for some other IO operation to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+		 * account IO time when pgaio_io_acquire_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_acquire(CurrentResourceOwner,
+								   &operation->io_return);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * It's safe to start IO while we have unsubmitted IO, but it'd be
+		 * better to first submit it. But right now the boolean return value
+		 * from ReadBuffersCanStartIO()/StartBufferIO() doesn't allow to
+		 * distinguish between nowait=true trigger failure and the buffer
+		 * already being valid.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1540,6 +1771,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u: %s",
+						   buffers[i], DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1549,6 +1785,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG5,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1556,86 +1797,55 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * head block, so we should get on with that I/O as soon as possible.
 		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG5,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
+		pgaio_io_get_wref(ioh, &operation->io_wref);
 
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
 
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
 
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
+		pgaio_io_set_flag(ioh, ioh_flags);
 
-			/* Terminate I/O and set BM_VALID. */
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		did_start_io = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
 
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-			}
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op(io_object, io_context, IOOP_READ,
+						   1, io_buffers_len * BLCKSZ);
 
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		*nblocks = io_buffers_len;
+		break;
+	}
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0019-WIP-aio-read_stream.c-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From 957228ccf17af3c3911ec7b64fd1e0bad965a261 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 18:21:47 -0500
Subject: [PATCH v2.5 19/30] WIP: aio: read_stream.c adjustments for real AIO

Comments need to be fixed.

The batching logic probably needs to be adjusted.
---
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e58e0edf221..508e145efcc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -32,10 +32,15 @@
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work for no benefit.
  *
+ * FIXME: This only applies to io_method == sync, otherwise this path is not
+ * used.
+ *
  * C) I/O is necessary, it appears to be random, and this system supports
  * read-ahead advice.  We'll look further ahead in order to reach the
  * configured level of I/O concurrency.
  *
+ * FIXME: restriction to random only applies to io_method == sync
+ *
  * The distance increases rapidly and decays slowly, so that it moves towards
  * those levels as different I/O patterns are discovered.  For example, a
  * sequential scan of fully cached data doesn't bother looking ahead, but a
@@ -90,6 +95,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -116,6 +122,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -458,6 +465,19 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
 	int16		buffer_limit;
 
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
+	/*
+	 * Try to amortize cost of submitting IOs over multiple IOs.
+	 */
+	pgaio_enter_batchmode();
+
 	/*
 	 * Check how many pins we could acquire now.  We do this here rather than
 	 * pushing it down into read_stream_start_pending_read(), because it
@@ -524,6 +544,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -549,6 +570,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -672,6 +695,8 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
@@ -680,7 +705,8 @@ read_stream_begin_impl(int flags,
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -923,7 +949,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (!stream->sync_mode ||
+			stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0020-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 7768536dac083a1e951e05bd9a80d3704241a59b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.5 20/30] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 652 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1444 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index cdeb2f0b528..817c85ec9ed 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6aa5d51d693..efd5b6601ad 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5830,7 +5826,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5887,7 +5883,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..2743c9a72d5
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,652 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+run_generic_test('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	run_generic_test('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+run_generic_test('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2..3 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1..2 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub run_generic_test
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0021-wip-ci-Increase-openbsd-kern.maxfiles-to-fix-co.patchtext/x-diff; charset=us-asciiDownload

From 1aaeed590e5b972072f24ac1de9f22936d7b02af Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 11:10:21 -0500
Subject: [PATCH v2.5 21/30] wip: ci: Increase openbsd kern.maxfiles to fix
 concurrent test issues

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .cirrus.tasks.yml | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index ba4bfa43a50..96d2b18cee0 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -262,6 +262,10 @@ task:
         TCL: -Dtcl_version=tcl86
       setup_additional_packages_script: |
         #pkg_add -I ...
+      sysctl_script: |
+        sysctl -a|grep ^kern.max
+        # otherwise parallel tests run into limits
+        sysctl kern.maxfiles=10000
       # Always core dump to ${CORE_DUMP_DIR}
       set_core_dump_script: sysctl -w kern.nosuidcoredump=2
       <<: *openbsd_task_template
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0028-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 422c0093aff9ee1d17a70af6bd06d7e83d316629 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.5 28/30] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.5-0029-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 9674f686e4dc39f2412437c0b7bc97c728bd4e1e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.5 29/30] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#62

Jakub Wartak

jakub.wartak@enterprisedb.com

10 months ago

In reply to: Andres Freund (#61)

Re: AIO v2.5

On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.5 of the AIO patchset.

[..]
Hi, Thanks for working on this!

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

- Three of the commits in the series really are just precursor commits to
their subsequent commits, which I found helpful for development and review,
namely:

- aio: Basic subsystem initialization
- aio: Skeleton IO worker infrastructure
- aio: Add liburing dependency

Not sure if it's worth keeping these separate or whether they should just be
merged with their "real commit".

For me it was easier to read those when they are separate.

- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
but all the ereport()s add a noticeable amount of overhead at high IO
throughput (at multiple gigabytes/second), so that's probably not right
forever. I'd leave this on initially and then change it to default to off
later. I think that's ok?

+1, hopefully nothing is recording/logging/running with
log_min_messages>=debug3 because only then it starts to be visible.

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

Wouldn't that matter only on *BSDs?

BTW I somehow cannot imagine someone saturating >= 32 workers (if one
does, better to switch to uring anyway?), but I have a related
question about closing fd by those workers.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

s/pg_stat_aios/pg_aios/ ? :^) It looks good to me as it is. Anyway it
is a debugging view - perhaps mark it as such in the docs - so there
is no stable API for that and shouldn't be queried by any software
anyway.

- Documentation for pg_stat_aios.

pg_aios! :)

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

#dm-dust is not that available even on modern distros(not always
compiled), but flakey seemed to work on 4.18.x:
losetup /dev/loop0 /dd.img
mkfs.ext4 -j /dev/loop0
mkdir /flakey
mount /dev/loop0 /flakey # for now it will work
mkdir /flakey/tblspace
chown postgres /flakey/tblspace
chmod 0700 /flakey/tblspace
CREATE TABLESPACE test1 LOCATION '/flakey/tblspace'
CREATE TABLE on t1fail on that test1 tablespace + INSERT SOME DATA
pg_ctl stop
umount /flakey
echo "0 `blockdev --getsz /dev/loop0` flakey /dev/loop0 0 1 1" |
dmsetup create flakey # after 1s start throwing IO errors
mount /dev/mapper/flakey /flakey
#might even say: mount: /flakey: can't read superblock on /dev/mapper/flakey.
mount /dev/mapper/flakey /flakey
pg_ctl start

and then this will happen:

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR: could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR: could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error
postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR: could not read blocks 0..1 in file
"pg_tblspc/24579/PG_18_202503031/5/24586_fsm": Input/output error

postgres=# insert into t1fail select generate_series(1000001, 2000001);
ERROR: could not open file
"pg_tblspc/24579/PG_18_202503031/5/24586_vm": Read-only file system

so usual stuff with kernel remounting it RO, but here's the dragon
with io_method=worker:

# mount -o remount,rw /flakey/
mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
write-protected.
# umount /flakey # to fsck or just mount rw again
umount: /flakey: target is busy.
# lsof /flakey/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
postgres 103483 postgres 14u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103484 postgres 6u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103485 postgres 6u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586

Those 10348[345] are IO workers, they have still open fds and there's
no way to close those without restart -- well without close()
injection probably via gdb. pg_terminate_backend() on those won't
work. The only thing that works seems to be sending SIGUSR2, but is
that safe [there could be some errors after pwrite() ] ? With
io_worker=sync just quitting the backend of course works. Not sure
what your thoughts are because any other bgworker could be having open
fds there. It's a very minor thing. Otherwise that outage of separate
tablespace (rarely used) would potentially cause inability to fsck
there and lower the availability of the DB (due to potential restart
required). I'm thinking especially of scenarios where lots of schemas
are used with lots of tablespaces OR where temp_tablespace is employed
for some dedicated (fast/furious/faulty) device. So I'm hoping SIGUSR2
is enough right (4231f4059e5e54d78c56b904f30a5873da88e163 seems to be
doing it anyway) ?

BTW: While at this, I've tried amcheck/pg_surgery for 1 min and they
both seem to work.

-J.

#63

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Jakub Wartak (#62)

Re: AIO v2.5

Hi,

On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:

On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

I think it's more whether we find cases where it performs substantially worse
with the read stream users that exists. The behaviour for non-read-stream IO
shouldn't change.

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

Wouldn't that matter only on *BSDs?

Yea, NetBSD and OpenBSD only, I think.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

s/pg_stat_aios/pg_aios/ ? :^)

Ooops, yes.

It looks good to me as it is.
Anyway it
is a debugging view - perhaps mark it as such in the docs - so there
is no stable API for that and shouldn't be queried by any software
anyway.

Cool

- Documentation for pg_stat_aios.

pg_aios! :)

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

When you say "auto-close", you mean that one IO error should trigger *all*
workers to close their FDs?

so usual stuff with kernel remounting it RO, but here's the dragon
with io_method=worker:

# mount -o remount,rw /flakey/
mount: /flakey: cannot remount /dev/mapper/flakey read-write, is
write-protected.
# umount /flakey # to fsck or just mount rw again
umount: /flakey: target is busy.
# lsof /flakey/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
postgres 103483 postgres 14u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103484 postgres 6u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586
postgres 103485 postgres 6u REG 253,2 36249600 17
/flakey/tblspace/PG_18_202503031/5/24586

Those 10348[345] are IO workers, they have still open fds and there's
no way to close those without restart -- well without close()
injection probably via gdb.

The same is already true with bgwriter, checkpointer etc?

pg_terminate_backend() on those won't work. The only thing that works seems
to be sending SIGUSR2

Sending SIGINT works.

, but is that safe [there could be some errors after pwrite() ]?

Could you expand on that?

With
io_worker=sync just quitting the backend of course works. Not sure
what your thoughts are because any other bgworker could be having open
fds there. It's a very minor thing. Otherwise that outage of separate
tablespace (rarely used) would potentially cause inability to fsck
there and lower the availability of the DB (due to potential restart
required).

I think a crash-restart is the only valid thing to get out of a scenario like
that, independent of AIO:

- If there had been any writes we need to perform crash recovery anyway, to
recreate those writes
- If there just were reads, it's good to restart as well, as otherwise there
might be pages in the buffer pool that don't exist on disk anymore, due to
the errors.

Greetings,

Andres Freund

#64

Robert Haas

robertmhaas@gmail.com

10 months ago

In reply to: Andres Freund (#61)

Re: AIO v2.5

On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

I don't like the name. Pluralization abbreviations is weird, and it's
even weirder when the abbreviation is not one that is universally
known. Maybe just drop the "s".

--
Robert Haas
EDB: http://www.enterprisedb.com

#65

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Robert Haas (#64)

Re: AIO v2.5

Hi,

On 2025-03-06 10:33:33 -0500, Robert Haas wrote:

On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

I don't like the name.

I don't think it changes anything, but as Jakub pointed out, I thinko'd the
name in the email you're responding to, it's pg_aios, not pg_stat_aios.

It shows the currently in-flight IOs, not accumulated statistics about them,
hence no _stat_.

I don't like the name either, I IIRC asked for suggestions elsewhere in the
thread, not a lot was forthcoming, so I left it at pg_aios.

Pluralization abbreviations is weird, and it's even weirder when the
abbreviation is not one that is universally known. Maybe just drop the "s".

I went with plural because that's what we have in other views showing the
"current" state:
- pg_cursors
- pg_file_settings
- pg_prepared_statements
- pg_prepared_xacts
- pg_replication_slots
- pg_locks
- ...

But you're right that those aren't abbreviations.

Greetings,

Andres Freund

#66

Jakub Wartak

jakub.wartak@enterprisedb.com

10 months ago

In reply to: Andres Freund (#63)

Re: AIO v2.5

On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:

On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

I think it's more whether we find cases where it performs substantially worse
with the read stream users that exist. The behaviour for non-read-stream IO
shouldn't change.

OK, so in order to to get full picture for v18beta this would mean
$thread + following ones?:
- Use read streams in autoprewarm
- BitmapHeapScan table AM violation removal (and use streaming read API)
- Index Prefetching (it seems it has stalled?)

or is there something more planned? (I'm asking what to apply on top
of AIO to minimize number of potential test runs which seem to take
lots of time, so to do it all in one go)

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

When you say "auto-close", you mean that one IO error should trigger *all*
workers to close their FDs?

Yeah I somehow was thinking about such a thing, but after You have
bolded that "*all*", my question sounds much more stupid than it was
yesterday. Sorry for asking stupid question :)

The same is already true with bgwriter, checkpointer etc?

Yeah.. I was kind of looking for a way of getting "higher
availability" in the presence of partial IO (tablespace) errors.

pg_terminate_backend() on those won't work. The only thing that works seems
to be sending SIGUSR2

Sending SIGINT works.

Ugh, ok, it looks like I've been overthinking that, cool.

, but is that safe [there could be some errors after pwrite() ]?

Could you expand on that?

It is pure speculation on my side: well I'm always concerned about
leaving something out there without cleanup after errors and then
re-using it for something else much later, especially on edge-cases
like NFS or FUSE. In the backend we could maintain some state, but
io_workes are shared across backends. E.g. some pwrite() failing on
NFS, we are not closing that fd, and then reusing it for something
else much latter for different backend (although AFAIK close() does
not guarantee anything, but e.g. it could be that some inode/path or
something was simply marked dangling - the fresh pair of
close()/open() could could could return error, but here we would just
keep on pwriting() there?).

OK the only question remains: does it make sense to try something like
pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP
from time to time , or is it not worth trying?

With
io_worker=sync just quitting the backend of course works. Not sure
what your thoughts are because any other bgworker could be having open
fds there. It's a very minor thing. Otherwise that outage of separate
tablespace (rarely used) would potentially cause inability to fsck
there and lower the availability of the DB (due to potential restart
required).

I think a crash-restart is the only valid thing to get out of a scenario like
that, independent of AIO:

- If there had been any writes we need to perform crash recovery anyway, to
recreate those writes
- If there just were reads, it's good to restart as well, as otherwise there
might be pages in the buffer pool that don't exist on disk anymore, due to
the errors.

OK, cool, thanks!

-J.

#67

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Jakub Wartak (#66)

Re: AIO v2.5

Hi,

On 2025-03-07 11:21:09 +0100, Jakub Wartak wrote:

On Thu, Mar 6, 2025 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-06 12:36:43 +0100, Jakub Wartak wrote:

On Tue, Mar 4, 2025 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

IMHO, yes, good idea. Anyway final outcomes partially will depend on
how many other stream-consumers be committed, right?

I think it's more whether we find cases where it performs substantially worse
with the read stream users that exist. The behaviour for non-read-stream IO
shouldn't change.

OK, so in order to to get full picture for v18beta this would mean
$thread + following ones?:
- Use read streams in autoprewarm
- BitmapHeapScan table AM violation removal (and use streaming read API)

Yep.

- Index Prefetching (it seems it has stalled?)

I don't think there's any chance it'll be in 18. There's a good bit more work
needed before it can go in...

or is there something more planned? (I'm asking what to apply on top
of AIO to minimize number of potential test runs which seem to take
lots of time, so to do it all in one go)

I think there may be some more (e.g. btree index vacuuming), but I don't think
they'll have *that* big an impact.

So, I've taken aio-2 branch from Your's github repo for a small ride
on legacy RHEL 8.7 with dm-flakey to inject I/O errors. This is more a
question: perhaps IO workers should auto-close fd on errors or should
we use SIGUSR2 for it? The scenario is like this:

When you say "auto-close", you mean that one IO error should trigger *all*
workers to close their FDs?

Yeah I somehow was thinking about such a thing, but after You have
bolded that "*all*", my question sounds much more stupid than it was
yesterday. Sorry for asking stupid question :)

Don't worry about that :)

The same is already true with bgwriter, checkpointer etc?

Yeah.. I was kind of looking for a way of getting "higher
availability" in the presence of partial IO (tablespace) errors.

I'm really doubtful that's that worthwhile to pursue. IME the system is pretty
much hosed once this starts to happening and it's often made *worse* by trying
to limp along.

OK the only question remains: does it make sense to try something like
pgbench on NFS UDP mountopt=hard,nointr + intermittent iptables DROP
from time to time , or is it not worth trying?

I don't think it's particularly interesting. But then I'd *never* trust any
meaningful data to a PG running on NFS.

Greetings,

Andres Freund

#68

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#65)

Re: AIO v2.5

Hi,

On 2025-03-06 11:53:41 -0500, Andres Freund wrote:

On 2025-03-06 10:33:33 -0500, Robert Haas wrote:

On Tue, Mar 4, 2025 at 2:00 PM Andres Freund <andres@anarazel.de> wrote:

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

I don't like the name.

I don't think it changes anything, but as Jakub pointed out, I thinko'd the
name in the email you're responding to, it's pg_aios, not pg_stat_aios.

It shows the currently in-flight IOs, not accumulated statistics about them,
hence no _stat_.

I don't like the name either, I IIRC asked for suggestions elsewhere in the
thread, not a lot was forthcoming, so I left it at pg_aios.

What about pg_io_handles?

Greetings,

Andres Freund

#69

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#68)

Re: AIO v2.5

Hi,

Tom, CCed you since you have worked most on elog.c

On 2025-03-07 16:23:51 -0500, Andres Freund wrote:

What about pg_io_handles?

While looking at the view I felt motivated to tackle the one FIXME in the
implementation of the view. Namely that the "error_desc" column wasn't
populated (the view did show that there was an error, but not what the error
was).

Which lead me down a sad sad rabbit hole, largely independent of AIO.

A bit of background:

For AIO completion callbacks can signal errors (e.g. a page header failing
validation). That error can be logged in the callback and/or raised later,
e.g. by the query that issued the IO.

AIO callbacks happen in critical sections, which is required to be able to use
AIO for WAL (see README.md for more details).

Currently errors are logged/raised by ereport()s in functions that gets passed
in an elevel, pretty standard.

A few of the ereports() use errcode_for_file_access() to translate an errno to
an sqlerrcode.

Now on to the problem:

The result of an ereport() can't be put into a view, obviously. I didn't think
it'd be good if the each kind of error needed to be implemented twice, once
with ereport() and once to just return a string to put in the view.

I tried a few things:

1) Use errsave() to allow delayed reporting of the error

I encountered a few problems:

- errsave() doesn't allow the log level to be specified, which means it can't
directly be used to LOG if no context is specified.

This could be worked around by always specifying the context, with
ErrorSaveContext.details_wanted = true and having generic code that changes
the elevel to whatever is appropriate and then using ThrowErrorData() to log the
message.

- ersave_start() sets assoc_context to CurrentMemoryContext and
errsave_finish() allocates an ErrorData copy in CurrentMemoryContext

This makes naive use of this approach when logging in a critical section
impossible. If ErrorSaveContext is not passed in an ERROR will be raised,
even if we just want to log. If ErrorSaveContext is used, we allocate
memory in the caller context, which isn't allowed in a critical section.

The only way I saw to work around that was to switch to ErrorContext before
calling errsave(). That's doable, the logging is called from one function
(pgaio_result_report()). That kinda works, but as a consequence we more than
double the memory usage in ErrorContext as errsave_finish() will palloc a
new ErrorData and ThrowErrorData() copies that ErrorData and all its string
back to ErrorContext.

2) Have the error callback format the error using a helper function instead of
using ereport()

Problems:

- errcode_for_file_access() would need to be reimplemented / split into a
function translating an errnode into an sqlerrcode without getting it from
the error data stack

- emitting the log message in a critical section would require either doing
the error formatting in ErrorContext or creating another context with
reserved memory to do so.

- allowing to specify DETAIL, HINT etc basically requires a small elog.c
interface reimplementation

3) Use pre_format_elog_string(), format_elog_string() similar to what guc.c
does for check hooks, via GUC_check_errmsg(), GUC_check_errhint() ...

Problems:

- Requires to duplicate errcode_for_file_access() for similar reason as in 2)

- Not exactly pretty

- Somewhat gnarly, but doable, to make use of %m safe, the way it's done in
guc.h afaict isn't safe:
pre_format_elog_string() is called for each of
GUC_check_{errmsg,errdetail,errhint}. As the global errno might get set
during the format_elog_string(), it'll not be the right one during
the next GUC_check_*.

4) Don't use ereport() directly, but instead put the errstart() in
pgaio_result_report(), before calling the error description callback.

When emitting a log message, call errfinish() after the callback. For the
view, get the message out via CopyErrorData() and free the memory again
using FlushErrorState

Problems:

- Seems extremely hacky

I implemented all, but don't really like any of them.

Unless somebody has a better idea or we agree that one of the above is
actually a acceptable approach, I'm inclined to simply remove the column
containing the description of the error. The window in which one could see an
IO with an error is rather short most of the time anyway and the error will
also be logged.

It's a bit annoying that adding the column later would require revising the
signature of the error reporting callback at that time, but I think that
degree of churn is acceptable.

The main reason I wanted to write this up is that it seems that we're just
lacking some infrastructure here.

Greetings,

Andres Freund

#70

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: Andres Freund (#69)

Re: AIO v2.5

Andres Freund <andres@anarazel.de> writes:

While looking at the view I felt motivated to tackle the one FIXME in the
implementation of the view. Namely that the "error_desc" column wasn't
populated (the view did show that there was an error, but not what the error
was).

Which lead me down a sad sad rabbit hole, largely independent of AIO.

...

The main reason I wanted to write this up is that it seems that we're just
lacking some infrastructure here.

Maybe. The mention of elog.c in the same breath with critical
sections is already enough to scare me; we surely daren't invoke
gettext() in a critical section, for instance. I feel the most
we could hope for here is to report a constant string that would
not get translated till later, outside the critical section.
That seems less about infrastructure and more about how the AIO
error handling/reporting code is laid out. In the meantime,
if leaving the error out of this view is enough to make the problem
go away, let's do that.

regards, tom lane

#71

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#61)

33 attachment(s)

Re: AIO v2.5

Hi,

Attached is v2.6 of the AIO patchset.

Relative to 2.5 I:

- Improved the split between subsystem initialization and main AIO commit, as
well as the one between worker infrastructure and io_method=worker

Seemed worth as the only one voicing an opinion about squashing those
commits was opposed.

- Added a lot more comments to aio.h/aio_internal.h. I think just about
anything that should conceivably have a comment has one.

- Reordered fields in PgAioHandle to waste less due to padding

- Narrowed a few *count fields, they were 64bit without ever being al to reach
that

- Used aio_types.h more widely, instead of "manual" forward declarations. This
required moving a few typedefs to aio_types.h

- Substantial commit message improvements.

- Removed the pg_aios.error_desc column, due to:
/messages/by-id/qzxq6mqqozctlfcg2kg5744gmyubicvuehnp4a7up472thlvz2@y5xqgd5wcwhw

- Reordered the commits slightly, to put the README just after the
smgr.c/md.c/... support, as the readme references those in the examples

- Stopped creating backend-local io_uring instances, that is vestigial for
now. We likely will want to reintroduce them at some point (e.g. for network
IO), but we can do that at that time.

- There were a lot of duplicated codepaths in bufmgr.c support for AIO due to
temp tables. I added a few commits refactoring the temp buffers state
management to look a lot more like the shared buffer code.

I'm not sure that that's the best path, but they all seemed substantial
improvements on their own.

- putting io_method in PG_TEST_INITDB_EXTRA_OPTS previously broke a test,
because Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the # options
specified by ->extra. I now worked around that by appending the io method to
a local PG_TEST_INITDB_EXTRA_OPTS, but brrr.

- The tracepoint for read completion omitted the fact that it was a temp
table, if so.

- Fixed some duplicated function decls, due to a misresolved merge-conflict

Current state:

- 0001, 0002 - core AIO - IMO pretty much ready

- 0003, 0004 - IO worker - same

- 0005, 0006 - io_uring support - close, but we need to do something about
set_max_fds(), which errors out spuriously in some cases

- 0007 - smgr/md/fd.c readv support - seems quite close, but might benefit from
another pass through

- 0008 - README - I think it's good, but I'm probably not seeing the trees for
the forest anymore

- 0009 - pg_aios view - naming not resolved, docs missing

- 0010 to 0014 - from another thread, just included here due to a dependency

- 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
up the code before making larger changes, needs review

- 0021 - keep BufferDesc refcount up2date for temp buffers - I think that's
pretty much ready, but depends on earlier patches

- 0022 - bufmgr readv AIO suppot - some naming, some code duplication needs to
be resolved, but otherwise quite close

- 0023 - use AIO in StartReadBuffers() - perhaps a bit of polishing needed

- 0024 - adjust read_stream.c for AIO - I think Thomas has a better patch for
this in the works

- 0025 - tests for AIO - I think it's reasonable, unless somebody objects to
exporting a few bufmgr.c functions to the test

- the rest: Not for 18

Greetings,

Andres Freund

Attachments:

v2.6-0025-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From ed3fffb92575e43444e91d96a156c2bfc3512821 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.6 25/34] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 667 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1459 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 257f8beeeec..3d7ebf96a7e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 60df9eb8cba..20544b39ef9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5780,7 +5776,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5837,7 +5833,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..2e18c8de338
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,667 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0026-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 8905fc32399617b7a655a6e8f3f05aecb62a83bc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:35:04 -0400
Subject: [PATCH v2.6 26/34] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 251 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index f76f74ba166..fb6ac058a09 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index be8c4c2d60d..65abebefbfb 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11d4d5a7aea..a61fc14805e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2017,3 +2108,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 607c14ee173..088b189543b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -102,6 +102,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -129,6 +134,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -691,6 +697,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0027-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 2a0bbc1f858b98eff116a410277920fcb2b7ebae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 15:47:38 -0500
Subject: [PATCH v2.6 27/34] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  18 ++
 src/include/storage/aio_internal.h            |  33 ++++
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 12 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 3ce8763ebe8..cdf54b90b15 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -249,6 +249,9 @@ struct PgAioHandleCallbacks
 #define PGAIO_HANDLE_MAX_CALLBACKS	4
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
+
 
 /* --------------------------------------------------------------------------------
  * IO Handles
@@ -333,6 +336,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -345,6 +362,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2412c5e7ecb..9471f2a92be 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -94,6 +94,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -149,11 +155,23 @@ struct PgAioHandle
 };
 
 
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -184,6 +202,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -211,6 +235,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index de6c7db894b..bd9436d665d 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -404,6 +404,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c2b34a95e1c..e1b39b5d44c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -663,6 +666,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1045,6 +1063,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 2205658cd9a..a56126a772a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -130,11 +183,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -149,6 +222,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -160,6 +236,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -172,6 +249,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -179,9 +290,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -203,6 +318,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4317cfc9d2f..41a4025c66d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3268,6 +3268,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8738ad51bf1..291ca8ab38d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 76b9cec1e26..de00346b549 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResoureElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResoureElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5a13e6e6dd..fc379ff27eb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2129,6 +2129,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0028-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From ef08eb591f0d539ca4603686302a2f454c960662 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:09:51 -0500
Subject: [PATCH v2.6 28/34] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |  2 +
 src/include/storage/bufmgr.h           |  2 +
 src/backend/storage/aio/aio_callback.c |  2 +
 src/backend/storage/buffer/bufmgr.c    | 88 ++++++++++++++++++++++++++
 4 files changed, 94 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index cdf54b90b15..b53aa9748c3 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -182,8 +182,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index db9a4673097..a2bff99fb55 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,7 +176,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index fb6ac058a09..7162f722e3c 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 20544b39ef9..1f47edaa7b9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6467,6 +6467,42 @@ SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
 	return result;
 }
 
+static uint64
+BufferCompleteWriteShared(Buffer buffer, bool release_lock, bool failed)
+{
+	BufferDesc *bufHdr;
+	bool		result = false;
+
+	Assert(BufferIsValid(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	TerminateBufferIO(bufHdr, /* clear_dirty = */ true,
+					  failed ? BM_IO_ERROR : 0,
+					   /* forget_owner = */ false,
+					   /* syncio = */ false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. called
+	 * LWLockDisown()), we are.
+	 */
+	if (release_lock)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+
+	return result;
+}
+
 /*
  * Helper to prepare IO on shared buffers for execution, shared between reads
  * and writes.
@@ -6555,6 +6591,12 @@ shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
  * - result.error_data is the offset of the first page that failed
  *   verification in a larger IO
  */
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	shared_buffer_stage_common(ioh, true);
+}
+
 static void
 buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
 {
@@ -6641,6 +6683,33 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 c
 	return buffer_readv_complete_common(ioh, prior_result, false, cb_data);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioResult result = prior_result;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	ereport(DEBUG5,
+			errmsg("%s: %d %d", __func__, prior_result.status, prior_result.result),
+			errhidestmt(true), errhidecontext(true));
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	/* FIXME: handle outright errors */
+
+	for (int io_data_off = 0; io_data_off < handle_data_len; io_data_off++)
+	{
+		Buffer		buf = io_data[io_data_off];
+
+		/* FIXME: handle short writes / failures */
+		/* FIXME: ioh->target_data.shared_buffer.release_lock */
+		BufferCompleteWriteShared(buf, true, false);
+	}
+
+	return result;
+}
+
 /*
  * Helper to stage IO on local buffers for execution, shared between reads
  * and writes.
@@ -6685,6 +6754,16 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb
 	return buffer_readv_complete_common(ioh, prior_result, true, cb_data);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "not yet");
+}
+
 
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
@@ -6693,6 +6772,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6706,3 +6790,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0029-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 43d16b992593ff7ad54452a9531eb29c9a31811d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.6 29/34] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc379ff27eb..92ccd2e0514 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1190,6 +1190,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3011,6 +3012,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0030-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 4a65eb21e12504385b1c334758e5ce53d6100896 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.6 30/34] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 4fd717169f0..4208d5f2c97 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(const void *startup_data, size_t startup_data_l
 extern void CheckpointerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3d7ebf96a7e..3db2f786177 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a2bff99fb55..9ed4468d8e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -299,7 +299,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetSoftPinLimit(void);
 extern uint32 GetSoftLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f47edaa7b9..088a6137b60 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static bool AsyncReadBuffers(ReadBuffersOperation *operation,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3151,6 +3152,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3182,7 +3234,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3244,7 +3299,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3352,48 +3409,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3411,15 +3511,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3445,7 +3553,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3488,6 +3596,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3508,6 +3619,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3664,11 +3777,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3679,6 +3806,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3690,6 +3824,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3728,8 +3867,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3738,22 +3935,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3763,7 +3988,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3772,40 +3997,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4179,6 +4658,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 92ccd2e0514..c29a17a269f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0031-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From 42d43019acba43a03ef19f9e762ab42b21718fc6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.6 31/34] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 9b134c03456..c890bf847e3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -778,8 +778,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0024-WIP-aio-read_stream.c-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From dacf255f86738d5398995c291668734e1f670e60 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 18:21:47 -0500
Subject: [PATCH v2.6 24/34] WIP: aio: read_stream.c adjustments for real AIO

Comments need to be fixed.

The batching logic probably needs to be adjusted.
---
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e58e0edf221..508e145efcc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -32,10 +32,15 @@
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work for no benefit.
  *
+ * FIXME: This only applies to io_method == sync, otherwise this path is not
+ * used.
+ *
  * C) I/O is necessary, it appears to be random, and this system supports
  * read-ahead advice.  We'll look further ahead in order to reach the
  * configured level of I/O concurrency.
  *
+ * FIXME: restriction to random only applies to io_method == sync
+ *
  * The distance increases rapidly and decays slowly, so that it moves towards
  * those levels as different I/O patterns are discovered.  For example, a
  * sequential scan of fully cached data doesn't bother looking ahead, but a
@@ -90,6 +95,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -116,6 +122,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -458,6 +465,19 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
 	int16		buffer_limit;
 
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
+	/*
+	 * Try to amortize cost of submitting IOs over multiple IOs.
+	 */
+	pgaio_enter_batchmode();
+
 	/*
 	 * Check how many pins we could acquire now.  We do this here rather than
 	 * pushing it down into read_stream_start_pending_read(), because it
@@ -524,6 +544,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -549,6 +570,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -672,6 +695,8 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
@@ -680,7 +705,8 @@ read_stream_begin_impl(int flags,
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -923,7 +949,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (!stream->sync_mode ||
+			stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0001-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 49f5873485e478672b70362b698b6d44c22f4c16 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 10:44:56 -0500
Subject: [PATCH v2.6 01/34] aio: Basic subsystem initialization

This commit just does the minimal wiring up of the AIO subsystem, to be fully
added in the next commit, to the rest of the system. The next commit contains
more details about motivation and architecture.

This commit is kept separate to make it easier to review, reviewing how the
AIO subsystem works is a big enough task on its own an rather separate from
reviewing the changes across the tree.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/aio.h                     | 38 ++++++++
 src/include/storage/aio_subsys.h              | 33 +++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/access/transam/xact.c             | 12 +++
 src/backend/postmaster/autovacuum.c           |  2 +
 src/backend/postmaster/bgwriter.c             |  2 +
 src/backend/postmaster/checkpointer.c         |  2 +
 src/backend/postmaster/pgarch.c               |  2 +
 src/backend/postmaster/walsummarizer.c        |  2 +
 src/backend/postmaster/walwriter.c            |  2 +
 src/backend/replication/walsender.c           |  2 +
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 90 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++
 doc/src/sgml/config.sgml                      | 51 +++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 24 files changed, 356 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..e4faf692a38
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO but interact with it in some form. E.g. postmaster.c
+ * and shared memory initialization need to initialize AIO but don't perform
+ * AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..de3bc37264f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..d1b49adc547 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -62,6 +62,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 800815dfbcc..5aa0fa665c2 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a688cc5d2a1..72f5acceec7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0e228d143a0..fda91ffd1ce 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dbe4e1d426b..7e622ae4bd2 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ccba0f84e6e..0fec4f1f871 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -38,6 +38,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -289,6 +290,7 @@ WalSummarizerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0380601bcbb..fd92c8b7a33 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..69097125606 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..828a94efdc3
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/aio_subsys.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+}
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ee1a9d5d98b..9b134c03456 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -628,6 +629,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ad25cbb39c5..28dcc465762 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -72,6 +72,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3254,6 +3255,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5299,6 +5312,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2d1de9c37bd..a31c9963ce4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..76b9cec1e26 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d2fa5f7d1a9..7d5482f31cd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2644,6 +2644,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..7e6bbc1a8c1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1277,6 +1277,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0002-aio-Add-asynchronous-I-O-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 99e985aaef9ff99368c17de6e24359375e996401 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 10:14:31 -0500
Subject: [PATCH v2.6 02/34] aio: Add asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by a new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons all
targets currently need to be added to an array in aio_target.c.  This commit
does not implement any AIO targets, just the infrastructure for them. The smgr
target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit nothing uses the AIO infrastructure. Other commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, that I started to work on to make adding AIO feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |  300 +++++
 src/include/storage/aio_internal.h            |  363 ++++++
 src/include/storage/aio_types.h               |  117 ++
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1127 +++++++++++++++++
 src/backend/storage/aio/aio_callback.c        |  307 +++++
 src/backend/storage/aio/aio_init.c            |  186 +++
 src/backend/storage/aio/aio_io.c              |  180 +++
 src/backend/storage/aio/aio_target.c          |  118 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 13 files changed, 2775 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..0124dc4d24d 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -14,6 +14,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +29,306 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed.
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ *
+ * typedef is in aio_types.h
+ */
+struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+};
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+/* typedef is in aio_types.h */
+struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+};
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+										uint8 cb_data);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..d213b0842a3
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,363 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that should only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/* returned by pgaio_io_acquire() */
+	PGAIO_HS_HANDED_OUT,
+
+	/* pgaio_io_prep_*() has been called, but IO hasn't been submitted yet */
+	PGAIO_HS_DEFINED,
+
+	/* target's stage() callback has been called, ready to be submitted */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted and is being executed */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/* IO completed, shared completion has been called */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in aio_types.h */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/* data forwarded to each callback */
+	uint8		callbacks_data[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	/*
+	 * To wait for the IO to complete other backends can wait on this CV. Note
+	 * that, if in SUBMITTED state, a waiter first needs to check if it needs
+	 * to do work via IoMethodOps->wait_one().
+	 */
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/*
+	 * If not NULL, this memory location will be updated with information
+	 * about the IOs completion iff the issuing backend learns about the IOs
+	 * completion.
+	 */
+	PgAioReturn *report_return;
+
+	/* Data necessary for the IO to be performed */
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * without having been either defined (by actually associating it with IO)
+	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
+	 * to enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strict speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint32		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint32		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at buildtime. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..a5cc658efbd
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+typedef struct PgAioHandleCallbacks PgAioHandleCallbacks;
+typedef struct PgAioTargetInfo PgAioTargetInfo;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 828a94efdc3..26dd23cfab5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -15,10 +37,28 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -31,7 +71,182 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation. See pgaio_io_call_inj().
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that that has been "handed out"
+ * to code, but not yet submitted or released. This restriction is necessary
+ * to ensure that it is possible for code to wait for an unused handle by
+ * waiting for in-flight IO to complete. There is a limited number of handles
+ * in each backend, if multiple handles could be handed out without being
+ * submitted, waiting for all in-flight IO to complete would not guarantee
+ * that handles free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr., which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()) the IO will also have been submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
 
 /*
  * Release IO handle during resource owner cleanup.
@@ -39,8 +254,793 @@ int			io_max_concurrency = -1;
 void
 pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 {
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			if (!on_error)
+				elog(WARNING, "AIO handle was not submitted");
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
 }
 
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+/*
+ * Returns an ID uniquely identifying the IO handle. This is only really
+ * useful for logging, as handles are reused across multiple IOs.
+ */
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+/*
+ * Return the ProcNumber for the process that can use an IO handle. The
+ * mapping from IO handles to PGPROCs is static, therefore this even works
+ * when the corresponding PGPROC is not in use.
+ */
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+/*
+ * Return a wait reference for the IO. Only wait references can be used to
+ * wait for an IOs completion, as handles themselves can be reused after
+ * completion.  See also the comment above pgaio_io_acquire().
+ */
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Has the IO completed and thus the IO handle been reused?
+ *
+ * This is useful when waiting for IO completion at a low level (e.g. in an IO
+ * method's ->wait_one() callback).
+ */
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state != PGAIO_HS_SUBMITTED
+			&& state != PGAIO_HS_COMPLETED_IO
+			&& state != PGAIO_HS_COMPLETED_SHARED
+			&& state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+/*
+ * Make IO handle ready to be reused after IO has completed or after the
+ * handle has been released without being used.
+ */
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * locallbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+/*
+ * Wait for an IO handle to become usable.
+ *
+ * This only really is useful for pgaio_io_acquire().
+ */
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+		pgaio_submit_staged();
+
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Mark a wait reference as invalid
+ */
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+/* Is the wait reference valid? */
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+/*
+ * Similar to pgaio_io_get_id(), just for wait references.
+ */
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed. Can be called in any process, not just
+ * in the issuing backend.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	/*
+	 * XXX: It likely would be worth checking in with the io method, to give
+	 * the IO method a chance to check if there are completion events queued.
+	 */
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * Batch submission mode needs to explicitly ended with
+ * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
+ * error recovery will end the batch.
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+
 /*
  * Perform AIO related cleanup after an error.
  *
@@ -50,6 +1050,22 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 void
 pgaio_error_cleanup(void)
 {
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
 }
 
 /*
@@ -62,11 +1078,82 @@ pgaio_error_cleanup(void)
 void
 AtEOXact_Aio(bool is_commit)
 {
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Registered as before_shmem_exit() callback in pgaio_init_backend()
+ */
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	AtEOXact_Aio(code == 0);
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes: - it's somewhat annoying to see partially finished IOs in
+	 * stats views etc - it's rumored that some kernel-level AIO mechanisms
+	 * don't deal well with the issuer of an AIO exiting
+	 */
+
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
 }
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -88,3 +1175,43 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 
 	return true;
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+/*
+ * Call injection point with support for pgaio_inj_io_get().
+ */
+void
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
+{
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..7392b55322c
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,307 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+							uint8 cb_data)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	if (cb_id >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cb_id);
+	if (aio_handle_cbs[cb_id].cb->complete_shared == NULL &&
+		aio_handle_cbs[cb_id].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cb_id);
+	if (ioh->num_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->callbacks[ioh->num_callbacks] = cb_id;
+	ioh->callbacks_data[ioh->num_callbacks] = cb_data;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_callbacks + 1,
+				   cb_id, ce->name);
+
+	ioh->num_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO Result related functions
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cb_id = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage(%u)",
+					   i, cb_id, ce->name, cb_data);
+		ce->cb->stage(ioh, cb_data);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared(%u) with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cb_id, ce->name,
+					   cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result, cb_data);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local(%u) with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cb_id, ce->name, cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result, cb_data);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..8e1162b09dc 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,210 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
+	 * then PGC_S_DYNAMIC_DEFAULT will fail to override that and we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..89376ff4040
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..3fa813fd592
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,118 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..902c2428d41
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "IO should have been executed synchronously");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..b44e4908b25 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e6bbc1a8c1..ce7f877adca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1278,6 +1278,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2124,6 +2125,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0003-aio-Infrastructure-for-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 15a1e1ef5ca66da64d4ce97ba9bc7c6a196dc725 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 17:57:00 -0500
Subject: [PATCH v2.6 03/34] aio: Infrastructure for io_method=worker

This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/miscadmin.h                       |   2 +
 src/include/postmaster/postmaster.h           |   1 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 165 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 ++++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  19 ++
 src/test/regress/expected/stats.out           |  10 +-
 20 files changed, 328 insertions(+), 14 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index a2b63495eec..54429e046a9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index b6a3f275a1b..afefa35e2cc 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -98,6 +98,7 @@ extern void InitProcessGlobals(void);
 extern int	MaxLivePostmasterChildren(void);
 
 extern bool PostmasterMarkPIDForWorkerNotify(int);
+extern void assign_io_workers(int newval, void *extra);
 
 #ifdef WIN32
 extern void pgwin32_register_deadchild_callback(HANDLE procHandle, DWORD procId);
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index e4faf692a38..ed00d5c47cd 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..872241ae299
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern int	io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 114eb1f8f76..4e815f9fde0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,7 +449,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 47375e5bfaa..516f2b5d212 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d2a7a7add6f..1c8ae3b27dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -341,6 +344,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -403,6 +407,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -437,6 +445,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1366,6 +1376,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1378,7 +1393,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2503,6 +2517,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2724,6 +2748,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2906,20 +2931,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2934,12 +2960,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3040,11 +3067,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3172,10 +3213,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3199,6 +3244,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -4115,6 +4161,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4265,6 +4312,100 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+void
+assign_io_workers(int newval, void *extra)
+{
+	io_workers = newval;
+	if (!IsUnderPostmaster && pmState > PM_INIT)
+		maybe_adjust_io_workers();
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ef9ef93e2b
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(const void *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 947ffb40421..812e224cf14 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3315,6 +3315,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 6efbb650aa8..51f3c550f6d 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -294,6 +294,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb575025596..c8de9c9e2d3 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -376,6 +376,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b44e4908b25..3f6dc3876b4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index dc3521457c7..2acc505383e 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 28dcc465762..4317cfc9d2f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -75,6 +75,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3267,6 +3268,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, assign_io_workers, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a31c9963ce4..f901352c868 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -207,6 +207,7 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7d5482f31cd..c6f8171e39a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2695,6 +2695,25 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is
+         3. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 30d763c4aee..0e0b9b8e6b4 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -51,6 +51,14 @@ client backend|relation|vacuum
 client backend|temp relation|normal
 client backend|wal|init
 client backend|wal|normal
+io worker|relation|bulkread
+io worker|relation|bulkwrite
+io worker|relation|init
+io worker|relation|normal
+io worker|relation|vacuum
+io worker|temp relation|normal
+io worker|wal|init
+io worker|wal|normal
 slotsync worker|relation|bulkread
 slotsync worker|relation|bulkwrite
 slotsync worker|relation|init
@@ -87,7 +95,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(71 rows)
+(79 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0004-aio-Add-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 40a03bed243e3a31a73bf4912964e8b5254fed16 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2025 13:46:35 -0500
Subject: [PATCH v2.6 04/34] aio: Add io_method=worker

The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately.  Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables.  Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 432 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 doc/src/sgml/config.sgml                      |   5 +
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 453 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 0124dc4d24d..7a422a4d131 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -23,10 +23,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d213b0842a3..a626581dd04 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -353,6 +353,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 26dd23cfab5..1e756d37b93 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -80,6 +81,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 8e1162b09dc..2205658cd9a 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -211,6 +217,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ef9ef93e2b..29de257565d 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,22 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * IO workers consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more.  qXXX
+ * This could be improved by using futexes instead of latches to wake N
+ * waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,25 +32,325 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG3, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +369,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -61,6 +383,27 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 		EmitErrorReport();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		proc_exit(1);
 	}
 
@@ -71,9 +414,89 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+		}
+
 		CHECK_FOR_INTERRUPTS();
 	}
 
@@ -83,6 +506,5 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3f6dc3876b4..9fa12a555e8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f901352c868..1058726285f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,7 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c6f8171e39a..de1b918b06a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2682,6 +2682,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ce7f877adca..098ef0cd7a5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0005-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 73324630e79ce3788b449c095ee3851f78b23bf3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 17:56:05 -0500
Subject: [PATCH v2.6 05/34] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0006-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From 039163efaea51533e573b3ff26f88e501693135d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:29:37 -0500
Subject: [PATCH v2.6 06/34] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 410 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 440 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 7a422a4d131..023616f6805 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -24,6 +24,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index a626581dd04..2412c5e7ecb 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -354,6 +354,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 1e756d37b93..a8227efa0b5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..0d7ad9124dd
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,410 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1058726285f..8738ad51bf1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index de1b918b06a..47dd93e4227 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2687,6 +2687,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 098ef0cd7a5..e5a13e6e6dd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2147,6 +2147,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0007-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From f329d7ef281b59be2a18a4093dc0192bb11bef0d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.6 07/34] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   5 +-
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 176 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 124 +++++++++++++++++
 10 files changed, 382 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 023616f6805..16476680f12 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -108,9 +108,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -176,6 +177,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..b0b9a2a5c97 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 7392b55322c..6a21c82396a 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,10 +18,11 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into the aio_handle_cbs */
-static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+static const PgAioHandleCallbacks aio_invalid_cb = {0};
 
 typedef struct PgAioHandleCallbacksEntry
 {
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 3fa813fd592..aab86a358da 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -29,6 +30,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..be8c4c2d60d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..11d4d5a7aea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,99 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..607c14ee173 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,6 +94,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -104,6 +109,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +127,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -144,6 +152,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -623,6 +641,22 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -819,6 +853,19 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +894,80 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0008-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From bd05827716214a60d07ed34a47abd36dd0fac6c0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.6 08/34] aio: Add README.md explaining higher level design

---
 src/backend/storage/aio/README.md | 422 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 424 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..de6c7db894b
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,422 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reason to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reason *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single FUA write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build backends executable code and other process local
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. Shared state can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results)
+. An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling `pgaio_io_acquire()`
+and because `pgaio_io_acquire()` needs to be able to succeed, only a single AIO
+Handle may be acquired (i.e. returned by `pgaio_io_acquire()`) without causing
+the IO to have been defined (by, potentially indirectly, causing
+`pgaio_io_prep_*()` to have been called). Otherwise a backend could trivially
+self-deadlock by using up all AIO Handles without the ability to wait for some
+of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a8227efa0b5..c2b34a95e1c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0009-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 460e366fe8797f2e44b9e4c1ed7500120af7da30 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.6 09/34] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cede992b6e2..f416ed387d0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12469,4 +12469,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0010-Refactor-read_stream.c-s-circular-arithmetic.patchtext/x-diff; charset=us-asciiDownload

From 36a84f449e69e7dbbd739e24fdd17e94fedf875f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Feb 2025 14:47:25 +1300
Subject: [PATCH v2.6 10/34] Refactor read_stream.c's circular arithmetic.

Several places have open-coded circular index arithmetic.  Make some
common functions for better readability and consistent assertion
checking.

This avoids adding yet more open-coded duplication in later patches, and
standardizes on the vocabulary "advance" and "retreat" as used elsewhere
in PostgreSQL.
---
 src/backend/storage/aio/read_stream.c | 78 +++++++++++++++++++++------
 1 file changed, 61 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 04bdb5e6d4b..dee51fa85a9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -224,6 +224,55 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
+/*
+ * Increment index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	*index += 1;
+	if (*index == stream->queue_size)
+		*index = 0;
+}
+
+/*
+ * Increment index by n, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance_n(ReadStream *stream, int16 *index, int16 n)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+	Assert(n <= MAX_IO_COMBINE_LIMIT);
+
+	*index += n;
+	if (*index >= stream->queue_size)
+		*index -= stream->queue_size;
+
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+}
+
+#if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
+/*
+ * Decrement index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_retreat(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	if (*index == 0)
+		*index = stream->queue_size - 1;
+	else
+		*index -= 1;
+}
+#endif
+
 static void
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
@@ -302,11 +351,8 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 				&stream->buffers[stream->queue_size],
 				sizeof(stream->buffers[0]) * overflow);
 
-	/* Compute location of start of next read, without using % operator. */
-	buffer_index += nblocks;
-	if (buffer_index >= stream->queue_size)
-		buffer_index -= stream->queue_size;
-	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	/* Move to the location of start of next read. */
+	read_stream_index_advance_n(stream, &buffer_index, nblocks);
 	stream->next_buffer_index = buffer_index;
 
 	/* Adjust the pending read to cover the remaining portion, if any. */
@@ -334,12 +380,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
-		 * wrap-around, but we don't want to use the expensive % operator.
+		 * wrap-around.
 		 */
-		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
-		if (buffer_index >= stream->queue_size)
-			buffer_index -= stream->queue_size;
-		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		buffer_index = stream->next_buffer_index;
+		read_stream_index_advance_n(stream,
+									&buffer_index,
+									stream->pending_read_nblocks);
 		per_buffer_data = get_per_buffer_data(stream, buffer_index);
 		blocknum = read_stream_get_block(stream, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
@@ -781,12 +827,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	 */
 	if (stream->per_buffer_data)
 	{
+		int16		index;
 		void	   *per_buffer_data;
 
-		per_buffer_data = get_per_buffer_data(stream,
-											  oldest_buffer_index == 0 ?
-											  stream->queue_size - 1 :
-											  oldest_buffer_index - 1);
+		index = oldest_buffer_index;
+		read_stream_index_retreat(stream, &index);
+		per_buffer_data = get_per_buffer_data(stream, index);
 
 #if defined(CLOBBER_FREED_MEMORY)
 		/* This also tells Valgrind the memory is "noaccess". */
@@ -804,9 +850,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	stream->pinned_buffers--;
 
 	/* Advance oldest buffer, with wrap-around. */
-	stream->oldest_buffer_index++;
-	if (stream->oldest_buffer_index == stream->queue_size)
-		stream->oldest_buffer_index = 0;
+	read_stream_index_advance(stream, &stream->oldest_buffer_index);
 
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patchtext/x-diff; charset=us-asciiDownload

From 0b09f52651d5e48ef5e25931209ce231aab58bc6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 10:59:39 +1300
Subject: [PATCH v2.6 11/34] Improve buffer pool API for per-backend pin
 limits.

Previously the support functions assumed that you needed one additional
pin to make progress, and could optionally use some more.  Add a couple
more functions for callers that want to know:

* what the maximum possible number could be, for space planning
  purposes, called the "soft pin limit"

* how many additional pins they could acquire right now, without the
  special case allowing one pin (ie for users that already hold pins and
  can already make progress even if zero extra pins are available now)

These APIs are better suited to read_stream.c, which will be adjusted in
a follow-up patch.  Also move the computation of the each backend's fair
share of the buffer pool to backend initialization time, since the
answer doesn't change and we don't want to perform a division operation
every time we compute availability.
---
 src/include/storage/bufmgr.h          |  4 ++
 src/backend/storage/buffer/bufmgr.c   | 75 ++++++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c | 16 ++++++
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..597ecb97897 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -290,6 +290,10 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern uint32 GetSoftPinLimit(void);
+extern uint32 GetSoftLocalPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern uint32 GetAdditionalLocalPinLimit(void);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..11c146763db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -211,6 +211,8 @@ static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
 
+static uint32 MaxProportionalPins;
+
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
@@ -2097,6 +2099,46 @@ again:
 	return buf;
 }
 
+/*
+ * Return the maximum number of buffer than this backend should try to pin at
+ * once, to avoid pinning more than its fair share.  This is the highest value
+ * that GetAdditionalPinLimit() and LimitAdditionalPins() could ever return.
+ *
+ * It's called a soft limit because nothing stops a backend from trying to
+ * acquire more pins than this this with ReadBuffer(), but code that wants more
+ * for I/O optimizations should respect this per-backend limit when it can
+ * still make progress without them.
+ */
+uint32
+GetSoftPinLimit(void)
+{
+	return MaxProportionalPins;
+}
+
+/*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend soft limit, considering the
+ * number of buffers it has already pinned.
+ */
+uint32
+GetAdditionalPinLimit(void)
+{
+	uint32		estimated_pins_held;
+
+	/*
+	 * We get the number of "overflowed" pins for free, but don't know the
+	 * number of pins in PrivateRefCountArray.  The cost of calculating that
+	 * exactly doesn't seem worth it, so just assume the max.
+	 */
+	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	/* Is this backend already holding more than its fair share? */
+	if (estimated_pins_held > MaxProportionalPins)
+		return 0;
+
+	return MaxProportionalPins - estimated_pins_held;
+}
+
 /*
  * Limit the number of pins a batch operation may additionally acquire, to
  * avoid running out of pinnable buffers.
@@ -2112,28 +2154,15 @@ again:
 void
 LimitAdditionalPins(uint32 *additional_pins)
 {
-	uint32		max_backends;
-	int			max_proportional_pins;
+	uint32		limit;
 
 	if (*additional_pins <= 1)
 		return;
 
-	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
-	max_proportional_pins = NBuffers / max_backends;
-
-	/*
-	 * Subtract the approximate number of buffers already pinned by this
-	 * backend. We get the number of "overflowed" pins for free, but don't
-	 * know the number of pins in PrivateRefCountArray. The cost of
-	 * calculating that exactly doesn't seem worth it, so just assume the max.
-	 */
-	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	if (max_proportional_pins <= 0)
-		max_proportional_pins = 1;
-
-	if (*additional_pins > max_proportional_pins)
-		*additional_pins = max_proportional_pins;
+	limit = GetAdditionalPinLimit();
+	limit = Max(limit, 1);
+	if (limit < *additional_pins)
+		*additional_pins = limit;
 }
 
 /*
@@ -3575,6 +3604,16 @@ InitBufferManagerAccess(void)
 {
 	HASHCTL		hash_ctl;
 
+	/*
+	 * The soft limit on the number of pins each backend should respect, bast
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  Higher level code that pins non-trivial numbers
+	 * of buffers should use LimitAdditionalPins() or GetAdditionalPinLimit()
+	 * to stay under this limit.
+	 */
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
 	hash_ctl.keysize = sizeof(int32);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 80b83444eb2..5378ba84316 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -286,6 +286,22 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see GetSoftPinLimit() */
+uint32
+GetSoftLocalPinLimit(void)
+{
+	/* Every backend has its own temporary buffers, and can pin them all. */
+	return num_temp_buffers;
+}
+
+/* see GetAdditionalPinLimit() */
+uint32
+GetAdditionalLocalPinLimit(void)
+{
+	Assert(NLocalPinnedBuffers <= num_temp_buffers);
+	return num_temp_buffers - NLocalPinnedBuffers;
+}
+
 /* see LimitAdditionalPins() */
 void
 LimitAdditionalLocalPins(uint32 *additional_pins)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0012-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 38e75e9ee69cac5ad03ed87fb815c1f5b8dce11b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 23:52:53 +1300
Subject: [PATCH v2.6 12/34] Respect pin limits accurately in read_stream.c.

Read streams pin multiple buffers at once as required to combine I/O.
This also avoids having to unpin and repin later when issuing read-ahead
advice, and will be needed for proposed work that starts "real"
asynchronous I/O.

To avoid pinning too much of the buffer pool at once, we previously used
LimitAdditionalBuffers() to avoid pinning more than this backend's fair
share of the pool as a cap.  The coding was a naive and only checked the
cap once at stream initialization.

This commit moves the check to the time of use with new bufmgr APIs from
an earlier commit, since the result might change later due to pins
acquired later outside this stream.  No extra CPU cycles are added to
the all-buffered fast-path code (it only pins one buffer at a time), but
the I/O-starting path now re-checks the limit every time using simple
arithmetic.

In practice it was difficult to exceed the limit, but you could contrive
a workload to do it using multiple CURSORs and FETCHing from sequential
scans in round-robin fashion, so that each underlying stream computes
its limit before all the others have ramped up to their full look-ahead
distance.  Therefore, no back-patch for now.

Per code review from Andres, in the course of his AIO work.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/aio/read_stream.c | 111 ++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index dee51fa85a9..6a39bd1d92d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		temporary;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -274,7 +275,9 @@ read_stream_index_retreat(ReadStream *stream, int16 *index)
 #endif
 
 static void
-read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+read_stream_start_pending_read(ReadStream *stream,
+							   int16 buffer_limit,
+							   bool suppress_advice)
 {
 	bool		need_wait;
 	int			nblocks;
@@ -308,10 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		flags = 0;
 
-	/* We say how many blocks we want to read, but may be smaller on return. */
+	/*
+	 * We say how many blocks we want to read, but may be smaller on return.
+	 * On memory-constrained systems we may be also have to ask for a smaller
+	 * read ourselves.
+	 */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = stream->pending_read_nblocks;
+	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -360,11 +367,60 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	stream->pending_read_nblocks -= nblocks;
 }
 
+/*
+ * How many more buffers could we use, while respecting the soft limit?
+ */
+static int16
+read_stream_get_buffer_limit(ReadStream *stream)
+{
+	uint32		buffers;
+
+	/* Check how many local or shared pins we could acquire. */
+	if (stream->temporary)
+		buffers = GetAdditionalLocalPinLimit();
+	else
+		buffers = GetAdditionalPinLimit();
+
+	/*
+	 * Each stream is always allowed to try to acquire one pin if it doesn't
+	 * hold one already.  This is needed to guarantee progress, and just like
+	 * the simple ReadBuffer() operation in code that is not using this stream
+	 * API, if a buffer can't be pinned we'll raise an error when trying to
+	 * pin, ie the buffer pool is simply too small for the workload.
+	 */
+	if (buffers == 0 && stream->pinned_buffers == 0)
+		return 1;
+
+	/*
+	 * Otherwise, see how many additional pins the backend can currently pin,
+	 * which may be zero.  As above, this only guarantees that this backend
+	 * won't use more than its fair share if all backends can respect the soft
+	 * limit, not that a pin can actually be acquired without error.
+	 */
+	return Min(buffers, INT16_MAX);
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	int16		buffer_limit;
+
+	/*
+	 * Check how many pins we could acquire now.  We do this here rather than
+	 * pushing it down into read_stream_start_pending_read(), because it
+	 * allows more flexibility in behavior when we run out of allowed pins.
+	 * Currently the policy is to start an I/O when we've run out of allowed
+	 * pins only if we have to to make progress, and otherwise to stop looking
+	 * ahead until more pins become available, so that we don't start issuing
+	 * a lot of smaller I/Os, prefering to build the largest ones we can. This
+	 * choice is debatable, but it should only really come up with the buffer
+	 * pool/connection ratio is very constrained.
+	 */
+	buffer_limit = read_stream_get_buffer_limit(stream);
+
 	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+		   stream->pinned_buffers + stream->pending_read_nblocks <
+		   Min(stream->distance, buffer_limit))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
@@ -372,7 +428,9 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 
 		if (stream->pending_read_nblocks == io_combine_limit)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit,
+										   suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
 			continue;
 		}
@@ -406,11 +464,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
-			if (stream->ios_in_progress == stream->max_ios)
+			if (stream->ios_in_progress == stream->max_ios || buffer_limit == 0)
 			{
-				/* And we've hit the limit.  Rewind, and stop here. */
+				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
@@ -426,16 +485,17 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * limit, preferring to give it another chance to grow to full
 	 * io_combine_limit size once more buffers have been consumed.  However,
 	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.
+	 * distance limit or buffer limit and there isn't anything pinned yet, or
+	 * the callback has signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
 		(stream->pending_read_nblocks == io_combine_limit ||
-		 (stream->pending_read_nblocks == stream->distance &&
+		 ((stream->pending_read_nblocks == stream->distance ||
+		   stream->pending_read_nblocks == buffer_limit) &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
-		read_stream_start_pending_read(stream, suppress_advice);
+		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
 }
 
 /*
@@ -464,6 +524,7 @@ read_stream_begin_impl(int flags,
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
+	uint32		max_possible_buffer_limit;
 	Oid			tablespace_id;
 
 	/*
@@ -511,12 +572,23 @@ read_stream_begin_impl(int flags,
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
 	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
 
-	/* Don't allow this backend to pin more than its share of buffers. */
+	/*
+	 * Also limit by the maximum possible number of pins we could be allowed
+	 * to acquire according to bufmgr.  We may not be able to use them all due
+	 * to other pins held by this backend, but we'll enforce the dynamic limit
+	 * later when starting I/O.
+	 */
 	if (SmgrIsTemp(smgr))
-		LimitAdditionalLocalPins(&max_pinned_buffers);
+		max_possible_buffer_limit = GetSoftLocalPinLimit();
 	else
-		LimitAdditionalPins(&max_pinned_buffers);
-	Assert(max_pinned_buffers > 0);
+		max_possible_buffer_limit = GetSoftPinLimit();
+	max_pinned_buffers = Min(max_pinned_buffers, max_possible_buffer_limit);
+
+	/*
+	 * The soft limit might be zero on a system configured with more
+	 * connections than buffers.  We need at least one.
+	 */
+	max_pinned_buffers = Max(1, max_pinned_buffers);
 
 	/*
 	 * We need one extra entry for buffers and per-buffer data, because users
@@ -576,6 +648,7 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -704,6 +777,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * arbitrary I/O entry (they're all free).  We don't have to
 			 * adjust pinned_buffers because we're transferring one to caller
 			 * but pinning one more.
+			 *
+			 * In the fast path we don't need to check the pin limit.  We're
+			 * always allowed at least one pin so that progress can be made,
+			 * and that's all we need here.  Although two pins are momentarily
+			 * held at the same time, the model used here is that the stream
+			 * holds only one, and the other now belongs to the caller.
 			 */
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0013-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 533585f2fab0c29ac33d0629e5aca5d86fd013f7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Jan 2025 11:42:03 +1300
Subject: [PATCH v2.6 13/34] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach read
stream to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap and code that uses the fast path stays
in one single buffer queue element.  Satisfy both goals by initializing
the queue incrementally on the first cycle.
---
 src/backend/storage/aio/read_stream.c | 108 ++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 6a39bd1d92d..e58e0edf221 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -112,8 +112,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -280,7 +282,9 @@ read_stream_start_pending_read(ReadStream *stream,
 							   bool suppress_advice)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
+	int			forwarded;
 	int			flags;
 	int16		io_index;
 	int16		overflow;
@@ -312,13 +316,34 @@ read_stream_start_pending_read(ReadStream *stream,
 		flags = 0;
 
 	/*
-	 * We say how many blocks we want to read, but may be smaller on return.
-	 * On memory-constrained systems we may be also have to ask for a smaller
-	 * read ourselves.
+	 * On buffer-constrained systems we may need to limit the I/O size by the
+	 * available pin count.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -347,16 +372,35 @@ read_stream_start_pending_read(ReadStream *stream,
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Move to the location of start of next read. */
 	read_stream_index_advance_n(stream, &buffer_index, nblocks);
@@ -381,6 +425,15 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	else
 		buffers = GetAdditionalPinLimit();
 
+	/*
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffers += stream->forwarded_buffers;
+
 	/*
 	 * Each stream is always allowed to try to acquire one pin if it doesn't
 	 * hold one already.  This is needed to guarantee progress, and just like
@@ -389,7 +442,7 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	 * pin, ie the buffer pool is simply too small for the workload.
 	 */
 	if (buffers == 0 && stream->pinned_buffers == 0)
-		return 1;
+		buffers = 1;
 
 	/*
 	 * Otherwise, see how many additional pins the backend can currently pin,
@@ -755,10 +808,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -807,6 +862,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -891,10 +947,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		}
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -937,6 +998,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -972,6 +1034,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -985,6 +1048,23 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		read_stream_index_advance(stream, &index);
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0014-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 457213b9b3b2b4476dee5ba10cc046dcc22633fc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Feb 2025 21:55:40 +1300
Subject: [PATCH v2.6 14/34] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 597ecb97897..4a035f59a7d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 11c146763db..75a928e802a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1257,10 +1257,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1270,30 +1270,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, BM_VALID after this check, but
+			 * StartBufferIO() will handle those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1314,15 +1364,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1337,7 +1383,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1351,11 +1397,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1369,13 +1425,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1386,7 +1447,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1416,24 +1478,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patchtext/x-diff; charset=us-asciiDownload

From b5f9a66b263506e4981d1bbaaf955336cba16729 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 11:07:47 -0500
Subject: [PATCH v2.6 15/34] tests: Expand temp table tests to some pin related
 matters

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/expected/temp.out | 156 +++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/sql/temp.sql      | 107 ++++++++++++++++++++
 3 files changed, 264 insertions(+), 1 deletion(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index 2a246a7e123..91fe519b1cc 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -410,3 +410,159 @@ SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 
 PREPARE TRANSACTION 'twophase_search';
 ERROR:  cannot PREPARE a transaction that has operated on temporary objects
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK;
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,2)
+(1 row)
+
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,3)
+(1 row)
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+ROLLBACK;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+DROP TABLE test_temp;
+ERROR:  cannot DROP TABLE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+TRUNCATE test_temp;
+ERROR:  cannot TRUNCATE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       0
+(1 row)
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       1
+(1 row)
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       2
+(1 row)
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..0a35f2f8f6a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,7 +108,7 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
+# NB: temp.sql does reconnects which transiently use 2 connections,
 # so keep this parallel group to at most 19 tests
 # ----------
 test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 2a487a1ef7f..12091f968de 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -311,3 +311,110 @@ SET search_path TO 'pg_temp';
 BEGIN;
 SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 PREPARE TRANSACTION 'twophase_search';
+
+
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+
+
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ROLLBACK;
+
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+FETCH NEXT FROM c_3;
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+ROLLBACK;
+
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+DROP TABLE test_temp;
+COMMIT;
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+TRUNCATE test_temp;
+COMMIT;
+
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patchtext/x-diff; charset=us-asciiDownload

From 37a8eee547cfaeee83c68f771a7faf05f83e4422 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:11:51 -0400
Subject: [PATCH v2.6 16/34] localbuf: Fix dangerous coding pattern in
 GetLocalVictimBuffer()

If PinLocalBuffer() were to modify the buf_state, the buf_state in
GetLocalVictimBuffer() would be out of date. Currently that does not happen,
as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and
GetLocalVictimBuffer() passes false.

However, it's easy to make this not the case anymore - it cost me a few hours
to debug the consequences.

The minimal fix would be to just refetch the buf_state after after calling
PinLocalBuffer(), but the same danger exists in later parts of the
function. So instead refetch declare buf_state in narrower scopes and always
refetch.

I apparently broke this in 794f2594479.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/localbuf.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5378ba84316..d3c869f53f9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -178,7 +178,6 @@ GetLocalVictimBuffer(void)
 {
 	int			victim_bufid;
 	int			trycounter;
-	uint32		buf_state;
 	BufferDesc *bufHdr;
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
@@ -199,7 +198,7 @@ GetLocalVictimBuffer(void)
 
 		if (LocalRefCount[victim_bufid] == 0)
 		{
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 			if (BUF_STATE_GET_USAGECOUNT(buf_state) > 0)
 			{
@@ -233,8 +232,9 @@ GetLocalVictimBuffer(void)
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
 	 */
-	if (buf_state & BM_DIRTY)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
@@ -267,8 +267,9 @@ GetLocalVictimBuffer(void)
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
-	if (buf_state & BM_TAG_VALID)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		LocalBufferLookupEnt *hresult;
 
 		hresult = (LocalBufferLookupEnt *)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0017-localbuf-Introduce-InvalidateLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 4ec576bd7e51fb0c0404e86e68c68473e597235b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:25:22 -0400
Subject: [PATCH v2.6 17/34] localbuf: Introduce InvalidateLocalBuffer()

There were three copies of this code previously, two of them
identical. There's no good reason for that.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/localbuf.c | 91 +++++++++++++--------------
 1 file changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index d3c869f53f9..4444228d645 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -269,17 +270,7 @@ GetLocalVictimBuffer(void)
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-		LocalBufferLookupEnt *hresult;
-
-		hresult = (LocalBufferLookupEnt *)
-			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-		if (!hresult)			/* shouldn't happen */
-			elog(ERROR, "local buffer hash table corrupted");
-		/* mark buffer invalid just in case hash insert fails */
-		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		InvalidateLocalBuffer(bufHdr, false);
 
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT, 1, 0);
 	}
@@ -492,6 +483,45 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * used. Passing false is appropriate when redesignating the buffer instead
+ * dropping it.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (check_unreferenced && LocalRefCount[bufid] != 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)).str,
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -512,7 +542,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -522,24 +551,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
@@ -559,7 +571,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -567,23 +578,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0018-localbuf-Introduce-TerminateLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From 491f65d6b356734065a190da3311391ffc74d08e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:55:05 -0400
Subject: [PATCH v2.6 18/34] localbuf: Introduce TerminateLocalBufferIO()

Previously TerminateLocalBufferIO() was open-coded in multiple places, which
doesn't seem like a great idea.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  2 ++
 src/backend/storage/buffer/bufmgr.c   | 32 ++++++++-------------------
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++--
 3 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8b32fb108b0..4611a60d3e0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -471,6 +471,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  Buffer *buffers,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
+extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
+								   uint32 set_flag_bits);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 75a928e802a..ead562249d3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1072,19 +1072,11 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		if (!isLocalBuf)
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
 
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			buf_state |= BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-		}
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 		else
-		{
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			TerminateBufferIO(bufHdr, false, BM_VALID, true);
-		}
 	}
 	else if (!isLocalBuf)
 	{
@@ -1608,19 +1600,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 									relpath(operation->smgr->smgr_rlocator, forknum).str)));
 			}
 
-			/* Terminate I/O and set BM_VALID. */
+			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
 				TerminateBufferIO(bufHdr, false, BM_VALID, true);
-			}
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
 										IOCONTEXT_NORMAL, IOOP_WRITE,
 										io_start, 1, BLCKSZ);
 
-				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+				TerminateLocalBufferIO(bufHdr, true, 0);
 
 				pgBufferUsage.local_blks_written++;
 
@@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	buf_state &= ~BM_IO_IN_PROGRESS;
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
 
-	buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 4444228d645..ce3f8c78c18 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -259,8 +259,7 @@ GetLocalVictimBuffer(void)
 								IOOP_WRITE, io_start, 1, BLCKSZ);
 
 		/* Mark not-dirty now in case we error out below */
-		buf_state &= ~BM_DIRTY;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		TerminateLocalBufferIO(bufHdr, true, 0);
 
 		pgBufferUsage.local_blks_written++;
 	}
@@ -483,6 +482,31 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like TerminateBufferIO, but for local buffers
+ */
+void
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+{
+	/* Only need to adjust flags */
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
+
+	if (clear_dirty)
+		buf_state &= ~BM_DIRTY;
+
+	buf_state |= set_flag_bits;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+	/* local buffers don't track IO using resowners */
+
+	/* local buffers don't use the IO CV, as no other process can see buffer */
+}
+
 /*
  * InvalidateLocalBuffer -- mark a local buffer invalid.
  *
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0019-localbuf-Introduce-FlushLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From e227cea815d47fa12314ede9476fdf4e53d6eaf0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:33:35 -0400
Subject: [PATCH v2.6 19/34] localbuf: Introduce FlushLocalBuffer()

We unnecessarily had multiple paths containing the code to issue writes for
local buffers.
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   | 22 +--------
 src/backend/storage/buffer/localbuf.c | 65 +++++++++++++++------------
 3 files changed, 38 insertions(+), 50 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4611a60d3e0..90bc7e0db7b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ead562249d3..5ead69b5e16 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4516,7 +4516,6 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
-			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -4524,9 +4523,6 @@ FlushRelationBuffers(Relation rel)
 				 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 			{
 				ErrorContextCallback errcallback;
-				Page		localpage;
-
-				localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
 				/* Setup error traceback support for ereport() */
 				errcallback.callback = local_buffer_write_error_callback;
@@ -4534,23 +4530,7 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
-				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-				io_start = pgstat_prepare_io_time(track_io_timing);
-
-				smgrwrite(srel,
-						  BufTagGetForkNum(&bufHdr->tag),
-						  bufHdr->tag.blockNum,
-						  localpage,
-						  false);
-
-				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
-										IOCONTEXT_NORMAL, IOOP_WRITE,
-										io_start, 1, BLCKSZ);
-
-				TerminateLocalBufferIO(bufHdr, true, 0);
-
-				pgBufferUsage.local_blks_written++;
+				FlushLocalBuffer(bufHdr, srel);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index ce3f8c78c18..8ea5088f32b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -174,6 +174,41 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	return bufHdr;
 }
 
+/*
+ * Like FlushBuffer(), just for local buffers.
+ */
+void
+FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
+{
+	instr_time	io_start;
+	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
+						MyProcNumber);
+
+	PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	/* And write... */
+	smgrwrite(reln,
+			  BufTagGetForkNum(&bufHdr->tag),
+			  bufHdr->tag.blockNum,
+			  localpage,
+			  false);
+
+	/* Temporary table I/O does not use Buffer Access Strategies */
+	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+							IOOP_WRITE, io_start, 1, BLCKSZ);
+
+	/* Mark not-dirty */
+	TerminateLocalBufferIO(bufHdr, true, 0);
+
+	pgBufferUsage.local_blks_written++;
+}
+
 static Buffer
 GetLocalVictimBuffer(void)
 {
@@ -234,35 +269,7 @@ GetLocalVictimBuffer(void)
 	 * the case, write it out before reusing it!
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
-	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-		instr_time	io_start;
-		SMgrRelation oreln;
-		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
-
-		/* Find smgr relation for buffer */
-		oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber);
-
-		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-		io_start = pgstat_prepare_io_time(track_io_timing);
-
-		/* And write... */
-		smgrwrite(oreln,
-				  BufTagGetForkNum(&bufHdr->tag),
-				  bufHdr->tag.blockNum,
-				  localpage,
-				  false);
-
-		/* Temporary table I/O does not use Buffer Access Strategies */
-		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
-								IOOP_WRITE, io_start, 1, BLCKSZ);
-
-		/* Mark not-dirty now in case we error out below */
-		TerminateLocalBufferIO(bufHdr, true, 0);
-
-		pgBufferUsage.local_blks_written++;
-	}
+		FlushLocalBuffer(bufHdr, NULL);
 
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0020-localbuf-Introduce-StartLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From 2d76581fd451dc96e27a68d0a90b2b0ad837dbed Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:44:36 -0400
Subject: [PATCH v2.6 20/34] localbuf: Introduce StartLocalBufferIO()

Doesn't yet do a lot, but will do more in a subsequent commit.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   | 12 ++++-----
 src/backend/storage/buffer/localbuf.c | 37 +++++++++++++++++++++++++++
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 90bc7e0db7b..87565ae9040 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5ead69b5e16..9c8f8ee9a20 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1038,7 +1038,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1450,13 +1450,11 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-	{
-		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-
-		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
-	}
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+		return StartBufferIO(GetBufferDescriptor(buffer - 1),
+							 true, nowait);
 }
 
 void
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8ea5088f32b..70d0b91034d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -183,6 +183,13 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	instr_time	io_start;
 	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
+	/*
+	 * Try to start an I/O operation.  There currently are no reasons for
+	 * StartLocalBufferIO to return false, so we raise an error in that case.
+	 */
+	if (!StartLocalBufferIO(bufHdr, false, false))
+		elog(ERROR, "failed to start write IO on local buffer");
+
 	/* Find smgr relation for buffer */
 	if (reln == NULL)
 		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -406,11 +413,17 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			PinLocalBuffer(existing_hdr, false);
 			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
 
+			/*
+			 * Clear the BM_VALID bit, do StartLocalBufferIO() and proceed.
+			 */
 			buf_state = pg_atomic_read_u32(&existing_hdr->state);
 			Assert(buf_state & BM_TAG_VALID);
 			Assert(!(buf_state & BM_DIRTY));
 			buf_state &= ~BM_VALID;
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+
+			/* no need to loop for local buffers */
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -425,6 +438,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
 
 			hresult->id = victim_buf_id;
+
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -489,6 +504,28 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
+{
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+	{
+		/* someone else already did the I/O */
+		UnlockBufHdr(bufHdr, buf_state);
+		return false;
+	}
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* local buffers don't track IO using resowners */
+
+	return true;
+}
+
 /*
  * Like TerminateBufferIO, but for local buffers
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From fa20461dc62fd32e4d9a72d0012a76819ae8bbb8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 18:04:48 -0500
Subject: [PATCH v2.6 21/34] localbuf: Track pincount in BufferDesc as well

For AIO on temp tables the AIO subsystem needs to be able to ensure a pin on a
buffer while AIO is going on, even if the IO issuing query errors out. To do
so, track the refcount in BufferDesc.state, not just LocalRefCount.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c8f8ee9a20..b2ccd087b51 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5404,6 +5404,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5457,6 +5469,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 70d0b91034d..a446a8620e2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0022-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From ac4b03e3569d522c926553316fba68fea976f637 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:49:37 -0400
Subject: [PATCH v2.6 22/34] bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
  IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

As of this commit nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Todo:
- deduplicate shared/local buffer completion callbacks
- deduplicate LockBufferForCleanup() support
- function naming pattern (buffer_readv_complete_common() calls
  LocalBufferCompleteRead/SharedBufferCompleteRead, which looks ugly)

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   8 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 379 ++++++++++++++++++++++++-
 src/backend/storage/buffer/localbuf.c  | 103 ++++++-
 7 files changed, 493 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 16476680f12..3ce8763ebe8 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -180,6 +180,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 87565ae9040..257f8beeeec 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,7 +475,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
+								   uint32 set_flag_bits, bool syncio);
 extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
@@ -481,4 +484,7 @@ extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+
+extern PgAioResult LocalBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed);
+
 #endif							/* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4a035f59a7d..12687fde45e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -168,6 +169,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6a21c82396a..f76f74ba166 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b2ccd087b51..149840f81ea 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1074,9 +1076,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1600,9 +1602,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2513,7 +2515,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -3970,7 +3972,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5529,6 +5531,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5536,10 +5539,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5628,7 +5640,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5643,6 +5655,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5651,6 +5671,40 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 *
+	 * AFIXME: Deduplicate with UnpinBufferNoOwner() or just replace
+	 * BM_PIN_COUNT_WAITER with something saner.
+	 */
+	/* Support LockBufferForCleanup() */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+	{
+		/*
+		 * Acquire the buffer header lock, re-check that there's a waiter.
+		 * Another backend could have unpinned this buffer, and already woken
+		 * up the waiter.  There's no danger of the buffer being replaced
+		 * after we unpinned it above, as it's pinned by the waiter.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		if ((buf_state & BM_PIN_COUNT_WAITER) &&
+			BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+		{
+			/* we just released the last pin other than the waiter's */
+			int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+			buf_state &= ~BM_PIN_COUNT_WAITER;
+			UnlockBufHdr(buf, buf_state);
+			ProcSendSignal(wait_backend_pgprocno);
+		}
+		else
+			UnlockBufHdr(buf, buf_state);
+	}
 }
 
 /*
@@ -5699,7 +5753,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6150,3 +6204,310 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
+	BufferDesc *bufHdr = GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathperm(rlocator, tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateBufferIO(bufHdr, false,
+					  failed ? BM_IO_ERROR : BM_VALID,
+					  false, false);
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Helper to prepare IO on shared buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+shared_buffer_stage_common(PgAioHandle *ioh, bool is_write)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetBufferDescriptor(buf - 1);
+
+		if (i == 0)
+			first = bufHdr->tag;
+		else
+		{
+			Assert(bufHdr->tag.relNumber == first.relNumber);
+			Assert(bufHdr->tag.blockNum == first.blockNum + i);
+		}
+
+
+		buf_state = LockBufHdr(bufHdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+			Assert(!(buf_state & BM_VALID));
+
+		Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		buf_state += BUF_REFCOUNT_ONE;
+		bufHdr->io_wref = io_ref;
+
+		UnlockBufHdr(bufHdr, buf_state);
+
+		if (is_write)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner, buf);
+	}
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	shared_buffer_stage_common(ioh, false);
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static PgAioResult
+buffer_readv_complete_common(PgAioHandle *ioh, PgAioResult prior_result, bool is_temp, uint8 cb_data)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (int buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		if (is_temp)
+			buf_result = LocalBufferCompleteRead(buf_off, buf, cb_data, failed);
+		else
+			buf_result = SharedBufferCompleteRead(buf_off, buf, cb_data, failed);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	return buffer_readv_complete_common(ioh, prior_result, false, cb_data);
+}
+
+/*
+ * Helper to stage IO on local buffers for execution, shared between reads
+ * and writes.
+ */
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_wref;
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_wref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buf = (Buffer) io_data[i];
+		BufferDesc *bufHdr;
+		uint32		buf_state;
+
+		bufHdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		bufHdr->io_wref = io_wref;
+
+		/*
+		 * Track pin by AIO subsystem in BufferDesc, not in LocalRefCount as
+		 * one might initially think. This is necessary to handle this backend
+		 * erroring out while AIO is still in progress.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	return buffer_readv_complete_common(ioh, prior_result, true, cb_data);
+}
+
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index a446a8620e2..9a5c4f69635 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -517,8 +519,23 @@ MarkLocalBufferDirty(Buffer buffer)
 bool
 StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * The buffer could have IO in progress, e.g. when there are two scans of
+	 * the same relation. Either wait for the other IO or return false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
@@ -537,7 +554,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -550,6 +568,17 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +743,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
@@ -941,3 +972,69 @@ AtProcExit_LocalBuffers(void)
 	 */
 	CheckForLocalBufferLeaks();
 }
+
+PgAioResult
+LocalBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
+	BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	BufferTag	tag = bufHdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+
+	Assert(BufferIsValid(buffer));
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator, MyProcNumber, tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	TerminateLocalBufferIO(bufHdr, false,
+						   failed ? BM_IO_ERROR : BM_VALID,
+						   false);
+
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  MyProcNumber,
+									  false);
+
+	return result;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0023-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 7d30d8b7eb9888800a2456d5f9b9a58458590f4d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:52:26 -0400
Subject: [PATCH v2.6 23/34] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual AIO user. StartReadBuffers() now uses
the AIO routines to issue IO. This converts a lot of callers to use the AIO
infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commits.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 399 +++++++++++++++++++++-------
 2 files changed, 305 insertions(+), 100 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 12687fde45e..db9a4673097 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 149840f81ea..60df9eb8cba 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int *nblocks);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1228,10 +1230,9 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,6 +1260,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1298,6 +1300,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+
+			ereport(DEBUG3,
+					errmsg("found forwarded buffer %d",
+						   buffers[i]),
+					errhidestmt(true), errhidecontext(true));
 		}
 		else
 		{
@@ -1363,25 +1370,59 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers(). This is signalled to the caller by
+		 * decrementing *nblocks.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		return AsyncReadBuffers(operation, nblocks);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1449,7 +1490,7 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1462,28 +1503,163 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	PgAioReturn *aio_ret;
+
+	/*
+	 * If we get here without an IO operation having been issued, io_method ==
+	 * IOMETHOD_SYNC path must have been used. In that case, we start - as we
+	 * used to before - the IO now, just before waiting.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref))
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+
+		while (true)
+		{
+			nblocks = operation->nblocks;
+
+			if (!AsyncReadBuffers(operation, &nblocks))
+			{
+				/* all blocks were already read in concurrently */
+				Assert(nblocks == operation->nblocks);
+				return;
+			}
+
+			Assert(nblocks > 0 && nblocks <= operation->nblocks);
+
+			if (nblocks == operation->nblocks)
+			{
+				/* will wait below as if this had been normal AIO */
+				break;
+			}
+
+			/*
+			 * It's unlikely, but possible, that AsyncReadBuffers() wasn't
+			 * able to initiate IO for all the relevant buffers. In that case
+			 * we need to wait for the prior IO before issuing more IO.
+			 */
+			WaitReadBuffers(operation);
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/* Find the range of the physical read we need to perform. */
 	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
-
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	aio_ret = &operation->io_return;
+
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * XXX: We probably should track the IO operation, rather than its time,
+	 * separately, when initiating the IO. But right now that's not quite
+	 * allowed by the interface.
+	 */
+
+	/*
+	 * Tracking a wait even if we don't actually need to wait
+	 *
+	 * a) is not cheap
+	 *
+	 * b) reports some time as waiting, even if we never waited.
+	 */
+	if (aio_ret->result.status == ARS_UNKNOWN &&
+		!pgaio_wref_check_done(&operation->io_wref))
+	{
+		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+		pgaio_wref_wait(&operation->io_wref);
+
+		/*
+		 * The IO operation itself was already counted earlier, in
+		 * AsyncReadBuffers().
+		 */
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 0, 0);
+	}
+	else
+	{
+		Assert(pgaio_wref_check_done(&operation->io_wref));
+	}
+
+	if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry below, so we just emit a debug message the server log
+		 * (or not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+
+		/*
+		 * Try to perform the rest of the IO.  Buffers for which IO has
+		 * completed successfully will be discovered as such and not retried.
+		 */
+		nblocks = operation->nblocks;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, &nblocks);
+		goto restart;
+	}
+	else if (aio_ret->result.status != ARS_OK)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* NB: READ_DONE tracepoint is executed in IO completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation. If IO is only initiated for a
+ * subset of the blocks, *nblocks is updated to reflect that.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int *nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	bool		did_start_io = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1491,6 +1667,14 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1500,25 +1684,53 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	pgaio_wref_clear(&operation->io_wref);
+
+	/*
+	 * Loop until we have started one IO or we discover that all buffers are
+	 * already valid.
+	 */
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
 		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
 		BlockNumber io_first_block;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+		 * block, which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in here to the IO? If there
+		 * already are a lot of IO operations in progress, getting an IO
+		 * handle will block waiting for some other IO operation to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+		 * account IO time when pgaio_io_acquire_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_acquire(CurrentResourceOwner,
+								   &operation->io_return);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * It's safe to start IO while we have unsubmitted IO, but it'd be
+		 * better to first submit it. But right now the boolean return value
+		 * from ReadBuffersCanStartIO()/StartBufferIO() doesn't allow to
+		 * distinguish between nowait=true trigger failure and the buffer
+		 * already being valid.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1530,6 +1742,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			ereport(DEBUG3,
+					errmsg("can't start io for first buffer %u: %s",
+						   buffers[i], DebugPrintBufferRefcount(buffers[i])),
+					errhidestmt(true), errhidecontext(true));
 			continue;
 		}
 
@@ -1539,6 +1756,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		ereport(DEBUG5,
+				errmsg("first prepped for io: %s, offset %d",
+					   DebugPrintBufferRefcount(io_buffers[0]), i),
+				errhidestmt(true), errhidecontext(true));
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1546,78 +1768,55 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * head block, so we should get on with that I/O as soon as possible.
 		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
+			ereport(DEBUG5,
+					errmsg("seq prepped for io: %s, offset %d",
+						   DebugPrintBufferRefcount(buffers[i + 1]),
+						   i + 1),
+					errhidestmt(true), errhidecontext(true));
+
 			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		did_start_io = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op(io_object, io_context, IOOP_READ,
+						   1, io_buffers_len * BLCKSZ);
+
+		*nblocks = io_buffers_len;
+		break;
+	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0032-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 1f9d6dbf3ec211a591652575be22fffcab096d46 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.6 32/34] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.6-0033-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 10e695cfb4c8655d939a783ed5ce28f45a460636 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.6 33/34] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#72

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#71)

Re: AIO v2.5

On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

- 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
up the code before making larger changes, needs review

This is a review of 0016-0020

Commit messages for 0017-0020 are thin. I assume you will beef them up
a bit before committing. Really, though, those matter much less than
0016 which is an actual bug (or pre-bug) fix. I called out the ones
where I think you should really consider adding more detail to the
commit message.

0016:

      * the case, write it out before reusing it!
      */
-    if (buf_state & BM_DIRTY)
+    if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
     {
+        uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);

I don't love that you fetch in the if statement and inside the if
statement. You wouldn't normally do this, so it sticks out. I get that
you want to avoid having the problem this commit fixes again, but
maybe it is worth just fetching the buf_state above the if statement
and adding a comment that it could have changed so you must do that.
Anyway, I think your future patches make the local buf_state variable
in this function obsolete, so perhaps it doesn't matter.

Not related to this patch, but while reading this code, I noticed that
this line of code is really weird:
LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
I actually don't understand what it is doing ... setting the result of
the macro to the result of GetLocalBufferStorage()? I haven't seen
anything like that before.

Otherwise, this patch LGTM.

0017:

+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
 static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);

Technically this line is too long

+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * used. Passing false is appropriate when redesignating the buffer instead
+ * dropping it.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{

I was on the fence about the language "buffer is still used", since
this is about the ref count and not the usage count. If this is the
language used elsewhere perhaps it is fine.

I also was not sure what redesignate means here. If you mean to use
this function in the future in other contexts than eviction and
dropping buffers, fine. But otherwise, maybe just use a more obvious
word (like eviction).

0018:

Compiler now warns that buf_state is unused in GetLocalVictimBuffer().

@@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
IOCONTEXT_NORMAL, IOOP_WRITE,
io_start, 1, BLCKSZ);

-                buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-                pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+                TerminateLocalBufferIO(bufHdr, true, 0);

FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems
like wouldn't have been applicable to local buffers before, but,
actually with async IO could perhaps happen in the future? Anyway,
TerminateLocalBuffers() doesn't clear that flag, so you should call
that out if it was intentional.

@@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits,
+    buf_state &= ~BM_IO_IN_PROGRESS;
+    buf_state &= ~BM_IO_ERROR;
-    buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);

Is it worth mentioning in the commit message that you made a cosmetic
change to TerminateBufferIO()?

0019:
LGTM

0020:
This commit message is probably tooo thin. I think you need to at
least say something about this being used by AIO in the future. Out of
context of this patch set, it will be confusing.

+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
+{

I think you could use a comment about why nowait might be useful for
local buffers in the future. It wouldn't make sense with synchronous
I/O, so it feels a bit weird without any comment.

+    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+    {
+        /* someone else already did the I/O */
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }

UnlockBufHdr() explicitly says it should not be called for local
buffers. I know that code is unreachable right now, but it doesn't
feel quite right. I'm not sure what the architecture of AIO local
buffers will be like, but if other processes can't access these
buffers, I don't know why you would need BM_LOCKED. And if you will, I
think you need to edit the UnlockBufHdr() comment.

@@ -1450,13 +1450,11 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
     if (BufferIsLocal(buffer))
     else
-        return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+        return StartBufferIO(GetBufferDescriptor(buffer - 1),
+                             true, nowait);

I'm not sure it is worth the diff in non-local buffer case to reflow
this. It is already confusing enough in this patch that you are adding
some code that is mostly unneeded.

- Melanie

#73

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Melanie Plageman (#72)

Re: AIO v2.5

Hi,

On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote:

On Mon, Mar 10, 2025 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

- 0016 to 0020 - cleanups for temp buffers code - I just wrote these to clean
up the code before making larger changes, needs review

This is a review of 0016-0020

Commit messages for 0017-0020 are thin. I assume you will beef them up
a bit before committing.

Yea. I wanted to get some feedback on whether these refactorings are a good
idea or not...

Really, though, those matter much less than 0016 which is an actual bug (or
pre-bug) fix. I called out the ones where I think you should really consider
adding more detail to the commit message.

0016:

Do you think we should backpatch that change? It's not really an active bug in
16+, but it's also not quite right. The other changes surely shouldn't be
backpatched...

* the case, write it out before reusing it!
*/
-    if (buf_state & BM_DIRTY)
+    if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
{
+        uint32        buf_state = pg_atomic_read_u32(&bufHdr->state);
I don't love that you fetch in the if statement and inside the if
statement. You wouldn't normally do this, so it sticks out. I get that
you want to avoid having the problem this commit fixes again, but
maybe it is worth just fetching the buf_state above the if statement
and adding a comment that it could have changed so you must do that.

It's seems way too easy to introduce new similar breakages if the scope of
buf_state is that wide - I yesterday did waste 90min because I did in another
similar place. The narrower scopes make that much less likely to be a problem.

Anyway, I think your future patches make the local buf_state variable
in this function obsolete, so perhaps it doesn't matter.

Leaving the defensive-programming aspect aside, it does seem like a better
intermediary state to me to have the local vars than to have to change more
lines when introducing FlushLocalBuffer() etc.

Not related to this patch, but while reading this code, I noticed that
this line of code is really weird:
LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
I actually don't understand what it is doing ... setting the result of
the macro to the result of GetLocalBufferStorage()? I haven't seen
anything like that before.

Yes, that's what it's doing. LocalBufferBlockPointers() evaluates to a value
that can be used as an lvalue in an assignment.

Not exactly pretty...

Otherwise, this patch LGTM.

0017:

+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);

Technically this line is too long

Oh, do I love our line length limits. But, um, is it actually too long? It's
78 chars, which is exactly our limit, I think?

+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * used. Passing false is appropriate when redesignating the buffer instead
+ * dropping it.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{
I was on the fence about the language "buffer is still used", since
this is about the ref count and not the usage count. If this is the
language used elsewhere perhaps it is fine.

I'll change it to "still pinned"
I guess I can make it "still pinned".

I also was not sure what redesignate means here. If you mean to use
this function in the future in other contexts than eviction and
dropping buffers, fine. But otherwise, maybe just use a more obvious
word (like eviction).

I was trying to reference changing the identity of the buffer as part of
buffer replacement, where we keep a pin to the buffer. Compared to the use of
InvalidateLocalBuffer() in DropRelationAllLocalBuffers() /
DropRelationLocalBuffers().

/*
* InvalidateLocalBuffer -- mark a local buffer invalid.
*
* If check_unreferenced is true, error out if the buffer is still
* pinned. Passing false is appropriate when calling InvalidateLocalBuffer()
* as part of changing the identity of a buffer, instead of just dropping the
* buffer.
*
* See also InvalidateBuffer().
*/

0018:

Compiler now warns that buf_state is unused in GetLocalVictimBuffer().

Oops. Missed that because it was then removed in a later commit...

@@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
IOCONTEXT_NORMAL, IOOP_WRITE,
io_start, 1, BLCKSZ);
-                buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-                pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+                TerminateLocalBufferIO(bufHdr, true, 0);
FlushRelationBuffers() used to clear BM_JUST_DIRTIED, which it seems
like wouldn't have been applicable to local buffers before, but,
actually with async IO could perhaps happen in the future? Anyway,
TerminateLocalBuffers() doesn't clear that flag, so you should call
that out if it was intentional.

I think it'd be good to start using BM_JUST_DIRTIED, even if just to make the
code between local and shared buffers more similar. But that that's better
done separately.

I don't know why FlushRelationBuffers cleared it, it's neer set at the moment.

I'll add a note to the commit message.

@@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits,
+    buf_state &= ~BM_IO_IN_PROGRESS;
+    buf_state &= ~BM_IO_ERROR;
-    buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
Is it worth mentioning in the commit message that you made a cosmetic
change to TerminateBufferIO()?

Doesn't really seem worth calling out, but if you think it should, I will.

0020:
This commit message is probably tooo thin. I think you need to at
least say something about this being used by AIO in the future. Out of
context of this patch set, it will be confusing.

Yep.

+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
+{
I think you could use a comment about why nowait might be useful for
local buffers in the future. It wouldn't make sense with synchronous
I/O, so it feels a bit weird without any comment.

Hm, fair point. Another approach would be to defer adding the argument to a
later patch, it doesn't need to be added here.

+    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+    {
+        /* someone else already did the I/O */
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }
UnlockBufHdr() explicitly says it should not be called for local
buffers. I know that code is unreachable right now, but it doesn't
feel quite right. I'm not sure what the architecture of AIO local
buffers will be like, but if other processes can't access these
buffers, I don't know why you would need BM_LOCKED. And if you will, I
think you need to edit the UnlockBufHdr() comment.

You are right, this is a bug in my change. I started with a copy of
StartBufferIO() and whittled it down insufficiently. Thanks for catching that!

Wonder if we should add an assert against this to UnlockBufHdr()...

@@ -1450,13 +1450,11 @@ static inline bool
WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
{
if (BufferIsLocal(buffer))
else
-        return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+        return StartBufferIO(GetBufferDescriptor(buffer - 1),
+                             true, nowait);
I'm not sure it is worth the diff in non-local buffer case to reflow
this. It is already confusing enough in this patch that you are adding
some code that is mostly unneeded.

Heh, you're right. I had to add a line break in the StartLocalBufferIO() and
it looked wrong to have the two lines formatted differently :)

Thanks for the review!

Greetings,

Andres Freund

#74

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#73)

Re: AIO v2.5

On Tue, Mar 11, 2025 at 1:56 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-03-11 11:31:18 -0400, Melanie Plageman wrote:

Commit messages for 0017-0020 are thin. I assume you will beef them up
a bit before committing.

Yea. I wanted to get some feedback on whether these refactorings are a good
idea or not...

I'd say yes, they seem like a good idea.

Really, though, those matter much less than 0016 which is an actual bug (or
pre-bug) fix. I called out the ones where I think you should really consider
adding more detail to the commit message.

0016:

Do you think we should backpatch that change? It's not really an active bug in
16+, but it's also not quite right. The other changes surely shouldn't be
backpatched...

I don't feel strongly about it. PinLocalBuffer() is passed with
adjust_usagecount false and we have loads of other places where things
would just not work if we changed the boolean flag passed in to a
function called by it (bgwriter and SyncOneBuffer() with
skip_recently_used comes to mind).

On the other hand it's a straightforward fix that only needs to be
backpatched a couple versions, so it definitely doesn't hurt.

+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int    NLocalPinnedBuffers = 0;
static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
Technically this line is too long
Oh, do I love our line length limits. But, um, is it actually too long? It's
78 chars, which is exactly our limit, I think?

Teccchnically it's 79, which is why it showed up for me with this
handy line from the committing wiki page

git diff origin/master -- src/backend/storage/buffer/localbuf.c | grep
-E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 ||
/^diff/"

But anyway, it doesn't really matter. I only mentioned it because I
noticed it visually looked long.

+    if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+    {
+        /* someone else already did the I/O */
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }
UnlockBufHdr() explicitly says it should not be called for local
buffers. I know that code is unreachable right now, but it doesn't
feel quite right. I'm not sure what the architecture of AIO local
buffers will be like, but if other processes can't access these
buffers, I don't know why you would need BM_LOCKED. And if you will, I
think you need to edit the UnlockBufHdr() comment.
You are right, this is a bug in my change. I started with a copy of
StartBufferIO() and whittled it down insufficiently. Thanks for catching that!

Wonder if we should add an assert against this to UnlockBufHdr()...

Yea, I think that makes sense.

- Melanie

#75

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#71)

Re: AIO v2.5

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.

Otherwise, deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

I'm not finding that code in v2.6. What function has it?

[I wrote a bunch of the subsequent comments against v2.5. I may have missed
instances of v2.6 obsoleting them.]

On Tue, Mar 04, 2025 at 02:00:14PM -0500, Andres Freund wrote:

Attached is v2.5 of the AIO patchset.

- Added a proper commit message fo the main commit. I'd appreciate folks
reading through it. I'm sure I forgot a lot of folks and a lot of things.

Commit message looks fine.

At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below.

That is a big milestone.

Questions:

- My current thinking is that we'd set io_method = worker initially - so we
actually get some coverage - and then decide whether to switch to
io_method=sync by default for 18 sometime around beta1/2. Does that sound
reasonable?

Yes.

- We could reduce memory usage a tiny bit if we made the mapping between
pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
ProcNumber. Right now IO workers have the per-backend AIO state, but don't
actually need it. I'm mildly inclined to think that the complexity isn't
worth it, but on the fence.

The max memory savings, for 32 IO workers, is like the difference between
max_connections=500 and max_connections=532, right? If that's right, I
wouldn't bother in the foreseeable future.

- Three of the commits in the series really are just precursor commits to
their subsequent commits, which I found helpful for development and review,
namely:

- aio: Basic subsystem initialization
- aio: Skeleton IO worker infrastructure
- aio: Add liburing dependency

Not sure if it's worth keeping these separate or whether they should just be
merged with their "real commit".

The split aided my review. It's trivial to turn an unmerged stack of commits
into the merged equivalent, but unmerging is hard.

- Thomas suggested renaming
COMPLETED_IO->COMPLETED,
COMPLETED_SHARED->TERMINATED_BY_COMPLETER,
COMPLETED_SHARED->TERMINATED_BY_SUBMITTER
in
/messages/by-id/CA+hUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw@mail.gmail.com

While the other things in the email were commented upon by others and
addressed in v2.4, the naming aspect wasn't further remarked upon by others.
I'm not personally in love with the suggested names, but I could live with
them.

I, too, could live with those. None of these naming proposals bother me, and
I would not have raised the topic myself. If I were changing it further, I'd
use these principles:

- use COMPLETED or TERMINATED, not both
- I like COMPLETED, because _complete_ works well in a function name.
_terminate_ sounds more like an abnormal interruption.
- If one state name lacks a suffix, it should be the final state.

So probably one of:

{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{KERN,RETURN,RETVAL,ERRNO}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}_{SHMEM,SHARED}
{COMPLETED,TERMINATED,FINISHED,REAPED,DONE}{_SUBMITTER,}

If it were me picking today, I'd pick:

COMPLETED_RETURN
COMPLETED_SHMEM
COMPLETED

- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
but all the ereport()s add a noticeable amount of overhead at high IO
throughput (at multiple gigabytes/second), so that's probably not right
forever. I'd leave this on initially and then change it to default to off
later. I think that's ok?

Sure. Perhaps make it depend on USE_ASSERT_CHECKING later?

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

Let's start as you have it. If someone wants to make things perfect for
non-root BSD users, they can add the GUC later. io_method=sync is a
sufficient backup plan indefinitely.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

An enum array does seem elegant and extensible, but it has the problems you
say. (I would expect to lose time setting up pg_enum.oid values to not change
between releases.) A possible compromise would be a text array like
heap_tuple_infomask_flags() does. Overall, I'm not seeing a clear need to
change away from the bool columns.

Todo:

- Figure out how to deduplicate support for LockBufferForCleanup() in
TerminateBufferIO().

Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar
subroutine.

- Check if documentation for track_io_timing needs to be adjusted, after the
bufmgr.c changes we only track waiting for an IO.

Yes.

On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:

Attached is v2.6 of the AIO patchset.

- 0005, 0006 - io_uring support - close, but we need to do something about
set_max_fds(), which errors out spuriously in some cases

What do we know about those cases? I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

- 0025 - tests for AIO - I think it's reasonable, unless somebody objects to
exporting a few bufmgr.c functions to the test

I'll essentially never object to that.

+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.

s/ResoureElem/ResourceElem/

+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous I/O synchronously)

The part in parentheses reads like a contradiction to me. How about phrasing
it like one of these:

(execute I/O synchronously, even I/O eligible for asynchronous execution)
(execute asynchronous-eligible I/O synchronously)
(execute I/O synchronously, even when asynchronous execution was feasible)

+ * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

+ * On what is IO being performed.

End sentence with question mark, probably.

+ * List of in-flight IOs. Also contains IOs that aren't strict speaking

s/strict/strictly/

+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *

I recommend adding "Always called in a critical section." since at least
pgaio_worker_submit() subtly needs it.

+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);

+ * Each backend can only have one AIO handle that that has been "handed out"

s/that that/that/

+ * AIO, it typically will pass the handle to smgr., which will pass it on to

s/smgr.,/smgr.c,/ or just "smgr"

+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();

I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a
check-world coverage report. I tried PGAIO_SUBMIT_BATCH_SIZE=2,
io_max_concurrency=1, and io_max_concurrency=64. Do you already have a recipe
for reaching this case?

+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{

We've got closely-associated verbs "prepare", "prep", and "stage". README.md
doesn't mention "stage". Can one of the following two changes happen?

- README.md starts mentioning "stage" and how it differs from the others
- Code stops using "stage"

+ * locallbacks just before reclaiming at multiple callsites.

s/locallbacks/local callbacks/

+ * Check if the the referenced IO completed, without blocking.

s/the the/the/

+ * Batch submission mode needs to explicitly ended with
+ * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
+ * error recovery will end the batch.

This sentence needs some grammar help, I think. Maybe use:

* End batch submission mode with pgaio_exit_batchmode(). (Throwing errors is
* allowed; error recovery will end the batch.)

Size
AioShmemSize(void)
{
Size sz = 0;

+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,

s/wal_buffers/io_max_concurrency/

+extern int io_workers;

By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT.

+static void
+maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

+{

...

+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */

Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.

--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = gettext_noop("checkpointer");
break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";

Wrap in gettext_noop() like B_CHECKPOINTER does.

+         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
+         <literal>worker</literal>.

s/guc-max-wal-senders/guc-io-method/

+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more. qXXX

s/qXXX/XXX/

+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;

Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.

+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif

Probably remove the "#if 0" or add a comment on why it's here.

+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));

I still think (see 2024-09-16 review) EAGAIN should do the documented
recommendation instead of PANIC:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait for
some completions and try again.

At a minimum, it deserves a comment like "We accept PANIC on memory exhaustion
here."

+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));

I think errno isn't meaningful here, so %m doesn't belong.

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2687,6 +2687,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>

Docs should eventually cover RLIMIT_MEMLOCK per
https://github.com/axboe/liburing "ulimit settings". Maybe RLIMIT_NOFILE,
too.

@@ -2498,6 +2529,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}

What's the rationale for this function's change?

+The main reason to want to use Direct IO are:

+The main reason *not* to use Direct IO are:

x2 s/main reason/main reasons/

+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a

s/writes/write's/

+ single FUA write).

I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/

+In an `EXEC_BACKEND` build backends executable code and other process local

s/backends/backends'/

+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.

One of these or similar:
s/md.c. another/md.c can have another/
s/md.c. /md.c /

I've got one high-level question that I felt could take too long to answer for
myself by code reading. What's the cleanup story if process A does
elog(FATAL) with unfinished I/O? Specifically:

- Suppose some other process B reuses the shared memory AIO data structures
that pertained to process A. After that, some process C completes the I/O
in shmem. Do we avoid confusing B by storing local callback data meant for
A in shared memory now pertaining to B?

- Thinking more about this README paragraph:

    +In addition to completion, AIO callbacks also are called to "prepare" an
    +IO. This is, e.g., used to increase buffer reference counts to account for the
    +AIO subsystem referencing the buffer, which is required to handle the case
    +where the issuing backend errors out and releases its own pins while the IO is
    +still ongoing.

Which function performs that reference count increase? I'm not finding it
today. I wanted to look at how it ensures the issuing backend still exists
as the function increases the reference count.

One later-patch item:

+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{

...

+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);

I wondered about whether the buffer-read-done probe should happen in the
process that calls the complete_shared callback or in the process that did the
buffer-read-start probe. When I see dtrace examples, they usually involve
explicitly naming each PID to trace. Assuming that's indeed the norm, I think
the local callback would be the better place, so a given trace contains both
probes. If it were reasonable to dtrace all current and future postmaster
kids, that would argue for putting the probe in the complete_shared callback.
Alternatively, would could argue for separate probes buffer-read-done-shmem
and buffer-read-done.

Thanks,
nm

#76

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#75)

Re: AIO v2.5

Hi,

On 2025-03-11 12:41:08 -0700, Noah Misch wrote:

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.

Otherwise, deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

I'm not finding that code in v2.6. What function has it?

My local version now has it... Sorry, I was focusing on the earlier patches
until now.

What do we want to do for ConditionalLockBufferForCleanup() (I don't think
IsBufferCleanupOK() can matter)? I suspect we should also make it wait for
the IO. See below:

Not for 18, but for full write support, we'll also need logic to wait for IO
in LockBuffer(BUFFER_LOCK_EXCLUSIVE) and answer the same question as for
ConditionalLockBufferForCleanup() for ConditionalLockBuffer().

It's not an issue with the current level of write support in the stack of
patches. But with v1 AIO, which had support for a lot more ways of doing
asynchronous writes, it turned out that not handling it in
ConditionalLockBuffer() triggers an endless loop. This can be
kind-of-reproduced today by just making ConditionalLockBuffer() always return
false - triggers a hang in the regression tests:

spginsert() loops around spgdoinsert() until it succeeds. spgdoinsert() locks
the child page with ConditionalLockBuffer() and gives up if it can't.

That seems like rather bad code in spgist, because, even without AIO, it'll
busy-loop until the buffer is unlocked. Which could take a while, given that
it'll conflict even with a share locker and thus synchronous writes.

Even if we fixed spgist, it seems rather likely that there's other code that
wouldn't tolerate "spurious" failures. Which leads me to think that causing
the IO to complete is probably the safest bet. Triggering IO completion never
requires acquiring new locks that could participate in a deadlock, so it'd be
safe.

At this point I am not aware of anything significant left to do in the main
AIO commit, safe some of the questions below.

That is a big milestone.

Indeed!

- We could reduce memory usage a tiny bit if we made the mapping between
pgproc and per-backend-aio-state more complicated, i.e. not just indexed by
ProcNumber. Right now IO workers have the per-backend AIO state, but don't
actually need it. I'm mildly inclined to think that the complexity isn't
worth it, but on the fence.

The max memory savings, for 32 IO workers, is like the difference between
max_connections=500 and max_connections=532, right?

Even less than that: Aux processes aren't always used as a multiplier in
places where max_connections etc are. E.g. max_locks_per_transaction is just
multiplied by MaxBackends, not MaxBackends+NUM_AUXILIARY_PROCS.

If that's right, I wouldn't bother in the foreseeable future.

Cool.

- Three of the commits in the series really are just precursor commits to
their subsequent commits, which I found helpful for development and review,
namely:

- aio: Basic subsystem initialization
- aio: Skeleton IO worker infrastructure
- aio: Add liburing dependency

Not sure if it's worth keeping these separate or whether they should just be
merged with their "real commit".

The split aided my review. It's trivial to turn an unmerged stack of commits
into the merged equivalent, but unmerging is hard.

That's been the feedback so far, so I'll leave it split.

- Right now this series defines PGAIO_VERBOSE to 1. That's good for debugging,
but all the ereport()s add a noticeable amount of overhead at high IO
throughput (at multiple gigabytes/second), so that's probably not right
forever. I'd leave this on initially and then change it to default to off
later. I think that's ok?

Sure. Perhaps make it depend on USE_ASSERT_CHECKING later?

Yea, that makes sense.

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

Let's start as you have it. If someone wants to make things perfect for
non-root BSD users, they can add the GUC later. io_method=sync is a
sufficient backup plan indefinitely.

Cool.

I think we'll really need to do something about this for BSD users regardless
of AIO. Or maybe those OSs should fix something, but somehow I am not having
high hopes for an OS that claims to have POSIX confirming unnamed semaphores
due to having a syscall that always returns EPERM... [1]https://man.openbsd.org/sem_init.3#STANDARDS.

- pg_stat_aios currently has the IO Handle flags as dedicated columns. Not
sure that's great?

They could be an enum array or such too? That'd perhaps be a bit more
extensible? OTOH, we don't currently use enums in the catalogs and arrays
are somewhat annoying to conjure up from C.

An enum array does seem elegant and extensible, but it has the problems you
say. (I would expect to lose time setting up pg_enum.oid values to not change
between releases.) A possible compromise would be a text array like
heap_tuple_infomask_flags() does. Overall, I'm not seeing a clear need to
change away from the bool columns.

Yea, I think that's where I ended up too. If we get a dozen flags we can
reconsider.

Todo:

- Figure out how to deduplicate support for LockBufferForCleanup() in
TerminateBufferIO().

Yes, I agree there's an opportunity for a WakePinCountWaiter() or similar
subroutine.

Done.

- Check if documentation for track_io_timing needs to be adjusted, after the
bufmgr.c changes we only track waiting for an IO.

Yes.

The relevant sentences seem to be:

- "Enables timing of database I/O calls."

s/calls/waits/

- "Time spent in {read,write,writeback,extend,fsync} operations"

s/in/waiting for/

Even though not all of these will use AIO, the "waiting for" formulation
seems just as accurate.

- "Columns tracking I/O time will only be non-zero when <xref
linkend="guc-track-io-timing"/> is enabled."

s/time/wait time/

On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:

Attached is v2.6 of the AIO patchset.

- 0005, 0006 - io_uring support - close, but we need to do something about
set_max_fds(), which errors out spuriously in some cases

What do we know about those cases? I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
will have a large number of FDs already allocated by the time
set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
max_files_per_process allowing few, and even negative, IOs.

I think we should redefine max_files_per_process to be about the number of
files each *backend* will additionally open. Jelte was working on related
patches, see [2]/messages/by-id/D80MHNSG4EET.6MSV5G9P130F@jeltef.nl

+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResoureElem mechanism.

s/ResoureElem/ResourceElem/

Oops, fixed.

+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous I/O synchronously)

The part in parentheses reads like a contradiction to me.

There's something to that...

How about phrasing it like one of these:

(execute I/O synchronously, even I/O eligible for asynchronous execution)
(execute asynchronous-eligible I/O synchronously)
(execute I/O synchronously, even when asynchronous execution was feasible)

I like the second one best, adopted.

[..]
End sentence with question mark, probably.
[..]
s/strict/strictly/
[..]
I recommend adding "Always called in a critical section." since at least
pgaio_worker_submit() subtly needs it.
[..]
s/that that/that/
[..]
s/smgr.,/smgr.c,/ or just "smgr"
[..]
s/locallbacks/local callbacks/
[..]
s/the the/the/

All adopted.

+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
I'm seeing the "num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE" case uncovered in a
check-world coverage report. I tried PGAIO_SUBMIT_BATCH_SIZE=2,
io_max_concurrency=1, and io_max_concurrency=64. Do you already have a recipe
for reaching this case?

With the default server settings it's hard to hit due to read_stream.c
limiting how much IO it issues:

1) The default io_combine_limit=16 makes reads larger, reducing the queue
depth, at least for sequential scans

2) The default shared_buffers/max_connections settings limit the number of
buffers that can be pinned to 86, which will only allow a small number of
IOs due to 86/io_combine_limit = ~5

3) The default effective_io_concurrency only allows one IO in flight

Melanie has a patch to adjust effective_io_concurrency:
/messages/by-id/CAAKRu_Z4ekRbfTacYYVrvu9xRqS6G4DMbZSbN_1usaVtj+bv2w@mail.gmail.com

If I increase shared_buffers and decrease io_combine_limit and put an
elog(PANIC) in that branch, it's rather quickly hit.

+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
We've got closely-associated verbs "prepare", "prep", and "stage". README.md
doesn't mention "stage". Can one of the following two changes happen?

- README.md starts mentioning "stage" and how it differs from the others
- Code stops using "stage"

I'll try to add something to README.md. To me the sequence is prepare->stage.

+ * Batch submission mode needs to explicitly ended with
+ * pgaio_exit_batchmode(), but it is allowed to throw errors, in which case
+ * error recovery will end the batch.

This sentence needs some grammar help, I think.

Indeed.

Maybe use:

* End batch submission mode with pgaio_exit_batchmode(). (Throwing errors is
* allowed; error recovery will end the batch.)

I like it.

Size
AioShmemSize(void)
{
Size sz = 0;
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set wal_buffers = -1 in the config file,
s/wal_buffers/io_max_concurrency/

Ooops.

+extern int io_workers;

By the rule that GUC vars are PGDLLIMPORT, this should be PGDLLIMPORT.

Indeed. I wish we had something finding violations of this automatically...

+static void
+maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

But it also stops IO workers if necessary?

+{

...
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.

I'm not sure that'd be a good idea - right now IO workers are started before
the startup process, as the startup process might need to perform IO. If we
started it only later in ServerLoop() we'd potentially do a fair bit of work,
including starting checkpointer, bgwriter, bgworkers before we started IO
workers. That shouldn't actively break anything, but it would likely make
things slower.

I rather dislike the code around when we start what. Leaving AIO aside, during
a normal startup we start checkpointer, bgwriter before the startup
process. But during a crash restart we don't explicitly start them. Why make
things uniform when it coul also be exciting :)

--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
case B_CHECKPOINTER:
backendDesc = gettext_noop("checkpointer");
break;
+		case B_IO_WORKER:
+			backendDesc = "io worker";
Wrap in gettext_noop() like B_CHECKPOINTER does.
+         Only has an effect if <xref linkend="guc-max-wal-senders"/> is set to
+         <literal>worker</literal>.
s/guc-max-wal-senders/guc-io-method/

+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more. qXXX

s/qXXX/XXX/

All fixed.

+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.

Can we rely on that to be present on all platforms, including windows?

+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif

Probably remove the "#if 0" or add a comment on why it's here.

Will do. It was an attempt at dealing with the set_max_safe_fds() issue above,
but it turned out to not work at all, given how fd.c currently works.

+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
I still think (see 2024-09-16 review) EAGAIN should do the documented
recommendation instead of PANIC:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait for
some completions and try again.

I don't think this can be hit in a recoverable way. We'd likely just end up
with an untested path that quite possibly would be wrong.

What wait time would be appropriate? What problems would it cause if we just
slept while holding critical lwlocks? I think it'd typically just delay the
crash-restart if we did, making it harder to recover from the problem.

Because we are careful to limit how many outstanding IO requests there are on
an io_uring instance, the kernel has to have run *severely* out of memory to
hit this.

I suspect it might currently be *impossible* to hit this due to ENOMEM,
because io_uring will fall back to allocating individual request, if the batch
allocation it normally does, fails. My understanding is that for small
allocations the kernel will try to reclaim memory forever, only large ones can
fail.

Even if it were possible to hit, the likelihood that postgres can continue to
work ok if the kernel can't allocate ~250 bytes seems very low.

How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s
meaning of EAGAIN is, uhm, unconvential, so a better error message than
strerror() might be good?

Proposed comment:
/*
* The io_uring_enter() manpage suggests that the appropriate
* reaction to EAGAIN is:
*
* "The application should wait for some completions and try
* again"
*
* However, it seems unlikely that that would help in our case, as
* we apply a low limit to the number of outstanding IOs and thus
* also outstanding completions, making it unlikely that we'd get
* EAGAIN while the OS is in good working order.
*
* Additionally, it would be problematic to just wait here, our
* caller might hold critical locks. It'd possibly lead to
* delaying the crash-restart that seems likely to occur when the
* kernel is under such heavy memory pressure.
*/

+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));

I think errno isn't meaningful here, so %m doesn't belong.

You're right. I wonder if we should make errno meaningful though (by setting
it), the elog.c machinery captures it and I know that there are logging hooks
that utilize that fact. That'd also avoid the need to use strerror() here.

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2687,6 +2687,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>

Docs should eventually cover RLIMIT_MEMLOCK per
https://github.com/axboe/liburing "ulimit settings".

The way we currently use io_uring (i.e. no registered buffers), the
RLIMIT_MEMLOCK advice only applies to linux <= 5.11. I'm not sure that's
worth documenting?

Maybe RLIMIT_NOFILE, too.

Yea, we probably need to. Depends a bit on where we go with [2]/messages/by-id/D80MHNSG4EET.6MSV5G9P130F@jeltef.nl though.

@@ -2498,6 +2529,12 @@ FilePathName(File file)
int
FileGetRawDesc(File file)
{
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
Assert(FileIsValid(file));
return VfdCache[file].fd;
}

What's the rationale for this function's change?

It flatly didn't work before. I guess I can make that a separate commit.

+The main reason to want to use Direct IO are:

+The main reason *not* to use Direct IO are:

x2 s/main reason/main reasons/
+  and direct IO without O_DSYNC needs to issue a write and after the writes
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
s/writes/write's/

+ single FUA write).

I recommend including the acronym expansion: s/FUA/Force Unit Access (FUA)/

+In an `EXEC_BACKEND` build backends executable code and other process local

s/backends/backends'/
+state is not necessarily mapped to the same addresses in each process due to
+ASLR. This means that the shared memory cannot contain pointer to callbacks.
s/pointer/pointers/
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c. another callback to check if the IO
+operation was successful.
One of these or similar:
s/md.c. another/md.c can have another/
s/md.c. /md.c /

All applied.

I've got one high-level question that I felt could take too long to answer for
myself by code reading. What's the cleanup story if process A does
elog(FATAL) with unfinished I/O? Specifically:

It's a good question. Luckily there's a relatively easy answer:

pgaio_shutdown() is registered via before_shmem_exit() in pgaio_init_backend()
and pgaio_shutdown() waits for all IOs to finish.

The main reason this exists is that the AIO mechanism in various OSs, at least
in some OS versions, don't like it if the issuing process exits while the IO
is in flight. IIRC that was the case with in v1 with posix_aio (which we
don't support in v2, and probably should never use) and I think also with
io_uring in some kernel versions.

Another reason is that those requests would show up in pg_aios (or whatever we
end up naming it) until they're reused, which doesn't seem great.

- Suppose some other process B reuses the shared memory AIO data structures
that pertained to process A. After that, some process C completes the I/O
in shmem. Do we avoid confusing B by storing local callback data meant for
A in shared memory now pertaining to B?

This will, before pgaio_shutdown() gets involved, also be prevented by local
callbacks being cleared by resowner cleanup. We take care that that resowner
cleanup happens before process exit. That's important, because the backend
local pointer could be invalidated by an ERROR

- Thinking more about this README paragraph:

+In addition to completion, AIO callbacks also are called to "prepare" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.

Which function performs that reference count increase? I'm not finding it
today.

Ugh, I just renamed the relevant functions in my local branch, while trying to
reduce the code duplication between shared and local buffers ;).

In <= v2.6 it's shared_buffer_stage_common() and local_buffer_readv_stage().

In v2.7-to-be it is buffer_stage_common(), which now supports both shared and
local buffers.

I wanted to look at how it ensures the issuing backend still exists as the
function increases the reference count.

The reference count is increased solely in the BufferDesc, *not* in the
backend-local pin tracking. Earlier I had tracked the pin in BufferDesc for
shared buffers (as the pin needs to be released upon completion, which might
be in another backend), but in LocalRefCount[] for temp buffers. But that
turned out to not work when the backend errors out, as it would make
CheckForLocalBufferLeaks() complain.

One later-patch item:
+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
...
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
I wondered about whether the buffer-read-done probe should happen in the
process that calls the complete_shared callback or in the process that did the
buffer-read-start probe.

Yea, that's a good point. I should at least have added a comment pointing out
that it's a choice with pros and cons.

The reason I went for doing it in the completion callback is that it seemed
better to get the READ_DONE event as soon as possible, even if the issuer of
the IO is currently busy doing other things. The shared completion callback is
after all where the buffer state is updated for shared buffers.

But I think you have a point too.

When I see dtrace examples, they usually involve explicitly naming each PID
to trace

TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace
itself. For those it's easy to trace more than just a single pid and to
monitor system-wide. I don't really know enough about using dtrace itself.

Assuming that's indeed the norm, I think the local callback would
be the better place, so a given trace contains both probes.

Seems like a shame to add an extra indirect function call for a tracing
feature that afaict approximately nobody ever uses (IIRC we several times have
passed wrong things to tracepoints without that being noticed).

TBH, the tracepoints are so poorly documented and maintained that I was
tempted to suggest removing them a couple times.

This was an awesome review, thanks!

Andres Freund

[1]: https://man.openbsd.org/sem_init.3#STANDARDS
[2]: /messages/by-id/D80MHNSG4EET.6MSV5G9P130F@jeltef.nl

#77

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#76)

Re: AIO v2.5

On Tue, Mar 11, 2025 at 07:55:35PM -0400, Andres Freund wrote:

On 2025-03-11 12:41:08 -0700, Noah Misch wrote:

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

What do we want to do for ConditionalLockBufferForCleanup() (I don't think
IsBufferCleanupOK() can matter)? I suspect we should also make it wait for
the IO. See below:

I agree IsBufferCleanupOK() can't matter. It asserts that the caller already
holds the exclusive buffer lock, and it doesn't loop or wait.

[...] leads me to think that causing
the IO to complete is probably the safest bet. Triggering IO completion never
requires acquiring new locks that could participate in a deadlock, so it'd be
safe.

Yes, that decision looks right to me. I scanned the callers, and none of them
clearly prefers a different choice. If we someday find one caller prefers a
false return over blocking on I/O completion, we can always introduce a new
ConditionalLock* variant for that.

- To allow io_workers to be PGC_SIGHUP, and to eventually allow to
automatically in/decrease active workers, the max number of workers (32) is
always allocated. That means we use more semaphores than before. I think
that's ok, it's not 1995 anymore. Alternatively we can add a
"io_workers_max" GUC and probe for it in initdb.

Let's start as you have it. If someone wants to make things perfect for
non-root BSD users, they can add the GUC later. io_method=sync is a
sufficient backup plan indefinitely.

Cool.

I think we'll really need to do something about this for BSD users regardless
of AIO. Or maybe those OSs should fix something, but somehow I am not having
high hopes for an OS that claims to have POSIX confirming unnamed semaphores
due to having a syscall that always returns EPERM... [1].

I won't mind a project making things better for non-root BSD users. I do
think such a project should not block other projects making things better for
everything else (like $SUBJECT).

- Check if documentation for track_io_timing needs to be adjusted, after the
bufmgr.c changes we only track waiting for an IO.

Yes.

The relevant sentences seem to be:

- "Enables timing of database I/O calls."

s/calls/waits/

- "Time spent in {read,write,writeback,extend,fsync} operations"

s/in/waiting for/

Even though not all of these will use AIO, the "waiting for" formulation
seems just as accurate.

- "Columns tracking I/O time will only be non-zero when <xref
linkend="guc-track-io-timing"/> is enabled."

s/time/wait time/

Sounds good.

On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:

Attached is v2.6 of the AIO patchset.

- 0005, 0006 - io_uring support - close, but we need to do something about
set_max_fds(), which errors out spuriously in some cases

What do we know about those cases? I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
will have a large number of FDs already allocated by the time
set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
max_files_per_process allowing few, and even negative, IOs.

I think we should redefine max_files_per_process to be about the number of
files each *backend* will additionally open. Jelte was working on related
patches, see [2]

Got it. max_files_per_process is a quaint setting, documented as follows (I
needed the reminder):

If the kernel is enforcing
a safe per-process limit, you don't need to worry about this setting.
But on some platforms (notably, most BSD systems), the kernel will
allow individual processes to open many more files than the system
can actually support if many processes all try to open
that many files. If you find yourself seeing <quote>Too many open
files</quote> failures, try reducing this setting.

I could live with
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean
against it since it feels unduly novel to have a setting where we use the
postgresql.conf value to calculate a value that becomes the new SHOW-value of
the same setting. Options I'd consider before that:

- Like you say, "redefine max_files_per_process to be about the number of
files each *backend* will additionally open". It will become normal that
each backend's actual FD list length is max_files_per_process + MaxBackends
if io_method=io_uring. Outcome is not unlike
v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
mutate max_files_per_process. Benchmark results should not change beyond
the inter-major-version noise level unless one sets io_method=io_uring. I'm
feeling best about this one, but I've not been thinking about it long.

- When building with io_uring, make the max_files_per_process default
something like 10000 instead. Disadvantages: FD usage grows even if you
don't use io_uring. Merely rebuilding with io_uring (not enabling it at
runtime) will change benchmark results. High MaxBackends still needs to
override the value.

+static void
+maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

But it also stops IO workers if necessary?

Good point. Maybe just add a comment like "start or stop IO workers to close
the gap between the running count and the configured count intent".

+{

...
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.
I'm not sure that'd be a good idea - right now IO workers are started before
the startup process, as the startup process might need to perform IO. If we
started it only later in ServerLoop() we'd potentially do a fair bit of work,
including starting checkpointer, bgwriter, bgworkers before we started IO
workers. That shouldn't actively break anything, but it would likely make
things slower.

I missed that. How about keeping the two calls associated with PM_STARTUP but
replacing the assign_io_workers() and process_pm_child_exit() calls with one
in LaunchMissingBackgroundProcesses()? In the event of a launch failure, I
think that would retry the launch quickly, as opposed to maybe-never.

I rather dislike the code around when we start what. Leaving AIO aside, during
a normal startup we start checkpointer, bgwriter before the startup
process. But during a crash restart we don't explicitly start them. Why make
things uniform when it coul also be exciting :)

It's become some artisanal code! :)

+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.
Can we rely on that to be present on all platforms, including windows?

I expect EBADF is universal. EBADF would be fine.

EOWNERDEAD is from 2006.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140
says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual
Studio versions, so I consider them unknown).
https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
lists some OSs not having it, the newest of which looks like NetBSD 9.3
(2022). We could use it and add a #define for platforms lacking it.

+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+			elog(PANIC, "failed: %d/%s",
+				 ret, strerror(-ret));
I still think (see 2024-09-16 review) EAGAIN should do the documented
recommendation instead of PANIC:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait for
some completions and try again.
I don't think this can be hit in a recoverable way. We'd likely just end up
with an untested path that quite possibly would be wrong.

What wait time would be appropriate? What problems would it cause if we just
slept while holding critical lwlocks? I think it'd typically just delay the
crash-restart if we did, making it harder to recover from the problem.

I might use 30s like pgwin32_open_handle(), but 30s wouldn't be principled.

Because we are careful to limit how many outstanding IO requests there are on
an io_uring instance, the kernel has to have run *severely* out of memory to
hit this.

I suspect it might currently be *impossible* to hit this due to ENOMEM,
because io_uring will fall back to allocating individual request, if the batch
allocation it normally does, fails. My understanding is that for small
allocations the kernel will try to reclaim memory forever, only large ones can
fail.

Even if it were possible to hit, the likelihood that postgres can continue to
work ok if the kernel can't allocate ~250 bytes seems very low.

How about adding a dedicated error message for EAGAIN? IMO io_uring_enter()'s
meaning of EAGAIN is, uhm, unconvential, so a better error message than
strerror() might be good?

I'm fine with the present error message.

Proposed comment:
/*
* The io_uring_enter() manpage suggests that the appropriate
* reaction to EAGAIN is:
*
* "The application should wait for some completions and try
* again"
*
* However, it seems unlikely that that would help in our case, as
* we apply a low limit to the number of outstanding IOs and thus
* also outstanding completions, making it unlikely that we'd get
* EAGAIN while the OS is in good working order.
*
* Additionally, it would be problematic to just wait here, our
* caller might hold critical locks. It'd possibly lead to
* delaying the crash-restart that seems likely to occur when the
* kernel is under such heavy memory pressure.
*/

That works for me. No retry needed, then.

+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				elog(PANIC, "unexpected: %d/%s: %m", ret, strerror(-ret));
I think errno isn't meaningful here, so %m doesn't belong.
You're right. I wonder if we should make errno meaningful though (by setting
it), the elog.c machinery captures it and I know that there are logging hooks
that utilize that fact. That'd also avoid the need to use strerror() here.

That's better still.

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2687,6 +2687,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
Docs should eventually cover RLIMIT_MEMLOCK per
https://github.com/axboe/liburing "ulimit settings".
The way we currently use io_uring (i.e. no registered buffers), the
RLIMIT_MEMLOCK advice only applies to linux <= 5.11. I'm not sure that's
worth documenting?

Kernel 5.11 will be 5.5 years old by the time v18 is out. Yeah, no need for
doc coverage of that.

One later-patch item:
+static PgAioResult
+SharedBufferCompleteRead(int buf_off, Buffer buffer, uint8 flags, bool failed)
+{
...
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  INVALID_PROC_NUMBER,
+									  false);
I wondered about whether the buffer-read-done probe should happen in the
process that calls the complete_shared callback or in the process that did the
buffer-read-start probe.
Yea, that's a good point. I should at least have added a comment pointing out
that it's a choice with pros and cons.

The reason I went for doing it in the completion callback is that it seemed
better to get the READ_DONE event as soon as possible, even if the issuer of
the IO is currently busy doing other things. The shared completion callback is
after all where the buffer state is updated for shared buffers.

But I think you have a point too.

When I see dtrace examples, they usually involve explicitly naming each PID
to trace

TBH, i've only ever used our tracepoints via perf and bpftrace, not dtrace
itself. For those it's easy to trace more than just a single pid and to
monitor system-wide. I don't really know enough about using dtrace itself.

Perhaps just a comment, then.

Assuming that's indeed the norm, I think the local callback would
be the better place, so a given trace contains both probes.

Seems like a shame to add an extra indirect function call

Yep.

This was an awesome review, thanks!

Glad it helped.

Show quoted text

[1] https://man.openbsd.org/sem_init.3#STANDARDS
[2] /messages/by-id/D80MHNSG4EET.6MSV5G9P130F@jeltef.nl

#78

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#76)

Re: AIO v2.5

Hi,

On 2025-03-11 19:55:35 -0400, Andres Freund wrote:

On 2025-03-11 12:41:08 -0700, Noah Misch wrote:

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.

Otherwise, deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

I'm not finding that code in v2.6. What function has it?

My local version now has it... Sorry, I was focusing on the earlier patches
until now.

Looking more at my draft, I don't think it was race-free. I had a race-free
way of doing it in the v1 patch (by making lwlocks extensible, so the check
for IO could happen between enqueueing on the lwlock wait queue and sleeping
on the semaphore), but that obviously requires that infrastructure.

I want to focus on reads for now, so I'll add FIXMEs to the relevant places in
the patch to support AIO writes and focus on the rest of the patch for now.

Greetings,

Andres Freund

#79

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#77)

Re: AIO v2.5

Hi,

On 2025-03-11 20:57:43 -0700, Noah Misch wrote:

I think we'll really need to do something about this for BSD users regardless
of AIO. Or maybe those OSs should fix something, but somehow I am not having
high hopes for an OS that claims to have POSIX confirming unnamed semaphores
due to having a syscall that always returns EPERM... [1].

I won't mind a project making things better for non-root BSD users. I do
think such a project should not block other projects making things better for
everything else (like $SUBJECT).

Oh, I strongly agree. The main reason I would like it to be addressed that
I'm pretty tired of having to think about open/netbsd whenever we update some
default setting.

On Mon, Mar 10, 2025 at 02:23:12PM -0400, Andres Freund wrote:

Attached is v2.6 of the AIO patchset.

- 0005, 0006 - io_uring support - close, but we need to do something about
set_max_fds(), which errors out spuriously in some cases

What do we know about those cases? I don't see a set_max_fds(); is that
set_max_safe_fds(), or something else?

Sorry, yes, set_max_safe_fds(). The problem basically is that with io_uring we
will have a large number of FDs already allocated by the time
set_max_safe_fds() is called. set_max_safe_fds() subtracts already_open from
max_files_per_process allowing few, and even negative, IOs.

I think we should redefine max_files_per_process to be about the number of
files each *backend* will additionally open. Jelte was working on related
patches, see [2]

Got it. max_files_per_process is a quaint setting, documented as follows (I
needed the reminder):

If the kernel is enforcing
a safe per-process limit, you don't need to worry about this setting.
But on some platforms (notably, most BSD systems), the kernel will
allow individual processes to open many more files than the system
can actually support if many processes all try to open
that many files. If you find yourself seeing <quote>Too many open
files</quote> failures, try reducing this setting.

I could live with
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but would lean
against it since it feels unduly novel to have a setting where we use the
postgresql.conf value to calculate a value that becomes the new SHOW-value of
the same setting.

I think we may update some other GUCs, but not sure.

Options I'd consider before that:

- Like you say, "redefine max_files_per_process to be about the number of
files each *backend* will additionally open". It will become normal that
each backend's actual FD list length is max_files_per_process + MaxBackends
if io_method=io_uring. Outcome is not unlike
v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
mutate max_files_per_process. Benchmark results should not change beyond
the inter-major-version noise level unless one sets io_method=io_uring. I'm
feeling best about this one, but I've not been thinking about it long.

Yea, I think that's something probably worth doing separately from Jelte's
patch. I do think that it'd be rather helpful to have jelte's patch to
increase NOFILE in addition though.

+static void
+maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

But it also stops IO workers if necessary?

Good point. Maybe just add a comment like "start or stop IO workers to close
the gap between the running count and the configured count intent".

It's now
/*
* Start or stop IO workers, to close the gap between the number of running
* workers and the number of configured workers. Used to respond to change of
* the io_workers GUC (by increasing and decreasing the number of workers), as
* well as workers terminating in response to errors (by starting
* "replacement" workers).
*/

+{

...
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.
I'm not sure that'd be a good idea - right now IO workers are started before
the startup process, as the startup process might need to perform IO. If we
started it only later in ServerLoop() we'd potentially do a fair bit of work,
including starting checkpointer, bgwriter, bgworkers before we started IO
workers. That shouldn't actively break anything, but it would likely make
things slower.
I missed that. How about keeping the two calls associated with PM_STARTUP but
replacing the assign_io_workers() and process_pm_child_exit() calls with one
in LaunchMissingBackgroundProcesses()?

I think replacing the call in assign_io_workers() is a good idea, that way we
don't need assign_io_workers().

Less convinced it's a good idea to do the same for process_pm_child_exit() -
if IO workers errored out we'll launch backends etc before we get to
LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
seems a bit odd.

I think LaunchMissingBackgroundProcesses() should be split into one that
starts aux processes and one that starts bgworkers. The one maintaining aux
processes should be called before we start backends, the latter not.

In the event of a launch failure, I think that would retry the launch
quickly, as opposed to maybe-never.

That's a fair point.

+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.
Can we rely on that to be present on all platforms, including windows?
I expect EBADF is universal. EBADF would be fine.

Hm, that's actually an error that could happen for other reasons, and IMO
would be more confusing than ENOENT. The latter describes the issue to a
reasonable extent, EBADFD seems like it would be more confusing.

I'm not sure it's worth investing time in this - it really shouldn't happen,
and we probably have bigger problems than the error code if it does. But if we
do want to do something, I think I can see a way to report a dedicated error
message for this.

EOWNERDEAD is from 2006.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/errno-constants?view=msvc-140
says VS2015 had EOWNERDEAD (the page doesn't have links for older Visual
Studio versions, so I consider them unknown).

Oh, that's a larger list than I'd have though.

https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
lists some OSs not having it, the newest of which looks like NetBSD 9.3
(2022). We could use it and add a #define for platforms lacking it.

What would we define it as? I guess we could just pick a high value, but...

Greetings,

Andres Freund

#80

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#79)

33 attachment(s)

Re: AIO v2.5

Hi,

Attached is v2.7, with the following changes:

- Significantly deduplicated AIO related code bufmgr.c

Previously the code for temp and shared buffers was duplicated to an
uncomfortable degree. Now there is a common helper to implements the
behaviour for both cases.

The BM_PIN_COUNT_WAITER supporting code was also deduplicated, by
introducing a helper function.

- Fixed typos / improved phrasing, per Noah's review

- Add comment explaining why retries for EAGAIN for io_uring_enter syscall
failures don't seem to make sense, improve related error messages slightly

- Added a comment to aio.h explaining that aio_types.h might suffice for
function declarations and aio_init.h for initialization related code.

- Added and expanded comments for PgAioHandleState, explaining the state
machine in more detail.

- Updated README to mention the stage callback (instead of the outdated
"prepare"), plus some other minor cleanups.

- Added a commit rephrasing track_io_timing related docs to talk about waits

- Added FIXME to method_uring.c about the set_max_safe_fds() issue. Depending
on when/how that is resolved, the relevant commits can be reordered relative
to the rest.

- Improved localbuf: patches and commit messages, as per Melanie's review

- Added FIXMEs to the bufmgr.c write support (only in later commit, unlikely
to be realistic for 18) denoting that deadlock risk needs to be
addressed. We probably need some lwlock.c improvements to make that
race-free, otherwise I'd just have fixed this.

- Added a comment discussing the placement of the
TRACE_POSTGRESQL_BUFFER_READ_DONE callback

- removed a few debug ereports() from the StartReadBuffers patch

Unresolved:

- Whether to continue starting new workers in process_pm_child_exit()

- What to name the view (currently pg_aios). I'm inclined to go for
pg_io_handles right now.

- set_max_safe_fds() related issues for the io_uring backend

Greetings,

Andres Freund

Attachments:

v2.7-0027-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From a21e56cebe1186057859563cde2fe3a3e79f33d8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:35:04 -0400
Subject: [PATCH v2.7 27/35] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 251 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index f76f74ba166..fb6ac058a09 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index be8c4c2d60d..65abebefbfb 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11d4d5a7aea..a61fc14805e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2017,3 +2108,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 607c14ee173..088b189543b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -102,6 +102,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -129,6 +134,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -691,6 +697,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										 buffers, nblocks, skipFsync);
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0028-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From b1996adba85d9f00279a3e249da3647394073dce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 15:47:38 -0500
Subject: [PATCH v2.7 28/35] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a4d9b51acfe..3047dc77ce1 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -337,6 +337,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -349,6 +363,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 2580e34e43b..ca6cbde11a6 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index b0b9a2a5c97..09968e17855 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index b7dfb80b4b2..bb0bfc71ea6 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index f31e52f7fe0..a9c351eb0dc 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -663,6 +666,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1044,6 +1062,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index d0d916ea5c9..c32d8e6cb3d 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -131,11 +184,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -150,6 +223,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -161,6 +237,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -173,6 +250,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -180,9 +291,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -204,6 +319,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d4e63c75cb6..3a375b92652 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3259,6 +3259,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c9e9e850a99..24e1af803b0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7b9affbeb80..596b9a56265 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2132,6 +2132,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0029-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From d870435980b1bacb429af7480ab45d2339f8d331 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Mar 2025 11:33:13 -0400
Subject: [PATCH v2.7 29/35] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 180 +++++++++++++++++++++++++
 4 files changed, 186 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 3047dc77ce1..192abc5a712 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -186,8 +186,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cd921bdad49..f59d29113cc 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -175,7 +175,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index fb6ac058a09..7162f722e3c 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 55187eefa5d..19f62900259 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5357,7 +5357,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5372,6 +5380,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5444,6 +5465,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5591,6 +5617,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6652,12 +6686,131 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
 		);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result;
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6665,6 +6818,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6678,6 +6838,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -6685,6 +6856,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6698,3 +6874,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0030-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 0e0e0841a9c10030058669fde1311d741eace15a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.7 30/35] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 596b9a56265..21b26196e0c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1191,6 +1191,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3015,6 +3016,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0031-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 4f652aeac3caad90aaede62e10c39793bed57576 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.7 31/35] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 4fd717169f0..4208d5f2c97 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ extern void BackgroundWriterMain(const void *startup_data, size_t startup_data_l
 extern void CheckpointerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1f52a9fc8..a3df3192d12 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f59d29113cc..98502bf69f9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,7 +298,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetSoftPinLimit(void);
 extern uint32 GetSoftLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 19f62900259..9483596e63a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static bool AsyncReadBuffers(ReadBuffersOperation *operation,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3148,6 +3149,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3179,7 +3231,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3241,7 +3296,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3349,48 +3406,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3408,15 +3508,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3442,7 +3550,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3485,6 +3593,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3505,6 +3616,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3661,11 +3774,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3676,6 +3803,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3687,6 +3821,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3725,8 +3864,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3735,22 +3932,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3760,7 +3985,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3769,40 +3994,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4176,6 +4655,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 21b26196e0c..88c9a1285d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0032-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From b79a011ac5ccaccc9a2bcd7be7176b10d676d7a5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.7 32/35] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0001-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From 073775698bd3b3ef76b2e4fe938952abda3af4f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 10:44:56 -0500
Subject: [PATCH v2.7 01/35] aio: Basic subsystem initialization

This commit just does the minimal wiring up of the AIO subsystem, to be fully
added in the next commit, to the rest of the system. The next commit contains
more details about motivation and architecture.

This commit is kept separate to make it easier to review, reviewing how the
AIO subsystem works is a big enough task on its own an rather separate from
reviewing the changes across the tree.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/aio.h                     | 38 ++++++++
 src/include/storage/aio_subsys.h              | 33 +++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/access/transam/xact.c             | 12 +++
 src/backend/postmaster/autovacuum.c           |  2 +
 src/backend/postmaster/bgwriter.c             |  2 +
 src/backend/postmaster/checkpointer.c         |  2 +
 src/backend/postmaster/pgarch.c               |  2 +
 src/backend/postmaster/walsummarizer.c        |  2 +
 src/backend/postmaster/walwriter.c            |  2 +
 src/backend/replication/walsender.c           |  2 +
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 90 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++
 doc/src/sgml/config.sgml                      | 51 +++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 24 files changed, 356 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..e4faf692a38
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO but interact with it in some form. E.g. postmaster.c
+ * and shared memory initialization need to initialize AIO but don't perform
+ * AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..de3bc37264f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 9a0d8ec85c7..a3eba8fbe21 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -64,6 +64,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 800815dfbcc..5aa0fa665c2 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a688cc5d2a1..72f5acceec7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0e228d143a0..fda91ffd1ce 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dbe4e1d426b..7e622ae4bd2 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ccba0f84e6e..0fec4f1f871 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -38,6 +38,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -289,6 +290,7 @@ WalSummarizerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0380601bcbb..fd92c8b7a33 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 446d10c1a7d..69097125606 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..828a94efdc3
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/aio_subsys.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+}
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4b2faf1ba9d..7958ea11b73 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -635,6 +636,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 508970680d1..a456f27ba1e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -72,6 +72,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3245,6 +3246,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5302,6 +5315,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 36cb64d7ebc..58b256e2221 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..d39f3e1b655 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResourceElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8c82b39a89d..7ca8e7867cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2638,6 +2638,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a2e592dbbbb..d3eef76bc2e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1279,6 +1279,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0002-aio-Add-core-asynchronous-I-O-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 894e1024ae78353a242d8189115f30c0240a059b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 10:14:31 -0500
Subject: [PATCH v2.7 02/35] aio: Add core asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by a new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons all
targets currently need to be added to an array in aio_target.c.  This commit
does not implement any AIO targets, just the infrastructure for them. The smgr
target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit nothing uses the AIO infrastructure. Other commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, that I started to work on to make adding AIO feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |  304 +++++
 src/include/storage/aio_internal.h            |  395 ++++++
 src/include/storage/aio_types.h               |  117 ++
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1130 +++++++++++++++++
 src/backend/storage/aio/aio_callback.c        |  307 +++++
 src/backend/storage/aio/aio_init.c            |  187 +++
 src/backend/storage/aio/aio_io.c              |  180 +++
 src/backend/storage/aio/aio_target.c          |  118 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 13 files changed, 2815 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..a0a0f23f3a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -3,6 +3,10 @@
  * aio.h
  *    Main AIO interface
  *
+ * This is the header to include when actually issuing AIO. When just
+ * declaring functions involving an AIO related type, it might suffice to
+ * include aio_types.h. Initialization related functions are in the dedicated
+ * aio_init.h.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -14,6 +18,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +33,306 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed?
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ *
+ * typedef is in aio_types.h
+ */
+struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+};
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+/* typedef is in aio_types.h */
+struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+};
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+										uint8 cb_data);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..e42c3dcb06c
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,395 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that should only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+/*
+ * State machine for handles. With some exceptions, noted below, handles move
+ * linearly through all states.
+ *
+ * State changes should all go through pgaio_io_update_state().
+ */
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/*
+	 * Returned by pgaio_io_acquire(). The next state is either DEFINED (if
+	 * pgaio_io_prep_*() is called), or IDLE (if pgaio_io_release() is
+	 * called).
+	 */
+	PGAIO_HS_HANDED_OUT,
+
+	/*
+	 * pgaio_io_prep_*() has been called, but IO is not yet staged. At this
+	 * point the handle has all the information for the IO to be executed.
+	 */
+	PGAIO_HS_DEFINED,
+
+	/*
+	 * stage() callbacks have been called, handle ready to be submitted for
+	 * execution. Unless in batchmode (see c.f. pgaio_enter_batchmode()), the
+	 * IO will be submitted immediately after.
+	 */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted to the IO method for execution */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/*
+	 * IO completed, shared completion has been called.
+	 *
+	 * If the IO completion occurs in the issuing backend, local callbacks
+	 * will immediately be called. Otherwise the handle stays in
+	 * COMPLETED_SHARED until the issuing backend waits for the completion of
+	 * the IO.
+	 */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/*
+	 * IO completed, local completion has been called.
+	 *
+	 * After this the handle will be made reusable and go into IDLE state.
+	 */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in aio_types.h */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/* data forwarded to each callback */
+	uint8		callbacks_data[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	/*
+	 * To wait for the IO to complete other backends can wait on this CV. Note
+	 * that, if in SUBMITTED state, a waiter first needs to check if it needs
+	 * to do work via IoMethodOps->wait_one().
+	 */
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/*
+	 * If not NULL, this memory location will be updated with information
+	 * about the IOs completion iff the issuing backend learns about the IOs
+	 * completion.
+	 */
+	PgAioReturn *report_return;
+
+	/* Data necessary for the IO to be performed */
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * without having been either defined (by actually associating it with IO)
+	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
+	 * to enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strictly speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint32		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint32		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 * Always called in a critical section.
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at build time. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ *
+ * XXX: This likely should be eventually be disabled by default, at least in
+ * non-assert builds.
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..a5cc658efbd
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+typedef struct PgAioHandleCallbacks PgAioHandleCallbacks;
+typedef struct PgAioTargetInfo PgAioTargetInfo;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 828a94efdc3..7eb0bd6f12d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -15,10 +37,28 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -31,7 +71,182 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation. See pgaio_io_call_inj().
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that has been "handed out" to
+ * code, but not yet submitted or released. This restriction is necessary to
+ * ensure that it is possible for code to wait for an unused handle by waiting
+ * for in-flight IO to complete. There is a limited number of handles in each
+ * backend, if multiple handles could be handed out without being submitted,
+ * waiting for all in-flight IO to complete would not guarantee that handles
+ * free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr.c, which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()) the IO will also have been submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
 
 /*
  * Release IO handle during resource owner cleanup.
@@ -39,8 +254,792 @@ int			io_max_concurrency = -1;
 void
 pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 {
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			if (!on_error)
+				elog(WARNING, "AIO handle was not submitted");
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
 }
 
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+/*
+ * Returns an ID uniquely identifying the IO handle. This is only really
+ * useful for logging, as handles are reused across multiple IOs.
+ */
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+/*
+ * Return the ProcNumber for the process that can use an IO handle. The
+ * mapping from IO handles to PGPROCs is static, therefore this even works
+ * when the corresponding PGPROC is not in use.
+ */
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+/*
+ * Return a wait reference for the IO. Only wait references can be used to
+ * wait for an IOs completion, as handles themselves can be reused after
+ * completion.  See also the comment above pgaio_io_acquire().
+ */
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if necessary, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Has the IO completed and thus the IO handle been reused?
+ *
+ * This is useful when waiting for IO completion at a low level (e.g. in an IO
+ * method's ->wait_one() callback).
+ */
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state != PGAIO_HS_SUBMITTED
+			&& state != PGAIO_HS_COMPLETED_IO
+			&& state != PGAIO_HS_COMPLETED_SHARED
+			&& state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+/*
+ * Make IO handle ready to be reused after IO has completed or after the
+ * handle has been released without being used.
+ */
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * local callbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+/*
+ * Wait for an IO handle to become usable.
+ *
+ * This only really is useful for pgaio_io_acquire().
+ */
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+		pgaio_submit_staged();
+
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Mark a wait reference as invalid
+ */
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+/* Is the wait reference valid? */
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+/*
+ * Similar to pgaio_io_get_id(), just for wait references.
+ */
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed. Can be called in any process, not just
+ * in the issuing backend.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	/*
+	 * XXX: It likely would be worth checking in with the io method, to give
+	 * the IO method a chance to check if there are completion events queued.
+	 */
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * End batch submission mode with pgaio_exit_batchmode().  (Throwing errors is
+ * allowed; error recovery will end the batch.)
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+
 /*
  * Perform AIO related cleanup after an error.
  *
@@ -50,6 +1049,22 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 void
 pgaio_error_cleanup(void)
 {
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
 }
 
 /*
@@ -62,11 +1077,86 @@ pgaio_error_cleanup(void)
 void
 AtEOXact_Aio(bool is_commit)
 {
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Registered as before_shmem_exit() callback in pgaio_init_backend()
+ */
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/* first clean up resources as we would at a transaction boundary */
+	AtEOXact_Aio(code == 0);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes:
+	 *
+	 * - Some kernel-level AIO mechanisms don't deal well with the issuer of
+	 * an AIO exiting before IO completed
+	 *
+	 * - It'd be confusing to see partially finished IOs in stats views etc
+	 */
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
 }
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -88,3 +1178,43 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 
 	return true;
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+/*
+ * Call injection point with support for pgaio_inj_io_get().
+ */
+void
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
+{
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..7392b55322c
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,307 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+							uint8 cb_data)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	if (cb_id >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cb_id);
+	if (aio_handle_cbs[cb_id].cb->complete_shared == NULL &&
+		aio_handle_cbs[cb_id].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cb_id);
+	if (ioh->num_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->callbacks[ioh->num_callbacks] = cb_id;
+	ioh->callbacks_data[ioh->num_callbacks] = cb_data;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_callbacks + 1,
+				   cb_id, ce->name);
+
+	ioh->num_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO Result related functions
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cb_id = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage(%u)",
+					   i, cb_id, ce->name, cb_data);
+		ce->cb->stage(ioh, cb_data);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared(%u) with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cb_id, ce->name,
+					   cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result, cb_data);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local(%u) with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cb_id, ce->name, cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result, cb_data);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..e6c3e40f7d4 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,211 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set io_max_concurrency = -1 in the
+	 * config file, then PGC_S_DYNAMIC_DEFAULT will fail to override that and
+	 * we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..89376ff4040
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,180 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods lso might use it / fall back to it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..3fa813fd592
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,118 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..902c2428d41
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "IO should have been executed synchronously");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..b44e4908b25 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3eef76bc2e..cca85bd8806 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1280,6 +1280,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2127,6 +2128,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0003-aio-Infrastructure-for-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 1cc6b28ffecdef03a55594dcbe92c9b2019c6556 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 17:57:00 -0500
Subject: [PATCH v2.7 03/35] aio: Infrastructure for io_method=worker

This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/miscadmin.h                       |   2 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 174 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 +++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  19 ++
 src/test/regress/expected/stats.out           |  10 +-
 19 files changed, 336 insertions(+), 14 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6f16794eb63..603d0424354 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index e4faf692a38..ed00d5c47cd 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..14ae6127dd0
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+extern void IoWorkerMain(const void *startup_data, size_t startup_data_len) pg_attribute_noreturn();
+
+extern PGDLLIMPORT int io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 114eb1f8f76..4e815f9fde0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,7 +449,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index ecd04655c2a..6b117b128cb 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 57155c00e01..7c0330a9780 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -340,6 +343,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -402,6 +406,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -436,6 +444,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1365,6 +1375,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1377,7 +1392,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2502,6 +2516,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2723,6 +2747,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2905,20 +2930,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2933,12 +2959,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3039,11 +3066,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3171,10 +3212,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3198,6 +3243,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -3235,6 +3281,16 @@ LaunchMissingBackgroundProcesses(void)
 	if (SysLoggerPMChild == NULL && Logging_collector)
 		StartSysLogger();
 
+	/*
+	 * The number of configured configured workers might have changed, or a
+	 * prior start of a worker might have failed. Check if we need to
+	 * start/stop any workers.
+	 *
+	 * A config file change will always lead to this function being called, so
+	 * we always will process the config change in a timely manner.
+	 */
+	maybe_adjust_io_workers();
+
 	/*
 	 * The checkpointer and the background writer are active from the start,
 	 * until shutdown is initiated.
@@ -4120,6 +4176,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4270,6 +4327,99 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Start or stop IO workers, to close the gap between the number of running
+ * workers and the number of configured workers.  Used to respond to change of
+ * the io_workers GUC (by increasing and decreasing the number of workers), as
+ * well as workers terminating in response to errors (by starting
+ * "replacement" workers).
+ */
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ef9ef93e2b
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(const void *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 55ab2da299b..fcb23239d07 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3316,6 +3316,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a8cb54a7732..5518a18e060 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -375,6 +375,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb575025596..c8de9c9e2d3 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -376,6 +376,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b44e4908b25..3f6dc3876b4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index dc3521457c7..43b4dbccc3d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = gettext_noop("io worker");
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a456f27ba1e..d4e63c75cb6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -75,6 +75,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3258,6 +3259,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 58b256e2221..1fd56ee5362 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -207,6 +207,7 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ca8e7867cb..bd359e6480d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2689,6 +2689,25 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is
+         3. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f77caacc17d..cd08a2ca0af 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -51,6 +51,14 @@ client backend|relation|vacuum
 client backend|temp relation|normal
 client backend|wal|init
 client backend|wal|normal
+io worker|relation|bulkread
+io worker|relation|bulkwrite
+io worker|relation|init
+io worker|relation|normal
+io worker|relation|vacuum
+io worker|temp relation|normal
+io worker|wal|init
+io worker|wal|normal
 slotsync worker|relation|bulkread
 slotsync worker|relation|bulkwrite
 slotsync worker|relation|init
@@ -87,7 +95,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(71 rows)
+(79 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0004-aio-Add-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From b54005c7e95e7f2aa448677e38e6f84bd31dc9ee Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2025 13:46:35 -0500
Subject: [PATCH v2.7 04/35] aio: Add io_method=worker

The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately.  Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables.  Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 431 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 doc/src/sgml/config.sgml                      |   5 +
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 452 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a0a0f23f3a5..769ea33ea96 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -27,10 +27,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index e42c3dcb06c..b560f69076b 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -385,6 +385,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 7eb0bd6f12d..94fbe0aea80 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -80,6 +81,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index e6c3e40f7d4..d0d916ea5c9 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -212,6 +218,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ef9ef93e2b..117c9a87db7 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * IO workers consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,25 +31,325 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG3, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +368,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -61,6 +382,27 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 		EmitErrorReport();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		proc_exit(1);
 	}
 
@@ -71,9 +413,89 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+		}
+
 		CHECK_FOR_INTERRUPTS();
 	}
 
@@ -83,6 +505,5 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3f6dc3876b4..9fa12a555e8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fd56ee5362..7df01d3a589 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,7 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd359e6480d..c71a602a04d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2676,6 +2676,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cca85bd8806..08490717c98 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0005-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 2b58fd8ab0969c452d3984c225e0e0708a78fd0f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 17:56:05 -0500
Subject: [PATCH v2.7 05/35] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0006-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From f5004f8efb285f3a711a5a26d9d124cf438d372b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:29:37 -0500
Subject: [PATCH v2.7 06/35] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 440 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 470 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 769ea33ea96..8bc43c33505 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -28,6 +28,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index b560f69076b..2580e34e43b 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 94fbe0aea80..2fbc380b758 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..588177eaa4f
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,440 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * FIXME: Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow MaxBackends fds
+		 * created in postmaster, with spare space for max_files_per_process
+		 * additional FDs
+		 *
+		 * - set_max_safe_fds() subtracts the number of already used FDs from
+		 * max_files_per_process, ending up with a low limit or even erroring
+		 * out.
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7df01d3a589..c9e9e850a99 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c71a602a04d..fdc8f5c1fb0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2681,6 +2681,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08490717c98..7b9affbeb80 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2150,6 +2150,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0007-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From efc4ee7b3773beca3b9b169a8f4d01ad391a9bdb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 16:06:51 -0500
Subject: [PATCH v2.7 07/35] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   5 +-
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 176 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 124 +++++++++++++++++
 10 files changed, 382 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8bc43c33505..d02c3d27fae 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -112,9 +112,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -180,6 +181,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..b0b9a2a5c97 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 7392b55322c..6a21c82396a 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,10 +18,11 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into the aio_handle_cbs */
-static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+static const PgAioHandleCallbacks aio_invalid_cb = {0};
 
 typedef struct PgAioHandleCallbacksEntry
 {
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 3fa813fd592..aab86a358da 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -29,6 +30,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..be8c4c2d60d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..11d4d5a7aea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,99 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..607c14ee173 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -53,6 +53,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -93,6 +94,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -104,6 +109,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -121,12 +127,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -144,6 +152,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -623,6 +641,22 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 										nblocks);
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -819,6 +853,19 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	return smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -847,3 +894,80 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0008-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 25f6f424df486472ccccfedd7448b67ea4f84cb0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.7 08/35] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..b7dfb80b4b2
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c can have another callback to check if
+the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 2fbc380b758..f31e52f7fe0 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0009-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From a096c8f7b25acdd1c8448d69fad7827dbb01e08d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.7 09/35] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 42e427f8fe8..1b6ba3182da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12476,4 +12476,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0010-Refactor-read_stream.c-s-circular-arithmetic.patchtext/x-diff; charset=us-asciiDownload

From b74207351a5b861760dd1fa7ad2d4f902e606922 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Feb 2025 14:47:25 +1300
Subject: [PATCH v2.7 10/35] Refactor read_stream.c's circular arithmetic.

Several places have open-coded circular index arithmetic.  Make some
common functions for better readability and consistent assertion
checking.

This avoids adding yet more open-coded duplication in later patches, and
standardizes on the vocabulary "advance" and "retreat" as used elsewhere
in PostgreSQL.
---
 src/backend/storage/aio/read_stream.c | 78 +++++++++++++++++++++------
 1 file changed, 61 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 04bdb5e6d4b..dee51fa85a9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -224,6 +224,55 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
+/*
+ * Increment index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	*index += 1;
+	if (*index == stream->queue_size)
+		*index = 0;
+}
+
+/*
+ * Increment index by n, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_advance_n(ReadStream *stream, int16 *index, int16 n)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+	Assert(n <= MAX_IO_COMBINE_LIMIT);
+
+	*index += n;
+	if (*index >= stream->queue_size)
+		*index -= stream->queue_size;
+
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+}
+
+#if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
+/*
+ * Decrement index, wrapping around at queue size.
+ */
+static inline void
+read_stream_index_retreat(ReadStream *stream, int16 *index)
+{
+	Assert(*index >= 0);
+	Assert(*index < stream->queue_size);
+
+	if (*index == 0)
+		*index = stream->queue_size - 1;
+	else
+		*index -= 1;
+}
+#endif
+
 static void
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
@@ -302,11 +351,8 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 				&stream->buffers[stream->queue_size],
 				sizeof(stream->buffers[0]) * overflow);
 
-	/* Compute location of start of next read, without using % operator. */
-	buffer_index += nblocks;
-	if (buffer_index >= stream->queue_size)
-		buffer_index -= stream->queue_size;
-	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	/* Move to the location of start of next read. */
+	read_stream_index_advance_n(stream, &buffer_index, nblocks);
 	stream->next_buffer_index = buffer_index;
 
 	/* Adjust the pending read to cover the remaining portion, if any. */
@@ -334,12 +380,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
-		 * wrap-around, but we don't want to use the expensive % operator.
+		 * wrap-around.
 		 */
-		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
-		if (buffer_index >= stream->queue_size)
-			buffer_index -= stream->queue_size;
-		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		buffer_index = stream->next_buffer_index;
+		read_stream_index_advance_n(stream,
+									&buffer_index,
+									stream->pending_read_nblocks);
 		per_buffer_data = get_per_buffer_data(stream, buffer_index);
 		blocknum = read_stream_get_block(stream, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
@@ -781,12 +827,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	 */
 	if (stream->per_buffer_data)
 	{
+		int16		index;
 		void	   *per_buffer_data;
 
-		per_buffer_data = get_per_buffer_data(stream,
-											  oldest_buffer_index == 0 ?
-											  stream->queue_size - 1 :
-											  oldest_buffer_index - 1);
+		index = oldest_buffer_index;
+		read_stream_index_retreat(stream, &index);
+		per_buffer_data = get_per_buffer_data(stream, index);
 
 #if defined(CLOBBER_FREED_MEMORY)
 		/* This also tells Valgrind the memory is "noaccess". */
@@ -804,9 +850,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	stream->pinned_buffers--;
 
 	/* Advance oldest buffer, with wrap-around. */
-	stream->oldest_buffer_index++;
-	if (stream->oldest_buffer_index == stream->queue_size)
-		stream->oldest_buffer_index = 0;
+	read_stream_index_advance(stream, &stream->oldest_buffer_index);
 
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0011-Improve-buffer-pool-API-for-per-backend-pin-lim.patchtext/x-diff; charset=us-asciiDownload

From 4457195bddb0f584eebac1567b417cf4cc6ed579 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 10:59:39 +1300
Subject: [PATCH v2.7 11/35] Improve buffer pool API for per-backend pin
 limits.

Previously the support functions assumed that you needed one additional
pin to make progress, and could optionally use some more.  Add a couple
more functions for callers that want to know:

* what the maximum possible number could be, for space planning
  purposes, called the "soft pin limit"

* how many additional pins they could acquire right now, without the
  special case allowing one pin (ie for users that already hold pins and
  can already make progress even if zero extra pins are available now)

These APIs are better suited to read_stream.c, which will be adjusted in
a follow-up patch.  Also move the computation of the each backend's fair
share of the buffer pool to backend initialization time, since the
answer doesn't change and we don't want to perform a division operation
every time we compute availability.
---
 src/include/storage/bufmgr.h          |  4 ++
 src/backend/storage/buffer/bufmgr.c   | 75 ++++++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c | 16 ++++++
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b204e4731c1..74b5afe8a1a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -290,6 +290,10 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern uint32 GetSoftPinLimit(void);
+extern uint32 GetSoftLocalPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern uint32 GetAdditionalLocalPinLimit(void);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..11c146763db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -211,6 +211,8 @@ static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
 
+static uint32 MaxProportionalPins;
+
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
@@ -2097,6 +2099,46 @@ again:
 	return buf;
 }
 
+/*
+ * Return the maximum number of buffer than this backend should try to pin at
+ * once, to avoid pinning more than its fair share.  This is the highest value
+ * that GetAdditionalPinLimit() and LimitAdditionalPins() could ever return.
+ *
+ * It's called a soft limit because nothing stops a backend from trying to
+ * acquire more pins than this this with ReadBuffer(), but code that wants more
+ * for I/O optimizations should respect this per-backend limit when it can
+ * still make progress without them.
+ */
+uint32
+GetSoftPinLimit(void)
+{
+	return MaxProportionalPins;
+}
+
+/*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend soft limit, considering the
+ * number of buffers it has already pinned.
+ */
+uint32
+GetAdditionalPinLimit(void)
+{
+	uint32		estimated_pins_held;
+
+	/*
+	 * We get the number of "overflowed" pins for free, but don't know the
+	 * number of pins in PrivateRefCountArray.  The cost of calculating that
+	 * exactly doesn't seem worth it, so just assume the max.
+	 */
+	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	/* Is this backend already holding more than its fair share? */
+	if (estimated_pins_held > MaxProportionalPins)
+		return 0;
+
+	return MaxProportionalPins - estimated_pins_held;
+}
+
 /*
  * Limit the number of pins a batch operation may additionally acquire, to
  * avoid running out of pinnable buffers.
@@ -2112,28 +2154,15 @@ again:
 void
 LimitAdditionalPins(uint32 *additional_pins)
 {
-	uint32		max_backends;
-	int			max_proportional_pins;
+	uint32		limit;
 
 	if (*additional_pins <= 1)
 		return;
 
-	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
-	max_proportional_pins = NBuffers / max_backends;
-
-	/*
-	 * Subtract the approximate number of buffers already pinned by this
-	 * backend. We get the number of "overflowed" pins for free, but don't
-	 * know the number of pins in PrivateRefCountArray. The cost of
-	 * calculating that exactly doesn't seem worth it, so just assume the max.
-	 */
-	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	if (max_proportional_pins <= 0)
-		max_proportional_pins = 1;
-
-	if (*additional_pins > max_proportional_pins)
-		*additional_pins = max_proportional_pins;
+	limit = GetAdditionalPinLimit();
+	limit = Max(limit, 1);
+	if (limit < *additional_pins)
+		*additional_pins = limit;
 }
 
 /*
@@ -3575,6 +3604,16 @@ InitBufferManagerAccess(void)
 {
 	HASHCTL		hash_ctl;
 
+	/*
+	 * The soft limit on the number of pins each backend should respect, bast
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  Higher level code that pins non-trivial numbers
+	 * of buffers should use LimitAdditionalPins() or GetAdditionalPinLimit()
+	 * to stay under this limit.
+	 */
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
 	hash_ctl.keysize = sizeof(int32);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 80b83444eb2..5378ba84316 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -286,6 +286,22 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see GetSoftPinLimit() */
+uint32
+GetSoftLocalPinLimit(void)
+{
+	/* Every backend has its own temporary buffers, and can pin them all. */
+	return num_temp_buffers;
+}
+
+/* see GetAdditionalPinLimit() */
+uint32
+GetAdditionalLocalPinLimit(void)
+{
+	Assert(NLocalPinnedBuffers <= num_temp_buffers);
+	return num_temp_buffers - NLocalPinnedBuffers;
+}
+
 /* see LimitAdditionalPins() */
 void
 LimitAdditionalLocalPins(uint32 *additional_pins)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0012-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 7c0802c756fb9547e2a9240ee00d776011b381ad Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 24 Jan 2025 23:52:53 +1300
Subject: [PATCH v2.7 12/35] Respect pin limits accurately in read_stream.c.

Read streams pin multiple buffers at once as required to combine I/O.
This also avoids having to unpin and repin later when issuing read-ahead
advice, and will be needed for proposed work that starts "real"
asynchronous I/O.

To avoid pinning too much of the buffer pool at once, we previously used
LimitAdditionalBuffers() to avoid pinning more than this backend's fair
share of the pool as a cap.  The coding was a naive and only checked the
cap once at stream initialization.

This commit moves the check to the time of use with new bufmgr APIs from
an earlier commit, since the result might change later due to pins
acquired later outside this stream.  No extra CPU cycles are added to
the all-buffered fast-path code (it only pins one buffer at a time), but
the I/O-starting path now re-checks the limit every time using simple
arithmetic.

In practice it was difficult to exceed the limit, but you could contrive
a workload to do it using multiple CURSORs and FETCHing from sequential
scans in round-robin fashion, so that each underlying stream computes
its limit before all the others have ramped up to their full look-ahead
distance.  Therefore, no back-patch for now.

Per code review from Andres, in the course of his AIO work.

Reported-by: Andres Freund <andres@anarazel.de>
---
 src/backend/storage/aio/read_stream.c | 111 ++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index dee51fa85a9..6a39bd1d92d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		temporary;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -274,7 +275,9 @@ read_stream_index_retreat(ReadStream *stream, int16 *index)
 #endif
 
 static void
-read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+read_stream_start_pending_read(ReadStream *stream,
+							   int16 buffer_limit,
+							   bool suppress_advice)
 {
 	bool		need_wait;
 	int			nblocks;
@@ -308,10 +311,14 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		flags = 0;
 
-	/* We say how many blocks we want to read, but may be smaller on return. */
+	/*
+	 * We say how many blocks we want to read, but may be smaller on return.
+	 * On memory-constrained systems we may be also have to ask for a smaller
+	 * read ourselves.
+	 */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = stream->pending_read_nblocks;
+	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -360,11 +367,60 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	stream->pending_read_nblocks -= nblocks;
 }
 
+/*
+ * How many more buffers could we use, while respecting the soft limit?
+ */
+static int16
+read_stream_get_buffer_limit(ReadStream *stream)
+{
+	uint32		buffers;
+
+	/* Check how many local or shared pins we could acquire. */
+	if (stream->temporary)
+		buffers = GetAdditionalLocalPinLimit();
+	else
+		buffers = GetAdditionalPinLimit();
+
+	/*
+	 * Each stream is always allowed to try to acquire one pin if it doesn't
+	 * hold one already.  This is needed to guarantee progress, and just like
+	 * the simple ReadBuffer() operation in code that is not using this stream
+	 * API, if a buffer can't be pinned we'll raise an error when trying to
+	 * pin, ie the buffer pool is simply too small for the workload.
+	 */
+	if (buffers == 0 && stream->pinned_buffers == 0)
+		return 1;
+
+	/*
+	 * Otherwise, see how many additional pins the backend can currently pin,
+	 * which may be zero.  As above, this only guarantees that this backend
+	 * won't use more than its fair share if all backends can respect the soft
+	 * limit, not that a pin can actually be acquired without error.
+	 */
+	return Min(buffers, INT16_MAX);
+}
+
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
+	int16		buffer_limit;
+
+	/*
+	 * Check how many pins we could acquire now.  We do this here rather than
+	 * pushing it down into read_stream_start_pending_read(), because it
+	 * allows more flexibility in behavior when we run out of allowed pins.
+	 * Currently the policy is to start an I/O when we've run out of allowed
+	 * pins only if we have to to make progress, and otherwise to stop looking
+	 * ahead until more pins become available, so that we don't start issuing
+	 * a lot of smaller I/Os, prefering to build the largest ones we can. This
+	 * choice is debatable, but it should only really come up with the buffer
+	 * pool/connection ratio is very constrained.
+	 */
+	buffer_limit = read_stream_get_buffer_limit(stream);
+
 	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+		   stream->pinned_buffers + stream->pending_read_nblocks <
+		   Min(stream->distance, buffer_limit))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
@@ -372,7 +428,9 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 
 		if (stream->pending_read_nblocks == io_combine_limit)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit,
+										   suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
 			continue;
 		}
@@ -406,11 +464,12 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
+			read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+			buffer_limit = read_stream_get_buffer_limit(stream);
 			suppress_advice = false;
-			if (stream->ios_in_progress == stream->max_ios)
+			if (stream->ios_in_progress == stream->max_ios || buffer_limit == 0)
 			{
-				/* And we've hit the limit.  Rewind, and stop here. */
+				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
@@ -426,16 +485,17 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * limit, preferring to give it another chance to grow to full
 	 * io_combine_limit size once more buffers have been consumed.  However,
 	 * if we've already reached io_combine_limit, or we've reached the
-	 * distance limit and there isn't anything pinned yet, or the callback has
-	 * signaled end-of-stream, we start the read immediately.
+	 * distance limit or buffer limit and there isn't anything pinned yet, or
+	 * the callback has signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
 		(stream->pending_read_nblocks == io_combine_limit ||
-		 (stream->pending_read_nblocks == stream->distance &&
+		 ((stream->pending_read_nblocks == stream->distance ||
+		   stream->pending_read_nblocks == buffer_limit) &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
-		read_stream_start_pending_read(stream, suppress_advice);
+		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
 }
 
 /*
@@ -464,6 +524,7 @@ read_stream_begin_impl(int flags,
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
+	uint32		max_possible_buffer_limit;
 	Oid			tablespace_id;
 
 	/*
@@ -511,12 +572,23 @@ read_stream_begin_impl(int flags,
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
 	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
 
-	/* Don't allow this backend to pin more than its share of buffers. */
+	/*
+	 * Also limit by the maximum possible number of pins we could be allowed
+	 * to acquire according to bufmgr.  We may not be able to use them all due
+	 * to other pins held by this backend, but we'll enforce the dynamic limit
+	 * later when starting I/O.
+	 */
 	if (SmgrIsTemp(smgr))
-		LimitAdditionalLocalPins(&max_pinned_buffers);
+		max_possible_buffer_limit = GetSoftLocalPinLimit();
 	else
-		LimitAdditionalPins(&max_pinned_buffers);
-	Assert(max_pinned_buffers > 0);
+		max_possible_buffer_limit = GetSoftPinLimit();
+	max_pinned_buffers = Min(max_pinned_buffers, max_possible_buffer_limit);
+
+	/*
+	 * The soft limit might be zero on a system configured with more
+	 * connections than buffers.  We need at least one.
+	 */
+	max_pinned_buffers = Max(1, max_pinned_buffers);
 
 	/*
 	 * We need one extra entry for buffers and per-buffer data, because users
@@ -576,6 +648,7 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -704,6 +777,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * arbitrary I/O entry (they're all free).  We don't have to
 			 * adjust pinned_buffers because we're transferring one to caller
 			 * but pinning one more.
+			 *
+			 * In the fast path we don't need to check the pin limit.  We're
+			 * always allowed at least one pin so that progress can be made,
+			 * and that's all we need here.  Although two pins are momentarily
+			 * held at the same time, the model used here is that the stream
+			 * holds only one, and the other now belongs to the caller.
 			 */
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0013-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 7d39560f8da9b08297b6ba528101316eb54c0970 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Jan 2025 11:42:03 +1300
Subject: [PATCH v2.7 13/35] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach read
stream to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap and code that uses the fast path stays
in one single buffer queue element.  Satisfy both goals by initializing
the queue incrementally on the first cycle.
---
 src/backend/storage/aio/read_stream.c | 108 ++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 6a39bd1d92d..e58e0edf221 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -112,8 +112,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -280,7 +282,9 @@ read_stream_start_pending_read(ReadStream *stream,
 							   bool suppress_advice)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
+	int			forwarded;
 	int			flags;
 	int16		io_index;
 	int16		overflow;
@@ -312,13 +316,34 @@ read_stream_start_pending_read(ReadStream *stream,
 		flags = 0;
 
 	/*
-	 * We say how many blocks we want to read, but may be smaller on return.
-	 * On memory-constrained systems we may be also have to ask for a smaller
-	 * read ourselves.
+	 * On buffer-constrained systems we may need to limit the I/O size by the
+	 * available pin count.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -347,16 +372,35 @@ read_stream_start_pending_read(ReadStream *stream,
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Move to the location of start of next read. */
 	read_stream_index_advance_n(stream, &buffer_index, nblocks);
@@ -381,6 +425,15 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	else
 		buffers = GetAdditionalPinLimit();
 
+	/*
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffers += stream->forwarded_buffers;
+
 	/*
 	 * Each stream is always allowed to try to acquire one pin if it doesn't
 	 * hold one already.  This is needed to guarantee progress, and just like
@@ -389,7 +442,7 @@ read_stream_get_buffer_limit(ReadStream *stream)
 	 * pin, ie the buffer pool is simply too small for the workload.
 	 */
 	if (buffers == 0 && stream->pinned_buffers == 0)
-		return 1;
+		buffers = 1;
 
 	/*
 	 * Otherwise, see how many additional pins the backend can currently pin,
@@ -755,10 +808,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -807,6 +862,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -891,10 +947,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		}
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -937,6 +998,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -972,6 +1034,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -985,6 +1048,23 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		read_stream_index_advance(stream, &index);
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0014-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From ddfe28078981ed11416fc9bc9eb69c9b0f3e4714 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Feb 2025 21:55:40 +1300
Subject: [PATCH v2.7 14/35] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 74b5afe8a1a..307f36af384 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 11c146763db..75a928e802a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1257,10 +1257,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1270,30 +1270,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, BM_VALID after this check, but
+			 * StartBufferIO() will handle those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1314,15 +1364,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1337,7 +1383,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1351,11 +1397,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1369,13 +1425,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1386,7 +1447,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1416,24 +1478,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patchtext/x-diff; charset=us-asciiDownload

From c157888c49ce7357b0d39f26d51f9d6144f9f56a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 11:07:47 -0500
Subject: [PATCH v2.7 15/35] tests: Expand temp table tests to some pin related
 matters

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/expected/temp.out | 156 +++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/sql/temp.sql      | 107 ++++++++++++++++++++
 3 files changed, 264 insertions(+), 1 deletion(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index 2a246a7e123..91fe519b1cc 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -410,3 +410,159 @@ SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 
 PREPARE TRANSACTION 'twophase_search';
 ERROR:  cannot PREPARE a transaction that has operated on temporary objects
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK;
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,2)
+(1 row)
+
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,3)
+(1 row)
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+ROLLBACK;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+DROP TABLE test_temp;
+ERROR:  cannot DROP TABLE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+TRUNCATE test_temp;
+ERROR:  cannot TRUNCATE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       0
+(1 row)
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       1
+(1 row)
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       2
+(1 row)
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..0a35f2f8f6a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,7 +108,7 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
+# NB: temp.sql does reconnects which transiently use 2 connections,
 # so keep this parallel group to at most 19 tests
 # ----------
 test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 2a487a1ef7f..12091f968de 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -311,3 +311,110 @@ SET search_path TO 'pg_temp';
 BEGIN;
 SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 PREPARE TRANSACTION 'twophase_search';
+
+
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+
+
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ROLLBACK;
+
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+FETCH NEXT FROM c_3;
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+ROLLBACK;
+
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+DROP TABLE test_temp;
+COMMIT;
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+TRUNCATE test_temp;
+COMMIT;
+
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patchtext/x-diff; charset=us-asciiDownload

From 9dd0824cb06e20ccabb04c0e8fb185e63a2bc566 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:11:51 -0400
Subject: [PATCH v2.7 16/35] localbuf: Fix dangerous coding pattern in
 GetLocalVictimBuffer()

If PinLocalBuffer() were to modify the buf_state, the buf_state in
GetLocalVictimBuffer() would be out of date. Currently that does not happen,
as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and
GetLocalVictimBuffer() passes false.

However, it's easy to make this not the case anymore - it cost me a few hours
to debug the consequences.

The minimal fix would be to just refetch the buf_state after after calling
PinLocalBuffer(), but the same danger exists in later parts of the
function. Instead, declare buf_state in the narrower scopes and re-read the
state in conditional branches.  Besides being safer, it also fits well with
an upcoming set of cleanup patches that move the contents of the conditional
branches in GetLocalVictimBuffer() into helper functions.

I "broke" this in 794f2594479.

Arguably this should be backpatched, but as the relevant functions are not
exported and there is no actual misbehaviour, I chose to not backpatch, at
least for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/backend/storage/buffer/localbuf.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5378ba84316..d3c869f53f9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -178,7 +178,6 @@ GetLocalVictimBuffer(void)
 {
 	int			victim_bufid;
 	int			trycounter;
-	uint32		buf_state;
 	BufferDesc *bufHdr;

 	ResourceOwnerEnlarge(CurrentResourceOwner);
@@ -199,7 +198,7 @@ GetLocalVictimBuffer(void)

 		if (LocalRefCount[victim_bufid] == 0)
 		{
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);

 			if (BUF_STATE_GET_USAGECOUNT(buf_state) > 0)
 			{
@@ -233,8 +232,9 @@ GetLocalVictimBuffer(void)
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
 	 */
-	if (buf_state & BM_DIRTY)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
@@ -267,8 +267,9 @@ GetLocalVictimBuffer(void)
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
-	if (buf_state & BM_TAG_VALID)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		LocalBufferLookupEnt *hresult;

 		hresult = (LocalBufferLookupEnt *)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0017-localbuf-Introduce-InvalidateLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 9b7cb5b1fe6812e358870eb62c90de936a10c716 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:25:22 -0400
Subject: [PATCH v2.7 17/35] localbuf: Introduce InvalidateLocalBuffer()

Previously, there were three copies of this code, two of them
identical. There's no good reason for that.

This change is nice on its own, but the main motivation is the AIO patchset,
which needs to add extra checks the deduplicated code, which of course is
easier if there is only one version.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/backend/storage/buffer/localbuf.c | 92 +++++++++++++--------------
 1 file changed, 44 insertions(+), 48 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index d3c869f53f9..f46ba0c0558 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -269,17 +270,7 @@ GetLocalVictimBuffer(void)
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-		LocalBufferLookupEnt *hresult;
-
-		hresult = (LocalBufferLookupEnt *)
-			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-		if (!hresult)			/* shouldn't happen */
-			elog(ERROR, "local buffer hash table corrupted");
-		/* mark buffer invalid just in case hash insert fails */
-		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		InvalidateLocalBuffer(bufHdr, false);
 
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT, 1, 0);
 	}
@@ -492,6 +483,46 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * pinned. Passing false is appropriate when calling InvalidateLocalBuffer()
+ * as part of changing the identity of a buffer, instead of just dropping the
+ * buffer.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (check_unreferenced && LocalRefCount[bufid] != 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)).str,
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -512,7 +543,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -522,24 +552,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
@@ -559,7 +572,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -567,23 +579,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0018-localbuf-Introduce-TerminateLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From e546b1554b346c0c90d28c69ebdfabbc57373755 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:55:05 -0400
Subject: [PATCH v2.7 18/35] localbuf: Introduce TerminateLocalBufferIO()

Previously TerminateLocalBufferIO() was open-coded in multiple places, which
doesn't seem like a great idea. While TerminateLocalBufferIO() currently is
rather simple, an upcoming patch requires additional code to be added to
TerminateLocalBufferIO(), making this modification particularly worthwhile.

For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED,
even though that's never set for temporary buffers. This is not carried over
as part of this change.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  2 ++
 src/backend/storage/buffer/bufmgr.c   | 32 ++++++++-------------------
 src/backend/storage/buffer/localbuf.c | 29 +++++++++++++++++++++---
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8b32fb108b0..4611a60d3e0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -471,6 +471,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  Buffer *buffers,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
+extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
+								   uint32 set_flag_bits);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 75a928e802a..ead562249d3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1072,19 +1072,11 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		if (!isLocalBuf)
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
 
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			buf_state |= BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-		}
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 		else
-		{
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			TerminateBufferIO(bufHdr, false, BM_VALID, true);
-		}
 	}
 	else if (!isLocalBuf)
 	{
@@ -1608,19 +1600,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 									relpath(operation->smgr->smgr_rlocator, forknum).str)));
 			}
 
-			/* Terminate I/O and set BM_VALID. */
+			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
 				TerminateBufferIO(bufHdr, false, BM_VALID, true);
-			}
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -4564,8 +4548,7 @@ FlushRelationBuffers(Relation rel)
 										IOCONTEXT_NORMAL, IOOP_WRITE,
 										io_start, 1, BLCKSZ);
 
-				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+				TerminateLocalBufferIO(bufHdr, true, 0);
 
 				pgBufferUsage.local_blks_written++;
 
@@ -5652,8 +5635,11 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	buf_state &= ~BM_IO_IN_PROGRESS;
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
 
-	buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f46ba0c0558..6df51a3a754 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -235,7 +235,6 @@ GetLocalVictimBuffer(void)
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
 	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
@@ -259,8 +258,7 @@ GetLocalVictimBuffer(void)
 								IOOP_WRITE, io_start, 1, BLCKSZ);
 
 		/* Mark not-dirty now in case we error out below */
-		buf_state &= ~BM_DIRTY;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		TerminateLocalBufferIO(bufHdr, true, 0);
 
 		pgBufferUsage.local_blks_written++;
 	}
@@ -483,6 +481,31 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like TerminateBufferIO, but for local buffers
+ */
+void
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+{
+	/* Only need to adjust flags */
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
+
+	if (clear_dirty)
+		buf_state &= ~BM_DIRTY;
+
+	buf_state |= set_flag_bits;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+	/* local buffers don't track IO using resowners */
+
+	/* local buffers don't use the IO CV, as no other process can see buffer */
+}
+
 /*
  * InvalidateLocalBuffer -- mark a local buffer invalid.
  *
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0019-localbuf-Introduce-FlushLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 5eafd098098b347144cee7454db4c44c3ca814ff Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:33:35 -0400
Subject: [PATCH v2.7 19/35] localbuf: Introduce FlushLocalBuffer()

Previously we had two paths implementing writing out temporary table
buffers. For shared buffers, the logic for that is centralized in
FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers.

Besides being a nice cleanup on its own, it also makes an upcoming change
slightly easier.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   | 22 +--------
 src/backend/storage/buffer/localbuf.c | 64 +++++++++++++++------------
 3 files changed, 38 insertions(+), 49 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4611a60d3e0..90bc7e0db7b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ead562249d3..5ead69b5e16 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4516,7 +4516,6 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
-			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -4524,9 +4523,6 @@ FlushRelationBuffers(Relation rel)
 				 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 			{
 				ErrorContextCallback errcallback;
-				Page		localpage;
-
-				localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
 				/* Setup error traceback support for ereport() */
 				errcallback.callback = local_buffer_write_error_callback;
@@ -4534,23 +4530,7 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
-				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-				io_start = pgstat_prepare_io_time(track_io_timing);
-
-				smgrwrite(srel,
-						  BufTagGetForkNum(&bufHdr->tag),
-						  bufHdr->tag.blockNum,
-						  localpage,
-						  false);
-
-				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
-										IOCONTEXT_NORMAL, IOOP_WRITE,
-										io_start, 1, BLCKSZ);
-
-				TerminateLocalBufferIO(bufHdr, true, 0);
-
-				pgBufferUsage.local_blks_written++;
+				FlushLocalBuffer(bufHdr, srel);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6df51a3a754..251289c95fd 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -174,6 +174,41 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	return bufHdr;
 }
 
+/*
+ * Like FlushBuffer(), just for local buffers.
+ */
+void
+FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
+{
+	instr_time	io_start;
+	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
+						MyProcNumber);
+
+	PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	/* And write... */
+	smgrwrite(reln,
+			  BufTagGetForkNum(&bufHdr->tag),
+			  bufHdr->tag.blockNum,
+			  localpage,
+			  false);
+
+	/* Temporary table I/O does not use Buffer Access Strategies */
+	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+							IOOP_WRITE, io_start, 1, BLCKSZ);
+
+	/* Mark not-dirty */
+	TerminateLocalBufferIO(bufHdr, true, 0);
+
+	pgBufferUsage.local_blks_written++;
+}
+
 static Buffer
 GetLocalVictimBuffer(void)
 {
@@ -234,34 +269,7 @@ GetLocalVictimBuffer(void)
 	 * the case, write it out before reusing it!
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
-	{
-		instr_time	io_start;
-		SMgrRelation oreln;
-		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
-
-		/* Find smgr relation for buffer */
-		oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber);
-
-		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-		io_start = pgstat_prepare_io_time(track_io_timing);
-
-		/* And write... */
-		smgrwrite(oreln,
-				  BufTagGetForkNum(&bufHdr->tag),
-				  bufHdr->tag.blockNum,
-				  localpage,
-				  false);
-
-		/* Temporary table I/O does not use Buffer Access Strategies */
-		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
-								IOOP_WRITE, io_start, 1, BLCKSZ);
-
-		/* Mark not-dirty now in case we error out below */
-		TerminateLocalBufferIO(bufHdr, true, 0);
-
-		pgBufferUsage.local_blks_written++;
-	}
+		FlushLocalBuffer(bufHdr, NULL);
 
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0020-localbuf-Introduce-StartLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From 11489de41c73c8089d2f1ba4f08da04697feabcd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:44:36 -0400
Subject: [PATCH v2.7 20/35] localbuf: Introduce StartLocalBufferIO()

To initiate IO on a shared buffer we have StartBufferIO(). For temporary table
buffers no similar function exists - likely because the code for that
currently is very simple due to the lack of concurrency.

However, the upcoming AIO support will make it possible to re-encounter a
local buffer, while the buffer already is the target of IO. In that case we
need to wait for already in-progress IO to complete. This commit makes it
easier to add the necessary code, by introducing StartLocalBufferIO().

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   |  8 ++----
 src/backend/storage/buffer/localbuf.c | 36 +++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 90bc7e0db7b..9327f60c44c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5ead69b5e16..07254d7015a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1038,7 +1038,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+		need_to_zero = StartLocalBufferIO(bufHdr, true);
 	}
 	else
 	{
@@ -1450,11 +1450,7 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-	{
-		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-
-		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
-	}
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 251289c95fd..31d29434245 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -183,6 +183,13 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	instr_time	io_start;
 	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
+	/*
+	 * Try to start an I/O operation.  There currently are no reasons for
+	 * StartLocalBufferIO to return false, so we raise an error in that case.
+	 */
+	if (!StartLocalBufferIO(bufHdr, false))
+		elog(ERROR, "failed to start write IO on local buffer");
+
 	/* Find smgr relation for buffer */
 	if (reln == NULL)
 		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -406,11 +413,17 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			PinLocalBuffer(existing_hdr, false);
 			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
 
+			/*
+			 * Clear the BM_VALID bit, do StartLocalBufferIO() and proceed.
+			 */
 			buf_state = pg_atomic_read_u32(&existing_hdr->state);
 			Assert(buf_state & BM_TAG_VALID);
 			Assert(!(buf_state & BM_DIRTY));
 			buf_state &= ~BM_VALID;
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+
+			/* no need to loop for local buffers */
+			StartLocalBufferIO(existing_hdr, true);
 		}
 		else
 		{
@@ -425,6 +438,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
 
 			hresult->id = victim_buf_id;
+
+			StartLocalBufferIO(victim_buf_hdr, true);
 		}
 	}
 
@@ -489,6 +504,27 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+{
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+	{
+		/* someone else already did the I/O */
+		return false;
+	}
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* local buffers don't track IO using resowners */
+
+	return true;
+}
+
 /*
  * Like TerminateBufferIO, but for local buffers
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From e61984a7c2a018adaf543d64181499715d92b175 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 18:04:48 -0500
Subject: [PATCH v2.7 21/35] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07254d7015a..dadc1607742 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5402,6 +5402,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5455,6 +5467,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 31d29434245..02852d92b3e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0022-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 256db69d5f403796bdd8b95609a7ab9a187ad739 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Mar 2025 13:51:06 -0400
Subject: [PATCH v2.7 22/35] bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
  IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

As of this commit nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Todo:
- deduplicate shared/local buffer completion callbacks
- deduplicate LockBufferForCleanup() support
- function naming pattern (buffer_readv_complete_common() calls
  LocalBufferCompleteRead/SharedBufferCompleteRead, which looks ugly)

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 440 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  45 ++-
 7 files changed, 466 insertions(+), 42 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d02c3d27fae..a4d9b51acfe 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -184,6 +184,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 307f36af384..ffb55cfa08a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -168,6 +169,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6a21c82396a..f76f74ba166 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dadc1607742..2bd1c4f371e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1038,7 +1040,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1074,9 +1076,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1450,7 +1452,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1598,9 +1601,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2511,7 +2514,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2857,6 +2860,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2925,29 +2966,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3968,7 +3988,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5527,6 +5547,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5534,10 +5555,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5626,7 +5656,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5641,6 +5671,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5649,6 +5687,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5697,7 +5746,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6148,3 +6197,328 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Helper for AIO staging callback for both reads and writes as well as temp
+ * and shared buffers.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the subsystem.
+		 *
+		 * For local buffers: This can't be done just in LocalRefCount as one
+		 * might initially think, as this backend could error out while AIO is
+		 * still in progress, releasing all the pins by the backend itself.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 02852d92b3e..86da17ae697 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * The buffer could have IO in progress, e.g. when there are two scans of
+	 * the same relation. Either wait for the other IO or return false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
@@ -536,7 +553,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +567,17 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +743,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0023-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 959ee2bf7b54785ef68aca8af8b7bc8c246fe792 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:52:26 -0400
Subject: [PATCH v2.7 23/35] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual AIO user. StartReadBuffers() now uses
the AIO routines to issue IO. This converts a lot of callers to use the AIO
infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commits.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   5 +
 src/backend/storage/buffer/bufmgr.c | 380 ++++++++++++++++++++--------
 2 files changed, 285 insertions(+), 100 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ffb55cfa08a..cd921bdad49 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,8 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2bd1c4f371e..65fe5741f3f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation,
+							 int *nblocks);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1228,10 +1230,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,6 +1265,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1363,25 +1370,59 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers(). This is signalled to the caller by
+		 * decrementing *nblocks.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		return AsyncReadBuffers(operation, nblocks);
 	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
 }
 
 /*
@@ -1449,7 +1490,7 @@ StartReadBuffer(ReadBuffersOperation *operation,
 }
 
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1461,28 +1502,157 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
+	IOContext	io_context;
+	IOObject	io_object;
 	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
+	PgAioReturn *aio_ret;
+
+	/*
+	 * If we get here without an IO operation having been issued, io_method ==
+	 * IOMETHOD_SYNC path must have been used. In that case, we start - as we
+	 * used to before the introducing of AIO - the IO in WaitReadBuffers().
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref))
+	{
+		Assert(io_method == IOMETHOD_SYNC);
+
+		while (true)
+		{
+			nblocks = operation->nblocks;
+
+			if (!AsyncReadBuffers(operation, &nblocks))
+			{
+				/* all blocks were already read in concurrently */
+				Assert(nblocks == operation->nblocks);
+				return;
+			}
+
+			Assert(nblocks > 0 && nblocks <= operation->nblocks);
+
+			if (nblocks == operation->nblocks)
+			{
+				/* will wait below as if this had been normal AIO */
+				break;
+			}
+
+			/*
+			 * It's unlikely, but possible, that AsyncReadBuffers() wasn't
+			 * able to initiate IO for all the relevant buffers. In that case
+			 * we need to wait for the prior IO before issuing more IO.
+			 */
+			WaitReadBuffers(operation);
+		}
+	}
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+restart:
 
 	/* Find the range of the physical read we need to perform. */
 	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
-
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	aio_ret = &operation->io_return;
+
+	/*
+	 * For IO timing we just count the time spent waiting for the IO.
+	 *
+	 * Tracking a wait even if we don't actually need to wait
+	 *
+	 * a) is not cheap, due to the timestamping overhead
+	 *
+	 * b) reports some time as waiting, even if we never waited
+	 */
+	if (aio_ret->result.status == ARS_UNKNOWN &&
+		!pgaio_wref_check_done(&operation->io_wref))
+	{
+		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+		pgaio_wref_wait(&operation->io_wref);
+
+		/*
+		 * The IO operation itself was already counted earlier, in
+		 * AsyncReadBuffers().
+		 */
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 0, 0);
+	}
+	else
+	{
+		Assert(pgaio_wref_check_done(&operation->io_wref));
+	}
+
+	if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry below, so we just emit a debug message the server log
+		 * (or not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+
+		/*
+		 * Try to perform the rest of the IO.  Buffers for which IO has
+		 * completed successfully will be discovered as such and not retried.
+		 */
+		nblocks = operation->nblocks;
+
+		elog(DEBUG3, "retrying IO after partial failure");
+		CHECK_FOR_INTERRUPTS();
+		AsyncReadBuffers(operation, &nblocks);
+		goto restart;
+	}
+	else if (aio_ret->result.status != ARS_OK)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * nblocks;
+
+	/* NB: READ_DONE tracepoint is executed in IO completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation. If IO is only initiated for a
+ * subset of the blocks, *nblocks is updated to reflect that.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation,
+				 int *nblocks)
+{
+	int			io_buffers_len = 0;
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	bool		did_start_io = false;
+	PgAioHandle *ioh = NULL;
+	uint32		ioh_flags = 0;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1490,6 +1660,14 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
 	 * behavior, but might turn out to be not true if we find that someone
@@ -1499,25 +1677,54 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * but another backend completed the read".
 	 */
 	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
+		pgBufferUsage.local_blks_read += *nblocks;
 	else
-		pgBufferUsage.shared_blks_read += nblocks;
+		pgBufferUsage.shared_blks_read += *nblocks;
 
-	for (int i = 0; i < nblocks; ++i)
+	pgaio_wref_clear(&operation->io_wref);
+
+	/*
+	 * Loop until we have started one IO or we discover that all buffers are
+	 * already valid.
+	 */
+	for (int i = 0; i < *nblocks; ++i)
 	{
-		int			io_buffers_len;
 		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
 		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
-		instr_time	io_start;
 		BlockNumber io_first_block;
 
 		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
+		 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might
+		 * block, which we don't want after setting IO_IN_PROGRESS.
+		 *
+		 * XXX: Should we attribute the time spent in pgaio_io_acquire() to
+		 * the IO? If there already are a lot of IO operations in progress,
+		 * getting an IO handle will block waiting for some other IO operation
+		 * to finish.
+		 *
+		 * In most cases it'll be free to get the IO, so a timer would be
+		 * overhead. Perhaps we should use pgaio_io_acquire_nb() and only
+		 * account IO time when pgaio_io_acquire_nb() returned false?
 		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		if (likely(!ioh))
+			ioh = pgaio_io_acquire(CurrentResourceOwner,
+								   &operation->io_return);
+
+		/*
+		 * Skip this block if someone else has already completed it.
+		 *
+		 * If an I/O is already in progress in another backend, this will wait
+		 * for the outcome: either done, or something went wrong and we will
+		 * retry. But don't wait if we have staged, but haven't issued,
+		 * another IO.
+		 *
+		 * It's safe to start IO while we have unsubmitted IO, but it'd be
+		 * better to first submit it. But right now the boolean return value
+		 * from ReadBuffersCanStartIO()/StartBufferIO() doesn't allow to
+		 * distinguish between nowait=true trigger failure and the buffer
+		 * already being valid.
+		 */
+		if (!ReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1545,8 +1752,8 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * head block, so we should get on with that I/O as soon as possible.
 		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		while ((i + 1) < *nblocks &&
+			   ReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
@@ -1556,67 +1763,40 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		did_start_io = true;
+		smgrstartreadv(ioh, operation->smgr, forknum, io_first_block,
+					   io_pages, io_buffers_len);
+		ioh = NULL;
+
+		/* not obvious what we'd use for time */
+		pgstat_count_io_op(io_object, io_context, IOOP_READ,
+						   1, io_buffers_len * BLCKSZ);
+
+		*nblocks = io_buffers_len;
+		break;
+	}
+
+	if (ioh)
+	{
+		pgaio_io_release(ioh);
+		ioh = NULL;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0024-docs-Reframe-track_io_timing-related-docs-as-wa.patchtext/x-diff; charset=us-asciiDownload

From c5d5bfdfa5a99b07b035aba5c37f15ed1bced508 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Mar 2025 13:25:16 -0400
Subject: [PATCH v2.7 24/35] docs: Reframe track_io_timing related docs as wait
 time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fdc8f5c1fb0..c9b0a93cdc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8529,7 +8529,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8563,7 +8563,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..7598340072f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2733,7 +2733,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2771,7 +2771,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2799,7 +2799,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2835,7 +2835,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2909,7 +2909,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2996,7 +2996,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0025-WIP-aio-read_stream.c-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From f03ea92c345ad8945747029bf193542fbe1ab47f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 18:21:47 -0500
Subject: [PATCH v2.7 25/35] WIP: aio: read_stream.c adjustments for real AIO

Comments need to be fixed.

The batching logic probably needs to be adjusted.
---
 src/backend/storage/aio/read_stream.c | 31 +++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e58e0edf221..508e145efcc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -32,10 +32,15 @@
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work for no benefit.
  *
+ * FIXME: This only applies to io_method == sync, otherwise this path is not
+ * used.
+ *
  * C) I/O is necessary, it appears to be random, and this system supports
  * read-ahead advice.  We'll look further ahead in order to reach the
  * configured level of I/O concurrency.
  *
+ * FIXME: restriction to random only applies to io_method == sync
+ *
  * The distance increases rapidly and decays slowly, so that it moves towards
  * those levels as different I/O patterns are discovered.  For example, a
  * sequential scan of fully cached data doesn't bother looking ahead, but a
@@ -90,6 +95,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -116,6 +122,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -458,6 +465,19 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
 	int16		buffer_limit;
 
+	if (stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+			return;
+		}
+	}
+
+	/*
+	 * Try to amortize cost of submitting IOs over multiple IOs.
+	 */
+	pgaio_enter_batchmode();
+
 	/*
 	 * Check how many pins we could acquire now.  We do this here rather than
 	 * pushing it down into read_stream_start_pending_read(), because it
@@ -524,6 +544,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 			{
 				/* And we've hit a limit.  Rewind, and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -549,6 +570,8 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, buffer_limit, suppress_advice);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -672,6 +695,8 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
@@ -680,7 +705,8 @@ read_stream_begin_impl(int flags,
 	 * (overriding our detection heuristics), and max_ios hasn't been set to
 	 * zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -923,7 +949,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (!stream->sync_mode ||
+			stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0026-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 6622e1573444b0c6a44d5764d37cd194d1d76e74 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.7 26/35] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 667 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1459 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..4b1f52a9fc8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 65fe5741f3f..55187eefa5d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5777,7 +5773,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5834,7 +5830,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..2e18c8de338
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,667 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.7-0033-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 4fdb6e8083cf54dce497b4487f1685d8c90f72db Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.7 33/35] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

#81

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Andres Freund (#80)

1 attachment(s)

Re: AIO v2.5

Andres Freund <andres@anarazel.de> wrote:

Attached is v2.7, with the following changes:

Attached are a few proposals for minor comment fixes.

Besides that, it occurred to me when I was trying to get familiar with the
patch set (respectable work, btw) that an additional Assert() statement could
make sense:

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a9c351eb0dc..325688f0f23 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -413,6 +413,7 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
        bool            needs_synchronous;

Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+ Assert(pgaio_my_backend->handed_out_io == ioh);
Assert(pgaio_io_has_target(ioh));

ioh->op = op;

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

aio_comments.difftext/x-diffDownload

diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 89376ff4040..7879e556263 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -107,7 +107,7 @@ pgaio_io_prep_writev(PgAioHandle *ioh,
 
 /*
  * Execute IO operation synchronously. This is implemented here, not in
- * method_sync.c, because other IO methods lso might use it / fall back to it.
+ * method_sync.c, because other IO methods also might use it / fall back to it.
  */
 void
 pgaio_io_perform_synchronously(PgAioHandle *ioh)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9483596e63a..df9357c78c0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7329,7 +7329,7 @@ local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
 }
 
 
-/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
 	.complete_shared = shared_buffer_readv_complete,
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index ca6cbde11a6..cf4eb102adc 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -205,9 +205,9 @@ typedef struct PgAioBackend
 	dclist_head idle_ios;
 
 	/*
-	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire()
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire_nb()
 	 * without having been either defined (by actually associating it with IO)
-	 * or by released (with pgaio_io_release()). This restriction is necessary
+	 * or released (with pgaio_io_release()). This restriction is necessary
 	 * to guarantee that we always can acquire an IO. ->handed_out_io is used
 	 * to enforce that rule.
 	 */

#82

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Antonin Houska (#81)

36 attachment(s)

Re: AIO v2.5

Hi,

On 2025-03-13 11:53:03 +0100, Antonin Houska wrote:

Attached are a few proposals for minor comment fixes.

Thanks, applied.

Besides that, it occurred to me when I was trying to get familiar with the
patch set (respectable work, btw) that an additional Assert() statement could
make sense:

Yea, it does. I added it to another place as well.

Attached is v2.8 with the following changes:

- I wasn't happy with the way StartReadBuffers(), WaitReadBuffers() and
AsyncReadBuffers() interacted. The io_method=sync path in particular was too
cute by half, calling WaitReadBuffers() from within WaitReadBuffers().

I think the new state considerably better.

Plenty other smaller changes as part of that. One worth calling out is that
ReadBuffersCanStartIO() now submits staged IO before actually blocking. Not
the prettiest code, but I think it's ok.

- Added a function to assert the sanitiy of a ReadBuffersOperation

While doing the refactoring for the prior point, I temporarily had a bug
that returned buffers for which IO wasn't actually performed. Surprisingly
the only assertion that triggered was when that buffer was read again by
another operation, because it had been marked dirty, despite never being
valid.

Now there's a function that can be used to check that the buffers referenced
by a ReadBuffersOperation are in a sane state.

I guess this could be committed independently, but it'd not be entirely
trivial to extract, so I'm currently leaning against doing that.

- Previously VacuumCostActive accounting happened after IO completion. But
that doesn't seem quite right, it'd allow to submit a lot of IO at
once. It's now moved to AsyncReadBuffers()

- With io_method=sync or with worker and temp tables, smgrstartreadv() would
actually execute the IO. But the time accounting was done entirely around
pgaio_wref_wait(). Now it's done in both places.

- Rebased onto newer version of Thomas' read_stream.c changes

With that newer version the integration with read stream for actually doing
AIO is a bit simpler. There's one FIXME in the patch, because I don't
really understand what a comment is referring to.

I also split out a more experimental patch to make more efficient use of
batching in read stream, the heuristics are more complicated, and it works
well enough without.

- I added a commit to clean up the buffer access accounting for the case that
a buffer was read in concurrently. That IMO is somewhat bogus on master, and
it seemed to get more bogus with AIO.

- Integrated Antonin Houska's fixes and Assert suggestion

- Added a patch to address the smgr.c/md.c interrupt issue (a problem in master), see
/messages/by-id/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl

I think the reasonable next steps are:

- Commit "localbuf: *" commits

- Commit temp table tests, likely after lowering the minimum temp_buffers setting

- Pursue a fix of the smgr interupt issue on the referenced thread

This can happen in parallel with AIO patches up to
"aio: Implement support for reads in smgr/md/fd"

- Commit the core AIO infrastructure patch after doing a few more passes

- Commit IO worker support

- In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
been discussing on this thread recently. As that only affects io_uring, it
doesn't have to block other patches going in.

- Do a round of review of the read_stream changes Thomas recently posted (and
that are also included here)

- Try to get some more review for the bufmgr.c related changes. I've whacked
them around a fair bit lately.

- Try to get Thomas to review my read_stream.c changes

Open items:

- The upstream BAS_BULKREAD is so small that throughput is substantially worse
once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is
probably not good enough, although I am not sure about that

- The set_max_safe_fds() issue for io_uring

- Right now effective_io_concurrency cannot be set > 0 on Windows and other
platforms that lack posix_fadvise. But with AIO we can read ahead without
posix_fadvise().

It'd not really make anything worse than today to not remove the limit, but
it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
to look around and see whether it would require anything other than doc
changes.

Greetings,

Andres Freund

Attachments:

v2.8-0020-localbuf-Introduce-StartLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From 4460435cd2170123f2f316fbaed2b1bd8578393e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:44:36 -0400
Subject: [PATCH v2.8 20/38] localbuf: Introduce StartLocalBufferIO()

To initiate IO on a shared buffer we have StartBufferIO(). For temporary table
buffers no similar function exists - likely because the code for that
currently is very simple due to the lack of concurrency.

However, the upcoming AIO support will make it possible to re-encounter a
local buffer, while the buffer already is the target of IO. In that case we
need to wait for already in-progress IO to complete. This commit makes it
easier to add the necessary code, by introducing StartLocalBufferIO().

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   |  8 ++----
 src/backend/storage/buffer/localbuf.c | 36 +++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 90bc7e0db7b..9327f60c44c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4a7cf3f3cc4..95911b51d58 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1038,7 +1038,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+		need_to_zero = StartLocalBufferIO(bufHdr, true);
 	}
 	else
 	{
@@ -1450,11 +1450,7 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-	{
-		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-
-		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
-	}
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8efde05c0a5..f172a5c7820 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -183,6 +183,13 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	instr_time	io_start;
 	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
+	/*
+	 * Try to start an I/O operation.  There currently are no reasons for
+	 * StartLocalBufferIO to return false, so we raise an error in that case.
+	 */
+	if (!StartLocalBufferIO(bufHdr, false))
+		elog(ERROR, "failed to start write IO on local buffer");
+
 	/* Find smgr relation for buffer */
 	if (reln == NULL)
 		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -406,11 +413,17 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			PinLocalBuffer(existing_hdr, false);
 			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
 
+			/*
+			 * Clear the BM_VALID bit, do StartLocalBufferIO() and proceed.
+			 */
 			buf_state = pg_atomic_read_u32(&existing_hdr->state);
 			Assert(buf_state & BM_TAG_VALID);
 			Assert(!(buf_state & BM_DIRTY));
 			buf_state &= ~BM_VALID;
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+
+			/* no need to loop for local buffers */
+			StartLocalBufferIO(existing_hdr, true);
 		}
 		else
 		{
@@ -425,6 +438,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
 
 			hresult->id = victim_buf_id;
+
+			StartLocalBufferIO(victim_buf_hdr, true);
 		}
 	}
 
@@ -489,6 +504,27 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like StartBufferIO, but for local buffers
+ */
+bool
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+{
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
+	{
+		/* someone else already did the I/O */
+		return false;
+	}
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* local buffers don't track IO using resowners */
+
+	return true;
+}
+
 /*
  * Like TerminateBufferIO, but for local buffers
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0018-localbuf-Introduce-TerminateLocalBufferIO.patchtext/x-diff; charset=us-asciiDownload

From 109d2a3b679eb0f8ee9dd25fe6cd885f79f34e9c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:55:05 -0400
Subject: [PATCH v2.8 18/38] localbuf: Introduce TerminateLocalBufferIO()

Previously TerminateLocalBufferIO() was open-coded in multiple places, which
doesn't seem like a great idea. While TerminateLocalBufferIO() currently is
rather simple, an upcoming patch requires additional code to be added to
TerminateLocalBufferIO(), making this modification particularly worthwhile.

For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED,
even though that's never set for temporary buffers. This is not carried over
as part of this change.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  2 ++
 src/backend/storage/buffer/bufmgr.c   | 32 ++++++++-------------------
 src/backend/storage/buffer/localbuf.c | 29 +++++++++++++++++++++---
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8b32fb108b0..4611a60d3e0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -471,6 +471,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  Buffer *buffers,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
+extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
+								   uint32 set_flag_bits);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ef519b4c356..d2e7c283179 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1072,19 +1072,11 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		if (!isLocalBuf)
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
 
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			buf_state |= BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-		}
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 		else
-		{
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			TerminateBufferIO(bufHdr, false, BM_VALID, true);
-		}
 	}
 	else if (!isLocalBuf)
 	{
@@ -1608,19 +1600,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 									relpath(operation->smgr->smgr_rlocator, forknum).str)));
 			}
 
-			/* Terminate I/O and set BM_VALID. */
+			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-				buf_state |= BM_VALID;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
 			else
-			{
-				/* Set BM_VALID, terminate IO, and wake up any waiters */
 				TerminateBufferIO(bufHdr, false, BM_VALID, true);
-			}
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -4555,8 +4539,7 @@ FlushRelationBuffers(Relation rel)
 										IOCONTEXT_NORMAL, IOOP_WRITE,
 										io_start, 1, BLCKSZ);
 
-				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+				TerminateLocalBufferIO(bufHdr, true, 0);
 
 				pgBufferUsage.local_blks_written++;
 
@@ -5643,8 +5626,11 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	buf_state &= ~BM_IO_IN_PROGRESS;
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
 
-	buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5331091132d..86b1c4c7c68 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -235,7 +235,6 @@ GetLocalVictimBuffer(void)
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
 	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
@@ -259,8 +258,7 @@ GetLocalVictimBuffer(void)
 								IOOP_WRITE, io_start, 1, BLCKSZ);
 
 		/* Mark not-dirty now in case we error out below */
-		buf_state &= ~BM_DIRTY;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		TerminateLocalBufferIO(bufHdr, true, 0);
 
 		pgBufferUsage.local_blks_written++;
 	}
@@ -483,6 +481,31 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * Like TerminateBufferIO, but for local buffers
+ */
+void
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+{
+	/* Only need to adjust flags */
+	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	/* BM_IO_IN_PROGRESS isn't currently used for local buffers */
+
+	/* Clear earlier errors, if this IO failed, it'll be marked again */
+	buf_state &= ~BM_IO_ERROR;
+
+	if (clear_dirty)
+		buf_state &= ~BM_DIRTY;
+
+	buf_state |= set_flag_bits;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+	/* local buffers don't track IO using resowners */
+
+	/* local buffers don't use the IO CV, as no other process can see buffer */
+}
+
 /*
  * InvalidateLocalBuffer -- mark a local buffer invalid.
  *
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0019-localbuf-Introduce-FlushLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 1f62831d5e559777753fc090e0b63c01d6f88086 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 18:33:35 -0400
Subject: [PATCH v2.8 19/38] localbuf: Introduce FlushLocalBuffer()

Previously we had two paths implementing writing out temporary table
buffers. For shared buffers, the logic for that is centralized in
FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers.

Besides being a nice cleanup on its own, it also makes an upcoming change
slightly easier.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/include/storage/buf_internals.h   |  1 +
 src/backend/storage/buffer/bufmgr.c   | 22 +--------
 src/backend/storage/buffer/localbuf.c | 64 +++++++++++++++------------
 3 files changed, 38 insertions(+), 49 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4611a60d3e0..90bc7e0db7b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -473,6 +473,7 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits);
+extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d2e7c283179..4a7cf3f3cc4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4507,7 +4507,6 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
-			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -4515,9 +4514,6 @@ FlushRelationBuffers(Relation rel)
 				 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
 			{
 				ErrorContextCallback errcallback;
-				Page		localpage;
-
-				localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
 				/* Setup error traceback support for ereport() */
 				errcallback.callback = local_buffer_write_error_callback;
@@ -4525,23 +4521,7 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
-				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-				io_start = pgstat_prepare_io_time(track_io_timing);
-
-				smgrwrite(srel,
-						  BufTagGetForkNum(&bufHdr->tag),
-						  bufHdr->tag.blockNum,
-						  localpage,
-						  false);
-
-				pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION,
-										IOCONTEXT_NORMAL, IOOP_WRITE,
-										io_start, 1, BLCKSZ);
-
-				TerminateLocalBufferIO(bufHdr, true, 0);
-
-				pgBufferUsage.local_blks_written++;
+				FlushLocalBuffer(bufHdr, srel);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 86b1c4c7c68..8efde05c0a5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -174,6 +174,41 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	return bufHdr;
 }
 
+/*
+ * Like FlushBuffer(), just for local buffers.
+ */
+void
+FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
+{
+	instr_time	io_start;
+	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
+						MyProcNumber);
+
+	PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	/* And write... */
+	smgrwrite(reln,
+			  BufTagGetForkNum(&bufHdr->tag),
+			  bufHdr->tag.blockNum,
+			  localpage,
+			  false);
+
+	/* Temporary table I/O does not use Buffer Access Strategies */
+	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
+							IOOP_WRITE, io_start, 1, BLCKSZ);
+
+	/* Mark not-dirty */
+	TerminateLocalBufferIO(bufHdr, true, 0);
+
+	pgBufferUsage.local_blks_written++;
+}
+
 static Buffer
 GetLocalVictimBuffer(void)
 {
@@ -234,34 +269,7 @@ GetLocalVictimBuffer(void)
 	 * the case, write it out before reusing it!
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
-	{
-		instr_time	io_start;
-		SMgrRelation oreln;
-		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
-
-		/* Find smgr relation for buffer */
-		oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber);
-
-		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
-
-		io_start = pgstat_prepare_io_time(track_io_timing);
-
-		/* And write... */
-		smgrwrite(oreln,
-				  BufTagGetForkNum(&bufHdr->tag),
-				  bufHdr->tag.blockNum,
-				  localpage,
-				  false);
-
-		/* Temporary table I/O does not use Buffer Access Strategies */
-		pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL,
-								IOOP_WRITE, io_start, 1, BLCKSZ);
-
-		/* Mark not-dirty now in case we error out below */
-		TerminateLocalBufferIO(bufHdr, true, 0);
-
-		pgBufferUsage.local_blks_written++;
-	}
+		FlushLocalBuffer(bufHdr, NULL);
 
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0001-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From f430ba47d6f72d37ec772b0f2e1875158cafbefe Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 10:44:56 -0500
Subject: [PATCH v2.8 01/38] aio: Basic subsystem initialization

This commit just does the minimal wiring up of the AIO subsystem, to be fully
added in the next commit, to the rest of the system. The next commit contains
more details about motivation and architecture.

This commit is kept separate to make it easier to review, reviewing how the
AIO subsystem works is a big enough task on its own an rather separate from
reviewing the changes across the tree.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/aio.h                     | 38 ++++++++
 src/include/storage/aio_subsys.h              | 33 +++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/access/transam/xact.c             | 12 +++
 src/backend/postmaster/autovacuum.c           |  2 +
 src/backend/postmaster/bgwriter.c             |  2 +
 src/backend/postmaster/checkpointer.c         |  2 +
 src/backend/postmaster/pgarch.c               |  2 +
 src/backend/postmaster/walsummarizer.c        |  2 +
 src/backend/postmaster/walwriter.c            |  2 +
 src/backend/replication/walsender.c           |  2 +
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 90 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++
 doc/src/sgml/config.sgml                      | 51 +++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 24 files changed, 356 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..e4faf692a38
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO but interact with it in some form. E.g. postmaster.c
+ * and shared memory initialization need to initialize AIO but don't perform
+ * AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 24444cbc365..f619100467d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 9a0d8ec85c7..a3eba8fbe21 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -64,6 +64,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71c34027c88..2513a8ef8a6 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a688cc5d2a1..72f5acceec7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0e228d143a0..fda91ffd1ce 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dbe4e1d426b..7e622ae4bd2 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ccba0f84e6e..0fec4f1f871 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -38,6 +38,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -289,6 +290,7 @@ WalSummarizerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0380601bcbb..fd92c8b7a33 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d96121b3aad..1028919aecb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..828a94efdc3
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/aio_subsys.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+}
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4b2faf1ba9d..7958ea11b73 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -635,6 +636,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9c0b10ad4dc..0d3ebf06a95 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -72,6 +72,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3254,6 +3255,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5311,6 +5324,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8de86e0c945..43c2ec2153e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..d39f3e1b655 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResourceElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3d62c8bd274..7ec18bb7627 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2638,6 +2638,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93339ef3c58..3c9e823f07e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1279,6 +1279,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0002-aio-Add-core-asynchronous-I-O-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From ad3a1330e96ce3f3f271cea29d5e58e83a832527 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 10:14:31 -0500
Subject: [PATCH v2.8 02/38] aio: Add core asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by a new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons all
targets currently need to be added to an array in aio_target.c.  This commit
does not implement any AIO targets, just the infrastructure for them. The smgr
target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit nothing uses the AIO infrastructure. Other commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, that I started to work on to make adding AIO feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |  304 +++++
 src/include/storage/aio_internal.h            |  395 ++++++
 src/include/storage/aio_types.h               |  117 ++
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1131 +++++++++++++++++
 src/backend/storage/aio/aio_callback.c        |  307 +++++
 src/backend/storage/aio/aio_init.c            |  187 +++
 src/backend/storage/aio/aio_io.c              |  182 +++
 src/backend/storage/aio/aio_target.c          |  118 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 13 files changed, 2818 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..a0a0f23f3a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -3,6 +3,10 @@
  * aio.h
  *    Main AIO interface
  *
+ * This is the header to include when actually issuing AIO. When just
+ * declaring functions involving an AIO related type, it might suffice to
+ * include aio_types.h. Initialization related functions are in the dedicated
+ * aio_init.h.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -14,6 +18,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +33,306 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed?
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes) - the FD might be from another
+ * process, or closed since. That's not a problem for IOs waiting to be issued
+ * only because the queue is flushed when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ *
+ * typedef is in aio_types.h
+ */
+struct PgAioTargetInfo
+{
+	void		(*reopen) (PgAioHandle *ioh);
+
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	const char *name;
+};
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+/* typedef is in aio_types.h */
+struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+};
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+										uint8 cb_data);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..52b7751efa8
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,395 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that should only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+/*
+ * State machine for handles. With some exceptions, noted below, handles move
+ * linearly through all states.
+ *
+ * State changes should all go through pgaio_io_update_state().
+ */
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/*
+	 * Returned by pgaio_io_acquire(). The next state is either DEFINED (if
+	 * pgaio_io_prep_*() is called), or IDLE (if pgaio_io_release() is
+	 * called).
+	 */
+	PGAIO_HS_HANDED_OUT,
+
+	/*
+	 * pgaio_io_prep_*() has been called, but IO is not yet staged. At this
+	 * point the handle has all the information for the IO to be executed.
+	 */
+	PGAIO_HS_DEFINED,
+
+	/*
+	 * stage() callbacks have been called, handle ready to be submitted for
+	 * execution. Unless in batchmode (see c.f. pgaio_enter_batchmode()), the
+	 * IO will be submitted immediately after.
+	 */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted to the IO method for execution */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/*
+	 * IO completed, shared completion has been called.
+	 *
+	 * If the IO completion occurs in the issuing backend, local callbacks
+	 * will immediately be called. Otherwise the handle stays in
+	 * COMPLETED_SHARED until the issuing backend waits for the completion of
+	 * the IO.
+	 */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/*
+	 * IO completed, local completion has been called.
+	 *
+	 * After this the handle will be made reusable and go into IDLE state.
+	 */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in aio_types.h */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/* data forwarded to each callback */
+	uint8		callbacks_data[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	/*
+	 * To wait for the IO to complete other backends can wait on this CV. Note
+	 * that, if in SUBMITTED state, a waiter first needs to check if it needs
+	 * to do work via IoMethodOps->wait_one().
+	 */
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/*
+	 * If not NULL, this memory location will be updated with information
+	 * about the IOs completion iff the issuing backend learns about the IOs
+	 * completion.
+	 */
+	PgAioReturn *report_return;
+
+	/* Data necessary for the IO to be performed */
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire_nb()
+	 * without having been either defined (by actually associating it with IO)
+	 * or released (with pgaio_io_release()). This restriction is necessary to
+	 * guarantee that we always can acquire an IO. ->handed_out_io is used to
+	 * enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strictly speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint32		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint32		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 * Always called in a critical section.
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at build time. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ *
+ * XXX: This likely should be eventually be disabled by default, at least in
+ * non-assert builds.
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..a5cc658efbd
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+typedef struct PgAioHandleCallbacks PgAioHandleCallbacks;
+typedef struct PgAioTargetInfo PgAioTargetInfo;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 828a94efdc3..b02349cbf99 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable it is split across
+ * a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -15,10 +37,28 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -31,7 +71,182 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation. See pgaio_io_call_inj().
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that has been "handed out" to
+ * code, but not yet submitted or released. This restriction is necessary to
+ * ensure that it is possible for code to wait for an unused handle by waiting
+ * for in-flight IO to complete. There is a limited number of handles in each
+ * backend, if multiple handles could be handed out without being submitted,
+ * waiting for all in-flight IO to complete would not guarantee that handles
+ * free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. In case that is not
+ * desirable, see pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr.c, which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()) the IO will also have been submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+	{
+		ereport(ERROR,
+				errmsg("API violation: Only one IO can be handed out"));
+	}
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
 
 /*
  * Release IO handle during resource owner cleanup.
@@ -39,8 +254,793 @@ int			io_max_concurrency = -1;
 void
 pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 {
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			if (!on_error)
+				elog(WARNING, "AIO handle was not submitted");
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
 }
 
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+/*
+ * Returns an ID uniquely identifying the IO handle. This is only really
+ * useful for logging, as handles are reused across multiple IOs.
+ */
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+/*
+ * Return the ProcNumber for the process that can use an IO handle. The
+ * mapping from IO handles to PGPROCs is static, therefore this even works
+ * when the corresponding PGPROC is not in use.
+ */
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+/*
+ * Return a wait reference for the IO. Only wait references can be used to
+ * wait for an IOs completion, as handles themselves can be reused after
+ * completion.  See also the comment above pgaio_io_acquire().
+ */
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if appropriate, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_my_backend->handed_out_io == ioh);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Has the IO completed and thus the IO handle been reused?
+ *
+ * This is useful when waiting for IO completion at a low level (e.g. in an IO
+ * method's ->wait_one() callback).
+ */
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state != PGAIO_HS_SUBMITTED
+			&& state != PGAIO_HS_COMPLETED_IO
+			&& state != PGAIO_HS_COMPLETED_SHARED
+			&& state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+/*
+ * Make IO handle ready to be reused after IO has completed or after the
+ * handle has been released without being used.
+ */
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * local callbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+/*
+ * Wait for an IO handle to become usable.
+ *
+ * This only really is useful for pgaio_io_acquire().
+ */
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+		pgaio_submit_staged();
+
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IOs after waiting");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Mark a wait reference as invalid
+ */
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+/* Is the wait reference valid? */
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+/*
+ * Similar to pgaio_io_get_id(), just for wait references.
+ */
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed. Can be called in any process, not just
+ * in the issuing backend.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	/*
+	 * XXX: It likely would be worth checking in with the io method, to give
+	 * the IO method a chance to check if there are completion events queued.
+	 */
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * End batch submission mode with pgaio_exit_batchmode().  (Throwing errors is
+ * allowed; error recovery will end the batch.)
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+
 /*
  * Perform AIO related cleanup after an error.
  *
@@ -50,6 +1050,22 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 void
 pgaio_error_cleanup(void)
 {
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
 }
 
 /*
@@ -62,11 +1078,86 @@ pgaio_error_cleanup(void)
 void
 AtEOXact_Aio(bool is_commit)
 {
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Registered as before_shmem_exit() callback in pgaio_init_backend()
+ */
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/* first clean up resources as we would at a transaction boundary */
+	AtEOXact_Aio(code == 0);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes:
+	 *
+	 * - Some kernel-level AIO mechanisms don't deal well with the issuer of
+	 * an AIO exiting before IO completed
+	 *
+	 * - It'd be confusing to see partially finished IOs in stats views etc
+	 */
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
 }
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -88,3 +1179,43 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 
 	return true;
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+/*
+ * Call injection point with support for pgaio_inj_io_get().
+ */
+void
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
+{
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..7392b55322c
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,307 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/* just to have something to put into the aio_handle_cbs */
+static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+							uint8 cb_data)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	if (cb_id >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cb_id);
+	if (aio_handle_cbs[cb_id].cb->complete_shared == NULL &&
+		aio_handle_cbs[cb_id].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have completion callback", cb_id);
+	if (ioh->num_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d", PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->callbacks[ioh->num_callbacks] = cb_id;
+	ioh->callbacks_data[ioh->num_callbacks] = cb_data;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_callbacks + 1,
+				   cb_id, ce->name);
+
+	ioh->num_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO Result related functions
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cb_id = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage(%u)",
+					   i, cb_id, ce->name, cb_data);
+		ce->cb->stage(ioh, cb_data);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared(%u) with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cb_id, ce->name,
+					   cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result, cb_data);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local(%u) with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cb_id, ce->name, cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result, cb_data);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..e6c3e40f7d4 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,211 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* ios */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set io_max_concurrency = -1 in the
+	 * config file, then PGC_S_DYNAMIC_DEFAULT will fail to override that and
+	 * we must force
+	 *
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..df4931a84a5
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,182 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ *
+ * XXX: Right now the amount of space available for each IO is
+ * PG_IOV_MAX. While it's tempting to use the io_combine_limit GUC, that's
+ * PGC_USERSET, so we can't allocate shared memory based on that.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods also might use it / fall back to
+ * it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_my_backend->handed_out_io == ioh);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..3fa813fd592
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,118 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ *
+ * To support executing using worker processes, the file descriptor for an IO
+ * may need to be be reopened in a different process. This is done via the
+ * PgAioTargetInfo.reopen callback.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..902c2428d41
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "IO should have been executed synchronously");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..b44e4908b25 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3c9e823f07e..f4261145353 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1280,6 +1280,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2127,6 +2128,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0003-aio-Infrastructure-for-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From ecba88025db2d79cb32f8f4b92175ea8e6b0716d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Feb 2025 17:57:00 -0500
Subject: [PATCH v2.8 03/38] aio: Infrastructure for io_method=worker

This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/miscadmin.h                       |   2 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 174 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 +++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  19 ++
 src/test/regress/expected/stats.out           |  10 +-
 19 files changed, 336 insertions(+), 14 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6f16794eb63..603d0424354 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index e4faf692a38..ed00d5c47cd 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..7bde7e89c8a
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
+
+extern PGDLLIMPORT int io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0750ec3c474..f51b03d3822 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,7 +449,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 77fb877dbad..bf6b55ee830 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d13846298bd..f301e43743f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -340,6 +343,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -402,6 +406,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -436,6 +444,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1365,6 +1375,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1377,7 +1392,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2502,6 +2516,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2723,6 +2747,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2905,20 +2930,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2933,12 +2959,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3039,11 +3066,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3171,10 +3212,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3198,6 +3243,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -3235,6 +3281,16 @@ LaunchMissingBackgroundProcesses(void)
 	if (SysLoggerPMChild == NULL && Logging_collector)
 		StartSysLogger();
 
+	/*
+	 * The number of configured configured workers might have changed, or a
+	 * prior start of a worker might have failed. Check if we need to
+	 * start/stop any workers.
+	 *
+	 * A config file change will always lead to this function being called, so
+	 * we always will process the config change in a timely manner.
+	 */
+	maybe_adjust_io_workers();
+
 	/*
 	 * The checkpointer and the background writer are active from the start,
 	 * until shutdown is initiated.
@@ -4120,6 +4176,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4270,6 +4327,99 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Start or stop IO workers, to close the gap between the number of running
+ * workers and the number of configured workers.  Used to respond to change of
+ * the io_workers GUC (by increasing and decreasing the number of workers), as
+ * well as workers terminating in response to errors (by starting
+ * "replacement" workers).
+ */
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ef9ef93e2b
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(const void *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 55ab2da299b..fcb23239d07 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3316,6 +3316,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a8cb54a7732..5518a18e060 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -375,6 +375,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb575025596..c8de9c9e2d3 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -376,6 +376,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b44e4908b25..3f6dc3876b4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index dc3521457c7..43b4dbccc3d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = gettext_noop("io worker");
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0d3ebf06a95..4cc19bef686 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -75,6 +75,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3267,6 +3268,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 43c2ec2153e..a4049ff0d9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -207,6 +207,7 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ec18bb7627..baefb7af4cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2689,6 +2689,25 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is
+         3. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f77caacc17d..cd08a2ca0af 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -51,6 +51,14 @@ client backend|relation|vacuum
 client backend|temp relation|normal
 client backend|wal|init
 client backend|wal|normal
+io worker|relation|bulkread
+io worker|relation|bulkwrite
+io worker|relation|init
+io worker|relation|normal
+io worker|relation|vacuum
+io worker|temp relation|normal
+io worker|wal|init
+io worker|wal|normal
 slotsync worker|relation|bulkread
 slotsync worker|relation|bulkwrite
 slotsync worker|relation|init
@@ -87,7 +95,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(71 rows)
+(79 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0004-aio-Add-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 9afb73981705ff2973bbdd52bbf044b4adaa1f3d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2025 13:46:35 -0500
Subject: [PATCH v2.8 04/38] aio: Add io_method=worker

The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately.  Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables.  Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 431 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 doc/src/sgml/config.sgml                      |   5 +
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 452 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a0a0f23f3a5..769ea33ea96 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -27,10 +27,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 52b7751efa8..51f63586467 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -385,6 +385,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index b02349cbf99..f0bb2939d08 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -80,6 +81,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index e6c3e40f7d4..d0d916ea5c9 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -212,6 +218,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ef9ef93e2b..117c9a87db7 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * IO workers consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,25 +31,325 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG3, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +368,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -61,6 +382,27 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 		EmitErrorReport();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		proc_exit(1);
 	}
 
@@ -71,9 +413,89 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+		}
+
 		CHECK_FOR_INTERRUPTS();
 	}
 
@@ -83,6 +505,5 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3f6dc3876b4..9fa12a555e8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a4049ff0d9d..75a844a9474 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,7 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index baefb7af4cb..108caea1fed 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2676,6 +2676,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4261145353..4d8b5c7f1d6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0005-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From fd2957cc938a5c071144a8d1500bda52bf2411d4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 17:56:05 -0500
Subject: [PATCH v2.8 05/38] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0006-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From de04dc4c26133a0b08436733e1542990cca81a61 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Feb 2025 14:29:37 -0500
Subject: [PATCH v2.8 06/38] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 440 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 470 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 769ea33ea96..8bc43c33505 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -28,6 +28,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 51f63586467..98ae66921f5 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index f0bb2939d08..58d384b8403 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..588177eaa4f
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,440 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * FIXME: Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow MaxBackends fds
+		 * created in postmaster, with spare space for max_files_per_process
+		 * additional FDs
+		 *
+		 * - set_max_safe_fds() subtracts the number of already used FDs from
+		 * max_files_per_process, ending up with a low limit or even erroring
+		 * out.
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 75a844a9474..7e8c3dcb175 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 108caea1fed..8d53478f53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2681,6 +2681,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4d8b5c7f1d6..1748befca16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2150,6 +2150,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0007-smgr-Hold-interrupts-in-most-smgr-functions.patchtext/x-diff; charset=us-asciiDownload

From 01e1021461650fd66cc2bcada5da73d2d8441f98 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Mar 2025 14:48:36 -0400
Subject: [PATCH v2.8 07/38] smgr: Hold interrupts in most smgr functions

We need to hold interrupts across most of the smgr.c/md.c functions, as
otherwise interrupt processing, e.g. due to a debug elog/ereport, can trigger
procsignal processing, which in turn can trigger smgrreleaseall(). As the
relevant code is not reentrant we quickly end up in a bad situation.

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 94 +++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..8787ce9b18f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -40,6 +40,15 @@
  * themselves, as there could pointers to them in active use.  See
  * smgrrelease() and smgrreleaseall().
  *
+ * NB: We need to hold interrupts across most of the functions in this file,
+ * as otherwise interrupt processing, e.g. due to a debug elog/ereport, can
+ * trigger procsignal processing, which in turn can trigger
+ * smgrreleaseall(). None of the relevant code is reentrant.  It seems better
+ * to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of trying to
+ * push them down to md.c where possible: For one, every smgr implementation
+ * would be vulnerable, for another, a good bit of smgr.c code itself is
+ * affected too.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -53,6 +62,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -158,12 +168,16 @@ smgrinit(void)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
 			smgrsw[i].smgr_init();
 	}
 
+	RESUME_INTERRUPTS();
+
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
 }
@@ -176,11 +190,13 @@ smgrshutdown(int code, Datum arg)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_shutdown)
 			smgrsw[i].smgr_shutdown();
 	}
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -206,6 +222,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 	Assert(RelFileNumberIsValid(rlocator.relNumber));
 
+	HOLD_INTERRUPTS();
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
@@ -242,6 +260,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		dlist_push_tail(&unpinned_relns, &reln->node);
 	}
 
+	RESUME_INTERRUPTS();
+
 	return reln;
 }
 
@@ -283,6 +303,8 @@ smgrdestroy(SMgrRelation reln)
 
 	Assert(reln->pincount == 0);
 
+	HOLD_INTERRUPTS();
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
@@ -292,6 +314,8 @@ smgrdestroy(SMgrRelation reln)
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -302,12 +326,16 @@ smgrdestroy(SMgrRelation reln)
 void
 smgrrelease(SMgrRelation reln)
 {
+	HOLD_INTERRUPTS();
+
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -336,6 +364,8 @@ smgrdestroyall(void)
 {
 	dlist_mutable_iter iter;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
 	 * each one from the list.
@@ -347,6 +377,8 @@ smgrdestroyall(void)
 
 		smgrdestroy(rel);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -362,12 +394,16 @@ smgrreleaseall(void)
 	if (SMgrRelationHash == NULL)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
 	{
 		smgrrelease(reln);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -400,7 +436,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -413,7 +455,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -434,6 +478,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	FlushRelationsAllBuffers(rels, nrels);
 
 	/*
@@ -449,6 +495,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -471,6 +519,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Get rid of any remaining buffers for the relations.  bufmgr will just
 	 * drop them without bothering to write the contents.
@@ -522,6 +572,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	}
 
 	pfree(rlocators);
+
+	RESUME_INTERRUPTS();
 }
 
 
@@ -538,6 +590,8 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
@@ -550,6 +604,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -563,6 +619,8 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
 											 nblocks, skipFsync);
 
@@ -575,6 +633,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -588,7 +648,13 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -601,7 +667,13 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	uint32		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -619,8 +691,10 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
 										nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -653,8 +727,10 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
 										 buffers, nblocks, skipFsync);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -665,8 +741,10 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
 											nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -683,10 +761,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
+	HOLD_INTERRUPTS();
+
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
+	RESUME_INTERRUPTS();
+
 	return result;
 }
 
@@ -731,6 +813,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 {
 	int			i;
 
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	/*
 	 * Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
 	 * just drop them without bothering to write the contents.
@@ -784,7 +868,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -816,7 +902,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0008-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From a844a39186f5b29471da96feaf7f570c7a22afa3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Mar 2025 14:58:29 -0400
Subject: [PATCH v2.8 08/38] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   5 +-
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 176 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 132 +++++++++++++++++++
 10 files changed, 390 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8bc43c33505..d02c3d27fae 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -112,9 +112,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -180,6 +181,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..b0b9a2a5c97 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 7392b55322c..6a21c82396a 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,10 +18,11 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into the aio_handle_cbs */
-static const struct PgAioHandleCallbacks aio_invalid_cb = {0};
+static const PgAioHandleCallbacks aio_invalid_cb = {0};
 
 typedef struct PgAioHandleCallbacksEntry
 {
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 3fa813fd592..aab86a358da 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -29,6 +30,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 62f1185859f..34588ed5167 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..11d4d5a7aea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,99 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8787ce9b18f..53fd6af4799 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -103,6 +104,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -114,6 +119,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -131,12 +137,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -154,6 +162,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -697,6 +715,24 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -907,6 +943,25 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	HOLD_INTERRUPTS();
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+	RESUME_INTERRUPTS();
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -935,3 +990,80 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+	Assert(off == od->read.offset);
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0009-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 62823ca4274f855baf5c2d5101936776cc69cd1c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:37 -0500
Subject: [PATCH v2.8 09/38] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..b7dfb80b4b2
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c can have another callback to check if
+the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 58d384b8403..3f126762a22 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0010-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From e3be5af02530055dab22dad00d5bccdd4a4dba3a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:40 -0500
Subject: [PATCH v2.8 10/38] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..017971011f3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12479,4 +12479,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0011-Improve-read_stream.c-advice-for-big-random-chu.patchtext/x-diff; charset=us-asciiDownload

From 3dcfa20437c7febf65ef7ad7dbc1a3c4f0bf6980 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Feb 2025 15:59:13 +1300
Subject: [PATCH v2.8 11/38] Improve read_stream.c advice for big random
 chunks.

read_stream.c tries not to issue readahead advice when it thinks the
kernel's own readahead should be active, ie when using buffered I/O and
reading sequential blocks.  It previously gave up too easily, and issued
advice only for the first read of size up to io_combine_limit in a
larger sequential range of blocks after a random jump.  The following
read could suffer an avoidable I/O stall.

Fix, by issuing advice until the corresponding pread() calls catch up
with the start of the region we're currently issuing advice for, if
ever.  That's when the kernel actually sees the sequential pattern and
has any chance of helping.  That won't happen until the sequential
region fills the whole look-ahead window, so advice is now much more
aggressive, while still disabled for purely sequential streams.

While refactoring the advice logic, also get rid of the suppress_advice
argument that was passed around between functions, since
read_stream_start_pending_read() can make that determination itself, and
it's better to keep all that logic in one place.

Revealed by Tomas Vondra's disk access pattern tests with Melanie
Plageman's Bitmap Heap Scan patch.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Tomas Vondra <tomas@vondra.me>
Reported-by: Andres Freund <andres@anarazel.de>
Tested-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 71 +++++++++++++++++++--------
 1 file changed, 50 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 175f8410baf..f991373359a 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -133,6 +133,7 @@ struct ReadStream
 
 	/* Next expected block, for detecting sequential access. */
 	BlockNumber seq_blocknum;
+	BlockNumber seq_until_processed;
 
 	/* The read operation we are currently preparing. */
 	BlockNumber pending_read_blocknum;
@@ -238,11 +239,11 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
  * distance to a level that prevents look-ahead until buffers are released.
  */
 static bool
-read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+read_stream_start_pending_read(ReadStream *stream)
 {
 	bool		need_wait;
 	int			nblocks;
-	int			flags;
+	int			flags = 0;
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
@@ -262,16 +263,36 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
 
-	/*
-	 * If advice hasn't been suppressed, this system supports it, and this
-	 * isn't a strictly sequential pattern, then we'll issue advice.
-	 */
-	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
-		flags = READ_BUFFERS_ISSUE_ADVICE;
-	else
-		flags = 0;
+	/* Do we need to issue read-ahead advice? */
+	if (stream->advice_enabled)
+	{
+		bool		ahead;
+
+		/*
+		 * We only issue advice if we're actually far enough ahead that we
+		 * won't immediately have to call WaitReadBuffers().
+		 */
+		ahead = stream->pinned_buffers > 0 ||
+			stream->pending_read_nblocks < stream->distance;
+
+		if (stream->pending_read_blocknum == stream->seq_blocknum)
+		{
+			/*
+			 * Sequential: issue advice only until the WaitReadBuffers() calls
+			 * catch up with the first advice issued for this sequential
+			 * region, so the kernel can see sequential access.
+			 */
+			if (stream->seq_until_processed != InvalidBlockNumber && ahead)
+				flags = READ_BUFFERS_ISSUE_ADVICE;
+		}
+		else
+		{
+			/* Random jump: start tracking new region. */
+			stream->seq_until_processed = stream->pending_read_blocknum;
+			if (ahead)
+				flags = READ_BUFFERS_ISSUE_ADVICE;
+		}
+	}
 
 	/* How many more buffers is this backend allowed? */
 	if (stream->temporary)
@@ -360,7 +381,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 }
 
 static void
-read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+read_stream_look_ahead(ReadStream *stream)
 {
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
@@ -371,8 +392,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 
 		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
-			suppress_advice = false;
+			read_stream_start_pending_read(stream);
 			continue;
 		}
 
@@ -405,15 +425,13 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			if (!read_stream_start_pending_read(stream, suppress_advice) ||
+			if (!read_stream_start_pending_read(stream) ||
 				stream->ios_in_progress == stream->max_ios)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
-
-			suppress_advice = false;
 		}
 
 		/* This is the start of a new pending read. */
@@ -437,7 +455,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
-		read_stream_start_pending_read(stream, suppress_advice);
+		read_stream_start_pending_read(stream);
 
 	/*
 	 * There should always be something pinned when we leave this function,
@@ -613,6 +631,8 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->seq_blocknum = InvalidBlockNumber;
+	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
@@ -793,7 +813,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		 * space for more, but if we're just starting up we'll need to crank
 		 * the handle to get started.
 		 */
-		read_stream_look_ahead(stream, true);
+		read_stream_look_ahead(stream);
 
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
@@ -838,6 +858,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			distance = stream->distance * 2;
 			distance = Min(distance, stream->max_pinned_buffers);
 			stream->distance = distance;
+
+			/*
+			 * If we've caught up with the first advice issued for the current
+			 * sequential region, cancel further advice until the next random
+			 * jump.  The kernel should be able to see the pattern now that
+			 * we're actually making sequential preadv() calls.
+			 */
+			if (stream->ios[io_index].op.blocknum == stream->seq_until_processed)
+				stream->seq_until_processed = InvalidBlockNumber;
 		}
 		else
 		{
@@ -899,7 +928,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->oldest_buffer_index = 0;
 
 	/* Prepare for the next call. */
-	read_stream_look_ahead(stream, false);
+	read_stream_look_ahead(stream);
 
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0012-Simplify-distance-heuristics-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From c0e57e66e59f7b7e98414c280d9e9d1c60c1f768 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Feb 2025 01:25:40 +1300
Subject: [PATCH v2.8 12/38] Simplify distance heuristics in read_stream.c.

Previously, sequential reads would cause the look-ahead distance to
move towards io_combine_limit, on the basis that we wouldn't be issuing
advice so we wouldn't benefit from looking ahead further.  That's not
really true, as we could suffer avoidable stalls when encountering
random jumps after sequential regions, for example in Bitmap Heap Scans
(with a proposed patch), and it is also incompatible with AIO plans,
where you always have to look ahead to start I/O.

Simplify the algorithm: now cache hits alone make the look-ahead
distance drop off, and cache misses make it grow rapidly as before.
Random vs sequential heuristics are no longer taken into consideration.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Tested-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 94 ++++++++++-----------------
 1 file changed, 34 insertions(+), 60 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f991373359a..ef30930b9b2 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -17,30 +17,12 @@
  * pending read.  When that isn't possible, the existing pending read is sent
  * to StartReadBuffers() so that a new one can begin to form.
  *
- * The algorithm for controlling the look-ahead distance tries to classify the
- * stream into three ideal behaviors:
- *
- * A) No I/O is necessary, because the requested blocks are fully cached
- * already.  There is no benefit to looking ahead more than one block, so
- * distance is 1.  This is the default initial assumption.
- *
- * B) I/O is necessary, but read-ahead advice is undesirable because the
- * access is sequential and we can rely on the kernel's read-ahead heuristics,
- * or impossible because direct I/O is enabled, or the system doesn't support
- * read-ahead advice.  There is no benefit in looking ahead more than
- * io_combine_limit, because in this case the only goal is larger read system
- * calls.  Looking further ahead would pin many buffers and perform
- * speculative work for no benefit.
- *
- * C) I/O is necessary, it appears to be random, and this system supports
- * read-ahead advice.  We'll look further ahead in order to reach the
- * configured level of I/O concurrency.
- *
- * The distance increases rapidly and decays slowly, so that it moves towards
- * those levels as different I/O patterns are discovered.  For example, a
- * sequential scan of fully cached data doesn't bother looking ahead, but a
- * sequential scan that hits a region of uncached blocks will start issuing
- * increasingly wide read calls until it plateaus at io_combine_limit.
+ * The algorithm for controlling the look-ahead distance is based on recent
+ * cache hit and miss history.  When no I/O is necessary, there is no benefit
+ * in looking ahead more than one block.  This is the default initial
+ * assumption, but when blocks needing I/O are streamed, the distance is
+ * increased rapidly to try to benefit from I/O combining and concurrency.  It
+ * is reduced gradually when cached blocks are streamed.
  *
  * The main data structure is a circular queue of buffers of size
  * max_pinned_buffers plus some extra space for technical reasons, ready to be
@@ -337,7 +319,7 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		/* Look-ahead distance decays, no I/O necessary. */
 		if (stream->distance > 1)
 			stream->distance--;
 	}
@@ -518,6 +500,15 @@ read_stream_begin_impl(int flags,
 	else
 		max_ios = get_tablespace_io_concurrency(tablespace_id);
 
+	/*
+	 * XXX Since we don't have asynchronous I/O yet, if direct I/O is enabled
+	 * then just behave as though I/O concurrency is set to 0.  Otherwise we
+	 * would look ahead pinning many buffers for no benefit, for lack of
+	 * advice and AIO.
+	 */
+	if (io_direct_flags & IO_DIRECT_DATA)
+		max_ios = 0;
+
 	/* Cap to INT16_MAX to avoid overflowing below */
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
@@ -638,7 +629,7 @@ read_stream_begin_impl(int flags,
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
 	 * reading the whole relation.  This way we start out assuming we'll be
-	 * doing full io_combine_limit sized reads (behavior B).
+	 * doing full io_combine_limit sized reads.
 	 */
 	if (flags & READ_STREAM_FULL)
 		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
@@ -729,10 +720,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 
 	/*
-	 * A fast path for all-cached scans (behavior A).  This is the same as the
-	 * usual algorithm, but it is specialized for no I/O and no per-buffer
-	 * data, so we can skip the queue management code, stay in the same buffer
-	 * slot and use singular StartReadBuffer().
+	 * A fast path for all-cached scans.  This is the same as the usual
+	 * algorithm, but it is specialized for no I/O and no per-buffer data, so
+	 * we can skip the queue management code, stay in the same buffer slot and
+	 * use singular StartReadBuffer().
 	 */
 	if (likely(stream->fast_path))
 	{
@@ -852,37 +843,20 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
-		{
-			/* Distance ramps up fast (behavior C). */
-			distance = stream->distance * 2;
-			distance = Min(distance, stream->max_pinned_buffers);
-			stream->distance = distance;
+		/* Look-ahead distance ramps up quickly after we do I/O. */
+		distance = stream->distance * 2;
+		distance = Min(distance, stream->max_pinned_buffers);
+		stream->distance = distance;
 
-			/*
-			 * If we've caught up with the first advice issued for the current
-			 * sequential region, cancel further advice until the next random
-			 * jump.  The kernel should be able to see the pattern now that
-			 * we're actually making sequential preadv() calls.
-			 */
-			if (stream->ios[io_index].op.blocknum == stream->seq_until_processed)
-				stream->seq_until_processed = InvalidBlockNumber;
-		}
-		else
-		{
-			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > stream->io_combine_limit)
-			{
-				stream->distance--;
-			}
-			else
-			{
-				distance = stream->distance * 2;
-				distance = Min(distance, stream->io_combine_limit);
-				distance = Min(distance, stream->max_pinned_buffers);
-				stream->distance = distance;
-			}
-		}
+		/*
+		 * If we've caught up with the first advice issued for the current
+		 * sequential region, cancel further advice until the next random
+		 * jump.  The kernel should be able to see the pattern now that we're
+		 * actually making sequential preadv() calls.
+		 */
+		if (stream->advice_enabled &&
+			stream->ios[io_index].op.blocknum == stream->seq_until_processed)
+			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
 #ifdef CLOBBER_FREED_MEMORY
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0013-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From fcd1ffd6c68ae2c3b6687a4917fd4300cb9db3af Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 30 Jan 2025 11:42:03 +1300
Subject: [PATCH v2.8 13/38] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach read
stream to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap and code that uses the fast path stays
in one single buffer queue element.  Satisfy both goals by initializing
the queue incrementally on the first cycle.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 103 +++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index ef30930b9b2..0f1525f46c5 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -95,8 +95,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -224,8 +226,10 @@ static bool
 read_stream_start_pending_read(ReadStream *stream)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
 	int			flags = 0;
+	int			forwarded;
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
@@ -276,11 +280,20 @@ read_stream_start_pending_read(ReadStream *stream)
 		}
 	}
 
-	/* How many more buffers is this backend allowed? */
+	/*
+	 * How many more buffers is this backend allowed?
+	 *
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
 	if (stream->temporary)
 		buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
 	else
 		buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffer_limit += stream->forwarded_buffers;
 	if (buffer_limit == 0 && stream->pinned_buffers == 0)
 		buffer_limit = 1;		/* guarantee progress */
 
@@ -307,8 +320,31 @@ read_stream_start_pending_read(ReadStream *stream)
 	 * We say how many blocks we want to read, but it may be smaller on return
 	 * if the buffer manager decides to shorten the read.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -337,16 +373,35 @@ read_stream_start_pending_read(ReadStream *stream)
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Compute location of start of next read, without using % operator. */
 	buffer_index += nblocks;
@@ -731,10 +786,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -783,6 +840,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -859,10 +917,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -907,6 +970,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -942,6 +1006,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -955,6 +1020,24 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		if (++index == stream->queue_size)
+			index = 0;
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0014-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 3195c6f626461fc798657a974463bbb6871efbe5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Feb 2025 21:55:40 +1300
Subject: [PATCH v2.8 14/38] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 79a89f87fcc..690cdde793a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8243f4b2445..ef519b4c356 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1257,10 +1257,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1270,30 +1270,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, but StartBufferIO() will handle
+			 * those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1314,15 +1364,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1337,7 +1383,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1351,11 +1397,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1369,13 +1425,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1386,7 +1447,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1416,24 +1478,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0015-tests-Expand-temp-table-tests-to-some-pin-relat.patchtext/x-diff; charset=us-asciiDownload

From db791c683bbc295ab8cc83a3e3ac2b11064a40f9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Mar 2025 11:07:47 -0500
Subject: [PATCH v2.8 15/38] tests: Expand temp table tests to some pin related
 matters

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/regress/expected/temp.out | 156 +++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/sql/temp.sql      | 107 ++++++++++++++++++++
 3 files changed, 264 insertions(+), 1 deletion(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index 2a246a7e123..91fe519b1cc 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -410,3 +410,159 @@ SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 
 PREPARE TRANSACTION 'twophase_search';
 ERROR:  cannot PREPARE a transaction that has operated on temporary objects
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK;
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,2)
+(1 row)
+
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ERROR:  no empty local buffer available
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+ ctid  
+-------
+ (3,3)
+(1 row)
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+ROLLBACK;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+DROP TABLE test_temp;
+ERROR:  cannot DROP TABLE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+ test_temp_pin 
+---------------
+ 
+(1 row)
+
+TRUNCATE test_temp;
+ERROR:  cannot TRUNCATE "test_temp" because it is being used by active queries in this session
+COMMIT;
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10000 | 10000 |     1 |       0
+(1 row)
+
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       0
+(1 row)
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       1
+(1 row)
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+ count | max_a | min_a | max_cnt 
+-------+-------+-------+---------
+ 10001 | 10000 |    -1 |       2
+(1 row)
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 37b6d21e1f9..0a35f2f8f6a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,7 +108,7 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
+# NB: temp.sql does reconnects which transiently use 2 connections,
 # so keep this parallel group to at most 19 tests
 # ----------
 test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 2a487a1ef7f..12091f968de 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -311,3 +311,110 @@ SET search_path TO 'pg_temp';
 BEGIN;
 SELECT current_schema() ~ 'pg_temp' AS is_temp_schema;
 PREPARE TRANSACTION 'twophase_search';
+
+
+-- Tests to verify we recover correctly from exhausting buffer pins and
+-- related matters.
+
+-- use lower possible buffer limit to make the test cheaper
+\c
+SET temp_buffers = 100;
+
+CREATE TEMPORARY TABLE test_temp(a int not null unique, b TEXT not null, cnt int not null);
+INSERT INTO test_temp SELECT generate_series(1, 10000) as id, repeat('a', 200), 0;
+-- should be at least 2x as large than temp_buffers
+SELECT pg_relation_size('test_temp') / current_setting('block_size')::int8 > 200;
+
+-- Don't want cursor names and plpgsql function lines in the error messages
+\set VERBOSITY terse
+
+/* helper function to create cursors for each page in [p_start, p_end] */
+CREATE FUNCTION test_temp_pin(p_start int, p_end int)
+RETURNS void
+LANGUAGE plpgsql
+AS $f$
+  DECLARE
+      cursorname text;
+      query text;
+  BEGIN
+    FOR i IN p_start..p_end LOOP
+       cursorname = 'c_'||i;
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       EXECUTE query;
+       EXECUTE 'FETCH NEXT FROM '||cursorname;
+       -- for test development
+       -- RAISE NOTICE '%: %', cursorname, query;
+    END LOOP;
+  END;
+$f$;
+
+
+-- Test overflow of temp table buffers is handled correctly
+BEGIN;
+-- should work, below max
+SELECT test_temp_pin(0, 9);
+-- should fail, too many buffers pinned
+SELECT test_temp_pin(10, 105);
+ROLLBACK;
+
+BEGIN;
+-- have some working cursors to test after errors
+SELECT test_temp_pin(0, 9);
+FETCH NEXT FROM c_3;
+-- exhaust buffer pins in subtrans, check things work after
+SAVEPOINT rescue_me;
+SELECT test_temp_pin(10, 105);
+ROLLBACK TO SAVEPOINT rescue_me;
+-- pre-subtrans cursors continue to work
+FETCH NEXT FROM c_3;
+
+-- new cursors with pins can be created after subtrans rollback
+SELECT test_temp_pin(10, 95);
+
+-- Check that read streams deal with lower number of pins available
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+ROLLBACK;
+
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+DROP TABLE test_temp;
+COMMIT;
+
+-- Check that temp tables with existing cursors can't be dropped.
+BEGIN;
+SELECT test_temp_pin(0, 1);
+TRUNCATE test_temp;
+COMMIT;
+
+-- Check that temp tables that are dropped in transaction that's rolled back
+-- preserve buffer contents
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+INSERT INTO test_temp(a, b, cnt) VALUES (-1, '', 0);
+BEGIN;
+INSERT INTO test_temp(a, b, cnt) VALUES (-2, '', 0);
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table drop is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+DROP TABLE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+-- Check that temp table truncation is transactional and preserves dirty
+-- buffer contents
+UPDATE test_temp SET cnt = cnt + 1 WHERE a = -1;
+BEGIN;
+TRUNCATE test_temp;
+ROLLBACK;
+SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
+
+
+-- cleanup
+DROP FUNCTION test_temp_pin(int, int);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0016-localbuf-Fix-dangerous-coding-pattern-in-GetLoc.patchtext/x-diff; charset=us-asciiDownload

From a096e1ccb3e2e175053070e27f2c0b02061b8c0d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:11:51 -0400
Subject: [PATCH v2.8 16/38] localbuf: Fix dangerous coding pattern in
 GetLocalVictimBuffer()

If PinLocalBuffer() were to modify the buf_state, the buf_state in
GetLocalVictimBuffer() would be out of date. Currently that does not happen,
as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and
GetLocalVictimBuffer() passes false.

However, it's easy to make this not the case anymore - it cost me a few hours
to debug the consequences.

The minimal fix would be to just refetch the buf_state after after calling
PinLocalBuffer(), but the same danger exists in later parts of the
function. Instead, declare buf_state in the narrower scopes and re-read the
state in conditional branches.  Besides being safer, it also fits well with
an upcoming set of cleanup patches that move the contents of the conditional
branches in GetLocalVictimBuffer() into helper functions.

I "broke" this in 794f2594479.

Arguably this should be backpatched, but as the relevant functions are not
exported and there is no actual misbehaviour, I chose to not backpatch, at
least for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/backend/storage/buffer/localbuf.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 919f600ed41..456a2232f22 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -178,7 +178,6 @@ GetLocalVictimBuffer(void)
 {
 	int			victim_bufid;
 	int			trycounter;
-	uint32		buf_state;
 	BufferDesc *bufHdr;

 	ResourceOwnerEnlarge(CurrentResourceOwner);
@@ -199,7 +198,7 @@ GetLocalVictimBuffer(void)

 		if (LocalRefCount[victim_bufid] == 0)
 		{
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);

 			if (BUF_STATE_GET_USAGECOUNT(buf_state) > 0)
 			{
@@ -233,8 +232,9 @@ GetLocalVictimBuffer(void)
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
 	 */
-	if (buf_state & BM_DIRTY)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
@@ -267,8 +267,9 @@ GetLocalVictimBuffer(void)
 	/*
 	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
-	if (buf_state & BM_TAG_VALID)
+	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 		LocalBufferLookupEnt *hresult;

 		hresult = (LocalBufferLookupEnt *)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0017-localbuf-Introduce-InvalidateLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 896913e86ab7789ff24f890c99b3c2c8843f0e9a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 20:25:22 -0400
Subject: [PATCH v2.8 17/38] localbuf: Introduce InvalidateLocalBuffer()

Previously, there were three copies of this code, two of them
identical. There's no good reason for that.

This change is nice on its own, but the main motivation is the AIO patchset,
which needs to add extra checks the deduplicated code, which of course is
easier if there is only one version.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com
---
 src/backend/storage/buffer/localbuf.c | 92 +++++++++++++--------------
 1 file changed, 44 insertions(+), 48 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 456a2232f22..5331091132d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -56,6 +56,7 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
+static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -269,17 +270,7 @@ GetLocalVictimBuffer(void)
 	 */
 	if (pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID)
 	{
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-		LocalBufferLookupEnt *hresult;
-
-		hresult = (LocalBufferLookupEnt *)
-			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-		if (!hresult)			/* shouldn't happen */
-			elog(ERROR, "local buffer hash table corrupted");
-		/* mark buffer invalid just in case hash insert fails */
-		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		InvalidateLocalBuffer(bufHdr, false);
 
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT, 1, 0);
 	}
@@ -492,6 +483,46 @@ MarkLocalBufferDirty(Buffer buffer)
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 }
 
+/*
+ * InvalidateLocalBuffer -- mark a local buffer invalid.
+ *
+ * If check_unreferenced is true, error out if the buffer is still
+ * pinned. Passing false is appropriate when calling InvalidateLocalBuffer()
+ * as part of changing the identity of a buffer, instead of just dropping the
+ * buffer.
+ *
+ * See also InvalidateBuffer().
+ */
+static void
+InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
+	int			bufid = -buffer - 1;
+	uint32		buf_state;
+	LocalBufferLookupEnt *hresult;
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	if (check_unreferenced && LocalRefCount[bufid] != 0)
+		elog(ERROR, "block %u of %s is still referenced (local %u)",
+			 bufHdr->tag.blockNum,
+			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
+							MyProcNumber,
+							BufTagGetForkNum(&bufHdr->tag)).str,
+			 LocalRefCount[bufid]);
+
+	/* Remove entry from hashtable */
+	hresult = (LocalBufferLookupEnt *)
+		hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
+	if (!hresult)				/* shouldn't happen */
+		elog(ERROR, "local buffer hash table corrupted");
+	/* Mark buffer invalid */
+	ClearBufferTag(&bufHdr->tag);
+	buf_state &= ~BUF_FLAG_MASK;
+	buf_state &= ~BUF_USAGECOUNT_MASK;
+	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+}
+
 /*
  * DropRelationLocalBuffers
  *		This function removes from the buffer pool all the pages of the
@@ -512,7 +543,6 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -522,24 +552,7 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
@@ -559,7 +572,6 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
-		LocalBufferLookupEnt *hresult;
 		uint32		buf_state;
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -567,23 +579,7 @@ DropRelationAllLocalBuffers(RelFileLocator rlocator)
 		if ((buf_state & BM_TAG_VALID) &&
 			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
 		{
-			if (LocalRefCount[i] != 0)
-				elog(ERROR, "block %u of %s is still referenced (local %u)",
-					 bufHdr->tag.blockNum,
-					 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
-									MyProcNumber,
-									BufTagGetForkNum(&bufHdr->tag)).str,
-					 LocalRefCount[i]);
-			/* Remove entry from hashtable */
-			hresult = (LocalBufferLookupEnt *)
-				hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
-			if (!hresult)		/* shouldn't happen */
-				elog(ERROR, "local buffer hash table corrupted");
-			/* Mark buffer invalid */
-			ClearBufferTag(&bufHdr->tag);
-			buf_state &= ~BUF_FLAG_MASK;
-			buf_state &= ~BUF_USAGECOUNT_MASK;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			InvalidateLocalBuffer(bufHdr, true);
 		}
 	}
 }
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0021-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From feb0718a8c17f42d3d3e597d010601574f2a81f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Feb 2025 18:04:48 -0500
Subject: [PATCH v2.8 21/38] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 95911b51d58..a9643930974 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5393,6 +5393,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5446,6 +5458,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0022-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From e2cf561cb8edb8012cb5996c094d216475ba6e21 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Mar 2025 13:51:06 -0400
Subject: [PATCH v2.8 22/38] bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
  IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

As of this commit nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Todo:
- deduplicate shared/local buffer completion callbacks
- deduplicate LockBufferForCleanup() support
- function naming pattern (buffer_readv_complete_common() calls
  LocalBufferCompleteRead/SharedBufferCompleteRead, which looks ugly)

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 440 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  45 ++-
 7 files changed, 466 insertions(+), 42 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d02c3d27fae..a4d9b51acfe 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -184,6 +184,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 690cdde793a..eafc94d7ed1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -168,6 +169,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 6a21c82396a..f76f74ba166 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a9643930974..54add34f1a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1038,7 +1040,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1074,9 +1076,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1450,7 +1452,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1598,9 +1601,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2503,7 +2506,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2849,6 +2852,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2917,29 +2958,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3959,7 +3979,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5518,6 +5538,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5525,10 +5546,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5617,7 +5647,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5632,6 +5662,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5640,6 +5678,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5688,7 +5737,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6139,3 +6188,328 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Helper for AIO staging callback for both reads and writes as well as temp
+ * and shared buffers.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the subsystem.
+		 *
+		 * For local buffers: This can't be done just in LocalRefCount as one
+		 * might initially think, as this backend could error out while AIO is
+		 * still in progress, releasing all the pins by the backend itself.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..b18c4c25143 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * The buffer could have IO in progress, e.g. when there are two scans of
+	 * the same relation. Either wait for the other IO or return false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
@@ -536,7 +553,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +567,17 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +743,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0023-bufmgr-Improve-stats-when-buffer-was-read-in-co.patchtext/x-diff; charset=us-asciiDownload

From af6ca07ccd4d6917086e3fea2224a962ed396aa6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 14 Mar 2025 13:03:55 -0400
Subject: [PATCH v2.8 23/38] bufmgr: Improve stats when buffer was read in
 concurrently

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 41 ++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54add34f1a2..d783b33f78e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1490,19 +1490,6 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	/*
-	 * We count all these blocks as read by this backend.  This is traditional
-	 * behavior, but might turn out to be not true if we find that someone
-	 * else has beaten us and completed the read of some of these blocks.  In
-	 * that case the system globally double-counts, but we traditionally don't
-	 * count this as a "hit", and we don't have a separate counter for "miss,
-	 * but another backend completed the read".
-	 */
-	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
-	else
-		pgBufferUsage.shared_blks_read += nblocks;
-
 	for (int i = 0; i < nblocks; ++i)
 	{
 		int			io_buffers_len;
@@ -1520,8 +1507,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		if (!WaitReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
-			 * Report this as a 'hit' for this backend, even though it must
-			 * have started out as a miss in PinBufferForBlock().
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock().
+			 *
+			 * Some of the accesses would otherwise never be counted (e.g.
+			 * pgBufferUsage) or counted as a miss (e.g.
+			 * pgstat_count_buffer_hit(), as we always call
+			 * pgstat_count_buffer_read()).
 			 */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
 											  operation->smgr->smgr_rlocator.locator.spcOid,
@@ -1529,6 +1521,20 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			if (persistence == RELPERSISTENCE_TEMP)
+				pgBufferUsage.local_blks_hit += 1;
+			else
+				pgBufferUsage.shared_blks_hit += 1;
+
+			if (operation->rel)
+				pgstat_count_buffer_hit(operation->rel);
+
+			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+			if (VacuumCostActive)
+				VacuumCostBalance += VacuumCostPageHit;
+
 			continue;
 		}
 
@@ -1614,6 +1620,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  false);
 		}
 
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_read += io_buffers_len;
+		else
+			pgBufferUsage.shared_blks_read += io_buffers_len;
+
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 	}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0024-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 5403662d9fd2730066dd3ce3083b4f4f1a3d147a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 14 Mar 2025 13:48:47 -0400
Subject: [PATCH v2.8 24/38] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual AIO user. StartReadBuffers() now uses
the AIO routines to issue IO. This converts a lot of callers to use the AIO
infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commits.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 614 +++++++++++++++++++++-------
 2 files changed, 477 insertions(+), 143 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eafc94d7ed1..ca4ca22c345 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d783b33f78e..2691757016c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1228,10 +1230,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1256,9 +1262,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1297,6 +1305,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			Assert(pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
 		}
 		else
@@ -1321,6 +1330,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1363,25 +1386,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1448,8 +1520,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1458,31 +1557,243 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
-void
-WaitReadBuffers(ReadBuffersOperation *operation)
+/*
+ * Helper for AsyncReadBuffers that tries to initiate IO on the buffer.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
-
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	/*
+	 * If this backend currently has staged IO, we we need to submit the
+	 * pending IO before waiting for the right to issue IO, to avoid the
+	 * potential for deadlocks (and, more commonly, unnecessary delays for
+	 * other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately a false returned StartBufferIO() doesn't allow to
+		 * distinguish between the the buffer already being valid and IO
+		 * already being in progress. Since IO already being in progress is
+		 * quite rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			nblocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != ARS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks sucessfully read as the result of the
+	 * IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == ARS_OK))
+		nblocks = aio_ret->result.result;
+	else if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == ARS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
 
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	operation->nblocks_done += nblocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple short reads, but also because some of the remaining to-be-read
+	 * buffers may have been read in by other backends, limiting the IO size.
+	 */
+	while (true)
+	{
+		int			nblocks;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it. It's possible for there to be no IO if
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == ARS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the the time the one IO we started will read in everything.
+		 * But we need to deal with short reads and buffers not needing IO
+		 * anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a short read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &nblocks);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after short reads, the first operation->nblocks_done is
+ * buffers are skipped.
+ *
+ * On return *nblocks_progres is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1490,58 +1801,90 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might block,
+	 * which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If IO needs to be submitted before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
+	 * wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to wait
+	 * for already submitted IO, which doesn't require additional locks, but
+	 * it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allows us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock().
+		 *
+		 * Some of the accesses would otherwise never be counted (e.g.
+		 * pgBufferUsage) or counted as a miss (e.g.
+		 * pgstat_count_buffer_hit(), as we always call
+		 * pgstat_count_buffer_read()).
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock().
-			 *
-			 * Some of the accesses would otherwise never be counted (e.g.
-			 * pgBufferUsage) or counted as a miss (e.g.
-			 * pgstat_count_buffer_hit(), as we always call
-			 * pgstat_count_buffer_read()).
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
@@ -1549,85 +1892,70 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * buffers at the same time?  In this case we don't wait if we see an
 		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0025-docs-Reframe-track_io_timing-related-docs-as-wa.patchtext/x-diff; charset=us-asciiDownload

From fff6267519bc2161b6b5c76923c69489f30f96f3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Mar 2025 13:25:16 -0400
Subject: [PATCH v2.8 25/38] docs: Reframe track_io_timing related docs as wait
 time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d53478f53a..69638d7c1f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8549,7 +8549,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8583,7 +8583,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..7598340072f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2733,7 +2733,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2771,7 +2771,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2799,7 +2799,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2835,7 +2835,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2909,7 +2909,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2996,7 +2996,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0026-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From 6ddf6724030ee0c09bd8027252e244b78440514f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 14 Mar 2025 11:39:34 -0400
Subject: [PATCH v2.8 26/38] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency

There is one comment talking about max_ios logic with "real asynchronous I/O"
that I am not sure about, so I left it alone for now.

There are further improvements we should consider, e.g. waiting to issue IOs
until we can issue multiple IOs at once. But that's left for a future change,
since it would involve additional heuristics.
---
 src/backend/storage/aio/read_stream.c | 39 ++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0f1525f46c5..b44cc358f29 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -420,6 +422,13 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -467,6 +476,7 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -501,6 +511,8 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -556,12 +568,12 @@ read_stream_begin_impl(int flags,
 		max_ios = get_tablespace_io_concurrency(tablespace_id);
 
 	/*
-	 * XXX Since we don't have asynchronous I/O yet, if direct I/O is enabled
-	 * then just behave as though I/O concurrency is set to 0.  Otherwise we
-	 * would look ahead pinning many buffers for no benefit, for lack of
-	 * advice and AIO.
+	 * If real asynchronous I/O is disabled, and direct I/O is enabled, just
+	 * behave as though I/O concurrency is set to 0.  Otherwise we would look
+	 * ahead pinning many buffers for no benefit, as the advice-based
+	 * readahead doesn't support direct I/O.
 	 */
-	if (io_direct_flags & IO_DIRECT_DATA)
+	if (io_method == IOMETHOD_SYNC && (io_direct_flags & IO_DIRECT_DATA))
 		max_ios = 0;
 
 	/* Cap to INT16_MAX to avoid overflowing below */
@@ -641,15 +653,19 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * This system supports prefetching advice.
+	 *
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -659,6 +675,9 @@ read_stream_begin_impl(int flags,
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
 	 * different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
 	 */
 	if (max_ios == 0)
 		max_ios = 1;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0027-aio-Experimental-heuristics-to-increase-batchin.patchtext/x-diff; charset=us-asciiDownload

From bb43dc77fb2e4e9f7244fbd4680d713040d520ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 14 Mar 2025 12:16:17 -0400
Subject: [PATCH v2.8 27/38] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index b44cc358f29..0667efe31ca 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -422,6 +422,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0028-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 016a1d9a237305b8e8967c485c99bd2e40d0f209 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:54 -0500
Subject: [PATCH v2.8 28/38] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 674 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1466 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..4b1f52a9fc8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2691757016c..860886d4d90 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5927,7 +5923,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5984,7 +5980,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..ecd1f3c2f17
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,674 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8);
+
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0029-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 3df9228c566d98f8a0a60c9d49a16054334674e2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 9 Mar 2025 19:35:04 -0400
Subject: [PATCH v2.8 29/38] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 251 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index f76f74ba166..fb6ac058a09 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 34588ed5167..46cbcb2b688 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11d4d5a7aea..a61fc14805e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2017,3 +2108,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 53fd6af4799..cd515954e74 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -112,6 +112,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -139,6 +144,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -769,6 +775,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0030-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 408511c945fb6231227a066f1c14d40ecbd58070 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 7 Mar 2025 15:47:38 -0500
Subject: [PATCH v2.8 30/38] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a4d9b51acfe..3047dc77ce1 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -337,6 +337,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -349,6 +363,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 98ae66921f5..166a66f7062 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index b0b9a2a5c97..09968e17855 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index b7dfb80b4b2..bb0bfc71ea6 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3f126762a22..8ad6be06ca1 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -664,6 +667,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1045,6 +1063,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index d0d916ea5c9..c32d8e6cb3d 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -82,6 +82,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -107,6 +133,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -131,11 +184,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	if (pgaio_method_ops->shmem_size)
 		sz = add_size(sz, pgaio_method_ops->shmem_size());
@@ -150,6 +223,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -161,6 +237,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -173,6 +250,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -180,9 +291,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -204,6 +319,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4cc19bef686..7290793c9a1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3268,6 +3268,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7e8c3dcb175..ea71e9075ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1748befca16..6ffd4e82976 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2132,6 +2132,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0031-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 3b7b8829481e08137d42fbd660976923dc8b293a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Mar 2025 11:33:13 -0400
Subject: [PATCH v2.8 31/38] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 182 ++++++++++++++++++++++++-
 4 files changed, 187 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 3047dc77ce1..192abc5a712 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -186,8 +186,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ca4ca22c345..fffc76d11d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,7 +176,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index fb6ac058a09..7162f722e3c 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 860886d4d90..886fbaf63f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5507,7 +5507,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5522,6 +5530,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5594,6 +5615,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5741,6 +5767,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6802,12 +6836,131 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
 		);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result;
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6815,6 +6968,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6828,13 +6988,29 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
-/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
 	.complete_shared = shared_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6848,3 +7024,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0032-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 0a47b96406ba63839ffe9bde619229cad8fc135e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:50 -0500
Subject: [PATCH v2.8 32/38] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6ffd4e82976..26c622824e4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1191,6 +1191,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3016,6 +3017,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0033-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 44f06b0060d94daeb3d467e9f4f364016fdc3615 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Jan 2025 13:44:52 -0500
Subject: [PATCH v2.8 33/38] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1f52a9fc8..a3df3192d12 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fffc76d11d9..30d965ec069 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -299,7 +299,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 886fbaf63f3..3312b713c8c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3299,6 +3300,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3330,7 +3382,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3392,7 +3447,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3500,48 +3557,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3559,15 +3659,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3593,7 +3701,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3636,6 +3744,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3656,6 +3767,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3812,11 +3925,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3827,6 +3954,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3838,6 +3972,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3876,8 +4015,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3886,22 +4083,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3911,7 +4136,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3920,40 +4145,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4326,6 +4805,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 26c622824e4..26a9c1d4ba6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0034-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From ee17a601fd0903d54b7858dd3e81a03db6b1d003 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 8 Oct 2024 14:34:38 -0400
Subject: [PATCH v2.8 34/38] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0035-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From ee763e18fabaa01879c6aa444283ced36d3c34a1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Sep 2024 00:42:27 -0400
Subject: [PATCH v2.8 35/38] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.8-0036-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 6cb92c23b3975602f02fb777516fb8aff57e440a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 31 Dec 2024 13:25:56 -0500
Subject: [PATCH v2.8 36/38] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#83

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#82)

Re: AIO v2.5

Hi,

On 2025-03-14 15:43:15 -0400, Andres Freund wrote:

Open items:

- The upstream BAS_BULKREAD is so small that throughput is substantially worse
once a table reaches 1/4 shared_buffers. That patch in the patchset as-is is
probably not good enough, although I am not sure about that

- The set_max_safe_fds() issue for io_uring

- Right now effective_io_concurrency cannot be set > 0 on Windows and other
platforms that lack posix_fadvise. But with AIO we can read ahead without
posix_fadvise().

It'd not really make anything worse than today to not remove the limit, but
it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
to look around and see whether it would require anything other than doc
changes.

A fourth, smaller, question:

- Should the docs for debug_io_direct be rephrased and if so, how?

Without read-stream-AIO debug_io_direct=data has completely unusable
performance if there's ever any data IO - and if there's no IO there's no
point in using the option.

Now there is a certain set of workloads where performance with
debug_io_direct=data can be better than master, sometimes substantially
so. But at the same time, without support for at least:

- AIO writes for at least checkpointer, bgwriter

doing one synchronous IO for each buffer is ... slow.

- read-streamified index vacuuming

And probably also:
- AIO-ified writes for writes executed by backends, e.g. due to strategies

Doing one synchronous IO for each buffer is ... slow. And e.g. with COPY
we do a *lot* of those. OTOH, it could be fine if most modifications are
done via INSERTs instead of COPY.

- prefetching for non-BHS index accesses

Without prefetching, a well correlated index-range scan will be orders of
magnitude slower with DIO.

- Anything bypassing shared_buffers, like RelationCopyStorage() or
bulk_write.c will be extremely slow

The only saving grace is that these aren't all *that* common.

Due to those constraints I think it's pretty clear we can't remove the debug_
prefix at this time.

Perhaps it's worth going from

<para>
Currently this feature reduces performance, and is intended for
developer testing only.
</para>
to
<para>
Currently this feature reduces performance in many workloads, and is
intended for testing only.
</para>

I.e. qualify the downside with "many workloads" and widen the audience ever so
slightly?

Greetings,

Andres Freund

#84

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#82)

28 attachment(s)

Re: AIO v2.5

Hi,

Attached is v2.9 with the following, fairly small, changes:
- Rebased ontop of the additional committed read stream patches

- Committed the localbuf: refactorings (but not the change to expand
refcounting of local buffers, that seems a bit more dependent on the rest)

- Committed the temp table test after some annoying debugging
/messages/by-id/w5wr26ijzp7xz2qrxkt6dzvmmn2tn6tn5fp64y6gq5iuvg43hw@v4guo6x776dq

- Some rephrasing and moving of comments in the first two commits

- There was a small bug in the smgr reopen call I found when reviewing, the
PgAioOpData->read.fd field was referenced for both reads and writes, which
failed to fail because both read/write use a compatible struct layout.

Unless I hear otherwise, I plan to commit the first two patches fairly soon,
followed by the worker support patches a few buildfarm cycles later.

I'm sure there's a bunch of stuff worth improving in the AIO infrastructure
and I can't imagine a project of this size not having bugs. But I think it's
in a state where that's better worked out in tree, with broader exposure and
testing.

Greetings,

Andres Freund

Attachments:

v2.9-0001-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload

From de5db91a35226eb03ba6d643fffc74ee9a89c74b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 17 Mar 2025 14:17:28 -0400
Subject: [PATCH v2.9 01/30] aio: Basic subsystem initialization

This commit just does the minimal wiring up of the AIO subsystem, to be fully
added in the next commit, to the rest of the system. The next commit contains
more details about motivation and architecture.

This commit is kept separate to make it easier to review, separating the
changes across the tree from the new subsystem itself.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/aio.h                     | 38 ++++++++
 src/include/storage/aio_subsys.h              | 33 +++++++
 src/include/utils/guc.h                       |  1 +
 src/include/utils/guc_hooks.h                 |  2 +
 src/include/utils/resowner.h                  |  5 ++
 src/backend/access/transam/xact.c             | 12 +++
 src/backend/postmaster/autovacuum.c           |  2 +
 src/backend/postmaster/bgwriter.c             |  2 +
 src/backend/postmaster/checkpointer.c         |  2 +
 src/backend/postmaster/pgarch.c               |  2 +
 src/backend/postmaster/walsummarizer.c        |  2 +
 src/backend/postmaster/walwriter.c            |  2 +
 src/backend/replication/walsender.c           |  2 +
 src/backend/storage/aio/Makefile              |  2 +
 src/backend/storage/aio/aio.c                 | 90 +++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 37 ++++++++
 src/backend/storage/aio/meson.build           |  2 +
 src/backend/storage/ipc/ipci.c                |  3 +
 src/backend/utils/init/postinit.c             |  7 ++
 src/backend/utils/misc/guc_tables.c           | 23 +++++
 src/backend/utils/misc/postgresql.conf.sample |  6 ++
 src/backend/utils/resowner/resowner.c         | 29 ++++++
 doc/src/sgml/config.sgml                      | 51 +++++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 24 files changed, 356 insertions(+)
 create mode 100644 src/include/storage/aio.h
 create mode 100644 src/include/storage/aio_subsys.h
 create mode 100644 src/backend/storage/aio/aio.c
 create mode 100644 src/backend/storage/aio/aio_init.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
new file mode 100644
index 00000000000..e79d5343038
--- /dev/null
+++ b/src/include/storage/aio.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.h
+ *    Main AIO interface
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_H
+#define AIO_H
+
+
+
+/* Enum for io_method GUC. */
+typedef enum IoMethod
+{
+	IOMETHOD_SYNC = 0,
+} IoMethod;
+
+/* We'll default to synchronous execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+
+
+struct dlist_node;
+extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
+
+
+/* GUCs */
+extern PGDLLIMPORT int io_method;
+extern PGDLLIMPORT int io_max_concurrency;
+
+
+#endif							/* AIO_H */
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
new file mode 100644
index 00000000000..2a6ca1c27a9
--- /dev/null
+++ b/src/include/storage/aio_subsys.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_subsys.h
+ *    Interaction with AIO as a subsystem, rather than actually issuing AIO
+ *
+ * This header is for AIO related functionality that's being called by files
+ * that don't perform AIO, but interact with the AIO subsystem in some
+ * form. E.g. postmaster.c and shared memory initialization need to initialize
+ * AIO but don't perform AIO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_subsys.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_SUBSYS_H
+#define AIO_SUBSYS_H
+
+
+/* aio_init.c */
+extern Size AioShmemSize(void);
+extern void AioShmemInit(void);
+
+extern void pgaio_init_backend(void);
+
+
+/* aio.c */
+extern void pgaio_error_cleanup(void);
+extern void AtEOXact_Aio(bool is_commit);
+
+#endif							/* AIO_SUBSYS_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 24444cbc365..f619100467d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -318,6 +318,7 @@ extern PGDLLIMPORT bool optimize_bounded_sort;
  */
 extern PGDLLIMPORT const struct config_enum_entry archive_mode_options[];
 extern PGDLLIMPORT const struct config_enum_entry dynamic_shared_memory_options[];
+extern PGDLLIMPORT const struct config_enum_entry io_method_options[];
 extern PGDLLIMPORT const struct config_enum_entry recovery_target_action_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_level_options[];
 extern PGDLLIMPORT const struct config_enum_entry wal_sync_method_options[];
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 9a0d8ec85c7..a3eba8fbe21 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -64,6 +64,8 @@ extern bool check_default_with_oids(bool *newval, void **extra,
 extern bool check_effective_io_concurrency(int *newval, void **extra,
 										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
+extern void assign_io_method(int newval, void *extra);
+extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
 extern const char *show_in_hot_standby(void);
 extern bool check_locale_messages(char **newval, void **extra, GucSource source);
 extern void assign_locale_messages(const char *newval, void *extra);
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index e8d452ca7ee..aede4bfc820 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -164,4 +164,9 @@ struct LOCALLOCK;
 extern void ResourceOwnerRememberLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *locallock);
 
+/* special support for AIO */
+struct dlist_node;
+extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+
 #endif							/* RESOWNER_H */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..b885513f765 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -51,6 +51,7 @@
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -2411,6 +2412,8 @@ CommitTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2716,6 +2719,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_BEFORE_LOCKS,
 						 true, true);
 
+	AtEOXact_Aio(true);
+
 	/* Check we've released all buffer pins */
 	AtEOXact_Buffers(true);
 
@@ -2830,6 +2835,8 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
+	pgaio_error_cleanup();
+
 	/* Clean up buffer content locks, too */
 	UnlockBuffers();
 
@@ -2960,6 +2967,7 @@ AbortTransaction(void)
 		ResourceOwnerRelease(TopTransactionResourceOwner,
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, true);
+		AtEOXact_Aio(false);
 		AtEOXact_Buffers(false);
 		AtEOXact_RelationCache(false);
 		AtEOXact_TypeCache();
@@ -5232,6 +5240,9 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
+
+	pgaio_error_cleanup();
+
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5326,6 +5337,7 @@ AbortSubTransaction(void)
 							 RESOURCE_RELEASE_BEFORE_LOCKS,
 							 false, false);
 
+		AtEOXact_Aio(false);
 		AtEOSubXact_RelationCache(false, s->subTransactionId,
 								  s->parent->subTransactionId);
 		AtEOSubXact_TypeCache();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71c34027c88..2513a8ef8a6 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -88,6 +88,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a688cc5d2a1..72f5acceec7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,6 +38,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -168,6 +169,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0e228d143a0..fda91ffd1ce 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,6 +49,7 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -276,6 +277,7 @@ CheckpointerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dbe4e1d426b..7e622ae4bd2 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -40,6 +40,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/pgarch.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -568,6 +569,7 @@ pgarch_archiveXlog(char *xlog)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ccba0f84e6e..0fec4f1f871 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -38,6 +38,7 @@
 #include "postmaster/interrupt.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walreceiver.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
@@ -289,6 +290,7 @@ WalSummarizerMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Files(false);
 		AtEOXact_HashTables(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0380601bcbb..fd92c8b7a33 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -51,6 +51,7 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/walwriter.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
@@ -164,6 +165,7 @@ WalWriterMain(const void *startup_data, size_t startup_data_len)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
+		pgaio_error_cleanup();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d96121b3aad..1028919aecb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -79,6 +79,7 @@
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
 #include "storage/condition_variable.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -327,6 +328,7 @@ WalSndErrorCleanup(void)
 	LWLockReleaseAll();
 	ConditionVariableCancelSleep();
 	pgstat_report_wait_end();
+	pgaio_error_cleanup();
 
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 2f29a9ec4d1..eaeaeeee8e3 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -9,6 +9,8 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	aio.o \
+	aio_init.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
new file mode 100644
index 00000000000..828a94efdc3
--- /dev/null
+++ b/src/backend/storage/aio/aio.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio.c
+ *    AIO - Core Logic
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/aio_subsys.h"
+#include "utils/guc.h"
+#include "utils/guc_hooks.h"
+
+
+/* Options for io_method. */
+const struct config_enum_entry io_method_options[] = {
+	{"sync", IOMETHOD_SYNC, false},
+	{NULL, 0, false}
+};
+
+/* GUCs */
+int			io_method = DEFAULT_IO_METHOD;
+int			io_max_concurrency = -1;
+
+
+
+/*
+ * Release IO handle during resource owner cleanup.
+ */
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+}
+
+/*
+ * Perform AIO related cleanup after an error.
+ *
+ * This should be called early in the error recovery paths, as later steps may
+ * need to issue AIO (e.g. to record a transaction abort WAL record).
+ */
+void
+pgaio_error_cleanup(void)
+{
+}
+
+/*
+ * Perform AIO related checks at (sub-)transactional boundaries.
+ *
+ * This should be called late during (sub-)transactional commit/abort, after
+ * all steps that might need to perform AIO, so that we can verify that the
+ * AIO subsystem is in a valid state at the end of a transaction.
+ */
+void
+AtEOXact_Aio(bool is_commit)
+{
+}
+
+void
+assign_io_method(int newval, void *extra)
+{
+}
+
+bool
+check_io_max_concurrency(int *newval, void **extra, GucSource source)
+{
+	if (*newval == -1)
+	{
+		/*
+		 * Auto-tuning will be applied later during startup, as auto-tuning
+		 * depends on the value of various GUCs.
+		 */
+		return true;
+	}
+	else if (*newval == 0)
+	{
+		GUC_check_errdetail("Only -1 or values bigger than 0 are valid.");
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
new file mode 100644
index 00000000000..aeacc144149
--- /dev/null
+++ b/src/backend/storage/aio/aio_init.c
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_init.c
+ *    AIO - Subsystem Initialization
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_init.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio_subsys.h"
+
+
+
+Size
+AioShmemSize(void)
+{
+	Size		sz = 0;
+
+	return sz;
+}
+
+void
+AioShmemInit(void)
+{
+}
+
+void
+pgaio_init_backend(void)
+{
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 8abe0eb4863..c822fd4ddf7 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -1,5 +1,7 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
 backend_sources += files(
+  'aio.c',
+  'aio_init.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..2fa045e6b0f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -37,6 +37,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/dsm_registry.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, AioShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	AioShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4b2faf1ba9d..7958ea11b73 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -43,6 +43,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -635,6 +636,12 @@ BaseInit(void)
 	 */
 	pgstat_initialize();
 
+	/*
+	 * Initialize AIO before infrastructure that might need to actually
+	 * execute AIO.
+	 */
+	pgaio_init_backend();
+
 	/* Do local initialization of storage and buffer managers */
 	InitSync();
 	smgrinit();
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9c0b10ad4dc..0d3ebf06a95 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -72,6 +72,7 @@
 #include "replication/slot.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/large_object.h"
@@ -3254,6 +3255,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_max_concurrency",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Max number of IOs that one process can execute simultaneously."),
+			NULL,
+		},
+		&io_max_concurrency,
+		-1, -1, 1024,
+		check_io_max_concurrency, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -5311,6 +5324,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"io_method", PGC_POSTMASTER, RESOURCES_IO,
+			gettext_noop("Selects the method for executing asynchronous I/O."),
+			NULL
+		},
+		&io_method,
+		DEFAULT_IO_METHOD, io_method_options,
+		NULL, assign_io_method, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8de86e0c945..43c2ec2153e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,6 +202,12 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
+#io_method = sync			# sync (change requires restart)
+#io_max_concurrency = -1		# Max number of IOs that one process
+					# can execute simultaneously
+					# -1 sets based on shared_buffers
+					# (change requires restart)
+
 # - Worker Processes -
 
 #max_worker_processes = 8		# (change requires restart)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index ac5ca4a765e..d39f3e1b655 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -47,6 +47,8 @@
 
 #include "common/hashfn.h"
 #include "common/int.h"
+#include "lib/ilist.h"
+#include "storage/aio.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -155,6 +157,12 @@ struct ResourceOwnerData
 
 	/* The local locks cache. */
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
+
+	/*
+	 * AIO handles need be registered in critical sections and therefore
+	 * cannot use the normal ResourceElem mechanism.
+	 */
+	dlist_head	aio_handles;
 };
 
 
@@ -425,6 +433,8 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		parent->firstchild = owner;
 	}
 
+	dlist_init(&owner->aio_handles);
+
 	return owner;
 }
 
@@ -725,6 +735,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * so issue warnings.  In the abort case, just clean up quietly.
 		 */
 		ResourceOwnerReleaseAll(owner, phase, isCommit);
+
+		while (!dlist_is_empty(&owner->aio_handles))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_handles);
+
+			pgaio_io_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1082,3 +1099,15 @@ ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock)
 	elog(ERROR, "lock reference %p is not owned by resource owner %s",
 		 locallock, owner->name);
 }
+
+void
+ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_handles, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_handles, ioh_node);
+}
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3d62c8bd274..7ec18bb7627 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2638,6 +2638,57 @@ include_dir 'conf.d'
         </para>
        </listitem>
       </varlistentry>
+
+      <varlistentry id="guc-io-max-concurrency" xreflabel="io_max_concurrency">
+       <term><varname>io_max_concurrency</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_max_concurrency</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the maximum number of I/O operations that one process can
+         execute simultaneously.
+        </para>
+        <para>
+         The default setting of <literal>-1</literal> selects a number based
+         on <xref linkend="guc-shared-buffers"/> and the maximum number of
+         processes (<xref linkend="guc-max-connections"/>, <xref
+         linkend="guc-autovacuum-worker-slots"/>, <xref
+         linkend="guc-max-worker-processes"/> and <xref
+         linkend="guc-max-wal-senders"/>), but not more than
+         <literal>64</literal>.
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-io-method" xreflabel="io_method">
+       <term><varname>io_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>io_method</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the method for executing asynchronous I/O.
+         Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
+           </para>
+          </listitem>
+         </itemizedlist>
+        </para>
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93339ef3c58..3c9e823f07e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1279,6 +1279,7 @@ IntoClause
 InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
+IoMethod
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0002-aio-Add-core-asynchronous-I-O-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 8d8ec39d441a6b9c0b1a01bdec3849a981195af5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 02/30] aio: Add core asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by the new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons,
all targets currently need to be added to an array in aio_target.c.  This
commit does not implement any AIO targets, just the infrastructure for
them. The smgr target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently, based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit, nothing uses the AIO infrastructure. Later commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, which I had started to work on to make adding AIO
feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |  311 +++++
 src/include/storage/aio_internal.h            |  395 ++++++
 src/include/storage/aio_types.h               |  117 ++
 src/backend/storage/aio/Makefile              |    4 +
 src/backend/storage/aio/aio.c                 | 1130 +++++++++++++++++
 src/backend/storage/aio/aio_callback.c        |  308 +++++
 src/backend/storage/aio/aio_init.c            |  198 +++
 src/backend/storage/aio/aio_io.c              |  184 +++
 src/backend/storage/aio/aio_target.c          |  114 ++
 src/backend/storage/aio/meson.build           |    4 +
 src/backend/storage/aio/method_sync.c         |   47 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/tools/pgindent/typedefs.list              |   21 +
 13 files changed, 2834 insertions(+)
 create mode 100644 src/include/storage/aio_internal.h
 create mode 100644 src/include/storage/aio_types.h
 create mode 100644 src/backend/storage/aio/aio_callback.c
 create mode 100644 src/backend/storage/aio/aio_io.c
 create mode 100644 src/backend/storage/aio/aio_target.c
 create mode 100644 src/backend/storage/aio/method_sync.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e79d5343038..f48a4962089 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -3,6 +3,10 @@
  * aio.h
  *    Main AIO interface
  *
+ * This is the header to include when actually issuing AIO. When just
+ * declaring functions involving an AIO related type, it might suffice to
+ * include aio_types.h. Initialization related functions are in the dedicated
+ * aio_init.h.
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -14,6 +18,9 @@
 #ifndef AIO_H
 #define AIO_H
 
+#include "storage/aio_types.h"
+#include "storage/procnumber.h"
+
 
 
 /* Enum for io_method GUC. */
@@ -26,9 +33,313 @@ typedef enum IoMethod
 #define DEFAULT_IO_METHOD IOMETHOD_SYNC
 
 
+/*
+ * Flags for an IO that can be set with pgaio_io_set_flag().
+ */
+typedef enum PgAioHandleFlags
+{
+	/*
+	 * The IO references backend local memory.
+	 *
+	 * This needs to be set on an IO whenever the IO references process-local
+	 * memory. Some IO methods do not support executing IO that references
+	 * process local memory and thus need to fall back to executing IO
+	 * synchronously for IOs with this flag set.
+	 *
+	 * Required for correctness.
+	 */
+	PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
+
+	/*
+	 * Hint that IO will be executed synchronously.
+	 *
+	 * This can make it a bit cheaper to execute synchronous IO via the AIO
+	 * interface, to avoid needing an AIO and non-AIO version of code.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_SYNCHRONOUS = 1 << 0,
+
+	/*
+	 * IO is using buffered IO, used to control heuristic in some IO methods.
+	 *
+	 * Advantageous to set, if applicable, but not required for correctness.
+	 */
+	PGAIO_HF_BUFFERED = 1 << 2,
+} PgAioHandleFlags;
+
+/*
+ * The IO operations supported by the AIO subsystem.
+ *
+ * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * PgAioOpData currently *does* need to be public, therefore keeping this
+ * public seems to make sense.
+ */
+typedef enum PgAioOp
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_OP_INVALID = 0,
+
+	PGAIO_OP_READV,
+	PGAIO_OP_WRITEV,
+
+	/**
+	 * In the near term we'll need at least:
+	 * - fsync / fdatasync
+	 * - flush_range
+	 *
+	 * Eventually we'll additionally want at least:
+	 * - send
+	 * - recv
+	 * - accept
+	 **/
+} PgAioOp;
+
+#define PGAIO_OP_COUNT	(PGAIO_OP_WRITEV + 1)
+
+
+/*
+ * On what is IO being performed?
+ *
+ * PgAioTargetID specific behaviour should be implemented in
+ * aio_target.c.
+ */
+typedef enum PgAioTargetID
+{
+	/* intentionally the zero value, to help catch zeroed memory etc */
+	PGAIO_TID_INVALID = 0,
+} PgAioTargetID;
+
+#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+
+
+/*
+ * Data necessary for support IO operations (see PgAioOp).
+ *
+ * NB: Note that the FDs in here may *not* be relied upon for re-issuing
+ * requests (e.g. for partial reads/writes or in an IO worker) - the FD might
+ * be from another process, or closed since. That's not a problem for staged
+ * IOs, as all staged IOs are submitted when closing an FD.
+ */
+typedef union
+{
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			read;
+
+	struct
+	{
+		int			fd;
+		uint16		iov_length;
+		uint64		offset;
+	}			write;
+} PgAioOpData;
+
+
+/*
+ * Information the object that IO is executed on. Mostly callbacks that
+ * operate on PgAioTargetData.
+ *
+ * typedef is in aio_types.h
+ */
+struct PgAioTargetInfo
+{
+	/*
+	 * To support executing using worker processes, the file descriptor for an
+	 * IO may need to be be reopened in a different process.
+	 */
+	void		(*reopen) (PgAioHandle *ioh);
+
+	/* describe the target of the IO, used for log messages and views */
+	char	   *(*describe_identity) (const PgAioTargetData *sd);
+
+	/* name of the target, used in log messages / views */
+	const char *name;
+};
+
+
+/*
+ * IDs for callbacks that can be registered on an IO.
+ *
+ * Callbacks are identified by an ID rather than a function pointer. There are
+ * two main reasons:
+ *
+ * 1) Memory within PgAioHandle is precious, due to the number of PgAioHandle
+ *    structs in pre-allocated shared memory.
+ *
+ * 2) Due to EXEC_BACKEND function pointers are not necessarily stable between
+ *    different backends, therefore function pointers cannot directly be in
+ *    shared memory.
+ *
+ * Without 2), we could fairly easily allow to add new callbacks, by filling a
+ * ID->pointer mapping table on demand. In the presence of 2 that's still
+ * doable, but harder, because every process has to re-register the pointers
+ * so that a local ID->"backend local pointer" mapping can be maintained.
+ */
+typedef enum PgAioHandleCallbackID
+{
+	PGAIO_HCB_INVALID,
+} PgAioHandleCallbackID;
+
+
+typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
+typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
+typedef void (*PgAioHandleCallbackReport) (PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+/* typedef is in aio_types.h */
+struct PgAioHandleCallbacks
+{
+	/*
+	 * Prepare resources affected by the IO for execution. This could e.g.
+	 * include moving ownership of buffer pins to the AIO subsystem.
+	 */
+	PgAioHandleCallbackStage stage;
+
+	/*
+	 * Update the state of resources affected by the IO to reflect completion
+	 * of the IO. This could e.g. include updating shared buffer state to
+	 * signal the IO has finished.
+	 *
+	 * The _shared suffix indicates that this is executed by the backend that
+	 * completed the IO, which may or may not be the backend that issued the
+	 * IO.  Obviously the callback thus can only modify resources in shared
+	 * memory.
+	 *
+	 * The latest registered callback is called first. This allows
+	 * higher-level code to register callbacks that can rely on callbacks
+	 * registered by lower-level code to already have been executed.
+	 *
+	 * NB: This is called in a critical section. Errors can be signalled by
+	 * the callback's return value, it's the responsibility of the IO's issuer
+	 * to react appropriately.
+	 */
+	PgAioHandleCallbackComplete complete_shared;
+
+	/*
+	 * Like complete_shared, except called in the issuing backend.
+	 *
+	 * This variant of the completion callback is useful when backend-local
+	 * state has to be updated to reflect the IO's completion. E.g. a
+	 * temporary buffer's BufferDesc isn't accessible in complete_shared.
+	 *
+	 * Local callbacks are only called after complete_shared for all
+	 * registered callbacks has been called.
+	 */
+	PgAioHandleCallbackComplete complete_local;
+
+	/*
+	 * Report the result of an IO operation. This is e.g. used to raise an
+	 * error after an IO failed at the appropriate time (i.e. not when the IO
+	 * failed, but under control of the code that issued the IO).
+	 */
+	PgAioHandleCallbackReport report;
+};
+
+
+
+/*
+ * How many callbacks can be registered for one IO handle. Currently we only
+ * need two, but it's not hard to imagine needing a few more.
+ */
+#define PGAIO_HANDLE_MAX_CALLBACKS	4
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/* functions in aio.c */
+struct ResourceOwnerData;
+extern PgAioHandle *pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+extern PgAioHandle *pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret);
+
+extern void pgaio_io_release(PgAioHandle *ioh);
 struct dlist_node;
 extern void pgaio_io_release_resowner(struct dlist_node *ioh_node, bool on_error);
 
+extern void pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag);
+
+extern int	pgaio_io_get_id(PgAioHandle *ioh);
+extern ProcNumber pgaio_io_get_owner(PgAioHandle *ioh);
+
+extern void pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow);
+
+/* functions in aio_io.c */
+struct iovec;
+extern int	pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov);
+
+extern PgAioOp pgaio_io_get_op(PgAioHandle *ioh);
+extern PgAioOpData *pgaio_io_get_op_data(PgAioHandle *ioh);
+
+extern void pgaio_io_prep_readv(PgAioHandle *ioh,
+								int fd, int iovcnt, uint64 offset);
+extern void pgaio_io_prep_writev(PgAioHandle *ioh,
+								 int fd, int iovcnt, uint64 offset);
+
+/* functions in aio_target.c */
+extern void pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid);
+extern bool pgaio_io_has_target(PgAioHandle *ioh);
+extern PgAioTargetData *pgaio_io_get_target_data(PgAioHandle *ioh);
+extern char *pgaio_io_get_target_description(PgAioHandle *ioh);
+
+/* functions in aio_callback.c */
+extern void pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+										uint8 cb_data);
+extern void pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len);
+extern void pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len);
+extern uint64 *pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_wref_clear(PgAioWaitRef *iow);
+extern bool pgaio_wref_valid(PgAioWaitRef *iow);
+extern int	pgaio_wref_get_id(PgAioWaitRef *iow);
+
+extern void pgaio_wref_wait(PgAioWaitRef *iow);
+extern bool pgaio_wref_check_done(PgAioWaitRef *iow);
+
+
+
+/* --------------------------------------------------------------------------------
+ * IO Result
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data,
+								int elevel);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_enter_batchmode(void);
+extern void pgaio_exit_batchmode(void);
+extern void pgaio_submit_staged(void);
+extern bool pgaio_have_staged(void);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+extern void pgaio_closing_fd(int fd);
+
+
 
 /* GUCs */
 extern PGDLLIMPORT int io_method;
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
new file mode 100644
index 00000000000..52b7751efa8
--- /dev/null
+++ b/src/include/storage/aio_internal.h
@@ -0,0 +1,395 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_internal.h
+ *    AIO related declarations that should only be used by the AIO subsystem
+ *    internally.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_INTERNAL_H
+#define AIO_INTERNAL_H
+
+
+#include "lib/ilist.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/condition_variable.h"
+
+
+/*
+ * The maximum number of IOs that can be batch submitted at once.
+ */
+#define PGAIO_SUBMIT_BATCH_SIZE 32
+
+
+
+/*
+ * State machine for handles. With some exceptions, noted below, handles move
+ * linearly through all states.
+ *
+ * State changes should all go through pgaio_io_update_state().
+ */
+typedef enum PgAioHandleState
+{
+	/* not in use */
+	PGAIO_HS_IDLE = 0,
+
+	/*
+	 * Returned by pgaio_io_acquire(). The next state is either DEFINED (if
+	 * pgaio_io_prep_*() is called), or IDLE (if pgaio_io_release() is
+	 * called).
+	 */
+	PGAIO_HS_HANDED_OUT,
+
+	/*
+	 * pgaio_io_prep_*() has been called, but IO is not yet staged. At this
+	 * point the handle has all the information for the IO to be executed.
+	 */
+	PGAIO_HS_DEFINED,
+
+	/*
+	 * stage() callbacks have been called, handle ready to be submitted for
+	 * execution. Unless in batchmode (see c.f. pgaio_enter_batchmode()), the
+	 * IO will be submitted immediately after.
+	 */
+	PGAIO_HS_STAGED,
+
+	/* IO has been submitted to the IO method for execution */
+	PGAIO_HS_SUBMITTED,
+
+	/* IO finished, but result has not yet been processed */
+	PGAIO_HS_COMPLETED_IO,
+
+	/*
+	 * IO completed, shared completion has been called.
+	 *
+	 * If the IO completion occurs in the issuing backend, local callbacks
+	 * will immediately be called. Otherwise the handle stays in
+	 * COMPLETED_SHARED until the issuing backend waits for the completion of
+	 * the IO.
+	 */
+	PGAIO_HS_COMPLETED_SHARED,
+
+	/*
+	 * IO completed, local completion has been called.
+	 *
+	 * After this the handle will be made reusable and go into IDLE state.
+	 */
+	PGAIO_HS_COMPLETED_LOCAL,
+} PgAioHandleState;
+
+
+struct ResourceOwnerData;
+
+/* typedef is in aio_types.h */
+struct PgAioHandle
+{
+	/* all state updates should go through pgaio_io_update_state() */
+	PgAioHandleState state:8;
+
+	/* what are we operating on */
+	PgAioTargetID target:8;
+
+	/* which IO operation */
+	PgAioOp		op:8;
+
+	/* bitfield of PgAioHandleFlags */
+	uint8		flags;
+
+	uint8		num_callbacks;
+
+	/* using the proper type here would use more space */
+	uint8		callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/* data forwarded to each callback */
+	uint8		callbacks_data[PGAIO_HANDLE_MAX_CALLBACKS];
+
+	/*
+	 * Length of data associated with handle using
+	 * pgaio_io_set_handle_data_*().
+	 */
+	uint8		handle_data_len;
+
+	/* XXX: could be optimized out with some pointer math */
+	int32		owner_procno;
+
+	/* raw result of the IO operation */
+	int32		result;
+
+	/**
+	 * In which list the handle is registered, depends on the state:
+	 * - IDLE, in per-backend list
+	 * - HANDED_OUT - not in a list
+	 * - DEFINED - in per-backend staged list
+	 * - STAGED - in per-backend staged list
+	 * - SUBMITTED - in issuer's in_flight list
+	 * - COMPLETED_IO - in issuer's in_flight list
+	 * - COMPLETED_SHARED - in issuer's in_flight list
+	 **/
+	dlist_node	node;
+
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+
+	/* incremented every time the IO handle is reused */
+	uint64		generation;
+
+	/*
+	 * To wait for the IO to complete other backends can wait on this CV. Note
+	 * that, if in SUBMITTED state, a waiter first needs to check if it needs
+	 * to do work via IoMethodOps->wait_one().
+	 */
+	ConditionVariable cv;
+
+	/* result of shared callback, passed to issuer callback */
+	PgAioResult distilled_result;
+
+	/*
+	 * Index into PgAioCtl->iovecs and PgAioCtl->handle_data.
+	 *
+	 * At the moment there's no need to differentiate between the two, but
+	 * that won't necessarily stay that way.
+	 */
+	uint32		iovec_off;
+
+	/*
+	 * If not NULL, this memory location will be updated with information
+	 * about the IOs completion iff the issuing backend learns about the IOs
+	 * completion.
+	 */
+	PgAioReturn *report_return;
+
+	/* Data necessary for the IO to be performed */
+	PgAioOpData op_data;
+
+	/*
+	 * Data necessary to identify the object undergoing IO to higher-level
+	 * code. Needs to be sufficient to allow another backend to reopen the
+	 * file.
+	 */
+	PgAioTargetData target_data;
+};
+
+
+typedef struct PgAioBackend
+{
+	/* index into PgAioCtl->io_handles */
+	uint32		io_handle_off;
+
+	/* IO Handles that currently are not used */
+	dclist_head idle_ios;
+
+	/*
+	 * Only one IO may be returned by pgaio_io_acquire()/pgaio_io_acquire_nb()
+	 * without having been either defined (by actually associating it with IO)
+	 * or released (with pgaio_io_release()). This restriction is necessary to
+	 * guarantee that we always can acquire an IO. ->handed_out_io is used to
+	 * enforce that rule.
+	 */
+	PgAioHandle *handed_out_io;
+
+	/* Are we currently in batchmode? See pgaio_enter_batchmode(). */
+	bool		in_batchmode;
+
+	/*
+	 * IOs that are defined, but not yet submitted.
+	 */
+	uint16		num_staged_ios;
+	PgAioHandle *staged_ios[PGAIO_SUBMIT_BATCH_SIZE];
+
+	/*
+	 * List of in-flight IOs. Also contains IOs that aren't strictly speaking
+	 * in-flight anymore, but have been waited-for and completed by another
+	 * backend. Once this backend sees such an IO it'll be reclaimed.
+	 *
+	 * The list is ordered by submission time, with more recently submitted
+	 * IOs being appended at the end.
+	 */
+	dclist_head in_flight_ios;
+} PgAioBackend;
+
+
+typedef struct PgAioCtl
+{
+	int			backend_state_count;
+	PgAioBackend *backend_state;
+
+	/*
+	 * Array of iovec structs. Each iovec is owned by a specific backend. The
+	 * allocation is in PgAioCtl to allow the maximum number of iovecs for
+	 * individual IOs to be configurable with PGC_POSTMASTER GUC.
+	 */
+	uint32		iovec_count;
+	struct iovec *iovecs;
+
+	/*
+	 * For, e.g., an IO covering multiple buffers in shared / temp buffers, we
+	 * need to get Buffer IDs during completion to be able to change the
+	 * BufferDesc state accordingly. This space can be used to store e.g.
+	 * Buffer IDs.  Note that the actual iovec might be shorter than this,
+	 * because we combine neighboring pages into one larger iovec entry.
+	 */
+	uint64	   *handle_data;
+
+	uint32		io_handle_count;
+	PgAioHandle *io_handles;
+} PgAioCtl;
+
+
+
+/*
+ * Callbacks used to implement an IO method.
+ */
+typedef struct IoMethodOps
+{
+	/* global initialization */
+
+	/*
+	 * Amount of additional shared memory to reserve for the io_method. Called
+	 * just like a normal ipci.c style *Size() function. Optional.
+	 */
+	size_t		(*shmem_size) (void);
+
+	/*
+	 * Initialize shared memory. First time is true if AIO's shared memory was
+	 * just initialized, false otherwise. Optional.
+	 */
+	void		(*shmem_init) (bool first_time);
+
+	/*
+	 * Per-backend initialization. Optional.
+	 */
+	void		(*init_backend) (void);
+
+
+	/* handling of IOs */
+
+	/* optional */
+	bool		(*needs_synchronous_execution) (PgAioHandle *ioh);
+
+	/*
+	 * Start executing passed in IOs.
+	 *
+	 * Will not be called if ->needs_synchronous_execution() returned true.
+	 *
+	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
+	 *
+	 * Always called in a critical section.
+	 */
+	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+	/*
+	 * Wait for the IO to complete. Optional.
+	 *
+	 * If not provided, it needs to be guaranteed that the IO method calls
+	 * pgaio_io_process_completion() without further interaction by the
+	 * issuing backend.
+	 */
+	void		(*wait_one) (PgAioHandle *ioh,
+							 uint64 ref_generation);
+} IoMethodOps;
+
+
+/* aio.c */
+extern bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state);
+extern void pgaio_io_stage(PgAioHandle *ioh, PgAioOp op);
+extern void pgaio_io_process_completion(PgAioHandle *ioh, int result);
+extern void pgaio_io_prepare_submit(PgAioHandle *ioh);
+extern bool pgaio_io_needs_synchronous_execution(PgAioHandle *ioh);
+extern const char *pgaio_io_get_state_name(PgAioHandle *ioh);
+const char *pgaio_result_status_string(PgAioResultStatus rs);
+extern void pgaio_shutdown(int code, Datum arg);
+
+/* aio_callback.c */
+extern void pgaio_io_call_stage(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
+extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+
+/* aio_io.c */
+extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
+extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+
+/* aio_target.c */
+extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
+extern void pgaio_io_reopen(PgAioHandle *ioh);
+extern const char *pgaio_io_get_target_name(PgAioHandle *ioh);
+
+
+/*
+ * The AIO subsystem has fairly verbose debug logging support. This can be
+ * enabled/disabled at build time. The reason for this is that
+ * a) the verbosity can make debugging things on higher levels hard
+ * b) even if logging can be skipped due to elevel checks, it still causes a
+ *    measurable slowdown
+ *
+ * XXX: This likely should be eventually be disabled by default, at least in
+ * non-assert builds.
+ */
+#define PGAIO_VERBOSE		1
+
+/*
+ * Simple ereport() wrapper that only logs if PGAIO_VERBOSE is defined.
+ *
+ * This intentionally still compiles the code, guarded by a constant if (0),
+ * if verbose logging is disabled, to make it less likely that debug logging
+ * is silently broken.
+ *
+ * The current definition requires passing at least one argument.
+ */
+#define pgaio_debug(elevel, msg, ...)  \
+	do { \
+		if (PGAIO_VERBOSE) \
+			ereport(elevel, \
+					errhidestmt(true), errhidecontext(true), \
+					errmsg_internal(msg, \
+									__VA_ARGS__)); \
+	} while(0)
+
+/*
+ * Simple ereport() wrapper. Note that the definition requires passing at
+ * least one argument.
+ */
+#define pgaio_debug_io(elevel, ioh, msg, ...)  \
+	pgaio_debug(elevel, "io %-10d|op %-5s|target %-4s|state %-16s: " msg, \
+				pgaio_io_get_id(ioh), \
+				pgaio_io_get_op_name(ioh), \
+				pgaio_io_get_target_name(ioh), \
+				pgaio_io_get_state_name(ioh), \
+				__VA_ARGS__)
+
+
+#ifdef USE_INJECTION_POINTS
+
+extern void pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point);
+
+/* just for use in tests, from within injection points */
+extern PgAioHandle *pgaio_inj_io_get(void);
+
+#else
+
+#define pgaio_io_call_inj(ioh, injection_point) (void) 0
+
+/*
+ * no fallback for pgaio_inj_io_get, all code using injection points better be
+ * guarded by USE_INJECTION_POINTS.
+ */
+
+#endif
+
+
+/* Declarations for the tables of function pointers exposed by each IO method. */
+extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+
+extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
+extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
+extern PGDLLIMPORT PgAioBackend *pgaio_my_backend;
+
+
+
+#endif							/* AIO_INTERNAL_H */
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
new file mode 100644
index 00000000000..a5cc658efbd
--- /dev/null
+++ b/src/include/storage/aio_types.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_types.h
+ *    AIO related types that are useful to include separately, to reduce the
+ *    "include burden".
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/aio_types.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef AIO_TYPES_H
+#define AIO_TYPES_H
+
+#include "storage/block.h"
+#include "storage/relfilelocator.h"
+
+
+typedef struct PgAioHandle PgAioHandle;
+typedef struct PgAioHandleCallbacks PgAioHandleCallbacks;
+typedef struct PgAioTargetInfo PgAioTargetInfo;
+
+/*
+ * A reference to an IO that can be used to wait for the IO (using
+ * pgaio_wref_wait()) to complete.
+ *
+ * These can be passed across process boundaries.
+ */
+typedef struct PgAioWaitRef
+{
+	/* internal ID identifying the specific PgAioHandle */
+	uint32		aio_index;
+
+	/*
+	 * IO handles are reused. To detect if a handle was reused, and thereby
+	 * avoid unnecessarily waiting for a newer IO, each time the handle is
+	 * reused a generation number is increased.
+	 *
+	 * To avoid requiring alignment sufficient for an int64, split the
+	 * generation into two.
+	 */
+	uint32		generation_upper;
+	uint32		generation_lower;
+} PgAioWaitRef;
+
+
+/*
+ * Information identifying what the IO is being performed on.
+ *
+ * This needs sufficient information to
+ *
+ * a) Reopen the file for the IO if the IO is executed in a context that
+ *    cannot use the FD provided initially (e.g. because the IO is executed in
+ *    a worker process).
+ *
+ * b) Describe the object the IO is performed on in log / error messages.
+ */
+typedef union PgAioTargetData
+{
+	/* just as an example placeholder for later */
+	struct
+	{
+		uint32		queue_id;
+	}			wal;
+} PgAioTargetData;
+
+
+/*
+ * The status of an AIO operation.
+ */
+typedef enum PgAioResultStatus
+{
+	ARS_UNKNOWN,				/* not yet completed / uninitialized */
+	ARS_OK,
+	ARS_PARTIAL,				/* did not fully succeed, but no error */
+	ARS_ERROR,
+} PgAioResultStatus;
+
+
+/*
+ * Result of IO operation, visible only to the initiator of IO.
+ */
+typedef struct PgAioResult
+{
+	/*
+	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
+	 * enum, because some compilers treat enums as signed.
+	 */
+	uint32		id:8;
+
+	/* of type PgAioResultStatus, see above */
+	uint32		status:2;
+
+	/* meaning defined by callback->error */
+	uint32		error_data:22;
+
+	int32		result;
+} PgAioResult;
+
+
+/*
+ * Combination of PgAioResult with minimal metadata about the IO.
+ *
+ * Contains sufficient information to be able, in case the IO [partially]
+ * fails, to log/raise an error under control of the IO issuing code.
+ */
+typedef struct PgAioReturn
+{
+	PgAioResult result;
+	PgAioTargetData target_data;
+} PgAioReturn;
+
+
+#endif							/* AIO_TYPES_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index eaeaeeee8e3..89f821ea7e1 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -10,7 +10,11 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	aio.o \
+	aio_callback.o \
 	aio_init.o \
+	aio_io.o \
+	aio_target.o \
+	method_sync.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 828a94efdc3..4d5439c73fd 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -3,6 +3,28 @@
  * aio.c
  *    AIO - Core Logic
  *
+ * For documentation about how AIO works on a higher level, including a
+ * schematic example, see README.md.
+ *
+ *
+ * AIO is a complicated subsystem. To keep things navigable, it is split
+ * across a number of files:
+ *
+ * - method_*.c - different ways of executing AIO (e.g. worker process)
+ *
+ * - aio_target.c - IO on different kinds of targets
+ *
+ * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
+ *
+ * - aio_callback.c - callbacks at IO operation lifecycle events
+ *
+ * - aio_init.c - per-server and per-backend initialization
+ *
+ * - aio.c - all other topics
+ *
+ * - read_stream.c - helper for reading buffered relation data
+ *
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -15,10 +37,28 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
 #include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/resowner.h"
+#include "utils/wait_event_types.h"
+
+#ifdef USE_INJECTION_POINTS
+#include "utils/injection_point.h"
+#endif
+
+
+static inline void pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state);
+static void pgaio_io_reclaim(PgAioHandle *ioh);
+static void pgaio_io_resowner_register(PgAioHandle *ioh);
+static void pgaio_io_wait_for_free(void);
+static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation);
+static const char *pgaio_io_state_get_name(PgAioHandleState s);
+static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
 
 /* Options for io_method. */
@@ -31,7 +71,179 @@ const struct config_enum_entry io_method_options[] = {
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
 
+/* global control for AIO */
+PgAioCtl   *pgaio_ctl;
 
+/* current backend's per-backend state */
+PgAioBackend *pgaio_my_backend;
+
+
+static const IoMethodOps *const pgaio_method_ops_table[] = {
+	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+};
+
+/* callbacks for the configured io_method, set by assign_io_method */
+const IoMethodOps *pgaio_method_ops;
+
+
+/*
+ * Currently there's no infrastructure to pass arguments to injection points,
+ * so we instead set this up for the duration of the injection point
+ * invocation. See pgaio_io_call_inj().
+ */
+#ifdef USE_INJECTION_POINTS
+static PgAioHandle *pgaio_inj_cur_handle;
+#endif
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Acquire an AioHandle, waiting for IO completion if necessary.
+ *
+ * Each backend can only have one AIO handle that has been "handed out" to
+ * code, but not yet submitted or released. This restriction is necessary to
+ * ensure that it is possible for code to wait for an unused handle by waiting
+ * for in-flight IO to complete. There is a limited number of handles in each
+ * backend, if multiple handles could be handed out without being submitted,
+ * waiting for all in-flight IO to complete would not guarantee that handles
+ * free up.
+ *
+ * It is cheap to acquire an IO handle, unless all handles are in use. In that
+ * case this function waits for the oldest IO to complete. If that is not
+ * desirable, use pgaio_io_acquire_nb().
+ *
+ * If a handle was acquired but then does not turn out to be needed,
+ * e.g. because pgaio_io_acquire() is called before starting an IO in a
+ * critical section, the handle needs to be released with pgaio_io_release().
+ *
+ *
+ * To react to the completion of the IO as soon as it is known to have
+ * completed, callbacks can be registered with pgaio_io_register_callbacks().
+ *
+ * To actually execute IO using the returned handle, the pgaio_io_prep_*()
+ * family of functions is used. In many cases the pgaio_io_prep_*() call will
+ * not be done directly by code that acquired the handle, but by lower level
+ * code that gets passed the handle. E.g. if code in bufmgr.c wants to perform
+ * AIO, it typically will pass the handle to smgr.c, which will pass it on to
+ * md.c, on to fd.c, which then finally calls pgaio_io_prep_*().  This
+ * forwarding allows the various layers to react to the IO's completion by
+ * registering callbacks. These callbacks in turn can translate a lower
+ * layer's result into a result understandable by a higher layer.
+ *
+ * During pgaio_io_prep_*() the IO is staged (i.e. prepared for execution but
+ * not submitted to the kernel). Unless in batchmode
+ * (c.f. pgaio_enter_batchmode()), the IO will also get submitted for
+ * execution. Note that, whether in batchmode or not, the IO might even
+ * complete before the functions return.
+ *
+ * After pgaio_io_prep_*() the AioHandle is "consumed" and may not be
+ * referenced by the IO issuing code. To e.g. wait for IO, references to the
+ * IO can be established with pgaio_io_get_wref() *before* pgaio_io_prep_*()
+ * is called.  pgaio_wref_wait() can be used to wait for the IO to complete.
+ *
+ *
+ * To know if the IO [partially] succeeded or failed, a PgAioReturn * can be
+ * passed to pgaio_io_acquire(). Once the issuing backend has called
+ * pgaio_wref_wait(), the PgAioReturn contains information about whether the
+ * operation succeeded and details about the first failure, if any. The error
+ * can be raised / logged with pgaio_result_report().
+ *
+ * The lifetime of the memory pointed to be *ret needs to be at least as long
+ * as the passed in resowner. If the resowner releases resources before the IO
+ * completes (typically due to an error), the reference to *ret will be
+ * cleared. In case of resowner cleanup *ret will not be updated with the
+ * results of the IO operation.
+ */
+PgAioHandle *
+pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	PgAioHandle *h;
+
+	while (true)
+	{
+		h = pgaio_io_acquire_nb(resowner, ret);
+
+		if (h != NULL)
+			return h;
+
+		/*
+		 * Evidently all handles by this backend are in use. Just wait for
+		 * some to complete.
+		 */
+		pgaio_io_wait_for_free();
+	}
+}
+
+/*
+ * Acquire an AioHandle, returning NULL if no handles are free.
+ *
+ * See pgaio_io_acquire(). The only difference is that this function will return
+ * NULL if there are no idle handles, instead of blocking.
+ */
+PgAioHandle *
+pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
+{
+	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
+	{
+		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
+		pgaio_submit_staged();
+	}
+
+	if (pgaio_my_backend->handed_out_io)
+		elog(ERROR, "API violation: Only one IO can be handed out");
+
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+	{
+		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
+		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		Assert(ioh->state == PGAIO_HS_IDLE);
+		Assert(ioh->owner_procno == MyProcNumber);
+
+		pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
+		pgaio_my_backend->handed_out_io = ioh;
+
+		if (resowner)
+			pgaio_io_resowner_register(ioh);
+
+		if (ret)
+		{
+			ioh->report_return = ret;
+			ret->result.status = ARS_UNKNOWN;
+		}
+
+		return ioh;
+	}
+
+	return NULL;
+}
+
+/*
+ * Release IO handle that turned out to not be required.
+ *
+ * See pgaio_io_acquire() for more details.
+ */
+void
+pgaio_io_release(PgAioHandle *ioh)
+{
+	if (ioh == pgaio_my_backend->handed_out_io)
+	{
+		Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+		Assert(ioh->resowner);
+
+		pgaio_my_backend->handed_out_io = NULL;
+		pgaio_io_reclaim(ioh);
+	}
+	else
+	{
+		elog(ERROR, "release in unexpected state");
+	}
+}
 
 /*
  * Release IO handle during resource owner cleanup.
@@ -39,8 +251,795 @@ int			io_max_concurrency = -1;
 void
 pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 {
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case PGAIO_HS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case PGAIO_HS_HANDED_OUT:
+			Assert(ioh == pgaio_my_backend->handed_out_io || pgaio_my_backend->handed_out_io == NULL);
+
+			if (ioh == pgaio_my_backend->handed_out_io)
+			{
+				pgaio_my_backend->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case PGAIO_HS_DEFINED:
+		case PGAIO_HS_STAGED:
+			if (!on_error)
+				elog(WARNING, "AIO handle was not submitted");
+			pgaio_submit_staged();
+			break;
+		case PGAIO_HS_SUBMITTED:
+		case PGAIO_HS_COMPLETED_IO:
+		case PGAIO_HS_COMPLETED_SHARED:
+		case PGAIO_HS_COMPLETED_LOCAL:
+			/* this is expected to happen */
+			break;
+	}
+
+	/*
+	 * Need to unregister the reporting of the IO's result, the memory it's
+	 * referencing likely has gone away.
+	 */
+	if (ioh->report_return)
+		ioh->report_return = NULL;
 }
 
+/*
+ * Add a [set of] flags to the IO.
+ *
+ * Note that this combines flags with already set flags, rather than set flags
+ * to explicitly the passed in parameters. This is to allow multiple callsites
+ * to set flags.
+ */
+void
+pgaio_io_set_flag(PgAioHandle *ioh, PgAioHandleFlags flag)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	ioh->flags |= flag;
+}
+
+/*
+ * Returns an ID uniquely identifying the IO handle. This is only really
+ * useful for logging, as handles are reused across multiple IOs.
+ */
+int
+pgaio_io_get_id(PgAioHandle *ioh)
+{
+	Assert(ioh >= pgaio_ctl->io_handles &&
+		   ioh < (pgaio_ctl->io_handles + pgaio_ctl->io_handle_count));
+	return ioh - pgaio_ctl->io_handles;
+}
+
+/*
+ * Return the ProcNumber for the process that can use an IO handle. The
+ * mapping from IO handles to PGPROCs is static, therefore this even works
+ * when the corresponding PGPROC is not in use.
+ */
+ProcNumber
+pgaio_io_get_owner(PgAioHandle *ioh)
+{
+	return ioh->owner_procno;
+}
+
+/*
+ * Return a wait reference for the IO. Only wait references can be used to
+ * wait for an IOs completion, as handles themselves can be reused after
+ * completion.  See also the comment above pgaio_io_acquire().
+ */
+void
+pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
+		   ioh->state == PGAIO_HS_DEFINED ||
+		   ioh->state == PGAIO_HS_STAGED);
+	Assert(ioh->generation != 0);
+
+	iow->aio_index = ioh - pgaio_ctl->io_handles;
+	iow->generation_upper = (uint32) (ioh->generation >> 32);
+	iow->generation_lower = (uint32) ioh->generation;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal Functions related to PgAioHandle
+ * --------------------------------------------------------------------------------
+ */
+
+static inline void
+pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
+{
+	pgaio_debug_io(DEBUG5, ioh,
+				   "updating state to %s",
+				   pgaio_io_state_get_name(new_state));
+
+	/*
+	 * Ensure the changes signified by the new state are visible before the
+	 * new state becomes visible.
+	 */
+	pg_write_barrier();
+
+	ioh->state = new_state;
+}
+
+static void
+pgaio_io_resowner_register(PgAioHandle *ioh)
+{
+	Assert(!ioh->resowner);
+	Assert(CurrentResourceOwner);
+
+	ResourceOwnerRememberAioHandle(CurrentResourceOwner, &ioh->resowner_node);
+	ioh->resowner = CurrentResourceOwner;
+}
+
+/*
+ * Stage IO for execution and, if appropriate, submit it immediately.
+ *
+ * Should only be called from pgaio_io_prep_*().
+ */
+void
+pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
+{
+	bool		needs_synchronous;
+
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_my_backend->handed_out_io == ioh);
+	Assert(pgaio_io_has_target(ioh));
+
+	ioh->op = op;
+	ioh->result = 0;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
+
+	/* allow a new IO to be staged */
+	pgaio_my_backend->handed_out_io = NULL;
+
+	pgaio_io_call_stage(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
+
+	/*
+	 * Synchronous execution has to be executed, well, synchronously, so check
+	 * that first.
+	 */
+	needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "prepared (synchronous: %d, in_batch: %d)",
+				   needs_synchronous, pgaio_my_backend->in_batchmode);
+
+	if (!needs_synchronous)
+	{
+		pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
+		Assert(pgaio_my_backend->num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+		/*
+		 * Unless code explicitly opted into batching IOs, submit the IO
+		 * immediately.
+		 */
+		if (!pgaio_my_backend->in_batchmode)
+			pgaio_submit_staged();
+	}
+	else
+	{
+		pgaio_io_prepare_submit(ioh);
+		pgaio_io_perform_synchronously(ioh);
+	}
+}
+
+bool
+pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	/*
+	 * If the caller said to execute the IO synchronously, do so.
+	 *
+	 * XXX: We could optimize the logic when to execute synchronously by first
+	 * checking if there are other IOs in flight and only synchronously
+	 * executing if not. Unclear whether that'll be sufficiently common to be
+	 * worth worrying about.
+	 */
+	if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
+		return true;
+
+	/* Check if the IO method requires synchronous execution of IO */
+	if (pgaio_method_ops->needs_synchronous_execution)
+		return pgaio_method_ops->needs_synchronous_execution(ioh);
+
+	return false;
+}
+
+/*
+ * Handle IO being processed by IO method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just before the
+ * IO is performed.
+ */
+void
+pgaio_io_prepare_submit(PgAioHandle *ioh)
+{
+	pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
+
+	dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
+}
+
+/*
+ * Handle IO getting completed by a method.
+ *
+ * Should be called by IO methods / synchronous IO execution, just after the
+ * IO has been performed.
+ *
+ * Expects to be called in a critical section. We expect IOs to be usable for
+ * WAL etc, which requires being able to execute completion callbacks in a
+ * critical section.
+ */
+void
+pgaio_io_process_completion(PgAioHandle *ioh, int result)
+{
+	Assert(ioh->state == PGAIO_HS_SUBMITTED);
+
+	Assert(CritSectionCount > 0);
+
+	ioh->result = result;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
+
+	pgaio_io_call_inj(ioh, "AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+	pgaio_io_call_complete_shared(ioh);
+
+	pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
+
+	/* condition variable broadcast ensures state is visible before wakeup */
+	ConditionVariableBroadcast(&ioh->cv);
+
+	/* contains call to pgaio_io_call_complete_local() */
+	if (ioh->owner_procno == MyProcNumber)
+		pgaio_io_reclaim(ioh);
+}
+
+/*
+ * Has the IO completed and thus the IO handle been reused?
+ *
+ * This is useful when waiting for IO completion at a low level (e.g. in an IO
+ * method's ->wait_one() callback).
+ */
+bool
+pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
+{
+	*state = ioh->state;
+	pg_read_barrier();
+
+	return ioh->generation != ref_generation;
+}
+
+/*
+ * Wait for IO to complete. External code should never use this, outside of
+ * the AIO subsystem waits are only allowed via pgaio_wref_wait().
+ */
+static void
+pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	bool		am_owner;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state != PGAIO_HS_SUBMITTED
+			&& state != PGAIO_HS_COMPLETED_IO
+			&& state != PGAIO_HS_COMPLETED_SHARED
+			&& state != PGAIO_HS_COMPLETED_LOCAL)
+		{
+			elog(PANIC, "waiting for own IO in wrong state: %d",
+				 state);
+		}
+	}
+
+	while (true)
+	{
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+			return;
+
+		switch (state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				elog(ERROR, "IO in wrong state: %d", state);
+				break;
+
+			case PGAIO_HS_SUBMITTED:
+
+				/*
+				 * If we need to wait via the IO method, do so now. Don't
+				 * check via the IO method if the issuing backend is executing
+				 * the IO synchronously.
+				 */
+				if (pgaio_method_ops->wait_one && !(ioh->flags & PGAIO_HF_SYNCHRONOUS))
+				{
+					pgaio_method_ops->wait_one(ioh, ref_generation);
+					continue;
+				}
+				/* fallthrough */
+
+				/* waiting for owner to submit */
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_STAGED:
+				/* waiting for reaper to complete */
+				/* fallthrough */
+			case PGAIO_HS_COMPLETED_IO:
+				/* shouldn't be able to hit this otherwise */
+				Assert(IsUnderPostmaster);
+				/* ensure we're going to get woken up */
+				ConditionVariablePrepareToSleep(&ioh->cv);
+
+				while (!pgaio_io_was_recycled(ioh, ref_generation, &state))
+				{
+					if (state == PGAIO_HS_COMPLETED_SHARED ||
+						state == PGAIO_HS_COMPLETED_LOCAL)
+						break;
+					ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
+				}
+
+				ConditionVariableCancelSleep();
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* see above */
+				if (am_owner)
+					pgaio_io_reclaim(ioh);
+				return;
+		}
+	}
+}
+
+/*
+ * Make IO handle ready to be reused after IO has completed or after the
+ * handle has been released without being used.
+ */
+static void
+pgaio_io_reclaim(PgAioHandle *ioh)
+{
+	/* This is only ok if it's our IO */
+	Assert(ioh->owner_procno == MyProcNumber);
+	Assert(ioh->state != PGAIO_HS_IDLE);
+
+	/*
+	 * It's a bit ugly, but right now the easiest place to put the execution
+	 * of shared completion callbacks is this function, as we need to execute
+	 * local callbacks just before reclaiming at multiple callsites.
+	 */
+	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+	{
+		pgaio_io_call_complete_local(ioh);
+		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+	}
+
+	pgaio_debug_io(DEBUG4, ioh,
+				   "reclaiming: distilled_result: (status %s, id %u, error_data %d), raw_result: %d",
+				   pgaio_result_status_string(ioh->distilled_result.status),
+				   ioh->distilled_result.id,
+				   ioh->distilled_result.error_data,
+				   ioh->result);
+
+	/* if the IO has been defined, we might need to do more work */
+	if (ioh->state != PGAIO_HS_HANDED_OUT)
+	{
+		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = ioh->distilled_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
+	}
+
+	if (ioh->resowner)
+	{
+		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+		ioh->resowner = NULL;
+	}
+
+	Assert(!ioh->resowner);
+
+	ioh->op = PGAIO_OP_INVALID;
+	ioh->target = PGAIO_TID_INVALID;
+	ioh->flags = 0;
+	ioh->num_callbacks = 0;
+	ioh->handle_data_len = 0;
+	ioh->report_return = NULL;
+	ioh->result = 0;
+	ioh->distilled_result.status = ARS_UNKNOWN;
+
+	/* XXX: the barrier is probably superfluous */
+	pg_write_barrier();
+	ioh->generation++;
+
+	pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
+
+	/*
+	 * We push the IO to the head of the idle IO list, that seems more cache
+	 * efficient in cases where only a few IOs are used.
+	 */
+	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+}
+
+/*
+ * Wait for an IO handle to become usable.
+ *
+ * This only really is useful for pgaio_io_acquire().
+ */
+static void
+pgaio_io_wait_for_free(void)
+{
+	int			reclaimed = 0;
+
+	pgaio_debug(DEBUG2, "waiting for self with %d pending",
+				pgaio_my_backend->num_staged_ios);
+
+	/*
+	 * First check if any of our IOs actually have completed - when using
+	 * worker, that'll often be the case. We could do so as part of the loop
+	 * below, but that'd potentially lead us to wait for some IO submitted
+	 * before.
+	 */
+	for (int i = 0; i < io_max_concurrency; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
+
+		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
+		{
+			pgaio_io_reclaim(ioh);
+			reclaimed++;
+		}
+	}
+
+	if (reclaimed > 0)
+		return;
+
+	/*
+	 * If we have any unsubmitted IOs, submit them now. We'll start waiting in
+	 * a second, so it's better they're in flight. This also addresses the
+	 * edge-case that all IOs are unsubmitted.
+	 */
+	if (pgaio_my_backend->num_staged_ios > 0)
+		pgaio_submit_staged();
+
+	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
+		elog(ERROR, "no free IOs despite no in-flight IOs");
+
+	/*
+	 * Wait for the oldest in-flight IO to complete.
+	 *
+	 * XXX: Reusing the general IO wait is suboptimal, we don't need to wait
+	 * for that specific IO to complete, we just need *any* IO to complete.
+	 */
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
+											   &pgaio_my_backend->in_flight_ios);
+
+		switch (ioh->state)
+		{
+				/* should not be in in-flight list */
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_DEFINED:
+			case PGAIO_HS_HANDED_OUT:
+			case PGAIO_HS_STAGED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				pgaio_debug_io(DEBUG2, ioh,
+							   "waiting for free io with %d in flight",
+							   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+				/*
+				 * In a more general case this would be racy, because the
+				 * generation could increase after we read ioh->state above.
+				 * But we are only looking at IOs by the current backend and
+				 * the IO can only be recycled by this backend.
+				 */
+				pgaio_io_wait(ioh, ioh->generation);
+				break;
+
+			case PGAIO_HS_COMPLETED_SHARED:
+				/* it's possible that another backend just finished this IO */
+				pgaio_io_reclaim(ioh);
+				break;
+		}
+
+		if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
+			elog(PANIC, "no idle IO after waiting for IO to terminate");
+		return;
+	}
+}
+
+/*
+ * Internal - code outside of AIO should never need this and it'd be hard for
+ * such code to be safe.
+ */
+static PgAioHandle *
+pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
+{
+	PgAioHandle *ioh;
+
+	Assert(iow->aio_index < pgaio_ctl->io_handle_count);
+
+	ioh = &pgaio_ctl->io_handles[iow->aio_index];
+
+	*ref_generation = ((uint64) iow->generation_upper) << 32 |
+		iow->generation_lower;
+
+	Assert(*ref_generation != 0);
+
+	return ioh;
+}
+
+static const char *
+pgaio_io_state_get_name(PgAioHandleState s)
+{
+#define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
+	switch (s)
+	{
+			PGAIO_HS_TOSTR_CASE(IDLE);
+			PGAIO_HS_TOSTR_CASE(HANDED_OUT);
+			PGAIO_HS_TOSTR_CASE(DEFINED);
+			PGAIO_HS_TOSTR_CASE(STAGED);
+			PGAIO_HS_TOSTR_CASE(SUBMITTED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
+			PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
+	}
+#undef PGAIO_HS_TOSTR_CASE
+
+	return NULL;				/* silence compiler */
+}
+
+const char *
+pgaio_io_get_state_name(PgAioHandle *ioh)
+{
+	return pgaio_io_state_get_name(ioh->state);
+}
+
+const char *
+pgaio_result_status_string(PgAioResultStatus rs)
+{
+	switch (rs)
+	{
+		case ARS_UNKNOWN:
+			return "UNKNOWN";
+		case ARS_OK:
+			return "OK";
+		case ARS_PARTIAL:
+			return "PARTIAL";
+		case ARS_ERROR:
+			return "ERROR";
+	}
+
+	return NULL;				/* silence compiler */
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to IO Wait References
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Mark a wait reference as invalid
+ */
+void
+pgaio_wref_clear(PgAioWaitRef *iow)
+{
+	iow->aio_index = PG_UINT32_MAX;
+}
+
+/* Is the wait reference valid? */
+bool
+pgaio_wref_valid(PgAioWaitRef *iow)
+{
+	return iow->aio_index != PG_UINT32_MAX;
+}
+
+/*
+ * Similar to pgaio_io_get_id(), just for wait references.
+ */
+int
+pgaio_wref_get_id(PgAioWaitRef *iow)
+{
+	Assert(pgaio_wref_valid(iow));
+	return iow->aio_index;
+}
+
+/*
+ * Wait for the IO to have completed. Can be called in any process, not just
+ * in the issuing backend.
+ */
+void
+pgaio_wref_wait(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	pgaio_io_wait(ioh, ref_generation);
+}
+
+/*
+ * Check if the referenced IO completed, without blocking.
+ */
+bool
+pgaio_wref_check_done(PgAioWaitRef *iow)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_wref(iow, &ref_generation);
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return true;
+
+	if (state == PGAIO_HS_IDLE)
+		return true;
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+	if (state == PGAIO_HS_COMPLETED_SHARED ||
+		state == PGAIO_HS_COMPLETED_LOCAL)
+	{
+		if (am_owner)
+			pgaio_io_reclaim(ioh);
+		return true;
+	}
+
+	/*
+	 * XXX: It likely would be worth checking in with the io method, to give
+	 * the IO method a chance to check if there are completion events queued.
+	 */
+
+	return false;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Actions on multiple IOs.
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Submit IOs in batches going forward.
+ *
+ * Submitting multiple IOs at once can be substantially faster than doing so
+ * one-by-one. At the same time, submitting multiple IOs at once requires more
+ * care to avoid deadlocks.
+ *
+ * Consider backend A staging an IO for buffer 1 and then trying to start IO
+ * on buffer 2, while backend B does the inverse. If A submitted the IO before
+ * moving on to buffer 2, this works just fine, B will wait for the IO to
+ * complete. But if batching were used, each backend will wait for IO that has
+ * not yet been submitted to complete, i.e. forever.
+ *
+ * End batch submission mode with pgaio_exit_batchmode().  (Throwing errors is
+ * allowed; error recovery will end the batch.)
+ *
+ * To avoid deadlocks, code needs to ensure that it will not wait for another
+ * backend while there is unsubmitted IO. E.g. by using conditional lock
+ * acquisition when acquiring buffer locks. To check if there currently are
+ * staged IOs, call pgaio_have_staged() and to submit all staged IOs call
+ * pgaio_submit_staged().
+ *
+ * It is not allowed to enter batchmode while already in batchmode, it's
+ * unlikely to ever be needed, as code needs to be explicitly aware of being
+ * called in batchmode, to avoid the deadlock risks explained above.
+ *
+ * Note that IOs may get submitted before pgaio_exit_batchmode() is called,
+ * e.g. because too many IOs have been staged or because pgaio_submit_staged()
+ * was called.
+ */
+void
+pgaio_enter_batchmode(void)
+{
+	if (pgaio_my_backend->in_batchmode)
+		elog(ERROR, "starting batch while batch already in progress");
+	pgaio_my_backend->in_batchmode = true;
+}
+
+/*
+ * Stop submitting IOs in batches.
+ */
+void
+pgaio_exit_batchmode(void)
+{
+	Assert(pgaio_my_backend->in_batchmode);
+
+	pgaio_submit_staged();
+	pgaio_my_backend->in_batchmode = false;
+}
+
+/*
+ * Are there staged but unsubmitted IOs?
+ *
+ * See comment above pgaio_enter_batchmode() for why code may need to check if
+ * there is IO in that state.
+ */
+bool
+pgaio_have_staged(void)
+{
+	Assert(pgaio_my_backend->in_batchmode ||
+		   pgaio_my_backend->num_staged_ios == 0);
+	return pgaio_my_backend->num_staged_ios > 0;
+}
+
+/*
+ * Submit all staged but not yet submitted IOs.
+ *
+ * Unless in batch mode, this never needs to be called, as IOs get submitted
+ * as soon as possible. While in batchmode pgaio_submit_staged() can be called
+ * before waiting on another backend, to avoid the risk of deadlocks. See
+ * pgaio_enter_batchmode().
+ */
+void
+pgaio_submit_staged(void)
+{
+	int			total_submitted = 0;
+	int			did_submit;
+
+	if (pgaio_my_backend->num_staged_ios == 0)
+		return;
+
+
+	START_CRIT_SECTION();
+
+	did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
+										  pgaio_my_backend->staged_ios);
+
+	END_CRIT_SECTION();
+
+	total_submitted += did_submit;
+
+	Assert(total_submitted == did_submit);
+
+	pgaio_my_backend->num_staged_ios = 0;
+
+	pgaio_debug(DEBUG4,
+				"aio: submitted %d IOs",
+				total_submitted);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Other
+ * --------------------------------------------------------------------------------
+ */
+
+
 /*
  * Perform AIO related cleanup after an error.
  *
@@ -50,6 +1049,22 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 void
 pgaio_error_cleanup(void)
 {
+	/*
+	 * It is possible that code errored out after pgaio_enter_batchmode() but
+	 * before pgaio_exit_batchmode() was called. In that case we need to
+	 * submit the IO now.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_my_backend->in_batchmode = false;
+
+		pgaio_submit_staged();
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
 }
 
 /*
@@ -62,11 +1077,86 @@ pgaio_error_cleanup(void)
 void
 AtEOXact_Aio(bool is_commit)
 {
+	/*
+	 * We should never be in batch mode at transactional boundaries. In case
+	 * an error was thrown while in batch mode, pgaio_error_cleanup() should
+	 * have exited batchmode.
+	 *
+	 * In case we are in batchmode somehow, make sure to submit all staged
+	 * IOs, other backends may need them to complete to continue.
+	 */
+	if (pgaio_my_backend->in_batchmode)
+	{
+		pgaio_error_cleanup();
+		elog(WARNING, "open AIO batch at end of (sub-)transaction");
+	}
+
+	/*
+	 * As we aren't in batchmode, there shouldn't be any unsubmitted IOs.
+	 */
+	Assert(pgaio_my_backend->num_staged_ios == 0);
+}
+
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
+{
+	/*
+	 * Might be called before AIO is initialized or in a subprocess that
+	 * doesn't use AIO.
+	 */
+	if (!pgaio_my_backend)
+		return;
+
+	/*
+	 * For now just submit all staged IOs - we could be more selective, but
+	 * it's probably not worth it.
+	 */
+	pgaio_submit_staged();
+}
+
+/*
+ * Registered as before_shmem_exit() callback in pgaio_init_backend()
+ */
+void
+pgaio_shutdown(int code, Datum arg)
+{
+	Assert(pgaio_my_backend);
+	Assert(!pgaio_my_backend->handed_out_io);
+
+	/* first clean up resources as we would at a transaction boundary */
+	AtEOXact_Aio(code == 0);
+
+	/*
+	 * Before exiting, make sure that all IOs are finished. That has two main
+	 * purposes:
+	 *
+	 * - Some kernel-level AIO mechanisms don't deal well with the issuer of
+	 * an AIO exiting before IO completed
+	 *
+	 * - It'd be confusing to see partially finished IOs in stats views etc
+	 */
+	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+	{
+		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+
+		/* see comment in pgaio_io_wait_for_free() about raciness */
+		pgaio_io_wait(ioh, ioh->generation);
+	}
+
+	pgaio_my_backend = NULL;
 }
 
 void
 assign_io_method(int newval, void *extra)
 {
+	Assert(pgaio_method_ops_table[newval] != NULL);
+	Assert(newval < lengthof(io_method_options));
+
+	pgaio_method_ops = pgaio_method_ops_table[newval];
 }
 
 bool
@@ -88,3 +1178,43 @@ check_io_max_concurrency(int *newval, void **extra, GucSource source)
 
 	return true;
 }
+
+
+
+/* --------------------------------------------------------------------------------
+ * Injection point support
+ * --------------------------------------------------------------------------------
+ */
+
+#ifdef USE_INJECTION_POINTS
+
+/*
+ * Call injection point with support for pgaio_inj_io_get().
+ */
+void
+pgaio_io_call_inj(PgAioHandle *ioh, const char *injection_point)
+{
+	pgaio_inj_cur_handle = ioh;
+
+	PG_TRY();
+	{
+		InjectionPointCached(injection_point);
+	}
+	PG_FINALLY();
+	{
+		pgaio_inj_cur_handle = NULL;
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Return IO associated with injection point invocation. This is only needed
+ * as injection points currently don't support arguments.
+ */
+PgAioHandle *
+pgaio_inj_io_get(void)
+{
+	return pgaio_inj_cur_handle;
+}
+
+#endif
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
new file mode 100644
index 00000000000..d5a2cca28f1
--- /dev/null
+++ b/src/backend/storage/aio/aio_callback.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_callback.c
+ *	  AIO - Functionality related to callbacks that can be registered on IO
+ *	  Handles
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_callback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/* just to have something to put into aio_handle_cbs */
+static const PgAioHandleCallbacks aio_invalid_cb = {0};
+
+typedef struct PgAioHandleCallbacksEntry
+{
+	const PgAioHandleCallbacks *const cb;
+	const char *const name;
+} PgAioHandleCallbacksEntry;
+
+/*
+ * Callback definition for the callbacks that can be registered on an IO
+ * handle.  See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
+#define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
+	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+#undef CALLBACK_ENTRY
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Register callback for the IO handle.
+ *
+ * Only a limited number (PGAIO_HANDLE_MAX_CALLBACKS) of callbacks can be
+ * registered for each IO.
+ *
+ * Callbacks need to be registered before [indirectly] calling
+ * pgaio_io_prep_*(), as the IO may be executed immediately.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ *
+ * Note that callbacks are executed in critical sections.  This is necessary
+ * to be able to execute IO in critical sections (consider e.g. WAL
+ * logging). To perform AIO we first need to acquire a handle, which, if there
+ * are no free handles, requires waiting for IOs to complete and to execute
+ * their completion callbacks.
+ *
+ * Callbacks may be executed in the issuing backend but also in another
+ * backend (because that backend is waiting for the IO) or in IO workers (if
+ * io_method=worker is used).
+ *
+ *
+ * See PgAioHandleCallbackID's definition for an explanation for why
+ * callbacks are not identified by a pointer.
+ */
+void
+pgaio_io_register_callbacks(PgAioHandle *ioh, PgAioHandleCallbackID cb_id,
+							uint8 cb_data)
+{
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	if (cb_id >= lengthof(aio_handle_cbs))
+		elog(ERROR, "callback %d is out of range", cb_id);
+	if (aio_handle_cbs[cb_id].cb->complete_shared == NULL &&
+		aio_handle_cbs[cb_id].cb->complete_local == NULL)
+		elog(ERROR, "callback %d does not have a completion callback", cb_id);
+	if (ioh->num_callbacks >= PGAIO_HANDLE_MAX_CALLBACKS)
+		elog(PANIC, "too many callbacks, the max is %d",
+			 PGAIO_HANDLE_MAX_CALLBACKS);
+	ioh->callbacks[ioh->num_callbacks] = cb_id;
+	ioh->callbacks_data[ioh->num_callbacks] = cb_data;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "adding cb #%d, id %d/%s",
+				   ioh->num_callbacks + 1,
+				   cb_id, ce->name);
+
+	ioh->num_callbacks++;
+}
+
+/*
+ * Associate an array of data with the Handle. This is e.g. useful to the
+ * transport knowledge about which buffers a multi-block IO affects to
+ * completion callbacks.
+ *
+ * Right now this can be done only once for each IO, even though multiple
+ * callbacks can be registered. There aren't any known usecases requiring more
+ * and the required amount of shared memory does add up, so it doesn't seem
+ * worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
+ */
+void
+pgaio_io_set_handle_data_64(PgAioHandle *ioh, uint64 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Convenience version of pgaio_io_set_handle_data_64() that converts a 32bit
+ * array to a 64bit array. Without it callers would end up needing to
+ * open-code equivalent code.
+ */
+void
+pgaio_io_set_handle_data_32(PgAioHandle *ioh, uint32 *data, uint8 len)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->handle_data_len == 0);
+	Assert(len <= PG_IOV_MAX);
+
+	for (int i = 0; i < len; i++)
+		pgaio_ctl->handle_data[ioh->iovec_off + i] = data[i];
+	ioh->handle_data_len = len;
+}
+
+/*
+ * Return data set with pgaio_io_set_handle_data_*().
+ */
+uint64 *
+pgaio_io_get_handle_data(PgAioHandle *ioh, uint8 *len)
+{
+	Assert(ioh->handle_data_len > 0);
+
+	*len = ioh->handle_data_len;
+
+	return &pgaio_ctl->handle_data[ioh->iovec_off];
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO Result related functions
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	PgAioHandleCallbackID cb_id = result.id;
+	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+	Assert(result.status != ARS_UNKNOWN);
+	Assert(result.status != ARS_OK);
+
+	if (ce->cb->report == NULL)
+		elog(ERROR, "callback %d/%s does not have report callback",
+			 result.id, ce->name);
+
+	ce->cb->report(result, target_data, elevel);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal callback related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal function which invokes ->stage for all the registered callbacks.
+ */
+void
+pgaio_io_call_stage(PgAioHandle *ioh)
+{
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->stage)
+			continue;
+
+		pgaio_debug_io(DEBUG3, ioh,
+					   "calling cb #%d %d/%s->stage(%u)",
+					   i, cb_id, ce->name, cb_data);
+		ce->cb->stage(ioh, cb_data);
+	}
+}
+
+/*
+ * Internal function which invokes ->complete_shared for all the registered
+ * callbacks.
+ */
+void
+pgaio_io_call_complete_shared(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.result = ioh->result;
+	result.id = PGAIO_HCB_INVALID;
+	result.error_data = 0;
+
+	/*
+	 * Call callbacks with the last registered (innermost) callback first.
+	 * Each callback can modify the result forwarded to the next callback.
+	 */
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_shared)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_shared(%u) with distilled result: (status %s, id %u, error_data %d, result %d)",
+					   i, cb_id, ce->name,
+					   cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_shared(ioh, result, cb_data);
+	}
+
+	ioh->distilled_result = result;
+
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after shared completion: distilled result: (status %s, id %u, error_data: %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Internal function which invokes ->complete_local for all the registered
+ * callbacks.
+ *
+ * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
+ */
+void
+pgaio_io_call_complete_local(PgAioHandle *ioh)
+{
+	PgAioResult result;
+
+	START_CRIT_SECTION();
+
+	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
+
+	/* start with distilled result from shared callback */
+	result = ioh->distilled_result;
+
+	for (int i = ioh->num_callbacks; i > 0; i--)
+	{
+		PgAioHandleCallbackID cb_id = ioh->callbacks[i - 1];
+		uint8		cb_data = ioh->callbacks_data[i - 1];
+		const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
+
+		if (!ce->cb->complete_local)
+			continue;
+
+		pgaio_debug_io(DEBUG4, ioh,
+					   "calling cb #%d, id %d/%s->complete_local(%u) with distilled result: status %s, id %u, error_data %d, result %d",
+					   i, cb_id, ce->name, cb_data,
+					   pgaio_result_status_string(result.status),
+					   result.id, result.error_data, result.result);
+		result = ce->cb->complete_local(ioh, result, cb_data);
+	}
+
+	/*
+	 * Note that we don't save the result in ioh->distilled_result, the local
+	 * callback's result should not ever matter to other waiters.
+	 */
+	pgaio_debug_io(DEBUG3, ioh,
+				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
+				   pgaio_result_status_string(result.status),
+				   result.id, result.error_data, result.result,
+				   ioh->result);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index aeacc144149..6fe55510fae 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -14,24 +14,222 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
 
 
 
+static Size
+AioCtlShmemSize(void)
+{
+	Size		sz;
+
+	/* pgaio_ctl itself */
+	sz = offsetof(PgAioCtl, io_handles);
+
+	return sz;
+}
+
+static uint32
+AioProcs(void)
+{
+	return MaxBackends + NUM_AUXILIARY_PROCS;
+}
+
+static Size
+AioBackendShmemSize(void)
+{
+	return mul_size(AioProcs(), sizeof(PgAioBackend));
+}
+
+static Size
+AioHandleShmemSize(void)
+{
+	Size		sz;
+
+	/* verify AioChooseMaxConcurrency() did its thing */
+	Assert(io_max_concurrency > 0);
+
+	/* io handles */
+	sz = mul_size(AioProcs(),
+				  mul_size(io_max_concurrency, sizeof(PgAioHandle)));
+
+	return sz;
+}
+
+static Size
+AioHandleIOVShmemSize(void)
+{
+	/*
+	 * Each IO handle can have an PG_IOV_MAX long iovec.
+	 *
+	 * XXX: Right now the amount of space available for each IO is PG_IOV_MAX.
+	 * While it's tempting to use the io_combine_limit GUC, that's
+	 * PGC_USERSET, so we can't allocate shared memory based on that.
+	 */
+	return mul_size(sizeof(struct iovec),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+static Size
+AioHandleDataShmemSize(void)
+{
+	/* each buffer referenced by an iovec can have associated data */
+	return mul_size(sizeof(uint64),
+					mul_size(mul_size(PG_IOV_MAX, AioProcs()),
+							 io_max_concurrency));
+}
+
+/*
+ * Choose a suitable value for io_max_concurrency.
+ *
+ * It's unlikely that we could have more IOs in flight than buffers that we
+ * would be allowed to pin.
+ *
+ * On the upper end, apply a cap too - just because shared_buffers is large,
+ * it doesn't make sense have millions of buffers undergo IO concurrently.
+ */
+static int
+AioChooseMaxConcurrency(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 64);
+}
+
 Size
 AioShmemSize(void)
 {
 	Size		sz = 0;
 
+	/*
+	 * We prefer to report this value's source as PGC_S_DYNAMIC_DEFAULT.
+	 * However, if the DBA explicitly set io_max_concurrency = -1 in the
+	 * config file, then PGC_S_DYNAMIC_DEFAULT will fail to override that and
+	 * we must force the matter with PGC_S_OVERRIDE.
+	 */
+	if (io_max_concurrency == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseMaxConcurrency());
+		SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_max_concurrency == -1)	/* failed to apply it? */
+			SetConfigOption("io_max_concurrency", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
+	sz = add_size(sz, AioCtlShmemSize());
+	sz = add_size(sz, AioBackendShmemSize());
+	sz = add_size(sz, AioHandleShmemSize());
+	sz = add_size(sz, AioHandleIOVShmemSize());
+	sz = add_size(sz, AioHandleDataShmemSize());
+
+	/* Reserve space for method specific resources. */
+	if (pgaio_method_ops->shmem_size)
+		sz = add_size(sz, pgaio_method_ops->shmem_size());
+
 	return sz;
 }
 
 void
 AioShmemInit(void)
 {
+	bool		found;
+	uint32		io_handle_off = 0;
+	uint32		iovec_off = 0;
+	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+
+	pgaio_ctl = (PgAioCtl *)
+		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
+
+	if (found)
+		goto out;
+
+	memset(pgaio_ctl, 0, AioCtlShmemSize());
+
+	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
+	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+
+	pgaio_ctl->backend_state = (PgAioBackend *)
+		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
+
+	pgaio_ctl->io_handles = (PgAioHandle *)
+		ShmemInitStruct("AioHandle", AioHandleShmemSize(), &found);
+
+	pgaio_ctl->iovecs = (struct iovec *)
+		ShmemInitStruct("AioHandleIOV", AioHandleIOVShmemSize(), &found);
+	pgaio_ctl->handle_data = (uint64 *)
+		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
+
+	for (int procno = 0; procno < AioProcs(); procno++)
+	{
+		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
+
+		bs->io_handle_off = io_handle_off;
+		io_handle_off += io_max_concurrency;
+
+		dclist_init(&bs->idle_ios);
+		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
+		dclist_init(&bs->in_flight_ios);
+
+		/* initialize per-backend IOs */
+		for (int i = 0; i < io_max_concurrency; i++)
+		{
+			PgAioHandle *ioh = &pgaio_ctl->io_handles[bs->io_handle_off + i];
+
+			ioh->generation = 1;
+			ioh->owner_procno = procno;
+			ioh->iovec_off = iovec_off;
+			ioh->handle_data_len = 0;
+			ioh->report_return = NULL;
+			ioh->resowner = NULL;
+			ioh->num_callbacks = 0;
+			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->flags = 0;
+
+			ConditionVariableInit(&ioh->cv);
+
+			dclist_push_tail(&bs->idle_ios, &ioh->node);
+			iovec_off += PG_IOV_MAX;
+		}
+	}
+
+out:
+	/* Initialize IO method specific resources. */
+	if (pgaio_method_ops->shmem_init)
+		pgaio_method_ops->shmem_init(!found);
 }
 
 void
 pgaio_init_backend(void)
 {
+	/* shouldn't be initialized twice */
+	Assert(!pgaio_my_backend);
+
+	if (MyProc == NULL || MyProcNumber >= AioProcs())
+		elog(ERROR, "aio requires a normal PGPROC");
+
+	pgaio_my_backend = &pgaio_ctl->backend_state[MyProcNumber];
+
+	if (pgaio_method_ops->init_backend)
+		pgaio_method_ops->init_backend();
+
+	before_shmem_exit(pgaio_shutdown, 0);
 }
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
new file mode 100644
index 00000000000..36d2c1f492d
--- /dev/null
+++ b/src/backend/storage/aio/aio_io.c
@@ -0,0 +1,184 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_io.c
+ *    AIO - Low Level IO Handling
+ *
+ * Functions related to associating IO operations to IO Handles and IO-method
+ * independent support functions for actually performing IO.
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_io.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "utils/wait_event.h"
+
+
+static void pgaio_io_before_prep(PgAioHandle *ioh);
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Scatter/gather IO needs to associate an iovec with the Handle. To support
+ * worker mode this data needs to be in shared memory.
+ */
+int
+pgaio_io_get_iovec(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	return PG_IOV_MAX;
+}
+
+PgAioOp
+pgaio_io_get_op(PgAioHandle *ioh)
+{
+	return ioh->op;
+}
+
+PgAioOpData *
+pgaio_io_get_op_data(PgAioHandle *ioh)
+{
+	return &ioh->op_data;
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * "Preparation" routines for individual IO operations
+ *
+ * These are called by the code actually initiating an IO, to associate the IO
+ * specific data with an AIO handle.
+ *
+ * Each of the preparation routines first needs to call
+ * pgaio_io_before_prep(), then fill IO specific fields in the handle and then
+ * finally call pgaio_io_stage().
+ * --------------------------------------------------------------------------------
+ */
+
+void
+pgaio_io_prep_readv(PgAioHandle *ioh,
+					int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.read.fd = fd;
+	ioh->op_data.read.offset = offset;
+	ioh->op_data.read.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_READV);
+}
+
+void
+pgaio_io_prep_writev(PgAioHandle *ioh,
+					 int fd, int iovcnt, uint64 offset)
+{
+	pgaio_io_before_prep(ioh);
+
+	ioh->op_data.write.fd = fd;
+	ioh->op_data.write.offset = offset;
+	ioh->op_data.write.iov_length = iovcnt;
+
+	pgaio_io_stage(ioh, PGAIO_OP_WRITEV);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal IO related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Execute IO operation synchronously. This is implemented here, not in
+ * method_sync.c, because other IO methods also might use it / fall back to
+ * it.
+ */
+void
+pgaio_io_perform_synchronously(PgAioHandle *ioh)
+{
+	ssize_t		result = 0;
+	struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	START_CRIT_SECTION();
+
+	/* Perform IO. */
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_READ);
+			result = pg_preadv(ioh->op_data.read.fd, iov,
+							   ioh->op_data.read.iov_length,
+							   ioh->op_data.read.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_WRITEV:
+			pgstat_report_wait_start(WAIT_EVENT_DATA_FILE_WRITE);
+			result = pg_pwritev(ioh->op_data.write.fd, iov,
+								ioh->op_data.write.iov_length,
+								ioh->op_data.write.offset);
+			pgstat_report_wait_end();
+			break;
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to execute invalid IO operation");
+	}
+
+	ioh->result = result < 0 ? -errno : result;
+
+	pgaio_io_process_completion(ioh, ioh->result);
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Helper function to be called by IO operation preparation functions, before
+ * any data in the handle is set.  Mostly to centralize assertions.
+ */
+static void
+pgaio_io_before_prep(PgAioHandle *ioh)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(pgaio_my_backend->handed_out_io == ioh);
+	Assert(pgaio_io_has_target(ioh));
+	Assert(ioh->op == PGAIO_OP_INVALID);
+}
+
+/*
+ * Could be made part of the public interface, but it's not clear there's
+ * really a use case for that.
+ */
+const char *
+pgaio_io_get_op_name(PgAioHandle *ioh)
+{
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_INVALID:
+			return "invalid";
+		case PGAIO_OP_READV:
+			return "read";
+		case PGAIO_OP_WRITEV:
+			return "write";
+	}
+
+	return NULL;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
new file mode 100644
index 00000000000..b01406a6a52
--- /dev/null
+++ b/src/backend/storage/aio/aio_target.c
@@ -0,0 +1,114 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_target.c
+ *	  AIO - Functionality related to executing IO for different targets
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_target.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+
+/*
+ * Registry for entities that can be the target of AIO.
+ */
+static const PgAioTargetInfo *pgaio_target_info[] = {
+	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
+		.name = "invalid",
+	},
+};
+
+
+
+/* --------------------------------------------------------------------------------
+ * Public target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+bool
+pgaio_io_has_target(PgAioHandle *ioh)
+{
+	return ioh->target != PGAIO_TID_INVALID;
+}
+
+/*
+ * Return the name for the target associated with the IO. Mostly useful for
+ * debugging/logging.
+ */
+const char *
+pgaio_io_get_target_name(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+
+	return pgaio_target_info[ioh->target]->name;
+}
+
+/*
+ * Assign a target to the IO.
+ *
+ * This has to be called exactly once before pgaio_io_prep_*() is called.
+ */
+void
+pgaio_io_set_target(PgAioHandle *ioh, PgAioTargetID targetid)
+{
+	Assert(ioh->state == PGAIO_HS_HANDED_OUT);
+	Assert(ioh->target == PGAIO_TID_INVALID);
+
+	ioh->target = targetid;
+}
+
+PgAioTargetData *
+pgaio_io_get_target_data(PgAioHandle *ioh)
+{
+	return &ioh->target_data;
+}
+
+/*
+ * Return a stringified description of the IO's target.
+ *
+ * The string is localized and allocated in the current memory context.
+ */
+char *
+pgaio_io_get_target_description(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->describe_identity(&ioh->target_data);
+}
+
+
+
+/* --------------------------------------------------------------------------------
+ * Internal target related functions operating on IO Handles
+ * --------------------------------------------------------------------------------
+ */
+
+/*
+ * Internal: Check if pgaio_io_reopen() is available for the IO.
+ */
+bool
+pgaio_io_can_reopen(PgAioHandle *ioh)
+{
+	return pgaio_target_info[ioh->target]->reopen != NULL;
+}
+
+/*
+ * Internal: Before executing an IO outside of the context of the process the
+ * IO has been prepared in, the file descriptor has to be reopened - any FD
+ * referenced in the IO itself, won't be valid in the separate process.
+ */
+void
+pgaio_io_reopen(PgAioHandle *ioh)
+{
+	Assert(ioh->target >= 0 && ioh->target < PGAIO_TID_COUNT);
+	Assert(ioh->op >= 0 && ioh->op < PGAIO_OP_COUNT);
+
+	pgaio_target_info[ioh->target]->reopen(ioh);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index c822fd4ddf7..2c26089d52e 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -2,6 +2,10 @@
 
 backend_sources += files(
   'aio.c',
+  'aio_callback.c',
   'aio_init.c',
+  'aio_io.c',
+  'aio_target.c',
+  'method_sync.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_sync.c b/src/backend/storage/aio/method_sync.c
new file mode 100644
index 00000000000..902c2428d41
--- /dev/null
+++ b/src/backend/storage/aio/method_sync.c
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_sync.c
+ *    AIO - perform "AIO" by executing it synchronously
+ *
+ * This method is mainly to check if AIO use causes regressions. Other IO
+ * methods might also fall back to the synchronous method for functionality
+ * they cannot provide.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+
+static bool pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_sync_ops = {
+	.needs_synchronous_execution = pgaio_sync_needs_synchronous_execution,
+	.submit = pgaio_sync_submit,
+};
+
+
+
+static bool
+pgaio_sync_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return true;
+}
+
+static int
+pgaio_sync_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	elog(ERROR, "IO should have been executed synchronously");
+
+	return 0;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..b44e4908b25 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -191,6 +191,7 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3c9e823f07e..f4261145353 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1280,6 +1280,7 @@ InvalMessageArray
 InvalidationInfo
 InvalidationMsgsGroup
 IoMethod
+IoMethodOps
 IpcMemoryId
 IpcMemoryKey
 IpcMemoryState
@@ -2127,6 +2128,26 @@ Permutation
 PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
+PgAioBackend
+PgAioCtl
+PgAioHandle
+PgAioHandleCallbackID
+PgAioHandleCallbackStage
+PgAioHandleCallbackComplete
+PgAioHandleCallbackReport
+PgAioHandleCallbacks
+PgAioHandleCallbacksEntry
+PgAioHandleFlags
+PgAioHandleState
+PgAioOp
+PgAioOpData
+PgAioResult
+PgAioResultStatus
+PgAioReturn
+PgAioTargetData
+PgAioTargetID
+PgAioTargetInfo
+PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0003-aio-Infrastructure-for-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 8db5243ad68b6802e401fc1d465d01072ae794cb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 03/30] aio: Infrastructure for io_method=worker

This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/miscadmin.h                       |   2 +
 src/include/storage/aio_subsys.h              |   4 +
 src/include/storage/io_worker.h               |  22 +++
 src/include/storage/proc.h                    |   4 +-
 src/backend/postmaster/launch_backend.c       |   2 +
 src/backend/postmaster/pmchild.c              |   1 +
 src/backend/postmaster/postmaster.c           | 174 ++++++++++++++++--
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_worker.c       |  88 +++++++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/activity/pgstat_backend.c   |   1 +
 src/backend/utils/activity/pgstat_io.c        |   1 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/init/miscinit.c             |   3 +
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  19 ++
 src/test/regress/expected/stats.out           |  10 +-
 19 files changed, 336 insertions(+), 14 deletions(-)
 create mode 100644 src/include/storage/io_worker.h
 create mode 100644 src/backend/storage/aio/method_worker.c

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6f16794eb63..603d0424354 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -360,6 +360,7 @@ typedef enum BackendType
 	B_ARCHIVER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_IO_WORKER,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SUMMARIZER,
@@ -389,6 +390,7 @@ extern PGDLLIMPORT BackendType MyBackendType;
 #define AmWalReceiverProcess()		(MyBackendType == B_WAL_RECEIVER)
 #define AmWalSummarizerProcess()	(MyBackendType == B_WAL_SUMMARIZER)
 #define AmWalWriterProcess()		(MyBackendType == B_WAL_WRITER)
+#define AmIoWorkerProcess()			(MyBackendType == B_IO_WORKER)
 
 #define AmSpecialWorkerProcess() \
 	(AmAutoVacuumLauncherProcess() || \
diff --git a/src/include/storage/aio_subsys.h b/src/include/storage/aio_subsys.h
index 2a6ca1c27a9..8a8ce87f62a 100644
--- a/src/include/storage/aio_subsys.h
+++ b/src/include/storage/aio_subsys.h
@@ -30,4 +30,8 @@ extern void pgaio_init_backend(void);
 extern void pgaio_error_cleanup(void);
 extern void AtEOXact_Aio(bool is_commit);
 
+
+/* aio_worker.c */
+extern bool pgaio_workers_enabled(void);
+
 #endif							/* AIO_SUBSYS_H */
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
new file mode 100644
index 00000000000..7bde7e89c8a
--- /dev/null
+++ b/src/include/storage/io_worker.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_worker.h
+ *    IO worker for implementing AIO "ourselves"
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_WORKER_H
+#define IO_WORKER_H
+
+
+pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
+
+extern PGDLLIMPORT int io_workers;
+
+#endif							/* IO_WORKER_H */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0750ec3c474..f51b03d3822 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -449,7 +449,9 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
  * 2 slots, but WAL writer is launched only after startup has exited, so we
  * only need 6 slots.
  */
-#define NUM_AUXILIARY_PROCS		6
+#define MAX_IO_WORKERS          32
+#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
+
 
 /* configurable options */
 extern PGDLLIMPORT int DeadlockTimeout;
diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 77fb877dbad..bf6b55ee830 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -48,6 +48,7 @@
 #include "replication/slotsync.h"
 #include "replication/walreceiver.h"
 #include "storage/dsm.h"
+#include "storage/io_worker.h"
 #include "storage/pg_shmem.h"
 #include "tcop/backend_startup.h"
 #include "utils/memutils.h"
@@ -197,6 +198,7 @@ static child_process_kind child_process_kinds[] = {
 	[B_ARCHIVER] = {"archiver", PgArchiverMain, true},
 	[B_BG_WRITER] = {"bgwriter", BackgroundWriterMain, true},
 	[B_CHECKPOINTER] = {"checkpointer", CheckpointerMain, true},
+	[B_IO_WORKER] = {"io_worker", IoWorkerMain, true},
 	[B_STARTUP] = {"startup", StartupProcessMain, true},
 	[B_WAL_RECEIVER] = {"wal_receiver", WalReceiverMain, true},
 	[B_WAL_SUMMARIZER] = {"wal_summarizer", WalSummarizerMain, true},
diff --git a/src/backend/postmaster/pmchild.c b/src/backend/postmaster/pmchild.c
index 0d473226c3a..cde1d23a4ca 100644
--- a/src/backend/postmaster/pmchild.c
+++ b/src/backend/postmaster/pmchild.c
@@ -101,6 +101,7 @@ InitPostmasterChildSlots(void)
 
 	pmchild_pools[B_AUTOVAC_WORKER].size = autovacuum_worker_slots;
 	pmchild_pools[B_BG_WORKER].size = max_worker_processes;
+	pmchild_pools[B_IO_WORKER].size = MAX_IO_WORKERS;
 
 	/*
 	 * There can be only one of each of these running at a time.  They each
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d13846298bd..f301e43743f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -108,9 +108,12 @@
 #include "replication/logicallauncher.h"
 #include "replication/slotsync.h"
 #include "replication/walsender.h"
+#include "storage/aio_subsys.h"
 #include "storage/fd.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
 #include "utils/datetime.h"
@@ -340,6 +343,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_IO_WORKERS,			/* waiting for io workers to exit */
 	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -402,6 +406,10 @@ bool		LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* State for IO worker management. */
+static int	io_worker_count = 0;
+static PMChild *io_worker_children[MAX_IO_WORKERS];
+
 /*
  * postmaster.c - function prototypes
  */
@@ -436,6 +444,8 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static bool maybe_reap_io_worker(int pid);
+static void maybe_adjust_io_workers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1365,6 +1375,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
 
+	UpdatePMState(PM_STARTUP);
+
+	/* Make sure we can perform I/O while starting up. */
+	maybe_adjust_io_workers();
+
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
 		CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
@@ -1377,7 +1392,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2502,6 +2516,16 @@ process_pm_child_exit(void)
 			continue;
 		}
 
+		/* Was it an IO worker? */
+		if (maybe_reap_io_worker(pid))
+		{
+			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+				HandleChildCrash(pid, exitstatus, _("io worker"));
+
+			maybe_adjust_io_workers();
+			continue;
+		}
+
 		/*
 		 * Was it a backend or a background worker?
 		 */
@@ -2723,6 +2747,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
+		case PM_WAIT_IO_WORKERS:
 
 			/*
 			 * NB: Similar code exists in PostmasterStateMachine()'s handling
@@ -2905,20 +2930,21 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect archiver, checkpointer and walsender to exit as well,
-		 * otherwise not.
+		 * expect archiver, checkpointer, io workers and walsender to exit as
+		 * well, otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
 			targetMask = btmask_add(targetMask,
 									B_CHECKPOINTER,
 									B_ARCHIVER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 		/*
-		 * Normally walsenders and archiver will continue running; they will
-		 * be terminated later after writing the checkpoint record.  We also
-		 * let dead-end children to keep running for now.  The syslogger
-		 * process exits last.
+		 * Normally archiver, checkpointer, IO workers and walsenders will
+		 * continue running; they will be terminated later after writing the
+		 * checkpoint record.  We also let dead-end children to keep running
+		 * for now.  The syslogger process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2933,12 +2959,13 @@ PostmasterStateMachine(void)
 									B_LOGGER);
 
 			/*
-			 * Archiver, checkpointer and walsender may or may not be in
-			 * targetMask already.
+			 * Archiver, checkpointer, IO workers, and walsender may or may
+			 * not be in targetMask already.
 			 */
 			remainMask = btmask_add(remainMask,
 									B_ARCHIVER,
 									B_CHECKPOINTER,
+									B_IO_WORKER,
 									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
@@ -3039,11 +3066,25 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_XLOG_ARCHIVAL state ends when there are no children other
-		 * than checkpointer, dead-end children and logger left. There
+		 * than checkpointer, io workers and dead-end children left. There
 		 * shouldn't be any regular backends left by now anyway; what we're
 		 * really waiting for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_IO_WORKER,
+											B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		{
+			UpdatePMState(PM_WAIT_IO_WORKERS);
+			SignalChildren(SIGUSR2, btmask(B_IO_WORKER));
+		}
+	}
+
+	if (pmState == PM_WAIT_IO_WORKERS)
+	{
+		/*
+		 * PM_WAIT_IO_WORKERS state ends when there's only checkpointer and
+		 * dead_end children left.
+		 */
+		if (io_worker_count == 0)
 		{
 			UpdatePMState(PM_WAIT_CHECKPOINTER);
 
@@ -3171,10 +3212,14 @@ PostmasterStateMachine(void)
 		/* re-create shared memory and semaphores */
 		CreateSharedMemoryAndSemaphores();
 
+		UpdatePMState(PM_STARTUP);
+
+		/* Make sure we can perform I/O while starting up. */
+		maybe_adjust_io_workers();
+
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3198,6 +3243,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
+			PM_TOSTR_CASE(PM_WAIT_IO_WORKERS);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
@@ -3235,6 +3281,16 @@ LaunchMissingBackgroundProcesses(void)
 	if (SysLoggerPMChild == NULL && Logging_collector)
 		StartSysLogger();
 
+	/*
+	 * The number of configured configured workers might have changed, or a
+	 * prior start of a worker might have failed. Check if we need to
+	 * start/stop any workers.
+	 *
+	 * A config file change will always lead to this function being called, so
+	 * we always will process the config change in a timely manner.
+	 */
+	maybe_adjust_io_workers();
+
 	/*
 	 * The checkpointer and the background writer are active from the start,
 	 * until shutdown is initiated.
@@ -4120,6 +4176,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
@@ -4270,6 +4327,99 @@ maybe_start_bgworkers(void)
 	}
 }
 
+static bool
+maybe_reap_io_worker(int pid)
+{
+	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	{
+		if (io_worker_children[id] &&
+			io_worker_children[id]->pid == pid)
+		{
+			ReleasePostmasterChildSlot(io_worker_children[id]);
+
+			--io_worker_count;
+			io_worker_children[id] = NULL;
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Start or stop IO workers, to close the gap between the number of running
+ * workers and the number of configured workers.  Used to respond to change of
+ * the io_workers GUC (by increasing and decreasing the number of workers), as
+ * well as workers terminating in response to errors (by starting
+ * "replacement" workers).
+ */
+static void
+maybe_adjust_io_workers(void)
+{
+	if (!pgaio_workers_enabled())
+		return;
+
+	/*
+	 * If we're in final shutting down state, then we're just waiting for all
+	 * processes to exit.
+	 */
+	if (pmState >= PM_WAIT_IO_WORKERS)
+		return;
+
+	/* Don't start new workers during an immediate shutdown either. */
+	if (Shutdown >= ImmediateShutdown)
+		return;
+
+	/*
+	 * Don't start new workers if we're in the shutdown phase of a crash
+	 * restart. But we *do* need to start if we're already starting up again.
+	 */
+	if (FatalError && pmState >= PM_STOP_BACKENDS)
+		return;
+
+	Assert(pmState < PM_WAIT_IO_WORKERS);
+
+	/* Not enough running? */
+	while (io_worker_count < io_workers)
+	{
+		PMChild    *child;
+		int			id;
+
+		/* find unused entry in io_worker_children array */
+		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		{
+			if (io_worker_children[id] == NULL)
+				break;
+		}
+		if (id == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker ID");
+
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
+	}
+
+	/* Too many running? */
+	if (io_worker_count > io_workers)
+	{
+		/* ask the IO worker in the highest slot to exit */
+		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		{
+			if (io_worker_children[id] != NULL)
+			{
+				kill(io_worker_children[id]->pid, SIGUSR2);
+				break;
+			}
+		}
+	}
+}
+
+
 /*
  * When a backend asks to be notified about worker state changes, we
  * set a flag in its backend entry.  The background worker machinery needs
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 89f821ea7e1..f51c34a37f8 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_io.o \
 	aio_target.o \
 	method_sync.o \
+	method_worker.o \
 	read_stream.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2c26089d52e..74f94c6e40b 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,5 +7,6 @@ backend_sources += files(
   'aio_io.c',
   'aio_target.c',
   'method_sync.c',
+  'method_worker.c',
   'read_stream.c',
 )
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
new file mode 100644
index 00000000000..0ef9ef93e2b
--- /dev/null
+++ b/src/backend/storage/aio/method_worker.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_worker.c
+ *    AIO - perform AIO using worker processes
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/method_worker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/auxprocess.h"
+#include "postmaster/interrupt.h"
+#include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/wait_event.h"
+
+
+/* GUCs */
+int			io_workers = 3;
+
+
+void
+IoWorkerMain(const void *startup_data, size_t startup_data_len)
+{
+	sigjmp_buf	local_sigjmp_buf;
+
+	MyBackendType = B_IO_WORKER;
+	AuxiliaryProcessMainCommon();
+
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
+	pqsignal(SIGINT, die);		/* to allow manually triggering worker restart */
+
+	/*
+	 * Ignore SIGTERM, will get explicit shutdown via SIGUSR2 later in the
+	 * shutdown sequence, similar to checkpointer.
+	 */
+	pqsignal(SIGTERM, SIG_IGN);
+	/* SIGQUIT handler was already set up by InitPostmasterChild */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
+	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+
+	/* see PostgresMain() */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		error_context_stack = NULL;
+		HOLD_INTERRUPTS();
+
+		EmitErrorReport();
+
+		proc_exit(1);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	while (!ShutdownRequestPending)
+	{
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+				  WAIT_EVENT_IO_WORKER_MAIN);
+		ResetLatch(MyLatch);
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	proc_exit(0);
+}
+
+bool
+pgaio_workers_enabled(void)
+{
+	/* placeholder for future commit */
+	return false;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 55ab2da299b..fcb23239d07 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3316,6 +3316,8 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating background worker \"%s\" due to administrator command",
 							MyBgworkerEntry->bgw_type)));
+		else if (AmIoWorkerProcess())
+			proc_exit(0);
 		else
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index a8cb54a7732..5518a18e060 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -375,6 +375,7 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_LOGGER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_STARTUP:
 			return false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb575025596..c8de9c9e2d3 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -376,6 +376,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
+		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index b44e4908b25..3f6dc3876b4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -57,6 +57,7 @@ BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
 CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
+IO_WORKER_MAIN	"Waiting in main loop of IO Worker process."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index dc3521457c7..43b4dbccc3d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -293,6 +293,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = gettext_noop("checkpointer");
 			break;
+		case B_IO_WORKER:
+			backendDesc = gettext_noop("io worker");
+			break;
 		case B_LOGGER:
 			backendDesc = gettext_noop("logger");
 			break;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 0d3ebf06a95..4cc19bef686 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -75,6 +75,7 @@
 #include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
+#include "storage/io_worker.h"
 #include "storage/large_object.h"
 #include "storage/pg_shmem.h"
 #include "storage/predicate.h"
@@ -3267,6 +3268,18 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_workers,
+		3, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_IO,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 43c2ec2153e..a4049ff0d9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -207,6 +207,7 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
+#io_workers = 3				# 1-32;
 
 # - Worker Processes -
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ec18bb7627..baefb7af4cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2689,6 +2689,25 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-workers" xreflabel="io_workers">
+       <term><varname>io_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Selects the number of I/O worker processes to use. The default is
+         3. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
      </variablelist>
     </sect2>
 
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f77caacc17d..cd08a2ca0af 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -51,6 +51,14 @@ client backend|relation|vacuum
 client backend|temp relation|normal
 client backend|wal|init
 client backend|wal|normal
+io worker|relation|bulkread
+io worker|relation|bulkwrite
+io worker|relation|init
+io worker|relation|normal
+io worker|relation|vacuum
+io worker|temp relation|normal
+io worker|wal|init
+io worker|wal|normal
 slotsync worker|relation|bulkread
 slotsync worker|relation|bulkwrite
 slotsync worker|relation|init
@@ -87,7 +95,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(71 rows)
+(79 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0004-aio-Add-io_method-worker.patchtext/x-diff; charset=us-asciiDownload

From 5fb2be750b2513df5a88769051b114f530ded0fc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 04/30] aio: Add io_method=worker

The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately.  Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables.  Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   5 +-
 src/include/storage/aio_internal.h            |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/backend/storage/aio/aio.c                 |   2 +
 src/backend/storage/aio/aio_init.c            |   9 +
 src/backend/storage/aio/method_worker.c       | 431 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 doc/src/sgml/config.sgml                      |   5 +
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 452 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index f48a4962089..7b6b7d20a85 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -27,10 +27,11 @@
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
+	IOMETHOD_WORKER,
 } IoMethod;
 
-/* We'll default to synchronous execution. */
-#define DEFAULT_IO_METHOD IOMETHOD_SYNC
+/* We'll default to worker based execution. */
+#define DEFAULT_IO_METHOD IOMETHOD_WORKER
 
 
 /*
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 52b7751efa8..51f63586467 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -385,6 +385,7 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
+extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..932024b1b0b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, AioWorkerSubmissionQueue)
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 4d5439c73fd..3ed4b1dfdac 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -64,6 +64,7 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
+	{"worker", IOMETHOD_WORKER, false},
 	{NULL, 0, false}
 };
 
@@ -80,6 +81,7 @@ PgAioBackend *pgaio_my_backend;
 
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
+	[IOMETHOD_WORKER] = &pgaio_worker_ops,
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 6fe55510fae..4e405ce7ca8 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -18,6 +18,7 @@
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
@@ -39,6 +40,11 @@ AioCtlShmemSize(void)
 static uint32
 AioProcs(void)
 {
+	/*
+	 * While AIO workers don't need their own AIO context, we can't currently
+	 * guarantee nothing gets assigned to the a ProcNumber for an IO worker if
+	 * we just subtracted MAX_IO_WORKERS.
+	 */
 	return MaxBackends + NUM_AUXILIARY_PROCS;
 }
 
@@ -223,6 +229,9 @@ pgaio_init_backend(void)
 	/* shouldn't be initialized twice */
 	Assert(!pgaio_my_backend);
 
+	if (MyBackendType == B_IO_WORKER)
+		return;
+
 	if (MyProc == NULL || MyProcNumber >= AioProcs())
 		elog(ERROR, "aio requires a normal PGPROC");
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 0ef9ef93e2b..117c9a87db7 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -3,6 +3,21 @@
  * method_worker.c
  *    AIO - perform AIO using worker processes
  *
+ * IO workers consume IOs from a shared memory submission queue, run
+ * traditional synchronous system calls, and perform the shared completion
+ * handling immediately.  Client code submits most requests by pushing IOs
+ * into the submission queue, and waits (if necessary) using condition
+ * variables.  Some IOs cannot be performed in another process due to lack of
+ * infrastructure for reopening the file, and must processed synchronously by
+ * the client code when submitted.
+ *
+ * So that the submitter can make just one system call when submitting a batch
+ * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
+ * could be improved by using futexes instead of latches to wake N waiters.
+ *
+ * This method of AIO is available in all builds on all operating systems, and
+ * is the default.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -16,25 +31,325 @@
 
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
 
+/* How many workers should each worker wake up if needed? */
+#define IO_WORKER_WAKEUP_FANOUT 2
+
+
+typedef struct AioWorkerSubmissionQueue
+{
+	uint32		size;
+	uint32		mask;
+	uint32		head;
+	uint32		tail;
+	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerSubmissionQueue;
+
+typedef struct AioWorkerSlot
+{
+	Latch	   *latch;
+	bool		in_use;
+} AioWorkerSlot;
+
+typedef struct AioWorkerControl
+{
+	uint64		idle_worker_mask;
+	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} AioWorkerControl;
+
+
+static size_t pgaio_worker_shmem_size(void);
+static void pgaio_worker_shmem_init(bool first_time);
+
+static bool pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh);
+static int	pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+
+
+const IoMethodOps pgaio_worker_ops = {
+	.shmem_size = pgaio_worker_shmem_size,
+	.shmem_init = pgaio_worker_shmem_init,
+
+	.needs_synchronous_execution = pgaio_worker_needs_synchronous_execution,
+	.submit = pgaio_worker_submit,
+};
+
+
 /* GUCs */
 int			io_workers = 3;
 
 
+static int	io_worker_queue_size = 64;
+static int	MyIoWorkerId;
+static AioWorkerSubmissionQueue *io_worker_submission_queue;
+static AioWorkerControl *io_worker_control;
+
+
+static size_t
+pgaio_worker_shmem_size(void)
+{
+	return
+		offsetof(AioWorkerSubmissionQueue, ios) +
+		sizeof(uint32) * MAX_IO_WORKERS * io_max_concurrency +
+		offsetof(AioWorkerControl, workers) +
+		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+}
+
+static void
+pgaio_worker_shmem_init(bool first_time)
+{
+	bool		found;
+	int			size;
+
+	/* Round size up to next power of two so we can make a mask. */
+	size = pg_nextpower2_32(io_worker_queue_size);
+
+	io_worker_submission_queue =
+		ShmemInitStruct("AioWorkerSubmissionQueue",
+						offsetof(AioWorkerSubmissionQueue, ios) +
+						sizeof(uint32) * size,
+						&found);
+	if (!found)
+	{
+		io_worker_submission_queue->size = size;
+		io_worker_submission_queue->head = 0;
+		io_worker_submission_queue->tail = 0;
+	}
+
+	io_worker_control =
+		ShmemInitStruct("AioWorkerControl",
+						offsetof(AioWorkerControl, workers) +
+						sizeof(AioWorkerSlot) * io_workers,
+						&found);
+	if (!found)
+	{
+		io_worker_control->idle_worker_mask = 0;
+		for (int i = 0; i < io_workers; ++i)
+		{
+			io_worker_control->workers[i].latch = NULL;
+			io_worker_control->workers[i].in_use = false;
+		}
+	}
+}
+
+static int
+pgaio_choose_idle_worker(void)
+{
+	int			worker;
+
+	if (io_worker_control->idle_worker_mask == 0)
+		return -1;
+
+	/* Find the lowest bit position, and clear it. */
+	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
+	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+
+	return worker;
+}
+
+static bool
+pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		new_head;
+
+	queue = io_worker_submission_queue;
+	new_head = (queue->head + 1) & (queue->size - 1);
+	if (new_head == queue->tail)
+	{
+		pgaio_debug(DEBUG3, "io queue is full, at %u elements",
+					io_worker_submission_queue->size);
+		return false;			/* full */
+	}
+
+	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->head = new_head;
+
+	return true;
+}
+
+static uint32
+pgaio_worker_submission_queue_consume(void)
+{
+	AioWorkerSubmissionQueue *queue;
+	uint32		result;
+
+	queue = io_worker_submission_queue;
+	if (queue->tail == queue->head)
+		return UINT32_MAX;		/* empty */
+
+	result = queue->ios[queue->tail];
+	queue->tail = (queue->tail + 1) & (queue->size - 1);
+
+	return result;
+}
+
+static uint32
+pgaio_worker_submission_queue_depth(void)
+{
+	uint32		head;
+	uint32		tail;
+
+	head = io_worker_submission_queue->head;
+	tail = io_worker_submission_queue->tail;
+
+	if (tail > head)
+		head += io_worker_submission_queue->size;
+
+	Assert(head >= tail);
+
+	return head - tail;
+}
+
+static bool
+pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
+{
+	return
+		!IsUnderPostmaster
+		|| ioh->flags & PGAIO_HF_REFERENCES_LOCAL
+		|| !pgaio_io_can_reopen(ioh);
+}
+
+static void
+pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+{
+	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	int			nsync = 0;
+	Latch	   *wakeup = NULL;
+	int			worker;
+
+	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	for (int i = 0; i < nios; ++i)
+	{
+		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
+		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		{
+			/*
+			 * We'll do it synchronously, but only after we've sent as many as
+			 * we can to workers, to maximize concurrency.
+			 */
+			synchronous_ios[nsync++] = ios[i];
+			continue;
+		}
+
+		if (wakeup == NULL)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_choose_idle_worker();
+			if (worker >= 0)
+				wakeup = io_worker_control->workers[worker].latch;
+
+			pgaio_debug_io(DEBUG4, ios[i],
+						   "choosing worker %d",
+						   worker);
+		}
+	}
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	if (wakeup)
+		SetLatch(wakeup);
+
+	/* Run whatever is left synchronously. */
+	if (nsync > 0)
+	{
+		for (int i = 0; i < nsync; ++i)
+		{
+			pgaio_io_perform_synchronously(synchronous_ios[i]);
+		}
+	}
+}
+
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+
+		pgaio_io_prepare_submit(ioh);
+	}
+
+	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
+
+	return num_staged_ios;
+}
+
+/*
+ * on_shmem_exit() callback that releases the worker's slot in
+ * io_worker_control.
+ */
+static void
+pgaio_worker_die(int code, Datum arg)
+{
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
+	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+
+	io_worker_control->workers[MyIoWorkerId].in_use = false;
+	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+}
+
+/*
+ * Register the worker in shared memory, assign MyWorkerId and register a
+ * shutdown callback to release registration.
+ */
+static void
+pgaio_worker_register(void)
+{
+	MyIoWorkerId = -1;
+
+	/*
+	 * XXX: This could do with more fine-grained locking. But it's also not
+	 * very common for the number of workers to change at the moment...
+	 */
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+
+	for (int i = 0; i < io_workers; ++i)
+	{
+		if (!io_worker_control->workers[i].in_use)
+		{
+			Assert(io_worker_control->workers[i].latch == NULL);
+			io_worker_control->workers[i].in_use = true;
+			MyIoWorkerId = i;
+			break;
+		}
+		else
+			Assert(io_worker_control->workers[i].latch != NULL);
+	}
+
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker slot");
+
+	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
+	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	on_shmem_exit(pgaio_worker_die, 0);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	PgAioHandle *volatile error_ioh = NULL;
+	volatile int error_errno = 0;
+	char		cmd[128];
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -53,6 +368,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
 
+	/* also registers a shutdown callback to unregister */
+	pgaio_worker_register();
+
+	sprintf(cmd, "io worker: %d", MyIoWorkerId);
+	set_ps_display(cmd);
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -61,6 +382,27 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 		EmitErrorReport();
 
+		/*
+		 * In the - very unlikely - case that the IO failed in a way that
+		 * raises an error we need to mark the IO as failed.
+		 *
+		 * Need to do just enough error recovery so that we can mark the IO as
+		 * failed and then exit (postmaster will start a new worker).
+		 */
+		LWLockReleaseAll();
+
+		if (error_ioh != NULL)
+		{
+			/* should never fail without setting error_errno */
+			Assert(error_errno != 0);
+
+			errno = error_errno;
+
+			START_CRIT_SECTION();
+			pgaio_io_process_completion(error_ioh, -error_errno);
+			END_CRIT_SECTION();
+		}
+
 		proc_exit(1);
 	}
 
@@ -71,9 +413,89 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 	while (!ShutdownRequestPending)
 	{
-		WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-				  WAIT_EVENT_IO_WORKER_MAIN);
-		ResetLatch(MyLatch);
+		uint32		io_index;
+		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
+		int			nlatches = 0;
+		int			nwakeups = 0;
+		int			worker;
+
+		/* Try to get a job to do. */
+		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		{
+			/*
+			 * Nothing to do.  Mark self idle.
+			 *
+			 * XXX: Invent some kind of back pressure to reduce useless
+			 * wakeups?
+			 */
+			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+		}
+		else
+		{
+			/* Got one.  Clear idle flag. */
+			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+
+			/* See if we can wake up some peers. */
+			nwakeups = Min(pgaio_worker_submission_queue_depth(),
+						   IO_WORKER_WAKEUP_FANOUT);
+			for (int i = 0; i < nwakeups; ++i)
+			{
+				if ((worker = pgaio_choose_idle_worker()) < 0)
+					break;
+				latches[nlatches++] = io_worker_control->workers[worker].latch;
+			}
+		}
+		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		for (int i = 0; i < nlatches; ++i)
+			SetLatch(latches[i]);
+
+		if (io_index != UINT32_MAX)
+		{
+			PgAioHandle *ioh = NULL;
+
+			ioh = &pgaio_ctl->io_handles[io_index];
+			error_ioh = ioh;
+
+			pgaio_debug_io(DEBUG4, ioh,
+						   "worker %d processing IO",
+						   MyIoWorkerId);
+
+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
+		}
+		else
+		{
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+					  WAIT_EVENT_IO_WORKER_MAIN);
+			ResetLatch(MyLatch);
+		}
+
 		CHECK_FOR_INTERRUPTS();
 	}
 
@@ -83,6 +505,5 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 bool
 pgaio_workers_enabled(void)
 {
-	/* placeholder for future commit */
-	return false;
+	return io_method == IOMETHOD_WORKER;
 }
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3f6dc3876b4..9fa12a555e8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -348,6 +348,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a4049ff0d9d..75a844a9474 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,7 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = sync			# sync (change requires restart)
+#io_method = worker			# worker, sync (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index baefb7af4cb..108caea1fed 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2676,6 +2676,11 @@ include_dir 'conf.d'
          Selects the method for executing asynchronous I/O.
          Possible values are:
          <itemizedlist>
+          <listitem>
+           <para>
+            <literal>worker</literal> (execute asynchronous I/O using worker processes)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4261145353..4d8b5c7f1d6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,6 +55,9 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
+AioWorkerControl
+AioWorkerSlot
+AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0005-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 958c27172c8945c072c475b28e534cc0ebce74b4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 05/30] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0006-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From e61ff69e39bbd81dc184f5eade772203e10f1b29 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 06/30] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 440 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 470 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 7b6b7d20a85..53b1080b17f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -28,6 +28,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 51f63586467..98ae66921f5 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3ed4b1dfdac..2e210335872 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..588177eaa4f
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,440 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * FIXME: Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow MaxBackends fds
+		 * created in postmaster, with spare space for max_files_per_process
+		 * additional FDs
+		 *
+		 * - set_max_safe_fds() subtracts the number of already used FDs from
+		 * max_files_per_process, ending up with a low limit or even erroring
+		 * out.
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 75a844a9474..7e8c3dcb175 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 108caea1fed..8d53478f53a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2681,6 +2681,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4d8b5c7f1d6..1748befca16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2150,6 +2150,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0007-smgr-Hold-interrupts-in-most-smgr-functions.patchtext/x-diff; charset=us-asciiDownload

From da8a868fd73f96604426a232f8300f50bc26e425 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 07/30] smgr: Hold interrupts in most smgr functions

We need to hold interrupts across most of the smgr.c/md.c functions, as
otherwise interrupt processing, e.g. due to a debug elog/ereport, can trigger
procsignal processing, which in turn can trigger smgrreleaseall(). As the
relevant code is not reentrant we quickly end up in a bad situation.

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 94 +++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ebe35c04de5..8787ce9b18f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -40,6 +40,15 @@
  * themselves, as there could pointers to them in active use.  See
  * smgrrelease() and smgrreleaseall().
  *
+ * NB: We need to hold interrupts across most of the functions in this file,
+ * as otherwise interrupt processing, e.g. due to a debug elog/ereport, can
+ * trigger procsignal processing, which in turn can trigger
+ * smgrreleaseall(). None of the relevant code is reentrant.  It seems better
+ * to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of trying to
+ * push them down to md.c where possible: For one, every smgr implementation
+ * would be vulnerable, for another, a good bit of smgr.c code itself is
+ * affected too.
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -53,6 +62,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -158,12 +168,16 @@ smgrinit(void)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
 			smgrsw[i].smgr_init();
 	}
 
+	RESUME_INTERRUPTS();
+
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
 }
@@ -176,11 +190,13 @@ smgrshutdown(int code, Datum arg)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_shutdown)
 			smgrsw[i].smgr_shutdown();
 	}
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -206,6 +222,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 	Assert(RelFileNumberIsValid(rlocator.relNumber));
 
+	HOLD_INTERRUPTS();
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
@@ -242,6 +260,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		dlist_push_tail(&unpinned_relns, &reln->node);
 	}
 
+	RESUME_INTERRUPTS();
+
 	return reln;
 }
 
@@ -283,6 +303,8 @@ smgrdestroy(SMgrRelation reln)
 
 	Assert(reln->pincount == 0);
 
+	HOLD_INTERRUPTS();
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
@@ -292,6 +314,8 @@ smgrdestroy(SMgrRelation reln)
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -302,12 +326,16 @@ smgrdestroy(SMgrRelation reln)
 void
 smgrrelease(SMgrRelation reln)
 {
+	HOLD_INTERRUPTS();
+
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -336,6 +364,8 @@ smgrdestroyall(void)
 {
 	dlist_mutable_iter iter;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
 	 * each one from the list.
@@ -347,6 +377,8 @@ smgrdestroyall(void)
 
 		smgrdestroy(rel);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -362,12 +394,16 @@ smgrreleaseall(void)
 	if (SMgrRelationHash == NULL)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
 	{
 		smgrrelease(reln);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -400,7 +436,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -413,7 +455,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -434,6 +478,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	FlushRelationsAllBuffers(rels, nrels);
 
 	/*
@@ -449,6 +495,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -471,6 +519,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Get rid of any remaining buffers for the relations.  bufmgr will just
 	 * drop them without bothering to write the contents.
@@ -522,6 +572,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	}
 
 	pfree(rlocators);
+
+	RESUME_INTERRUPTS();
 }
 
 
@@ -538,6 +590,8 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
@@ -550,6 +604,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -563,6 +619,8 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
 											 nblocks, skipFsync);
 
@@ -575,6 +633,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -588,7 +648,13 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -601,7 +667,13 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	uint32		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -619,8 +691,10 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
 										nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -653,8 +727,10 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
 										 buffers, nblocks, skipFsync);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -665,8 +741,10 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
 											nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -683,10 +761,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
+	HOLD_INTERRUPTS();
+
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
+	RESUME_INTERRUPTS();
+
 	return result;
 }
 
@@ -731,6 +813,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 {
 	int			i;
 
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	/*
 	 * Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
 	 * just drop them without bothering to write the contents.
@@ -784,7 +868,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -816,7 +902,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0008-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From 478fe41b2ec999c394ff5848ff3ec5d725ad2598 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 08/30] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 176 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 144 ++++++++++++++++++++
 10 files changed, 401 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 53b1080b17f..4fe7a6a8b00 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -112,9 +112,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -186,6 +187,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..b0b9a2a5c97 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d5a2cca28f1..2d48aa7c1b3 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index b01406a6a52..6133b04f350 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 62f1185859f..34588ed5167 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..11d4d5a7aea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,99 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8787ce9b18f..06b2acf8fea 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -103,6 +104,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -114,6 +119,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -131,12 +137,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -154,6 +162,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -697,6 +715,24 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -907,6 +943,25 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	HOLD_INTERRUPTS();
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+	RESUME_INTERRUPTS();
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -935,3 +990,92 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0009-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload

From 6b1664d4a346e1bce5f2b7fc8abfe49044e45a9e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 09/30] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..b7dfb80b4b2
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache cache flush, whereas O\_DIRECT + O\_DSYNC can use a
+  single Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c can have another callback to check if
+the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 2e210335872..a0504ee84ee 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0010-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 876c89f4643b08e733771320b193872e337c8e1e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 10/30] aio: Add pg_aios view

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..017971011f3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12479,4 +12479,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0011-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From f14f6d89d9e9d90959f3e36b6ac850e1d79d23bd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 11/30] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach
streams to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap for code that uses the fast path and
stays in one single buffer queue element.  Satisfy both goals by
initializing the queue incrementally on the first cycle.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 103 +++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d65fa07b44c..cdf4b5a86a2 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -95,8 +95,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -224,8 +226,10 @@ static bool
 read_stream_start_pending_read(ReadStream *stream)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
 	int			flags;
+	int			forwarded;
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
@@ -272,11 +276,20 @@ read_stream_start_pending_read(ReadStream *stream)
 		}
 	}
 
-	/* How many more buffers is this backend allowed? */
+	/*
+	 * How many more buffers is this backend allowed?
+	 *
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
 	if (stream->temporary)
 		buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
 	else
 		buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffer_limit += stream->forwarded_buffers;
 	if (buffer_limit == 0 && stream->pinned_buffers == 0)
 		buffer_limit = 1;		/* guarantee progress */
 
@@ -303,8 +316,31 @@ read_stream_start_pending_read(ReadStream *stream)
 	 * We say how many blocks we want to read, but it may be smaller on return
 	 * if the buffer manager decides to shorten the read.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -333,16 +369,35 @@ read_stream_start_pending_read(ReadStream *stream)
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Compute location of start of next read, without using % operator. */
 	buffer_index += nblocks;
@@ -718,10 +773,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -770,6 +827,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -845,10 +903,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -893,6 +956,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -928,6 +992,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -941,6 +1006,24 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		if (++index == stream->queue_size)
+			index = 0;
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0012-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From e811294f9390f0178f02d9627854fe82b1857ff8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 12/30] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 79a89f87fcc..690cdde793a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -130,7 +130,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 79ca9d18d07..95911b51d58 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1249,10 +1249,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1262,30 +1262,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, but StartBufferIO() will handle
+			 * those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1306,15 +1356,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1329,7 +1375,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1343,11 +1389,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1361,13 +1417,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1378,7 +1439,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1404,24 +1466,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0013-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From ff815ec0ea2ef1b3b5e30dc44e6f952353ec4b91 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 13/30] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 95911b51d58..a9643930974 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5393,6 +5393,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5446,6 +5458,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0014-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 56160476c2e0bd8da81582f25a073e1db120fdb1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 14/30] bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
  IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

As of this commit nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Todo:
- deduplicate shared/local buffer completion callbacks
- deduplicate LockBufferForCleanup() support
- function naming pattern (buffer_readv_complete_common() calls
  LocalBufferCompleteRead/SharedBufferCompleteRead, which looks ugly)

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 440 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  45 ++-
 7 files changed, 466 insertions(+), 42 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 4fe7a6a8b00..c0db8f9754f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -190,6 +190,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 690cdde793a..eafc94d7ed1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -168,6 +169,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 2d48aa7c1b3..2537b1914d8 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a9643930974..54add34f1a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1038,7 +1040,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1074,9 +1076,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1450,7 +1452,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1598,9 +1601,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2503,7 +2506,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2849,6 +2852,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2917,29 +2958,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3959,7 +3979,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5518,6 +5538,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5525,10 +5546,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5617,7 +5647,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5632,6 +5662,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5640,6 +5678,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5688,7 +5737,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6139,3 +6188,328 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Helper for AIO staging callback for both reads and writes as well as temp
+ * and shared buffers.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the subsystem.
+		 *
+		 * For local buffers: This can't be done just in LocalRefCount as one
+		 * might initially think, as this backend could error out while AIO is
+		 * still in progress, releasing all the pins by the backend itself.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..b18c4c25143 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * The buffer could have IO in progress, e.g. when there are two scans of
+	 * the same relation. Either wait for the other IO or return false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
@@ -536,7 +553,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +567,17 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/*
+		 * Release pin held by IO subsystem, see also
+		 * local_buffer_readv_prepare().
+		 */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +743,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0015-bufmgr-Improve-stats-when-buffer-was-read-in-co.patchtext/x-diff; charset=us-asciiDownload

From 4bab968b504fae9b1ebe7445d182a74619de4306 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 15/30] bufmgr: Improve stats when buffer was read in
 concurrently

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 41 ++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 54add34f1a2..d783b33f78e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1490,19 +1490,6 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	/*
-	 * We count all these blocks as read by this backend.  This is traditional
-	 * behavior, but might turn out to be not true if we find that someone
-	 * else has beaten us and completed the read of some of these blocks.  In
-	 * that case the system globally double-counts, but we traditionally don't
-	 * count this as a "hit", and we don't have a separate counter for "miss,
-	 * but another backend completed the read".
-	 */
-	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
-	else
-		pgBufferUsage.shared_blks_read += nblocks;
-
 	for (int i = 0; i < nblocks; ++i)
 	{
 		int			io_buffers_len;
@@ -1520,8 +1507,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		if (!WaitReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
-			 * Report this as a 'hit' for this backend, even though it must
-			 * have started out as a miss in PinBufferForBlock().
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock().
+			 *
+			 * Some of the accesses would otherwise never be counted (e.g.
+			 * pgBufferUsage) or counted as a miss (e.g.
+			 * pgstat_count_buffer_hit(), as we always call
+			 * pgstat_count_buffer_read()).
 			 */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
 											  operation->smgr->smgr_rlocator.locator.spcOid,
@@ -1529,6 +1521,20 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			if (persistence == RELPERSISTENCE_TEMP)
+				pgBufferUsage.local_blks_hit += 1;
+			else
+				pgBufferUsage.shared_blks_hit += 1;
+
+			if (operation->rel)
+				pgstat_count_buffer_hit(operation->rel);
+
+			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+			if (VacuumCostActive)
+				VacuumCostBalance += VacuumCostPageHit;
+
 			continue;
 		}
 
@@ -1614,6 +1620,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  false);
 		}
 
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_read += io_buffers_len;
+		else
+			pgBufferUsage.shared_blks_read += io_buffers_len;
+
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 	}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0016-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 82ce7620bc540caaf639e386f2c458039b21c4e8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 16/30] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual AIO user. StartReadBuffers() now uses
the AIO routines to issue IO. This converts a lot of callers to use the AIO
infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commits.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 614 +++++++++++++++++++++-------
 2 files changed, 477 insertions(+), 143 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eafc94d7ed1..ca4ca22c345 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d783b33f78e..2691757016c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1228,10 +1230,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1256,9 +1262,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1297,6 +1305,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			Assert(pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
 		}
 		else
@@ -1321,6 +1330,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1363,25 +1386,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1448,8 +1520,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1458,31 +1557,243 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
-void
-WaitReadBuffers(ReadBuffersOperation *operation)
+/*
+ * Helper for AsyncReadBuffers that tries to initiate IO on the buffer.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
-
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	/*
+	 * If this backend currently has staged IO, we we need to submit the
+	 * pending IO before waiting for the right to issue IO, to avoid the
+	 * potential for deadlocks (and, more commonly, unnecessary delays for
+	 * other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately a false returned StartBufferIO() doesn't allow to
+		 * distinguish between the the buffer already being valid and IO
+		 * already being in progress. Since IO already being in progress is
+		 * quite rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			nblocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != ARS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks sucessfully read as the result of the
+	 * IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == ARS_OK))
+		nblocks = aio_ret->result.result;
+	else if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == ARS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
 
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	operation->nblocks_done += nblocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple short reads, but also because some of the remaining to-be-read
+	 * buffers may have been read in by other backends, limiting the IO size.
+	 */
+	while (true)
+	{
+		int			nblocks;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it. It's possible for there to be no IO if
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == ARS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the the time the one IO we started will read in everything.
+		 * But we need to deal with short reads and buffers not needing IO
+		 * anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a short read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &nblocks);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after short reads, the first operation->nblocks_done is
+ * buffers are skipped.
+ *
+ * On return *nblocks_progres is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1490,58 +1801,90 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might block,
+	 * which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If IO needs to be submitted before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
+	 * wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to wait
+	 * for already submitted IO, which doesn't require additional locks, but
+	 * it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allows us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock().
+		 *
+		 * Some of the accesses would otherwise never be counted (e.g.
+		 * pgBufferUsage) or counted as a miss (e.g.
+		 * pgstat_count_buffer_hit(), as we always call
+		 * pgstat_count_buffer_read()).
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock().
-			 *
-			 * Some of the accesses would otherwise never be counted (e.g.
-			 * pgBufferUsage) or counted as a miss (e.g.
-			 * pgstat_count_buffer_hit(), as we always call
-			 * pgstat_count_buffer_read()).
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
@@ -1549,85 +1892,70 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * buffers at the same time?  In this case we don't wait if we see an
 		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0017-docs-Reframe-track_io_timing-related-docs-as-wa.patchtext/x-diff; charset=us-asciiDownload

From b2f6cddcf8f00c757d2c0e4679416de6d3e2c66e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 17/30] docs: Reframe track_io_timing related docs as wait
 time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d53478f53a..69638d7c1f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8549,7 +8549,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8583,7 +8583,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..7598340072f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2733,7 +2733,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2771,7 +2771,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2799,7 +2799,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2835,7 +2835,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2909,7 +2909,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2996,7 +2996,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0018-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From 00d41c3c9fa6e9f66fac4422afebd88505bd7f22 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 18/30] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency

There is one comment talking about max_ios logic with "real asynchronous I/O"
that I am not sure about, so I left it alone for now.

There are further improvements we should consider, e.g. waiting to issue IOs
until we can issue multiple IOs at once. But that's left for a future change,
since it would involve additional heuristics.
---
 src/backend/storage/aio/read_stream.c | 29 ++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index cdf4b5a86a2..d7b395b86a3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -463,6 +472,7 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -497,6 +507,8 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -628,15 +640,19 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * This system supports prefetching advice.
+	 *
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -646,6 +662,9 @@ read_stream_begin_impl(int flags,
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
 	 * different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
 	 */
 	if (max_ios == 0)
 		max_ios = 1;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0019-aio-Experimental-heuristics-to-increase-batchin.patchtext/x-diff; charset=us-asciiDownload

From 6c3d70000a3805f8961b7cb7d74e4da4a79bedce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 19/30] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d7b395b86a3..887be2b3961 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -418,6 +418,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0020-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 76e4646099e84780dc3a294ce50a9089261c91b9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:57 -0400
Subject: [PATCH v2.9 20/30] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 674 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1466 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..4b1f52a9fc8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2691757016c..860886d4d90 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5927,7 +5923,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5984,7 +5980,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..ecd1f3c2f17
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,674 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8);
+
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0021-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 5aa9adf453330e025acc3de8a29522cb54359362 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 21/30] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 251 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 2537b1914d8..d1ea1ee145b 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 34588ed5167..46cbcb2b688 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11d4d5a7aea..a61fc14805e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2017,3 +2108,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 06b2acf8fea..86be5cb61bc 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -112,6 +112,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -139,6 +144,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -769,6 +775,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0022-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 7a16b6dc189cb8ebde7d1054ab7633d9776f3a93 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 22/30] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c0db8f9754f..c673627a358 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -344,6 +344,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -356,6 +370,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 98ae66921f5..166a66f7062 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index b0b9a2a5c97..09968e17855 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index b7dfb80b4b2..bb0bfc71ea6 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a0504ee84ee..46d83da60f7 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1044,6 +1062,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 4e405ce7ca8..b9732d989ff 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -93,6 +93,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -118,6 +144,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -141,11 +194,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -161,6 +234,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -172,6 +248,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -184,6 +261,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -191,9 +302,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -215,6 +330,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4cc19bef686..7290793c9a1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3268,6 +3268,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7e8c3dcb175..ea71e9075ff 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1748befca16..6ffd4e82976 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2132,6 +2132,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0023-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From f4fd610ea674d1f6d601b7e620b81f5c7698f892 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 23/30] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 182 ++++++++++++++++++++++++-
 4 files changed, 187 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c673627a358..2eeabb8e01a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -192,8 +192,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ca4ca22c345..fffc76d11d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,7 +176,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d1ea1ee145b..d7c261a964c 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 860886d4d90..886fbaf63f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5507,7 +5507,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5522,6 +5530,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5594,6 +5615,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5741,6 +5767,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6802,12 +6836,131 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
 		);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result;
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6815,6 +6968,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6828,13 +6988,29 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
-/* readv callback is is passed READ_BUFFERS_* flags as callback data */
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
 	.complete_shared = shared_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6848,3 +7024,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0024-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 7bdb0f91ee2666574c73d0d5d3f6c57f7c0add91 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 24/30] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6ffd4e82976..26c622824e4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1191,6 +1191,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3016,6 +3017,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0025-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 12321aa81bf8063dcd0859c13c40d8bff1c9b9d9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 25/30] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1f52a9fc8..a3df3192d12 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fffc76d11d9..30d965ec069 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -299,7 +299,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 886fbaf63f3..3312b713c8c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3299,6 +3300,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3330,7 +3382,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3392,7 +3447,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3500,48 +3557,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3559,15 +3659,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3593,7 +3701,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3636,6 +3744,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3656,6 +3767,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3812,11 +3925,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3827,6 +3954,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3838,6 +3972,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3876,8 +4015,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3886,22 +4083,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3911,7 +4136,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3920,40 +4145,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4326,6 +4805,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 26c622824e4..26a9c1d4ba6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0026-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload

From 2800508280cf377fd5e39ae064357e9d79ad9807 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 26/30] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0027-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 4c148856b1c38d69882a4d867a9343d758724218 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 27/30] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.9-0028-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 01750e92f72263b959fab2037269069fab3b0e3f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Mar 2025 12:29:58 -0400
Subject: [PATCH v2.9 28/30] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#85

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#82)

1 attachment(s)

Re: AIO v2.5

On Fri, Mar 14, 2025 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

Open items:

- Right now effective_io_concurrency cannot be set > 0 on Windows and other
platforms that lack posix_fadvise. But with AIO we can read ahead without
posix_fadvise().

It'd not really make anything worse than today to not remove the limit, but
it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
to look around and see whether it would require anything other than doc
changes.

I've attached a patch that removes the limit for
effective_io_concurrency and maintenance_io_concurrency. I tested both
GUCs with fadvise manually disabled on my system and I think it is
working for those read stream users I tried (vacuum and BHS).

I checked around to make sure no one was using only the value of the
guc to guard prefetches, and it seems like we're safe.

The one thing I am wondering about with the docs is whether or not we
need to make it more clear that only a subset of the "simultaneous
I/O" behavior controlled by eic/mic is available if your system
doesn't have fadvise. I tried to do that a bit, but I avoided getting
into too many details.

- Melanie

Attachments:

Enable-IO-concurrency-on-all-systems.patchtext/x-patch; charset=US-ASCII; name=Enable-IO-concurrency-on-all-systems.patchDownload

From 398282293c38f9dfb4f03839a4a7962887c9c0c9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 17 Mar 2025 15:17:21 -0400
Subject: [PATCH v1] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.
---
 doc/src/sgml/config.sgml                      | 15 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  5 ++--
 src/bin/initdb/initdb.c                       |  5 ----
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 10 files changed, 14 insertions(+), 67 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 69638d7c1f9..2a7587de320 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2581,8 +2581,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2593,8 +2592,8 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.
         </para>
 
         <para>
@@ -2617,10 +2616,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is 10 on supported systems, otherwise 0.  This value can
-         be overridden for tables in a particular tablespace by setting the
-         tablespace parameter of the same name (see
-         <xref linkend="sql-altertablespace"/>).
+         The default is <literal>10</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 59fb53e7707..3e43ad064c3 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..7188afd2860 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 
@@ -1231,34 +1229,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7290793c9a1..2c1ca98a3e5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3224,7 +3224,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3238,7 +3238,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ea71e9075ff..7bd7479190d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,8 +198,9 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+                                  # values > 0 enable prefetching on systems supporting it
+#maintenance_io_concurrency = 10	# 1-1000; same as effective_io_concurrency
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 #io_method = worker			# worker, io_uring, sync
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..b9d75799ee0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1401,11 +1401,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 30d965ec069..c425cc816e8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,14 +156,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 10
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index a3eba8fbe21..74de7a6abdf 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
-- 
2.34.1

#86

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#84)

25 attachment(s)

Re: AIO v2.5

Hi,

Attached is v2.10, with the following changes:

- committed core AIO infrastructure patch

A few cosmetic changes

- committed io_method=worker

Two non-cosmetic changes:

- The shmem allocation functions over-estimated the amount of shared memory
required.

- pgaio_worker_shmem_init() should initialize up to MAX_IO_WORKERS, not just
io_workers, since the latter is intentionally PGC_SIGHUP (found by Thomas)

- Bunch of typo fixes found by searching for repeated words

Thomas found one and then I searched for more.

- Added Melanie's patch to allow effective_io_concurrency to be set on windows etc

- Fixed a reference to an outdated function reference (thanks to Bilal)

- Reordered patches to be a bit more in dependency order

E.g. "bufmgr: Implement AIO read support" doesn't depend on Thomas' "buffer
forwarding" patches and thus can be commited before those go in.

Next steps:

- Decide what to do about the smgr interrupt issue

I guess it could be deferred, based on the argument it only matters with a
sufficiently high debug level, but I don't feel comfortable with that.

I think it'd be reasonable to just go with the patch I sent on the other
thread (and included here).

- Get somebody to look at
"bufmgr: Improve stats when buffer was read in concurrently"

This arguably fixes a bug, or just weird behaviour, on master.

- Address the set_max_safe_fds() issue and once resolved, commit io_uring
method

Can happen concurrently with next steps

- Commit "aio: Implement support for reads in smgr/md/fd"

- Get somebody to do one more pass at bufmgr related commits, I think they
could use a less in-the-weeds eye.

That's the following commits:
- localbuf: Track pincount in BufferDesc as well
- bufmgr: Implement AIO read support
- bufmgr: Use AIO in StartReadBuffers()
- aio: Basic read_stream adjustments for real AIO

Questions / Unresolved:

- Write support isn't going to land in 18, but there is a tiny bit of code
regarding writes in the code for bufmgr IO. I guess I could move that to a
later commit?

I'm inclined to leave it, the structure of the code only really makes
knowing that it's going to be shared between reads & writes.

- pg_aios view name

Greetings,

Andres Freund

Attachments:

v2.10-0001-smgr-Hold-interrupts-in-most-smgr-functions.patchtext/x-diff; charset=us-asciiDownload

From 47448e59a3d4d33267b10d1efd964ad358164d47 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.10 01/28] smgr: Hold interrupts in most smgr functions

We need to hold interrupts across most of the smgr.c/md.c functions, as
otherwise interrupt processing, e.g. due to a debug elog/ereport, can trigger
procsignal processing, which in turn can trigger smgrreleaseall(). As the
relevant code is not reentrant, we quickly end up in a bad situation.

The only reason we haven't noticed this before is that there is only one
non-error ereport called in affected routines, in register_dirty_segments(),
and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's
easy to reproduce crashes.

It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c,
instead of trying to push them down to md.c where possible: For one, every
smgr implementation would be vulnerable, for another, a good bit of smgr.c
code itself is affected too.

Eventually we might want a more targeted solution, allowing e.g. a networked
smgr implementation to be interrupted, but many other, more complicated,
problems would need to be fixed for that to be viable (e.g. smgr.c is often
called with interrupts already held).

Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl
---
 src/backend/storage/smgr/smgr.c | 97 ++++++++++++++++++++++++++++++++-
 1 file changed, 94 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e6cbb9b4ca2..24971304b85 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -40,6 +40,18 @@
  * themselves, as there could pointers to them in active use.  See
  * smgrrelease() and smgrreleaseall().
  *
+ * NB: We need to hold interrupts across most of the functions in this file,
+ * as otherwise interrupt processing, e.g. due to a debug elog/ereport, can
+ * trigger procsignal processing, which in turn can trigger
+ * smgrreleaseall(). None of the relevant code is reentrant.  It seems better
+ * to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of trying to
+ * push them down to md.c where possible: For one, every smgr implementation
+ * would be vulnerable, for another, a good bit of smgr.c code itself is
+ * affected too.  Eventually we might want a more targeted solution, allowing
+ * e.g. a networked smgr implementation to be interrupted, but many other,
+ * more complicated, problems would need to be fixed for that to be viable
+ * (e.g. smgr.c is often called with interrupts already held).
+ *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -53,6 +65,7 @@
 
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -158,12 +171,16 @@ smgrinit(void)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
 			smgrsw[i].smgr_init();
 	}
 
+	RESUME_INTERRUPTS();
+
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
 }
@@ -176,11 +193,13 @@ smgrshutdown(int code, Datum arg)
 {
 	int			i;
 
+	HOLD_INTERRUPTS();
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_shutdown)
 			smgrsw[i].smgr_shutdown();
 	}
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -206,6 +225,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 
 	Assert(RelFileNumberIsValid(rlocator.relNumber));
 
+	HOLD_INTERRUPTS();
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
@@ -242,6 +263,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend)
 		smgrsw[reln->smgr_which].smgr_open(reln);
 	}
 
+	RESUME_INTERRUPTS();
+
 	return reln;
 }
 
@@ -283,6 +306,8 @@ smgrdestroy(SMgrRelation reln)
 
 	Assert(reln->pincount == 0);
 
+	HOLD_INTERRUPTS();
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
@@ -292,6 +317,8 @@ smgrdestroy(SMgrRelation reln)
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -302,12 +329,16 @@ smgrdestroy(SMgrRelation reln)
 void
 smgrrelease(SMgrRelation reln)
 {
+	HOLD_INTERRUPTS();
+
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -336,6 +367,8 @@ smgrdestroyall(void)
 {
 	dlist_mutable_iter iter;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
 	 * each one from the list.
@@ -347,6 +380,8 @@ smgrdestroyall(void)
 
 		smgrdestroy(rel);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -362,12 +397,16 @@ smgrreleaseall(void)
 	if (SMgrRelationHash == NULL)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
 	{
 		smgrrelease(reln);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -400,7 +439,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator)
 bool
 smgrexists(SMgrRelation reln, ForkNumber forknum)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -413,7 +458,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 void
 smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	FlushRelationsAllBuffers(rels, nrels);
 
 	/*
@@ -449,6 +498,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
 				smgrsw[which].smgr_immedsync(rels[i], forknum);
 		}
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	if (nrels == 0)
 		return;
 
+	HOLD_INTERRUPTS();
+
 	/*
 	 * Get rid of any remaining buffers for the relations.  bufmgr will just
 	 * drop them without bothering to write the contents.
@@ -522,6 +575,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 	}
 
 	pfree(rlocators);
+
+	RESUME_INTERRUPTS();
 }
 
 
@@ -538,6 +593,8 @@ void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void *buffer, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
 										 buffer, skipFsync);
 
@@ -550,6 +607,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -563,6 +622,8 @@ void
 smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			   int nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
+
 	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
 											 nblocks, skipFsync);
 
@@ -575,6 +636,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -588,7 +651,13 @@ bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	bool		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -601,7 +670,13 @@ uint32
 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 			   BlockNumber blocknum)
 {
-	return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	uint32		ret;
+
+	HOLD_INTERRUPTS();
+	ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum);
+	RESUME_INTERRUPTS();
+
+	return ret;
 }
 
 /*
@@ -619,8 +694,10 @@ void
 smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		  void **buffers, BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
 										nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -653,8 +730,10 @@ void
 smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
 										 buffers, nblocks, skipFsync);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -665,8 +744,10 @@ void
 smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 			  BlockNumber nblocks)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum,
 											nblocks);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -683,10 +764,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 	if (result != InvalidBlockNumber)
 		return result;
 
+	HOLD_INTERRUPTS();
+
 	result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);
 
 	reln->smgr_cached_nblocks[forknum] = result;
 
+	RESUME_INTERRUPTS();
+
 	return result;
 }
 
@@ -731,6 +816,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 {
 	int			i;
 
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	/*
 	 * Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
 	 * just drop them without bothering to write the contents.
@@ -784,7 +871,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 void
 smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_registersync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -816,7 +905,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum)
 void
 smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
+	HOLD_INTERRUPTS();
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	RESUME_INTERRUPTS();
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0002-bufmgr-Improve-stats-when-buffer-is-read-in-co.patchtext/x-diff; charset=us-asciiDownload

From 59783c7585e36043ebb2384937226c054d42c766 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 02/28] bufmgr: Improve stats when buffer is read in
 concurrently

Previously we would count buffers that were read in concurently as being read
by the current backend, which leads to double-counting the read IOs
globally. Similarly, the buffer hit in that case was never attributed to
pg_stat_io and relation-level IO statistics, making cache hit ratios look
worse.  We also did not account for the cache hit in vacuum cost balancing.

This behaviour has historically grown, but there doesn't seem to be any reason
to just continue down that path.
---
 src/backend/storage/buffer/bufmgr.c | 41 ++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 79ca9d18d07..e76ff1f3224 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1433,19 +1433,6 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	/*
-	 * We count all these blocks as read by this backend.  This is traditional
-	 * behavior, but might turn out to be not true if we find that someone
-	 * else has beaten us and completed the read of some of these blocks.  In
-	 * that case the system globally double-counts, but we traditionally don't
-	 * count this as a "hit", and we don't have a separate counter for "miss,
-	 * but another backend completed the read".
-	 */
-	if (persistence == RELPERSISTENCE_TEMP)
-		pgBufferUsage.local_blks_read += nblocks;
-	else
-		pgBufferUsage.shared_blks_read += nblocks;
-
 	for (int i = 0; i < nblocks; ++i)
 	{
 		int			io_buffers_len;
@@ -1463,8 +1450,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		if (!WaitReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
-			 * Report this as a 'hit' for this backend, even though it must
-			 * have started out as a miss in PinBufferForBlock().
+			 * Report and track this as a 'hit' for this backend, even though
+			 * it must have started out as a miss in PinBufferForBlock().
+			 *
+			 * Some of the accesses would otherwise never be counted (e.g.
+			 * pgBufferUsage) or counted as a miss (e.g.
+			 * pgstat_count_buffer_hit(), as we always call
+			 * pgstat_count_buffer_read()).
 			 */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
 											  operation->smgr->smgr_rlocator.locator.spcOid,
@@ -1472,6 +1464,20 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  operation->smgr->smgr_rlocator.locator.relNumber,
 											  operation->smgr->smgr_rlocator.backend,
 											  true);
+
+			if (persistence == RELPERSISTENCE_TEMP)
+				pgBufferUsage.local_blks_hit += 1;
+			else
+				pgBufferUsage.shared_blks_hit += 1;
+
+			if (operation->rel)
+				pgstat_count_buffer_hit(operation->rel);
+
+			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+			if (VacuumCostActive)
+				VacuumCostBalance += VacuumCostPageHit;
+
 			continue;
 		}
 
@@ -1557,6 +1563,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 											  false);
 		}
 
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_read += io_buffers_len;
+		else
+			pgBufferUsage.shared_blks_read += io_buffers_len;
+
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 	}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0003-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 81dd8ccc81c4173228c0b7b9a3088fd1fb7b0a52 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.10 03/28] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 13c13748e5d..bed36153c7b 100644
--- a/meson.build
+++ b/meson.build
@@ -950,6 +950,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3168,6 +3180,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3823,6 +3836,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index 93fddd69981..8ffcdfc2e1c 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13444,6 +13489,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 3b620bac5ac..d45f5fe4ed7 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -191,6 +191,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -217,6 +218,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0004-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From fcf8cb09e32be7e7c0b193e153085cc355b05c6a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.10 04/28] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   3 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 440 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 470 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 7b6b7d20a85..53b1080b17f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -28,6 +28,9 @@ typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef USE_LIBURING
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 108fe61c7b4..da4d9c341f0 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef USE_LIBURING
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3ed4b1dfdac..2e210335872 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef USE_LIBURING
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef USE_LIBURING
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..6360c5621d1
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,440 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#ifdef USE_LIBURING
+
+#include <liburing.h>
+
+#include "pgstat.h"
+#include "port/pg_iovec.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * FIXME: Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow MaxBackends fds
+		 * created in postmaster, with spare space for max_files_per_process
+		 * additional FDs
+		 *
+		 * - set_max_safe_fds() subtracts the number of already used FDs from
+		 * max_files_per_process, ending up with a low limit or even erroring
+		 * out.
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+			elog(ERROR, "io_uring_queue_init failed: %s", strerror(-ret));
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* USE_LIBURING */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index beb05a89501..458f0815a12 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -202,7 +202,8 @@
 #maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9e9c02cde83..1156c389c2c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2681,6 +2681,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d35..53abd54edd8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2150,6 +2150,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0005-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From f82753ec15143863fd593153d561ae7b55dbf6bc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.10 05/28] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 176 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 144 ++++++++++++++++++++
 10 files changed, 401 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 53b1080b17f..4fe7a6a8b00 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -112,9 +112,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -186,6 +187,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..b0b9a2a5c97 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d5a2cca28f1..2d48aa7c1b3 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index b01406a6a52..6133b04f350 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 62f1185859f..34588ed5167 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1294,6 +1295,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1987,6 +1990,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2210,6 +2215,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2498,6 +2529,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2778,6 +2815,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2846,6 +2884,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..11d4d5a7aea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,61 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1430,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1921,99 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 24971304b85..70bf6fd824a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -66,6 +66,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -106,6 +107,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -117,6 +122,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -134,12 +140,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -157,6 +165,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -700,6 +718,24 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -910,6 +946,25 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	HOLD_INTERRUPTS();
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+	RESUME_INTERRUPTS();
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -938,3 +993,92 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0006-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From dd729535150b66fa9879de1ef69fd3f03d4c4cf8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 06/28] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..6df4826163e
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == ARS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == ARS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c can have another callback to check if
+the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 2e210335872..a0504ee84ee 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0007-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From 03799b214e7365defc7b149ce7cf5235297d0580 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 07/28] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e76ff1f3224..5fd57bc8d95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5403,6 +5415,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0008-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 4824a2a39e7cc395d2c05a341186721b0bf38273 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
  IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

As of this commit nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 440 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  42 ++-
 7 files changed, 463 insertions(+), 42 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 4fe7a6a8b00..c0db8f9754f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -190,6 +190,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7f5def6bada..3225bce605f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -169,6 +170,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 2d48aa7c1b3..2537b1914d8 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5fd57bc8d95..c3cdcd8d4ad 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -516,7 +517,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1038,7 +1040,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1074,9 +1076,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1388,7 +1390,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1550,9 +1553,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2460,7 +2463,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2806,6 +2809,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2874,29 +2915,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3916,7 +3936,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5475,6 +5495,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5574,7 +5604,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5589,6 +5619,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5597,6 +5635,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
+	 * Most of the time the current backend will hold another pin preventing
+	 * that from happening, but that's e.g. not the case when completing an IO
+	 * another backend started.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5645,7 +5694,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6096,3 +6145,328 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Helper for AIO staging callback for both reads and writes as well as temp
+ * and shared buffers.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the subsystem.
+		 *
+		 * For local buffers: This can't be done just in LocalRefCount as one
+		 * might initially think, as this backend could error out while AIO is
+		 * still in progress, releasing all the pins by the backend itself.
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = ARS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc, target_data->smgr.forkNum).str
+				   )
+		);
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..a58e917a788 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * The buffer could have IO in progress, e.g. when there are two scans of
+	 * the same relation. Either wait for the other IO or return false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
 		/* someone else already did the I/O */
@@ -536,7 +553,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +567,14 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/* release pin held by IO subsystem, see also buffer_stage_common() */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +740,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0009-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 8ead766192b863a8738208e358f07d845a91677a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 09/28] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach
streams to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap for code that uses the fast path and
stays in one single buffer queue element.  Satisfy both goals by
initializing the queue incrementally on the first cycle.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 103 +++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d65fa07b44c..cdf4b5a86a2 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -95,8 +95,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -224,8 +226,10 @@ static bool
 read_stream_start_pending_read(ReadStream *stream)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
 	int			flags;
+	int			forwarded;
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
@@ -272,11 +276,20 @@ read_stream_start_pending_read(ReadStream *stream)
 		}
 	}
 
-	/* How many more buffers is this backend allowed? */
+	/*
+	 * How many more buffers is this backend allowed?
+	 *
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
 	if (stream->temporary)
 		buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
 	else
 		buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffer_limit += stream->forwarded_buffers;
 	if (buffer_limit == 0 && stream->pinned_buffers == 0)
 		buffer_limit = 1;		/* guarantee progress */
 
@@ -303,8 +316,31 @@ read_stream_start_pending_read(ReadStream *stream)
 	 * We say how many blocks we want to read, but it may be smaller on return
 	 * if the buffer manager decides to shorten the read.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -333,16 +369,35 @@ read_stream_start_pending_read(ReadStream *stream)
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Compute location of start of next read, without using % operator. */
 	buffer_index += nblocks;
@@ -718,10 +773,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -770,6 +827,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -845,10 +903,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -893,6 +956,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -928,6 +992,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -941,6 +1006,24 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		if (++index == stream->queue_size)
+			index = 0;
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0010-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 791fe26b9ea4fd646186f2f41fbc185fd75aa6c3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 10/28] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3225bce605f..dcdef3e8164 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -131,7 +131,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c3cdcd8d4ad..7c0d6b85c0d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1251,10 +1251,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1264,30 +1264,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, but StartBufferIO() will handle
+			 * those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1308,15 +1358,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1331,7 +1377,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1345,11 +1391,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1363,13 +1419,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1380,7 +1441,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1407,24 +1469,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0011-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 9ec0ef4c8bde3f13259a909e037526f2564d46b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 11/28] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual AIO user. StartReadBuffers() now uses
the AIO routines to issue IO. This converts a lot of callers to use the AIO
infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 613 +++++++++++++++++++++-------
 2 files changed, 476 insertions(+), 143 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dcdef3e8164..f82bd1814dc 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7c0d6b85c0d..f747650036f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -528,6 +528,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1228,10 +1230,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1256,9 +1262,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1297,6 +1305,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			Assert(pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
 		}
 		else
@@ -1321,6 +1330,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1363,25 +1386,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1448,8 +1520,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1458,31 +1557,242 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
-void
-WaitReadBuffers(ReadBuffersOperation *operation)
+/*
+ * Helper for AsyncReadBuffers that tries to initiate IO on the buffer.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
-
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	/*
+	 * If this backend currently has staged IO, we need to submit the pending
+	 * IO before waiting for the right to issue IO, to avoid the potential for
+	 * deadlocks (and, more commonly, unnecessary delays for other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately a false returned StartBufferIO() doesn't allow to
+		 * distinguish between the buffer already being valid and IO already
+		 * being in progress. Since IO already being in progress is quite
+		 * rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			nblocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != ARS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks successfully read as the result of
+	 * the IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == ARS_OK))
+		nblocks = aio_ret->result.result;
+	else if (aio_ret->result.status == ARS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == ARS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
 
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	operation->nblocks_done += nblocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple short reads, but also because some of the remaining to-be-read
+	 * buffers may have been read in by other backends, limiting the IO size.
+	 */
+	while (true)
+	{
+		int			nblocks;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it. It's possible for there to be no IO if
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == ARS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the the the one IO we started will read in everything.  But
+		 * we need to deal with short reads and buffers not needing IO
+		 * anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a short read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &nblocks);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after short reads, the first operation->nblocks_done is
+ * buffers are skipped.
+ *
+ * On return *nblocks_progres is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1490,58 +1800,90 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO before ReadBuffersCanStartIO, as pgaio_io_acquire() might block,
+	 * which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If IO needs to be submitted before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
+	 * wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to wait
+	 * for already submitted IO, which doesn't require additional locks, but
+	 * it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allows us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock().
+		 *
+		 * Some of the accesses would otherwise never be counted (e.g.
+		 * pgBufferUsage) or counted as a miss (e.g.
+		 * pgstat_count_buffer_hit(), as we always call
+		 * pgstat_count_buffer_read()).
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock().
-			 *
-			 * Some of the accesses would otherwise never be counted (e.g.
-			 * pgBufferUsage) or counted as a miss (e.g.
-			 * pgstat_count_buffer_hit(), as we always call
-			 * pgstat_count_buffer_read()).
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
@@ -1549,85 +1891,70 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		 * buffers at the same time?  In this case we don't wait if we see an
 		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0012-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From ea72565d754cc7d0e7b89dedd7d97608c73172dd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 12/28] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency

There is one comment talking about max_ios logic with "real asynchronous I/O"
that I am not sure about, so I left it alone for now.

There are further improvements we should consider, e.g. waiting to issue IOs
until we can issue multiple IOs at once. But that's left for a future change,
since it would involve additional heuristics.
---
 src/backend/storage/aio/read_stream.c | 29 ++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index cdf4b5a86a2..d7b395b86a3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -463,6 +472,7 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -497,6 +507,8 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -628,15 +640,19 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * This system supports prefetching advice.
+	 *
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -646,6 +662,9 @@ read_stream_begin_impl(int flags,
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
 	 * different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
 	 */
 	if (max_ios == 0)
 		max_ios = 1;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0013-docs-Reframe-track_io_timing-related-docs-as-w.patchtext/x-diff; charset=us-asciiDownload

From 99638b6cff10894856202d5b6a82698f4f2be9b9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 13/28] docs: Reframe track_io_timing related docs as
 wait time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1156c389c2c..a1811713bcd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8549,7 +8549,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8583,7 +8583,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..7598340072f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2733,7 +2733,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2771,7 +2771,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2799,7 +2799,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2835,7 +2835,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2909,7 +2909,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2996,7 +2996,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0014-Enable-IO-concurrency-on-all-systems.patchtext/x-diff; charset=us-asciiDownload

From 90286d23b91150b49290b3e58f1e2f5f669938f8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 14/28] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  4 +--
 src/bin/initdb/initdb.c                       |  5 ----
 doc/src/sgml/config.sgml                      | 15 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 10 files changed, 13 insertions(+), 67 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f82bd1814dc..7f709fd7ea4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,14 +156,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 16
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index a3eba8fbe21..74de7a6abdf 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
 										 GucSource source);
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 59fb53e7707..3e43ad064c3 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..7188afd2860 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 
@@ -1231,34 +1229,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60a40ed445a..ea0d9dfb841 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3234,7 +3234,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3248,7 +3248,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 458f0815a12..0748601ec20 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,8 +198,8 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+#maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 #io_method = worker			# worker, io_uring, sync
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..b9d75799ee0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1401,11 +1401,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a1811713bcd..865979aa8ea 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2581,8 +2581,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2593,8 +2592,8 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.
         </para>
 
         <para>
@@ -2617,10 +2616,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.  This value can be overridden for tables in a
-         particular tablespace by setting the tablespace parameter of the same
-         name (see <xref linkend="sql-altertablespace"/>).
+         The default is <literal>16</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0015-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From e4d814dd6896d1490160edfd2d18016afb0d4bd4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 15/28] aio: Add pg_aios view

TODO:
- decide on name
- add docs

FIXME:
- catversion bump

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..017971011f3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12479,4 +12479,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0016-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From 953b5f5045607c44f5b0b720edf190c82a32d57e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 16/28] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d7b395b86a3..887be2b3961 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -418,6 +418,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0017-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 1e191dd86704a156a7b066fbbe8b6ed993b7d703 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 17/28] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 674 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 616 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1466 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..4b1f52a9fc8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f747650036f..7259b0b3bb6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,10 +515,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5926,7 +5922,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -5983,7 +5979,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..ecd1f3c2f17
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,674 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8);
+
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read is retried");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok; SELECT 1;));
+	is($ret, 1, "$io_method: hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers from failure to open ");
+
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..81b4d732206
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,616 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* read buffer without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != ARS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == ARS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		if (inj_io_error_state->short_read_result_set)
+		{
+			elog(LOG, "short read, changing result from %d to %d",
+				 ioh->result, inj_io_error_state->short_read_result);
+
+			ioh->result = inj_io_error_state->short_read_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0018-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 1bcc000713dff3d98dcac5b590d8b5a426a8cd3c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 18/28] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 189 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  22 +++
 7 files changed, 251 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 2537b1914d8..d1ea1ee145b 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 34588ed5167..46cbcb2b688 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2346,6 +2346,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11d4d5a7aea..a61fc14805e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1107,6 +1114,56 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1495,6 +1552,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2017,3 +2108,101 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * The smgr API operates in blocks, therefore convert the result from
+	 * bytes to blocks.
+	 */
+	result.result /= BLCKSZ;
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = ARS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != ARS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = ARS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 70bf6fd824a..51eaea69550 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -772,6 +778,22 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO with the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0019-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 4a2fc0878e4ec4bb029ad397e93351eee6407a0e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 19/28] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c0db8f9754f..c673627a358 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -344,6 +344,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -356,6 +370,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index da4d9c341f0..4378d51908f 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index b0b9a2a5c97..09968e17855 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 6df4826163e..628018ff246 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a0504ee84ee..46d83da60f7 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1044,6 +1062,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 4e405ce7ca8..b9732d989ff 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -93,6 +93,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -118,6 +144,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -141,11 +194,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -161,6 +234,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * PG_IOV_MAX;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -172,6 +248,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -184,6 +261,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -191,9 +302,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -215,6 +330,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += PG_IOV_MAX;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index ea0d9dfb841..960ced7c274 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3278,6 +3278,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0748601ec20..799851cc347 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -209,6 +209,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 81b4d732206..60c01b251fe 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -525,6 +526,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 53abd54edd8..d633c732162 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2132,6 +2132,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0020-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 9fee2ba1e38eaf80c972642b1507e6f2747e7a61 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 20/28] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 180 +++++++++++++++++++++++++
 4 files changed, 186 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c673627a358..2eeabb8e01a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -192,8 +192,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7f709fd7ea4..2cfa5736f4b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -170,7 +170,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d1ea1ee145b..d7c261a964c 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7259b0b3bb6..4ed342eeb3f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5506,7 +5506,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5521,6 +5529,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5593,6 +5614,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5740,6 +5766,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6801,12 +6835,131 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data, int
 		);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result;
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	result.status = ARS_OK;
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6814,6 +6967,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6827,6 +6987,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -6834,6 +7005,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6847,3 +7023,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0021-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 4e840b3c7ec31e8a53ab87526287a6744540dd34 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 21/28] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d633c732162..9c07fa04b49 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1191,6 +1191,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3016,6 +3017,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0022-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 6b684d809f160325c5eb38774c72552b5f198b23 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 22/28] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 588 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 580 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1f52a9fc8..a3df3192d12 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2cfa5736f4b..025167b2118 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -293,7 +293,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4ed342eeb3f..306267c068a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -512,8 +514,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -529,6 +529,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3298,6 +3299,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3329,7 +3381,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3391,7 +3446,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3499,48 +3556,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3558,15 +3658,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3592,7 +3700,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3635,6 +3743,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3655,6 +3766,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3811,11 +3924,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3826,6 +3953,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3837,6 +3971,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3875,8 +4014,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3885,22 +4082,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3910,7 +4135,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3919,40 +4144,294 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
 
-	tag = bufHdr->tag;
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
 
-	UnpinBuffer(bufHdr);
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+
+	UnlockBufHdr(cur_buf_hdr, buf_state);
+
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
+
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * have been able to write it while we were busy with log flushing because
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
+
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
 
-	return result | BUF_WRITTEN;
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
 }
 
 /*
@@ -4325,6 +4804,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9c07fa04b49..96144af6a4a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0023-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From 7452c48c0728ceffacb7220ef2b156f898a64259 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 23/28] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0024-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 3143cf15ab619862a243c2b2f8410b1d6c08d215 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 24/28] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.10-0025-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 9a7ad7dc2bd2d46d31d9c60d3f4456fa44296b07 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.10 25/28] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#87

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.10,

This is a review of 0008: bufmgr: Implement AIO read support

I'm afraid it is more of a cosmetic review than a sign-off on the
patch's correctness, but perhaps it will help future readers who may
have the same questions I did.

In the commit message:
bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

I think it might be nice to say something about allowing backends to
complete IOs issued by other backends.

@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),

    CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };

I personally can't quite figure out why the read and write callbacks
are defined differently than the stage, complete, and report
callbacks. I know there is a comment above PgAioHandleCallbackID about
something about this, but it didn't really clarify it for me. Maybe
you can put a block comment at the top of aio_callback.c? Or perhaps I
just need to study it more...

@@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
+       if (pgaio_wref_valid(&iow))
+       {
+           pgaio_wref_wait(&iow);
+           ConditionVariablePrepareToSleep(cv);
+           continue;
+       }

I'd add some comment above this. I reread it many times, and I still
only _think_ I understand what it does. I think the reason we need
ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
have called ConditionVariableCancelSleep() so we need to
ConditionVariablePrepareToSleep() again (it was done already at the
top of Wait())?

I'll admit I find the CV API quite confusing, so maybe I'm just
misunderstanding it.

Maybe worth mentioning in the commit message about why WaitIO() has to
work differently for AIO than sync IO.

/*
* Support LockBufferForCleanup()
*
* If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
* Most of the time the current backend will hold another pin preventing
* that from happening, but that's e.g. not the case when completing an IO
* another backend started.
*/

I found this wording a bit confusing. what about this:

We may have just released the last pin other than the waiter's. In most cases,
this backend holds another pin on the buffer. But, if, for example, this
backend is completing an IO issued by another backend, it may be time to wake
the waiter.

/*
* Helper for AIO staging callback for both reads and writes as well as temp
* and shared buffers.
*/
static pg_attribute_always_inline void
buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)

I think buffer_stage_common() needs the function comment to explain
what unit it is operating on.
It is called "buffer_" singular but then it loops through io_data
which appears to contain multiple buffers

/*
* Check that all the buffers are actually ones that could conceivably
* be done in one IO, i.e. are sequential.
*/
if (i == 0)
first = buf_hdr->tag;
else
{
Assert(buf_hdr->tag.relNumber == first.relNumber);
Assert(buf_hdr->tag.blockNum == first.blockNum + i);
}

So it is interesting to me that this validation is done at this level.
Enforcing sequentialness doesn't seem like it would be intrinsically
related to or required to stage IOs. And there isn't really anything
in this function that seems like it would require it either. Usually
an assert is pretty close to the thing it is protecting.

Oh and I think the end of the loop in buffer_stage_common() would look
nicer with a small refactor with the resulting code looking like this:

/* temp buffers don't use BM_IO_IN_PROGRESS */
Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));

/* we better have ensured the buffer is present until now */
Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

/*
* Reflect that the buffer is now owned by the subsystem.
*
* For local buffers: This can't be done just in LocalRefCount as one
* might initially think, as this backend could error out while AIO is
* still in progress, releasing all the pins by the backend itself.
*/
buf_state += BUF_REFCOUNT_ONE;
buf_hdr->io_wref = io_ref;

if (is_temp)
{
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
continue;
}

UnlockBufHdr(buf_hdr, buf_state);

if (is_write)
{
LWLock *content_lock;

content_lock = BufferDescriptorGetContentLock(buf_hdr);

Assert(LWLockHeldByMe(content_lock));

/*
* Lock is now owned by AIO subsystem.
*/
LWLockDisown(content_lock);
}

/*
* Stop tracking this buffer via the resowner - the AIO system now
* keeps track.
*/
ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
}

In buffer_readv_complete(), this comment

/*
* Iterate over all the buffers affected by this IO and call appropriate
* per-buffer completion function for each buffer.
*/

makes it sound like we might invoke different completion functions (like invoke
the completion callback), but that isn't what happens here.

failed =
prior_result.status == ARS_ERROR
|| prior_result.result <= buf_off;

Though not introduced in this commit, I will say that I find the ARS prefix not
particularly helpful. Though not as brief, something like AIO_RS_ERROR would
probably be more clear.

@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-   uint32      buf_state = pg_atomic_read_u32(&bufHdr->state);
+   uint32      buf_state;
+
+   /*
+    * The buffer could have IO in progress, e.g. when there are two scans of
+    * the same relation. Either wait for the other IO or return false.
+    */
+   if (pgaio_wref_valid(&bufHdr->io_wref))
+   {
+       PgAioWaitRef iow = bufHdr->io_wref;
+
+       if (nowait)
+           return false;
+
+       pgaio_wref_wait(&iow);
+   }

+ buf_state = pg_atomic_read_u32(&bufHdr->state);
if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
{
/* someone else already did the I/O */

I'd move this comment ("someone else already did") outside of the if
statement so it kind of separates it into three clear cases:
1) the IO is in progress and you can wait on it if you want, 2) the IO
is completed by someone else (is this possible for local buffers
without AIO?) 3) you can start the IO

- Melanie

#88

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Melanie Plageman (#87)

Re: AIO v2.5

Hi,

On 2025-03-18 21:00:17 -0400, Melanie Plageman wrote:

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.10,

This is a review of 0008: bufmgr: Implement AIO read support

I'm afraid it is more of a cosmetic review than a sign-off on the
patch's correctness, but perhaps it will help future readers who may
have the same questions I did.

I think that's actually an important level of review. I'm, as odd as that
sounds, more confident about the architectural stuff than about
"understandability" etc.

In the commit message:
bufmgr: Implement AIO read support

This implements the following:
- Addition of callbacks to maintain buffer state on completion of a readv
- Addition of a wait reference to BufferDesc, to allow backends to wait for
IOs
- StartBufferIO(), WaitIO(), TerminateBufferIO() support for waiting AIO

I think it might be nice to say something about allowing backends to
complete IOs issued by other backends.

Hm, I'd have said that's basically implied by the way AIO works (as outlined
in the added README.md), but I can think of a way to mention it here.

@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+   CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
#undef CALLBACK_ENTRY
};
I personally can't quite figure out why the read and write callbacks
are defined differently than the stage, complete, and report
callbacks. I know there is a comment above PgAioHandleCallbackID about
something about this, but it didn't really clarify it for me. Maybe
you can put a block comment at the top of aio_callback.c? Or perhaps I
just need to study it more...

They're not implemented differently - PgAioHandleCallbacks (which is what is
contained in aio_handle_cbs, just with a name added) all have a stage,
complete and report callbacks.

E.g. for SHARED_BUFFER_READV you have a stage (to transfer the buffer pins to
the AIO subsystem), a shared completion (to verify the page, to remove
BM_IO_IN_PROGRESS and set BM_VALID/BM_IO_ERROR, as appropriate) and a report
callback (to report a page validation error).

Maybe more of the relevant types and functions should have been plural, but
then it becomes very awkward to talk about the separate registrations of
multiple callbacks (i.e. the set of callbacks for md.c and the set of
callbacks for bufmgr.c).

@@ -5482,10 +5503,19 @@ WaitIO(BufferDesc *buf)
+       if (pgaio_wref_valid(&iow))
+       {
+           pgaio_wref_wait(&iow);
+           ConditionVariablePrepareToSleep(cv);
+           continue;
+       }
I'd add some comment above this. I reread it many times, and I still
only _think_ I understand what it does. I think the reason we need
ConditionVariablePrepareToSleep() again is because pgaio_io_wait() may
have called ConditionVariableCancelSleep() so we need to
ConditionVariablePrepareToSleep() again (it was done already at the
top of Wait())?

Oh, yes, that definitely needs a comment. I've been marinating in this for so
long that it seems obvious, but if I take a step back, it's not at all
obvious.

The issue is that pgaio_wref_wait() internally waits on a *different*
condition variable than the BufferDesc's CV. The consequences of not doing
this would be fairly mild - the next ConditionVariableSleep would prepare to
sleep and return immediately - but it's unnecessary.

Maybe worth mentioning in the commit message about why WaitIO() has to
work differently for AIO than sync IO.

/*
* Support LockBufferForCleanup()
*
* If we just released a pin, need to do BM_PIN_COUNT_WAITER handling.
* Most of the time the current backend will hold another pin preventing
* that from happening, but that's e.g. not the case when completing an IO
* another backend started.
*/

I found this wording a bit confusing. what about this:

We may have just released the last pin other than the waiter's. In most cases,
this backend holds another pin on the buffer. But, if, for example, this
backend is completing an IO issued by another backend, it may be time to wake
the waiter.

WFM.

/*
* Helper for AIO staging callback for both reads and writes as well as temp
* and shared buffers.
*/
static pg_attribute_always_inline void
buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)

I think buffer_stage_common() needs the function comment to explain
what unit it is operating on.
It is called "buffer_" singular but then it loops through io_data
which appears to contain multiple buffers

Hm. Yea. Originally it was just for readv and was duplicated for writes. The
vectorized bit hinted at being for multiple buffers.

/*
* Check that all the buffers are actually ones that could conceivably
* be done in one IO, i.e. are sequential.
*/
if (i == 0)
first = buf_hdr->tag;
else
{
Assert(buf_hdr->tag.relNumber == first.relNumber);
Assert(buf_hdr->tag.blockNum == first.blockNum + i);
}

So it is interesting to me that this validation is done at this level.
Enforcing sequentialness doesn't seem like it would be intrinsically
related to or required to stage IOs. And there isn't really anything
in this function that seems like it would require it either. Usually
an assert is pretty close to the thing it is protecting.

Staging is the last buffer-aware thing that happens before IO is actually
executed. If you were to do a readv() into *non* buffers that aren't for
sequential blocks, you would get bogus buffer pool contents, because obviously
it doesn't make sense to read data for block N+1 into the buffer for block N+3
or whatnot.

The assertions did find bugs during development, fwiw.

Oh and I think the end of the loop in buffer_stage_common() would look
nicer with a small refactor with the resulting code looking like this:

/* temp buffers don't use BM_IO_IN_PROGRESS */
Assert(!is_temp || (buf_state & BM_IO_IN_PROGRESS));

/* we better have ensured the buffer is present until now */
Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

/*
* Reflect that the buffer is now owned by the subsystem.
*
* For local buffers: This can't be done just in LocalRefCount as one
* might initially think, as this backend could error out while AIO is
* still in progress, releasing all the pins by the backend itself.
*/
buf_state += BUF_REFCOUNT_ONE;
buf_hdr->io_wref = io_ref;

if (is_temp)
{
pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
continue;
}

UnlockBufHdr(buf_hdr, buf_state);

if (is_write)
{
LWLock *content_lock;

content_lock = BufferDescriptorGetContentLock(buf_hdr);

Assert(LWLockHeldByMe(content_lock));

/*
* Lock is now owned by AIO subsystem.
*/
LWLockDisown(content_lock);
}

/*
* Stop tracking this buffer via the resowner - the AIO system now
* keeps track.
*/
ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
}

I don't particularly like this, I'd like to make the logic for shared and
local buffers more similar over time. E.g. by also tracking local buffer IO
via resowner.

In buffer_readv_complete(), this comment

/*
* Iterate over all the buffers affected by this IO and call appropriate
* per-buffer completion function for each buffer.
*/

makes it sound like we might invoke different completion functions (like invoke
the completion callback), but that isn't what happens here.

Oops, that's how it used to work, but it doesn't anymore, because it ended up
with too much duplication.

failed =
prior_result.status == ARS_ERROR
|| prior_result.result <= buf_off;

Though not introduced in this commit, I will say that I find the ARS prefix not
particularly helpful. Though not as brief, something like AIO_RS_ERROR would
probably be more clear.

Fair enough. I'd go for PGAIO_RS_ERROR etc though.

@@ -515,10 +517,25 @@ MarkLocalBufferDirty(Buffer buffer)
* Like StartBufferIO, but for local buffers
*/
bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
{
-   uint32      buf_state = pg_atomic_read_u32(&bufHdr->state);
+   uint32      buf_state;
+
+   /*
+    * The buffer could have IO in progress, e.g. when there are two scans of
+    * the same relation. Either wait for the other IO or return false.
+    */
+   if (pgaio_wref_valid(&bufHdr->io_wref))
+   {
+       PgAioWaitRef iow = bufHdr->io_wref;
+
+       if (nowait)
+           return false;
+
+       pgaio_wref_wait(&iow);
+   }

+ buf_state = pg_atomic_read_u32(&bufHdr->state);
if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
{
/* someone else already did the I/O */

I'd move this comment ("someone else already did") outside of the if
statement so it kind of separates it into three clear cases:

FWIW it's inside because that's how StartBufferIOs comment has been for a fair
while...

1) the IO is in progress and you can wait on it if you want,
2) the IO is completed by someone else (is this possible for local buffers
without AIO?)

No, that's not possible without AIO.

3) you can start the IO

I'll give it a go.

Thanks for the review!

Greetings,

Andres Freund

#89

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.10

This is a review of 0002: bufmgr: Improve stats when buffer is read
in concurrently

In the commit message, it might be worth distinguishing that
pg_stat_io and vacuum didn't double count reads, they under-counted
hits. pgBufferUsage and relation-level stats (pg_stat_all_tables etc)
overcounted reads and undercounted hits.

Quick example:
On master, if we try to read 7 blocks and 3 were hits and 2 were
completed by someone else then
- pg_stat_io and VacuumCostBalance would record 3 hits and 2 reads,
which looks like 2 misses.
- pgBufferUsage would record 3 hits and 4 reads, which looks like 4 misses.
- pg_stat_all_tables would record 3 hits and 7 reads, which looks like 4 misses.

The correct number of misses is 2 misses comprising 5 hits and 2 reads
(or 7 reads and 5 hits for pg_stat_all_tables which does the math
later).

@@ -1463,8 +1450,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
        if (!WaitReadBuffersCanStartIO(buffers[i], false))
        {
            /*
-            * Report this as a 'hit' for this backend, even though it must
-            * have started out as a miss in PinBufferForBlock().
+            * Report and track this as a 'hit' for this backend, even though
+            * it must have started out as a miss in PinBufferForBlock().
+            *
+            * Some of the accesses would otherwise never be counted (e.g.
+            * pgBufferUsage) or counted as a miss (e.g.
+            * pgstat_count_buffer_hit(), as we always call
+            * pgstat_count_buffer_read()).
             */

I think this comment should be changed. It reads like something
written when discovering this problem and not like something useful in
the future. I think you can probably drop the whole second paragraph.

You could make it even more clear by mentioning that the other backend
will count it as a read.

Otherwise, LGTM

- Melanie

#90

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.10,

I noticed a few comments could be improved in 0011: bufmgr: Use AIO
in StartReadBuffers()

In WaitReadBuffers(), this comment is incomplete:

        /*
-        * Skip this block if someone else has already completed it.  If an
-        * I/O is already in progress in another backend, this will wait for
-        * the outcome: either done, or something went wrong and we will
-        * retry.
+        * If there is an IO associated with the operation, we may need to
+        * wait for it. It's possible for there to be no IO if
         */

In WaitReadBuffers(), too many thes

/*
* Most of the the the one IO we started will read in everything. But
* we need to deal with short reads and buffers not needing IO
* anymore.
*/

In ReadBuffersCanStartIO()

+       /*
+        * Unfortunately a false returned StartBufferIO() doesn't allow to
+        * distinguish between the buffer already being valid and IO already
+        * being in progress. Since IO already being in progress is quite
+        * rare, this approach seems fine.
+        */

maybe reword "a false returned StartBufferIO()"

Above and in AsyncReadBuffers()

* To support retries after short reads, the first operation->nblocks_done is
* buffers are skipped.

can't quite understand this

+ * On return *nblocks_progres is updated to reflect the number of buffers
progress spelled wrong

* A secondary benefit is that this would allows us to measure the time in
* pgaio_io_acquire() without causing undue timer overhead in the common,
* non-blocking, case. However, currently the pgstats infrastructure
* doesn't really allow that, as it a) asserts that an operation can't
* have time without operations b) doesn't have an API to report
* "accumulated" time.
*/

allows->allow

What would the time spent in pgaio_io_acquire() be reported as? Time
submitting IOs? Time waiting for a handle? And what is "accumulated"
time here? It seems like you just add the time to the running total
and that is already accumulated.

- Melanie

#91

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Melanie Plageman (#90)

Re: AIO v2.5

Hi,

On 2025-03-19 13:20:17 -0400, Melanie Plageman wrote:

On Tue, Mar 18, 2025 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

Attached is v2.10,

I noticed a few comments could be improved in 0011: bufmgr: Use AIO
in StartReadBuffers()
[...]

Yep.

Above and in AsyncReadBuffers()

* To support retries after short reads, the first operation->nblocks_done is
* buffers are skipped.

can't quite understand this

Heh, yea, it's easy to misunderstand. "short read" in the sense of a partial
read, i.e. a preadv() that only read some of the blocks, not all. I'm
replacing the "short" with partial.

(also removed the superfluous "is")

* A secondary benefit is that this would allows us to measure the time in
* pgaio_io_acquire() without causing undue timer overhead in the common,
* non-blocking, case. However, currently the pgstats infrastructure
* doesn't really allow that, as it a) asserts that an operation can't
* have time without operations b) doesn't have an API to report
* "accumulated" time.
*/

allows->allow

What would the time spent in pgaio_io_acquire() be reported as?

I'd report it as additional time for the IO we're trying to start, as that
wait would otherwise not happen.

And what is "accumulated" time here? It seems like you just add the time to
the running total and that is already accumulated.

Afaict there currently is no way to report a time delta to
pgstat. pgstat_count_io_op_time() computes the time since
pgstat_prepare_io_time(). Due to the assertions that time cannot be reported
for an operation with a zero count, we can't just do two
pgstat_prepare_io_time(); ...; pgstat_count_io_op_time();
twice, with the first one passing cnt=0.

Greetings,

Andres Freund

#92

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:

On 2025-03-11 20:57:43 -0700, Noah Misch wrote:

- Like you say, "redefine max_files_per_process to be about the number of
files each *backend* will additionally open". It will become normal that
each backend's actual FD list length is max_files_per_process + MaxBackends
if io_method=io_uring. Outcome is not unlike
v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
mutate max_files_per_process. Benchmark results should not change beyond
the inter-major-version noise level unless one sets io_method=io_uring. I'm
feeling best about this one, but I've not been thinking about it long.

Yea, I think that's something probably worth doing separately from Jelte's
patch. I do think that it'd be rather helpful to have jelte's patch to
increase NOFILE in addition though.

Agreed.

+static void
+maybe_adjust_io_workers(void)

This also restarts workers that exit, so perhaps name it
start_io_workers_if_missing().

But it also stops IO workers if necessary?

Good point. Maybe just add a comment like "start or stop IO workers to close
the gap between the running count and the configured count intent".

It's now
/*
* Start or stop IO workers, to close the gap between the number of running
* workers and the number of configured workers. Used to respond to change of
* the io_workers GUC (by increasing and decreasing the number of workers), as
* well as workers terminating in response to errors (by starting
* "replacement" workers).
*/

Excellent.

+{

...
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
Can LaunchMissingBackgroundProcesses() become the sole caller of this
function, replacing the current mix of callers? That would be more conducive
to promptly doing the right thing after launch failure.
I'm not sure that'd be a good idea - right now IO workers are started before
the startup process, as the startup process might need to perform IO. If we
started it only later in ServerLoop() we'd potentially do a fair bit of work,
including starting checkpointer, bgwriter, bgworkers before we started IO
workers. That shouldn't actively break anything, but it would likely make
things slower.
I missed that. How about keeping the two calls associated with PM_STARTUP but
replacing the assign_io_workers() and process_pm_child_exit() calls with one
in LaunchMissingBackgroundProcesses()?
I think replacing the call in assign_io_workers() is a good idea, that way we
don't need assign_io_workers().

Less convinced it's a good idea to do the same for process_pm_child_exit() -
if IO workers errored out we'll launch backends etc before we get to
LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
seems a bit odd.

Works for me.

I think LaunchMissingBackgroundProcesses() should be split into one that
starts aux processes and one that starts bgworkers. The one maintaining aux
processes should be called before we start backends, the latter not.

That makes sense, though I've not thought about it much.

+			/*
+			 * It's very unlikely, but possible, that reopen fails. E.g. due
+			 * to memory allocations failing or file permissions changing or
+			 * such.  In that case we need to fail the IO.
+			 *
+			 * There's not really a good errno we can report here.
+			 */
+			error_errno = ENOENT;
Agreed there's not a good errno, but let's use a fake errno that we're mighty
unlikely to confuse with an actual case of libc returning that errno. Like
one of EBADF or EOWNERDEAD.
Can we rely on that to be present on all platforms, including windows?
I expect EBADF is universal. EBADF would be fine.
Hm, that's actually an error that could happen for other reasons, and IMO
would be more confusing than ENOENT. The latter describes the issue to a
reasonable extent, EBADFD seems like it would be more confusing.

I'm not sure it's worth investing time in this - it really shouldn't happen,
and we probably have bigger problems than the error code if it does. But if we
do want to do something, I think I can see a way to report a dedicated error
message for this.

I agree it's not worth much investment. Let's leave that one as-is. We can
always change it further if the not-really-good errno shows up too much.

https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
lists some OSs not having it, the newest of which looks like NetBSD 9.3
(2022). We could use it and add a #define for platforms lacking it.

What would we define it as? I guess we could just pick a high value, but...

Some second-best value, but I withdraw that idea.

On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote:

Attached is v2.7, with the following changes:

Unresolved:

- Whether to continue starting new workers in process_pm_child_exit()

I'm fine with that continuing. It's hurting ~nothing.

- What to name the view (currently pg_aios). I'm inclined to go for
pg_io_handles right now.

I like pg_aios mildly better than pg_io_handles, since "handle" sounds
implementation-centric.

On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote:

Attached is v2.8 with the following changes:

- In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
been discussing on this thread recently. As that only affects io_uring, it
doesn't have to block other patches going in.

As above, I like the "redefine" option.

- Right now effective_io_concurrency cannot be set > 0 on Windows and other
platforms that lack posix_fadvise. But with AIO we can read ahead without
posix_fadvise().

It'd not really make anything worse than today to not remove the limit, but
it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
to look around and see whether it would require anything other than doc
changes.

Worth changing, but non-blocking.

On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote:

- Should the docs for debug_io_direct be rephrased and if so, how?

Perhaps it's worth going from

<para>
Currently this feature reduces performance, and is intended for
developer testing only.
</para>
to
<para>
Currently this feature reduces performance in many workloads, and is
intended for testing only.
</para>

I.e. qualify the downside with "many workloads" and widen the audience ever so
slightly?

Yes, that's good.

Other than the smgr patch review sent on its own thread, I've not yet reviewed
any of these patches comprehensively. Given the speed of change, I felt it
was time to flush comments buffered since 2025-03-11:

commit 0284401 wrote:

aio: Basic subsystem initialization

@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
*/
LWLockReleaseAll();
pgstat_report_wait_end();
+ pgaio_error_cleanup();

AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that
proper? They do call pgaio_error_cleanup() as seen here, so the only loss is
some asserts. (The load-bearing part does get done.)

commit da72269 wrote:

aio: Add core asynchronous I/O infrastructure

+ * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

commit 55b454d wrote:

aio: Infrastructure for io_method=worker

+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */

I'd change the comment to something like one of:

retry after DetermineSleepTime()
next LaunchMissingBackgroundProcesses() will retry in <60s

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:

- Decide what to do about the smgr interrupt issue

Replied on that thread. It's essentially ready.

Questions / Unresolved:

- Write support isn't going to land in 18, but there is a tiny bit of code
regarding writes in the code for bufmgr IO. I guess I could move that to a
later commit?

I'm inclined to leave it, the structure of the code only really makes
knowing that it's going to be shared between reads & writes.

Fine to leave it.

- pg_aios view name

Covered above.

Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

And these individual lines from "git grep BM_IO_IN_PROGRESS":
* I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially. For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

I'd delete that comment; to me, the assertion alone is clearer.

+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",

This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's
goals; is it needed? If so, I recommend splitting it out as a preliminary
patch, to highlight the behavior change for release notes.

+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be

Missing word, probably fix like:
s,entire failed on a lower-level,entire I/O failed on a lower level,

+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;

I didn't run an experiment to check the following, but I think this should be
s/<=/</. Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.

s/the IO for this page/page verification/
s/IO's/IO's result/

Show quoted text

+		 */
+		if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}

#93

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#92)

Re: AIO v2.5

Hi,

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:

- Right now effective_io_concurrency cannot be set > 0 on Windows and other
platforms that lack posix_fadvise. But with AIO we can read ahead without
posix_fadvise().

It'd not really make anything worse than today to not remove the limit, but
it'd be pretty weird to prevent windows etc from benefiting from AIO. Need
to look around and see whether it would require anything other than doc
changes.

Worth changing, but non-blocking.

Thankfully Melanie submitted a patch for that...

Other than the smgr patch review sent on its own thread, I've not yet reviewed
any of these patches comprehensively. Given the speed of change, I felt it
was time to flush comments buffered since 2025-03-11:

Thanks!

commit 0284401 wrote:

aio: Basic subsystem initialization

@@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
*/
LWLockReleaseAll();
pgstat_report_wait_end();
+ pgaio_error_cleanup();

AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that
proper? They do call pgaio_error_cleanup() as seen here, so the only loss is
some asserts. (The load-bearing part does get done.)

I don't think it's particularly good that we use the AtEOXact_* functions in
the sigsetjmp blocks, that feels like a weird mixup of infrastructure to
me. So this was intentional.

commit da72269 wrote:

aio: Add core asynchronous I/O infrastructure

+ * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

/me has a red face.

commit 55b454d wrote:

aio: Infrastructure for io_method=worker
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
I'd change the comment to something like one of:

retry after DetermineSleepTime()
next LaunchMissingBackgroundProcesses() will retry in <60s

Hm, we retry more frequently that that if there are new connections... Maybe
just "try again next time"?

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:

- Decide what to do about the smgr interrupt issue

Replied on that thread. It's essentially ready.

Cool, will reply there in a bit.

Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process start a read or write sets the flag. When the
I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
sleeping on the buffer's condition variable.

And these individual lines from "git grep BM_IO_IN_PROGRESS":

* i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially.

Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534. I think the comment can just be deleted?

* I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* only one process at a time can set the BM_IO_IN_PROGRESS bit.

For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe

* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we got the exclusive right to perform I/O by setting the
* BM_IO_IN_PROGRESS bit.

+		/* we better have ensured the buffer is present until now */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
I'd delete that comment; to me, the assertion alone is clearer.

Ok.

+			ereport(LOG,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's
goals; is it needed? If so, I recommend splitting it out as a preliminary
patch, to highlight the behavior change for release notes.

No, it's not needed. I think I looked over the patch at some point and
considered the log-level wrong according to our guidelines and thought I'd
broken it.

+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
Missing word, probably fix like:
s,entire failed on a lower-level,entire I/O failed on a lower level,

Yep.

+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
I didn't run an experiment to check the following, but I think this should be
s/<=/</. Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.

Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.

+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.

s/the IO for this page/page verification/
s/IO's/IO's result/

Agreed.

Thanks for the review!

Greetings,

Andres Freund

#94

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#93)

Re: AIO v2.5

On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
commit 55b454d wrote:

aio: Infrastructure for io_method=worker
+		/* Try to launch one. */
+		child = StartChildProcess(B_IO_WORKER);
+		if (child != NULL)
+		{
+			io_worker_children[id] = child;
+			++io_worker_count;
+		}
+		else
+			break;				/* XXX try again soon? */
I'd change the comment to something like one of:

retry after DetermineSleepTime()
next LaunchMissingBackgroundProcesses() will retry in <60s
Hm, we retry more frequently that that if there are new connections... Maybe
just "try again next time"?

Works for me.

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:

Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock). The process start a read or write sets the flag. When the

s/start/starting/

I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by

s/to using/by using/

sleeping on the buffer's condition variable.

Sounds good.

And these individual lines from "git grep BM_IO_IN_PROGRESS":

* i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially.

Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534. I think the comment can just be deleted?

Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.

* I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* only one process at a time can set the BM_IO_IN_PROGRESS bit.

For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe

Looking again, I agree they're okay.

* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we got the exclusive right to perform I/O by setting the
* BM_IO_IN_PROGRESS bit.

That's fine too. Maybe s/perform/stage/ or s/perform/start/.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.

I hadn't thought of that. That's a good reason.

#95

Jakub Wartak

jakub.wartak@enterprisedb.com

10 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On Tue, Mar 18, 2025 at 9:12 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

Attached is v2.10, with the following changes:

- committed core AIO infrastructure patch

Hi, yay, It's happening.jpg ;)

Some thoughts about 2.10-0004:
What do you think about putting there into (io_uring patch) info
about the need to ensure that kernel.io_uring_disabled sysctl is on ?
(some distros might shut it down) E.g. in doc/src/sgml/config.sgml
after io_method = <listitems>... there could be

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
            <literal>io_uring</literal> (execute asynchronous I/O using
            io_uring, if available)
[..]
and then add something like:
+ "At present io_method=io_uring is supported only on Linux and
requires Linux's sysctl kernel.io_uring_disabled (if present) to be at
value 0 (enabled) or 1 (with kernel.io_uring_group set to PostgreSQL's
GID)."

Rationale: it seems that at least RHEL 9.x will have this knob present
(but e.g. RHEL 8.10 even kernel-ml 6.4.2 doesn't, as this seems to
come with 6.6+ but saw somewhere that somebody had issues with this on
probably backported kernel in Rocky 9.x ). Also further googling for I
have found also that mysql can throw - when executed from
podmap/docker: "mysqld: io_uring_queue_init() failed with ENOSYS:
check seccomp filters, and the kernel version (newer than 5.1
required)"

and this leaves this with two probable follow-up questions when
adjusting this sentence:
a. shouldn't we add some sentence about containers/namespaces/seccomp
allowing this ?
b. and/or shouldn't we reference in docs a minimum kernel version
(this is somewhat wild, liburing could be installed and compiled
against, but runtime kernel would be < 5.1 ?)

-J.

#96

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#93)

Re: AIO v2.5

Hi,

On 2025-03-19 18:17:37 -0400, Andres Freund wrote:

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
I didn't run an experiment to check the following, but I think this should be
s/<=/</. Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.
Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...

It was correct as-is. With result=1 you get precisely the result you describe
as the desired outcome, no?
prior_result.result <= buf_off
->
1 <= 0 -> failed = 0
1 <= 1 -> failed = 1

but if it were < as you suggest:

prior_result.result < buf_off
->
1 < 0 -> failed = 0
1 < 1 -> failed = 0

I.e. we would assume that the second buffer also completed.

What does concern me is that the existing tests do *not* catch the problem if
I turn "<=" into "<". The second buffer in this case wrongly gets marked as
valid. We do retry the read (because bufmgr.c thinks only one block was read),
but find the buffer to already be valid.

The reason the test doesn't fail, is that the way I set up the "short read"
tests. The injection point runs after the IO completed and just modifies the
result. However, the actual buffer contents still got modified.

The easiest way around that seems to be to have the injection point actually
zero out the remaining memory. Not pretty, but it'd be harder to just submit
shortend IOs in multiple IO methods. It'd be even better if we could
trivially use something like randomize_mem(), but it's only conditionally
compiled...

Greetings,

Andres Freund

#97

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#96)

Re: AIO v2.5

On Thu, Mar 20, 2025 at 01:05:05PM -0400, Andres Freund wrote:

On 2025-03-19 18:17:37 -0400, Andres Freund wrote:
On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == ARS_ERROR
+			|| prior_result.result <= buf_off;
I didn't run an experiment to check the following, but I think this should be
s/<=/</. Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.
Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...
It was correct as-is. With result=1 you get precisely the result you describe
as the desired outcome, no?
prior_result.result <= buf_off
->
1 <= 0 -> failed = 0
1 <= 1 -> failed = 1

but if it were < as you suggest:

prior_result.result < buf_off
->
1 < 0 -> failed = 0
1 < 1 -> failed = 0

I.e. we would assume that the second buffer also completed.

That's right. I see it now. My mistake.

What does concern me is that the existing tests do *not* catch the problem if
I turn "<=" into "<". The second buffer in this case wrongly gets marked as
valid. We do retry the read (because bufmgr.c thinks only one block was read),
but find the buffer to already be valid.

The reason the test doesn't fail, is that the way I set up the "short read"
tests. The injection point runs after the IO completed and just modifies the
result. However, the actual buffer contents still got modified.

The easiest way around that seems to be to have the injection point actually
zero out the remaining memory.

Sounds reasonable and sufficient.

FYI, I've resumed the comprehensive review. That's still ongoing.

#98

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#94)

Re: AIO v2.5

Hi,

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:

On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

Hm, we retry more frequently that that if there are new connections... Maybe
just "try again next time"?

Works for me.

And these individual lines from "git grep BM_IO_IN_PROGRESS":

* i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially.

Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534. I think the comment can just be deleted?

Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.

* I/O already in progress. We already hold BM_IO_IN_PROGRESS for the
* only one process at a time can set the BM_IO_IN_PROGRESS bit.
* only one process at a time can set the BM_IO_IN_PROGRESS bit.

For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe

Looking again, I agree they're okay.

* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we got the exclusive right to perform I/O by setting the
* BM_IO_IN_PROGRESS bit.

That's fine too. Maybe s/perform/stage/ or s/perform/start/.

I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.

I hadn't thought of that. That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
* smgrstartreadv() -- asynchronous version of smgrreadv()
*
* This starts an asynchronous readv IO using the IO handle `ioh`. Other than
* `ioh` all parameters are the same as smgrreadv().
*
* Completion callbacks above smgr will be passed the result as the number of
* successfully read blocks if the read [partially] succeeds. This maintains
* the abstraction that smgr operates on the level of blocks, rather than
* bytes.
*/

I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem. Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

Greetings,

Andres Freund

#99

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#98)

Re: AIO v2.5

On Thu, Mar 20, 2025 at 02:54:14PM -0400, Andres Freund wrote:

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:

On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.

I hadn't thought of that. That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
* smgrstartreadv() -- asynchronous version of smgrreadv()
*
* This starts an asynchronous readv IO using the IO handle `ioh`. Other than
* `ioh` all parameters are the same as smgrreadv().
*
* Completion callbacks above smgr will be passed the result as the number of
* successfully read blocks if the read [partially] succeeds. This maintains
* the abstraction that smgr operates on the level of blocks, rather than
* bytes.
*/

That's good. Possibly add "(Buffers for blocks not successfully read might
bear unspecified modifications, up to the full nblocks.)"

In a bit of over-thinking this, I wondered if shared_buffer_readv_complete
would be better named shared_buffer_smgrreadv_complete, to emphasize the
smgrreadv semantics. PGAIO_HCB_SHARED_BUFFER_READV likewise. But I tend to
think not. smgrreadv() has no "result" concept, so the symmetry is limited.

I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem. Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

I'd lean yes, if in doubt.

#100

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#86)

27 attachment(s)

Re: AIO v2.5

Hi,

Attached v2.11, with the following changes:

- Pushed the smgr interrupt change, as discussed on the dedicated thread

- Pushed "bufmgr: Improve stats when a buffer is read in concurrently"

It was reviewed by Melanie and there didn't seem to be any reason to wait
further.

- Addressed feedback from Melanie

- Addressed feedback from Noah

- Added a new commit: aio: Change prefix of PgAioResultStatus values to PGAIO_RS_

As suggested/requested by Melanie. I think she's unfortunately right.

- Added a patch for some comment fixups for code that's either older or
already pushed

- Added an error check for FileStartReadV() failing

FileStartReadV() actually can fail, if the file can't be re-opened. I
thought it'd be important for the error message to differ from the one
that's issued for read actually failing, so I went with:

"could not start reading blocks %u..%u in file \"%s\": %m"

but I'm not sure how good that is.

- Added a new commit to redefine set_max_safe_fds() to not subtract
already_open fds from max_files_per_process

This prevents io_method=io_uring from failing when RLIMIT_NOFILE is high
enough, but more than max_files_per_process io_uring instances need to be
created.

- Improved error message if io_uring_queue_init() fails

Added errhint()s for likely cases of failure.

Added errcode(). I was tempted to use errcode_for_file_access(), but that
doesn't support ENOSYS - perhaps I should add that instead?

- Disable io_uring method when using EXEC_BACKEND, they're not compatible

I chose to do this with a define aio.h, but I guess we could also do it at
configure time? That seems more complicated though - how would we even know
that EXEC_BACKEND is used on non-windows?

Not sure yet how to best disable testing io_uring in this case. We can't
just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the
initdb not fail and checking the error log would work, but that doesn't work
nicely with Cluster.pm.

- Changed test_aio injection short-read injection point to zero out the rest
of the IO, otherwise some tests fail to fail even if a bug in retries of
partial reads is introduced

- Improved method_io_uring.c includes a bit (no pgstat.h)

Questions:

- We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird?

See AbortBufferIO(Buffer buffer)

It doesn't really matter for the patchset, but it just strikes me as an oddity.

Greetings,

Andres Freund

Attachments:

v2.11-0001-aio-bufmgr-Comment-fixes.patchtext/x-diff; charset=us-asciiDownload

From 3432ae5d2421ab49bce015c6a64bb256ae36625d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Mar 2025 11:32:10 -0400
Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes

Some of these comments have been wrong for a while (12f3867f5534), some I
recently introduced (da7226993fd, 55b454d0e14). This also updates a comment in
FlushBuffer(), which will be copied in a future commit.

These changes seem big enough to be worth doing in separate commits.

Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
---
 src/include/storage/aio.h           |  2 +-
 src/backend/postmaster/postmaster.c |  2 +-
 src/backend/storage/buffer/bufmgr.c | 10 ++++------
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 7b6b7d20a85..a221c6c3d9a 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -72,7 +72,7 @@ typedef enum PgAioHandleFlags
 /*
  * The IO operations supported by the AIO subsystem.
  *
- * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * This could be in aio_internal.h, as it is not publicly referenced, but
  * PgAioOpData currently *does* need to be public, therefore keeping this
  * public seems to make sense.
  */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a0c37532d2f..c966c2e83af 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4401,7 +4401,7 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* XXX try again soon? */
+			break;				/* try again next time */
 	}
 
 	/* Too many running? */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 14fc1bd1248..37fb09d8b92 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3864,9 +3864,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		XLogFlush(recptr);
 
 	/*
-	 * Now it's safe to write buffer to disk. Note that no one else should
-	 * have been able to write it while we were busy with log flushing because
-	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
@@ -5434,9 +5435,6 @@ IsBufferCleanupOK(Buffer buffer)
 /*
  *	Functions for buffer I/O handling
  *
- *	Note: We assume that nested buffer I/O never occurs.
- *	i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
- *
  *	Also note that these are used only for shared buffers, not local ones.
  */
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0002-aio-Change-prefix-of-PgAioResultStatus-values-.patchtext/x-diff; charset=us-asciiDownload

From 3a7fd5b1c1e21955f1bd696873f2836e429f6801 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 11:30:26 -0400
Subject: [PATCH v2.11 02/27] aio: Change prefix of PgAioResultStatus values to
 PGAIO_RS_

The previous prefix wasn't consistent with the naming of other AIO related
enum values.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com
---
 src/include/storage/aio_types.h        |  8 ++++----
 src/backend/storage/aio/aio.c          | 12 ++++++------
 src/backend/storage/aio/aio_callback.c |  6 +++---
 src/backend/storage/aio/aio_init.c     |  2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index a5cc658efbd..dddda3a3e2f 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -73,10 +73,10 @@ typedef union PgAioTargetData
  */
 typedef enum PgAioResultStatus
 {
-	ARS_UNKNOWN,				/* not yet completed / uninitialized */
-	ARS_OK,
-	ARS_PARTIAL,				/* did not fully succeed, but no error */
-	ARS_ERROR,
+	PGAIO_RS_UNKNOWN,			/* not yet completed / uninitialized */
+	PGAIO_RS_OK,
+	PGAIO_RS_PARTIAL,			/* did not fully succeed, but no error */
+	PGAIO_RS_ERROR,
 } PgAioResultStatus;
 
 
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3ed4b1dfdac..29f57f9cd1c 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -216,7 +216,7 @@ pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 		if (ret)
 		{
 			ioh->report_return = ret;
-			ret->result.status = ARS_UNKNOWN;
+			ret->result.status = PGAIO_RS_UNKNOWN;
 		}
 
 		return ioh;
@@ -669,7 +669,7 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	ioh->handle_data_len = 0;
 	ioh->report_return = NULL;
 	ioh->result = 0;
-	ioh->distilled_result.status = ARS_UNKNOWN;
+	ioh->distilled_result.status = PGAIO_RS_UNKNOWN;
 
 	/* XXX: the barrier is probably superfluous */
 	pg_write_barrier();
@@ -829,13 +829,13 @@ pgaio_result_status_string(PgAioResultStatus rs)
 {
 	switch (rs)
 	{
-		case ARS_UNKNOWN:
+		case PGAIO_RS_UNKNOWN:
 			return "UNKNOWN";
-		case ARS_OK:
+		case PGAIO_RS_OK:
 			return "OK";
-		case ARS_PARTIAL:
+		case PGAIO_RS_PARTIAL:
 			return "PARTIAL";
-		case ARS_ERROR:
+		case PGAIO_RS_ERROR:
 			return "ERROR";
 	}
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d5a2cca28f1..09f03f296f5 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -164,8 +164,8 @@ pgaio_result_report(PgAioResult result, const PgAioTargetData *target_data, int
 	PgAioHandleCallbackID cb_id = result.id;
 	const PgAioHandleCallbacksEntry *ce = &aio_handle_cbs[cb_id];
 
-	Assert(result.status != ARS_UNKNOWN);
-	Assert(result.status != ARS_OK);
+	Assert(result.status != PGAIO_RS_UNKNOWN);
+	Assert(result.status != PGAIO_RS_OK);
 
 	if (ce->cb->report == NULL)
 		elog(ERROR, "callback %d/%s does not have report callback",
@@ -220,7 +220,7 @@ pgaio_io_call_complete_shared(PgAioHandle *ioh)
 	Assert(ioh->target > PGAIO_TID_INVALID && ioh->target < PGAIO_TID_COUNT);
 	Assert(ioh->op > PGAIO_OP_INVALID && ioh->op < PGAIO_OP_COUNT);
 
-	result.status = ARS_OK;		/* low level IO is always considered OK */
+	result.status = PGAIO_RS_OK;	/* low level IO is always considered OK */
 	result.result = ioh->result;
 	result.id = PGAIO_HCB_INVALID;
 	result.error_data = 0;
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 2ede7e80b65..885c3940c66 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -202,7 +202,7 @@ AioShmemInit(void)
 			ioh->report_return = NULL;
 			ioh->resowner = NULL;
 			ioh->num_callbacks = 0;
-			ioh->distilled_result.status = ARS_UNKNOWN;
+			ioh->distilled_result.status = PGAIO_RS_UNKNOWN;
 			ioh->flags = 0;
 
 			ConditionVariableInit(&ioh->cv);
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0003-Redefine-max_files_per_process-to-control-addi.patchtext/x-diff; charset=us-asciiDownload

From 5f29c2c22ef09854687ca3dda6c7648090684648 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 13:17:41 -0400
Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
 additionally opened files

Until now max_files_per_process=N limited each backend to open N files in
total (minus a safety factor), even if there were already more files opened in
postmaster and inherited by backends.  Change max_files_per_process to control
how many additional files each process is allowed to open.

The main motivation for this is the patch to add io_method=io_uring, which
needs to open one file for each backend.  Without this patch, even if
RLIMIT_NOFILE is high enough, postmaster will fail in set_max_safe_fds() if
started with a high max_connections.  The cause of the failure is that, until
now, set_max_safe_fds() subtracted the already open files from
max_files_per_process.

Discussion: https://postgr.es/m/w6uiicyou7hzq47mbyejubtcyb2rngkkf45fk4q7inue5kfbeo@bbfad3qyubvs
Discussion: https://postgr.es/m/CAGECzQQh6VSy3KG4pN1d=h9J=D1rStFCMR+t7yh_Kwj-g87aLQ@mail.gmail.com
---
 src/backend/storage/file/fd.c       | 14 ++++++++------
 src/backend/utils/misc/guc_tables.c |  2 +-
 doc/src/sgml/config.sgml            |  8 ++++++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 62f1185859f..0c3a2a756e7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1047,16 +1047,17 @@ set_max_safe_fds(void)
 
 	/*----------
 	 * We want to set max_safe_fds to
-	 *			MIN(usable_fds, max_files_per_process - already_open)
+	 *			MIN(usable_fds, max_files_per_process)
 	 * less the slop factor for files that are opened without consulting
-	 * fd.c.  This ensures that we won't exceed either max_files_per_process
-	 * or the experimentally-determined EMFILE limit.
+	 * fd.c.  This ensures that we won't allow to open more than
+	 * max_files_per_process, or the experimentally-determined EMFILE limit,
+	 * additional files.
 	 *----------
 	 */
 	count_usable_fds(max_files_per_process,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
+	max_safe_fds = Min(usable_fds, max_files_per_process);
 
 	/*
 	 * Take off the FDs reserved for system() etc.
@@ -1070,9 +1071,10 @@ set_max_safe_fds(void)
 		ereport(FATAL,
 				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
 				 errmsg("insufficient file descriptors available to start server process"),
-				 errdetail("System allows %d, server needs at least %d.",
+				 errdetail("System allows %d, server needs at least %d, %d files are already open.",
 						   max_safe_fds + NUM_RESERVED_FDS,
-						   FD_MINFREE + NUM_RESERVED_FDS)));
+						   FD_MINFREE + NUM_RESERVED_FDS,
+						   already_open)));
 
 	elog(DEBUG2, "max_safe_fds = %d, usable_fds = %d, already_open = %d",
 		 max_safe_fds, usable_fds, already_open);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 97cfd6e5a82..75bb0acdf0f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2680,7 +2680,7 @@ struct config_int ConfigureNamesInt[] =
 
 	{
 		{"max_files_per_process", PGC_POSTMASTER, RESOURCES_KERNEL,
-			gettext_noop("Sets the maximum number of simultaneously open files for each server process."),
+			gettext_noop("Sets the maximum number of files each server process is allowed to open simultaneously."),
 			NULL
 		},
 		&max_files_per_process,
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bdcefa8140b..c704f3f98f1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2378,8 +2378,12 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
+        Sets the maximum number of open files each server subprocess is
+        allowed to open simultaneously, in addition to the files already open
+        in postmaster. The default is one thousand files.
+       </para>
+       <para>
+        If the kernel is enforcing
         a safe per-process limit, you don't need to worry about this setting.
         But on some platforms (notably, most BSD systems), the kernel will
         allow individual processes to open many more files than the system
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0004-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From e5e829d3476ac26b41cb2ee0b056150d28c8aa54 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  11 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 138 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   4 +
 10 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 7cf518a2765..108e3678071 100644
--- a/meson.build
+++ b/meson.build
@@ -944,6 +944,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3164,6 +3176,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3819,6 +3832,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..3e2d0dc59c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
 
 #
 # UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 ##
 ## Header files
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index fac1e9a4e39..e98ed535185 100755
--- a/configure
+++ b/configure
@@ -651,6 +651,8 @@ LIBOBJS
 OPENSSL
 ZSTD
 LZ4
+LIBURING_LIBS
+LIBURING_CFLAGS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -712,6 +714,7 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -911,6 +915,8 @@ LDFLAGS_EX
 LDFLAGS_SL
 PERL
 PYTHON
+LIBURING_CFLAGS
+LIBURING_LIBS
 MSGFMT
 TCLSH'
 
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1629,6 +1636,10 @@ Some influential environment variables:
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERL        Perl program
   PYTHON      Python program
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   MSGFMT      msgfmt program for NLS
   TCLSH       Tcl interpreter program (tclsh)
 
@@ -8692,6 +8703,40 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
 
 #
 # UUID library
@@ -13440,6 +13485,99 @@ fi
 fi
 
 
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 ##
 ## Header files
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8fe9d61e82a..fba607cb5af 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -222,6 +223,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0005-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From 3642a03831aab68a70af15478eebf3b620f802eb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

io_uring, can be considerably faster than io_method=worker. It is, however,
linux specific and requires an additional compile-time dependency (liburing).

Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   8 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 474 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   6 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 509 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index a221c6c3d9a..57aa1c04b8b 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -22,12 +22,20 @@
 #include "storage/procnumber.h"
 
 
+/* io_uring is incompatible with EXEC_BACKEND */
+#if defined(USE_LIBURING) && !defined(EXEC_BACKEND)
+#define IOMETHOD_IO_URING_ENABLED
+#endif
+
 
 /* Enum for io_method GUC. */
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef IOMETHOD_IO_URING_ENABLED
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 108fe61c7b4..6e6033519b2 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef IOMETHOD_IO_URING_ENABLED
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 29f57f9cd1c..a30c163c3de 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef IOMETHOD_IO_URING_ENABLED
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef IOMETHOD_IO_URING_ENABLED
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..abf5273376a
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,474 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+/* included early, for IOMETHOD_IO_URING_ENABLED */
+#include "storage/aio.h"
+
+#ifdef IOMETHOD_IO_URING_ENABLED
+
+#include <liburing.h>
+
+#include "miscadmin.h"
+#include "port/pg_iovec.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/wait_event.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	int			TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow all
+		 * io_uring_queue_init() calls to succeed.
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to still have enough file
+		 * descriptors to satisfy set_max_safe_fds() left over. Or, even
+		 * better, have max_files_per_process left over FDs.
+		 *
+		 * We probably should adjust the soft RLIMIT_NOFILE to ensure that.
+		 *
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+		{
+			char	   *hint = NULL;
+			int			err = ERRCODE_INTERNAL_ERROR;
+
+			/* add hints for some failures that errno explains sufficiently */
+			if (-ret == EPERM)
+			{
+				err = ERRCODE_INSUFFICIENT_PRIVILEGE;
+				hint = _("Check if io_uring is disabled via /proc/sys/kernel/io_uring_disabled.");
+			}
+			else if (-ret == EMFILE)
+			{
+				err = ERRCODE_INSUFFICIENT_RESOURCES;
+				hint = psprintf(_("Consider increasing \"ulimit -n\" to at least %d."),
+								TotalProcs + max_files_per_process);
+			}
+			else if (-ret == ENOSYS)
+			{
+				err = ERRCODE_FEATURE_NOT_SUPPORTED;
+				hint = _("Kernel does not support io_uring.");
+			}
+
+			/* update errno to allow %m to work */
+			errno = -ret;
+
+			ereport(ERROR,
+					errcode(err),
+					errmsg("io_uring_queue_init failed: %m"),
+					hint != NULL ? errhint("%s", hint) : 0);
+		}
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_COMPLETION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* IOMETHOD_IO_URING_ENABLED */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..44eef41f841 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
 AIO_IO_COMPLETION	"Waiting for IO completion."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9f31e4071c7..69dc727174b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -204,7 +204,8 @@
 					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c704f3f98f1..4f71725036b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,12 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d35..53abd54edd8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2150,6 +2150,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0006-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From 5d65d6a0c7588f392305c40d16ed23dfa49927aa Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 186 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 150 ++++++++++++++++++++
 10 files changed, 417 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 57aa1c04b8b..1591a6c2fca 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -117,9 +117,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -191,6 +192,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index dddda3a3e2f..debe8163d4e 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 09f03f296f5..a92fa8f678e 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index b01406a6a52..6133b04f350 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0c3a2a756e7..f573aa8e7ca 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1296,6 +1297,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1989,6 +1992,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2212,6 +2217,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2500,6 +2531,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2780,6 +2817,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2848,6 +2886,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..2218f17e7a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,69 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	ret = FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1438,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1929,101 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartreadv(), the smgr API operates on the level
+	 * of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index af74f54b43b..545888dcdfc 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -66,6 +66,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -106,6 +107,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -117,6 +122,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -134,12 +140,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -157,6 +165,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -709,6 +727,30 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully read blocks if the read [partially] succeeds (Buffers for
+ * blocks not successfully read might bear unspecified modifications, up to
+ * the full nblocks). This maintains the abstraction that smgr operates on the
+ * level of blocks, rather than bytes.
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -917,6 +959,25 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	HOLD_INTERRUPTS();
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+	RESUME_INTERRUPTS();
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -945,3 +1006,92 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0007-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From e614522f8aedaf46a27c4b08fa78105cc92459d7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..012cf3f305a
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == PGAIO_RS_PARTIAL)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this the ability to associate multiple completion callbacks
+with a handle. E.g. bufmgr.c can have a callback to update the BufferDesc
+state and to verify the page and md.c can have another callback to check if
+the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index a30c163c3de..3c3383b1318 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0008-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From 2822b36f45394855d60b10e4985ad08acb52e973 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 37fb09d8b92..aea08b242ac 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5403,6 +5415,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0009-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From ed356428e1058b156a158c2e319668611f3539cc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/README      |   9 +-
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 496 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  47 ++-
 8 files changed, 523 insertions(+), 52 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1591a6c2fca..9a0868a270c 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -195,6 +195,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ecc1c3a909b..b8c28fe79f4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -171,6 +172,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index a92fa8f678e..3cdd2a6b195 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..a182fcd660c 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -147,9 +147,12 @@ in the buffer.  It is used per the rules above.
 
 * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
 buffer to complete (and in releases before 14, it was accompanied by a
-per-buffer LWLock).  The process doing a read or write sets the flag for the
-duration, and processes that need to wait for it to be cleared sleep on a
-condition variable.
+per-buffer LWLock).  The process starting a read or write sets the flag. When
+the I/O is completed, be it by the process that initiated the I/O or by
+another process, the flag is removed and the Buffer's condition variable is
+signalled.  Processes that need to wait for the I/O to complete can wait for
+asynchronous I/O by using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be
+unset by sleeping on the buffer's condition variable.
 
 
 Normal Buffer Replacement Strategy
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aea08b242ac..ea5f504e38a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -519,7 +520,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1041,7 +1043,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1077,9 +1079,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1391,7 +1393,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1549,9 +1552,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -2459,7 +2462,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2805,6 +2808,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2873,29 +2914,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3916,7 +3936,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5472,6 +5492,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5479,10 +5500,40 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+
+		/*
+		 * Copy the wait reference while holding the spinlock. This protects
+		 * against a concurrent TerminateBufferIO() in another backend from
+		 * clearing the wref while it's being read.
+		 */
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
+		/* no IO in progress, we don't need to wait */
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		/*
+		 * The buffer has asynchronous IO in progress, wait for it to
+		 * complete.
+		 */
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+
+			/*
+			 * The AIO subsystem internally uses condition variables and thus
+			 * might remove this backend from the BufferDesc's CV. While that
+			 * wouldn't cause a correctness issue (the first CV sleep just
+			 * immediately returns if not already registered), it seems worth
+			 * avoiding unnecessary loop iterations, given that we take care
+			 * to do so at the start of the function.
+			 */
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
+		/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5491,13 +5542,12 @@ WaitIO(BufferDesc *buf)
 /*
  * StartBufferIO: begin I/O on this buffer
  *	(Assumptions)
- *	My process is executing no IO
+ *	My process is executing no IO on this buffer
  *	The buffer is Pinned
  *
- * In some scenarios there are race conditions in which multiple backends
- * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
- * I/O condition variable until he's done.
+ * In some scenarios multiple backends could attempt the same I/O operation
+ * concurrently.  If someone else has already started I/O on this buffer then
+ * we will wait for completion of the IO using WaitIO().
  *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -5533,9 +5583,9 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 
 	/* Once we get here, there is definitely no I/O active on this buffer */
 
+	/* Check if someone else already did the I/O */
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		UnlockBufHdr(buf, buf_state);
 		return false;
 	}
@@ -5571,7 +5621,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5586,6 +5636,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5594,6 +5652,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * We may have just released the last pin other than the waiter's. In most
+	 * cases, this backend holds another pin on the buffer. But, if, for
+	 * example, this backend is completing an IO issued by another backend, it
+	 * may be time to wake the waiter.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5642,7 +5711,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6093,3 +6162,352 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Generic implementation of the AIO handle staging callback for readv/writev
+ * on local/shared buffers.
+ *
+ * Each readv/writev can target multiple buffers. The buffers have already
+ * been registered with the IO handle.
+ *
+ * To make the IO ready for execution ("staging"), we need to ensure that the
+ * targeted buffers are in an appropriate state while the IO is ongoing. For
+ * that the AIO subsystem needs to have its own buffer pin, otherwise an error
+ * in this backend could lead to this backend's buffer pin being released as
+ * part of error handling, which in turn could lead to the buffer being
+ * replaced while IO is ongoing.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	/* iterate over all buffers affected by the vectored readv/writev */
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential. This is the last
+		 * buffer-aware code before IO is actually executed and confusion
+		 * about which buffers are target by IO can be hard to debug, making
+		 * it worth doing extra-paranoid checks.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		/* verify the buffer is in the expected state */
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the AIO subsystem.
+		 *
+		 * For local buffers: This can't be done just via LocalRefCount, as
+		 * one might initially think, as this backend could error out while
+		 * AIO is still in progress, releasing all the pins by the backend
+		 * itself.
+		 *
+		 * This pin is released again in TerminateBufferIO().
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		/*
+		 * Ensure the content lock that prevents buffer modifications while
+		 * the buffer is being written out is not released early due to an
+		 * error.
+		 */
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result;
+	uint32		set_flag_bits;
+
+	/* check that the buffer is in the expected state for a read */
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	result.status = PGAIO_RS_OK;
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = PGAIO_RS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call the
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire I/O failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and page verification failed in
+		 * some form, set the whole IO's result to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR
+			&& buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
+					int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc,
+								  target_data->smgr.forkNum).str));
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..8b6cd54d09e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,13 +517,31 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * With AIO the buffer could have IO in progress, e.g. when there are two
+	 * scans of the same relation. Either wait for the other IO or return
+	 * false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	/* Once we get here, there is definitely no I/O active on this buffer */
+
+	/* Check if someone else already did the I/O */
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		return false;
 	}
 
@@ -536,7 +556,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +570,14 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/* release pin held by IO subsystem, see also buffer_stage_common() */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -714,6 +743,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0010-Support-buffer-forwarding-in-read_stream.c.patchtext/x-diff; charset=us-asciiDownload

From 39ac7acb4944758ae9c1c243e9da09f99dafade0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 10/27] Support buffer forwarding in read_stream.c.

In preparation for a following change to the buffer manager, teach
streams to keep track of buffers that were "forwarded" from one call to
StartReadBuffers() to the next.

Since StartReadBuffers() buffers argument will become an in/out
argument, we need to initialize the buffer queue entries with
InvalidBuffer.  We don't want to do that up front, because we try to
keep stream initialization cheap for code that uses the fast path and
stays in one single buffer queue element.  Satisfy both goals by
initializing the queue incrementally on the first cycle.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 103 +++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 45bdf819d57..18ecf4affc7 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -95,8 +95,10 @@ struct ReadStream
 	int16		ios_in_progress;
 	int16		queue_size;
 	int16		max_pinned_buffers;
+	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		initialized_buffers;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -224,8 +226,10 @@ static bool
 read_stream_start_pending_read(ReadStream *stream)
 {
 	bool		need_wait;
+	int			requested_nblocks;
 	int			nblocks;
 	int			flags;
+	int			forwarded;
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
@@ -272,11 +276,20 @@ read_stream_start_pending_read(ReadStream *stream)
 		}
 	}
 
-	/* How many more buffers is this backend allowed? */
+	/*
+	 * How many more buffers is this backend allowed?
+	 *
+	 * If we already have some forwarded buffers, we can certainly use those.
+	 * They are already pinned, and are mapped to the starting blocks of the
+	 * pending read, they just don't have any I/O started yet and are not
+	 * counted in stream->pinned_buffers.
+	 */
 	if (stream->temporary)
 		buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
 	else
 		buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
+	Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
+	buffer_limit += stream->forwarded_buffers;
 	if (buffer_limit == 0 && stream->pinned_buffers == 0)
 		buffer_limit = 1;		/* guarantee progress */
 
@@ -303,8 +316,31 @@ read_stream_start_pending_read(ReadStream *stream)
 	 * We say how many blocks we want to read, but it may be smaller on return
 	 * if the buffer manager decides to shorten the read.
 	 */
+	requested_nblocks = Min(buffer_limit, stream->pending_read_nblocks);
+	nblocks = requested_nblocks;
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
+
+	/*
+	 * The first time around the queue we initialize it as we go, including
+	 * the overflow zone, because otherwise the entries would appear as
+	 * forwarded buffers.  This avoids initializing the whole queue up front
+	 * in cases where it is large but we don't ever use it due to the
+	 * all-cached fast path or small scans.
+	 */
+	while (stream->initialized_buffers < buffer_index + nblocks)
+		stream->buffers[stream->initialized_buffers++] = InvalidBuffer;
+
+	/*
+	 * Start the I/O.  Any buffers that are not InvalidBuffer will be
+	 * interpreted as already pinned, forwarded by an earlier call to
+	 * StartReadBuffers(), and must map to the expected blocks.  The nblocks
+	 * value may be smaller on return indicating the size of the I/O that
+	 * could be started.  Buffers beyond the output nblocks number may also
+	 * have been pinned without starting I/O due to various edge cases.  In
+	 * that case we'll just leave them in the queue ahead of us, "forwarded"
+	 * to the next call, avoiding the need to unpin/repin.
+	 */
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -333,16 +369,35 @@ read_stream_start_pending_read(ReadStream *stream)
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
 	}
 
+	/*
+	 * How many pins were acquired but forwarded to the next call?  These need
+	 * to be passed to the next StartReadBuffers() call, or released if the
+	 * stream ends early.  We need the number for accounting purposes, since
+	 * they are not counted in stream->pinned_buffers but we already hold
+	 * them.
+	 */
+	forwarded = 0;
+	while (nblocks + forwarded < requested_nblocks &&
+		   stream->buffers[buffer_index + nblocks + forwarded] != InvalidBuffer)
+		forwarded++;
+	stream->forwarded_buffers = forwarded;
+
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
-	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
-	 * the front of the array.
+	 * we want it to wrap around at queue_size.  Copy overflowing buffers to
+	 * the front of the array where they'll be consumed, but also leave a copy
+	 * in the overflow zone which the I/O operation has a pointer to (it needs
+	 * a contiguous array).  Both copies will be cleared when the buffers are
+	 * handed to the consumer.
 	 */
-	overflow = (buffer_index + nblocks) - stream->queue_size;
+	overflow = (buffer_index + nblocks + forwarded) - stream->queue_size;
 	if (overflow > 0)
-		memmove(&stream->buffers[0],
-				&stream->buffers[stream->queue_size],
-				sizeof(stream->buffers[0]) * overflow);
+	{
+		Assert(overflow < stream->queue_size);	/* can't overlap */
+		memcpy(&stream->buffers[0],
+			   &stream->buffers[stream->queue_size],
+			   sizeof(stream->buffers[0]) * overflow);
+	}
 
 	/* Compute location of start of next read, without using % operator. */
 	buffer_index += nblocks;
@@ -719,10 +774,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
+		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
 		oldest_buffer_index = stream->oldest_buffer_index;
@@ -771,6 +828,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
+			stream->buffers[oldest_buffer_index] = InvalidBuffer;
 		}
 
 		stream->fast_path = false;
@@ -846,10 +904,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
-#ifdef CLOBBER_FREED_MEMORY
-	/* Clobber old buffer for debugging purposes. */
+	/*
+	 * We must zap this queue entry, or else it would appear as a forwarded
+	 * buffer.  If it's potentially in the overflow zone (ie it wrapped around
+	 * the queue), also zap that copy.
+	 */
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-#endif
+	if (oldest_buffer_index < io_combine_limit - 1)
+		stream->buffers[stream->queue_size + oldest_buffer_index] =
+			InvalidBuffer;
 
 #if defined(CLOBBER_FREED_MEMORY) || defined(USE_VALGRIND)
 
@@ -894,6 +957,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
+		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
@@ -929,6 +993,7 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
 void
 read_stream_reset(ReadStream *stream)
 {
+	int16		index;
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
@@ -942,6 +1007,24 @@ read_stream_reset(ReadStream *stream)
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
 
+	/* Unpin any unused forwarded buffers. */
+	index = stream->next_buffer_index;
+	while (index < stream->initialized_buffers &&
+		   (buffer = stream->buffers[index]) != InvalidBuffer)
+	{
+		Assert(stream->forwarded_buffers > 0);
+		stream->forwarded_buffers--;
+		ReleaseBuffer(buffer);
+
+		stream->buffers[index] = InvalidBuffer;
+		if (index < io_combine_limit - 1)
+			stream->buffers[stream->queue_size + index] = InvalidBuffer;
+
+		if (++index == stream->queue_size)
+			index = 0;
+	}
+
+	Assert(stream->forwarded_buffers == 0);
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0011-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From eef02887f094b1fc3fdfe2ec4343f889ec592776 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 11/27] Support buffer forwarding in StartReadBuffers().

Sometimes we have to perform a short read because we hit a cached block
that ends a contiguous run of blocks requiring I/O.  We don't want
StartReadBuffers() to have to start more than one I/O, so we stop there.
We also don't want to have to unpin the cached block (and repin it
later), so previously we'd silently pretend the hit was part of the I/O,
and just leave it out of the read from disk.  Now, we'll "forward" it to
the next call.  We still write it to the buffers[] array for the caller
to pass back to us later, but it's not included in *nblocks.

This policy means that we no longer mix hits and misses in a single
operation's results, so we avoid the requirement to call
WaitReadBuffers(), which might stall, before the caller can make use of
the hits.  The caller will get the hit in the next call instead, and
know that it doesn't have to wait.  That's important for later work on
out-of-order read streams that minimize I/O stalls.

This also makes life easier for proposed work on true AIO, which
occasionally needs to split a large I/O after pinning all the buffers,
while the current coding only ever forwards a single bookending hit.

This API is natural for read_stream.c: it just leaves forwarded buffers
where they are in its circular queue, where the next call will pick them
up and continue, minimizing pin churn.

If we ever think of a good reason to disable this feature, i.e. for
other users of StartReadBuffers() that don't want to deal with forwarded
buffers, then we could add a flag for that.  For now read_steam.c is the
only user.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/include/storage/bufmgr.h        |   1 -
 src/backend/storage/buffer/bufmgr.c | 128 ++++++++++++++++++++--------
 2 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8c28fe79f4..5137288c96e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -131,7 +131,6 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
-	int16		io_buffers_len;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ea5f504e38a..1a5037dd913 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1254,10 +1254,10 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 					 Buffer *buffers,
 					 BlockNumber blockNum,
 					 int *nblocks,
-					 int flags)
+					 int flags,
+					 bool allow_forwarding)
 {
 	int			actual_nblocks = *nblocks;
-	int			io_buffers_len = 0;
 	int			maxcombine = 0;
 
 	Assert(*nblocks > 0);
@@ -1267,30 +1267,80 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->rel,
-									   operation->smgr,
-									   operation->persistence,
-									   operation->forknum,
-									   blockNum + i,
-									   operation->strategy,
-									   &found);
+		if (allow_forwarding && buffers[i] != InvalidBuffer)
+		{
+			BufferDesc *bufHdr;
+
+			/*
+			 * This is a buffer that was pinned by an earlier call to
+			 * StartReadBuffers(), but couldn't be handled in one operation at
+			 * that time.  The operation was split, and the caller has passed
+			 * an already pinned buffer back to us to handle the rest of the
+			 * operation.  It must continue at the expected block number.
+			 */
+			Assert(BufferGetBlockNumber(buffers[i]) == blockNum + i);
+
+			/*
+			 * It might be an already valid buffer (a hit) that followed the
+			 * final contiguous block of an earlier I/O (a miss) marking the
+			 * end of it, or a buffer that some other backend has since made
+			 * valid by performing the I/O for us, in which case we can handle
+			 * it as a hit now.  It is safe to check for a BM_VALID flag with
+			 * a relaxed load, because we got a fresh view of it while pinning
+			 * it in the previous call.
+			 *
+			 * On the other hand if we don't see BM_VALID yet, it must be an
+			 * I/O that was split by the previous call and we need to try to
+			 * start a new I/O from this block.  We're also racing against any
+			 * other backend that might start the I/O or even manage to mark
+			 * it BM_VALID after this check, but StartBufferIO() will handle
+			 * those cases.
+			 */
+			if (BufferIsLocal(buffers[i]))
+				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+			else
+				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
+		}
+		else
+		{
+			buffers[i] = PinBufferForBlock(operation->rel,
+										   operation->smgr,
+										   operation->persistence,
+										   operation->forknum,
+										   blockNum + i,
+										   operation->strategy,
+										   &found);
+		}
 
 		if (found)
 		{
 			/*
-			 * Terminate the read as soon as we get a hit.  It could be a
-			 * single buffer hit, or it could be a hit that follows a readable
-			 * range.  We don't want to create more than one readable range,
-			 * so we stop here.
+			 * We have a hit.  If it's the first block in the requested range,
+			 * we can return it immediately and report that WaitReadBuffers()
+			 * does not need to be called.  If the initial value of *nblocks
+			 * was larger, the caller will have to call again for the rest.
 			 */
-			actual_nblocks = i + 1;
+			if (i == 0)
+			{
+				*nblocks = 1;
+				return false;
+			}
+
+			/*
+			 * Otherwise we already have an I/O to perform, but this block
+			 * can't be included as it is already valid.  Split the I/O here.
+			 * There may or may not be more blocks requiring I/O after this
+			 * one, we haven't checked, but it can't be contiguous with this
+			 * hit in the way.  We'll leave this buffer pinned, forwarding it
+			 * to the next call, avoiding the need to unpin it here and re-pin
+			 * it in the next call.
+			 */
+			actual_nblocks = i;
 			break;
 		}
 		else
 		{
-			/* Extend the readable range to cover this block. */
-			io_buffers_len++;
-
 			/*
 			 * Check how many blocks we can cover with the same IO. The smgr
 			 * implementation might e.g. be limited due to a segment boundary.
@@ -1311,15 +1361,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	}
 	*nblocks = actual_nblocks;
 
-	if (likely(io_buffers_len == 0))
-		return false;
-
 	/* Populate information needed for I/O. */
 	operation->buffers = buffers;
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
-	operation->io_buffers_len = io_buffers_len;
 
 	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
@@ -1334,7 +1380,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 		smgrprefetch(operation->smgr,
 					 operation->forknum,
 					 blockNum,
-					 operation->io_buffers_len);
+					 actual_nblocks);
 	}
 
 	/* Indicate that WaitReadBuffers() should be called. */
@@ -1348,11 +1394,21 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
  * actual number, which may be fewer than requested.  Caller sets some of the
  * members of operation; see struct definition.
  *
+ * The initial contents of the elements of buffers up to *nblocks should
+ * either be InvalidBuffer or an already-pinned buffer that was left by an
+ * preceding call to StartReadBuffers() that had to be split.  On return, some
+ * elements of buffers may hold pinned buffers beyond the number indicated by
+ * the updated value of *nblocks.  Operations are split on boundaries known to
+ * smgr (eg md.c segment boundaries that require crossing into a different
+ * underlying file), or when already cached blocks are found in the buffer
+ * that prevent the formation of a contiguous read.
+ *
  * If false is returned, no I/O is necessary.  If true is returned, one I/O
  * has been started, and WaitReadBuffers() must be called with the same
  * operation object before the buffers are accessed.  Along with the operation
  * object, the caller-supplied array of buffers must remain valid until
- * WaitReadBuffers() is called.
+ * WaitReadBuffers() is called, and any forwarded buffers must also be
+ * preserved for a future call unless explicitly released.
  *
  * Currently the I/O is only started with optional operating system advice if
  * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
@@ -1366,13 +1422,18 @@ StartReadBuffers(ReadBuffersOperation *operation,
 				 int *nblocks,
 				 int flags)
 {
-	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags,
+								true /* expect forwarded buffers */ );
 }
 
 /*
  * Single block version of the StartReadBuffers().  This might save a few
  * instructions when called from another translation unit, because it is
  * specialized for nblocks == 1.
+ *
+ * This version does not support "forwarded" buffers: they cannot be created
+ * by reading only one block, and the current contents of *buffer is ignored
+ * on entry.
  */
 bool
 StartReadBuffer(ReadBuffersOperation *operation,
@@ -1383,7 +1444,8 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	int			nblocks = 1;
 	bool		result;
 
-	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags,
+								  false /* single block, no forwarding */ );
 	Assert(nblocks == 1);		/* single block can't be short */
 
 	return result;
@@ -1410,24 +1472,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOObject	io_object;
 	char		persistence;
 
-	/*
-	 * Currently operations are only allowed to include a read of some range,
-	 * with an optional extra buffer that is already pinned at the end.  So
-	 * nblocks can be at most one more than io_buffers_len.
-	 */
-	Assert((operation->nblocks == operation->io_buffers_len) ||
-		   (operation->nblocks == operation->io_buffers_len + 1));
-
 	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->io_buffers_len;
-	if (nblocks == 0)
-		return;					/* nothing to do */
-
+	nblocks = operation->nblocks;
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
 	forknum = operation->forknum;
 	persistence = operation->persistence;
 
+	Assert(nblocks > 0);
+	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0012-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From de33e6dd831449c50e571a61be39ce0fa9a72f45 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 12/27] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO. This converts a lot of callers to use the
AIO infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 608 +++++++++++++++++++++-------
 2 files changed, 474 insertions(+), 140 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5137288c96e..7abb3bdfd9d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1a5037dd913..08d25e70bf2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,6 +531,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1231,10 +1233,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,9 +1265,11 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1300,6 +1308,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 				bufHdr = GetLocalBufferDescriptor(-buffers[i] - 1);
 			else
 				bufHdr = GetBufferDescriptor(buffers[i] - 1);
+			Assert(pg_atomic_read_u32(&bufHdr->state) & BM_TAG_VALID);
 			found = pg_atomic_read_u32(&bufHdr->state) & BM_VALID;
 		}
 		else
@@ -1324,6 +1333,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1366,25 +1389,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1451,8 +1523,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1461,31 +1560,243 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
-void
-WaitReadBuffers(ReadBuffersOperation *operation)
+/*
+ * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
-
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	/*
+	 * If this backend currently has staged IO, we need to submit the pending
+	 * IO before waiting for the right to issue IO, to avoid the potential for
+	 * deadlocks (and, more commonly, unnecessary delays for other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately StartBufferIO() returning false doesn't allow to
+		 * distinguish between the buffer already being valid and IO already
+		 * being in progress. Since IO already being in progress is quite
+		 * rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			nblocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != PGAIO_RS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks successfully read as the result of
+	 * the IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == PGAIO_RS_OK))
+		nblocks = aio_ret->result.result;
+	else if (aio_ret->result.status == PGAIO_RS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == PGAIO_RS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
 
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	operation->nblocks_done += nblocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple partial reads, but also because some of the remaining
+	 * to-be-read buffers may have been read in by other backends, limiting
+	 * the IO size.
+	 */
+	while (true)
+	{
+		int			nblocks;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it.
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the time, the one IO we already started, will read in
+		 * everything.  But we need to deal with partial reads and buffers not
+		 * needing IO anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a partial read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &nblocks);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks, if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after partial reads, the first operation->nblocks_done
+ * buffers are skipped.
+ *
+ * On return *nblocks_progress is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1493,140 +1804,157 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
+	 * might block, which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
+	 * wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to wait
+	 * for already submitted IO, which doesn't require additional locks, but
+	 * it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the first to-be-read buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock(). The other
+		 * backend will track this as a 'read'.
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock(). The
-			 * other backend will track this as a 'read'.
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0013-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From b6d647cffe0ec26b4d2cc03fe324a76cd60abdaf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency

There is one comment talking about max_ios logic with "real asynchronous I/O"
that I am not sure about, so I left it alone for now.

There are further improvements we should consider, e.g. waiting to issue IOs
until we can issue multiple IOs at once. But that's left for a future change,
since it would involve additional heuristics.
---
 src/backend/storage/aio/read_stream.c | 29 ++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 18ecf4affc7..60a841816d8 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -463,6 +472,7 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -497,6 +507,8 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	pgaio_exit_batchmode();
 }
 
 /*
@@ -629,15 +641,19 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * This system supports prefetching advice.
+	 *
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -647,6 +663,9 @@ read_stream_begin_impl(int flags,
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
 	 * different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
 	 */
 	if (max_ios == 0)
 		max_ios = 1;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0014-docs-Reframe-track_io_timing-related-docs-as-w.patchtext/x-diff; charset=us-asciiDownload

From f8344496f6de5c018850e8953c30a1a3c93037a4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 14/27] docs: Reframe track_io_timing related docs as
 wait time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4f71725036b..9f63c4574f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8578,7 +8578,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8612,7 +8612,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa6586d3a4..7598340072f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2733,7 +2733,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2771,7 +2771,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2799,7 +2799,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2835,7 +2835,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2909,7 +2909,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2996,7 +2996,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0015-Enable-IO-concurrency-on-all-systems.patchtext/x-diff; charset=us-asciiDownload

From 59a9323ac26e99b81d1a68762ab3953c960040bf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 10:15:20 -0400
Subject: [PATCH v2.11 15/27] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  6 ++--
 src/bin/initdb/initdb.c                       |  5 ----
 doc/src/sgml/config.sgml                      | 15 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 10 files changed, 14 insertions(+), 68 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7abb3bdfd9d..784df8b00cb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,14 +156,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 16
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 0f1e74f96c9..799fa7ace68 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern void assign_io_max_combine_limit(int newval, void *extra);
 extern void assign_io_combine_limit(int newval, void *extra);
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 645b5c00467..46c1dce222d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f550a3c0c63..d3567aa7f27 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 /*
@@ -1249,34 +1247,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 75bb0acdf0f..2e17dbc8c9e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3245,7 +3245,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3259,7 +3259,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 69dc727174b..d5f5f3dbda4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,11 +198,11 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+#maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
 #io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
-#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 #io_method = worker			# worker, io_uring, sync
 					# (change requires restart)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..b9d75799ee0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1401,11 +1401,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9f63c4574f9..863a5d025c4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2585,8 +2585,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2597,8 +2596,8 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.
         </para>
 
         <para>
@@ -2621,10 +2620,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.  This value can be overridden for tables in a
-         particular tablespace by setting the tablespace parameter of the same
-         name (see <xref linkend="sql-altertablespace"/>).
+         The default is <literal>16</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0016-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From a12552dfdda77bf66f7f248ba3265220f8f848e3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 16/27] aio: Add pg_aios view

TODO:
- decide on name
- add docs

FIXME:
- catversion bump

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/catalog/pg_proc.dat      |  10 ++
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 219 +++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 src/test/regress/expected/rules.out  |  16 ++
 6 files changed, 250 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 890822eaf79..017971011f3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12479,4 +12479,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,offset,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..73902a763d1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1390,3 +1390,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..58883fd61d8
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "utils/builtins.h"
+#include "funcapi.h"
+#include "storage/proc.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of it's fields are
+		 * valid yet (or are in the process of being set). Therefore we don't
+		 * want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..6570d0cddce 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    "offset",
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, "offset", length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0017-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 7ec622ba4ff50a7b9c24ed0158376b920076259a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 17/27] aio: Add test_aio module

---
 src/include/storage/buf_internals.h         |   6 +
 src/backend/storage/buffer/bufmgr.c         |   8 +-
 src/test/modules/Makefile                   |   1 +
 src/test/modules/meson.build                |   1 +
 src/test/modules/test_aio/.gitignore        |   2 +
 src/test/modules/test_aio/Makefile          |  27 +
 src/test/modules/test_aio/meson.build       |  37 ++
 src/test/modules/test_aio/t/001_aio.pl      | 699 ++++++++++++++++++++
 src/test/modules/test_aio/test_aio--1.0.sql |  97 +++
 src/test/modules/test_aio/test_aio.c        | 657 ++++++++++++++++++
 src/test/modules/test_aio/test_aio.control  |   3 +
 11 files changed, 1532 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..4b1f52a9fc8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 08d25e70bf2..be306bfa4f2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -518,10 +518,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5944,7 +5940,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -6001,7 +5997,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..87d5315ba00
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,27 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+export with_liburing
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..ac846b2b6f3
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+       'with_liburing': liburing.found() ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..444cabf2954
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,699 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if ($ENV{with_liburing} eq 'yes')
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS} &&
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# verify the error is reported in custom C code
+	($output, $ret) = $psql->query(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>true);));
+	is($ret, 1, "$io_method: read_rel_block_ll() of tbl_corr page fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: read_rel_block_ll() of tbl_corr page reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: tid scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: tid scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	# verify the error is reported for bufmgr reads
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1, 1)'));
+	is($ret, 1, "$io_method: sequential scan reading tbl_corr block fails");
+	like(
+		$psql->{stderr},
+		qr/invalid page in block 1 of relation base\/.*/,
+		"$io_method: sequential scan reading tbl_corr block reports error");
+	$psql->{stderr} = '';
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# create a buffer we can play around with
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+	is($ret, 0, "$io_method: create toy succeeds");
+	like($output, qr/^\d+$/, "$io_method: create toy returns numeric");
+	my $buf_id = $output;
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating io, not valid");
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	($output, $ret) =
+	  $psql_a->query(qq(SELECT buffer_call_start_io($buf_id, true, false);));
+	is($ret, 0,
+		"$io_method: blocking buffer io w/ success: first start buffer io succeeds"
+	);
+	is($output, "t",
+		"$io_method: blocking buffer io w/ success: first start buffer io returns true"
+	);
+
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	($output, $ret) = $psql_a->query(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+	is($ret, 0,
+		"$io_method: blocking start buffer io, terminating IO, valid");
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+
+	# buffer is valid now, make it invalid again
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 0,
+		"$io_method: injection point not triggering failure succeeds");
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)';));
+	is($ret, 1, "$io_method: single block short read fails");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/,
+		"$io_method: single block short read reports error");
+	$psql->{stderr} = '';
+
+	# shorten multi-block read to a single block, should retry
+	my $inval_query = qq(SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8););
+
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read (1 block) is retried");
+	is($output, 10000, "$io_method: multi block short read (1 block) has correct result");
+
+	# shorten multi-block read to two blocks, should retry
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192*2);));
+
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: multi block short read (2 blocks) is retried");
+	is($output, 10000, "$io_method: multi block short read (2 blocks) has correct result");
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'));
+	is($ret, 1,
+		"$io_method: shortened multi-block read detects invalid page");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/,
+		"$io_method: shortened multi-block reads reports invalid page");
+	$psql->{stderr} = '';
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: first hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: first hard IO error is reported");
+	$psql->{stderr} = '';
+
+	# trigger a second hard error, should error out again
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: second hard IO error is detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/,
+		"$io_method: second hard IO error is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_detach();
+	));
+
+	# now the IO should be ok.
+	($output, $ret) =
+	  $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: recovers after hard error");
+	is($output, 10000, "$io_method: recovers after hard error, query result ok");
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 1, "$io_method: failure to open: detected");
+	like(
+		$psql->{stderr},
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/,
+		"$io_method: failure to open is reported");
+	$psql->{stderr} = '';
+
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_detach();
+	));
+
+	# check that we indeed recover
+	($output, $ret) = $psql->query(qq(SELECT count(*) FROM tbl_ok;));
+	is($ret, 0, "$io_method: failure to open: recovers");
+	is($output, 10000, "$io_method: failure to open: next query ok");
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..3d4e9bb070f
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,97 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int, wait_complete bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..fc55b985dad
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,657 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* place buffer in shared buffers without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	buf_hdr = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	Relation	rel;
+	Buffer		buf;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	StartBufferIO(GetBufferDescriptor(buf - 1), true, false);
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != PGAIO_RS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == PGAIO_RS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	can_start = StartBufferIO(GetBufferDescriptor(buf - 1), for_input, nowait);
+
+	/*
+	 * Tor tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start)
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+/*
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, failed bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+*/
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	TerminateBufferIO(GetBufferDescriptor(buf - 1), clear_dirty, set_flag_bits,
+					  false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after TerminateBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		/*
+		 * Only shorten reads that are actually longer than the target size,
+		 * otherwise we can trigger over-reads.
+		 */
+		if (inj_io_error_state->short_read_result_set
+			&& ioh->op == PGAIO_OP_READV
+			&& inj_io_error_state->short_read_result <= ioh->result)
+		{
+			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			int32		old_result = ioh->result;
+			int32		new_result = inj_io_error_state->short_read_result;
+			int32		processed = 0;
+
+			ereport(LOG,
+					errmsg("short read, changing result from %d to %d",
+						   old_result, new_result),
+					errhidestmt(true), errhidecontext(true));
+
+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
+			 *
+			 * To avoid that, iterate through the IOV and zero out the
+			 * "failed" portion of the IO.
+			 */
+			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			{
+				if (processed + iov[i].iov_len <= new_result)
+					processed += iov[i].iov_len;
+				else if (processed <= new_result)
+				{
+					uint32		ok_part = new_result - processed;
+
+					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+					processed += iov[i].iov_len;
+				}
+				else
+				{
+					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+				}
+			}
+
+			ioh->result = new_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0018-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From 564255d2458963965d29894f12b6a37dc20a37e1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 18/27] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 60a841816d8..facf996608e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -418,6 +418,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0019-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From d9dc948108a99e8ee4052873f08b2ff10d9a0477 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 19/27] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 199 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  29 ++++
 7 files changed, 268 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 3cdd2a6b195..03e44b76075 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f573aa8e7ca..8f143c10d36 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2348,6 +2348,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2218f17e7a0..9d3dec2710c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1115,6 +1122,64 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	ret = FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start writing blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1503,6 +1568,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2027,3 +2126,103 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartwritev(), the smgr API operates on the
+	 * level of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 545888dcdfc..aec2c0c565a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -787,6 +793,29 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully written blocks if the write [partially] succeeds. This
+ * maintains the abstraction that smgr operates on the level of blocks, rather
+ * than bytes.
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0020-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 8be13a1426cb9d4a9f16dc0bfee0e59c9f375742 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 20/27] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 9a0868a270c..09d6d9fe1bc 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -349,6 +349,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -361,6 +375,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 6e6033519b2..447208c8f12 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index debe8163d4e..02950ee56a7 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 012cf3f305a..57e6f2fef07 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 3c3383b1318..ef344c32c61 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 		}
 	}
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1044,6 +1062,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..95b10933fed 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -88,6 +88,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -113,6 +139,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -136,11 +189,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -156,6 +229,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -167,6 +243,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -179,6 +256,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -186,9 +297,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -210,6 +325,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_max_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2e17dbc8c9e..89f9b1fbab7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3303,6 +3303,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d5f5f3dbda4..762194c61ad 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -211,6 +211,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 3d4e9bb070f..16f248619fc 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -76,6 +76,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index fc55b985dad..1e2127fe888 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -526,6 +527,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 53abd54edd8..d633c732162 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2132,6 +2132,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0021-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 9b4e2a0758fb149e3b7dd94728f948bab74ba5f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 21/27] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 180 +++++++++++++++++++++++++
 4 files changed, 186 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 09d6d9fe1bc..d461485e83f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -197,8 +197,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 784df8b00cb..ba9bf247ddb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -172,7 +172,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 03e44b76075..c49bf4a2384 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index be306bfa4f2..94f3caf2224 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5507,7 +5507,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5522,6 +5530,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5594,6 +5615,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5741,6 +5767,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6843,12 +6877,131 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
 								  target_data->smgr.forkNum).str));
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result;
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	result.status = PGAIO_RS_OK;
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR && buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6856,6 +7009,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6869,6 +7029,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -6876,6 +7047,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6889,3 +7065,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0022-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 90d1cada1a2782a6bf06c0da69860ddf3c589185 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 22/27] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d633c732162..9c07fa04b49 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1191,6 +1191,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3016,6 +3017,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 920a84a3fbe31183a01a711dc4fc68ba5a57497c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 23/27] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 594 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 586 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1f52a9fc8..a3df3192d12 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ba9bf247ddb..af5035317b7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -295,7 +295,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 94f3caf2224..0909888c342 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -515,8 +517,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -532,6 +532,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3298,6 +3299,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3329,7 +3381,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3391,7 +3446,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3499,48 +3556,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3558,15 +3658,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3592,7 +3700,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3635,6 +3743,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3655,6 +3766,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3811,11 +3924,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3826,6 +3953,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3837,6 +3971,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3875,8 +4014,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3885,22 +4082,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3910,7 +4135,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3919,40 +4144,300 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
+	/*
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+	UnlockBufHdr(cur_buf_hdr, buf_state);
 
-	tag = bufHdr->tag;
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
 
-	UnpinBuffer(bufHdr);
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
 
-	return result | BUF_WRITTEN;
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+
+	/*
+	 * FIXME: Implement issuing writebacks (note wb_context isn't used here).
+	 * Possibly needs to be integrated with io_queue.c.
+	 */
 }
 
 /*
@@ -4326,6 +4811,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9c07fa04b49..96144af6a4a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From ae835445290a70888f0969d5ce1da973ca1a728c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 24/27] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0025-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From e3e9545899e6b2ea34e589f88f427472a803e2ff Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 25/27] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0026-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 9cba27b9fb50afe9b01d97bbfa967dcf12d69779 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 26/27] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

v2.11-0027-StartReadBuffers-debug-stuff.patchtext/x-diff; charset=us-asciiDownload

From af56fb208e9b6f45de7c2f1b2d8e25214d2370c3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.11 27/27] StartReadBuffers debug stuff

---
 src/backend/storage/buffer/bufmgr.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0909888c342..ba695badbeb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1781,6 +1781,18 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	IOObject	io_object;
 	bool		did_start_io;
 
+#if 0
+	ereport(DEBUG3,
+			errmsg("%s: op->blocks: %d, op->blocks_done: %d, *nblocks_progress: %d, first buf %d",
+				   __func__,
+				   operation->nblocks,
+				   nblocks_done,
+				   *nblocks_progress,
+				   buffers[0]),
+			errhidestmt(true),
+			errhidecontext(true));
+#endif
+
 	/*
 	 * When this IO is executed synchronously, either because the caller will
 	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
@@ -1845,6 +1857,13 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		operation->nblocks_done += 1;
 		*nblocks_progress = 1;
 
+		ereport(DEBUG3,
+				errmsg("%s - trunc: %d",
+					   __func__,
+					   operation->nblocks_done),
+				errhidestmt(true),
+				errhidecontext(true));
+
 		pgaio_io_release(ioh);
 		pgaio_wref_clear(&operation->io_wref);
 		did_start_io = false;
@@ -1894,6 +1913,12 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		 */
 		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+#if 0
+			/* FIXME: Remove forced short read */
+			if (i > 3)
+				break;
+#endif
+
 			if (!ReadBuffersCanStartIO(buffers[i], true))
 				break;
 			/* Must be consecutive block numbers. */
-- 
2.48.1.76.g4e746b1a31.dirty

#101

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#100)

Re: AIO v2.5

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11, with the following changes:

- Added an error check for FileStartReadV() failing

FileStartReadV() actually can fail, if the file can't be re-opened. I
thought it'd be important for the error message to differ from the one
that's issued for read actually failing, so I went with:

"could not start reading blocks %u..%u in file \"%s\": %m"

but I'm not sure how good that is.

Message looks good.

- Improved error message if io_uring_queue_init() fails

Added errhint()s for likely cases of failure.

Added errcode(). I was tempted to use errcode_for_file_access(), but that
doesn't support ENOSYS - perhaps I should add that instead?

Either way is fine with me. ENOSYS -> ERRCODE_FEATURE_NOT_SUPPORTED is a good
general mapping to have in errcode_for_file_access(), but it's also not a
problem to keep it the way v2.11 has it.

- Disable io_uring method when using EXEC_BACKEND, they're not compatible

I chose to do this with a define aio.h, but I guess we could also do it at
configure time? That seems more complicated though - how would we even know
that EXEC_BACKEND is used on non-windows?

Agreed, "make PROFILE=-DEXEC_BACKEND" is a valid way to get EXEC_BACKEND.

Not sure yet how to best disable testing io_uring in this case. We can't
just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the
initdb not fail and checking the error log would work, but that doesn't work
nicely with Cluster.pm.

How about "postgres -c io_method=io_uring -C <anything>":

--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -29,7 +29,13 @@ $node_worker->stop();
 # Test io_method=io_uring
 ###

-if ($ENV{with_liburing} eq 'yes')
+sub have_io_uring
+{
+	local %ENV = $node_worker->_get_env();  # any node works
+	return run_log [qw(postgres -c io_method=io_uring -C io_method)];
+}
+
+if (have_io_uring())
 {
 	my $node_uring = create_node('io_uring');
 	$node_uring->start();

Questions:

- We only "look" at BM_IO_ERROR for writes, isn't that somewhat weird?

See AbortBufferIO(Buffer buffer)

It doesn't really matter for the patchset, but it just strikes me as an oddity.

That caught my attention in an earlier review round, but I didn't find it
important enough to raise.  It's mildly unfortunate to be setting BM_IO_ERROR
for reads when the only thing BM_IO_ERROR drives is message "Multiple failures
--- write error might be permanent."  It's minor, so let's leave it that way
for the foreseeable future.

Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes

Ready to commit, though other comment fixes might come up in later reviews.
One idea so far is to comment on valid states after some IoMethodOps
callbacks:

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -310,6 +310,9 @@ typedef struct IoMethodOps
 	/*
 	 * Start executing passed in IOs.
 	 *
+	 * Shall advance state to PGAIO_HS_SUBMITTED.  (By the time this returns,
+	 * other backends might have advanced the state further.)
+	 *
 	 * Will not be called if ->needs_synchronous_execution() returned true.
 	 *
 	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -321,6 +324,12 @@ typedef struct IoMethodOps
 	/*
 	 * Wait for the IO to complete. Optional.
 	 *
+	 * On return, state shall be PGAIO_HS_COMPLETED_IO,
+	 * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL.  (The callback
+	 * need not change the state if it's already one of those.)  If state is
+	 * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED
+	 * without further intervention.
+	 *
 	 * If not provided, it needs to be guaranteed that the IO method calls
 	 * pgaio_io_process_completion() without further interaction by the
 	 * issuing backend.

Subject: [PATCH v2.11 02/27] aio: Change prefix of PgAioResultStatus values to
PGAIO_RS_

Ready to commit

Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
additionally opened files

Ready to commit

Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

--- a/meson.build
+++ b/meson.build
@@ -944,6 +944,18 @@ endif

+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif

This is a different style from other deps; is it equivalent to our standard
style? Example for lz4:

lz4opt = get_option('lz4')
if not lz4opt.disabled()
lz4 = dependency('liblz4', required: false)
# Unfortunately the dependency is named differently with cmake
if not lz4.found() # combine with above once meson 0.60.0 is required
lz4 = dependency('lz4', required: lz4opt,
method: 'cmake', modules: ['LZ4::lz4_shared'],
)
endif

if lz4.found()
cdata.set('USE_LZ4', 1)
cdata.set('HAVE_LIBLZ4', 1)
endif

else
lz4 = not_found_dep
endif

--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])

+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],

Fourth arg generally starts with "build" for args like this. I suggest "build
with io_uring support, for asynchronous I/O". Comparable options:

--with-llvm build with LLVM based JIT support
--with-tcl build Tcl modules (PL/Tcl)
--with-perl build Perl modules (PL/Perl)
--with-python build Python modules (PL/Python)
--with-gssapi build with GSSAPI support
--with-pam build with PAM support
--with-bsd-auth build with BSD Authentication support
--with-ldap build with LDAP support
--with-bonjour build with Bonjour support
--with-selinux build with SELinux support
--with-systemd build with systemd support
--with-libcurl build with libcurl support
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
--with-lz4 build with LZ4 support
--with-zstd build with ZSTD support

+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)

#
# UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)

+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi

We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block. This
currently has unrelated stuff separating them. Also, with the exception of
icu, we follow PKG_CHECK_MODULES uses by absorbing flags from pkg-config and
use AC_CHECK_LIB to add the actual "-l". By not absorbing flags, I think a
liburing in a nonstandard location would require --with-libraries and
--with-includes, unlike the other PKG_CHECK_MODULES-based dependencies. lz4
is a representative example of our standard:

```
AC_MSG_CHECKING([whether to build with LZ4 support])
PGAC_ARG_BOOL(with, lz4, no, [build with LZ4 support],
[AC_DEFINE([USE_LZ4], 1, [Define to 1 to build with LZ4 support. (--with-lz4)])])
AC_MSG_RESULT([$with_lz4])
AC_SUBST(with_lz4)

if test "$with_lz4" = yes; then
PKG_CHECK_MODULES(LZ4, liblz4)
# We only care about -I, -D, and -L switches;
# note that -llz4 will be added by AC_CHECK_LIB below.
for pgac_option in $LZ4_CFLAGS; do
case $pgac_option in
-I*|-D*) CPPFLAGS="$CPPFLAGS $pgac_option";;
esac
done
for pgac_option in $LZ4_LIBS; do
case $pgac_option in
-L*) LDFLAGS="$LDFLAGS $pgac_option";;
esac
done
fi

# ... later in file ...

if test "$with_lz4" = yes ; then
AC_CHECK_LIB(lz4, LZ4_compress_default, [], [AC_MSG_ERROR([library 'lz4' is required for LZ4 support])])
fi
```

I think it's okay to not use the AC_CHECK_LIB and rely on explicit
src/backend/Makefile code like you've done, but we shouldn't miss
CPPFLAGS/LDFLAGS (or should have a comment on why missing them is right).

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml

lz4 and other deps have a mention in <sect1 id="install-requirements">, in
addition to sections edited here.

Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

(Still reviewing this one.)

#102

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#100)

Re: AIO v2.5

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11

Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

Apart from some isolated cosmetic points, this is ready to commit:

+			ereport(ERROR,
+					errcode(err),
+					errmsg("io_uring_queue_init failed: %m"),
+					hint != NULL ? errhint("%s", hint) : 0);

https://www.postgresql.org/docs/current/error-style-guide.html gives the example:

BAD: open() failed: %m
BETTER: could not open file %s: %m

Hence, this errmsg should change, perhaps to:
"could not setup io_uring queues: %m".

+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,

In the message string, io_gen appears before ref_gen. In the subsequent args,
the order is swapped relative to the message string.

--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:

Section: ClassName - WaitEventIO

+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
AIO_IO_COMPLETION	"Waiting for IO completion."

I'm wondering if there's an opportunity to enrich the last two wait event
names and/or descriptions. The current descriptions suggest to me more
similarity than is actually there. Inputs to the decision:

- AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three
starting states are the states where some other backend owns the next
action, so the current backend can only wait to be signaled.

- AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.

Possible names and descriptions, based on PgAioHandleState enum names and
comments:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO shared completion callback."

If "shared completion callback" is too internals-focused, perhaps this:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory."

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)

I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
there to reduce surprise on other platforms.

Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

(Still reviewing this one.)

#103

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#101)

Re: AIO v2.5

Hi,

On 2025-03-22 17:20:56 -0700, Noah Misch wrote:

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Not sure yet how to best disable testing io_uring in this case. We can't
just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the
initdb not fail and checking the error log would work, but that doesn't work
nicely with Cluster.pm.

How about "postgres -c io_method=io_uring -C <anything>":
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -29,7 +29,13 @@ $node_worker->stop();
# Test io_method=io_uring
###
-if ($ENV{with_liburing} eq 'yes')
+sub have_io_uring
+{
+	local %ENV = $node_worker->_get_env();  # any node works
+	return run_log [qw(postgres -c io_method=io_uring -C io_method)];
+}
+
+if (have_io_uring())
{
my $node_uring = create_node('io_uring');
$node_uring->start();

Yea, that's a good idea.

One thing that doesn't seem great is that it requires a prior node - what if
we do -c io_method=invalid' that would report the list of valid GUC options,
so we could just grep for io_uring?

It's too bad that postgres --describe-config
a) doesn't report the possible enum values
b) doesn't apply/validate -c options

Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes

Ready to commit, though other comment fixes might come up in later reviews.

I'll reorder it to a bit later in the series, to accumulate a few more.

One idea so far is to comment on valid states after some IoMethodOps
callbacks:

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -310,6 +310,9 @@ typedef struct IoMethodOps
/*
* Start executing passed in IOs.
*
+	 * Shall advance state to PGAIO_HS_SUBMITTED.  (By the time this returns,
+	 * other backends might have advanced the state further.)
+	 *
* Will not be called if ->needs_synchronous_execution() returned true.
*
* num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -321,6 +324,12 @@ typedef struct IoMethodOps
/*
* Wait for the IO to complete. Optional.
*
+	 * On return, state shall be PGAIO_HS_COMPLETED_IO,
+	 * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL.  (The callback
+	 * need not change the state if it's already one of those.)  If state is
+	 * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED
+	 * without further intervention.
+	 *
* If not provided, it needs to be guaranteed that the IO method calls
* pgaio_io_process_completion() without further interaction by the
* issuing backend.

I think these are a good idea. I added those to the copy-edit patch, with a
few more tweaks:

@@ -315,6 +315,9 @@ typedef struct IoMethodOps
     /*
      * Start executing passed in IOs.
      *
+     * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+     * returns, other backends might have advanced the state further.)
+     *
      * Will not be called if ->needs_synchronous_execution() returned true.
      *
      * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -323,12 +326,24 @@ typedef struct IoMethodOps
      */
     int         (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);

-    /*
+    /* ---
      * Wait for the IO to complete. Optional.
      *
+     * On return, state shall be on of
+     * - PGAIO_HS_COMPLETED_IO
+     * - PGAIO_HS_COMPLETED_SHARED
+     * - PGAIO_HS_COMPLETED_LOCAL
+     *
+     * The callback must not block if the handle is already in one of those
+     * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+     * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+     * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+     * method.
+     *
      * If not provided, it needs to be guaranteed that the IO method calls
      * pgaio_io_process_completion() without further interaction by the
      * issuing backend.
+     * ---
      */
     void        (*wait_one) (PgAioHandle *ioh,
                              uint64 ref_generation);

Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
additionally opened files

Ready to commit

Cool!

Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

--- a/meson.build
+++ b/meson.build
@@ -944,6 +944,18 @@ endif

+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif

This is a different style from other deps; is it equivalent to our standard
style?

Yes - the only reason to be more complicated in the lz4 case is that we want
to fall back to other ways of looking up the dependency (primarily because of
windows. But that's not required for liburing, which oviously is linux only.

--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,14 @@ AC_SUBST(with_readline)
PGAC_ARG_BOOL(with, libedit-preferred, no,
[prefer BSD Libedit over GNU Readline])

+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [io_uring support, for asynchronous I/O],

Fourth arg generally starts with "build" for args like this. I suggest "build
with io_uring support, for asynchronous I/O".

WFM.

+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.

We don't really seem to do that for "dependency checks" in general, e.g.
PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE,
dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the
defnition of the option. TBH, I've always struggled trying to discern what
the organizing principle of configure.ac is.

But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy
to move towards having the code for each dep all in one place, so moved.

A related thing: We seem to have no order of the $with_ checks that I can
discern. Should the liburing check be at a different place?

This currently has unrelated stuff separating them. Also, with the
exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from
pkg-config and use AC_CHECK_LIB to add the actual "-l".

I think for liburing I was trying to follow ICU's example - injecting CFLAGS
and LIBS just in the parts of the build dir that needs them.

For LIBS I think I did so:

diff --git a/src/backend/Makefile b/src/backend/Makefile
...
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)

But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x
version of aio I had
aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS)

but somehow lost that somewhere along the way to v2.x

I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more
narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS. I'm
somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than
src/backend/storage/aio/ though.

But I'm also willing to do it entirely differently.

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
lz4 and other deps have a mention in <sect1 id="install-requirements">, in
addition to sections edited here.

Good point.

Although once more I feel defeated by the ordering used :)

Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl,
uuid, systemd, selinux and bonjour aren't listed.

Not sure if it makes sense to add liburing, given that?

Greetings,

Andres Freund

#104

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#103)

Re: AIO v2.5

On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote:

On 2025-03-22 17:20:56 -0700, Noah Misch wrote:
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Not sure yet how to best disable testing io_uring in this case. We can't
just query EXEC_BACKEND from pg_config.h unfortunately. I guess making the
initdb not fail and checking the error log would work, but that doesn't work
nicely with Cluster.pm.

How about "postgres -c io_method=io_uring -C <anything>":
--- a/src/test/modules/test_aio/t/001_aio.pl
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -29,7 +29,13 @@ $node_worker->stop();
# Test io_method=io_uring
###
-if ($ENV{with_liburing} eq 'yes')
+sub have_io_uring
+{
+	local %ENV = $node_worker->_get_env();  # any node works
+	return run_log [qw(postgres -c io_method=io_uring -C io_method)];
+}
+
+if (have_io_uring())
{
my $node_uring = create_node('io_uring');
$node_uring->start();
Yea, that's a good idea.

One thing that doesn't seem great is that it requires a prior node - what if
we do -c io_method=invalid' that would report the list of valid GUC options,
so we could just grep for io_uring?

Works for me.

One idea so far is to comment on valid states after some IoMethodOps
callbacks:

I think these are a good idea. I added those to the copy-edit patch, with a
few more tweaks:

The tweaks made it better.

Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
#
# UUID library
@@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.
We don't really seem to do that for "dependency checks" in general, e.g.
PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE,
dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the
defnition of the option.

AC_CHECK_LIB stays far away, yes.

But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy
to move towards having the code for each dep all in one place, so moved.

A related thing: We seem to have no order of the $with_ checks that I can
discern. Should the liburing check be at a different place?

No opinion on that one. It's fine.

This currently has unrelated stuff separating them. Also, with the
exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from
pkg-config and use AC_CHECK_LIB to add the actual "-l".

I think for liburing I was trying to follow ICU's example - injecting CFLAGS
and LIBS just in the parts of the build dir that needs them.

For LIBS I think I did so:
diff --git a/src/backend/Makefile b/src/backend/Makefile
...
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x
version of aio I had
aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS)

but somehow lost that somewhere along the way to v2.x

I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more
narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS.

Agreed.

somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than
src/backend/storage/aio/ though.

But I'm also willing to do it entirely differently.

The CPPFLAGS addition, located wherever makes sense, resolves that point.

--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
lz4 and other deps have a mention in <sect1 id="install-requirements">, in
addition to sections edited here.
Good point.

Although once more I feel defeated by the ordering used :)

Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl,
uuid, systemd, selinux and bonjour aren't listed.

Not sure if it makes sense to add liburing, given that?

That's a lot of preexisting incompleteness. I withdraw the point about <sect1
id="install-requirements">.

Unrelated to the above, another question about io_uring:

commit da722699 wrote:

+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)

An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker,
closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a
reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked
awhile for an authoritative view on that, but I didn't find one. If we can
rely on io_uring_submit() returning only after the kernel has given the
io_uring its own reference to all applicable file descriptors, I expect it's
okay to close the process's FD. If the io_uring acquires its reference later
than that, I expect we shouldn't close before that later time.

#105

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#102)

Re: AIO v2.5

Hi,

On 2025-03-22 19:09:55 -0700, Noah Misch wrote:

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11

Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

Apart from some isolated cosmetic points, this is ready to commit:
+			ereport(ERROR,
+					errcode(err),
+					errmsg("io_uring_queue_init failed: %m"),
+					hint != NULL ? errhint("%s", hint) : 0);
https://www.postgresql.org/docs/current/error-style-guide.html gives the example:

BAD: open() failed: %m
BETTER: could not open file %s: %m

Hence, this errmsg should change, perhaps to:
"could not setup io_uring queues: %m".

You're right. I didn't intentionally "violate" the policy, but I do have to
admit, I'm not a huge fan of that aspect, it just obfuscates what actually
failed, forcing one to look at the code or strace to figure out what precisely
failed.

(Changed)

+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ref_generation,
+					   (long long unsigned) ioh->generation,
In the message string, io_gen appears before ref_gen. In the subsequent args,
the order is swapped relative to the message string.

Oops, you're right.

--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
AIO_IO_COMPLETION	"Waiting for IO completion."
I'm wondering if there's an opportunity to enrich the last two wait event
names and/or descriptions. The current descriptions suggest to me more
similarity than is actually there. Inputs to the decision:

- AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three
starting states are the states where some other backend owns the next
action, so the current backend can only wait to be signaled.

- AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.

Possible names and descriptions, based on PgAioHandleState enum names and
comments:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO shared completion callback."

If "shared completion callback" is too internals-focused, perhaps this:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory."

Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the
IO with io_method=worker/sync. For that AIO_COMPLETED_SHARED would be
inappropriate.

We could use a different wait event if wait for an IO via CV in
PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait(). But I'm not
sure that would get you that far - we don't broadcast the CV when
transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait
event would stay the same, now wrong, wait event until the shared callback
completes. Obviously waking everyone up just so they can use a differen wait
event doesn't make sense.

A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
"execution" or something like that, to hint at a separation between the raw IO
being completed and the IO, including the callbacks completing.

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
there to reduce surprise on other platforms.

You're right, the if available can be misunderstood. But not mentioning that
it's an optional dependency seems odd too. What about something like

<para>
<literal>io_uring</literal> (execute asynchronous I/O using
io_uring, requires postgres to have been built with
<link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
<link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
</para>

Should the docs for --with-liburing/-Dliburing mention it's linux only? We
don't seem to do that for things like systemd (linux), selinux (linux) and
only kinda for bonjour (macos).

Greetings,

Andres Freund

#106

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#105)

Re: AIO v2.5

On Sun, Mar 23, 2025 at 11:57:48AM -0400, Andres Freund wrote:

On 2025-03-22 19:09:55 -0700, Noah Misch wrote:

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11

Subject: [PATCH v2.11 05/27] aio: Add io_method=io_uring

--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,6 +192,8 @@ ABI_compatibility:
Section: ClassName - WaitEventIO
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_COMPLETION	"Waiting for IO completion via io_uring."
AIO_IO_COMPLETION	"Waiting for IO completion."
I'm wondering if there's an opportunity to enrich the last two wait event
names and/or descriptions. The current descriptions suggest to me more
similarity than is actually there. Inputs to the decision:

- AIO_IO_COMPLETION waits for an IO in PGAIO_HS_DEFINED, PGAIO_HS_STAGED, or
PGAIO_HS_COMPLETED_IO to reach PGAIO_HS_COMPLETED_SHARED. The three
starting states are the states where some other backend owns the next
action, so the current backend can only wait to be signaled.

- AIO_IO_URING_COMPLETION waits for the kernel to do enough so we can move
from PGAIO_HS_SUBMITTED to PGAIO_HS_COMPLETED_IO.

Possible names and descriptions, based on PgAioHandleState enum names and
comments:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO shared completion callback."

If "shared completion callback" is too internals-focused, perhaps this:

AIO_IO_URING_COMPLETED_IO "Waiting for IO result via io_uring."
AIO_COMPLETED_SHARED "Waiting for IO completion to update shared memory."
Hm, right now AIO_IO_COMPLETION also covers the actual "raw" execution of the
IO with io_method=worker/sync.

Right, it could start with the IO in PGAIO_HS_DEFINED and end with the IO in
PGAIO_HS_COMPLETED_SHARED. So another part of the wait may be the definer
doing work before exiting batch mode.

For that AIO_COMPLETED_SHARED would be
inappropriate.

The concept I had in mind was "waiting to reach PGAIO_HS_COMPLETED_SHARED,
whatever obstacles that involves".

Another candidate description string:

AIO_COMPLETED_SHARED "Waiting for another process to complete IO."

We could use a different wait event if wait for an IO via CV in
PGAIO_HS_SUBMITTED, with a small refactoring of pgaio_io_wait(). But I'm not
sure that would get you that far - we don't broadcast the CV when
transitioning from PGAIO_HS_SUBMITTED -> PGAIO_HS_COMPLETED_IO, so the wait
event would stay the same, now wrong, wait event until the shared callback
completes. Obviously waking everyone up just so they can use a differen wait
event doesn't make sense.

Agreed. The mapping of code ranges to wait events seems fine to me. I'm mainly
trying to optimize the wait event description strings to fit those code ranges.

A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
"execution" or something like that, to hint at a separation between the raw IO
being completed and the IO, including the callbacks completing.

Yes, that would work for me.

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
there to reduce surprise on other platforms.
You're right, the if available can be misunderstood. But not mentioning that
it's an optional dependency seems odd too. What about something like

<para>
<literal>io_uring</literal> (execute asynchronous I/O using
io_uring, requires postgres to have been built with
<link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
<link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
</para>

I'd change s/postgres to have been built/a build with/ since the SGML docs
don't use the term "postgres" that way. Otherwise, that works for me.

Should the docs for --with-liburing/-Dliburing mention it's linux only? We
don't seem to do that for things like systemd (linux), selinux (linux) and
only kinda for bonjour (macos).

No need, I think.

#107

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#104)

Re: AIO v2.5

Hi,

On 2025-03-23 08:55:29 -0700, Noah Misch wrote:

On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote:
Unrelated to the above, another question about io_uring:

commit da722699 wrote:
+/*
+ * Need to submit staged but not yet submitted IOs using the fd, otherwise
+ * the IO would end up targeting something bogus.
+ */
+void
+pgaio_closing_fd(int fd)
An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker,
closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a
reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked
awhile for an authoritative view on that, but I didn't find one. If we can
rely on io_uring_submit() returning only after the kernel has given the
io_uring its own reference to all applicable file descriptors, I expect it's
okay to close the process's FD. If the io_uring acquires its reference later
than that, I expect we shouldn't close before that later time.

I'm fairly sure io_uring has its own reference for the file descriptor by the
time io_uring_enter() returns [1]See https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L1728 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L2204 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L3372 in the io_uring_enter() syscall. What io_uring does *not* reliably tolerate
is the issuing process *exiting* before the IO completes, even if there are
other processes attached to the same io_uring instance.

AIO v1 had a posix_aio backend, which, on several platforms, did *not*
tolerate the FD being closed before the IO completes. Because of that
IoMethodOps had a closing_fd callback, which posix_aio used to wait for the
IO's completion [2]https://github.com/anarazel/postgres/blob/a08cd717b5af4e51afb25ec86623973158a72ab9/src/backend/storage/aio/aio_posix.c#L738.

I've added a test case exercising this path for all io methods. But I can't
think of a way that would catch io_uring not actually holding a reference to
the fd with a high likelihood - the IO will almost always complete quickly
enough to not be able to catch that. But it still seems better than not at all
testing the path - it does catch at least the problem of pgaio_closing_fd()
not doing anything.

Greetings,

Andres Freund

[1]: See https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L1728 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L2204 called from https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L3372 in the io_uring_enter() syscall
https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L1728
called from
https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L2204
called from
https://github.com/torvalds/linux/blob/586de92313fcab8ed84ac5f78f4d2aae2db92c59/io_uring/io_uring.c#L3372
in the io_uring_enter() syscall

[2]: https://github.com/anarazel/postgres/blob/a08cd717b5af4e51afb25ec86623973158a72ab9/src/backend/storage/aio/aio_posix.c#L738

#108

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#100)

Re: AIO v2.5

commit 247ce06b wrote:

+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);

A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11

Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);

FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
functions so named. The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
you see it?

+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);

I think ioret and aio_ret are supposed to be the same object. If that's
right, change one of the names. Likewise elsewhere in this file.

+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.

s/"defined"/"define" it/ or similar

+The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/

Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

@@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
Assert(refcount > 0);
if (refcount != 1)
return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends. Currently, it just assumes all local buffers are
cleanup-lockable:

/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;

@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)

buf_state = pg_atomic_read_u32(&bufHdr->state);

-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
elog(ERROR, "block %u of %s is still referenced (local %u)",

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

CREATE TEMP TABLE foo ...;
[some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
DROP TABLE foo;

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer(). Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code. I think that bufmgr ERROR is unreachable, since
only a private refcnt triggers that bufmgr ERROR. Is there something
preventing the localbuf error from being a problem? (This wouldn't require
changes to the current patch; responsibility would fall in a bufmgr AIO
patch.)

Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

(Still reviewing this and later patches, but incidental observations follow.)

+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{

...

+ PgAioResult result;

...

+ result.status = PGAIO_RS_OK;

...

+ return result;

gcc 14.2.0 -Werror gives me:

bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Zeroing the unset fields silenced it:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
 	char	   *bufdata = BufferGetBlock(buffer);
-	PgAioResult result;
+	PgAioResult result = { .status = PGAIO_RS_OK };
 	uint32		set_flag_bits;
@@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,

- result.status = PGAIO_RS_OK;
-
/* check for garbage data */

Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO

@@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
static void
read_stream_look_ahead(ReadStream *stream)
{
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();

We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that. A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress". Are there mitigating factors?

Subject: [PATCH v2.11 17/27] aio: Add test_aio module

+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)

Is "tbl_corred" a typo of something?

--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,657 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c

Header comment is surviving from copy-paste of delay_execution.c.

+ * Tor tests we don't want the resowner release preventing us from

s/Tor/For/

#109

Thomas Munro

thomas.munro@gmail.com

10 months ago

In reply to: Andres Freund (#107)

Re: AIO v2.5

On Mon, Mar 24, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-03-23 08:55:29 -0700, Noah Misch wrote:

An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
PGAIO_HS_COMPLETED_IO clearly doesn't block that close. For io_method=worker,
closing in PGAIO_HS_SUBMITTED is okay. For io_method=io_uring, is there a
reference about it being okay to close during PGAIO_HS_SUBMITTED? I looked
awhile for an authoritative view on that, but I didn't find one. If we can
rely on io_uring_submit() returning only after the kernel has given the
io_uring its own reference to all applicable file descriptors, I expect it's
okay to close the process's FD. If the io_uring acquires its reference later
than that, I expect we shouldn't close before that later time.

I'm fairly sure io_uring has its own reference for the file descriptor by the
time io_uring_enter() returns [1]. What io_uring does *not* reliably tolerate
is the issuing process *exiting* before the IO completes, even if there are
other processes attached to the same io_uring instance.

It is a bit strange that the documentation doesn't say that
explicitly. You can sorta-maybe-kinda infer it from the fact that
io_uring didn't originally support cancelling requests at all, maybe a
small clue that it also didn't cancel them when you closed the fd :-)
The only sane alternative would seem to be that they keep running and
have their own reference to the *file* (not the fd), which is the
actual case, and might also be inferrable at a stretch from the
io_uring_register() documentation that says it reduces overheads with
a "long term reference" reducing "per-I/O overhead". (The distant
third option/non-option is a sort of late/async binding fd as seen in
the Glibc user space POSIX AIO implementation, but that sort of
madness doesn't seem to be the sort of thing anyone working in the
kernel would entertain for a nanosecond...) Anyway, there are also
public discussions involving Mr Axboe that discuss the fact that async
operations continue to run when the associated fd is closed, eg from
people who were surprised by that when porting stuff from other
systems, which might help fill in the documentation gap a teensy bit
if people want to see something outside the source code:

https://github.com/axboe/liburing/issues/568

AIO v1 had a posix_aio backend, which, on several platforms, did *not*
tolerate the FD being closed before the IO completes. Because of that
IoMethodOps had a closing_fd callback, which posix_aio used to wait for the
IO's completion [2].

Just for the record while remembering this stuff: Windows is another
system that took the cancel-on-close approach, so the Windows IOCP
proof-of-concept patches also used that AIO v1 callback and we'll have
to think about that again if/when we want to get that stuff
going on AIO v2. I recall also speculating that it might be better to
teach the vfd system to pick another victim to close instead if an fd
was currently tied up with an asynchronous I/O for the benefit of
those cancel-on-close systems, hopefully without any happy-path
book-keeping. But just submitting staged I/O is a nice and cheap
solution for now, without them in the picture.

#110

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#108)

Re: AIO v2.5

Hi,

On 2025-03-23 17:29:39 -0700, Noah Misch wrote:

commit 247ce06b wrote:
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

Hm, I guess you're right - it would be pretty bonkers for the injection to
process interrupts, but its much better to clarify the code to make that not
an option. Once doing that it seemed we should also have a similar assertion
in pgaio_io_before_prep() would be appropriate.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Attached v2.11

Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
functions so named. The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
you see it?

I have a surprisingly strong negative reaction to that proposed naming. To me
the staging is a distinct step that happens *after* the IO is fully
defined. Making all the layered calls that lead up to that named that way
would IMO be a bad idea.

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.

+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Done.

Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

Great.

+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);
I think ioret and aio_ret are supposed to be the same object. If that's
right, change one of the names. Likewise elsewhere in this file.

You're right.

+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "defined", i.e. associate an IO operation with the handle.
s/"defined"/"define" it/ or similar

+The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/

Applied.

Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
@@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
Assert(refcount > 0);
if (refcount != 1)
return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends. Currently, it just assumes all local buffers are
cleanup-lockable:

/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;

Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
buffers, that the current backend can't be doing anything that conflicts with
acquiring a buffer pin - note that it doesn't check the backend local pincount
for shared buffers either.

LockBufferForCleanup() kind of has to make that assumption, because there's no
way to wait for yourself to release another pin, because obviously waiting in
LockBufferForCleanup() would prevent that from ever happening.

It's somewhat disheartening that the comments for LockBufferForCleanup() don't
mention that somehow the caller needs to be sure not to be called with other
pins on a relation. Nor does LockBufferForCleanup() have any asserts checking
how many backend-local pins exist.

Leaving documentation / asserts aside, I think this is largely a safe
assumption given current callers. With one exception, it's all vacuum or
recovery related code - as vacuum can't run in a transaction, we can't
conflict with another pin by the same backend.

The one exception is heap_surgery.c - it doesn't quite seem safe, the
surrounding (or another query with a cursor) could have a pin on the target
block. The most obvious fix would be to use CheckTableNotInUse(), but that
might break some reasonable uses. Or maybe it should just not use a cleanup
lock, it's not obvious to me why it uses one. But tbh, I don't care too much,
given what heap_surgery is.

@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)

buf_state = pg_atomic_read_u32(&bufHdr->state);
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
elog(ERROR, "block %u of %s is still referenced (local %u)",
I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

CREATE TEMP TABLE foo ...;
[some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
DROP TABLE foo;

That seems plausible. I'll try to write a test after this email.

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer(). Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.

Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
me. It's going to be pretty rarely reached, waiting for all concurrent IO
seems unnecessarily heavyweight. I don't think it matters much today, but once
we do things like asynchronously writing back buffers or WAL, the situation
will be different.

I think this points to the comment above the WaitIO() in InvalidateBuffer()
needing a bit of adapting - an in-progress read can trigger the WaitIO as
well. Something like:

/*
* We assume the reason for it to be pinned is that either we were
* asynchronously reading the page in before erroring out or someone else
* is flushing the page out. Wait for the IO to finish. (This could be
* an infinite loop if the refcount is messed up... it would be nice to
* time out after awhile, but there seems no way to be sure how many loops
* may be needed. Note that if the other guy has pinned the buffer but
* not yet done StartBufferIO, WaitIO will fall through and we'll
* effectively be busy-looping here.)
*/

+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
...

+ PgAioResult result;

...

+ result.status = PGAIO_RS_OK;

...

+ return result;

gcc 14.2.0 -Werror gives me:

bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Gngngng. Since when is it a bug for some fields of a struct to be
uninitialized, as long they're not used?

Interestingly I don't see that warning, despite also using gcc 14.2.0.

I'll just move to your solution, but it seems odd.

Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO
@@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
static void
read_stream_look_ahead(ReadStream *stream)
{
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO.
+	 */
+	pgaio_enter_batchmode();
We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that. A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress". Are there mitigating factors?

Ugh, yes, you're right. heap_vac_scan_next_block() is also affected.

I don't think "starting batch while batch already in progress" is the real
issue though - it seems easy enough to avoid starting another batch inside,
partially because current cases seem unlikely to need to do batchable IO
inside. What worries me more is that code might block while there's
unsubmitted IO - which seems entirely plausible.

I can see a few approaches:

1) Declare that all read stream callbacks have to be careful and cope with
batch mode

I'm not sure how viable that is, not starting batches seems ok, but
ensuring that the code doesn't block is a different story.

2) Have read stream users opt-in to batching

Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
to implement and to add to the callsites where that's fine.

3) Teach read stream to "look ahead" far enough to determine all the blocks
that could be issued in a batch outside of batchmode

I think that's probably not a great idea, it'd lead us to looking further
ahead than we really need to, which could increase "unfairness" in
e.g. parallel sequential scan.

4) Just defer using batch mode for now

It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
but there's no need to immediately solve this.

I think regardless of what we go for, it's worth splitting
"aio: Basic read_stream adjustments for real AIO"
into the actually basic parts (i.e. introducing sync_mode) from the not
actually so basic parts (i.e. batching).

I suspect that 2) would be the best approach. Only the read stream user knows
what it needs to do in the callback.

Subject: [PATCH v2.11 17/27] aio: Add test_aio module
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is tbl_corred)
Is "tbl_corred" a typo of something?

I think that was a search&replace of the table name gone wrong. It was just
supposed to be "corrupted".

+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
Header comment is surviving from copy-paste of delay_execution.c.

Oh, how I hate these pointless comments. Fixed.

Greetings,

Andres Freund

#111

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#110)

Re: AIO v2.5

Hi,

On 2025-03-24 11:43:47 -0400, Andres Freund wrote:

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

CREATE TEMP TABLE foo ...;
[some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
DROP TABLE foo;

That seems plausible. I'll try to write a test after this email.

FWIW, a test did indeed confirm that. Luckily:

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer(). Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.

Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
me.

This did indeed resolve the issue.

I've extended the testsuite to test for that and a bunch more things. Working
on sending out a new version...

We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that. A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress". Are there mitigating factors?

Ugh, yes, you're right. heap_vac_scan_next_block() is also affected.

I don't think "starting batch while batch already in progress" is the real
issue though - it seems easy enough to avoid starting another batch inside,
partially because current cases seem unlikely to need to do batchable IO
inside. What worries me more is that code might block while there's
unsubmitted IO - which seems entirely plausible.

I can see a few approaches:

1) Declare that all read stream callbacks have to be careful and cope with
batch mode

I'm not sure how viable that is, not starting batches seems ok, but
ensuring that the code doesn't block is a different story.

2) Have read stream users opt-in to batching

Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough
to implement and to add to the callsites where that's fine.

3) Teach read stream to "look ahead" far enough to determine all the blocks
that could be issued in a batch outside of batchmode

I think that's probably not a great idea, it'd lead us to looking further
ahead than we really need to, which could increase "unfairness" in
e.g. parallel sequential scan.

4) Just defer using batch mode for now

It's a nice win with io_uring for random IO, e.g. from bitmap heap scans ,
but there's no need to immediately solve this.

I think regardless of what we go for, it's worth splitting
"aio: Basic read_stream adjustments for real AIO"
into the actually basic parts (i.e. introducing sync_mode) from the not
actually so basic parts (i.e. batching).

I suspect that 2) would be the best approach. Only the read stream user knows
what it needs to do in the callback.

I still think 2) would be the best option.

Writing a patch for that.

If a callback may sometimes need to block, it can still opt into
READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.

The hardest part is to explain the flag. Here's my current attempt:

/* ---
* Opt-in to using AIO batchmode.
*
* Submitting IO in larger batches can be more efficient than doing so
* one-by-one, particularly for many small reads. It does, however, require
* the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
* batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
* a) block without first calling pgaio_submit_staged(), unless a
* to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
* never acquired in a nested fashion
* b) directly or indirectly start another batch pgaio_enter_batchmode()
*
* As this requires care and is nontrivial in some cases, batching is only
* used with explicit opt-in.
* ---
*/
#define READ_STREAM_USE_BATCHING 0x08

Greetings,

Andres Freund

#112

Thomas Munro

thomas.munro@gmail.com

10 months ago

In reply to: Andres Freund (#111)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:

If a callback may sometimes need to block, it can still opt into
READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.

The hardest part is to explain the flag. Here's my current attempt:

/* ---
* Opt-in to using AIO batchmode.
*
* Submitting IO in larger batches can be more efficient than doing so
* one-by-one, particularly for many small reads. It does, however, require
* the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
* batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
* a) block without first calling pgaio_submit_staged(), unless a
* to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
* never acquired in a nested fashion
* b) directly or indirectly start another batch pgaio_enter_batchmode()
*
* As this requires care and is nontrivial in some cases, batching is only
* used with explicit opt-in.
* ---
*/
#define READ_STREAM_USE_BATCHING 0x08

I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
would be better, to highlight that you are making a declaration about
a property of your callback, not just turning on an independent
go-fast feature... I fished those words out of the main (?)
description of this topic atop pgaio_enter_batchmode(). Just a
thought, IDK.

#113

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Thomas Munro (#112)

Re: AIO v2.5

Hi,

On 2025-03-25 13:07:49 +1300, Thomas Munro wrote:

On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:

#define READ_STREAM_USE_BATCHING 0x08

+1

I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
would be better, to highlight that you are making a declaration about
a property of your callback, not just turning on an independent
go-fast feature... I fished those words out of the main (?)
description of this topic atop pgaio_enter_batchmode(). Just a
thought, IDK.

The relevant lines are already very deeply indented, so I'm a bit wary of such
a long name. I think we'd basically have to use a separate flags variable
everywhere and that is annoying due to us following C89 variable declaration
positions...

Greetings,

Andres Freund

#114

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Thomas Munro (#112)

Re: AIO v2.5

On Mon, Mar 24, 2025 at 11:43:47AM -0400, Andres Freund wrote:

On 2025-03-23 17:29:39 -0700, Noah Misch wrote:
commit 247ce06b wrote:
+			pgaio_io_reopen(ioh);
+
+			/*
+			 * To be able to exercise the reopen-fails path, allow injection
+			 * points to trigger a failure at this point.
+			 */
+			pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
+
+			error_errno = 0;
+			error_ioh = NULL;
+
+			/*
+			 * We don't expect this to ever fail with ERROR or FATAL, no need
+			 * to keep error_ioh set to the IO.
+			 * pgaio_io_perform_synchronously() contains a critical section to
+			 * ensure we don't accidentally fail.
+			 */
+			pgaio_io_perform_synchronously(ioh);
A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code. It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.
Hm, I guess you're right - it would be pretty bonkers for the injection to
process interrupts, but its much better to clarify the code to make that not
an option. Once doing that it seemed we should also have a similar assertion
in pgaio_io_before_prep() would be appropriate.

Agreed. Following that line of thinking, the io_uring case needs to
HOLD_INTERRUPTS() (or hold smgrrelease() specifically) all the way from
pgaio_io_before_prep() to PGAIO_HS_SUBMITTED. The fd has to stay valid until
io_uring_submit().

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from
functions so named. The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do
you see it?
I have a surprisingly strong negative reaction to that proposed naming. To me
the staging is a distinct step that happens *after* the IO is fully
defined. Making all the layered calls that lead up to that named that way
would IMO be a bad idea.

As a general naming principle, I think the name of a function that advances
through multiple named steps should mention the last step. Naming the
function after just a non-last step feels weird to me. For example, serving a
meal consists of steps menu_define, mix_ingredients, and plate_food. It would
be weird to me if a function called meal_menu_define() mixed ingredients or
plated food, but it's fine if meal_plate_food() does all three steps. A
second strategy is to name both the first and last steps:
meal_define_menu_thru_plate_food() is fine apart from being long. A third
strategy is to have meal_plate_food() assert than meal_mix_ingredients() has
been called.

I wouldn't mind "staging" as a distinct step, but I think today's API
boundaries hide the distinction. PGAIO_HS_DEFINED is a temporary state during
a pgaio_io_stage() call, so the process that defines and stages the IO can
observe PGAIO_HS_DEFINED only while pgaio_io_stage() is on the stack.

The aforementioned "third strategy" could map to having distinct
smgrdefinereadv() and smgrstagereadv(). I don't know how well that would work
out overall. I wouldn't be optimistic about that winning, but I mention it
for completeness.

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.

For ease of new readers understanding things, I think it helps for the
functions that advance PgAioHandleState to have names that use words from
PgAioHandleState. It's one less mapping to get into the reader's head.
"Begin", "Start" and "prep" are all outside that taxonomy, making the reader
learn how to map them to the taxonomy. What reward does the reader get at the
end of that exercise? I'm not seeing one, but please do tell me what I'm
missing here.

Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well
@@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
Assert(refcount > 0);
if (refcount != 1)
return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends. Currently, it just assumes all local buffers are
cleanup-lockable:

/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;
Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
buffers, that the current backend can't be doing anything that conflicts with
acquiring a buffer pin - note that it doesn't check the backend local pincount
for shared buffers either.

It checks the local pincount via callee CheckBufferIsPinnedOnce().

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false. I'm not seeking
to raise the overall standard of *Cleanup() family of functions, but I am
trying to keep members of that family agreeing on the standard.

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

I think this points to the comment above the WaitIO() in InvalidateBuffer()
needing a bit of adapting - an in-progress read can trigger the WaitIO as
well. Something like:

/*
* We assume the reason for it to be pinned is that either we were
* asynchronously reading the page in before erroring out or someone else
* is flushing the page out. Wait for the IO to finish. (This could be
* an infinite loop if the refcount is messed up... it would be nice to
* time out after awhile, but there seems no way to be sure how many loops
* may be needed. Note that if the other guy has pinned the buffer but
* not yet done StartBufferIO, WaitIO will fall through and we'll
* effectively be busy-looping here.)
*/

Agreed.

+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
...

+ PgAioResult result;

...

+ result.status = PGAIO_RS_OK;

...

+ return result;

gcc 14.2.0 -Werror gives me:

bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]
Gngngng. Since when is it a bug for some fields of a struct to be
uninitialized, as long they're not used?

Interestingly I don't see that warning, despite also using gcc 14.2.0.

I badly neglected to mention my non-default flags:

CFLAGS='-O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000 -ftrivial-auto-var-init=pattern'
COPT=-Werror -Wno-error=array-bounds

Final CFLAGS, including the ones "configure" elects on its own:

configure: using CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-format-truncation -Wno-cast-function-type-strict -g -O2 -fno-sanitize-recover=all -fsanitize=address,alignment,undefined --param=max-vartrack-size=150000000 -ftrivial-auto-var-init=pattern

(I use -Wno-error=array-bounds because the sanitizer options elicit a lot of
those warnings. Today's master is free from maybe-uninitialized warnings in
this configuration, though.)

I'll just move to your solution, but it seems odd.

Got it.

I think regardless of what we go for, it's worth splitting
"aio: Basic read_stream adjustments for real AIO"
into the actually basic parts (i.e. introducing sync_mode) from the not
actually so basic parts (i.e. batching).

Fair.

On Mon, Mar 24, 2025 at 06:55:22PM -0400, Andres Freund wrote:

Hi,

On 2025-03-24 11:43:47 -0400, Andres Freund wrote:

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

CREATE TEMP TABLE foo ...;
[some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
DROP TABLE foo;

That seems plausible. I'll try to write a test after this email.

FWIW, a test did indeed confirm that. Luckily:

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer(). Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.

Just waiting for the IO in InvalidateBuffer() does seem like the best bet to
me.

This did indeed resolve the issue.

I'm happy with that approach.

On Tue, Mar 25, 2025 at 01:07:49PM +1300, Thomas Munro wrote:

On Tue, Mar 25, 2025 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:

If a callback may sometimes need to block, it can still opt into
READ_STREAM_USE_BATCHING, by submitting all staged IO before blocking.

The hardest part is to explain the flag. Here's my current attempt:

/* ---
* Opt-in to using AIO batchmode.
*
* Submitting IO in larger batches can be more efficient than doing so
* one-by-one, particularly for many small reads. It does, however, require
* the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
* batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
* a) block without first calling pgaio_submit_staged(), unless a
* to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
* never acquired in a nested fashion
* b) directly or indirectly start another batch pgaio_enter_batchmode()

I think a callback could still do:

pgaio_exit_batchmode()
... arbitrary code that might reach pgaio_enter_batchmode() ...
pgaio_enter_batchmode()

*
* As this requires care and is nontrivial in some cases, batching is only
* used with explicit opt-in.
* ---
*/
#define READ_STREAM_USE_BATCHING 0x08

+1

Agreed. It's simple, and there's no loss of generality.

I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
would be better, to highlight that you are making a declaration about
a property of your callback, not just turning on an independent
go-fast feature... I fished those words out of the main (?)
description of this topic atop pgaio_enter_batchmode(). Just a
thought, IDK.

Good points. I lean toward your renaming suggestion, or shortening to
READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the
original name, though.

Thanks,
nm

#115

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#100)

26 attachment(s)

Re: AIO v2.5

Hi,

Attached v2.12, with the following changes:

- Pushed the max_files_per_process change

I plan to look at what parts of Jelte's change is worth doing ontop.

Thanks for the review Noah.

- Rebased over Thomas' commit of the remaining read stream changes

Yay!

- Addressed Noah's review comments

- Added another test to test_aio/, to test that changing io_workers while
running works, and that workers are restarted if terminated

Written by Bilal

- Made InvalidateLocalBuffer wait for IO if necessary

As reported / suggested by Noah

- Added tests for dropping tables with ongoing IO

This failed, as Noah predicted, without the InvalidateLocalBuffer() change.

- Added a commit to explicitly hold interrupts in workers after
pgaio_io_reopen()

As suggested by Noah.

- Added a commit to fix a logic error around what gets passed to
ioh->report_return - this lead to temporary buffer validation errors not
being reported

Discovered while extending the tests, as noted in the next point.

I could see a few different "formulations" of this change (e.g. the
report_return stuff could be populated by pgaio_io_call_complete_local()
instead), but I don't think it matters much.

- Add temporary table coverage to test_aio

This required hanged test_aio.c to cope with temporary tables as well.

- io_uring tests don't run anymore when built with EXEC_BACKEND and liburing
enabled

- Split the read stream patch into two

Noah, quite rightly, pointed out that it's not safe to use batching if the
next-block callback may block (or start its own batch). The best idea seems
to be to make users of read stream opt-in to batching. I've done that in a
patch that uses where it seems safe without doing extra work. See also the
commit message.

- Added a commit to add I/O, Asynchronous I/O glossary and acronym entries

- Docs for pg_aios

- Renamed pg_aios.offset to off, to avoid use of a keyword

- Updated the io_uring wait event name while waiting for IOs to complete to
AIO_IO_URING_COMPLETION and updated the description of AIO_IO_COMPLETION to
"Waiting for another process to complete IO."

I think this is a mix of different suggestions by Noah.

TODO:

- There are more tests in test_aio that should be expanded to run for temp
tables as well, not just normal tables

- Add an explicit test for the checksum verification in the completion callback

There is an existing test for testing an invalid page due to page header
verification in test_aio, but not for checksum failures.

I think it's indirectly covered (e.g. in amcheck), but seems better to test
it explicitly.

Wonder if it's worth adding some coverage for when checksums are disabled?
Probably not necessary?

Greetings,

Andres Freund

Attachments:

v2.12-0016-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 8dbc0b33b7c4e4f26404208d1277be8ca4331f3f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 16/28] aio: Add test_aio module

To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/buf_internals.h           |   7 +
 src/backend/storage/buffer/bufmgr.c           |   8 +-
 src/backend/storage/buffer/localbuf.c         |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_aio/.gitignore          |   2 +
 src/test/modules/test_aio/Makefile            |  26 +
 src/test/modules/test_aio/meson.build         |  37 +
 src/test/modules/test_aio/t/001_aio.pl        | 906 ++++++++++++++++++
 src/test/modules/test_aio/t/002_io_workers.pl | 125 +++
 src/test/modules/test_aio/test_aio--1.0.sql   | 101 ++
 src/test/modules/test_aio/test_aio.c          | 712 ++++++++++++++
 src/test/modules/test_aio/test_aio.control    |   3 +
 13 files changed, 1924 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/t/002_io_workers.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8a52ace37cd..6821a710e46 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool syncio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
@@ -478,6 +484,7 @@ extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits, bool syncio);
 extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
+extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 883c5918a33..fcc5eb5654e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -518,10 +518,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5947,7 +5943,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -6004,7 +6000,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool syncio)
 {
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3b81a5661c7..ffc1a7c93a2 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -58,7 +58,6 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
-static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -596,7 +595,7 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
  *
  * See also InvalidateBuffer().
  */
-static void
+void
 InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 {
 	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..4fda1acfb32 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2025, PostgreSQL Global Development Group
 
+subdir('test_aio')
 subdir('brin')
 subdir('commit_ts')
 subdir('delay_execution')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..0a99b68d56d
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..fd638eb4416
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+      't/002_io_workers.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..b0beddb568e
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,906 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if (have_io_uring())
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+temp_buffers=100
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+
+	return $output;
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor,);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+
+sub test_batch
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_corr(data int not null);
+INSERT INTO tmp_corr SELECT generate_series(1, 10000);
+SELECT corrupt_rel_block('tmp_corr', 1);
+));
+
+	foreach my $tblname (qw(tbl_corr tmp_corr))
+	{
+		my $invalid_page_re =
+		  $tblname eq 'tbl_corr'
+		  ? qr/invalid page in block 1 of relation base\/\d+\/\d+/
+		  : qr/invalid page in block 1 of relation base\/\d+\/t\d+_\d+/;
+
+		# verify the error is reported in custom C code
+		psql_like(
+			$io_method,
+			$psql,
+			"read_rel_block_ll() of $tblname page",
+			qq(SELECT read_rel_block_ll('$tblname', 1, wait_complete=>true)),
+			qr/^$/,
+			$invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, seq scan
+		psql_like(
+			$io_method, $psql,
+			"sequential scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname),
+			qr/^$/, $invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, tid scan
+		psql_like(
+			$io_method,
+			$psql,
+			"tid scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname WHERE ctid = '(1, 1)'),
+			qr/^$/,
+			$invalid_page_re);
+	}
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+
+	### Verify behavior for normal tables
+
+	# create a buffer we can play around with
+	my $buf_id = psql_like(
+		$io_method, $psql_a,
+		"creation of toy buffer succeeds",
+		qq(SELECT buffer_create_toy('tbl_ok', 1)),
+		qr/^\d+$/, qr/^$/);
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, not valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true)),
+		qr/^$/,
+		qr/^$/);
+
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking buffer io w/ success: first start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, marking it as success
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true)),
+		qr/^$/,
+		qr/^$/);
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+	# buffer is valid now, make it invalid again
+	$psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+
+	### Verify behavior for temporary tables
+
+	# Can't unfortunately share the code with the normal table case, there are
+	# too many behavioral differences.
+
+	# create a buffer we can play around with
+	$psql_a->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_ok(data int not null);
+INSERT INTO tmp_ok SELECT generate_series(1, 10000);
+));
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tmp_ok', 3);));
+
+	# check that one backend can perform StartLocalBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartLocalBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Because local buffers don't use IO_IN_PROGRESS, a second StartLocalBufer
+	# succeeds as well. This test mostly serves as a documentation of that
+	# fact. If we had actually started IO, it'd be different.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartLocalBufferIO succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, without marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, syncio=>true);)
+	);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after not marking valid succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, syncio=>true);)
+	);
+
+	# Now another StartLocalBufferIO should fail, this time because the buffer
+	# is already valid.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after marking valid fails",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^f$/,
+		qr/^$/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that we deal correctly with FDs being closed while IO is in progress
+sub test_close_fd
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, waiting for results",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+			    wait_complete=>true,
+			    batchmode_enter=>true,
+			    smgrreleaseall=>true,
+			    batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+			    wait_complete=>false,
+			    batchmode_enter=>true,
+			    smgrreleaseall=>true,
+			    batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting, query works",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	$psql->quit();
+}
+
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method, $psql,
+		"injection point not triggering failure",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^1$/, qr/^$/);
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method,
+		$psql,
+		"single block short read fails",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/
+	);
+
+	# shorten multi-block read to a single block, should retry
+	my $inval_query = qq(SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8););
+
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (1 block) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# shorten multi-block read to two blocks, should retry
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192*2);));
+
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (2 blocks) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is corrupted)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+    ));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"shortened multi-block read detects invalid page",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/);
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+    ));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"first hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"second hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_short_read_detach()));
+
+	# now the IO should be ok.
+	psql_like(
+		$io_method, $psql,
+		"recovers after hard error",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+    ));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"failure to open: detected",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_reopen_detach();));
+
+	# check that we indeed recover
+	psql_like(
+		$io_method, $psql,
+		"failure to open: recovers",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+sub test_invalidate
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal unlogged temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+		my $tblname = $persistency . '_transactional';
+
+		my $create_sql = qq(
+CREATE $sql_persistency TABLE $tblname (id int not null, data text not null);
+INSERT INTO $tblname(id, data) SELECT generate_series(1, 10000) as id, repeat('a', 200);
+);
+
+		# Verify that outstanding read IO does not cause problems with
+		# AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql;");
+		$psql->query_safe(
+			qq(
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+		psql_like(
+			$io_method,
+			$psql,
+			"rollback of newly created $persistency table with outstanding IO",
+			qq(ROLLBACK),
+			qr/^$/,
+			qr/^$/);
+
+		# Verify that outstanding read IO does not cause problems with
+		# CommitTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql; COMMIT;");
+		$psql->query_safe(
+			qq(
+BEGIN;
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+
+		psql_like(
+			$io_method, $psql,
+			"drop $persistency table with outstanding IO",
+			qq(DROP TABLE $tblname),
+			qr/^$/, qr/^$/);
+
+		psql_like($io_method, $psql,
+			"commit after drop $persistency table with outstanding IO",
+			qq(COMMIT), qr/^$/, qr/^$/);
+	}
+
+	$psql->quit();
+}
+
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT corrupt_rel_block('tbl_corr', 1);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batch($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+	test_close_fd($io_method, $node);
+	test_invalidate($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
new file mode 100644
index 00000000000..169119275ee
--- /dev/null
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use List::Util qw(sample);
+
+
+my $node = PostgreSQL::Test::Cluster->new('worker');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+io_method=worker
+));
+
+$node->start();
+
+# Test changing the number of I/O worker processes while also evaluating the
+# handling of their termination.
+test_number_of_io_workers_dynamic($node);
+
+$node->stop();
+
+done_testing();
+
+
+sub test_number_of_io_workers_dynamic
+{
+	my $node = shift;
+
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+
+	# Verify that worker count can't be set to 0
+	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
+
+	# Verify that worker count can't be set to 33 (above the max)
+	change_number_of_io_workers($node, 33, $prev_worker_count, 1);
+
+	# Try changing IO workers to a random value and verify that the worker
+	# count ends up as expected. Always test the min/max of workers.
+	#
+	# Valid range for io_workers is [1, 32]. 8 tests in total seems
+	# reasonable.
+	my @io_workers_range = sample 6, 1 ... 32;
+	foreach my $worker_count (1, 32, @io_workers_range)
+	{
+		$prev_worker_count =
+		  change_number_of_io_workers($node, $worker_count,
+			$prev_worker_count, 0);
+	}
+}
+
+sub change_number_of_io_workers
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my $prev_worker_count = shift;
+	my $expect_failure = shift;
+	my ($result, $stdout, $stderr);
+
+	($result, $stdout, $stderr) =
+	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
+
+	if ($expect_failure)
+	{
+		ok( $stderr =~
+			  /$worker_count is outside the valid range for parameter "io_workers"/,
+			"updating number of io_workers to $worker_count failed, as expected"
+		);
+
+		return $prev_worker_count;
+	}
+	else
+	{
+		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+			$worker_count,
+			"updating number of io_workers from $prev_worker_count to $worker_count"
+		);
+
+		check_io_worker_count($node, $worker_count);
+		terminate_io_worker($node, $worker_count);
+		check_io_worker_count($node, $worker_count);
+
+		return $worker_count;
+	}
+}
+
+sub terminate_io_worker
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my ($pid, $ret);
+
+	# Select a random io worker
+	$pid = $node->safe_psql(
+		'postgres',
+		qq(SELECT pid FROM pg_stat_activity WHERE
+			backend_type = 'io worker' ORDER BY RANDOM() LIMIT 1));
+
+	# terminate IO worker with SIGINT
+	is(PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'INT', $pid),
+		0, "random io worker process signalled with INT");
+
+	# Check that worker exits
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE pid = $pid), '0'),
+		"random io worker process exited after signal");
+}
+
+sub check_io_worker_count
+{
+	my $node = shift;
+	my $worker_count = shift;
+
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'io worker'),
+			$worker_count),
+		"io worker count is $worker_count");
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..602615e39be
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,101 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION corrupt_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(rel regclass, blockno int,
+    wait_complete bool,
+    batchmode_enter bool DEFAULT false,
+    smgrreleaseall bool DEFAULT false,
+    batchmode_exit bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, syncio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..0f34892a17e
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,712 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(corrupt_rel_block);
+Datum
+corrupt_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBuffer(rel, blkno);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	MarkBufferDirty(buf);
+
+	PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+	ph->pd_special = BLCKSZ + 1;
+
+	/* ensure the on-disk state is updated */
+	if (BufferIsLocal(buf))
+		FlushLocalBuffer(GetLocalBufferDescriptor(-buf - 1), NULL);
+	else
+		FlushOneBuffer(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	/*
+	 * Don't want to have a buffer in-memory that's marked valid where the
+	 * on-disk contents are invalid.
+	 *
+	 * NB: This is racy, better don't copy this to non-test code.
+	 */
+	if (BufferIsLocal(buf))
+		InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+	else
+		EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* place buffer in shared buffers without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	}
+	else
+	{
+		buf_hdr = GetBufferDescriptor(buf - 1);
+		buf_state = LockBufHdr(buf_hdr);
+	}
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	if (RelationUsesLocalBuffers(rel))
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	else
+		UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		wait_complete = PG_GETARG_BOOL(2);
+	bool		batchmode_enter = PG_GETARG_BOOL(3);
+	bool		call_smgrreleaseall = PG_GETARG_BOOL(4);
+	bool		batchmode_exit = PG_GETARG_BOOL(5);
+	Relation	rel;
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	Page		pages[1];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+
+	pages[0] = BufferGetBlock(buf);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+		StartLocalBufferIO(buf_hdr, true, false);
+		pgaio_io_set_flag(ioh, PGAIO_HF_REFERENCES_LOCAL);
+	}
+	else
+	{
+		buf_hdr = GetBufferDescriptor(buf - 1);
+		StartBufferIO(buf_hdr, true, false);
+	}
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) &buf, 1);
+
+	pgaio_io_register_callbacks(ioh,
+								RelationUsesLocalBuffers(rel) ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								0);
+
+	if (batchmode_enter)
+		pgaio_enter_batchmode();
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, 1);
+
+	if (call_smgrreleaseall)
+		smgrreleaseall();
+
+	if (batchmode_exit)
+		pgaio_exit_batchmode();
+
+	ReleaseBuffer(buf);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != PGAIO_RS_OK)
+			pgaio_result_report(ior.result, &ior.target_data,
+								ior.result.status == PGAIO_RS_PARTIAL ? WARNING : ERROR);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+			FlushOneBuffer(buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	if (BufferIsLocal(buf))
+		can_start = StartLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+									   for_input, nowait);
+	else
+		can_start = StartBufferIO(GetBufferDescriptor(buf - 1),
+								  for_input, nowait);
+
+	/*
+	 * For tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start && !BufferIsLocal(buf))
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		syncio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	if (BufferIsLocal(buf))
+		TerminateLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+							   clear_dirty, set_flag_bits, syncio);
+	else
+		TerminateBufferIO(GetBufferDescriptor(buf - 1),
+						  clear_dirty, set_flag_bits, false, syncio);
+
+	ereport(LOG,
+			errmsg("buffer %d after Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		/*
+		 * Only shorten reads that are actually longer than the target size,
+		 * otherwise we can trigger over-reads.
+		 */
+		if (inj_io_error_state->short_read_result_set
+			&& ioh->op == PGAIO_OP_READV
+			&& inj_io_error_state->short_read_result <= ioh->result)
+		{
+			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			int32		old_result = ioh->result;
+			int32		new_result = inj_io_error_state->short_read_result;
+			int32		processed = 0;
+
+			ereport(LOG,
+					errmsg("short read, changing result from %d to %d",
+						   old_result, new_result),
+					errhidestmt(true), errhidecontext(true));
+
+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
+			 *
+			 * To avoid that, iterate through the IOV and zero out the
+			 * "failed" portion of the IO.
+			 */
+			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			{
+				if (processed + iov[i].iov_len <= new_result)
+					processed += iov[i].iov_len;
+				else if (processed <= new_result)
+				{
+					uint32		ok_part = new_result - processed;
+
+					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+					processed += iov[i].iov_len;
+				}
+				else
+				{
+					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+				}
+			}
+
+			ioh->result = new_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0017-aio-bufmgr-Comment-fixes.patchtext/x-diff; charset=us-asciiDownload

From 71ad880d55e58e603d7a03ac1e57af00e8611826 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Mar 2025 11:32:10 -0400
Subject: [PATCH v2.12 17/28] aio, bufmgr: Comment fixes

Some of these comments have been wrong for a while (12f3867f5534), some I
recently introduced (da7226993fd, 55b454d0e14). This also updates a comment in
FlushBuffer(), which will be copied in a future commit.

These changes seem big enough to be worth doing in separate commits.

Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
---
 src/include/storage/aio.h           |  2 +-
 src/include/storage/aio_internal.h  | 17 ++++++++++++++++-
 src/backend/postmaster/postmaster.c |  2 +-
 src/backend/storage/buffer/bufmgr.c | 10 ++++------
 4 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 92f29cfdb71..9a0868a270c 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -80,7 +80,7 @@ typedef enum PgAioHandleFlags
 /*
  * The IO operations supported by the AIO subsystem.
  *
- * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * This could be in aio_internal.h, as it is not publicly referenced, but
  * PgAioOpData currently *does* need to be public, therefore keeping this
  * public seems to make sense.
  */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 400c44206dd..0dff909e4ad 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -276,6 +276,9 @@ typedef struct IoMethodOps
 	/*
 	 * Start executing passed in IOs.
 	 *
+	 * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+	 * returns, other backends might have advanced the state further.)
+	 *
 	 * Will not be called if ->needs_synchronous_execution() returned true.
 	 *
 	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -284,12 +287,24 @@ typedef struct IoMethodOps
 	 */
 	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
 
-	/*
+	/* ---
 	 * Wait for the IO to complete. Optional.
 	 *
+	 * On return, state shall be on of
+	 * - PGAIO_HS_COMPLETED_IO
+	 * - PGAIO_HS_COMPLETED_SHARED
+	 * - PGAIO_HS_COMPLETED_LOCAL
+	 *
+	 * The callback must not block if the handle is already in one of those
+	 * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+	 * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+	 * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+	 * method.
+	 *
 	 * If not provided, it needs to be guaranteed that the IO method calls
 	 * pgaio_io_process_completion() without further interaction by the
 	 * issuing backend.
+	 * ---
 	 */
 	void		(*wait_one) (PgAioHandle *ioh,
 							 uint64 ref_generation);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a0c37532d2f..c966c2e83af 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4401,7 +4401,7 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* XXX try again soon? */
+			break;				/* try again next time */
 	}
 
 	/* Too many running? */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fcc5eb5654e..7f9b58003b3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4263,9 +4263,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		XLogFlush(recptr);
 
 	/*
-	 * Now it's safe to write buffer to disk. Note that no one else should
-	 * have been able to write it while we were busy with log flushing because
-	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
@@ -5855,9 +5856,6 @@ IsBufferCleanupOK(Buffer buffer)
 /*
  *	Functions for buffer I/O handling
  *
- *	Note: We assume that nested buffer I/O never occurs.
- *	i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
- *
  *	Also note that these are used only for shared buffers, not local ones.
  */
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0006-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From 04c07d2dddcc771c76f1104eede1888d5e2d209c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..24eec5776cd
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == PGAIO_RS_PARTIAL)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "define" it, i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_prep_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_prep_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this is the ability to associate multiple completion
+callbacks with a handle. E.g. bufmgr.c can have a callback to update the
+BufferDesc state and to verify the page and md.c can have another callback to
+check if the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e52b01c6bc1..0d52f72dfdc 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0007-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From 2a5998a947d3cb1d1a027de26ffc7225b4e1bcc5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/bufmgr.c   | 22 +++++++++++++++++++++
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 323382dcfa8..a2c4338f089 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5404,6 +5404,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 		Assert(refcount > 0);
 		if (refcount != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+		Assert(refcount > 0);
+		if (refcount != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
@@ -5457,6 +5469,16 @@ IsBufferCleanupOK(Buffer buffer)
 		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			return false;
+
+		/*
+		 * Check that the AIO subsystem doesn't have a pin. Likely not
+		 * possible today, but better safe than sorry.
+		 */
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1)
+			return false;
+
 		/* Nobody else to wait for */
 		return true;
 	}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0008-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 6718b7ef56c779dceb4b188a653fdc5e5d5a6dad Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support

This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/README      |   9 +-
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 509 ++++++++++++++++++++++---
 src/backend/storage/buffer/localbuf.c  |  60 ++-
 8 files changed, 542 insertions(+), 59 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 1be7821b29b..92f29cfdb71 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -195,6 +195,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..8a52ace37cd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool syncio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 538b890a51d..5137288c96e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -170,6 +171,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 54970337bf3..566060c12cb 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..a182fcd660c 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -147,9 +147,12 @@ in the buffer.  It is used per the rules above.
 
 * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
 buffer to complete (and in releases before 14, it was accompanied by a
-per-buffer LWLock).  The process doing a read or write sets the flag for the
-duration, and processes that need to wait for it to be cleared sleep on a
-condition variable.
+per-buffer LWLock).  The process starting a read or write sets the flag. When
+the I/O is completed, be it by the process that initiated the I/O or by
+another process, the flag is removed and the Buffer's condition variable is
+signalled.  Processes that need to wait for the I/O to complete can wait for
+asynchronous I/O by using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be
+unset by sleeping on the buffer's condition variable.
 
 
 Normal Buffer Replacement Strategy
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2c4338f089..3dad509b4d3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -519,7 +520,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool syncio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1041,7 +1043,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1077,9 +1079,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1454,7 +1456,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1604,9 +1607,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -1868,13 +1871,14 @@ retry:
 	}
 
 	/*
-	 * We assume the only reason for it to be pinned is that someone else is
-	 * flushing the page out.  Wait for them to finish.  (This could be an
-	 * infinite loop if the refcount is messed up... it would be nice to time
-	 * out after awhile, but there seems no way to be sure how many loops may
-	 * be needed.  Note that if the other guy has pinned the buffer but not
-	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
-	 * be busy-looping here.)
+	 * We assume the reason for it to be pinned is that either we were
+	 * asynchronously reading the page in before erroring out or someone else
+	 * is flushing the page out.  Wait for the IO to finish.  (This could be
+	 * an infinite loop if the refcount is messed up... it would be nice to
+	 * time out after awhile, but there seems no way to be sure how many loops
+	 * may be needed.  Note that if the other guy has pinned the buffer but
+	 * not yet done StartBufferIO, WaitIO will fall through and we'll
+	 * effectively be busy-looping here.)
 	 */
 	if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
 	{
@@ -2514,7 +2518,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2860,6 +2864,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2928,29 +2970,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3970,7 +3991,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, true);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5529,6 +5550,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5536,10 +5558,40 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+
+		/*
+		 * Copy the wait reference while holding the spinlock. This protects
+		 * against a concurrent TerminateBufferIO() in another backend from
+		 * clearing the wref while it's being read.
+		 */
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
+		/* no IO in progress, we don't need to wait */
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		/*
+		 * The buffer has asynchronous IO in progress, wait for it to
+		 * complete.
+		 */
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+
+			/*
+			 * The AIO subsystem internally uses condition variables and thus
+			 * might remove this backend from the BufferDesc's CV. While that
+			 * wouldn't cause a correctness issue (the first CV sleep just
+			 * immediately returns if not already registered), it seems worth
+			 * avoiding unnecessary loop iterations, given that we take care
+			 * to do so at the start of the function.
+			 */
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
+		/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5548,13 +5600,12 @@ WaitIO(BufferDesc *buf)
 /*
  * StartBufferIO: begin I/O on this buffer
  *	(Assumptions)
- *	My process is executing no IO
+ *	My process is executing no IO on this buffer
  *	The buffer is Pinned
  *
- * In some scenarios there are race conditions in which multiple backends
- * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
- * I/O condition variable until he's done.
+ * In some scenarios multiple backends could attempt the same I/O operation
+ * concurrently.  If someone else has already started I/O on this buffer then
+ * we will wait for completion of the IO using WaitIO().
  *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -5590,9 +5641,9 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 
 	/* Once we get here, there is definitely no I/O active on this buffer */
 
+	/* Check if someone else already did the I/O */
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		UnlockBufHdr(buf, buf_state);
 		return false;
 	}
@@ -5628,7 +5679,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool syncio)
 {
 	uint32		buf_state;
 
@@ -5643,6 +5694,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5651,6 +5710,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * We may have just released the last pin other than the waiter's. In most
+	 * cases, this backend holds another pin on the buffer. But, if, for
+	 * example, this backend is completing an IO issued by another backend, it
+	 * may be time to wake the waiter.
+	 */
+	if (buf_state & BM_PIN_COUNT_WAITER)
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5699,7 +5769,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
 }
 
 /*
@@ -6150,3 +6220,350 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Generic implementation of the AIO handle staging callback for readv/writev
+ * on local/shared buffers.
+ *
+ * Each readv/writev can target multiple buffers. The buffers have already
+ * been registered with the IO handle.
+ *
+ * To make the IO ready for execution ("staging"), we need to ensure that the
+ * targeted buffers are in an appropriate state while the IO is ongoing. For
+ * that the AIO subsystem needs to have its own buffer pin, otherwise an error
+ * in this backend could lead to this backend's buffer pin being released as
+ * part of error handling, which in turn could lead to the buffer being
+ * replaced while IO is ongoing.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	/* iterate over all buffers affected by the vectored readv/writev */
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential. This is the last
+		 * buffer-aware code before IO is actually executed and confusion
+		 * about which buffers are target by IO can be hard to debug, making
+		 * it worth doing extra-paranoid checks.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		/* verify the buffer is in the expected state */
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the AIO subsystem.
+		 *
+		 * For local buffers: This can't be done just via LocalRefCount, as
+		 * one might initially think, as this backend could error out while
+		 * AIO is still in progress, releasing all the pins by the backend
+		 * itself.
+		 *
+		 * This pin is released again in TerminateBufferIO().
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		/*
+		 * Ensure the content lock that prevents buffer modifications while
+		 * the buffer is being written out is not released early due to an
+		 * error.
+		 */
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	uint32		set_flag_bits;
+
+	/* check that the buffer is in the expected state for a read */
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = PGAIO_RS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call the
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire I/O failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and page verification failed in
+		 * some form, set the whole IO's result to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR
+			&& buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
+					int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc,
+								  target_data->smgr.forkNum).str));
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..3b81a5661c7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, true);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,13 +517,31 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * With AIO the buffer could have IO in progress, e.g. when there are two
+	 * scans of the same relation. Either wait for the other IO or return
+	 * false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	/* Once we get here, there is definitely no I/O active on this buffer */
+
+	/* Check if someone else already did the I/O */
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		return false;
 	}
 
@@ -536,7 +556,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool syncio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,6 +570,14 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (!syncio)
+	{
+		/* release pin held by IO subsystem, see also buffer_stage_common() */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -575,6 +604,19 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 	uint32		buf_state;
 	LocalBufferLookupEnt *hresult;
 
+	/*
+	 * It's possible that we started IO on this buffer before e.g. aborting
+	 * the transaction that created a table. We need to wait for that IO to
+	 * complete before removing / reusing the buffer.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		pgaio_wref_wait(&iow);
+		Assert(!pgaio_wref_valid(&bufHdr->io_wref));
+	}
+
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 	/*
@@ -714,6 +756,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0009-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 9effa7a99730091e3f458dd765c651df70d6b2a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO. This converts a lot of callers to use the
AIO infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 607 +++++++++++++++++++++-------
 2 files changed, 473 insertions(+), 140 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5137288c96e..7abb3bdfd9d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3dad509b4d3..883c5918a33 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,6 +531,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1231,10 +1233,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,10 +1265,12 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks == 1 || allow_forwarding);
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+	Assert(*nblocks == 1 || allow_forwarding);
 
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
@@ -1326,6 +1334,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1368,25 +1390,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1452,8 +1523,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1462,31 +1560,243 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
-void
-WaitReadBuffers(ReadBuffersOperation *operation)
+/*
+ * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
-	IOContext	io_context;
-	IOObject	io_object;
-	char		persistence;
-
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	/*
+	 * If this backend currently has staged IO, we need to submit the pending
+	 * IO before waiting for the right to issue IO, to avoid the potential for
+	 * deadlocks (and, more commonly, unnecessary delays for other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately StartBufferIO() returning false doesn't allow to
+		 * distinguish between the buffer already being valid and IO already
+		 * being in progress. Since IO already being in progress is quite
+		 * rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			nblocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != PGAIO_RS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks successfully read as the result of
+	 * the IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == PGAIO_RS_OK))
+		nblocks = aio_ret->result.result;
+	else if (aio_ret->result.status == PGAIO_RS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == PGAIO_RS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
 
 	Assert(nblocks > 0);
 	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
 
+	operation->nblocks_done += nblocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple partial reads, but also because some of the remaining
+	 * to-be-read buffers may have been read in by other backends, limiting
+	 * the IO size.
+	 */
+	while (true)
+	{
+		int			nblocks;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it.
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the time, the one IO we already started, will read in
+		 * everything.  But we need to deal with partial reads and buffers not
+		 * needing IO anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a partial read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &nblocks);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks, if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after partial reads, the first operation->nblocks_done
+ * buffers are skipped.
+ *
+ * On return *nblocks_progress is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
+
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1494,140 +1804,157 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
+	 * might block, which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
+	 * wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to wait
+	 * for already submitted IO, which doesn't require additional locks, but
+	 * it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the first to-be-read buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock(). The other
+		 * backend will track this as a 'read'.
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock(). The
-			 * other backend will track this as a 'read'.
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, true);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0010-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From 194a0c62a14cac960e046f566a337115adf82cda Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO

There is one comment talking about max_ios logic with "real asynchronous I/O"
that I am not sure about, so I left it alone for now.

There are further improvements we should consider:
- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency
- We could wait to issue IOs until we can issue multiple IOs at once
---
 src/backend/storage/aio/read_stream.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index c60e37e7f7f..df16530d673 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	bool		sync_mode;		/* using io_method=sync */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -613,15 +615,19 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * This system supports prefetching advice.
+	 *
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
@@ -631,6 +637,9 @@ read_stream_begin_impl(int flags,
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
 	 * different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
 	 */
 	if (max_ios == 0)
 		max_ios = 1;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0011-read_stream-Introduce-and-use-optional-batchmo.patchtext/x-diff; charset=us-asciiDownload

From f8bdb4f465c80a59bb25fa737de290053d9d8239 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Mar 2025 17:30:42 -0400
Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode
 support

Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
   to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
   never held while waiting for IO.

b) directly or indirectly start another batch pgaio_enter_batchmode()

As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

There are two cases where batching would likely be beneficial, but where we
aren't using it yet:

1) bitmap heap scans, because the callback reads the VM

   This should soon be solved, because we are planning to remove the use of
   the VM, due to that not being sound.

2) The first phase of heap vacuum

   This could be made to support batchmode, but would require some care.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/read_stream.h     | 19 +++++++++++++++++++
 src/backend/access/gist/gistvacuum.c  |  8 +++++++-
 src/backend/access/heap/heapam.c      | 18 +++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c  | 21 ++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  8 +++++++-
 src/backend/access/spgist/spgvacuum.c |  8 +++++++-
 src/backend/commands/analyze.c        |  7 ++++++-
 src/backend/storage/aio/read_stream.c | 16 ++++++++++++++++
 contrib/pg_prewarm/pg_prewarm.c       |  7 ++++++-
 contrib/pg_visibility/pg_visibility.c |  8 +++++++-
 10 files changed, 110 insertions(+), 10 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index c11d8ce3300..18761ac7df5 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -42,6 +42,25 @@
  */
 #define READ_STREAM_FULL 0x04
 
+/* ---
+ * Opt-in to using AIO batchmode.
+ *
+ * Submitting IO in larger batches can be more efficient than doing so
+ * one-by-one, particularly for many small reads. It does, however, require
+ * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
+ * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
+ * a) block without first calling pgaio_submit_staged(), unless a
+ *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
+ *    never held while waiting for IO.
+ *
+ * b) directly or indirectly start another batch pgaio_enter_batchmode()
+ *
+ * As this requires care and is nontrivial in some cases, batching is only
+ * used with explicit opt-in.
+ * ---
+ */
+#define READ_STREAM_USE_BATCHING 0x08
+
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 20b1bb5dbac..ce9d78d78d6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -210,7 +210,13 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = GIST_ROOT_BLKNO;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..c8357660776 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1206,7 +1206,15 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 		else
 			cb = heap_scan_stream_read_next_serial;
 
-		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+		/* ---
+		 * It is safe to use batchmode as the only locks taken by `cb`
+		 * are never taken while waiting for IO:
+		 * - SyncScanLock is used in the non-parallel case
+		 * - in the parallel case, only spinlocks and atomics are used
+		 * ---
+		 */
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL |
+														  READ_STREAM_USE_BATCHING,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
 														  MAIN_FORKNUM,
@@ -1216,6 +1224,14 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	}
 	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
 	{
+		/*
+		 * Currently we can't trivially use batching, due to the
+		 * VM_ALL_VISIBLE check in bitmapheap_stream_read_next. While that
+		 * could be made safe, we are about to remove the all-visible logic
+		 * from bitmap scans due to its unsoundness.
+		 *
+		 * FIXME: Should be changed soon!
+		 */
 		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_DEFAULT,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..82883492128 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1225,7 +1225,12 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	/* Set up the read stream for vacuum's first pass through the heap */
+	/*
+	 * Set up the read stream for vacuum's first pass through the heap.
+	 *
+	 * This could be made safe for READ_STREAM_USE_BATCHING, but only with
+	 * explicit work in heap_vac_scan_next_block.
+	 */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
 										vacrel->rel,
@@ -2669,6 +2674,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
  * Read stream callback for vacuum's third phase (second pass over the heap).
  * Gets the next block from the TID store and returns it or InvalidBlockNumber
  * if there are no further blocks to vacuum.
+ *
+ * NB: Assumed to be safe to use with READ_STREAM_USE_BATCHING.
  */
 static BlockNumber
 vacuum_reap_lp_read_stream_next(ReadStream *stream,
@@ -2732,8 +2739,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	iter = TidStoreBeginIterate(vacrel->dead_items);
 
-	/* Set up the read stream for vacuum's second pass through the heap */
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * Set up the read stream for vacuum's second pass through the heap.
+	 *
+	 * It is safe to use batchmode, as vacuum_reap_lp_read_stream_next() does
+	 * not need to wait for IO and does not perform locking. Once we support
+	 * parallelism it should still be fine, as presumably the holder of locks
+	 * would never be blocked by IO while holding the lock.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vacrel->bstrategy,
 										vacrel->rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 80b04d6ca2a..4a0bf069f99 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1064,7 +1064,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = BTREE_METAPAGE + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 77deb226b7e..b3df2d89074 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -822,7 +822,13 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* We can skip locking for new or temp relations */
 	needLock = !RELATION_IS_LOCAL(index);
 	p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										bds->info->strategy,
 										index,
 										MAIN_FORKNUM,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..fbf618e6abb 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1237,7 +1237,12 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * It is safe to use batching, as block_sampling_read_stream_next never
+	 * blocks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vac_strategy,
 										scan->rs_rd,
 										MAIN_FORKNUM,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index df16530d673..f675168e89a 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -101,6 +101,7 @@ struct ReadStream
 	int16		distance;
 	int16		initialized_buffers;
 	bool		sync_mode;		/* using io_method=sync */
+	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -402,6 +403,15 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO. The callback needs to opt-in to being
+	 * careful.
+	 */
+	if (stream->batch_mode)
+		pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -449,6 +459,8 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				if (stream->batch_mode)
+					pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -483,6 +495,9 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	if (stream->batch_mode)
+		pgaio_exit_batchmode();
 }
 
 /*
@@ -616,6 +631,7 @@ read_stream_begin_impl(int flags,
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
 	stream->sync_mode = io_method == IOMETHOD_SYNC;
+	stream->batch_mode = flags & READ_STREAM_USE_BATCHING;
 
 #ifdef USE_PREFETCH
 
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index a2f0ac4af0c..05ea40c793a 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -195,7 +195,12 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		p.current_blocknum = first_block;
 		p.last_exclusive = last_block + 1;
 
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											NULL,
 											rel,
 											forkNumber,
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 7f268a18a74..8a9fbb5f1b8 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -523,7 +523,13 @@ collect_visibility_data(Oid relid, bool include_pd)
 	{
 		p.current_blocknum = 0;
 		p.last_exclusive = nblocks;
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											bstrategy,
 											rel,
 											MAIN_FORKNUM,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0012-docs-Reframe-track_io_timing-related-docs-as-w.patchtext/x-diff; charset=us-asciiDownload

From 156355b98c690d2193dfe4637a7c73faa0b3f56f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 12/28] docs: Reframe track_io_timing related docs as
 wait time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f86135fbe1d..13773c05ff4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8568,7 +8568,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8602,7 +8602,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0960f5ba94a..3b6df434c85 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2745,7 +2745,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2783,7 +2783,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2811,7 +2811,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2847,7 +2847,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2921,7 +2921,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -3008,7 +3008,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0013-Enable-IO-concurrency-on-all-systems.patchtext/x-diff; charset=us-asciiDownload

From 8f98d20223aaf9a3c7965bbf6e64a0ff02ea1931 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 10:15:20 -0400
Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  6 ++--
 src/bin/initdb/initdb.c                       |  5 ----
 doc/src/sgml/config.sgml                      | 15 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 10 files changed, 14 insertions(+), 68 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7abb3bdfd9d..784df8b00cb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,14 +156,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 16
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 0f1e74f96c9..799fa7ace68 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern void assign_io_max_combine_limit(int newval, void *extra);
 extern void assign_io_combine_limit(int newval, void *extra);
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 645b5c00467..46c1dce222d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f550a3c0c63..d3567aa7f27 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 /*
@@ -1249,34 +1247,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 989825d3a9c..5d729102f46 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3245,7 +3245,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3259,7 +3259,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2246ccb85a7..771fe4cbe35 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,11 +198,11 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+#maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
 #io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
-#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 #io_method = worker			# worker, io_uring, sync
 					# (change requires restart)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..b9d75799ee0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1401,11 +1401,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 13773c05ff4..e447ade5c2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2585,8 +2585,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2597,8 +2596,8 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.
         </para>
 
         <para>
@@ -2621,10 +2620,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.  This value can be overridden for tables in a
-         particular tablespace by setting the tablespace parameter of the same
-         name (see <xref linkend="sql-altertablespace"/>).
+         The default is <literal>16</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0014-docs-Add-acronym-and-glossary-entries-for-I-O-.patchtext/x-diff; charset=us-asciiDownload

From c4afa7789e9c326d3bf29e26916bca08fc0718a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 21 Mar 2025 15:06:35 -0400
Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O
 and AIO

These could use a lot more polish.

I did not actually reference the new entries yet, because I don't really
understand what our policy for that is.
---
 doc/src/sgml/acronyms.sgml | 18 +++++++++++++++++
 doc/src/sgml/glossary.sgml | 40 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml
index 58d0d90fece..2f906e9f018 100644
--- a/doc/src/sgml/acronyms.sgml
+++ b/doc/src/sgml/acronyms.sgml
@@ -9,6 +9,15 @@
 
   <variablelist>
 
+   <varlistentry>
+    <term><acronym>AIO</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-aio">Asynchronous <acronym>I/O</acronym></link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ACL</acronym></term>
     <listitem>
@@ -354,6 +363,15 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><acronym>I/O</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-io">Input/Output</link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ISO</acronym></term>
     <listitem>
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index c0f812e3f5e..6ca2237ea1d 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -7,6 +7,7 @@
  </para>
 
  <glosslist>
+
   <glossentry id="glossary-acid">
    <glossterm>ACID</glossterm>
    <glossdef>
@@ -81,6 +82,31 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-aio">
+   <glossterm>Asynchronous <acronym>I/O</acronym></glossterm>
+   <acronym>AIO</acronym>
+   <indexterm>
+    <primary>Asynchronous <acronym>I/O</acronym></primary>
+   </indexterm>
+   <glossdef>
+    <para>
+     Asynchronous <acronym>I/O</acronym> (<acronym>AIO</acronym>) describes
+     performing <acronym>I/O</acronym> in a non-blocking way (asynchronously),
+     in contrast to synchronous <acronym>I/O</acronym>, which blocks for the
+     entire duration of the <acronym>I/O</acronym>.
+    </para>
+    <para>
+     With <acronym>AIO</acronym>, starting an <acronym>I/O</acronym> operation
+     is separated from waiting for the result of the operation, allowing
+     multiple <acronym>I/O</acronym> operations to be initiated concurrently,
+     as well as performing <acronym>CPU</acronym> heavy operations
+     concurrently with <acronym>I/O</acronym>. The price for that increased
+     concurrency is increased complexity.
+    </para>
+    <glossseealso otherterm="glossary-io" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-atomic">
    <glossterm>Atomic</glossterm>
    <glossdef>
@@ -938,6 +964,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-io">
+   <glossterm>Input/Output</glossterm>
+   <acronym>I/O</acronym>
+   <glossdef>
+    <para>
+     Input/Output (<acronym>I/O</acronym>) describes the communication between
+     a program and peripheral devices. In the context of database systems,
+     <acronym>I/O</acronym> commonly, but not exclusively, refers to
+     interaction with storage devices or the network.
+    </para>
+    <glossseealso otherterm="glossary-aio" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-insert">
    <glossterm>Insert</glossterm>
    <glossdef>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0015-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 98a0a84107d692d043511b1282dc5c924b83e5d5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 15/28] aio: Add pg_aios view

FIXME:
- catversion bump before commit
---
 src/include/catalog/pg_proc.dat      |  10 +
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 225 +++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 doc/src/sgml/system-views.sgml       | 280 +++++++++++++++++++++++++++
 src/test/regress/expected/rules.out  |  16 ++
 7 files changed, 536 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0d29ef50ff2..cf5855ccf35 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12479,4 +12479,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..37fd4bc2566 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1391,3 +1391,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..625966a515e
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,225 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "nodes/execnodes.h"
+#include "port/atomics.h"
+#include "storage/aio_internal.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/tuplestore.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from
+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of the following
+		 * fields are valid yet (or are in the process of being set).
+		 * Therefore we don't want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation (offset, length) */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED
+			|| start_state == PGAIO_HS_COMPLETED_LOCAL)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..ec6e554e533 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -51,6 +51,11 @@
     </thead>
 
     <tbody>
+     <row>
+      <entry><link linkend="view-pg-aios"><structname>pg_aios</structname></link></entry>
+      <entry>In-use asynchronous IO handles</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-available-extensions"><structname>pg_available_extensions</structname></link></entry>
       <entry>available extensions</entry>
@@ -231,6 +236,281 @@
   </table>
  </sect1>
 
+ <sect1 id="view-pg-aios">
+  <title><structname>pg_aios</structname></title>
+
+  <indexterm zone="view-pg-aios">
+   <primary>pg_aios</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_aios</structname> view lists all <xref
+   linkend="glossary-aio"/> handles that are currently in-use.  An I/O handle
+   is used to reference an I/O operation that is being prepared, executed or
+   is in the process of completing.  <structname>pg_aios</structname> contains
+   one row for each I/O handle.
+  </para>
+
+  <para>
+   This view is mainly useful for developers of
+   <productname>PostgreSQL</productname>, but may also be useful when tuning
+   <productname>PostgreSQL</productname>.
+  </para>
+
+  <table>
+   <title><structname>pg_aios</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>int4</type>
+      </para>
+      <para>
+       Process ID of the server process that is issuing this I/O.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_id</structfield> <type>int4</type>
+      </para>
+      <para>
+       Identifier of the I/O handle. Handles are reused once the I/O
+       completed (or if the handle is released before I/O is started). On reuse
+       <link linkend="view-pg-aios-io-generation">
+        <structname>pg_aios</structname>.<structfield>io_generation</structfield>
+       </link>
+       is incremented.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry" id="view-pg-aios-io-generation"><para role="column_definition">
+       <structfield>io_generation</structfield> <type>int8</type>
+      </para>
+      <para>
+       Generation of the I/O handle.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>state</structfield> <type>text</type>
+      </para>
+      <para>
+       State of the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>HANDED_OUT</literal>, referenced by code but not yet used
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>DEFINED</literal>, information necessary for execution is known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>STAGED</literal>, ready for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>SUBMITTED</literal>, submitted for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_IO</literal>, finished, but result has not yet been processed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_SHARED</literal>, shared completion processing completed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_LOCAL</literal>, backend local completion processing completed
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>operation</structfield> <type>text</type>
+      </para>
+      <para>
+       Operation performed using the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>invalid</literal>, not yet known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>readv</literal>, a vectored read
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writev</literal>, a vectored write
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>off</structfield> <type>int8</type>
+      </para>
+      <para>
+       Offset of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>length</structfield> <type>int8</type>
+      </para>
+      <para>
+       Length of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on postgres relations
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>handle_data_len</structfield> <type>int2</type>
+      </para>
+      <para>
+       Length of the data associated with the I/O operation. For I/O to/from
+       <xref linkend="guc-shared-buffers"/> and <xref
+       linkend="guc-temp-buffers"/>, this indicates the number of buffers the
+       I/O is operating on.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>raw_result</structfield> <type>int4</type>
+      </para>
+      <para>
+       Low-level result of the I/O operation, or NULL if the operation has not
+       yet completed.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>result</structfield> <type>text</type>
+      </para>
+      <para>
+       High-level result of the I/O operation:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>UNKNOWN</literal> means that the result of the
+          operation is not yet known.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>OK</literal> means the I/O completed successfully.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>PARTIAL</literal> means that the I/O completed without
+          error, but did not process all data. Commonly callers will need to
+          retry and perform the remainder of the work in a separate I/O.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>ERROR</literal> mean the I/O failed with an error.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target_desc</structfield> <type>text</type>
+      </para>
+      <para>
+       Description of what the I/O operation is targeting.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_sync</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is executed synchronously.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_localmem</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O references process local memory.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_buffered</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is buffered I/O.
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_aios</structname> view is read-only.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-available-extensions">
   <title><structname>pg_available_extensions</structname></title>
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..d9533deb04e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    off,
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, off, length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0001-aio-Be-more-paranoid-about-interrupts.patchtext/x-diff; charset=us-asciiDownload

From 326719b771e2a600b8eca864ab3ca180958b2f5e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Mar 2025 10:38:38 -0400
Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts

As reported by Noah, it's possible, although practically very unlikely, that
interrupts could be processed in between pgaio_io_reopen() and
pgaio_io_perform_synchronously(). Prevent that by explicitly holding
interrupts.

It also seems good to add an assertion to pgaio_io_before_prep() to ensure
that interrupts are held, as otherwise FDs referenced by the IO could be
closed during interrupt processing. All code in the aio series currently runs
the code with interrupts held, but it seems better to be paranoid.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250324002939.5c.nmisch@google.com
---
 src/backend/storage/aio/aio_io.c        | 6 ++++++
 src/backend/storage/aio/method_worker.c | 9 +++++++++
 2 files changed, 15 insertions(+)

diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 36d2c1f492d..cc6d999a6fb 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -159,6 +159,12 @@ pgaio_io_before_prep(PgAioHandle *ioh)
 	Assert(pgaio_my_backend->handed_out_io == ioh);
 	Assert(pgaio_io_has_target(ioh));
 	Assert(ioh->op == PGAIO_OP_INVALID);
+
+	/*
+	 * Otherwise the FDs referenced by the IO could be closed due to interrupt
+	 * processing.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
 }
 
 /*
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 2be6bb8972b..4a7853d13fa 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -476,6 +476,13 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 						   "worker %d processing IO",
 						   MyIoWorkerId);
 
+			/*
+			 * Prevent interrupts between pgaio_io_reopen() and
+			 * pgaio_io_perform_synchronously() that otherwise could lead to
+			 * the FD getting closed in that window.
+			 */
+			HOLD_INTERRUPTS();
+
 			/*
 			 * It's very unlikely, but possible, that reopen fails. E.g. due
 			 * to memory allocations failing or file permissions changing or
@@ -502,6 +509,8 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			 * ensure we don't accidentally fail.
 			 */
 			pgaio_io_perform_synchronously(ioh);
+
+			RESUME_INTERRUPTS();
 		}
 		else
 		{
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0002-aio-Pass-result-of-local-callbacks-to-report_r.patchtext/x-diff; charset=us-asciiDownload

From 04aa1bd244da3e4fb6a756247af66b8ecb61f505 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Mar 2025 17:06:54 -0400
Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
 ->report_return

Otherwise the results of e.g. temp table buffer verification errors will not
reach bufmgr.c. Obviously that's not right.
---
 src/include/storage/aio_internal.h     |  2 +-
 src/backend/storage/aio/aio.c          | 22 +++++++++++-----------
 src/backend/storage/aio/aio_callback.c | 11 +++++++++--
 3 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 108fe61c7b4..d5f64416870 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -309,7 +309,7 @@ extern void pgaio_shutdown(int code, Datum arg);
 /* aio_callback.c */
 extern void pgaio_io_call_stage(PgAioHandle *ioh);
 extern void pgaio_io_call_complete_shared(PgAioHandle *ioh);
-extern void pgaio_io_call_complete_local(PgAioHandle *ioh);
+extern PgAioResult pgaio_io_call_complete_local(PgAioHandle *ioh);
 
 /* aio_io.c */
 extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 29f57f9cd1c..1fd82842718 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -626,13 +626,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 
 	/*
 	 * It's a bit ugly, but right now the easiest place to put the execution
-	 * of shared completion callbacks is this function, as we need to execute
+	 * of local completion callbacks is this function, as we need to execute
 	 * local callbacks just before reclaiming at multiple callsites.
 	 */
 	if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
 	{
-		pgaio_io_call_complete_local(ioh);
+		PgAioResult local_result;
+
+		local_result = pgaio_io_call_complete_local(ioh);
 		pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
+
+		if (ioh->report_return)
+		{
+			ioh->report_return->result = local_result;
+			ioh->report_return->target_data = ioh->target_data;
+		}
 	}
 
 	pgaio_debug_io(DEBUG4, ioh,
@@ -642,18 +650,10 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 				   ioh->distilled_result.error_data,
 				   ioh->result);
 
-	/* if the IO has been defined, we might need to do more work */
+	/* if the IO has been defined, it's on the in-flight list, remove */
 	if (ioh->state != PGAIO_HS_HANDED_OUT)
-	{
 		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
 
-		if (ioh->report_return)
-		{
-			ioh->report_return->result = ioh->distilled_result;
-			ioh->report_return->target_data = ioh->target_data;
-		}
-	}
-
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 09f03f296f5..d32df1626ba 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -262,9 +262,12 @@ pgaio_io_call_complete_shared(PgAioHandle *ioh)
  * Internal function which invokes ->complete_local for all the registered
  * callbacks.
  *
+ * Returns ioh->distilled_result after, possibly, being modified by local
+ * callbacks.
+ *
  * XXX: It'd be nice to deduplicate with pgaio_io_call_complete_shared().
  */
-void
+PgAioResult
 pgaio_io_call_complete_local(PgAioHandle *ioh)
 {
 	PgAioResult result;
@@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
 
 	/*
 	 * Note that we don't save the result in ioh->distilled_result, the local
-	 * callback's result should not ever matter to other waiters.
+	 * callback's result should not ever matter to other waiters. However, the
+	 * local backend does care, so we return the result as modified by local
+	 * callbacks, which then can be passed to ioh->report_return->result.
 	 */
 	pgaio_debug_io(DEBUG3, ioh,
 				   "after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
@@ -305,4 +310,6 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
 				   ioh->result);
 
 	END_CRIT_SECTION();
+
+	return result;
 }
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0003-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload

From 0cf2230ae482bb6d7f6c362235427a3d3d64459a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.12 03/28] aio: Add liburing dependency

Will be used in a subsequent commit, to implement io_method=io_uring. Kept
separate for easier review.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 meson.build                    |  14 ++++
 meson_options.txt              |   3 +
 configure.ac                   |  12 +++
 src/makefiles/meson.build      |   3 +
 src/include/pg_config.h.in     |   3 +
 src/backend/Makefile           |   7 +-
 doc/src/sgml/installation.sgml |  34 ++++++++
 configure                      | 139 +++++++++++++++++++++++++++++++++
 .cirrus.tasks.yml              |   1 +
 src/Makefile.global.in         |   6 +-
 10 files changed, 218 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index 7cf518a2765..108e3678071 100644
--- a/meson.build
+++ b/meson.build
@@ -944,6 +944,18 @@ endif
 
 
 
+###############################################################
+# Library: liburing
+###############################################################
+
+liburingopt = get_option('liburing')
+liburing = dependency('liburing', required: liburingopt)
+if liburing.found()
+  cdata.set('USE_LIBURING', 1)
+endif
+
+
+
 ###############################################################
 # Library: libxml
 ###############################################################
@@ -3164,6 +3176,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  liburing,
   libxml,
   lz4,
   pam,
@@ -3819,6 +3832,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
       'llvm': llvm,
diff --git a/meson_options.txt b/meson_options.txt
index 702c4517145..dd7126da3a7 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('liburing', type : 'feature', value: 'auto',
+  description: 'io_uring support, for asynchronous I/O')
+
 option('libxml', type: 'feature', value: 'auto',
   description: 'XML support')
 
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..ecbc2734829 100644
--- a/configure.ac
+++ b/configure.ac
@@ -975,6 +975,18 @@ AC_SUBST(with_readline)
 PGAC_ARG_BOOL(with, libedit-preferred, no,
               [prefer BSD Libedit over GNU Readline])
 
+#
+# liburing
+#
+AC_MSG_CHECKING([whether to build with liburing support])
+PGAC_ARG_BOOL(with, liburing, no, [build with io_uring support, for asynchronous I/O],
+              [AC_DEFINE([USE_LIBURING], 1, [Define to build with io_uring support. (--with-liburing)])])
+AC_MSG_RESULT([$with_liburing])
+AC_SUBST(with_liburing)
+
+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi
 
 #
 # UUID library
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 60e13d50235..46d8da070e8 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -199,6 +199,8 @@ pgxs_empty = [
   'PTHREAD_CFLAGS', 'PTHREAD_LIBS',
 
   'ICU_LIBS',
+
+  'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
 if host_system == 'windows' and cc.get_argument_syntax() != 'msvc'
@@ -230,6 +232,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
   'llvm': llvm,
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..f2422241133 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -672,6 +672,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to build with io_uring support. (--with-liburing) */
+#undef USE_LIBURING
+
 /* Define to 1 to build with XML support. (--with-libxml) */
 #undef USE_LIBXML
 
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 42d4a28e5aa..7344c8c7f5c 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
 	$(top_builddir)/src/common/libpgcommon_srv.a \
 	$(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index e076cefa3b9..cc28f041330 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,24 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-liburing">
+       <term><option>--with-liburing</option></term>
+       <listitem>
+        <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        </para>
+        <para>
+         To detect the required compiler and linker options, PostgreSQL will
+         query <command>pkg-config</command>.
+        </para>
+        <para>
+         To use a liburing installation that is in an unusual location, you
+         can set <command>pkg-config</command>-related environment
+         variables (see its documentation).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-libxml">
        <term><option>--with-libxml</option></term>
        <listitem>
@@ -2611,6 +2629,22 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-liburing-meson">
+      <term><option>-Dliburing={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with liburing, enabling io_uring support for asynchronous I/O.
+        Defaults to auto.
+       </para>
+
+       <para>
+        To use a liburing installation that is in an unusual location, you
+        can set <command>pkg-config</command>-related environment
+        variables (see its documentation).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/configure b/configure
index fac1e9a4e39..c6d762dc999 100755
--- a/configure
+++ b/configure
@@ -712,6 +712,9 @@ LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
 with_uuid
+LIBURING_LIBS
+LIBURING_CFLAGS
+with_liburing
 with_readline
 with_systemd
 with_selinux
@@ -865,6 +868,7 @@ with_selinux
 with_systemd
 with_readline
 with_libedit_preferred
+with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
@@ -898,6 +902,8 @@ PKG_CONFIG_PATH
 PKG_CONFIG_LIBDIR
 ICU_CFLAGS
 ICU_LIBS
+LIBURING_CFLAGS
+LIBURING_LIBS
 LIBCURL_CFLAGS
 LIBCURL_LIBS
 XML2_CONFIG
@@ -1578,6 +1584,7 @@ Optional Packages:
   --without-readline      do not use GNU Readline nor BSD Libedit for editing
   --with-libedit-preferred
                           prefer BSD Libedit over GNU Readline
+  --with-liburing         build with io_uring support, for asynchronous I/O
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
@@ -1614,6 +1621,10 @@ Some influential environment variables:
               path overriding pkg-config's built-in search path
   ICU_CFLAGS  C compiler flags for ICU, overriding pkg-config
   ICU_LIBS    linker flags for ICU, overriding pkg-config
+  LIBURING_CFLAGS
+              C compiler flags for LIBURING, overriding pkg-config
+  LIBURING_LIBS
+              linker flags for LIBURING, overriding pkg-config
   LIBCURL_CFLAGS
               C compiler flags for LIBCURL, overriding pkg-config
   LIBCURL_LIBS
@@ -8692,6 +8703,134 @@ fi
 
 
 
+#
+# liburing
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with liburing support" >&5
+$as_echo_n "checking whether to build with liburing support... " >&6; }
+
+
+
+# Check whether --with-liburing was given.
+if test "${with_liburing+set}" = set; then :
+  withval=$with_liburing;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBURING 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-liburing option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_liburing=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_liburing" >&5
+$as_echo "$with_liburing" >&6; }
+
+
+if test "$with_liburing" = yes; then
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for liburing" >&5
+$as_echo_n "checking for liburing... " >&6; }
+
+if test -n "$LIBURING_CFLAGS"; then
+    pkg_cv_LIBURING_CFLAGS="$LIBURING_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_CFLAGS=`$PKG_CONFIG --cflags "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBURING_LIBS"; then
+    pkg_cv_LIBURING_LIBS="$LIBURING_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"liburing\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "liburing") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBURING_LIBS=`$PKG_CONFIG --libs "liburing" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "liburing" 2>&1`
+        else
+	        LIBURING_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "liburing" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBURING_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (liburing) were not met:
+
+$LIBURING_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBURING_CFLAGS
+and LIBURING_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBURING_CFLAGS=$pkg_cv_LIBURING_CFLAGS
+	LIBURING_LIBS=$pkg_cv_LIBURING_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
 
 #
 # UUID library
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 5849cbb839a..10d277f5659 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
             \
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8fe9d61e82a..cce29a37ac5 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_llvm	= @with_llvm@
@@ -222,6 +223,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBURING_CFLAGS		= @LIBURING_CFLAGS@
+LIBURING_LIBS		= @LIBURING_LIBS@
+
 TCLSH			= @TCLSH@
 TCL_LIBS		= @TCL_LIBS@
 TCL_LIB_SPEC		= @TCL_LIB_SPEC@
@@ -246,7 +250,7 @@ CPP = @CPP@
 CPPFLAGS = @CPPFLAGS@
 PG_SYSROOT = @PG_SYSROOT@
 
-override CPPFLAGS := $(ICU_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
 
 ifdef PGXS
 override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0004-aio-Add-io_method-io_uring.patchtext/x-diff; charset=us-asciiDownload

From f492f8d0ea3f00aeda25a119b55799c230807f3b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring

Performing AIO using io_uring can be considerably faster than
io_method=worker, particularly when lots of small IOs are issued, as the
context-switch overhead for worker based AIO becomes more significant.

io_uring, however, is linux specific and requires an additional compile-time
dependency (liburing).

This implementation is fairly simple and there are substantial optimization
opportunities.

The description of the existing AIO_IO_COMPLETION wait event is updated to
make the difference between it and the new AIO_IO_URING_EXECUTION clearer.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h                     |   8 +
 src/include/storage/aio_internal.h            |   3 +
 src/include/storage/lwlock.h                  |   1 +
 src/backend/storage/aio/Makefile              |   1 +
 src/backend/storage/aio/aio.c                 |   6 +
 src/backend/storage/aio/meson.build           |   1 +
 src/backend/storage/aio/method_io_uring.c     | 475 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   4 +-
 src/backend/utils/misc/postgresql.conf.sample |   3 +-
 doc/src/sgml/config.sgml                      |   8 +
 .cirrus.tasks.yml                             |   3 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 513 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/storage/aio/method_io_uring.c

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 7b6b7d20a85..8d84f2b9710 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -22,12 +22,20 @@
 #include "storage/procnumber.h"
 
 
+/* io_uring is incompatible with EXEC_BACKEND */
+#if defined(USE_LIBURING) && !defined(EXEC_BACKEND)
+#define IOMETHOD_IO_URING_ENABLED
+#endif
+
 
 /* Enum for io_method GUC. */
 typedef enum IoMethod
 {
 	IOMETHOD_SYNC = 0,
 	IOMETHOD_WORKER,
+#ifdef IOMETHOD_IO_URING_ENABLED
+	IOMETHOD_IO_URING,
+#endif
 } IoMethod;
 
 /* We'll default to worker based execution. */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index d5f64416870..400c44206dd 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -386,6 +386,9 @@ extern PgAioHandle *pgaio_inj_io_get(void);
 /* Declarations for the tables of function pointers exposed by each IO method. */
 extern PGDLLIMPORT const IoMethodOps pgaio_sync_ops;
 extern PGDLLIMPORT const IoMethodOps pgaio_worker_ops;
+#ifdef IOMETHOD_IO_URING_ENABLED
+extern PGDLLIMPORT const IoMethodOps pgaio_uring_ops;
+#endif
 
 extern PGDLLIMPORT const IoMethodOps *pgaio_method_ops;
 extern PGDLLIMPORT PgAioCtl *pgaio_ctl;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ffa03189e2d..4df1d25c045 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
 	LWTRANCHE_PARALLEL_VACUUM_DSA,
+	LWTRANCHE_AIO_URING_COMPLETION,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index f51c34a37f8..c06c50771e0 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -14,6 +14,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
 	read_stream.o
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 1fd82842718..e52b01c6bc1 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -65,6 +65,9 @@ static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 const struct config_enum_entry io_method_options[] = {
 	{"sync", IOMETHOD_SYNC, false},
 	{"worker", IOMETHOD_WORKER, false},
+#ifdef IOMETHOD_IO_URING_ENABLED
+	{"io_uring", IOMETHOD_IO_URING, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -82,6 +85,9 @@ PgAioBackend *pgaio_my_backend;
 static const IoMethodOps *const pgaio_method_ops_table[] = {
 	[IOMETHOD_SYNC] = &pgaio_sync_ops,
 	[IOMETHOD_WORKER] = &pgaio_worker_ops,
+#ifdef IOMETHOD_IO_URING_ENABLED
+	[IOMETHOD_IO_URING] = &pgaio_uring_ops,
+#endif
 };
 
 /* callbacks for the configured io_method, set by assign_io_method */
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 74f94c6e40b..2f0f03d8071 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -6,6 +6,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
   'read_stream.c',
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
new file mode 100644
index 00000000000..7bd1594d7c7
--- /dev/null
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -0,0 +1,475 @@
+/*-------------------------------------------------------------------------
+ *
+ * method_io_uring.c
+ *    AIO - perform AIO using Linux' io_uring
+ *
+ * For now we create one io_uring instance for each backend. These io_uring
+ * instances have to be created in postmaster, during startup, to allow other
+ * backends to process IO completions, if the issuing backend is currently
+ * busy doing other things. Other backends may not use another backend's
+ * io_uring instance to submit IO, that'd require additional locking that
+ * would likely be harmful for performance.
+ *
+ * We likely will want to introduce a backend-local io_uring instance in the
+ * future, e.g. for FE/BE network IO.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/method_io_uring.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+/* included early, for IOMETHOD_IO_URING_ENABLED */
+#include "storage/aio.h"
+
+#ifdef IOMETHOD_IO_URING_ENABLED
+
+#include <liburing.h>
+
+#include "miscadmin.h"
+#include "storage/aio_internal.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/lwlock.h"
+#include "storage/procnumber.h"
+#include "utils/wait_event.h"
+
+
+/* number of completions processed at once */
+#define PGAIO_MAX_LOCAL_COMPLETED_IO 32
+
+
+/* Entry points for IoMethodOps. */
+static size_t pgaio_uring_shmem_size(void);
+static void pgaio_uring_shmem_init(bool first_time);
+static void pgaio_uring_init_backend(void);
+static int	pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios);
+static void pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation);
+
+/* helper functions */
+static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
+
+
+const IoMethodOps pgaio_uring_ops = {
+	.shmem_size = pgaio_uring_shmem_size,
+	.shmem_init = pgaio_uring_shmem_init,
+	.init_backend = pgaio_uring_init_backend,
+
+	.submit = pgaio_uring_submit,
+	.wait_one = pgaio_uring_wait_one,
+};
+
+/*
+ * Per-backend state when using io_method=io_uring
+ *
+ * Align the whole struct to a cacheline boundary, to prevent false sharing
+ * between completion_lock and prior backend's io_uring_ring.
+ */
+typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
+PgAioUringContext
+{
+	/*
+	 * Multiple backends can process completions for this backend's io_uring
+	 * instance (e.g. when the backend issuing IO is busy doing something
+	 * else).  To make that safe we have to ensure that only a single backend
+	 * gets io completions from the io_uring instance at a time.
+	 */
+	LWLock		completion_lock;
+
+	struct io_uring io_uring_ring;
+} PgAioUringContext;
+
+/* PgAioUringContexts for all backends */
+static PgAioUringContext *pgaio_uring_contexts;
+
+/* the current backend's context */
+static PgAioUringContext *pgaio_my_uring_context;
+
+
+static uint32
+pgaio_uring_procs(void)
+{
+	/*
+	 * We can subtract MAX_IO_WORKERS here as io workers are never used at the
+	 * same time as io_method=io_uring.
+	 */
+	return MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+}
+
+static Size
+pgaio_uring_context_shmem_size(void)
+{
+	return mul_size(pgaio_uring_procs(), sizeof(PgAioUringContext));
+}
+
+static size_t
+pgaio_uring_shmem_size(void)
+{
+	return pgaio_uring_context_shmem_size();
+}
+
+static void
+pgaio_uring_shmem_init(bool first_time)
+{
+	int			TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	pgaio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &pgaio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * Right now a high TotalProcs will cause problems in two ways:
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to allow all
+		 * io_uring_queue_init() calls to succeed.
+		 *
+		 * - RLIMIT_NOFILE needs to be big enough to still have enough file
+		 * descriptors to satisfy set_max_safe_fds() left over. Or, even
+		 * better, have max_files_per_process left over FDs.
+		 *
+		 * We probably should adjust the soft RLIMIT_NOFILE to ensure that.
+		 *
+		 *
+		 * XXX: Newer versions of io_uring support sharing the workers that
+		 * execute some asynchronous IOs between io_uring instances. It might
+		 * be worth using that - also need to evaluate if that causes
+		 * noticeable additional contention?
+		 */
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);
+		if (ret < 0)
+		{
+			char	   *hint = NULL;
+			int			err = ERRCODE_INTERNAL_ERROR;
+
+			/* add hints for some failures that errno explains sufficiently */
+			if (-ret == EPERM)
+			{
+				err = ERRCODE_INSUFFICIENT_PRIVILEGE;
+				hint = _("Check if io_uring is disabled via /proc/sys/kernel/io_uring_disabled.");
+			}
+			else if (-ret == EMFILE)
+			{
+				err = ERRCODE_INSUFFICIENT_RESOURCES;
+				hint = psprintf(_("Consider increasing \"ulimit -n\" to at least %d."),
+								TotalProcs + max_files_per_process);
+			}
+			else if (-ret == ENOSYS)
+			{
+				err = ERRCODE_FEATURE_NOT_SUPPORTED;
+				hint = _("Kernel does not support io_uring.");
+			}
+
+			/* update errno to allow %m to work */
+			errno = -ret;
+
+			ereport(ERROR,
+					errcode(err),
+					errmsg("could not setup io_uring queue: %m"),
+					hint != NULL ? errhint("%s", hint) : 0);
+		}
+
+		LWLockInitialize(&context->completion_lock, LWTRANCHE_AIO_URING_COMPLETION);
+	}
+}
+
+static void
+pgaio_uring_init_backend(void)
+{
+	Assert(MyProcNumber < pgaio_uring_procs());
+
+	pgaio_my_uring_context = &pgaio_uring_contexts[MyProcNumber];
+}
+
+static int
+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &pgaio_my_uring_context->io_uring_ring;
+	int			in_flight_before = dclist_count(&pgaio_my_backend->in_flight_ios);
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		if (!sqe)
+			elog(ERROR, "io_uring submission queue is unexpectedly full");
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+
+		/*
+		 * io_uring executes IO in process context if possible. That's
+		 * generally good, as it reduces context switching. When performing a
+		 * lot of buffered IO that means that copying between page cache and
+		 * userspace memory happens in the foreground, as it can't be
+		 * offloaded to DMA hardware as is possible when using direct IO. When
+		 * executing a lot of buffered IO this causes io_uring to be slower
+		 * than worker mode, as worker mode parallelizes the copying. io_uring
+		 * can be told to offload work to worker threads instead.
+		 *
+		 * If an IO is buffered IO and we already have IOs in flight or
+		 * multiple IOs are being submitted, we thus tell io_uring to execute
+		 * the IO in the background. We don't do so for the first few IOs
+		 * being submitted as executing in this process' context has lower
+		 * latency.
+		 */
+		if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
+			io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
+
+		in_flight_before++;
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			pgaio_debug(DEBUG3,
+						"aio method uring: submit EINTR, nios: %d",
+						num_staged_ios);
+		}
+		else if (ret < 0)
+		{
+			/*
+			 * The io_uring_enter() manpage suggests that the appropriate
+			 * reaction to EAGAIN is:
+			 *
+			 * "The application should wait for some completions and try
+			 * again"
+			 *
+			 * However, it seems unlikely that that would help in our case, as
+			 * we apply a low limit to the number of outstanding IOs and thus
+			 * also outstanding completions, making it unlikely that we'd get
+			 * EAGAIN while the OS is in good working order.
+			 *
+			 * Additionally, it would be problematic to just wait here, our
+			 * caller might hold critical locks. It'd possibly lead to
+			 * delaying the crash-restart that seems likely to occur when the
+			 * kernel is under such heavy memory pressure.
+			 *
+			 * Update errno to allow %m to work.
+			 */
+			errno = -ret;
+			elog(PANIC, "io_uring submit failed: %m");
+		}
+		else if (ret != num_staged_ios)
+		{
+			/* likely unreachable, but if it is, we would need to re-submit */
+			elog(PANIC, "io_uring submit submitted only %d of %d",
+				 ret, num_staged_ios);
+		}
+		else
+		{
+			pgaio_debug(DEBUG4,
+						"aio method uring: submitted %d IOs",
+						num_staged_ios);
+			break;
+		}
+	}
+
+	return num_staged_ios;
+}
+
+static void
+pgaio_uring_drain_locked(PgAioUringContext *context)
+{
+	int			ready;
+	int			orig_ready;
+
+	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
+
+	/*
+	 * Don't drain more events than available right now. Otherwise it's
+	 * plausible that one backend could get stuck, for a while, receiving CQEs
+	 * without actually processing them.
+	 */
+	orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
+
+	while (ready > 0)
+	{
+		struct io_uring_cqe *cqes[PGAIO_MAX_LOCAL_COMPLETED_IO];
+		uint32		ncqes;
+
+		START_CRIT_SECTION();
+		ncqes =
+			io_uring_peek_batch_cqe(&context->io_uring_ring,
+									cqes,
+									Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
+		Assert(ncqes <= ready);
+
+		ready -= ncqes;
+
+		for (int i = 0; i < ncqes; i++)
+		{
+			struct io_uring_cqe *cqe = cqes[i];
+			PgAioHandle *ioh;
+
+			ioh = io_uring_cqe_get_data(cqe);
+			io_uring_cqe_seen(&context->io_uring_ring, cqe);
+
+			pgaio_io_process_completion(ioh, cqe->res);
+		}
+
+		END_CRIT_SECTION();
+
+		pgaio_debug(DEBUG3,
+					"drained %d/%d, now expecting %d",
+					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
+	}
+}
+
+static void
+pgaio_uring_wait_one(PgAioHandle *ioh, uint64 ref_generation)
+{
+	PgAioHandleState state;
+	ProcNumber	owner_procno = ioh->owner_procno;
+	PgAioUringContext *owner_context = &pgaio_uring_contexts[owner_procno];
+	bool		expect_cqe;
+	int			waited = 0;
+
+	/*
+	 * XXX: It would be nice to have a smarter locking scheme, nearly all the
+	 * time the backend owning the ring will consume the completions, making
+	 * the locking unnecessarily expensive.
+	 */
+	LWLockAcquire(&owner_context->completion_lock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pgaio_debug_io(DEBUG3, ioh,
+					   "wait_one io_gen: %llu, ref_gen: %llu, cycle %d",
+					   (long long unsigned) ioh->generation,
+					   (long long unsigned) ref_generation,
+					   waited);
+
+		if (pgaio_io_was_recycled(ioh, ref_generation, &state) ||
+			state != PGAIO_HS_SUBMITTED)
+		{
+			/* the IO was completed by another backend */
+			break;
+		}
+		else if (io_uring_cq_ready(&owner_context->io_uring_ring))
+		{
+			/* no need to wait in the kernel, io_uring has a completion */
+			expect_cqe = true;
+		}
+		else
+		{
+			int			ret;
+			struct io_uring_cqe *cqes;
+
+			/* need to wait in the kernel */
+			pgstat_report_wait_start(WAIT_EVENT_AIO_IO_URING_EXECUTION);
+			ret = io_uring_wait_cqes(&owner_context->io_uring_ring, &cqes, 1, NULL, NULL);
+			pgstat_report_wait_end();
+
+			if (ret == -EINTR)
+			{
+				continue;
+			}
+			else if (ret != 0)
+			{
+				/* see comment after io_uring_submit() */
+				errno = -ret;
+				elog(PANIC, "io_uring wait failed: %m");
+			}
+			else
+			{
+				Assert(cqes != NULL);
+				expect_cqe = true;
+				waited++;
+			}
+		}
+
+		if (expect_cqe)
+		{
+			pgaio_uring_drain_locked(owner_context);
+		}
+	}
+
+	LWLockRelease(&owner_context->completion_lock);
+
+	pgaio_debug(DEBUG3,
+				"wait_one with %d sleeps",
+				waited);
+}
+
+static void
+pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe)
+{
+	struct iovec *iov;
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.read.iov_length == 1)
+			{
+				io_uring_prep_read(sqe,
+								   ioh->op_data.read.fd,
+								   iov->iov_base,
+								   iov->iov_len,
+								   ioh->op_data.read.offset);
+			}
+			else
+			{
+				io_uring_prep_readv(sqe,
+									ioh->op_data.read.fd,
+									iov,
+									ioh->op_data.read.iov_length,
+									ioh->op_data.read.offset);
+
+			}
+			break;
+
+		case PGAIO_OP_WRITEV:
+			iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			if (ioh->op_data.write.iov_length == 1)
+			{
+				io_uring_prep_write(sqe,
+									ioh->op_data.write.fd,
+									iov->iov_base,
+									iov->iov_len,
+									ioh->op_data.write.offset);
+			}
+			else
+			{
+				io_uring_prep_writev(sqe,
+									 ioh->op_data.write.fd,
+									 iov,
+									 ioh->op_data.write.iov_length,
+									 ioh->op_data.write.offset);
+			}
+			break;
+
+		case PGAIO_OP_INVALID:
+			elog(ERROR, "trying to prepare invalid IO operation for execution");
+	}
+
+	io_uring_sqe_set_data(sqe, ioh);
+}
+
+#endif							/* IOMETHOD_IO_URING_ENABLED */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5702c35bb91..3df29658f18 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,6 +177,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
 	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+	[LWTRANCHE_AIO_URING_COMPLETION] = "AioUringCompletion",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9fa12a555e8..4f44648aca8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -192,7 +192,9 @@ ABI_compatibility:
 
 Section: ClassName - WaitEventIO
 
-AIO_IO_COMPLETION	"Waiting for IO completion."
+AIO_IO_COMPLETION	"Waiting for another process to complete IO."
+AIO_IO_URING_SUBMIT	"Waiting for IO submission via io_uring."
+AIO_IO_URING_EXECUTION	"Waiting for IO execution via io_uring."
 BASEBACKUP_READ	"Waiting for base backup to read from a file."
 BASEBACKUP_SYNC	"Waiting for data written by a base backup to reach durable storage."
 BASEBACKUP_WRITE	"Waiting for base backup to write to a file."
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0b9e3066bde..2246ccb85a7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -204,7 +204,8 @@
 					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 
-#io_method = worker			# worker, sync (change requires restart)
+#io_method = worker			# worker, io_uring, sync
+					# (change requires restart)
 #io_max_concurrency = -1		# Max number of IOs that one process
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 69fc93dffc4..f86135fbe1d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,14 @@ include_dir 'conf.d'
             <literal>worker</literal> (execute asynchronous I/O using worker processes)
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, requires a build with
+            <link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
+            <link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
+           </para>
+          </listitem>
           <listitem>
            <para>
             <literal>sync</literal> (execute asynchronous-eligible I/O synchronously)
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 10d277f5659..86a1fa9bbdb 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -493,11 +493,14 @@ task:
     # - Uses undefined behaviour and alignment sanitizers, sanitizer failures
     #   are typically printed in the server log
     # - Test both 64bit and 32 bit builds
+    # - uses io_method=io_uring
     - name: Linux - Debian Bookworm - Meson
 
       env:
         CCACHE_MAXSIZE: "400M" # tests two different builds
         SANITIZER_FLAGS: -fsanitize=alignment,undefined
+        PG_TEST_INITDB_EXTRA_OPTS: >-
+          -c io_method=io_uring
 
       configure_script: |
         su postgres <<-EOF
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3fbf5a4c212..e253f4937a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2151,6 +2151,7 @@ PgAioReturn
 PgAioTargetData
 PgAioTargetID
 PgAioTargetInfo
+PgAioUringContext
 PgAioWaitRef
 PgArchData
 PgBackendGSSStatus
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0005-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From 5e4c86d04b9ec0c45e0162628af182ecfc5bdb52 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 ++++++
 src/backend/storage/smgr/md.c          | 186 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 160 +++++++++++++++++++++
 10 files changed, 427 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 8d84f2b9710..1be7821b29b 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -117,9 +117,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -191,6 +192,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index dddda3a3e2f..debe8163d4e 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index d32df1626ba..54970337bf3 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index b01406a6a52..6133b04f350 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0c3a2a756e7..f573aa8e7ca 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1296,6 +1297,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1989,6 +1992,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2212,6 +2217,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2500,6 +2531,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2780,6 +2817,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2848,6 +2886,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..2218f17e7a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,69 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	ret = FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1438,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1929,101 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartreadv(), the smgr API operates on the level
+	 * of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index af74f54b43b..bf8f57b410a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -66,6 +66,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -106,6 +107,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -117,6 +122,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -134,12 +140,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -157,6 +165,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -709,6 +727,30 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully read blocks if the read [partially] succeeds (Buffers for
+ * blocks not successfully read might bear unspecified modifications, up to
+ * the full nblocks). This maintains the abstraction that smgr operates on the
+ * level of blocks, rather than bytes.
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -917,6 +959,29 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed prematurely.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -945,3 +1010,98 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed again before we get to executing the IO.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0018-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From 9d1583dc515e77019cb97f3a302b2d178ae87585 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 18/28] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f675168e89a..039d7dc71a5 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -403,6 +403,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0019-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 4b718a6e20de7a6fdee662e327b986b4f429e55f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 19/28] aio: Implement smgr/md/fd write support

TODO:
- register_dirty_segment_aio() can error out in edge cases
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 201 ++++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c        |  29 ++++
 7 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 566060c12cb..12b0f1ebfa6 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f573aa8e7ca..8f143c10d36 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2348,6 +2348,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2218f17e7a0..f4a7af201ba 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1115,6 +1122,64 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	ret = FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start writing blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1503,6 +1568,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -1987,7 +2086,7 @@ md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
  * AIO error reporting callback for mdstartreadv().
  *
  * Errors are encoded as follows:
- * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data != 0 encodes IO that failed with that errno
  * - PgAioResult.error_data == 0 encodes IO that didn't read all data
  */
 static void
@@ -2027,3 +2126,103 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartwritev(), the smgr API operates on the
+	 * level of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bf8f57b410a..e56d6a2597b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -787,6 +793,29 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully written blocks if the write [partially] succeeds. This
+ * maintains the abstraction that smgr operates on the level of blocks, rather
+ * than bytes.
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0020-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 5836eeb338e7866f9c4bcd75a2db92cf58bf0995 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 20/28] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 9a0868a270c..09d6d9fe1bc 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -349,6 +349,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -361,6 +375,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 0dff909e4ad..776c6985220 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -122,6 +122,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -177,11 +183,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -212,6 +231,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -239,6 +264,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index debe8163d4e..02950ee56a7 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -118,4 +118,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 24eec5776cd..b7d43ba8422 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -405,6 +405,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits(see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 0d52f72dfdc..f50ae7cdc23 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	if (ioh->state != PGAIO_HS_HANDED_OUT)
 		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1044,6 +1062,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..95b10933fed 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -88,6 +88,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -113,6 +139,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -136,11 +189,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -156,6 +229,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -167,6 +243,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -179,6 +256,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -186,9 +297,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -210,6 +325,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_max_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5d729102f46..e885a682258 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3303,6 +3303,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 771fe4cbe35..ef9585ac991 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -211,6 +211,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index 602615e39be..3187bd859b7 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -80,6 +80,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 0f34892a17e..2e12e60d3fb 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -53,6 +53,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -581,6 +582,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e253f4937a2..cabe406294a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2133,6 +2133,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0021-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From e7b00b7ad5db7ca79010eea3abcd626bffc8e25d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 21/28] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 178 +++++++++++++++++++++++++
 4 files changed, 184 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 09d6d9fe1bc..d461485e83f 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -197,8 +197,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 784df8b00cb..ba9bf247ddb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -172,7 +172,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 12b0f1ebfa6..2a7c6f4f8ad 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7f9b58003b3..d037aa76489 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5508,7 +5508,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5523,6 +5531,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5595,6 +5616,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5742,6 +5768,14 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -6842,12 +6876,129 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
 								  target_data->smgr.forkNum).str));
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, false);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR && buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6855,6 +7006,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6868,6 +7026,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -6875,6 +7044,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -6888,3 +7062,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0022-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 68ac3a336c08bc50747c7528fa635da260c31d55 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 22/28] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabe406294a..0e40687bb43 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1192,6 +1192,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3017,6 +3018,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 9b79a6489e9a7721fb947306a5e248aac52b3656 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 23/28] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 594 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 586 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 6821a710e46..234e093cd04 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ba9bf247ddb..af5035317b7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -295,7 +295,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d037aa76489..8702614b5f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -515,8 +517,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -532,6 +532,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3299,6 +3300,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3330,7 +3382,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3392,7 +3447,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3500,48 +3557,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3559,15 +3659,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3593,7 +3701,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3636,6 +3744,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3656,6 +3767,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3812,11 +3925,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3827,6 +3954,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3838,6 +3972,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3876,8 +4015,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3886,22 +4083,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3911,7 +4136,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3920,40 +4145,300 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
+	/*
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+	UnlockBufHdr(cur_buf_hdr, buf_state);
 
-	tag = bufHdr->tag;
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
 
-	UnpinBuffer(bufHdr);
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
 
-	return result | BUF_WRITTEN;
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+
+	/*
+	 * FIXME: Implement issuing writebacks (note wb_context isn't used here).
+	 * Possibly needs to be integrated with io_queue.c.
+	 */
 }
 
 /*
@@ -4327,6 +4812,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0e40687bb43..125fcf35bd8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From 7d6bc867506568482bd40c48268bf586c8470d06 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 24/28] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0025-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From d798396cff0a52848210550a5e21785e02cb54b6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 25/28] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.12-0026-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 82e31a4a036727b32bb32a6051bf4b2d2ea634cd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.12 26/28] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#116

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#106)

Re: AIO v2.5

Hi,

On 2025-03-23 09:32:48 -0700, Noah Misch wrote:

Another candidate description string:

AIO_COMPLETED_SHARED "Waiting for another process to complete IO."

I liked that one and adopted it.

A more minimal change would be to narrow AIO_IO_URING_COMPLETION to
"execution" or something like that, to hint at a separation between the raw IO
being completed and the IO, including the callbacks completing.

Yes, that would work for me.

I updated both the name and the description of this one to EXECUTION, but I'm
not sure I like it for the name...

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2710,6 +2710,12 @@ include_dir 'conf.d'
<literal>worker</literal> (execute asynchronous I/O using worker processes)
</para>
</listitem>
+          <listitem>
+           <para>
+            <literal>io_uring</literal> (execute asynchronous I/O using
+            io_uring, if available)
I feel the "if available" doesn't quite fit, since we'll fail if unavailable.
Maybe just "(execute asynchronous I/O using Linux io_uring)" with "Linux"
there to reduce surprise on other platforms.
You're right, the if available can be misunderstood. But not mentioning that
it's an optional dependency seems odd too. What about something like

<para>
<literal>io_uring</literal> (execute asynchronous I/O using
io_uring, requires postgres to have been built with
<link linkend="configure-option-with-liburing"><option>--with-liburing</option></link> /
<link linkend="configure-with-liburing-meson"><option>-Dliburing</option></link>)
</para>
I'd change s/postgres to have been built/a build with/ since the SGML docs
don't use the term "postgres" that way. Otherwise, that works for me.

Went with that.

Greetings,

Andres Freund

#117

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#100)

Re: AIO v2.5

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

[I checked that v2.12 doesn't invalidate these review comments, but I didn't
technically rebase the review onto v2.12's line numbers.]

static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
{
uint32 buf_state;

@@ -5586,6 +5636,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+	if (!syncio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}

Looking at the callers:

ZeroAndLockBuffer[1083] TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
ExtendBufferedRelShared[2869] TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
FlushBuffer[4827] TerminateBufferIO(buf, true, 0, true, true);
AbortBufferIO[6637] TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);

I think we can improve on the "syncio" arg name. The first two aren't doing
IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it
hadn't failed early. Perhaps name the arg "release_aio" and pass
release_aio=true instead of syncio=false (release_aio = !syncio).

+ * about which buffers are target by IO can be hard to debug, making

s/target/targeted/

+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{

...

+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",

My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
in full depth. In the !is_temp case, this runs in a complete_shared callback.
A process unrelated to the original IO may run this callback. That's
unfortunate in two ways. First, that other process's client gets an
unexpected WARNING. The process getting the WARNING may not even have
zero_damaged_pages enabled. Second, the client of the process that staged the
IO gets no message.

AIO ERROR-level messages handle this optimally. We emit a LOG-level message
in the process that runs the complete_shared callback, and we arrange for the
ERROR-level message in the stager. That would be ideal here: LOG in the
complete_shared runner, WARNING in the stager.

One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
zero_damaged_pages, perhaps as a short-term approach.

Thoughts?

#118

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#114)

Re: AIO v2.5

Hi,

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.

For ease of new readers understanding things, I think it helps for the
functions that advance PgAioHandleState to have names that use words from
PgAioHandleState. It's one less mapping to get into the reader's head.

Unfortunately for me it's kind of the opposite in this case, see below.

"Begin", "Start" and "prep" are all outside that taxonomy, making the reader
learn how to map them to the taxonomy. What reward does the reader get at the
end of that exercise? I'm not seeing one, but please do tell me what I'm
missing here.

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name. It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I'll try to think more about this, perhaps I can make myself see your POV
more.

Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends. Currently, it just assumes all local buffers are
cleanup-lockable:

/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;

Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared
buffers, that the current backend can't be doing anything that conflicts with
acquiring a buffer pin - note that it doesn't check the backend local pincount
for shared buffers either.

It checks the local pincount via callee CheckBufferIsPinnedOnce().

In exactly one of the callers :/

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false.

That's already true today, right? In master ConditionalLockBufferForCleanup()
for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.
I think I agree with your suggestion further below, but independent of that, I
don't see how the current modification in the patch makes the worse.

Historically this behaviour of LockBufferForCleanup() kinda somewhat makes
sense - the only place we use LockBufferForCleanup() is in a non-transactional
command i.e. vacuum / index vacuum. So LockBufferForCleanup() turns out to
only be safe in that context.

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.

/* ---
* Opt-in to using AIO batchmode.
*
* Submitting IO in larger batches can be more efficient than doing so
* one-by-one, particularly for many small reads. It does, however, require
* the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
* batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
* a) block without first calling pgaio_submit_staged(), unless a
* to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
* never acquired in a nested fashion
* b) directly or indirectly start another batch pgaio_enter_batchmode()

I think a callback could still do:

pgaio_exit_batchmode()
... arbitrary code that might reach pgaio_enter_batchmode() ...
pgaio_enter_batchmode()

Yea - but I somehow doubt there are many cases where it makes sense to
deep-queue IOs within the callback. The cases I can think of are things like
ensuring the right VM buffer is in s_b. But if it turns out to be necessary,
what you seuggest would be an out.

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

If it turns out to be actually useful to do nested batching, we can change it
so that nested batching *is* allowed, that'd not be hard.

*
* As this requires care and is nontrivial in some cases, batching is only
* used with explicit opt-in.
* ---
*/
#define READ_STREAM_USE_BATCHING 0x08

+1

Agreed. It's simple, and there's no loss of generality.

I wonder if something more like READ_STREAM_CALLBACK_BATCHMODE_AWARE
would be better, to highlight that you are making a declaration about
a property of your callback, not just turning on an independent
go-fast feature... I fished those words out of the main (?)
description of this topic atop pgaio_enter_batchmode(). Just a
thought, IDK.

Good points. I lean toward your renaming suggestion, or shortening to
READ_STREAM_BATCHMODE_AWARE or READ_STREAM_BATCH_OK. I'm also fine with the
original name, though.

I'm ok with all of these. In order of preference:

1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
2) READ_STREAM_BATCHMODE_AWARE
3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Greetings,

Andres Freund

#119

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#117)

Re: AIO v2.5

Hi,

On 2025-03-24 19:20:37 -0700, Noah Misch wrote:

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:

static void
TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
- bool forget_owner)
+ bool forget_owner, bool syncio)
...

Looking at the callers:

ZeroAndLockBuffer[1083] TerminateBufferIO(bufHdr, false, BM_VALID, true, true);
ExtendBufferedRelShared[2869] TerminateBufferIO(buf_hdr, false, BM_VALID, true, true);
FlushBuffer[4827] TerminateBufferIO(buf, true, 0, true, true);
AbortBufferIO[6637] TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, true);
buffer_readv_complete_one[7279] TerminateBufferIO(buf_hdr, false, set_flag_bits, false, false);
buffer_writev_complete_one[7427] TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, false);

I think we can improve on the "syncio" arg name. The first two aren't doing
IO, and AbortBufferIO() may be cleaning up what would have been an AIO if it
hadn't failed early. Perhaps name the arg "release_aio" and pass
release_aio=true instead of syncio=false (release_aio = !syncio).

Yes, I think that makes sense. Will do that tomorrow.

+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
...
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
in full depth. In the !is_temp case, this runs in a complete_shared
callback. A process unrelated to the original IO may run this callback.
That's unfortunate in two ways. First, that other process's client gets an
unexpected WARNING. The process getting the WARNING may not even have
zero_damaged_pages enabled. Second, the client of the process that staged
the IO gets no message.

Ah, right. That could be why I had flipped it. If so, shame on me for not
adding a comment...

AIO ERROR-level messages handle this optimally. We emit a LOG-level message
in the process that runs the complete_shared callback, and we arrange for the
ERROR-level message in the stager. That would be ideal here: LOG in the
complete_shared runner, WARNING in the stager.

We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
some ways, but not others) the message when run in a different backend fairly
easily. Still emitting a WARNING in the stager however is a bit more tricky.

Before thinking more deeply about how we could emit WARNING in the stage:

Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
trigger a rather massive flood of messages to the client in a *normal*
situation. I'm thinking of something like an insert extending a relation some
time after an immediate restart and encountering a lot of FSM corruption (due
to its non-crash-safe-ness) during the search for free space and the
subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of
WARNINGs to the client seems not quite right.

If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
then could tell the stager to issue the WARNING. It would add a bit of
distributed cost, both to callbacks and users of AIO, but it might not be too
bad.

One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
zero_damaged_pages, perhaps as a short-term approach.

Yea, that could work. Perhaps even just for zero_damaged_pages, after
changing it so that ZERO_ON_ERROR always just LOGs.

Hm, it seems somewhat nasty to have rather different performance
characteristics when forced to use zero_damaged_pages to recover from a
problem. Imagine an instance that's configured to use DIO and then needs to
use zero_damaged_pages to recove from corruption...

/me adds writing a test for both ZERO_ON_ERROR and zero_damaged_pages to the
TODO.

Greetings,

Andres Freund

#120

Thomas Munro

thomas.munro@gmail.com

10 months ago

In reply to: Andres Freund (#115)

1 attachment(s)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:

Attached v2.12, with the following changes:

Here's a tiny fixup to make io_concurrency=0 turn on
READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME. Without this, AIO
will still run at level 1 even if you asked for 0. Feel free to
squash, or ignore and I'll push it later, whatever suits... (tested on
the tip of your public aio-2 branch).

Attachments:

0001-fixup-to-v2.12-0010.fixupapplication/octet-stream; name=0001-fixup-to-v2.12-0010.fixupDownload

From 6c3ea233758ae12784b716861a29e0ce5a39cb79 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 25 Mar 2025 14:02:44 +1300
Subject: [PATCH] fixup to v2.12-0010

This gives io_concurrency=0 the expected meaning for AIO, namely "no
concurrency, just use synchronous I/O".
---
 src/backend/storage/aio/read_stream.c | 30 +++++++++++++++------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 039d7dc71a5..cec93129f58 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -100,6 +100,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
 	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
 	bool		advice_enabled;
@@ -253,7 +254,7 @@ read_stream_start_pending_read(ReadStream *stream)
 		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
 
 	/* Do we need to issue read-ahead advice? */
-	flags = 0;
+	flags = stream->read_buffers_flags;
 	if (stream->advice_enabled)
 	{
 		if (stream->pending_read_blocknum == stream->seq_blocknum)
@@ -264,7 +265,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 * then stay of the way of the kernel's own read-ahead.
 			 */
 			if (stream->seq_until_processed != InvalidBlockNumber)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 		else
 		{
@@ -275,7 +276,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 */
 			stream->seq_until_processed = stream->pending_read_blocknum;
 			if (stream->pinned_buffers > 0)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 	}
 
@@ -660,8 +661,7 @@ read_stream_begin_impl(int flags,
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.
-	 *
+	 * Read-ahead advice simulating asynchronous I/O with synchronous calls.
 	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
 	 * caller hasn't promised sequential access (overriding our detection
 	 * heuristics), and max_ios hasn't been set to zero.
@@ -674,15 +674,15 @@ read_stream_begin_impl(int flags,
 #endif
 
 	/*
-	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
-	 * above.  If we had real asynchronous I/O we might need a slightly
-	 * different definition.
-	 *
-	 * FIXME: Not sure what different definition we would need? I guess we
-	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
+	 * Setting max_ios to zero disables AIO and advice-based pseudo AIO, but
+	 * we still need to allocate space to combine and run one I/O.  Bump it up
+	 * to one, and remember to ask for synchronous I/O only.
 	 */
 	if (max_ios == 0)
+	{
 		max_ios = 1;
+		stream->read_buffers_flags = READ_BUFFERS_SYNCHRONOUSLY;
+	}
 
 	/*
 	 * Capture stable values for these two GUC-derived numbers for the
@@ -826,6 +826,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		if (likely(next_blocknum != InvalidBlockNumber))
 		{
+			int			flags = stream->read_buffers_flags;
+
+			if (stream->advice_enabled)
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -841,8 +846,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
 										next_blocknum,
-										stream->advice_enabled ?
-										READ_BUFFERS_ISSUE_ADVICE : 0)))
+										flags)))
 			{
 				/* Fast return. */
 				return buffer;
-- 
2.48.1

#121

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#118)

Re: AIO v2.5

On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!

One less thing!

I however don't particularly like the *start* or *prep* names, I've gone back
and forth on those a couple times. I could see "begin" work uniformly across
those.

For ease of new readers understanding things, I think it helps for the
functions that advance PgAioHandleState to have names that use words from
PgAioHandleState. It's one less mapping to get into the reader's head.

Unfortunately for me it's kind of the opposite in this case, see below.

"Begin", "Start" and "prep" are all outside that taxonomy, making the reader
learn how to map them to the taxonomy. What reward does the reader get at the
end of that exercise? I'm not seeing one, but please do tell me what I'm
missing here.

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name. It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's fair. Could we provide the mapping in a comment, something like the
following?

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
  * linearly through all states.
  *
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state().  Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ *   PGAIO_HS_COMPLETED_LOCAL.
  */
 typedef enum PgAioHandleState

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I'll try to think more about this, perhaps I can make myself see your POV
more.

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false.

That's already true today, right? In master ConditionalLockBufferForCleanup()
for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.

I'm finding a LocalRefCount check under LockBufferForCleanup:

LockBufferForCleanup(Buffer buffer)
{
...
CheckBufferIsPinnedOnce(buffer);

CheckBufferIsPinnedOnce(Buffer buffer)
{
if (BufferIsLocal(buffer))
{
if (LocalRefCount[-buffer - 1] != 1)
elog(ERROR, "incorrect local pin count: %d",
LocalRefCount[-buffer - 1]);
}
else
{
if (GetPrivateRefCount(buffer) != 1)
elog(ERROR, "incorrect local pin count: %d",
GetPrivateRefCount(buffer));
}
}

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.

Moving those changes works for me. I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.

/* ---
* Opt-in to using AIO batchmode.
*
* Submitting IO in larger batches can be more efficient than doing so
* one-by-one, particularly for many small reads. It does, however, require
* the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
* batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
* a) block without first calling pgaio_submit_staged(), unless a
* to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
* never acquired in a nested fashion
* b) directly or indirectly start another batch pgaio_enter_batchmode()

I think a callback could still do:

pgaio_exit_batchmode()
... arbitrary code that might reach pgaio_enter_batchmode() ...
pgaio_enter_batchmode()

Yea - but I somehow doubt there are many cases where it makes sense to
deep-queue IOs within the callback. The cases I can think of are things like
ensuring the right VM buffer is in s_b. But if it turns out to be necessary,
what you seuggest would be an out.

I don't foresee a callback specifically wanting to batch, but callbacks might
call into other infrastructure that can elect to batch. The exit+reenter
pattern would be better than adding no-batch options to other infrastructure.

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned. Maybe "(b) start another batch (without first exiting
one)". It's also fine as-is, though.

If it turns out to be actually useful to do nested batching, we can change it
so that nested batching *is* allowed, that'd not be hard.

Good point.

I'm ok with all of these. In order of preference:

1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
2) READ_STREAM_BATCHMODE_AWARE
3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Same for me.

#122

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#119)

Re: AIO v2.5

On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:

On 2025-03-24 19:20:37 -0700, Noah Misch wrote:
On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
...
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
in full depth. In the !is_temp case, this runs in a complete_shared
callback. A process unrelated to the original IO may run this callback.
That's unfortunate in two ways. First, that other process's client gets an
unexpected WARNING. The process getting the WARNING may not even have
zero_damaged_pages enabled. Second, the client of the process that staged
the IO gets no message.
Ah, right. That could be why I had flipped it. If so, shame on me for not
adding a comment...

AIO ERROR-level messages handle this optimally. We emit a LOG-level message
in the process that runs the complete_shared callback, and we arrange for the
ERROR-level message in the stager. That would be ideal here: LOG in the
complete_shared runner, WARNING in the stager.

We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
some ways, but not others) the message when run in a different backend fairly
easily. Still emitting a WARNING in the stager however is a bit more tricky.

Before thinking more deeply about how we could emit WARNING in the stage:

Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
trigger a rather massive flood of messages to the client in a *normal*
situation. I'm thinking of something like an insert extending a relation some
time after an immediate restart and encountering a lot of FSM corruption (due
to its non-crash-safe-ness) during the search for free space and the
subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of
WARNINGs to the client seems not quite right.

Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
ZERO_ON_ERROR. The ZERO_ON_ERROR case also should not use
ERRCODE_DATA_CORRUPTED. (That errcode shouldn't appear for business as usual.
It should signify wrong or irretrievable query results, essentially.)

For zero_damaged_pages, WARNING seems at least defensible, and
ERRCODE_DATA_CORRUPTED is right. It wouldn't be the worst thing to change
zero_damaged_pages to LOG and let the complete_shared runner log it, as long
as we release-note that. It's superuser-only, and the superuser can learn to
check the log. One typically should use zero_damaged_pages in one session at
a time, so the logs won't be too confusing.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
then could tell the stager to issue the WARNING. It would add a bit of
distributed cost, both to callbacks and users of AIO, but it might not be too
bad.

One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
zero_damaged_pages, perhaps as a short-term approach.

Yea, that could work. Perhaps even just for zero_damaged_pages, after
changing it so that ZERO_ON_ERROR always just LOGs.

Yes.

Hm, it seems somewhat nasty to have rather different performance
characteristics when forced to use zero_damaged_pages to recover from a
problem. Imagine an instance that's configured to use DIO and then needs to
use zero_damaged_pages to recove from corruption...

True. I'd be willing to bet high-scale use of zero_damaged_pages is rare. By
high scale, I mean something like reading a whole large table, as opposed to a
TID scan of the known-problematic range. That said, people (including me)
expect the emergency tools to be good even if they're used rarely. You're not
wrong to worry about it.

#123

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#121)

Re: AIO v2.5

Hi,

On 2025-03-25 06:33:21 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!

One less thing!

Unfortunately I'm now doubting the thoroughness of my check - while I made
every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
log_min_messages <= DEBUGN...

I'll try that in a bit.

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name. It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's fair. Could we provide the mapping in a comment, something like the
following?

Yes!

I wonder if it should also be duplicated or referenced elsewhere, although I
am not sure where precisely.

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
* linearly through all states.
*
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state().  Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ *   PGAIO_HS_COMPLETED_LOCAL.
*/
typedef enum PgAioHandleState

One detail I'm not sure about: The above change is correct, but perhaps a bit
misleading, because we can actually go "back" to IDLE. Not sure how to best
phrase that though.

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I assume you're on board with renaming _io_prep* to _io_start_*?

I'll try to think more about this, perhaps I can make myself see your POV
more.

As the patch stands, LockBufferForCleanup() can succeed when
ConditionalLockBufferForCleanup() would have returned false.

That's already true today, right? In master ConditionalLockBufferForCleanup()
for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.

I'm finding a LocalRefCount check under LockBufferForCleanup:

I guess I should have stopped looking at code / replying before my last email
last night... Not sure how I missed that.

CheckBufferIsPinnedOnce(Buffer buffer)
{
if (BufferIsLocal(buffer))
{
if (LocalRefCount[-buffer - 1] != 1)
elog(ERROR, "incorrect local pin count: %d",
LocalRefCount[-buffer - 1]);
}
else
{
if (GetPrivateRefCount(buffer) != 1)
elog(ERROR, "incorrect local pin count: %d",
GetPrivateRefCount(buffer));
}
}

Pretty random orthogonal thought, that I was reminded of by the above code
snippet:

It sure seems we should at some point get rid of LocalRefCount[] and just use
the GetPrivateRefCount() infrastructure for both shared and local buffers. I
don't think the GetPrivateRefCount() infrastructure cares about
local/non-local, leaving a few asserts aside. If we do that, and start to use
BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
differences between shared and local buffers would be a lot smaller.

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.

Moving those changes works for me. I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.

FWIW, I don't think it's currently worth looking at the write side in detail,
there's enough required changes to make that not necessarily the best use of
your time at this point. At least:

- Write logic needs to be rebased ontop of the patch series to not hit bit
dirty buffers while IO is going on

The performance impact of doing the memory copies is rather substantial, as
on intel memory bandwidth is *the* IO bottleneck even just for the checksum
computation, without a copy. That makes the memory copy for something like
bounce buffers hurt really badly.

And the memory usage of bounce buffers is also really concerning.

And even without checksums, several filesystems *really* don't like buffers
getting modified during DIO writes. Which I think would mean we ought to use
bounce buffers for *all* writes, which would impose a *very* substantial
overhead (basically removing the benefit of DMA happening off-cpu).

- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
in a critical section

The only reason it doesn't immediately fail is that it's reasonably rare
that RegisterSyncRequest() fails *and* either:

- smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
though the lookup is guaranteed to suceed for io_method=worker.

- an io_method=uring completion is run in a different backend and smgropen()
needs to build a new entry and thus needs to allocate memory

For a bit I thought this could be worked around easily enough by not doing
an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
instead just opening the file directly. That actually does kinda solve the
problem, but only because the memory allocation in PathNameOpenFile()
uses malloc(), not palloc() and thus doesn't trigger

- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
in-progress writes

I can think of ways to solve this purely in bufmgr.c, but only in ways that
would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
for an exclusive lock) and/or expensive.

- My current set of patches doesn't implement bgwriter_flush_after,
checkpointer_flush_after

I think that's not too hard to do, it's mainly round tuits.

- temp_file_limit is not respected by aio writes

I guess that could be ok if AIO writes are only used by checkpointer /
bgwriter, but we need to figure out a way to deal with that. Perhaps by
redesigning temp_file_limit, the current implementation seems like rather
substantial layering violation.

- Too much duplicated code, as there's the aio and non-aio write paths. That
might be ok for a bit.

I updated the commit messages of the relevant commits with the above, there
were abbreviated versions of most of the above, but not in enough detail for
anybody but me (and maybe not even that).

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned. Maybe "(b) start another batch (without first exiting
one)". It's also fine as-is, though.

I updated it to:

* b) start another batch (without first exiting batchmode and re-entering
* before returning)

I'm ok with all of these. In order of preference:

1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
2) READ_STREAM_BATCHMODE_AWARE
3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Same for me.

For now I'll leave it at READ_STREAM_USE_BATCHING, but if Thomas has a
preference I'll go for whatever we have a majority for.

Greetings,

Andres Freund

#124

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Thomas Munro (#120)

Re: AIO v2.5

Hi,

On 2025-03-25 17:10:19 +1300, Thomas Munro wrote:

On Tue, Mar 25, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:

Attached v2.12, with the following changes:

Here's a tiny fixup to make io_concurrency=0 turn on
READ_BUFFERS_SYNCHRONOUSLY as mooted in a FIXME. Without this, AIO
will still run at level 1 even if you asked for 0. Feel free to
squash, or ignore and I'll push it later, whatever suits... (tested on
the tip of your public aio-2 branch).

Thanks! I squashed it into "aio: Basic read_stream adjustments for real AIO"
and updated the commit message to account for that.

Greetings,

Andres Freund

#125

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#122)

Re: AIO v2.5

Hi,

On 2025-03-25 07:11:20 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:

Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
trigger a rather massive flood of messages to the client in a *normal*
situation. I'm thinking of something like an insert extending a relation some
time after an immediate restart and encountering a lot of FSM corruption (due
to its non-crash-safe-ness) during the search for free space and the
subsequent FSM vacuum. It might be ok to LOG that, but sending a lot of
WARNINGs to the client seems not quite right.

Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
ZERO_ON_ERROR. The ZERO_ON_ERROR case also should not use
ERRCODE_DATA_CORRUPTED. (That errcode shouldn't appear for business as usual.
It should signify wrong or irretrievable query results, essentially.)

I strongly agree on the errcode - basically makes it much harder to use the
errcode to trigger alerting. And we don't have any other way to do that...

I'm, obviously, positive on not using WARNING for ZERO_ON_ERROR. I'm neutral
on LOG vs DEBUG1, I can see arguments for either.

For zero_damaged_pages, WARNING seems at least defensible, and
ERRCODE_DATA_CORRUPTED is right. It wouldn't be the worst thing to change
zero_damaged_pages to LOG and let the complete_shared runner log it, as long
as we release-note that. It's superuser-only, and the superuser can learn to
check the log. One typically should use zero_damaged_pages in one session at
a time, so the logs won't be too confusing.

It's obviously tempting to go for that, I'm somewhat undecided what the best
way is right now. There might be a compromise, see below:

If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
then could tell the stager to issue the WARNING. It would add a bit of
distributed cost, both to callbacks and users of AIO, but it might not be too
bad.

FWIW, I prototyped this, it's not hard.

But it can't replace the current WARNING with 100% fidelity: If we read 60
blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we
can't encode that many block offset in single PgAioResult, there's not enough
space, and enlarging it far enough doesn't seem to make sense either.

What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
with that warning saying that there were N zeroed blocks in a read from block
N to block Y and a HINT saying that there are more details in the server log.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

I've been wondering about that as well, and yes, we probably should.

I'd add the pid of the backend that started the IO to the message - although
need to check whether we're trying to keep PIDs of other processes from
unprivileged users.

I think we probably should add a similar, but not equivalent, context in io
workers. Maybe "I/O worker executing I/O on behalf of process %d".

Greetings,

Andres Freund

#126

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#123)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote:

On 2025-03-25 06:33:21 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:

On 2025-03-24 17:45:37 -0700, Noah Misch wrote:

(We may be due for a test mode that does smgrreleaseall() at every
CHECK_FOR_INTERRUPTS()?)

I suspect we are. I'm a bit afraid of even trying...

...

It's extremely slow - but at least the main regression as well as the aio tests pass!

One less thing!

Unfortunately I'm now doubting the thoroughness of my check - while I made
every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
log_min_messages <= DEBUGN...

I'll try that in a bit.

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
{
if (!ReleaseLruFile())
break;
}
#else
while (ReleaseLruFile())
;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out
--- src/test/regress/expected/sanity_check.out  2024-10-24 12:43:25.741817594 -0700
+++ src/test/regress/results/sanity_check.out   2025-03-25 08:27:44.875151566 -0700
@@ -1,4 +1,7 @@
 VACUUM;
+ERROR:  index "pg_enum_oid_index" contains corrupted page at block 2
+HINT:  Please REINDEX it.
+CONTEXT:  while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum"
 --
 -- Sanity check: every system catalog that has OIDs should have
 -- a unique index on OID.  This ensures that the OIDs will be unique,
diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out
--- src/test/regress/expected/oidjoins.out      2023-07-06 19:58:07.686364439 -0700
+++ src/test/regress/results/oidjoins.out       2025-03-25 08:28:02.584335458 -0700
@@ -233,6 +233,8 @@
 NOTICE:  checking pg_policy {polrelid} => pg_class {oid}
 NOTICE:  checking pg_policy {polroles} => pg_authid {oid}
 NOTICE:  checking pg_default_acl {defaclrole} => pg_authid {oid}
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0)
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464)
 NOTICE:  checking pg_default_acl {defaclnamespace} => pg_namespace {oid}
 NOTICE:  checking pg_init_privs {classoid} => pg_class {oid}
 NOTICE:  checking pg_seclabel {classoid} => pg_class {oid}

Because the end state varies, depending on the number of previously staged
IOs, the IO method and whether batchmode is enabled, I think it's better if
the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
*not* aligned with an internal state name. It will just mislead readers to
think that there's a deterministic mapping when that does not exist.

That's fair. Could we provide the mapping in a comment, something like the
following?

Yes!

I wonder if it should also be duplicated or referenced elsewhere, although I
am not sure where precisely.

I considered the README.md also, but adding that wasn't an obvious win.

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
* linearly through all states.
*
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state().  Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ *   PGAIO_HS_COMPLETED_LOCAL.
*/
typedef enum PgAioHandleState

One detail I'm not sure about: The above change is correct, but perhaps a bit
misleading, because we can actually go "back" to IDLE. Not sure how to best
phrase that though.

Not sure either. Maybe the above could change to "to PGAIO_HS_STAGED or any
subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in
this state, concurrent activity could move the IO all the way to
PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE."

That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
naming that I just stopped seeing.

I assume you're on board with renaming _io_prep* to _io_start_*?

Yes.

I'll try to think more about this, perhaps I can make myself see your POV
more.

CheckBufferIsPinnedOnce(Buffer buffer)
{
if (BufferIsLocal(buffer))
{
if (LocalRefCount[-buffer - 1] != 1)
elog(ERROR, "incorrect local pin count: %d",
LocalRefCount[-buffer - 1]);
}
else
{
if (GetPrivateRefCount(buffer) != 1)
elog(ERROR, "incorrect local pin count: %d",
GetPrivateRefCount(buffer));
}
}

Pretty random orthogonal thought, that I was reminded of by the above code
snippet:

It sure seems we should at some point get rid of LocalRefCount[] and just use
the GetPrivateRefCount() infrastructure for both shared and local buffers. I
don't think the GetPrivateRefCount() infrastructure cares about
local/non-local, leaving a few asserts aside. If we do that, and start to use
BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
differences between shared and local buffers would be a lot smaller.

That sounds promising.

Like the comment, I expect it's academic today. I expect it will stay
academic. Anything that does a cleanup will start by reading the buffer,
which will resolve any refcnt the AIO subsystems holds for a read. If there's
an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
that. How about just removing the ConditionalLockBufferForCleanup() changes
or replacing them with a comment (like the present paragraph)?

I think we'll need an expanded version of what I suggest once we have writes -
but as you say, it shouldn't matter as long as we only have reads. So I think
moving the relevant changes, with adjusted caveats, to the bufmgr: write
change makes sense.

Moving those changes works for me. I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.

FWIW, I don't think it's currently worth looking at the write side in detail,

Got it. (I meant I didn't see a first-principles need, not that I had deduced
lack of need from a specific writes implementation.)

Do you think it's worth mentioning the above workaround? I'm mildly inclined
that not.

Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned. Maybe "(b) start another batch (without first exiting
one)". It's also fine as-is, though.

I updated it to:

* b) start another batch (without first exiting batchmode and re-entering
* before returning)

That's good.

#127

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#125)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:

On 2025-03-25 07:11:20 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:

If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
then could tell the stager to issue the WARNING. It would add a bit of
distributed cost, both to callbacks and users of AIO, but it might not be too
bad.

FWIW, I prototyped this, it's not hard.

But it can't replace the current WARNING with 100% fidelity: If we read 60
blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we
can't encode that many block offset in single PgAioResult, there's not enough
space, and enlarging it far enough doesn't seem to make sense either.

What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
with that warning saying that there were N zeroed blocks in a read from block
N to block Y and a HINT saying that there are more details in the server log.

Sounds fine.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

I've been wondering about that as well, and yes, we probably should.

I'd add the pid of the backend that started the IO to the message - although
need to check whether we're trying to keep PIDs of other processes from
unprivileged users.

We don't.

I think we probably should add a similar, but not equivalent, context in io
workers. Maybe "I/O worker executing I/O on behalf of process %d".

Sounds good.

#128

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#126)

Re: AIO v2.5

Hi,

On 2025-03-25 08:58:08 -0700, Noah Misch wrote:

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
{
if (!ReleaseLruFile())
break;
}
#else
while (ReleaseLruFile())
;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

Yikes. That's a very good catch.

I spent a bit of time debugging this. I think I see what's going on - it turns
out that the kernel does *not* open the FDs during io_uring_enter() if
IOSQE_ASYNC is specified [1]Instead files are opened when the queue entry is being worked on instead. Interestingly that only happens when the IO is *explicitly* requested to be executed in the workqueue with IOSQE_ASYNC, not when it's put there because it couldn't be done in a non-blocking way.. Which we do add heuristically, in an attempt to
avoid a small but measurable slowdown for sequential scans that are fully
buffered (c.f. pgaio_uring_submit()). If I disable that heuristic, your patch
above passes all tests here.

I don't know if that's an intentional or unintentional behavioral difference.

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Greetings,

Andres Freund

[1]: Instead files are opened when the queue entry is being worked on instead. Interestingly that only happens when the IO is *explicitly* requested to be executed in the workqueue with IOSQE_ASYNC, not when it's put there because it couldn't be done in a non-blocking way.
instead. Interestingly that only happens when the IO is *explicitly*
requested to be executed in the workqueue with IOSQE_ASYNC, not when it's
put there because it couldn't be done in a non-blocking way.

#129

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#128)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:

On 2025-03-25 08:58:08 -0700, Noah Misch wrote:

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
{
if (!ReleaseLruFile())
break;
}
#else
while (ReleaseLruFile())
;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

Yikes. That's a very good catch.

I spent a bit of time debugging this. I think I see what's going on - it turns
out that the kernel does *not* open the FDs during io_uring_enter() if
IOSQE_ASYNC is specified [1]. Which we do add heuristically, in an attempt to
avoid a small but measurable slowdown for sequential scans that are fully
buffered (c.f. pgaio_uring_submit()). If I disable that heuristic, your patch
above passes all tests here.

Same result here. As an additional data point, I tried adding this so every
reopen gets a new FD number (leaks FDs wildly):

--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1304,5 +1304,5 @@ LruDelete(File file)
 	 * to leak the FD than to mess up our internal state.
 	 */
-	if (close(vfdP->fd) != 0)
+	if (dup2(2, vfdP->fd) != vfdP->fd)
 		elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
 			 "could not close file \"%s\": %m", vfdP->fileName);

The same "make check" w/ TEMP_CONFIG io_method=io_uring passes with the
combination of that and the max-turnover change to ReleaseLruFiles().

I don't know if that's an intentional or unintentional behavioral difference.

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Agreed. If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other. I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a). What other considerations could be important?

#130

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#129)

Re: AIO v2.5

Hi,

On 2025-03-25 12:39:56 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:

I don't know if that's an intentional or unintentional behavioral difference.

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Agreed. If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other. I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a). What other considerations could be important?

The only other consideration I can think of is whether this should happen for
all io_methods or not.

I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
it's a bit weird to have a bool in a struct called *Ops.

Greetings,

Andres Freund

#131

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#130)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:

On 2025-03-25 12:39:56 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Agreed. If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other. I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a). What other considerations could be important?

The only other consideration I can think of is whether this should happen for
all io_methods or not.

Either way is fine, I think.

I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
it's a bit weird to have a bool in a struct called *Ops.

That wouldn't bother me. IndexAmRoutine has many bools, and "Ops" is
basically a synonym of "Routine".

#132

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#131)

1 attachment(s)

Re: AIO v2.5

Hi,

On 2025-03-25 13:18:50 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:

On 2025-03-25 12:39:56 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:

There are 2 1/2 ways around this:

1) Stop using IOSQE_ASYNC heuristic
2a) Wait for all in-flight IOs when any FD gets closed
2b) Wait for all in-flight IOs using FD when it gets closed

Given that we have clear evidence that io_uring doesn't completely support
closing FDs while IOs are in flight, be it a bug or intentional, it seems
clearly better to go for 2a or 2b.

Agreed. If a workload spends significant time on fd.c closing files, I
suspect that workload already won't have impressive benchmark numbers.
Performance-seeking workloads will already want to tune FD usage high enough
to keep FDs long-lived. So (1) clearly loses, and neither (2a) nor (2b)
clearly beats the other. I'd try (2b) first but, if complicated, quickly
abandon it in favor of (2a). What other considerations could be important?

The only other consideration I can think of is whether this should happen for
all io_methods or not.

Either way is fine, I think.

Here's a draft incremental patch (attached as a .fixup to avoid triggering
cfbot) implementing 2b).

I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
it's a bit weird to have a bool in a struct called *Ops.

That wouldn't bother me. IndexAmRoutine has many bools, and "Ops" is
basically a synonym of "Routine".

Cool. Done that way.

The repeated-iteration approach taken in pgaio_closing_fd() isn't the
prettiest, but it's hard to to imagine that ever being a noticeable.

This survives a testrun where I use your torture patch and where I force all
IOs to use ASYNC. Previously that did not get very far. I also did verify
that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

Greetings,

Andres Freund

Attachments:

open_fd.fixuptext/plain; charset=us-asciiDownload

commit 290de30c0e4a87ed3bd6cdef478376d39a922912
Author: Andres Freund <andres@anarazel.de>
Date:   2025-03-25 16:50:27 -0400

    wip: aio: io_uring wait for in-flight IOs
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index a46d4393ebc..352610f60a8 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -287,6 +287,15 @@ typedef struct PgAioCtl
  */
 typedef struct IoMethodOps
 {
+	/* properties */
+
+	/*
+	 * If an FD is about to be closed, do we need to wait for all in-flight
+	 * IOs referencing that FD?
+	 */
+	bool		wait_on_fd_before_close;
+
+
 	/* global initialization */
 
 	/*
@@ -368,6 +377,7 @@ extern PgAioResult pgaio_io_call_complete_local(PgAioHandle *ioh);
 /* aio_io.c */
 extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
 extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern bool pgaio_io_uses_fd(PgAioHandle *ioh, int fd);
 
 /* aio_target.c */
 extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c512495dce3..8816ccfa0d7 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1303,6 +1303,41 @@ pgaio_closing_fd(int fd)
 	 * it's probably not worth it.
 	 */
 	pgaio_submit_staged();
+
+	/*
+	 * If requested by the IO method, wait for all IOs that use the
+	 * to-be-closed FD.
+	 */
+	if (pgaio_method_ops->wait_on_fd_before_close)
+	{
+		/*
+		 * As waiting for one IO to complete may complete multiple IOs, we
+		 * can't just use a mutable list iterator. The maximum number of
+		 * in-flight IOs is fairly small, so just restart the loop after
+		 * waiting for an IO.
+		 */
+		while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+		{
+			dlist_iter	iter;
+			PgAioHandle *ioh = NULL;
+
+			dclist_foreach(iter, &pgaio_my_backend->in_flight_ios)
+			{
+				ioh = dclist_container(PgAioHandle, node, iter.cur);
+
+				if (pgaio_io_uses_fd(ioh, fd))
+					break;
+				else
+					ioh = NULL;
+			}
+
+			if (!ioh)
+				break;
+
+			/* see comment in pgaio_io_wait_for_free() about raciness */
+			pgaio_io_wait(ioh, ioh->generation);
+		}
+	}
 }
 
 /*
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 97930d687c2..108021fa377 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -186,3 +186,25 @@ pgaio_io_get_op_name(PgAioHandle *ioh)
 
 	return NULL;				/* silence compiler */
 }
+
+/*
+ * Used to determine if an IO needs to be waited upon before the file
+ * descriptor can be closed.
+ */
+bool
+pgaio_io_uses_fd(PgAioHandle *ioh, int fd)
+{
+	Assert(ioh->state >= PGAIO_HS_DEFINED);
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			return ioh->op_data.read.fd == fd;
+		case PGAIO_OP_WRITEV:
+			return ioh->op_data.write.fd == fd;
+		case PGAIO_OP_INVALID:
+			return false;
+	}
+
+	return false;				/* silence compiler */
+}
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index aab09793684..3b299dcf388 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -58,6 +58,15 @@ static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct io_uring_sqe *sqe);
 
 
 const IoMethodOps pgaio_uring_ops = {
+	/*
+	 * While io_uring mostly is OK with FDs getting closed while the IO is in
+	 * flight, that is not true for IOs submitted with IOSQE_ASYNC.
+	 *
+	 * See
+	 * https://postgr.es/m/5ons2rtmwarqqhhexb3dnqulw5rjgwgoct57vpdau4rujlrffj%403fls6d2mkiwc
+	 */
+	.wait_on_fd_before_close = true,
+
 	.shmem_size = pgaio_uring_shmem_size,
 	.shmem_init = pgaio_uring_shmem_init,
 	.init_backend = pgaio_uring_init_backend,

#133

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#132)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 04:56:53PM -0400, Andres Freund wrote:

The repeated-iteration approach taken in pgaio_closing_fd() isn't the
prettiest, but it's hard to to imagine that ever being a noticeable.

Yep. I've reviewed the fixup code, and it looks all good.

This survives a testrun where I use your torture patch and where I force all
IOs to use ASYNC. Previously that did not get very far. I also did verify
that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

I, too, see the test diffs gone.

#134

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#127)

Re: AIO v2.5

Hi,

On 2025-03-25 09:15:43 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:

FWIW, I prototyped this, it's not hard.

But it can't replace the current WARNING with 100% fidelity: If we read 60
blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we
can't encode that many block offset in single PgAioResult, there's not enough
space, and enlarging it far enough doesn't seem to make sense either.

What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
with that warning saying that there were N zeroed blocks in a read from block
N to block Y and a HINT saying that there are more details in the server log.

It should probably be DETAIL, not HINT...

Sounds fine.

I got that working. To make it readable, it required changing division of
labor between buffer_readv_complete() and buffer_readv_complete_one() a bit,
but I think it's actually easier to understand now.

Still need to beef up the test infrastructure a bit to make the multi-block
cases more easily testable.

Could use some input on the framing of the message/detail. Right now it's:

ERROR: invalid page in block 8 of relation base/5/16417
DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid.

But that doesn't seem great. Maybe:

DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid.

But that still isn't really a sentence.

Greetings,

Andres Freund

#135

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#115)

Re: AIO v2.5

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:

Attached v2.12, with the following changes:

TODO:

Wonder if it's worth adding some coverage for when checksums are disabled?
Probably not necessary?

Probably not necessary, agreed. Orthogonal to AIO, it's likely worth a CI
"SPECIAL" and/or buildfarm animal that runs all tests w/ checksums disabled.

Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts

Ready for commit

Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
->report_return

Ready for commit w/ up to one cosmetic change:

@@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)

/*
* Note that we don't save the result in ioh->distilled_result, the local
-	 * callback's result should not ever matter to other waiters.
+	 * callback's result should not ever matter to other waiters. However, the
+	 * local backend does care, so we return the result as modified by local
+	 * callbacks, which then can be passed to ioh->report_return->result.
*/
pgaio_debug_io(DEBUG3, ioh,
"after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",

Should this debug message remove the word "distilled", since this commit
solidifies distilled_result as referring to the complete_shared result?

Subject: [PATCH v2.12 03/28] aio: Add liburing dependency

Ready for commit

Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring

Ready for commit w/ open_fd.fixup

Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd

Ready for commit w/ up to two cosmetic changes:

+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Second, the aio_internal.h comment changes discussed in
postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.

Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Ready for commit

(This and the previous patch have three spots that would change with the
s/prep/start/ renames. No opinion on whether to rename before or rename
after.)

Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well

The plan here looks good:
postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2

Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support

See review here and later discussion:
postgr.es/m/20250325022037.91.nmisch@google.com

Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()

Ready for commit after a batch of small things, all but one of which have no
implications beyond code cosmetics. This is my first comprehensive review of
this patch. I like the test coverage (by the end of the patch series). For
anyone else following, I found "diff -w" helpful for the bufmgr.c changes.
That's because a key part is former WaitReadBuffers() code moving up an
indentation level to its home in new subroutine AsyncReadBuffers().

Assert(*nblocks == 1 || allow_forwarding);
Assert(*nblocks > 0);
Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+ Assert(*nblocks == 1 || allow_forwarding);

Duplicates the assert three lines back.

+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == PGAIO_RS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}

Assert(nblocks > 0);
Assert(nblocks <= MAX_IO_COMBINE_LIMIT);

+ operation->nblocks_done += nblocks;

I struggled somewhat from the variety of "nblocks" variables: this local
nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of
some functions. No one of them is clearly wrong to use the name, and some of
these names are preexisting. That said, if you see opportunities to push in
the direction of more-specific names, I'd welcome it.

For example, this local variable could become add_to_nblocks_done instead.

+ AsyncReadBuffers(operation, &nblocks);

I suggest renaming s/nblocks/ignored_nblocks_progress/ here.

+	 * If we need to wait for IO before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There

s/already staged/already-staged/. Normally I'd skip this as nitpicking, but I
misread this particular sentence twice, as "submit" being the subject that
"staged" something. (It's still nitpicking, alas.)

/*
* How many neighboring-on-disk blocks can we scatter-read into other
* buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
* head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().

Did this mean to say "WaitReadBuffers() -> AsyncReadBuffers()"? I'm guessing
so, since WaitReadBuffers() is the one that loops. It might be referring to
read_stream_start_pending_read()'s next StartReadBuffers(), though.

I think this could just delete the last sentence. The function header comment
already mentions the possibility of reading a subset of the request. This
spot doesn't need to detail how the higher layers come back to here.

+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);

We don't assign *nblocks_progress until lower in the function, so I think
"io_buffers_len" should replace "*nblocks_progress" here. (This is my only
non-cosmetic comment on this patch.)

Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO

(Still reviewing this and later patches, but incidental observations follow.)

Subject: [PATCH v2.12 16/28] aio: Add test_aio module

+use List::Util qw(sample);

sample() is new in 2020:
https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100

Hence, I'd expect some buildfarm failures. I'd try to use shuffle(), then
take the first N elements.

+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,712 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c

To elaborate on my last review, the entire header comment was a copy from
delay_execution.c. v2.12 fixes the IDENTIFICATION, but the rest needs
updates.

Thanks,
nm

#136

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#134)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 08:17:17PM -0400, Andres Freund wrote:

On 2025-03-25 09:15:43 -0700, Noah Misch wrote:

On Tue, Mar 25, 2025 at 11:57:58AM -0400, Andres Freund wrote:

FWIW, I prototyped this, it's not hard.

But it can't replace the current WARNING with 100% fidelity: If we read 60
blocks in a single smgrreadv, we today would would emit 60 WARNINGs. But we
can't encode that many block offset in single PgAioResult, there's not enough
space, and enlarging it far enough doesn't seem to make sense either.

What we *could* do is to emit one WARNING for each bufmgr.c smgrstartreadv(),
with that warning saying that there were N zeroed blocks in a read from block
N to block Y and a HINT saying that there are more details in the server log.

It should probably be DETAIL, not HINT...

Either is fine with me. I would go for HINT if referring to the server log,
given the precedent of errhint("See server log for query details."). DETAIL
fits for block counts, though:

Could use some input on the framing of the message/detail. Right now it's:

ERROR: invalid page in block 8 of relation base/5/16417
DETAIL: Read of 8 blocks, starting at block 7, 1 other pages in the same read are invalid.

But that doesn't seem great. Maybe:

DETAIL: Read of blocks 7..14, 1 other pages in the same read were also invalid.

But that still isn't really a sentence.

How about this for the multi-page case:

WARNING: zeroing out %u invalid pages among blocks %u..%u of relation %s
DETAIL: Block %u held first invalid page.
HINT: See server log for the other %u invalid blocks.

For the one-page case, the old message can stay:

WARNING: invalid page in block %u of relation %s; zeroing out page

#137

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#115)

Re: AIO v2.5

I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr:
Comment fixes", the last patch before write support.
postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this
email covers patches 10-17. All remaining review comments are minor, so I've
marked the commitfest entry Ready for Committer. If there's anything you'd
like re-reviewed before you commit it, feel free to bring it to my attention.
Thanks for getting the feature to this stage!

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:

Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO

@@ -631,6 +637,9 @@ read_stream_begin_impl(int flags,
* For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
* above.  If we had real asynchronous I/O we might need a slightly
* different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?

I think we don't need a different definition. max_ios comes from
effective_io_concurrency and similar settings. The above comment's definition
of max_ios=0 matches that GUC's documented behavior:

The allowed range is
<literal>1</literal> to <literal>1000</literal>, or
<literal>0</literal> to disable issuance of asynchronous I/O requests.

I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we
could reasonably argue to have effective_io_concurrency=0 distinguish itself
from effective_io_concurrency=1 in some different way for AIO. Equally,
there's no hurry to use that freedom to distinguish them.

Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode
support

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

I'd also use the new flag on the read_stream_begin_smgr_relation() call in
RelationCopyStorageUsingBuffer(). It uses block_range_read_stream_cb, and
other streams of that callback rightly use the flag.

+ * b) directly or indirectly start another batch pgaio_enter_batchmode()

Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com

Subject: [PATCH v2.12 12/28] docs: Reframe track_io_timing related docs as
wait time

Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Consider also updating this comment to stop focusing on prefetch; I think
changing that aligns with the patch's other changes:

/*
* How many buffers PrefetchBuffer callers should try to stay ahead of their
* ReadBuffer calls by. Zero means "never prefetch". This value is only used
* for buffers not belonging to tablespaces that have their
* effective_io_concurrency parameter set.
*/
int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

-#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)

I think "usually 1-128" remains right given:

GUC_UNIT_BLOCKS
#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
#define PG_IOV_MAX Min(IOV_MAX, 128)

-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.

Wrap the last line.

Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O
and AIO

These could use a lot more polish.

To me, it's fine as-is.

I did not actually reference the new entries yet, because I don't really
understand what our policy for that is.

I haven't seen much of a policy on that.

Subject: [PATCH v2.12 15/28] aio: Add pg_aios view
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();

Based on the "really started after this function was called" and "no risk of a
livelock here" comments below, I think "retry:" should be here. We don't
want to livelock in the form of chasing ever-growing start_generation numbers.

+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IOs state changed while we were "rendering" it. Just start from

s/IOs/IO's/

+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;

+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on postgres relations

s/postgres relations/relations/ since SGML docs don't use the term "postgres"
that way.

Subject: [PATCH v2.12 16/28] aio: Add test_aio module

--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2025, PostgreSQL Global Development Group

+subdir('test_aio')
subdir('brin')

List is alphabetized; please preserve that.

+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/delay_execution/Makefile

Update filename in comment.

+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group

s/2024/2025/

--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl

s/ {4}/\t/g on this file. It's mostly \t now, with some exceptions.

+ test_inject_worker('worker', $node_worker);

What do we expect to happen if autovacuum or checkpointer runs one of these
injection points? I'm guessing it would at most make that process fail
without affecting the test outcome. If so, that's fine.

+ $waitfor,);

s/,//

+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);

Last two lines are a clone of the previous psql_like() call. I guess this
wants to instead call handle_get_twice() and check for some stderr.

+ "read_rel_block_ll() of $tblname page",

What does "_ll" stand for?

+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",

I think the exiting backend's pgaio_shutdown() completed it.

+sub test_inject

This deserves a brief comment on the behaviors being tested, like the previous
functions have. It seems to be about short reads and hard failures like EIO.

Show quoted text

Subject: [PATCH v2.12 17/28] aio, bufmgr: Comment fixes

#138

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#135)

Re: AIO v2.5

Hi,

On 2025-03-25 17:19:15 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:
@@ -296,7 +299,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
/*
* Note that we don't save the result in ioh->distilled_result, the local
-	 * callback's result should not ever matter to other waiters.
+	 * callback's result should not ever matter to other waiters. However, the
+	 * local backend does care, so we return the result as modified by local
+	 * callbacks, which then can be passed to ioh->report_return->result.
*/
pgaio_debug_io(DEBUG3, ioh,
"after local completion: distilled result: (status %s, id %u, error_data %d, result %d), raw_result: %d",
Should this debug message remove the word "distilled", since this commit
solidifies distilled_result as referring to the complete_shared result?

Good point, updated.

Subject: [PATCH v2.12 01/28] aio: Be more paranoid about interrupts

Ready for commit

Subject: [PATCH v2.12 02/28] aio: Pass result of local callbacks to
->report_return

Ready for commit w/ up to one cosmetic change:

And pushed. Together with the s/pgaio_io_prep_/s/pgaio_io_start_/ renaming
we've been discussing. Btw, I figured out the origin of that, I was just
mirroring the liburing API...

Thanks again for the reviews.

Subject: [PATCH v2.12 03/28] aio: Add liburing dependency

Ready for commit

Subject: [PATCH v2.12 04/28] aio: Add io_method=io_uring

Ready for commit w/ open_fd.fixup

Yay. Planning to push those soon.

Subject: [PATCH v2.12 05/28] aio: Implement support for reads in smgr/md/fd

Ready for commit w/ up to two cosmetic changes:

Cool.

+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

Applied.

Second, the aio_internal.h comment changes discussed in
postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.

Here's my current version of that:

* Note that the externally visible functions to start IO
* (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
* PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
* PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).

Does that work?

I think I'll push that as part of the comment updates patch instead of
"Implement support for reads in smgr/md/fd", unless you see a reason to do so
differently. I'd have done it in the patch to s/prep/start/, but then it would
reference functions that don't exist yet...

Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Ready for commit

Cool.

Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to
reorder it until after "bufmgr: Implement AIO read support".

There's also a small change in a new patch in the series (not yet sent), due
to the changes related to emitting WARNINGs about checksum failures to the
client connection. I think that part is fine, but...

(This and the previous patch have three spots that would change with the
s/prep/start/ renames. No opinion on whether to rename before or rename
after.)

I thought it'd be better to do the renaming first.

Subject: [PATCH v2.12 07/28] localbuf: Track pincount in BufferDesc as well

The plan here looks good:
postgr.es/m/dbeeaize47y7esifdrinpa2l7cqqb67k72exvuf3appyxywjnc@7bt76mozhcy2

Subject: [PATCH v2.12 08/28] bufmgr: Implement AIO read support

See review here and later discussion:
postgr.es/m/20250325022037.91.nmisch@google.com

I'm working on a version with those addressed.

Subject: [PATCH v2.12 09/28] bufmgr: Use AIO in StartReadBuffers()

Ready for commit after a batch of small things, all but one of which have no
implications beyond code cosmetics.

Yay.

I like the test coverage (by the end of the patch series).

I'm really shocked just how bad our test coverage for a lot of this is today
:(

For anyone else following, I found "diff -w" helpful for the bufmgr.c
changes. That's because a key part is former WaitReadBuffers() code moving
up an indentation level to its home in new subroutine AsyncReadBuffers().

For reviewing changes that move stuff around a lot I find this rather helpful:
git diff --color-moved --color-moved-ws=ignore-space-change

That highlights removed code differently from moved code, and due to
ignore-space-change considers code that changed just due to space changes, to
be moved.

Assert(*nblocks == 1 || allow_forwarding);
Assert(*nblocks > 0);
Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+ Assert(*nblocks == 1 || allow_forwarding);

Duplicates the assert three lines back.

Ah, it was moved into ce1a75c4fea, which I didn't notice while rebasing...

+		nblocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == PGAIO_RS_ERROR)
+	{
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+		nblocks = 0;			/* silence compiler */
+	}
Assert(nblocks > 0);
Assert(nblocks <= MAX_IO_COMBINE_LIMIT);

+ operation->nblocks_done += nblocks;
I struggled somewhat from the variety of "nblocks" variables: this local
nblocks, operation->nblocks, actual_nblocks, and *nblocks in/out parameters of
some functions. No one of them is clearly wrong to use the name, and some of
these names are preexisting. That said, if you see opportunities to push in
the direction of more-specific names, I'd welcome it.

For example, this local variable could become add_to_nblocks_done instead.

I named it "newly_read_blocks", hope that works?

+ AsyncReadBuffers(operation, &nblocks);

I suggest renaming s/nblocks/ignored_nblocks_progress/ here.

Adopted.

Unfortunately I didn't see a good way to reduce the amount of the other
nblocks variables, as they are all, I think, preexisting.

+	 * If we need to wait for IO before we can get a handle, submit already
+	 * staged IO first, so that other backends don't need to wait. There
s/already staged/already-staged/. Normally I'd skip this as nitpicking, but I
misread this particular sentence twice, as "submit" being the subject that
"staged" something. (It's still nitpicking, alas.)

Makes sense - it doesn't help that it was at a linebreak...

/*
* How many neighboring-on-disk blocks can we scatter-read into other
* buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
* head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
+		 *
+		 * We'll come back to this block in the next call to
+		 * StartReadBuffers() -> AsyncReadBuffers().

I was referring to the latter, as that is the more common case (it's pretty
easy to hit if you e.g. have multiple sequential scans on the same table
going).

I think this could just delete the last sentence. The function header comment
already mentions the possibility of reading a subset of the request. This
spot doesn't need to detail how the higher layers come back to here.

Agreed.

+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, *nblocks_progress * BLCKSZ);
We don't assign *nblocks_progress until lower in the function, so I think
"io_buffers_len" should replace "*nblocks_progress" here. (This is my only
non-cosmetic comment on this patch.)

Good catch!

Subject: [PATCH v2.12 16/28] aio: Add test_aio module

+use List::Util qw(sample);

sample() is new in 2020:
https://metacpan.org/release/PEVANS/Scalar-List-Utils-1.68/source/Changes#L100

Hence, I'd expect some buildfarm failures. I'd try to use shuffle(), then
take the first N elements.

Hah. Bilal's patch was using shuffe(). I wanted to reduce the number of
iterations and first did as you suggest and then saw that there's a nicer
way...

Done that way again...

+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,712 @@
+/*-------------------------------------------------------------------------
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c

To elaborate on my last review, the entire header comment was a copy from
delay_execution.c. v2.12 fixes the IDENTIFICATION, but the rest needs
updates.

I was really too tired that day... Embarassing.

Greetings,

Andres Freund

#139

Thom Brown

thom@linux.com

10 months ago

In reply to: Andres Freund (#115)

Re: AIO v2.5

On Tue, 25 Mar 2025 at 01:18, Andres Freund <andres@anarazel.de> wrote:

Hi,

Attached v2.12, with the following changes:

I took a quick gander through this just out of curiosity (yes, I know
I'm late), and found these show-stoppers:

v2.12-0015-aio-Add-pg_aios-view.patch:

+ <literal>ERROR</literal> mean the I/O failed with an error.

s/mean/means/

v2.12-0021-bufmgr-Implement-AIO-write-support.patch

+shared buffer lock still allows some modification, e.g., for hint bits(see

s/bits\(see/bits \(see)

+buffers that can be used as the source / target for IO. A bounce buffer be

s/be/can be/

Regards

Thom

#140

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#138)

Re: AIO v2.5

On Wed, Mar 26, 2025 at 04:33:49PM -0400, Andres Freund wrote:

On 2025-03-25 17:19:15 -0700, Noah Misch wrote:

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:

Second, the aio_internal.h comment changes discussed in
postgr.es/m/20250325155808.f7.nmisch@google.com and earlier.

Here's my current version of that:

* Note that the externally visible functions to start IO
* (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
* PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
* PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).

Does that work?

Yes.

I think I'll push that as part of the comment updates patch instead of
"Implement support for reads in smgr/md/fd", unless you see a reason to do so
differently. I'd have done it in the patch to s/prep/start/, but then it would
reference functions that don't exist yet...

Agreed.

Subject: [PATCH v2.12 06/28] aio: Add README.md explaining higher level design

Ready for commit

Cool.

Comments in it reference PGAIO_HCB_SHARED_BUFFER_READV, so I'm inclined to
reorder it until after "bufmgr: Implement AIO read support".

Agreed.

For example, this local variable could become add_to_nblocks_done instead.

I named it "newly_read_blocks", hope that works?

Yes.

#141

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#137)

Re: AIO v2.5

Hi,

On 2025-03-26 11:31:02 -0700, Noah Misch wrote:

I reviewed everything up to and including "[PATCH v2.12 17/28] aio, bufmgr:
Comment fixes", the last patch before write support.

Thanks!

postgr.es/m/20250326001915.bc.nmisch@google.com covered patches 1-9, and this
email covers patches 10-17. All remaining review comments are minor, so I've
marked the commitfest entry Ready for Committer. If there's anything you'd
like re-reviewed before you commit it, feel free to bring it to my attention.
Thanks for getting the feature to this stage!

As part of our discussion around the WARNING stuff I did make some changes,
it'd be good if you could look at those once I send them. While I squashed
the rest of the changes (addressing review comments) into their base commits,
I left the error-reporting related bits and pieces in fixup commits, to make
that easier.

On Mon, Mar 24, 2025 at 09:18:06PM -0400, Andres Freund wrote:

Subject: [PATCH v2.12 10/28] aio: Basic read_stream adjustments for real AIO
@@ -631,6 +637,9 @@ read_stream_begin_impl(int flags,
* For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
* above.  If we had real asynchronous I/O we might need a slightly
* different definition.
+	 *
+	 * FIXME: Not sure what different definition we would need? I guess we
+	 * could add the READ_BUFFERS_SYNCHRONOUSLY flag automatically?
I think we don't need a different definition. max_ios comes from
effective_io_concurrency and similar settings. The above comment's definition
of max_ios=0 matches that GUC's documented behavior:

The allowed range is
<literal>1</literal> to <literal>1000</literal>, or
<literal>0</literal> to disable issuance of asynchronous I/O requests.

I'll guess the comment meant that "advice disabled" is a no-op for AIO, so we
could reasonably argue to have effective_io_concurrency=0 distinguish itself
from effective_io_concurrency=1 in some different way for AIO. Equally,
there's no hurry to use that freedom to distinguish them.

Thomas has since provided an implementation of what he was thinking of when
writing that comment:
/messages/by-id/CA+hUKG+8SC2=AD3bC0Pn85aMXm-PE2JSFGhC=MFVJvNQLObZeA@mail.gmail.com

I squashed that into "aio: Basic read_stream adjustments for real AIO".

Subject: [PATCH v2.12 11/28] read_stream: Introduce and use optional batchmode
support

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

I'd also use the new flag on the read_stream_begin_smgr_relation() call in
RelationCopyStorageUsingBuffer(). It uses block_range_read_stream_cb, and
other streams of that callback rightly use the flag.

Ah, yes. I had searched for all read_stream_begin_relation(), but not for
_smgr...

+ * b) directly or indirectly start another batch pgaio_enter_batchmode()

Needs new wording from end of postgr.es/m/20250325155808.f7.nmisch@google.com

Locally it's that, just need to send out a new version...
*
* b) start another batch (without first exiting batchmode and re-entering
* before returning)

Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Consider also updating this comment to stop focusing on prefetch; I think
changing that aligns with the patch's other changes:

/*
* How many buffers PrefetchBuffer callers should try to stay ahead of their
* ReadBuffer calls by. Zero means "never prefetch". This value is only used
* for buffers not belonging to tablespaces that have their
* effective_io_concurrency parameter set.
*/
int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

Good point. Although I suspect it might be worth adjusting this, and also the
config.sgml bit about effective_io_concurrency separately. That seems like it
might take an iteration or two.

-#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)

I think "usually 1-128" remains right given:

GUC_UNIT_BLOCKS
#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
#define PG_IOV_MAX Min(IOV_MAX, 128)

You're right. I think I got this wrong when rebasing over conflicts due to
06fb5612c97.

-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the prefetch distance.

Wrap the last line.

Done.

Subject: [PATCH v2.12 14/28] docs: Add acronym and glossary entries for I/O
and AIO

These could use a lot more polish.

To me, it's fine as-is.

Cool.

I did not actually reference the new entries yet, because I don't really
understand what our policy for that is.

I haven't seen much of a policy on that.

That's sure what it looks like to me :/

Subject: [PATCH v2.12 15/28] aio: Add pg_aios view
+retry:
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();

You're right.

+		 * scratch. There's no risk of a livelock here, as an IO has a limited
+		 * sets of states it can be in, and state changes go only in a single
+		 * direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;

+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on postgres relations

s/postgres relations/relations/ since SGML docs don't use the term "postgres"
that way.

Not sure what I was even trying to express with "postgres relations" vs plain
"relations" here...

Subject: [PATCH v2.12 16/28] aio: Add test_aio module
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -1,5 +1,6 @@
# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+subdir('test_aio')
subdir('brin')
List is alphabetized; please preserve that.
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/delay_execution/Makefile
Update filename in comment.
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
s/2024/2025/

Done.

--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
s/ {4}/\t/g on this file. It's mostly \t now, with some exceptions.

Huh. No idea how that happened.

+ test_inject_worker('worker', $node_worker);

What do we expect to happen if autovacuum or checkpointer runs one of these
injection points? I'm guessing it would at most make that process fail
without affecting the test outcome. If so, that's fine.

Autovacuum I disabled on the relations, to prevent that.

I think checkpointer should behave as you describe, although I could wonder if
it could confuse wait_for_log() based checks - but even so, I think that would
at worst lead to a test missing a bug, in extremely rare circumstances.

I tried triggering that condition, but it's pretty hard to hit, even after
lowering checkpoint_timeout to 1s and looping in the tests.

+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like($io_method, $psql, "handle_get_twice()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
Last two lines are a clone of the previous psql_like() call. I guess this
wants to instead call handle_get_twice() and check for some stderr.

Indeed.

+ "read_rel_block_ll() of $tblname page",

What does "_ll" stand for?

"low level". I added a C comment:

/*
* A "low level" read. This does similar things to what
* StartReadBuffers()/WaitReadBuffers() do, but provides more control (and
* less sanity).
*/

+	# Issue IO without waiting for completion, then exit
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by exited backend",

I think the exiting backend's pgaio_shutdown() completed it.

I wrote the test precisely to exercise that path, otherwise it's pretty hard
to reach. It does seem to reach the path reasonably reliably, although it's
much harder to catch that causing problems, as the IO is typically too fast.

+sub test_inject

This deserves a brief comment on the behaviors being tested, like the previous
functions have. It seems to be about short reads and hard failures like EIO.

Done.

Greetings,

Andres Freund

#142

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Thom Brown (#139)

Re: AIO v2.5

Hi,

On 2025-03-26 21:20:47 +0000, Thom Brown wrote:

I took a quick gander through this just out of curiosity (yes, I know
I'm late), and found these show-stoppers:

v2.12-0015-aio-Add-pg_aios-view.patch:

+ <literal>ERROR</literal> mean the I/O failed with an error.

s/mean/means/

v2.12-0021-bufmgr-Implement-AIO-write-support.patch

+shared buffer lock still allows some modification, e.g., for hint bits(see

s/bits\(see/bits \(see)

+buffers that can be used as the source / target for IO. A bounce buffer be

s/be/can be/

Thanks! Squashed into my local tree.

Greetings,

Andres Freund

#143

Thomas Munro

thomas.munro@gmail.com

10 months ago

In reply to: Andres Freund (#141)

Re: AIO v2.5

On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Consider also updating this comment to stop focusing on prefetch; I think
changing that aligns with the patch's other changes:

/*
* How many buffers PrefetchBuffer callers should try to stay ahead of their
* ReadBuffer calls by. Zero means "never prefetch". This value is only used
* for buffers not belonging to tablespaces that have their
* effective_io_concurrency parameter set.
*/
int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

Good point. Although I suspect it might be worth adjusting this, and also the
config.sgml bit about effective_io_concurrency separately. That seems like it
might take an iteration or two.

+1 for rewriting that separately from this work on the code (I can
have a crack at that if you want). For the comment, my suggestion
would be something like:

"Default limit on the level of concurrency that each I/O stream
(currently, ReadStream but in future other kinds of streams) can use.
Zero means that I/O is always performed synchronously, ie not
concurrently with query execution. This value can be overridden at the
tablespace level with the parameter of the same name. Note that
streams performing I/O not classified as single-session work respect
maintenance_io_concurrency instead."

#144

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#115)

26 attachment(s)

Re: AIO v2.5

Hi,

Attached v2.13, with the following changes:
Changes:

- Pushed a fair number of commits

A lot of thanks goes to Noah's detailed reviews!

- As Noah pointed out, the zero_damaged_pages warning could be emitted in an
io worker or another backend, but omitted in the backend that started the IO

To address that:

1) I added a new commit "aio: Add WARNING result status"
(itself trivial)

2) I changed buffer_readv_complete() to encode the warning/error in a more
detailed way than before (was_zeroed, first_invalid_off, count_invalid)

As part of that I put the encoding/decoding into a static inline

3) Tracking the number of invalid buffers was awkward with
buffer_readv_complete_one() returning a PgAioResult. Now it just
reports whether it found an invalid page with an out argument.

4) As discussed there now is a different error messages for the case of
multiple invalid pages

The code is a bit awkward to avoid code duplication, curious whether
that's seen as acceptable? I could just duplicate the entire ereport()
instead.

5) The WARNING in the callback is now a LOG, as it will be sent to the
client as a WARNING explicitly when the IO's results are processed

I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
LOG? But not at all sure.

There's a comment explaining this now too.

Noah, I think this set of changes would benefit from another round of
review. I left these changes in "squash-later: " commits, to make it easier
to see / think about.

- Added a comment about the pgaio_result_report() in md_readv_complete(). I
changed it to LOG_SERVER_ONLY as well , but I'm not at all sure about that.

- Previously the buffer completion callback checked zero_damaged_pages - but
that's not right, the GUC hopefully is only set on a per-session basis

I solved that by having AsyncReadBuffers() add ZERO_ON_ERROR to the flags if
zero_damaged_pages is configured.

Also added a comment explaining that we probably should eventually use a
separate flag, so we can adjust the errcode etc differently.

- Explicit test for zero_damaged_pages and ZERO_ON_ERROR

As part of that I made read_rel_block_ll() support reading multiple
blocks. That makes it a lot easier to verify that we handle cases like a
4-block read where 2,3 are invalid correctly.

- I removed the code that "localbuf: Track pincount in BufferDesc as well"
added to ConditionalLockBufferForCleanup() and IsBufferCleanupOK() as discussed

Right now the situations that the code was worried don't exist yet, as we
only support reads.

I added a comment about not needing to worry about that yet to "bufmgr:
Implement AIO read support". And then changed that comment to a FIXME in the
write patches.

- Squashed Thomas' change to make io_concurrency=0 really not use AIO

- Lots of other review comments by Noah addressed

- Merged typo fixes by Thom Brown

TODO:

- There are more tests in test_aio that should be expanded to run for temp
tables as well, not just normal tables

- Add an explicit test for the checksum verification in the completion callback

There is an existing test for testing an invalid page due to page header
verification in test_aio, but not for checksum failures.

I think it's indirectly covered (e.g. in amcheck), but seems better to test
it explicitly.

- Add error context callbacks for io worker and "foreign" IO completion

Greetings,

Andres Freund

Attachments:

v2.13-0021-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 65e53eb3977d86970250c364c3c41aac476a6ae2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 21/28] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Problems with AIO writes:

- Write logic needs to be rebased on-top of the patch series to not hit bit
  dirty buffers while IO is going on

  The performance impact of doing the memory copies is rather substantial, as
  on intel memory bandwidth is *the* IO bottleneck even just for the checksum
  computation, without a copy. That makes the memory copy for something like
  bounce buffers hurt really badly.

  And the memory usage of bounce buffers is also really concerning.

  And even without checksums, several filesystems *really* don't like buffers
  getting modified during DIO writes. Which I think would mean we ought to use
  bounce buffers for *all* writes, which would impose a *very* substantial
  overhead (basically removing the benefit of DMA happening off-cpu).

- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
  LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
  in-progress writes

  I can think of ways to solve this purely in bufmgr.c, but only in ways that
  would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
  for an exclusive lock) and/or expensive.
---
 src/include/storage/aio.h              |   2 +
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 188 ++++++++++++++++++++++++-
 4 files changed, 189 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index f9a6ac8c625..1726dbdb9e6 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -197,8 +197,10 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 784df8b00cb..ba9bf247ddb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -172,7 +172,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 67026f5f6b4..f9c4599ee21 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8608df8f115..473904d7c30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5521,7 +5521,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5536,6 +5544,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5599,9 +5620,8 @@ LockBufferForCleanup(Buffer buffer)
 	CheckBufferIsPinnedOnce(buffer);
 
 	/*
-	 * We do not yet need to be worried about in-progress AIOs holding a pin,
-	 * as we only support doing reads via AIO and this function can only be
-	 * called once the buffer is valid (i.e. no read can be in flight).
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
 	 */
 
 	/* Nobody else to wait for */
@@ -5614,6 +5634,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5761,7 +5786,13 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -5818,7 +5849,10 @@ IsBufferCleanupOK(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -6955,12 +6989,129 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
 	}
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, true);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, true);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR && buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -6968,6 +7119,13 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -6981,6 +7139,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -6988,6 +7157,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -7001,3 +7175,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0022-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 76bbfeb8bd930c60a968672cff2d18521349d0fd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 22/28] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 49e1faecacf..e247f6ac7d0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1192,6 +1192,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3018,6 +3019,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0023-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From de3d8c602b6e068216b49ab6ce0054ee4d346491 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 23/28] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.

TODO;

- This doesn't implement bgwriter_flush_after, checkpointer_flush_after

  I think that's not too hard to do, it's mainly round tuits.

- The queuing logic doesn't carefully respect pin limits

  That might be ok for checkpointer and bgwriter, but the infrastructure
  should be usable outside of this as well.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 594 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 586 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..45c2b70b736 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ba9bf247ddb..af5035317b7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -295,7 +295,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..9c045e81857 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -509,5 +509,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 473904d7c30..b374f53e9f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -515,8 +517,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -532,6 +532,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3306,6 +3307,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3337,7 +3389,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3399,7 +3454,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3507,48 +3564,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3566,15 +3666,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3600,7 +3708,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3643,6 +3751,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3663,6 +3774,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3819,11 +3932,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3834,6 +3961,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3845,6 +3979,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3883,8 +4022,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3893,22 +4090,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3918,7 +4143,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3927,40 +4152,300 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
+	/*
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+	UnlockBufHdr(cur_buf_hdr, buf_state);
 
-	tag = bufHdr->tag;
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
 
-	UnpinBuffer(bufHdr);
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
 
-	return result | BUF_WRITTEN;
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+
+	/*
+	 * FIXME: Implement issuing writebacks (note wb_context isn't used here).
+	 * Possibly needs to be integrated with io_queue.c.
+	 */
 }
 
 /*
@@ -4334,6 +4819,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..71dc6f559dd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1480,6 +1480,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e247f6ac7d0..57634962dd3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0024-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From 1f2386680e7b63c55b171737d1a0a2bf7a109348 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 24/28] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8ea846bfc3b..2b936714040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0017-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From bb4f22bd400a95520c7634328c5222b6352e0145 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 17/28] aio: Add test_aio module

To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/buf_internals.h           |    7 +
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/buffer/localbuf.c         |    3 +-
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_aio/.gitignore          |    2 +
 src/test/modules/test_aio/Makefile            |   26 +
 src/test/modules/test_aio/meson.build         |   37 +
 src/test/modules/test_aio/t/001_aio.pl        | 1090 +++++++++++++++++
 src/test/modules/test_aio/t/002_io_workers.pl |  125 ++
 src/test/modules/test_aio/test_aio--1.0.sql   |  107 ++
 src/test/modules/test_aio/test_aio.c          |  748 +++++++++++
 src/test/modules/test_aio/test_aio.control    |    3 +
 13 files changed, 2150 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/t/002_io_workers.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72b36a4af26..0dec7d93b3b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool release_aio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
@@ -478,6 +484,7 @@ extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits, bool release_aio);
 extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
+extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 80d7cc5959a..8608df8f115 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -518,10 +518,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool release_aio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5946,7 +5942,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -6003,7 +5999,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool release_aio)
 {
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6cd4d0f51eb..fa972ed55da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -58,7 +58,6 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
-static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -598,7 +597,7 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
  *
  * See also InvalidateBuffer().
  */
-static void
+void
 InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 {
 	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..9de0057bd1d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -13,6 +13,7 @@ subdir('oauth_validator')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('test_aio')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..f53cc64671a
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/test_aio/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..73d2fd68eaa
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+      't/002_io_workers.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..a4c896207d4
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,1090 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if (have_io_uring())
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+restart_after_crash=false
+temp_buffers=100
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+
+	return $output;
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+
+###
+# Sub-tests
+###
+
+# Sanity checks for the IO handle API
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get_twice()",
+		qq(SELECT handle_get_twice()),
+		qr/^$/,
+		qr/ERROR:  API violation: Only one IO can be handed out$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+# Sanity checks for the batchmode API
+sub test_batchmode
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+# Test that simple cases of invalid pages are reported
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_corr(data int not null);
+INSERT INTO tmp_corr SELECT generate_series(1, 10000);
+SELECT modify_rel_block('tmp_corr', 1, corrupt_header=>true);
+));
+
+	foreach my $tblname (qw(tbl_corr tmp_corr))
+	{
+		my $invalid_page_re =
+		  $tblname eq 'tbl_corr'
+		  ? qr/invalid page in block 1 of relation base\/\d+\/\d+/
+		  : qr/invalid page in block 1 of relation base\/\d+\/t\d+_\d+/;
+
+		# verify the error is reported in custom C code
+		psql_like(
+			$io_method,
+			$psql,
+			"read_rel_block_ll() of $tblname page",
+			qq(SELECT read_rel_block_ll('$tblname', 1)),
+			qr/^$/,
+			$invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, seq scan
+		psql_like(
+			$io_method, $psql,
+			"sequential scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname),
+			qr/^$/, $invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, tid scan
+		psql_like(
+			$io_method,
+			$psql,
+			"tid scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname WHERE ctid = '(1, 1)'),
+			qr/^$/,
+			$invalid_page_re);
+	}
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+
+	### Verify behavior for normal tables
+
+	# create a buffer we can play around with
+	my $buf_id = psql_like(
+		$io_method, $psql_a,
+		"creation of toy buffer succeeds",
+		qq(SELECT buffer_create_toy('tbl_ok', 1)),
+		qr/^\d+$/, qr/^$/);
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, not valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking buffer io w/ success: first start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, marking it as success
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+	# buffer is valid now, make it invalid again
+	$psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+
+	### Verify behavior for temporary tables
+
+	# Can't unfortunately share the code with the normal table case, there are
+	# too many behavioral differences.
+
+	# create a buffer we can play around with
+	$psql_a->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_ok(data int not null);
+INSERT INTO tmp_ok SELECT generate_series(1, 10000);
+));
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tmp_ok', 3);));
+
+	# check that one backend can perform StartLocalBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartLocalBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Because local buffers don't use IO_IN_PROGRESS, a second StartLocalBufer
+	# succeeds as well. This test mostly serves as a documentation of that
+	# fact. If we had actually started IO, it'd be different.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartLocalBufferIO succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, without marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after not marking valid succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false);)
+	);
+
+	# Now another StartLocalBufferIO should fail, this time because the buffer
+	# is already valid.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after marking valid fails",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^f$/,
+		qr/^$/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit.
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block. This verifies
+	# that the exiting backend left the AIO in a sane state.
+	psql_like(
+		$io_method,
+		$psql_b,
+		"read buffer started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that we deal correctly with FDs being closed while IO is in progress
+sub test_close_fd
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, waiting for results",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>true,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>false,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting, query works",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	$psql->quit();
+}
+
+# Tests using injection points. Mostly to exercise had IO errors that are
+# hard to trigger without using injection points.
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method, $psql,
+		"injection point not triggering failure",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^1$/, qr/^$/);
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method,
+		$psql,
+		"single block short read fails",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/
+	);
+
+	# shorten multi-block read to a single block, should retry
+	my $inval_query = qq(SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8););
+
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (1 block) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# shorten multi-block read to two blocks, should retry
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192*2);));
+
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (2 blocks) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is corrupted)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"shortened multi-block read detects invalid page",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/);
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"first hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"second hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_short_read_detach()));
+
+	# now the IO should be ok.
+	psql_like(
+		$io_method, $psql,
+		"recovers after hard error",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Tests using injection points, only for io_method=worker.
+#
+# io_method=worker has the special case of needing to reopen files. That can
+# in theory fail, because the file could be gone. That's a hard path to test
+# for real, so we use an injection point to trigger it.
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"failure to open: detected",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_reopen_detach();));
+
+	# check that we indeed recover
+	psql_like(
+		$io_method, $psql,
+		"failure to open: recovers",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Verify that we handle a relation getting removed (due to a rollback or a
+# DROP TABLE) while IO is ongoing for that table.
+sub test_invalidate
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal unlogged temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+		my $tblname = $persistency . '_transactional';
+
+		my $create_sql = qq(
+CREATE $sql_persistency TABLE $tblname (id int not null, data text not null) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO $tblname(id, data) SELECT generate_series(1, 10000) as id, repeat('a', 200);
+);
+
+		# Verify that outstanding read IO does not cause problems with
+		# AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql;");
+		$psql->query_safe(
+			qq(
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+		psql_like(
+			$io_method,
+			$psql,
+			"rollback of newly created $persistency table with outstanding IO",
+			qq(ROLLBACK),
+			qr/^$/,
+			qr/^$/);
+
+		# Verify that outstanding read IO does not cause problems with
+		# CommitTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql; COMMIT;");
+		$psql->query_safe(
+			qq(
+BEGIN;
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+
+		psql_like(
+			$io_method, $psql,
+			"drop $persistency table with outstanding IO",
+			qq(DROP TABLE $tblname),
+			qr/^$/, qr/^$/);
+
+		psql_like($io_method, $psql,
+			"commit after drop $persistency table with outstanding IO",
+			qq(COMMIT), qr/^$/, qr/^$/);
+	}
+
+	$psql->quit();
+}
+
+# Test behavior related to ZERO_ON_ERROR and zero_damaged_pages
+sub test_zero
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+
+		$psql_a->query_safe(
+			qq(
+CREATE $sql_persistency TABLE tbl_zero(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_zero SELECT generate_series(1, 10000);
+));
+
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, corrupt_header=>true);
+));
+
+		# Check that page validity errors are detected
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 0 of relation base\/.*\/.*$/
+		);
+
+		# Check that page validity errors are zeroed
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 0 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+		# And that once the corruption is fixed, we can read again
+		$psql_a->query(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test re-read of block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^$/);
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 3",
+			qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true);
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		# First test error
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 2,3 in larger read",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing via ZERO_ON_ERROR flag
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing vio zero_damaged_pages
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, zero_damaged_pages",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)
+COMMIT;
+),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		$psql_a->query_safe(qq(COMMIT));
+
+
+		# Verify that bufmgr.c IO detects page validity errors
+		$psql_a->query(
+			qq(
+SELECT invalidate_rel_block('tbl_zero', g.i)
+FROM generate_series(0, 15) g(i);
+SELECT modify_rel_block('tbl_zero', 3, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify reading zero_damaged_pages=off",
+			qq(
+SELECT count(*) FROM tbl_zero),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+		# Verify that bufmgr.c IO zeroes out pages with page validity errors
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify zero_damaged_pages=on",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT count(*) FROM tbl_zero;
+COMMIT;
+),
+			qr/^\d+$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+
+		# Clean up
+		$psql_a->query_safe(
+			qq(
+DROP TABLE tbl_zero;
+));
+	}
+}
+
+
+# Run all tests that are supported for all io_methods
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT modify_rel_block('tbl_corr', 1, corrupt_header=>true);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batchmode($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+	test_close_fd($io_method, $node);
+	test_invalidate($io_method, $node);
+	test_zero($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
new file mode 100644
index 00000000000..af5fae15ea7
--- /dev/null
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use List::Util qw(shuffle);
+
+
+my $node = PostgreSQL::Test::Cluster->new('worker');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+io_method=worker
+));
+
+$node->start();
+
+# Test changing the number of I/O worker processes while also evaluating the
+# handling of their termination.
+test_number_of_io_workers_dynamic($node);
+
+$node->stop();
+
+done_testing();
+
+
+sub test_number_of_io_workers_dynamic
+{
+	my $node = shift;
+
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+
+	# Verify that worker count can't be set to 0
+	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
+
+	# Verify that worker count can't be set to 33 (above the max)
+	change_number_of_io_workers($node, 33, $prev_worker_count, 1);
+
+	# Try changing IO workers to a random value and verify that the worker
+	# count ends up as expected. Always test the min/max of workers.
+	#
+	# Valid range for io_workers is [1, 32]. 8 tests in total seems
+	# reasonable.
+	my @io_workers_range = shuffle(1 ... 32);
+	foreach my $worker_count (1, 32, @io_workers_range[ 0, 6 ])
+	{
+		$prev_worker_count =
+		  change_number_of_io_workers($node, $worker_count,
+			$prev_worker_count, 0);
+	}
+}
+
+sub change_number_of_io_workers
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my $prev_worker_count = shift;
+	my $expect_failure = shift;
+	my ($result, $stdout, $stderr);
+
+	($result, $stdout, $stderr) =
+	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
+
+	if ($expect_failure)
+	{
+		ok( $stderr =~
+			  /$worker_count is outside the valid range for parameter "io_workers"/,
+			"updating number of io_workers to $worker_count failed, as expected"
+		);
+
+		return $prev_worker_count;
+	}
+	else
+	{
+		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+			$worker_count,
+			"updating number of io_workers from $prev_worker_count to $worker_count"
+		);
+
+		check_io_worker_count($node, $worker_count);
+		terminate_io_worker($node, $worker_count);
+		check_io_worker_count($node, $worker_count);
+
+		return $worker_count;
+	}
+}
+
+sub terminate_io_worker
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my ($pid, $ret);
+
+	# Select a random io worker
+	$pid = $node->safe_psql(
+		'postgres',
+		qq(SELECT pid FROM pg_stat_activity WHERE
+			backend_type = 'io worker' ORDER BY RANDOM() LIMIT 1));
+
+	# terminate IO worker with SIGINT
+	is(PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'INT', $pid),
+		0, "random io worker process signalled with INT");
+
+	# Check that worker exits
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE pid = $pid), '0'),
+		"random io worker process exited after signal");
+}
+
+sub check_io_worker_count
+{
+	my $node = shift;
+	my $worker_count = shift;
+
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'io worker'),
+			$worker_count),
+		"io worker count is $worker_count");
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..c668d516f61
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,107 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION modify_rel_block(rel regclass, blockno int,
+  zero bool DEFAULT false,
+  corrupt_header bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(
+    rel regclass,
+    blockno int,
+    nblocks int DEFAULT 1,
+    wait_complete bool DEFAULT true,
+    batchmode_enter bool DEFAULT false,
+    smgrreleaseall bool DEFAULT false,
+    batchmode_exit bool DEFAULT false,
+    zero_on_error bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, release_aio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..c7e857800dc
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,748 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_aio.c
+ *		Helpers to write tests for AIO
+ *
+ * This module provides interface functions for C functionality to SQL, to
+ * make it possible to test AIO related behavior in a targeted way from SQL.
+ * It'd not generally be safe to export these functions to SQL, but for a test
+ * that's fine.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(modify_rel_block);
+Datum
+modify_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		zero = PG_GETARG_BOOL(2);
+	bool		corrupt_header = PG_GETARG_BOOL(3);
+	Relation	rel;
+	Buffer		buf;
+	Page		page;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+							 RBM_ZERO_ON_ERROR, NULL);
+	page = BufferGetPage(buf);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	if (zero)
+		memset(page, 0, BufferGetPageSize(buf));
+
+	if (PageIsEmpty(page) && corrupt_header)
+		PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+
+	if (corrupt_header)
+		ph->pd_special = BLCKSZ + 1;
+
+	PageSetChecksumInplace(page, blkno);
+
+	smgrwrite(RelationGetSmgr(rel),
+			  MAIN_FORKNUM, blkno, page, true);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	/*
+	 * Don't want to have a buffer in-memory that's marked valid where the
+	 * on-disk contents are invalid.
+	 *
+	 * NB: This is racy, better don't copy this to non-test code.
+	 */
+	if (BufferIsLocal(buf))
+		InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+	else
+		EvictUnpinnedBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* place buffer in shared buffers without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	}
+	else
+	{
+		buf_hdr = GetBufferDescriptor(buf - 1);
+		buf_state = LockBufHdr(buf_hdr);
+	}
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	if (RelationUsesLocalBuffers(rel))
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	else
+		UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+/*
+ * A "low level" read. This does similar things to what
+ * StartReadBuffers()/WaitReadBuffers() do, but provides more control (and
+ * less sanity).
+ */
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	int			nblocks = PG_GETARG_INT32(2);
+	bool		wait_complete = PG_GETARG_BOOL(3);
+	bool		batchmode_enter = PG_GETARG_BOOL(4);
+	bool		call_smgrreleaseall = PG_GETARG_BOOL(5);
+	bool		batchmode_exit = PG_GETARG_BOOL(6);
+	bool		zero_on_error = PG_GETARG_BOOL(7);
+	Relation	rel;
+	Buffer		bufs[PG_IOV_MAX];
+	BufferDesc *buf_hdrs[PG_IOV_MAX];
+	Page		pages[PG_IOV_MAX];
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
+		elog(ERROR, "nblocks is out of range");
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	for (int i = 0; i < nblocks; i++)
+	{
+		bufs[i] = create_toy_buffer(rel, blkno + i);
+		pages[i] = BufferGetBlock(bufs[i]);
+		buf_hdrs[i] = BufferIsLocal(bufs[i]) ?
+			GetLocalBufferDescriptor(-bufs[i] - 1) :
+			GetBufferDescriptor(bufs[i] - 1);
+	}
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	smgr = RelationGetSmgr(rel);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartLocalBufferIO(buf_hdrs[i], true, false);
+		pgaio_io_set_flag(ioh, PGAIO_HF_REFERENCES_LOCAL);
+	}
+	else
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartBufferIO(buf_hdrs[i], true, false);
+	}
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) bufs, nblocks);
+
+	pgaio_io_register_callbacks(ioh,
+								RelationUsesLocalBuffers(rel) ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								(zero_on_error | zero_damaged_pages) ?
+								READ_BUFFERS_ZERO_ON_ERROR : 0);
+
+	if (batchmode_enter)
+		pgaio_enter_batchmode();
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, nblocks);
+
+	if (call_smgrreleaseall)
+		smgrreleaseall();
+
+	if (batchmode_exit)
+		pgaio_exit_batchmode();
+
+	for (int i = 0; i < nblocks; i++)
+		ReleaseBuffer(bufs[i]);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != PGAIO_RS_OK)
+			pgaio_result_report(ior.result,
+								&ior.target_data,
+								ior.result.status == PGAIO_RS_ERROR ?
+								ERROR : WARNING);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			BufferDesc *buf_hdr = BufferIsLocal(buf) ?
+				GetLocalBufferDescriptor(-buf - 1)
+				: GetBufferDescriptor(buf - 1);
+
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (pg_atomic_read_u32(&buf_hdr->state) & BM_DIRTY)
+			{
+				if (BufferIsLocal(buf))
+					FlushLocalBuffer(buf_hdr, NULL);
+				else
+					FlushOneBuffer(buf);
+			}
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (BufferIsLocal(buf))
+				InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+			else if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	if (BufferIsLocal(buf))
+		can_start = StartLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+									   for_input, nowait);
+	else
+		can_start = StartBufferIO(GetBufferDescriptor(buf - 1),
+								  for_input, nowait);
+
+	/*
+	 * For tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start && !BufferIsLocal(buf))
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		release_aio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	if (BufferIsLocal(buf))
+		TerminateLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+							   clear_dirty, set_flag_bits, release_aio);
+	else
+		TerminateBufferIO(GetBufferDescriptor(buf - 1),
+						  clear_dirty, set_flag_bits, false, release_aio);
+
+	ereport(LOG,
+			errmsg("buffer %d after Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		/*
+		 * Only shorten reads that are actually longer than the target size,
+		 * otherwise we can trigger over-reads.
+		 */
+		if (inj_io_error_state->short_read_result_set
+			&& ioh->op == PGAIO_OP_READV
+			&& inj_io_error_state->short_read_result <= ioh->result)
+		{
+			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			int32		old_result = ioh->result;
+			int32		new_result = inj_io_error_state->short_read_result;
+			int32		processed = 0;
+
+			ereport(LOG,
+					errmsg("short read, changing result from %d to %d",
+						   old_result, new_result),
+					errhidestmt(true), errhidecontext(true));
+
+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
+			 *
+			 * To avoid that, iterate through the IOV and zero out the
+			 * "failed" portion of the IO.
+			 */
+			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			{
+				if (processed + iov[i].iov_len <= new_result)
+					processed += iov[i].iov_len;
+				else if (processed <= new_result)
+				{
+					uint32		ok_part = new_result - processed;
+
+					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+					processed += iov[i].iov_len;
+				}
+				else
+				{
+					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+				}
+			}
+
+			ioh->result = new_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0018-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From 491a5fce1a9c49a6dc7c4c79bf18378b437bcbc3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 18/28] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 36c54fb695b..cec93129f58 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -404,6 +404,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0019-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From bd752931f695be73e6789c68437575f2d0c6446b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 19/28] aio: Implement smgr/md/fd write support

TODO:
- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
  in a critical section

  The only reason it doesn't immediately fail is that it's reasonably rare
  that RegisterSyncRequest() fails *and* either:

  - smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
    though the lookup is guaranteed to succeed for io_method=worker.

  - an io_method=uring completion is run in a different backend and smgropen()
    needs to build a new entry and thus needs to allocate memory

  For a bit I thought this could be worked around easily enough by not doing
  an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
  instead just opening the file directly. That actually does kinda solve the
  problem, but only because the memory allocation in PathNameOpenFile()
  uses malloc(), not palloc() and thus doesn't trigger

- temp_file_limit implementation
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 201 ++++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c        |  29 ++++
 7 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index c0063d6950e..67026f5f6b4 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9e1cdf7ddfe..4aecf2cfb97 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2348,6 +2348,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_start_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index eedba1e5794..f6bdc7e37bc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1115,6 +1122,64 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	ret = FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start writing blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1503,6 +1568,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -1997,7 +2096,7 @@ md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
  * AIO error reporting callback for mdstartreadv().
  *
  * Errors are encoded as follows:
- * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data != 0 encodes IO that failed with that errno
  * - PgAioResult.error_data == 0 encodes IO that didn't read all data
  */
 static void
@@ -2037,3 +2136,103 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartwritev(), the smgr API operates on the
+	 * level of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bf8f57b410a..e56d6a2597b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -787,6 +793,29 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully written blocks if the write [partially] succeeds. This
+ * maintains the abstraction that smgr operates on the level of blocks, rather
+ * than bytes.
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0020-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 7d49c6782d4335519e3a06d262d3bbb557a7bfc0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 20/28] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index b182c0c78ca..f9a6ac8c625 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -349,6 +349,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -361,6 +375,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 7f18da2c856..833f97361a1 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -127,6 +127,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -182,11 +188,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -217,6 +236,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -244,6 +269,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index 9db5d776b61..134986d3af1 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -119,4 +119,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index ddd59404a59..534aaad22bc 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -406,6 +406,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits (see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer can be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 86f7250b7a5..cff48964d07 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	if (ioh->state != PGAIO_HS_HANDED_OUT)
 		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1046,6 +1064,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..95b10933fed 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -88,6 +88,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -113,6 +139,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -136,11 +189,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -156,6 +229,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -167,6 +243,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -179,6 +256,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -186,9 +297,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -210,6 +325,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_max_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5d729102f46..e885a682258 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3303,6 +3303,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ac6ae0a9489..4a5d04ef419 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -211,6 +211,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index c668d516f61..f93a698ae34 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -86,6 +86,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index c7e857800dc..0cbb71fecfa 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -51,6 +51,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -617,6 +618,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9442a4841aa..49e1faecacf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2133,6 +2133,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0001-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From 592615eaae6d420127d6ab12f3023679337c3bce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.13 01/28] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 +++++
 src/backend/storage/smgr/md.c          | 196 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 160 ++++++++++++++++++++
 10 files changed, 437 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index cc987556e14..c386fc96d90 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -117,9 +117,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -191,6 +192,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index dddda3a3e2f..debe8163d4e 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 413ea3247c2..53db6e194af 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 536c7f91f5e..ac6c74f4ff2 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0c3a2a756e7..9e1cdf7ddfe 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1296,6 +1297,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1989,6 +1992,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2212,6 +2217,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_start_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2500,6 +2531,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2780,6 +2817,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2848,6 +2886,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..eedba1e5794 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,69 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	ret = FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1438,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1929,111 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		/*
+		 * Immediately log a message about the IO error, but only to the
+		 * server log. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing).  The
+		 * issuer of the IO will emit an ERROR when processing the IO's
+		 * results
+		 */
+		pgaio_result_report(result, td, LOG_SERVER_ONLY);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartreadv(), the smgr API operates on the level
+	 * of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		/* see comment above the "hard error" case */
+		pgaio_result_report(result, td, LOG_SERVER_ONLY);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index af74f54b43b..bf8f57b410a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -66,6 +66,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -106,6 +107,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -117,6 +122,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -134,12 +140,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -157,6 +165,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -709,6 +727,30 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully read blocks if the read [partially] succeeds (Buffers for
+ * blocks not successfully read might bear unspecified modifications, up to
+ * the full nblocks). This maintains the abstraction that smgr operates on the
+ * level of blocks, rather than bytes.
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -917,6 +959,29 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed prematurely.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -945,3 +1010,98 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed again before we get to executing the IO.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0002-docs-Add-acronym-and-glossary-entries-for-I-O-.patchtext/x-diff; charset=us-asciiDownload

From b6f0959552a56bb598f91f8e44c313cbd938d908 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 21 Mar 2025 15:06:35 -0400
Subject: [PATCH v2.13 02/28] docs: Add acronym and glossary entries for I/O
 and AIO

These are fairly basic, but better than nothing.  While there are several
opportunities to link to these entries, this patch does not add any. They will
however be referenced by future patches.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326183102.92.nmisch@google.com
---
 doc/src/sgml/acronyms.sgml | 18 ++++++++++++++++++
 doc/src/sgml/glossary.sgml | 39 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml
index 58d0d90fece..2f906e9f018 100644
--- a/doc/src/sgml/acronyms.sgml
+++ b/doc/src/sgml/acronyms.sgml
@@ -9,6 +9,15 @@
 
   <variablelist>
 
+   <varlistentry>
+    <term><acronym>AIO</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-aio">Asynchronous <acronym>I/O</acronym></link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ACL</acronym></term>
     <listitem>
@@ -354,6 +363,15 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><acronym>I/O</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-io">Input/Output</link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ISO</acronym></term>
     <listitem>
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index c0f812e3f5e..b88cac598e9 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -81,6 +81,31 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-aio">
+   <glossterm>Asynchronous <acronym>I/O</acronym></glossterm>
+   <acronym>AIO</acronym>
+   <indexterm>
+    <primary>Asynchronous <acronym>I/O</acronym></primary>
+   </indexterm>
+   <glossdef>
+    <para>
+     Asynchronous <acronym>I/O</acronym> (<acronym>AIO</acronym>) describes
+     performing <acronym>I/O</acronym> in a non-blocking way (asynchronously),
+     in contrast to synchronous <acronym>I/O</acronym>, which blocks for the
+     entire duration of the <acronym>I/O</acronym>.
+    </para>
+    <para>
+     With <acronym>AIO</acronym>, starting an <acronym>I/O</acronym> operation
+     is separated from waiting for the result of the operation, allowing
+     multiple <acronym>I/O</acronym> operations to be initiated concurrently,
+     as well as performing <acronym>CPU</acronym> heavy operations
+     concurrently with <acronym>I/O</acronym>. The price for that increased
+     concurrency is increased complexity.
+    </para>
+    <glossseealso otherterm="glossary-io" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-atomic">
    <glossterm>Atomic</glossterm>
    <glossdef>
@@ -938,6 +963,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-io">
+   <glossterm>Input/Output</glossterm>
+   <acronym>I/O</acronym>
+   <glossdef>
+    <para>
+     Input/Output (<acronym>I/O</acronym>) describes the communication between
+     a program and peripheral devices. In the context of database systems,
+     <acronym>I/O</acronym> commonly, but not exclusively, refers to
+     interaction with storage devices or the network.
+    </para>
+    <glossseealso otherterm="glossary-aio" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-insert">
    <glossterm>Insert</glossterm>
    <glossdef>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0003-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 23d1b3b0a0a0a1629fc7a9fa6c432feee8139cf9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 03/28] aio: Add pg_aios view

The new view lists all IO handles that are currently in use and is mainly
useful for PG developers, but may also be useful when tuning PG.

FIXME:
- catversion bump before commit

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/catalog/pg_proc.dat      |  10 +
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 225 +++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 doc/src/sgml/system-views.sgml       | 288 +++++++++++++++++++++++++++
 src/test/regress/expected/rules.out  |  16 ++
 7 files changed, 544 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..52dbce05a45 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12493,4 +12493,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..37fd4bc2566 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1391,3 +1391,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..765fb16aef4
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,225 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "nodes/execnodes.h"
+#include "port/atomics.h"
+#include "storage/aio_internal.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/tuplestore.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+retry:
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IO's state changed while we were "rendering" it. Just start
+		 * from scratch. There's no risk of a livelock here, as an IO has a
+		 * limited sets of states it can be in, and state changes go only in a
+		 * single direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of the following
+		 * fields are valid yet (or are in the process of being set).
+		 * Therefore we don't want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation (offset, length) */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED
+			|| start_state == PGAIO_HS_COMPLETED_LOCAL)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5154512b16b 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -51,6 +51,11 @@
     </thead>
 
     <tbody>
+     <row>
+      <entry><link linkend="view-pg-aios"><structname>pg_aios</structname></link></entry>
+      <entry>In-use asynchronous IO handles</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-available-extensions"><structname>pg_available_extensions</structname></link></entry>
       <entry>available extensions</entry>
@@ -231,6 +236,289 @@
   </table>
  </sect1>
 
+ <sect1 id="view-pg-aios">
+  <title><structname>pg_aios</structname></title>
+
+  <indexterm zone="view-pg-aios">
+   <primary>pg_aios</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_aios</structname> view lists all <xref
+   linkend="glossary-aio"/> handles that are currently in-use.  An I/O handle
+   is used to reference an I/O operation that is being prepared, executed or
+   is in the process of completing.  <structname>pg_aios</structname> contains
+   one row for each I/O handle.
+  </para>
+
+  <para>
+   This view is mainly useful for developers of
+   <productname>PostgreSQL</productname>, but may also be useful when tuning
+   <productname>PostgreSQL</productname>.
+  </para>
+
+  <table>
+   <title><structname>pg_aios</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>int4</type>
+      </para>
+      <para>
+       Process ID of the server process that is issuing this I/O.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_id</structfield> <type>int4</type>
+      </para>
+      <para>
+       Identifier of the I/O handle. Handles are reused once the I/O
+       completed (or if the handle is released before I/O is started). On reuse
+       <link linkend="view-pg-aios-io-generation">
+        <structname>pg_aios</structname>.<structfield>io_generation</structfield>
+       </link>
+       is incremented.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry" id="view-pg-aios-io-generation"><para role="column_definition">
+       <structfield>io_generation</structfield> <type>int8</type>
+      </para>
+      <para>
+       Generation of the I/O handle.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>state</structfield> <type>text</type>
+      </para>
+      <para>
+       State of the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>HANDED_OUT</literal>, referenced by code but not yet used
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>DEFINED</literal>, information necessary for execution is known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>STAGED</literal>, ready for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>SUBMITTED</literal>, submitted for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_IO</literal>, finished, but result has not yet been processed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_SHARED</literal>, shared completion processing completed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_LOCAL</literal>, backend local completion processing completed
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>operation</structfield> <type>text</type>
+      </para>
+      <para>
+       Operation performed using the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>invalid</literal>, not yet known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>readv</literal>, a vectored read
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writev</literal>, a vectored write
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>off</structfield> <type>int8</type>
+      </para>
+      <para>
+       Offset of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>length</structfield> <type>int8</type>
+      </para>
+      <para>
+       Length of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on relations
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>handle_data_len</structfield> <type>int2</type>
+      </para>
+      <para>
+       Length of the data associated with the I/O operation. For I/O to/from
+       <xref linkend="guc-shared-buffers"/> and <xref
+       linkend="guc-temp-buffers"/>, this indicates the number of buffers the
+       I/O is operating on.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>raw_result</structfield> <type>int4</type>
+      </para>
+      <para>
+       Low-level result of the I/O operation, or NULL if the operation has not
+       yet completed.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>result</structfield> <type>text</type>
+      </para>
+      <para>
+       High-level result of the I/O operation:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>UNKNOWN</literal> means that the result of the
+          operation is not yet known.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>OK</literal> means the I/O completed successfully.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>PARTIAL</literal> means that the I/O completed without
+          error, but did not process all data. Commonly callers will need to
+          retry and perform the remainder of the work in a separate I/O.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>WARNING</literal> means that the I/O completed without
+          error, but that execution of the IO triggered a warning. E.g. when
+          encountering a corrupted buffer with <xref
+          linkend="guc-zero-damaged-pages"/> enabled.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>ERROR</literal> means the I/O failed with an error.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target_desc</structfield> <type>text</type>
+      </para>
+      <para>
+       Description of what the I/O operation is targeting.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_sync</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is executed synchronously.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_localmem</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O references process local memory.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_buffered</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is buffered I/O.
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_aios</structname> view is read-only.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-available-extensions">
   <title><structname>pg_available_extensions</structname></title>
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..d9533deb04e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    off,
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, off, length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0004-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From d93f53ebe1d73e4fce87cfbc73c92dcd186a9317 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 04/28] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0005-aio-bufmgr-Comment-fixes.patchtext/x-diff; charset=us-asciiDownload

From 52ab5147d35d4435ecf2449df81579932620c6d1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Mar 2025 11:32:10 -0400
Subject: [PATCH v2.13 05/28] aio, bufmgr: Comment fixes

Some of these comments have been wrong for a while (12f3867f5534), some I
recently introduced (da7226993fd, 55b454d0e14). This also updates a comment in
FlushBuffer(), which will be copied in a future commit.

These changes seem big enough to be worth doing in separate commits.

Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
---
 src/include/storage/aio.h           |  2 +-
 src/include/storage/aio_internal.h  | 22 +++++++++++++++++++++-
 src/backend/postmaster/postmaster.c |  2 +-
 src/backend/storage/buffer/bufmgr.c | 10 ++++------
 4 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c386fc96d90..5c886e478a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -80,7 +80,7 @@ typedef enum PgAioHandleFlags
 /*
  * The IO operations supported by the AIO subsystem.
  *
- * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * This could be in aio_internal.h, as it is not publicly referenced, but
  * PgAioOpData currently *does* need to be public, therefore keeping this
  * public seems to make sense.
  */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index fb0425ccbfc..7f18da2c856 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,6 +34,11 @@
  * linearly through all states.
  *
  * State changes should all go through pgaio_io_update_state().
+ *
+ * Note that the externally visible functions to start IO
+ * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
+ * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).
  */
 typedef enum PgAioHandleState
 {
@@ -285,6 +290,9 @@ typedef struct IoMethodOps
 	/*
 	 * Start executing passed in IOs.
 	 *
+	 * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+	 * returns, other backends might have advanced the state further.)
+	 *
 	 * Will not be called if ->needs_synchronous_execution() returned true.
 	 *
 	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -293,12 +301,24 @@ typedef struct IoMethodOps
 	 */
 	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
 
-	/*
+	/* ---
 	 * Wait for the IO to complete. Optional.
 	 *
+	 * On return, state shall be on of
+	 * - PGAIO_HS_COMPLETED_IO
+	 * - PGAIO_HS_COMPLETED_SHARED
+	 * - PGAIO_HS_COMPLETED_LOCAL
+	 *
+	 * The callback must not block if the handle is already in one of those
+	 * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+	 * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+	 * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+	 * method.
+	 *
 	 * If not provided, it needs to be guaranteed that the IO method calls
 	 * pgaio_io_process_completion() without further interaction by the
 	 * issuing backend.
+	 * ---
 	 */
 	void		(*wait_one) (PgAioHandle *ioh,
 							 uint64 ref_generation);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a0c37532d2f..c966c2e83af 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4401,7 +4401,7 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* XXX try again soon? */
+			break;				/* try again next time */
 	}
 
 	/* Too many running? */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 323382dcfa8..bcc4635c8c2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3919,9 +3919,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		XLogFlush(recptr);
 
 	/*
-	 * Now it's safe to write buffer to disk. Note that no one else should
-	 * have been able to write it while we were busy with log flushing because
-	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
@@ -5489,9 +5490,6 @@ IsBufferCleanupOK(Buffer buffer)
 /*
  *	Functions for buffer I/O handling
  *
- *	Note: We assume that nested buffer I/O never occurs.
- *	i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
- *
  *	Also note that these are used only for shared buffers, not local ones.
  */
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0006-aio-Add-WARNING-result-status.patchtext/x-diff; charset=us-asciiDownload

From d10cc26904174ec6b85ae3a284218a24e1582885 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Mar 2025 14:33:09 -0400
Subject: [PATCH v2.13 06/28] aio: Add WARNING result status

If an IO succeeds, but issues a warning, e.g. due to a page verification
failure with zero_damaged_pages, we want to issue that warning in the context
of the issuer of the IO, not the process that executes the completion (always
the case for worker).

It's already possible for a completion callback to report a custom error
message, we just didn't have a result status that allowed a user of AIO to
know that a warning should be emitted even though the IO request succeeded.

All that's needed for that is a dedicated PGAIO_RS_ value.
---
 src/include/storage/aio_types.h | 9 +++++----
 src/backend/storage/aio/aio.c   | 2 ++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index debe8163d4e..9db5d776b61 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -79,8 +79,9 @@ typedef enum PgAioResultStatus
 {
 	PGAIO_RS_UNKNOWN,			/* not yet completed / uninitialized */
 	PGAIO_RS_OK,
-	PGAIO_RS_PARTIAL,			/* did not fully succeed, but no error */
-	PGAIO_RS_ERROR,
+	PGAIO_RS_PARTIAL,			/* did not fully succeed, no warning/error */
+	PGAIO_RS_WARNING,			/* [partially] succeeded, with a warning */
+	PGAIO_RS_ERROR,				/* failed entirely */
 } PgAioResultStatus;
 
 
@@ -96,10 +97,10 @@ typedef struct PgAioResult
 	uint32		id:8;
 
 	/* of type PgAioResultStatus, see above */
-	uint32		status:2;
+	uint32		status:3;
 
 	/* meaning defined by callback->error */
-	uint32		error_data:22;
+	uint32		error_data:21;
 
 	int32		result;
 } PgAioResult;
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 91e76113412..e3ed087e8a2 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -839,6 +839,8 @@ pgaio_result_status_string(PgAioResultStatus rs)
 			return "UNKNOWN";
 		case PGAIO_RS_OK:
 			return "OK";
+		case PGAIO_RS_WARNING:
+			return "WARNING";
 		case PGAIO_RS_PARTIAL:
 			return "PARTIAL";
 		case PGAIO_RS_ERROR:
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0007-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 13b1e4b7bb93211cb366affda83929176741dddf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 07/28] bufmgr: Implement AIO read support

This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   4 +
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   4 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/README      |   9 +-
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 519 ++++++++++++++++++++++---
 src/backend/storage/buffer/localbuf.c  |  62 ++-
 8 files changed, 554 insertions(+), 59 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 5c886e478a5..b182c0c78ca 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -195,6 +195,10 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..72b36a4af26 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool release_aio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 538b890a51d..5137288c96e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -170,6 +171,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 53db6e194af..c0063d6950e 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..a182fcd660c 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -147,9 +147,12 @@ in the buffer.  It is used per the rules above.
 
 * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
 buffer to complete (and in releases before 14, it was accompanied by a
-per-buffer LWLock).  The process doing a read or write sets the flag for the
-duration, and processes that need to wait for it to be cleared sleep on a
-condition variable.
+per-buffer LWLock).  The process starting a read or write sets the flag. When
+the I/O is completed, be it by the process that initiated the I/O or by
+another process, the flag is removed and the Buffer's condition variable is
+signalled.  Processes that need to wait for the I/O to complete can wait for
+asynchronous I/O by using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be
+unset by sleeping on the buffer's condition variable.
 
 
 Normal Buffer Replacement Strategy
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bcc4635c8c2..3cf8b0f98d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -519,7 +520,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool release_aio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1041,7 +1043,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1077,9 +1079,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1454,7 +1456,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1604,9 +1607,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -1868,13 +1871,14 @@ retry:
 	}
 
 	/*
-	 * We assume the only reason for it to be pinned is that someone else is
-	 * flushing the page out.  Wait for them to finish.  (This could be an
-	 * infinite loop if the refcount is messed up... it would be nice to time
-	 * out after awhile, but there seems no way to be sure how many loops may
-	 * be needed.  Note that if the other guy has pinned the buffer but not
-	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
-	 * be busy-looping here.)
+	 * We assume the reason for it to be pinned is that either we were
+	 * asynchronously reading the page in before erroring out or someone else
+	 * is flushing the page out.  Wait for the IO to finish.  (This could be
+	 * an infinite loop if the refcount is messed up... it would be nice to
+	 * time out after awhile, but there seems no way to be sure how many loops
+	 * may be needed.  Note that if the other guy has pinned the buffer but
+	 * not yet done StartBufferIO, WaitIO will fall through and we'll
+	 * effectively be busy-looping here.)
 	 */
 	if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
 	{
@@ -2514,7 +2518,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2860,6 +2864,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2928,29 +2970,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3971,7 +3992,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, false);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5241,6 +5262,12 @@ LockBufferForCleanup(Buffer buffer)
 
 	CheckBufferIsPinnedOnce(buffer);
 
+	/*
+	 * We do not yet need to be worried about in-progress AIOs holding a pin,
+	 * as we only support doing reads via AIO and this function can only be
+	 * called once the buffer is valid (i.e. no read can be in flight).
+	 */
+
 	/* Nobody else to wait for */
 	if (BufferIsLocal(buffer))
 		return;
@@ -5398,6 +5425,8 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/* see AIO related comment in LockBufferForCleanup() */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -5453,6 +5482,8 @@ IsBufferCleanupOK(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/* see AIO related comment in LockBufferForCleanup() */
+
 	if (BufferIsLocal(buffer))
 	{
 		/* There should be exactly one pin */
@@ -5505,6 +5536,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5512,10 +5544,40 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+
+		/*
+		 * Copy the wait reference while holding the spinlock. This protects
+		 * against a concurrent TerminateBufferIO() in another backend from
+		 * clearing the wref while it's being read.
+		 */
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
+		/* no IO in progress, we don't need to wait */
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		/*
+		 * The buffer has asynchronous IO in progress, wait for it to
+		 * complete.
+		 */
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+
+			/*
+			 * The AIO subsystem internally uses condition variables and thus
+			 * might remove this backend from the BufferDesc's CV. While that
+			 * wouldn't cause a correctness issue (the first CV sleep just
+			 * immediately returns if not already registered), it seems worth
+			 * avoiding unnecessary loop iterations, given that we take care
+			 * to do so at the start of the function.
+			 */
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
+		/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5524,13 +5586,12 @@ WaitIO(BufferDesc *buf)
 /*
  * StartBufferIO: begin I/O on this buffer
  *	(Assumptions)
- *	My process is executing no IO
+ *	My process is executing no IO on this buffer
  *	The buffer is Pinned
  *
- * In some scenarios there are race conditions in which multiple backends
- * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
- * I/O condition variable until he's done.
+ * In some scenarios multiple backends could attempt the same I/O operation
+ * concurrently.  If someone else has already started I/O on this buffer then
+ * we will wait for completion of the IO using WaitIO().
  *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -5566,9 +5627,9 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 
 	/* Once we get here, there is definitely no I/O active on this buffer */
 
+	/* Check if someone else already did the I/O */
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		UnlockBufHdr(buf, buf_state);
 		return false;
 	}
@@ -5604,7 +5665,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool release_aio)
 {
 	uint32		buf_state;
 
@@ -5619,6 +5680,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (release_aio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5627,6 +5696,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * We may have just released the last pin other than the waiter's. In most
+	 * cases, this backend holds another pin on the buffer. But, if, for
+	 * example, this backend is completing an IO issued by another backend, it
+	 * may be time to wake the waiter.
+	 */
+	if (release_aio && (buf_state & BM_PIN_COUNT_WAITER))
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5675,7 +5755,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, false);
 }
 
 /*
@@ -6126,3 +6206,350 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Generic implementation of the AIO handle staging callback for readv/writev
+ * on local/shared buffers.
+ *
+ * Each readv/writev can target multiple buffers. The buffers have already
+ * been registered with the IO handle.
+ *
+ * To make the IO ready for execution ("staging"), we need to ensure that the
+ * targeted buffers are in an appropriate state while the IO is ongoing. For
+ * that the AIO subsystem needs to have its own buffer pin, otherwise an error
+ * in this backend could lead to this backend's buffer pin being released as
+ * part of error handling, which in turn could lead to the buffer being
+ * replaced while IO is ongoing.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	/* iterate over all buffers affected by the vectored readv/writev */
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential. This is the last
+		 * buffer-aware code before IO is actually executed and confusion
+		 * about which buffers are targeted by IO can be hard to debug, making
+		 * it worth doing extra-paranoid checks.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		/* verify the buffer is in the expected state */
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the AIO subsystem.
+		 *
+		 * For local buffers: This can't be done just via LocalRefCount, as
+		 * one might initially think, as this backend could error out while
+		 * AIO is still in progress, releasing all the pins by the backend
+		 * itself.
+		 *
+		 * This pin is released again in TerminateBufferIO().
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		/*
+		 * Ensure the content lock that prevents buffer modifications while
+		 * the buffer is being written out is not released early due to an
+		 * error.
+		 */
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						  bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	uint32		set_flag_bits;
+
+	/* check that the buffer is in the expected state for a read */
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	/* check for garbage data */
+	if (!failed &&
+		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
+								PIV_LOG_WARNING | PIV_REPORT_STAT))
+	{
+		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+
+		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("invalid page in block %u of relation %s; zeroing out page",
+							tag.blockNum,
+							relpathbackend(rlocator,
+										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+										   tag.forkNum).str)));
+			memset(bufdata, 0, BLCKSZ);
+		}
+		else
+		{
+			/* mark buffer as having failed */
+			failed = true;
+
+			/* encode error for buffer_readv_report */
+			result.status = PGAIO_RS_ERROR;
+			if (is_temp)
+				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
+			else
+				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
+			result.error_data = buf_off;
+		}
+	}
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, true);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, true);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful and would also
+	 * imply some overhead.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call the
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire I/O failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
+											   is_temp);
+
+		/*
+		 * If there wasn't any prior error and page verification failed in
+		 * some form, set the whole IO's result to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR
+			&& buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * Errors are encoded as follows:
+ * - the only error is page verification failing
+ * - result.error_data is the offset of the first page that failed
+ *   verification in a larger IO
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
+					int elevel)
+{
+	ProcNumber	errProc;
+
+	if (target_data->smgr.is_temp)
+		errProc = MyProcNumber;
+	else
+		errProc = INVALID_PROC_NUMBER;
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			errmsg("invalid page in block %u of relation %s",
+				   target_data->smgr.blockNum + result.error_data,
+				   relpathbackend(target_data->smgr.rlocator, errProc,
+								  target_data->smgr.forkNum).str));
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..6cd4d0f51eb 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -17,7 +17,9 @@
 
 #include "access/parallel.h"
 #include "executor/instrument.h"
+#include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +189,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +213,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, false);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +432,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +448,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,13 +517,31 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * With AIO the buffer could have IO in progress, e.g. when there are two
+	 * scans of the same relation. Either wait for the other IO or return
+	 * false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	/* Once we get here, there is definitely no I/O active on this buffer */
+
+	/* Check if someone else already did the I/O */
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		return false;
 	}
 
@@ -536,7 +556,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool release_aio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,12 +570,22 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (release_aio)
+	{
+		/* release pin held by IO subsystem, see also buffer_stage_common() */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	/* local buffers don't track IO using resowners */
 
 	/* local buffers don't use the IO CV, as no other process can see buffer */
+
+	/* local buffers don't use BM_PIN_COUNT_WAITER, so no need to wake */
 }
 
 /*
@@ -575,6 +606,19 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 	uint32		buf_state;
 	LocalBufferLookupEnt *hresult;
 
+	/*
+	 * It's possible that we started IO on this buffer before e.g. aborting
+	 * the transaction that created a table. We need to wait for that IO to
+	 * complete before removing / reusing the buffer.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		pgaio_wref_wait(&iow);
+		Assert(!pgaio_wref_valid(&bufHdr->io_wref));
+	}
+
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 	/*
@@ -714,6 +758,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0008-squash-later-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 2f2370859b959bf48424966db30e3ba25a1ac0f0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Mar 2025 14:41:48 -0400
Subject: [PATCH v2.13 08/28] squash-later: bufmgr: Implement AIO read support

---
 src/backend/storage/buffer/bufmgr.c | 200 ++++++++++++++++++++++------
 1 file changed, 156 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cf8b0f98d2..0b96e256625 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6327,20 +6327,75 @@ buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
 	}
 }
 
+/*
+ * Helper to encode errors for buffer_readv_complete()
+ *
+ * Errors are encoded as follows:
+ * - bit 0 indicates whether page was zeroed (1) or not (0)
+ * - next 8 bits indicate the first offset is the offset of the first page
+ *   that failed verification in a larger IO
+ * - next 8 bits indicate the number of corruptions
+ */
+static inline void
+buffer_readv_encode_error(PgAioResult *result, bool is_temp,
+						  bool was_zeroed, uint8 first_invalid_off,
+						  uint8 count_invalid)
+{
+
+	uint8		shift = 0;
+
+	StaticAssertStmt(PG_IOV_MAX <= 1 << 8,
+					 "PG_IOV_MAX is bigger than reserved space for error data");
+
+	result->error_data = 0;
+
+	result->error_data |= was_zeroed << shift;
+	shift += 1;
+
+	result->error_data |= first_invalid_off << shift;
+	shift += 8;
+
+	result->error_data |= count_invalid << shift;
+	shift += 8;
+
+	result->id = is_temp ? PGAIO_HCB_LOCAL_BUFFER_READV :
+		PGAIO_HCB_SHARED_BUFFER_READV;
+	result->status = was_zeroed ? PGAIO_RS_WARNING : PGAIO_RS_ERROR;
+}
+
+/*
+ * Decode readv errors as encoded by buffer_readv_encode_error().
+ */
+static inline void
+buffer_readv_decode_error(PgAioResult result,
+						  bool *was_zeroed, uint8 *first_invalid_off, uint8 *count_invalid)
+{
+	uint32		rem_error = result.error_data;
+
+	*was_zeroed = rem_error & 1;
+	rem_error >>= 1;
+
+	*first_invalid_off = rem_error & 0xff;
+	rem_error >>= 8;
+
+	*count_invalid = rem_error & 0xff;
+	rem_error >>= 8;
+}
+
 /*
  * Helper for AIO readv completion callbacks, supporting both shared and temp
  * buffers. Gets called once for each buffer in a multi-page read.
  */
-static pg_attribute_always_inline PgAioResult
-buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
-						  bool failed, bool is_temp)
+static pg_attribute_always_inline void
+buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
+						  uint8 flags, bool failed, bool is_temp,
+						  bool *failed_verification)
 {
 	BufferDesc *buf_hdr = is_temp ?
 		GetLocalBufferDescriptor(-buffer - 1)
 		: GetBufferDescriptor(buffer - 1);
 	BufferTag	tag = buf_hdr->tag;
 	char	   *bufdata = BufferGetBlock(buffer);
-	PgAioResult result = {.status = PGAIO_RS_OK};
 	uint32		set_flag_bits;
 
 	/* check that the buffer is in the expected state for a read */
@@ -6357,37 +6412,45 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
 	}
 #endif
 
+	*failed_verification = false;
+
 	/* check for garbage data */
 	if (!failed &&
 		!PageIsVerifiedExtended((Page) bufdata, tag.blockNum,
 								PIV_LOG_WARNING | PIV_REPORT_STAT))
 	{
-		RelFileLocator rlocator = BufTagGetRelFileLocator(&tag);
+		PgAioResult result_one;
 
-		if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+		*failed_verification = true;
+
+		if (flags & READ_BUFFERS_ZERO_ON_ERROR)
 		{
-			ereport(WARNING,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("invalid page in block %u of relation %s; zeroing out page",
-							tag.blockNum,
-							relpathbackend(rlocator,
-										   is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
-										   tag.forkNum).str)));
 			memset(bufdata, 0, BLCKSZ);
 		}
 		else
 		{
 			/* mark buffer as having failed */
 			failed = true;
-
-			/* encode error for buffer_readv_report */
-			result.status = PGAIO_RS_ERROR;
-			if (is_temp)
-				result.id = PGAIO_HCB_LOCAL_BUFFER_READV;
-			else
-				result.id = PGAIO_HCB_SHARED_BUFFER_READV;
-			result.error_data = buf_off;
 		}
+
+		/*
+		 * Immediately log a message about the invalid page, but only to the
+		 * server log. The reason to do so immediately is that this may be
+		 * executed in a different backend than the one that originated the
+		 * request. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing). The
+		 * issuer of the IO will emit an ERROR or WARNING when processing the
+		 * IO's results
+		 *
+		 * To avoid duplicating the code to emit these log messages, we reuse
+		 * buffer_readv_report().
+		 */
+		buffer_readv_encode_error(&result_one, is_temp,
+								  flags & READ_BUFFERS_ZERO_ON_ERROR,
+								  buf_off, 1);
+		pgaio_result_report(result_one, td, LOG_SERVER_ONLY);
 	}
 
 	/* Terminate I/O and set BM_VALID. */
@@ -6412,8 +6475,6 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
 									  tag.relNumber,
 									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
 									  false);
-
-	return result;
 }
 
 /*
@@ -6428,6 +6489,8 @@ buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 {
 	PgAioResult result = prior_result;
 	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint8		first_invalid_off = 0;
+	uint8		invalid_count = 0;
 	uint64	   *io_data;
 	uint8		handle_data_len;
 
@@ -6447,35 +6510,46 @@ buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
 	{
 		Buffer		buf = io_data[buf_off];
-		PgAioResult buf_result;
 		bool		failed;
+		bool		failed_verification = false;
 
 		Assert(BufferIsValid(buf));
 
 		/*
 		 * If the entire I/O failed on a lower-level, each buffer needs to be
-		 * marked as failed. In case of a partial read, some buffers may be
-		 * ok.
+		 * marked as failed. In case of a partial read, the first few buffers
+		 * may be ok.
 		 */
 		failed =
 			prior_result.status == PGAIO_RS_ERROR
 			|| prior_result.result <= buf_off;
 
-		buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
-											   is_temp);
+		buffer_readv_complete_one(td, buf_off, buf, cb_data, failed, is_temp,
+								  &failed_verification);
 
 		/*
-		 * If there wasn't any prior error and page verification failed in
-		 * some form, set the whole IO's result to the page's result.
+		 * Track information about the number of errors across all pages, as
+		 * there can be multiple pages failing verification as part of one IO.
 		 */
-		if (result.status != PGAIO_RS_ERROR
-			&& buf_result.status != PGAIO_RS_OK)
+		if (failed_verification)
 		{
-			result = buf_result;
-			pgaio_result_report(result, td, LOG);
+			if (invalid_count++ == 0)
+				first_invalid_off = buf_off;
 		}
 	}
 
+	/*
+	 * If the smgr read succeeded [partially] and page verification failed for
+	 * some of the pages, adjust the IO's result state appropriately.
+	 */
+	if (prior_result.status != PGAIO_RS_ERROR && invalid_count > 0)
+	{
+		bool		was_zeroed = cb_data & READ_BUFFERS_ZERO_ON_ERROR;
+
+		buffer_readv_encode_error(&result, is_temp, was_zeroed,
+								  first_invalid_off, invalid_count);
+	}
+
 	return result;
 }
 
@@ -6483,28 +6557,66 @@ buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
  * AIO error reporting callback for aio_shared_buffer_readv_cb and
  * aio_local_buffer_readv_cb.
  *
- * Errors are encoded as follows:
- * - the only error is page verification failing
- * - result.error_data is the offset of the first page that failed
- *   verification in a larger IO
+ * The error is encoded / decoded in buffer_readv_encode_error() /
+ * buffer_readv_decode_error().
  */
 static void
 buffer_readv_report(PgAioResult result, const PgAioTargetData *target_data,
 					int elevel)
 {
+	BlockNumber blocknum = target_data->smgr.blockNum;
+	int			nblocks = target_data->smgr.nblocks;
 	ProcNumber	errProc;
+	bool		was_zeroed;
+	uint8		first_invalid_off;
+	uint8		invalid_count;
+	RelPathStr	rpath;
+
+	buffer_readv_decode_error(result, &was_zeroed, &first_invalid_off,
+							  &invalid_count);
 
 	if (target_data->smgr.is_temp)
 		errProc = MyProcNumber;
 	else
 		errProc = INVALID_PROC_NUMBER;
 
-	ereport(elevel,
-			errcode(ERRCODE_DATA_CORRUPTED),
-			errmsg("invalid page in block %u of relation %s",
-				   target_data->smgr.blockNum + result.error_data,
-				   relpathbackend(target_data->smgr.rlocator, errProc,
-								  target_data->smgr.forkNum).str));
+	rpath = relpathbackend(target_data->smgr.rlocator, errProc,
+						   target_data->smgr.forkNum);
+
+	if (was_zeroed)
+	{
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				invalid_count == 1 ?
+				errmsg("invalid page in block %u of relation %s; zeroing out page",
+					   blocknum + first_invalid_off, rpath.str) :
+				errmsg("zeroing out %u invalid pages among blocks %u..%u of relation %s",
+					   invalid_count,
+					   blocknum, blocknum + nblocks - 1, rpath.str),
+				invalid_count > 1 ?
+				errdetail("Block %u held first invalid page.",
+						  blocknum + first_invalid_off) : 0,
+				invalid_count > 1 ?
+				errhint("See server log for the other %u invalid blocks.",
+						invalid_count - 1) : 0);
+	}
+	else
+	{
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				invalid_count == 1 ?
+				errmsg("invalid page in block %u of relation %s",
+					   blocknum + first_invalid_off, rpath.str) :
+				errmsg("%u invalid pages among blocks %u..%u of relation %s",
+					   invalid_count,
+					   blocknum, blocknum + nblocks - 1, rpath.str),
+				invalid_count > 1 ?
+				errdetail("Block %u held first invalid page.",
+						  blocknum + first_invalid_off) : 0,
+				invalid_count > 1 ?
+				errhint("See server log for the other %u invalid blocks.",
+						invalid_count - 1) : 0);
+	}
 }
 
 static void
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0009-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From 8af4c611c646f06a31c4be83bd301b9599841dc4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 09/28] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO. This converts a lot of callers to use the
AIO infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 594 +++++++++++++++++++++-------
 2 files changed, 463 insertions(+), 137 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5137288c96e..7abb3bdfd9d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -112,6 +112,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 2)
+
 
 struct ReadBuffersOperation
 {
@@ -131,6 +134,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0b96e256625..83dda2cb5d8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,6 +531,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1231,10 +1233,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,6 +1265,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks == 1 || allow_forwarding);
 	Assert(*nblocks > 0);
@@ -1326,6 +1333,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1368,25 +1389,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1452,8 +1522,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1462,31 +1559,240 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	/*
+	 * If this backend currently has staged IO, we need to submit the pending
+	 * IO before waiting for the right to issue IO, to avoid the potential for
+	 * deadlocks (and, more commonly, unnecessary delays for other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately StartBufferIO() returning false doesn't allow to
+		 * distinguish between the buffer already being valid and IO already
+		 * being in progress. Since IO already being in progress is quite
+		 * rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	int			newly_read_blocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != PGAIO_RS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks successfully read as the result of
+	 * the IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(aio_ret->result.status == PGAIO_RS_OK))
+		newly_read_blocks = aio_ret->result.result;
+	else if (aio_ret->result.status == PGAIO_RS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		newly_read_blocks = aio_ret->result.result;
+
+		elog(DEBUG3, "partial read, will retry");
+
+	}
+	else if (aio_ret->result.status == PGAIO_RS_ERROR)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
+
+	Assert(newly_read_blocks > 0);
+	Assert(newly_read_blocks <= MAX_IO_COMBINE_LIMIT);
+
+	operation->nblocks_done += newly_read_blocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
+	PgAioReturn *aio_ret = &operation->io_return;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
-	Assert(nblocks > 0);
-	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple partial reads, but also because some of the remaining
+	 * to-be-read buffers may have been read in by other backends, limiting
+	 * the IO size.
+	 */
+	while (true)
+	{
+		int			ignored_nblocks_progress;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it.
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the time, the one IO we already started, will read in
+		 * everything.  But we need to deal with partial reads and buffers not
+		 * needing IO anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a partial read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &ignored_nblocks_progress);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks, if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after partial reads, the first operation->nblocks_done
+ * buffers are skipped.
+ *
+ * On return *nblocks_progress is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
 
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1494,140 +1800,154 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
+	 * might block, which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the first to-be-read buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock(). The other
+		 * backend will track this as a 'read'.
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock(). The
-			 * other backend will track this as a 'read'.
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, io_buffers_len * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0010-squash-later-bufmgr-Use-AIO-in-StartReadBuffer.patchtext/x-diff; charset=us-asciiDownload

From 7775702ff10063bd58919937682523a1fe031db4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Mar 2025 16:41:21 -0400
Subject: [PATCH v2.13 10/28] squash-later: bufmgr: Use AIO in
 StartReadBuffers()

---
 src/backend/storage/buffer/bufmgr.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 83dda2cb5d8..040e7cef9e8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1595,6 +1595,7 @@ static void
 ProcessReadBuffersResult(ReadBuffersOperation *operation)
 {
 	PgAioReturn *aio_ret = &operation->io_return;
+	PgAioResultStatus rs = aio_ret->result.status;
 	int			newly_read_blocks = 0;
 
 	Assert(pgaio_wref_valid(&operation->io_wref));
@@ -1605,8 +1606,12 @@ ProcessReadBuffersResult(ReadBuffersOperation *operation)
 	 * the IO operation. Thus we can simply add that to ->nblocks_done.
 	 */
 
-	if (likely(aio_ret->result.status == PGAIO_RS_OK))
+	if (likely(rs != PGAIO_RS_ERROR))
 		newly_read_blocks = aio_ret->result.result;
+
+	if (rs == PGAIO_RS_ERROR || rs == PGAIO_RS_WARNING)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data,
+							rs == PGAIO_RS_ERROR ? ERROR : WARNING);
 	else if (aio_ret->result.status == PGAIO_RS_PARTIAL)
 	{
 		/*
@@ -1614,13 +1619,8 @@ ProcessReadBuffersResult(ReadBuffersOperation *operation)
 		 * not even that in prod scenarios).
 		 */
 		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
-		newly_read_blocks = aio_ret->result.result;
-
 		elog(DEBUG3, "partial read, will retry");
-
 	}
-	else if (aio_ret->result.status == PGAIO_RS_ERROR)
-		pgaio_result_report(aio_ret->result, &aio_ret->target_data, ERROR);
 
 	Assert(newly_read_blocks > 0);
 	Assert(newly_read_blocks <= MAX_IO_COMBINE_LIMIT);
@@ -1800,6 +1800,20 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
+	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
+	 * set globally, but on a per-session basis. The completion callback,
+	 * which may be run in other processes, e.g. in IO workers, may have a
+	 * different value of the zero_damaged_pages GUC.
+	 *
+	 * XXX: We probably should eventually use a different flag for
+	 * zero_damaged_pages, so we can report different log levels / error codes
+	 * for zero_damaged_pages and ZERO_ON_ERROR.
+	 */
+	if (zero_damaged_pages)
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
+
 	/*
 	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
 	 * might block, which we don't want after setting IO_IN_PROGRESS.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0011-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From 969f6e5478a46812b42ff6f0dffa16028e71b01e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 11/28] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 423 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 425 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..3bdf003a03c
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,423 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have partially
+ * completed. If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read.
+ */
+if (ioret.result.status == PGAIO_RS_PARTIAL)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "define" it, i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_start_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_start_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this is the ability to associate multiple completion
+callbacks with a handle. E.g. bufmgr.c can have a callback to update the
+BufferDesc state and to verify the page and md.c can have another callback to
+check if the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e3ed087e8a2..86f7250b7a5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0012-squash-later-aio-Add-README.md-explaining-high.patchtext/x-diff; charset=us-asciiDownload

From 011011f9d34405f3882e429a743d6ac6c9ba7d89 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Mar 2025 14:38:53 -0400
Subject: [PATCH v2.13 12/28] squash-later: aio: Add README.md explaining
 higher level design

---
 src/backend/storage/aio/README.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index 3bdf003a03c..ddd59404a59 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -152,8 +152,9 @@ if (ioret.result.status == PGAIO_RS_ERROR)
     pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
 
 /*
- * Besides having succeeded completely, the IO could also have partially
- * completed. If we e.g. tried to read many blocks at once, the read might have
+ * Besides having succeeded completely, the IO could also have a) partially
+ * completed or b) succeeded with a warning (e.g. due to zero_damaged_pages).
+ * If we e.g. tried to read many blocks at once, the read might have
  * only succeeded for the first few blocks.
  *
  * If the IO partially succeeded and this backend needs all blocks to have
@@ -161,9 +162,9 @@ if (ioret.result.status == PGAIO_RS_ERROR)
  * The AIO subsystem cannot handle this retry transparently.
  *
  * As this example is already long, and we only read a single block, we'll just
- * error out if there's a partial read.
+ * error out if there's a partial read or a warning.
  */
-if (ioret.result.status == PGAIO_RS_PARTIAL)
+if (ioret.result.status != PGAIO_RS_OK)
     pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0013-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From 03f43a541feaa0149e24cd529c343f1b02dce12a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 13/28] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to
  StartReadBuffers() to ensure synchronous IO execution

There are further improvements we should consider:

- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency. That however requires care to avoid deadlocks and thus
  done separately.
- It can be beneficial to defer starting new IOs until we can issue multiple
  IOs at once. That however requires non-trivial heuristics to decide when to
  do so.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/aio/read_stream.c | 39 ++++++++++++++++++---------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index c60e37e7f7f..26e5dfe77db 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,8 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	int			read_buffers_flags;
+	bool		sync_mode;		/* using io_method=sync */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -250,7 +253,7 @@ read_stream_start_pending_read(ReadStream *stream)
 		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
 
 	/* Do we need to issue read-ahead advice? */
-	flags = 0;
+	flags = stream->read_buffers_flags;
 	if (stream->advice_enabled)
 	{
 		if (stream->pending_read_blocknum == stream->seq_blocknum)
@@ -261,7 +264,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 * then stay of the way of the kernel's own read-ahead.
 			 */
 			if (stream->seq_until_processed != InvalidBlockNumber)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 		else
 		{
@@ -272,7 +275,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 */
 			stream->seq_until_processed = stream->pending_read_blocknum;
 			if (stream->pinned_buffers > 0)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 	}
 
@@ -613,27 +616,33 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * Read-ahead advice simulating asynchronous I/O with synchronous calls.
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
 	/*
-	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
-	 * above.  If we had real asynchronous I/O we might need a slightly
-	 * different definition.
+	 * Setting max_ios to zero disables AIO and advice-based pseudo AIO, but
+	 * we still need to allocate space to combine and run one I/O.  Bump it up
+	 * to one, and remember to ask for synchronous I/O only.
 	 */
 	if (max_ios == 0)
+	{
 		max_ios = 1;
+		stream->read_buffers_flags = READ_BUFFERS_SYNCHRONOUSLY;
+	}
 
 	/*
 	 * Capture stable values for these two GUC-derived numbers for the
@@ -777,6 +786,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		if (likely(next_blocknum != InvalidBlockNumber))
 		{
+			int			flags = stream->read_buffers_flags;
+
+			if (stream->advice_enabled)
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -792,8 +806,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
 										next_blocknum,
-										stream->advice_enabled ?
-										READ_BUFFERS_ISSUE_ADVICE : 0)))
+										flags)))
 			{
 				/* Fast return. */
 				return buffer;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0014-read_stream-Introduce-and-use-optional-batchmo.patchtext/x-diff; charset=us-asciiDownload

From e4400c4b22d9131ac269b59e2179e73145a14fa1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Mar 2025 17:30:42 -0400
Subject: [PATCH v2.13 14/28] read_stream: Introduce and use optional batchmode
 support

Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
   to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
   never held while waiting for IO.

b) directly or indirectly start another batch pgaio_enter_batchmode()

As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

There are two cases where batching would likely be beneficial, but where we
aren't using it yet:

1) bitmap heap scans, because the callback reads the VM

   This should soon be solved, because we are planning to remove the use of
   the VM, due to that not being sound.

2) The first phase of heap vacuum

   This could be made to support batchmode, but would require some care.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/read_stream.h     | 21 +++++++++++++++++++++
 src/backend/access/gist/gistvacuum.c  |  8 +++++++-
 src/backend/access/heap/heapam.c      | 18 +++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c  | 21 ++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  8 +++++++-
 src/backend/access/spgist/spgvacuum.c |  8 +++++++-
 src/backend/commands/analyze.c        |  7 ++++++-
 src/backend/storage/aio/read_stream.c | 16 ++++++++++++++++
 src/backend/storage/buffer/bufmgr.c   |  8 +++++++-
 contrib/pg_prewarm/pg_prewarm.c       |  7 ++++++-
 contrib/pg_visibility/pg_visibility.c |  8 +++++++-
 11 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index c11d8ce3300..9b0d65161d0 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -42,6 +42,27 @@
  */
 #define READ_STREAM_FULL 0x04
 
+/* ---
+ * Opt-in to using AIO batchmode.
+ *
+ * Submitting IO in larger batches can be more efficient than doing so
+ * one-by-one, particularly for many small reads. It does, however, require
+ * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
+ * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
+ *
+ * a) block without first calling pgaio_submit_staged(), unless a
+ *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
+ *    never held while waiting for IO.
+ *
+ * b) start another batch (without first exiting batchmode and re-entering
+ *    before returning)
+ *
+ * As this requires care and is nontrivial in some cases, batching is only
+ * used with explicit opt-in.
+ * ---
+ */
+#define READ_STREAM_USE_BATCHING 0x08
+
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 20b1bb5dbac..ce9d78d78d6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -210,7 +210,13 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = GIST_ROOT_BLKNO;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..c8357660776 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1206,7 +1206,15 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 		else
 			cb = heap_scan_stream_read_next_serial;
 
-		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+		/* ---
+		 * It is safe to use batchmode as the only locks taken by `cb`
+		 * are never taken while waiting for IO:
+		 * - SyncScanLock is used in the non-parallel case
+		 * - in the parallel case, only spinlocks and atomics are used
+		 * ---
+		 */
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL |
+														  READ_STREAM_USE_BATCHING,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
 														  MAIN_FORKNUM,
@@ -1216,6 +1224,14 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	}
 	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
 	{
+		/*
+		 * Currently we can't trivially use batching, due to the
+		 * VM_ALL_VISIBLE check in bitmapheap_stream_read_next. While that
+		 * could be made safe, we are about to remove the all-visible logic
+		 * from bitmap scans due to its unsoundness.
+		 *
+		 * FIXME: Should be changed soon!
+		 */
 		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_DEFAULT,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2cbcf5e5db2..82883492128 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1225,7 +1225,12 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	/* Set up the read stream for vacuum's first pass through the heap */
+	/*
+	 * Set up the read stream for vacuum's first pass through the heap.
+	 *
+	 * This could be made safe for READ_STREAM_USE_BATCHING, but only with
+	 * explicit work in heap_vac_scan_next_block.
+	 */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
 										vacrel->rel,
@@ -2669,6 +2674,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
  * Read stream callback for vacuum's third phase (second pass over the heap).
  * Gets the next block from the TID store and returns it or InvalidBlockNumber
  * if there are no further blocks to vacuum.
+ *
+ * NB: Assumed to be safe to use with READ_STREAM_USE_BATCHING.
  */
 static BlockNumber
 vacuum_reap_lp_read_stream_next(ReadStream *stream,
@@ -2732,8 +2739,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	iter = TidStoreBeginIterate(vacrel->dead_items);
 
-	/* Set up the read stream for vacuum's second pass through the heap */
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * Set up the read stream for vacuum's second pass through the heap.
+	 *
+	 * It is safe to use batchmode, as vacuum_reap_lp_read_stream_next() does
+	 * not need to wait for IO and does not perform locking. Once we support
+	 * parallelism it should still be fine, as presumably the holder of locks
+	 * would never be blocked by IO while holding the lock.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vacrel->bstrategy,
 										vacrel->rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 80b04d6ca2a..4a0bf069f99 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1064,7 +1064,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = BTREE_METAPAGE + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 77deb226b7e..b3df2d89074 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -822,7 +822,13 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* We can skip locking for new or temp relations */
 	needLock = !RELATION_IS_LOCAL(index);
 	p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										bds->info->strategy,
 										index,
 										MAIN_FORKNUM,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..fbf618e6abb 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1237,7 +1237,12 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * It is safe to use batching, as block_sampling_read_stream_next never
+	 * blocks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vac_strategy,
 										scan->rs_rd,
 										MAIN_FORKNUM,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 26e5dfe77db..36c54fb695b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -102,6 +102,7 @@ struct ReadStream
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
+	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -403,6 +404,15 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO. The callback needs to opt-in to being
+	 * careful.
+	 */
+	if (stream->batch_mode)
+		pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -450,6 +460,8 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				if (stream->batch_mode)
+					pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -484,6 +496,9 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	if (stream->batch_mode)
+		pgaio_exit_batchmode();
 }
 
 /*
@@ -617,6 +632,7 @@ read_stream_begin_impl(int flags,
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
 	stream->sync_mode = io_method == IOMETHOD_SYNC;
+	stream->batch_mode = flags & READ_STREAM_USE_BATCHING;
 
 #ifdef USE_PREFETCH
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 040e7cef9e8..80d7cc5959a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5085,7 +5085,13 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	p.current_blocknum = 0;
 	p.last_exclusive = nblocks;
 	src_smgr = smgropen(srclocator, INVALID_PROC_NUMBER);
-	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL |
+												 READ_STREAM_USE_BATCHING,
 												 bstrategy_src,
 												 src_smgr,
 												 permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index f496ec9d85d..84a3189a77b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -198,7 +198,12 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		p.current_blocknum = first_block;
 		p.last_exclusive = last_block + 1;
 
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											NULL,
 											rel,
 											forkNumber,
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index ca91819852c..d79ef35006b 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -526,7 +526,13 @@ collect_visibility_data(Oid relid, bool include_pd)
 	{
 		p.current_blocknum = 0;
 		p.last_exclusive = nblocks;
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											bstrategy,
 											rel,
 											MAIN_FORKNUM,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0015-docs-Reframe-track_io_timing-related-docs-as-w.patchtext/x-diff; charset=us-asciiDownload

From 10739e9cb2515f2794a075b12381fa1cf0924ad6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 15/28] docs: Reframe track_io_timing related docs as
 wait time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f86135fbe1d..13773c05ff4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8568,7 +8568,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8602,7 +8602,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bacc09cb8af..a6d67d2fbaa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2747,7 +2747,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2785,7 +2785,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2813,7 +2813,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2849,7 +2849,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2923,7 +2923,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -3010,7 +3010,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0016-Enable-IO-concurrency-on-all-systems.patchtext/x-diff; charset=us-asciiDownload

From 7c803ff8e9e4025e690ebc8529792a79017efa78 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 10:15:20 -0400
Subject: [PATCH v2.13 16/28] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  4 +--
 src/bin/initdb/initdb.c                       |  5 ----
 doc/src/sgml/config.sgml                      | 16 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 10 files changed, 14 insertions(+), 67 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7abb3bdfd9d..784df8b00cb 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,14 +156,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 16
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 0f1e74f96c9..799fa7ace68 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern void assign_io_max_combine_limit(int newval, void *extra);
 extern void assign_io_combine_limit(int newval, void *extra);
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 645b5c00467..46c1dce222d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f550a3c0c63..d3567aa7f27 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 /*
@@ -1249,34 +1247,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 989825d3a9c..5d729102f46 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3245,7 +3245,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3259,7 +3259,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2246ccb85a7..ac6ae0a9489 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,8 +198,8 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+#maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
 #io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 22b7d31b165..c17fda2bc81 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1402,11 +1402,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 13773c05ff4..370cda13553 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2585,8 +2585,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2597,8 +2596,9 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the
+         prefetch distance.
         </para>
 
         <para>
@@ -2621,10 +2621,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.  This value can be overridden for tables in a
-         particular tablespace by setting the tablespace parameter of the same
-         name (see <xref linkend="sql-altertablespace"/>).
+         The default is <literal>16</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0025-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From b34a8b921ea3491a65098620fab3973c795c339e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 25/28] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.13-0026-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 64dd33d1cca8cbadcfd38320957862bd74e55016 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.13 26/28] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#145

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#144)

Re: AIO v2.5

Hi,

On 2025-03-26 21:07:40 -0400, Andres Freund wrote:

TODO
...
- Add an explicit test for the checksum verification in the completion callback

There is an existing test for testing an invalid page due to page header
verification in test_aio, but not for checksum failures.

I think it's indirectly covered (e.g. in amcheck), but seems better to test
it explicitly.

Ah, for crying out loud. As it turns out, no, we do not have *ANY* tests for
this for the server-side. Not a single one. I'm somewhat apoplectic,
data_checksums is a really complicated feature, which we just started *turning
on by default*, without a single test of the failure behaviour, when detecting
failures is the one thing the feature is supposed to do.

I now wrote some tests. And I both regret doing so (because it found problems,
which would have been apparent long ago, if the feature had come with *any*
tests, if I had gone the same way I could have just pushed stuff) and am glad
I did (because I dislike pushing broken stuff).

I have to admit, I was tempted to just ignore this issue and just not say
anything about tests for checksum failures anymore.

Problems:

1) PageIsVerifiedExtended() emits a WARNING, just like with ZERO_ON_ERROR, we
don't want to emit it in a) io workers b) another backend if it completes
the error.

This isn't hard to address, we can add PIV_LOG_LOG (or something like that)
to emit it at a different log level and an out-parameter to trigger sending
a warning / adjust the warning/error message we already emit once the
issuer completes the IO.

2) With IO workers (and "foreign completors", in rare cases), the checksum
failures would be attributed wrongly, as it reports all stats to
MyDatabaseId

As it turns out, this is already borked on master for shared relations,
since pg_stat_database.checksum_failures has existed, see [1]/messages/by-id/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz.

This isn't too hard to fix, if we adjust the signature to
PageIsVerifiedExtended() to pass in the database oid. But see also 3)

3) We can't pgstat_report_checksum_failure() during the completion callback,
as it *sometimes* allocates memory

Aside from the allocation-in-critical-section asserts, I think this is
*extremely* unlikely to actually cause a problem in practice. But we don't
want to rely on that, obviously.

Addressing the first two is pretty simple and/or needs to be done anyway,
since it's a currently existing bug, as discussed in [1]/messages/by-id/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz.

Addressing 3) is not at all trivial. Here's what I've thought of so far:

Approach I)

My first thoughts were around trying to make the relevant pgstat
infrastructure either not need to allocate memory, or handle memory
allocation failures gracefully.

Unfortunately that seems not really viable:

The most successful approach I tried was to report stats directly to the
dshash table, and only report stats if there's already an entry (which
there just about always will, except for a very short period after stats
have been reset).

Unfortunately that fails because to access the shared memory with the stats
data we need to do dsa_get_address(), which can fail if the relevant dsm
segment wasn't already mapped in the current process (it allocates memory
in the process of mapping in the segment). There's no API to do that
without erroring out.

That aspect rules out a number of other approaches that sounded like they
could work - we e.g. could increase the refcount of the relevant pgstat
entry before issuing IO, ensuring that it's around by the time we need to
report. But that wouldn't get around the issue of needing to map in the dsm
segment.

Approach II)

Don't report the error in the completion callback. The obvious place would be
to do it where we we'll raise the warning/error in the issuing process. The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.

Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

Two disadvantages:

- Accumulation of errors will be delayed until the next
pgstat_report_start(). That seems acceptable, after all we do so far a lot
of other stats.

- We need to register a local callback for shared buffer reads, which don't
need them today . That's a small bit of added overhead. It's a shame to do
so for counters that approximately never get incremented.

Approach IV):

Embracy piercing abstractions / generic infrastructure and put two atomic
variables (one for shared one for the backend's database) in some
backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
update that in the completion callback. Flush that variable to the shared
stats in pgstat_report_start() or such.

This would avoid the need for the local completion callback, and would also
allow to introduce a function see the number "unflushed" checksum errors. It
also doesn't require transporting the number of errors between the shared
callback and the local callback - but we might want to do have that for the
error message anyway.

I wish the new-to-18 pgstat_backend() were designed in a way to make this
possible in a nice way. But unfortunately it puts the backend-specific data in
the dshash table / dynamic shared memory, rather than in a MaxBackends +
NUM_AUX sized array array in plain shared memory. As explained in I), we can't
rely on having the entire array mapped. Leaving the issue from this email
aside, that also adds a fair bit of overhead to other cases.

Does anybody have better ideas?

I think II), III) and IV) are all relatively simple to implement.

The most complicated bit is that a bit of bit-squeezing is necessary to fit
the number of checksum errors (in addition to the number of otherwise invalid
pages) into the available space for error data. It's doable. We could also
just increase the size of PgAioResult.

I've implemented II), but I'm not sure the disadvantages are acceptable.

Greetings,

Andres Freund

[1]: /messages/by-id/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz

#146

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#145)

Re: AIO v2.5

On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:

I now wrote some tests. And I both regret doing so (because it found problems,
which would have been apparent long ago, if the feature had come with *any*
tests, if I had gone the same way I could have just pushed stuff) and am glad
I did (because I dislike pushing broken stuff).

I have to admit, I was tempted to just ignore this issue and just not say
anything about tests for checksum failures anymore.

I don't blame you.

3) We can't pgstat_report_checksum_failure() during the completion callback,
as it *sometimes* allocates memory

Aside from the allocation-in-critical-section asserts, I think this is
*extremely* unlikely to actually cause a problem in practice. But we don't
want to rely on that, obviously.

Addressing 3) is not at all trivial. Here's what I've thought of so far:

Approach I)

Unfortunately that fails because to access the shared memory with the stats
data we need to do dsa_get_address()

Approach II)

Don't report the error in the completion callback. The obvious place would be
to do it where we we'll raise the warning/error in the issuing process. The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.

While neither are great, I could live with both. I guess I'm optimistic that
clusters experiencing checksum failures won't lose enough reports to these
loss sources to make the difference in whether monitoring catches them. In
other words, a cluster will report N failures without these losses and N-K
after these losses. If N is large enough for relevant monitoring to flag the
cluster appropriately, N-K will also be large enough.

Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

Two disadvantages:

- Accumulation of errors will be delayed until the next
pgstat_report_start(). That seems acceptable, after all we do so far a lot
of other stats.

Yep, acceptable.

- We need to register a local callback for shared buffer reads, which don't
need them today . That's a small bit of added overhead. It's a shame to do
so for counters that approximately never get incremented.

Fair concern. An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
latter would differ from the former only in having the extra local callback.
Could that help? I think the only overhead is using more PGAIO_HCB numbers.
We currently reserve 256 (uint8), but one could imagine trying to pack into
fewer bits. That said, this wouldn't paint us into a corner. We could change
the approach later.

pgaio_io_call_complete_local() starts a critical section. Is that a problem
for this approach?

Approach IV):

Embracy piercing abstractions / generic infrastructure and put two atomic
variables (one for shared one for the backend's database) in some
backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
update that in the completion callback. Flush that variable to the shared
stats in pgstat_report_start() or such.

I could live with that. I feel better about Approach III currently, though.
Overall, I'm feeling best about III long-term, but II may be the right
tactical choice.

Does anybody have better ideas?

I think no, but here are some ideas I tossed around:

- Like your Approach III, but have the completing process store the count
locally and flush it, instead of the staging process doing so. Would need
more than 2 slots, but we could have a fixed number of slots and just
discard any reports that arrive with all slots full. Reporting checksum
failures in, say, 8 databases in quick succession probably tells the DBA
there's "enough corruption to start worrying". Missing the 9th database
would be okay.

- Pre-warm the memory allocations and DSAs we could possibly need, so we can
report those stats in critical sections, from the completing process. Bad
since there's an entry per database, hence no reasonable limit on how much
memory a process might need to pre-warm. We could even end up completing an
IO for a database that didn't exist on entry to our critical section.

- Skip the checksum pgstats if we're completing in a critical section.
Doesn't work since we _always_ make a critical section to complete I/O.

This email isn't as well-baked as I like, but the alternative was delaying it
24-48h depending on how other duties go over those hours. My v2.13 review is
still in-progress, too.

#147

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#146)

Re: AIO v2.5

Hi,

On 2025-03-27 20:22:23 -0700, Noah Misch wrote:

On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:

Don't report the error in the completion callback. The obvious place would be
to do it where we we'll raise the warning/error in the issuing process. The
big disadvantage is that that that could lead to under-counting checksum
errors:

a) A read stream does 2+ concurrent reads for the same relation, and more than
one encounters checksum errors. When processing the results for the first
failed read, we raise an error and thus won't process the results of the
second+ reads with errors.

b) A read is started asynchronously, but before the backend gets around to
processing the result of the IO, it errors out during that other work
(possibly due to a cancellation). Because the backend never looked at the
results of the IOs, the checksum errors don't get accounted for.

b) doesn't overly bother me, but a) seems problematic.

While neither are great, I could live with both. I guess I'm optimistic that
clusters experiencing checksum failures won't lose enough reports to these
loss sources to make the difference in whether monitoring catches them. In
other words, a cluster will report N failures without these losses and N-K
after these losses. If N is large enough for relevant monitoring to flag the
cluster appropriately, N-K will also be large enough.

That's true.

Approach III)

Accumulate checksum errors in two backend local variables (one for database
specific errors, one for errors on shared relations), which will be flushed by
the backend that issued IO during the next pgstat_report_start().

FWIW, two variables turn out to not quite suffice - as I realized later, we
actually can issue IO on behalf of arbitrary databases, due to
ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer().

That unfortunately makes it much harder to be able to guarantee that the
completor of an IO has the DSM segment for a pg_stat_database stats entry
mapped.

- We need to register a local callback for shared buffer reads, which don't
need them today . That's a small bit of added overhead. It's a shame to do
so for counters that approximately never get incremented.

Fair concern. An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
latter would differ from the former only in having the extra local callback.
Could that help? I think the only overhead is using more PGAIO_HCB numbers.

I think changing the callback could work - I'll do some measurements in a
coffee or two, but I suspect the overhead is not worth being too worried about
for now. There's a different aspect that worries me slightly more, see
further down.

We currently reserve 256 (uint8), but one could imagine trying to pack into
fewer bits.

Yea, my current local worktree reduces it to 6 bits for now, to make space for
keeping track of the number of checksum failures in error data (as part of
that adds defines for the bit widths). If that becomes an issue we can make
PgAioResult wider, but I suspect that won't be too soon.

One simplification that we could make is to only ever report one checksum
failure for each IO, even if N buffers failed - after all that's what HEAD
does (by virtue of throwing an error after the first). Then we'd not track the
number of checksum errors.

That said, this wouldn't paint us into a corner. We could change the
approach later.

Indeed - I think we mainly need something that works for now. I think medium
term the right fix here would be to make sure that the stats can be accounted
for with just an atomic increment somewhere.

We've had several discussions around having an in-memory datastructure for
every relation that currently has buffer in shared_buffers, to store e.g. the
relation length and the sync requests. If we get that, I think Thomas has a
prototype, we can accumulate the number of checksum errors in there, for
example. It'd also allow to address the biggest blocker for writes, namely
that RememberSyncRequest() could fail, *after* IO comletion.

pgaio_io_call_complete_local() starts a critical section. Is that a problem
for this approach?

I think we can make it not a problem - I added a
pgstat_prepare_report_checksum_failure(dboid) that ensures the calling backend
has a reference to the relevant shared memory stats entry. If we make the rule
that it has to be called *before* starting buffered IO (i.e. in
AsyncReadBuffers()), we can be sure the stats reference still exists by the
time local completion runs (as the isn't a way to have the stats entry dropped
without dropping the database, which isn't possible while a) the database
still is connected to, for normal IO b) the CREATE DATABASE is still running).

Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
local hashtable. That's more expensive than an indirect function call
(i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
not we can apply a mini-cache for the current database, which is surely the
only thing that ever matters for performance.

Approach IV):

Embracy piercing abstractions / generic infrastructure and put two atomic
variables (one for shared one for the backend's database) in some
backend-specific shared memory (e.g. the backend's PgAioBackend or PGPROC) and
update that in the completion callback. Flush that variable to the shared
stats in pgstat_report_start() or such.

I could live with that. I feel better about Approach III currently, though.
Overall, I'm feeling best about III long-term, but II may be the right
tactical choice.

I think it's easy to change between these approaches. Both require that we
encode the number of checksum failures in the result, which is where most of
the complexity lies (but still a rather surmountable amount of complexity).

I think no, but here are some ideas I tossed around:

- Like your Approach III, but have the completing process store the count
locally and flush it, instead of the staging process doing so. Would need
more than 2 slots, but we could have a fixed number of slots and just
discard any reports that arrive with all slots full. Reporting checksum
failures in, say, 8 databases in quick succession probably tells the DBA
there's "enough corruption to start worrying". Missing the 9th database
would be okay.

Yea. I think that'd be an ok fallback, but if we can make III' work, it'd be
nicer.

- Pre-warm the memory allocations and DSAs we could possibly need, so we can
report those stats in critical sections, from the completing process. Bad
since there's an entry per database, hence no reasonable limit on how much
memory a process might need to pre-warm. We could even end up completing an
IO for a database that didn't exist on entry to our critical section.

I experimented with this one - it works surprisingly well, because for IO
workers we could just do the pre-warming outside of the critical section, and
it's *exceedingly* rare that any other completor would ever need to complete
IO for another database than the current / a shared relation.

But it does leave a nasty edge case, that we'd just have to accept. I guess we
could just make it so that in that case stats aren't reported.

But it seems pretty ugly.

This email isn't as well-baked as I like, but the alternative was delaying it
24-48h depending on how other duties go over those hours. My v2.13 review is
still in-progress, too.

It's appreciated!

Greetings,

Andres Freund

#148

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#147)

Re: AIO v2.5

Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:

One simplification that we could make is to only ever report one checksum
failure for each IO, even if N buffers failed - after all that's what HEAD
does (by virtue of throwing an error after the first). Then we'd not track the
number of checksum errors.

Just after sending, I thought of another variation: Report the number of
*invalid* pages (which we already track) as checksum errors, if there was at
least on checksum error.

It's imo rather weird that we track checksum errors but we don't track invalid
page headers, despite the latter being an even worse indication of something
having gone wrong...

Greetings,

Andres Freund

#149

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#147)

Re: AIO v2.5

Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:

On 2025-03-27 20:22:23 -0700, Noah Misch wrote:

On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:

- We need to register a local callback for shared buffer reads, which don't
need them today . That's a small bit of added overhead. It's a shame to do
so for counters that approximately never get incremented.

Fair concern. An idea is to let the complete_shared callback change the
callback list associated with the IO, so it could change
PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW. The
latter would differ from the former only in having the extra local callback.
Could that help? I think the only overhead is using more PGAIO_HCB numbers.

I think changing the callback could work - I'll do some measurements in a
coffee or two, but I suspect the overhead is not worth being too worried about
for now. There's a different aspect that worries me slightly more, see
further down.
...
Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
local hashtable. That's more expensive than an indirect function call
(i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
not we can apply a mini-cache for the current database, which is surely the
only thing that ever matters for performance.

I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a
difference of either having the unnecessary complete_local callback or having
the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are
a few hits inside pgstat_get_entry_ref(), but not enough to matter.

Hence I think this isn't worth worrying about, at least for now. I think we
have far bigger fish to fry at this point than such a small performance
difference.

I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not
mention the overhead. I'm still inclined to think that it's better to call it
in the shared completion callback.

I also fixed support and added tests for ignore_checksum_failure, that also
needs to be determined at the start of the IO, not in the completion. Once
more there were no tests, of course.

I spent the last 6 hours on the stupid error/warning messages around this,
somewhat ridiculous.

The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery. The same buffer could both be ignored *and* zeroed. Or somebody
could use ignore_checksum_failure=on but then still encounter a page that is
invalid.

But I finally got to a point where the code ends up readable, without undue
duplication. It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Here's the relevant code:

/*
* Treat a read that had both zeroed buffers *and* ignored checksums as a
* special case, it's too irregular to be emitted the same way as the other
* cases.
*/
if (zeroed_any && ignored_any)
{
Assert(zeroed_any && ignored_any);
Assert(nblocks > 1); /* same block can't be both zeroed and ignored */
Assert(result.status != PGAIO_RS_ERROR);
affected_count = zeroed_or_error_count;

ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
affected_count, checkfail_count, first, last, rpath.str),
affected_count > 1 ?
errdetail("Block %u held first zeroed page.",
first + first_off) : 0,
errhint("See server log for details about the other %u invalid blocks.",
affected_count + checkfail_count - 1));
return;
}

/*
* The other messages are highly repetitive. To avoid duplicating a long
* and complicated ereport(), gather the translated format strings
* separately and then do one common ereport.
*/
if (result.status == PGAIO_RS_ERROR)
{
Assert(!zeroed_any); /* can't have invalid pages when zeroing them */
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s");
msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first invalid page.");
hint_mult = _("See server log for the other %u invalid blocks.");
}
else if (zeroed_any && !ignored_any)
{
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s; zeroing out page");
msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first zeroed page.");
hint_mult = _("See server log for the other %u zeroed blocks.");
}
else if (!zeroed_any && ignored_any)
{
affected_count = checkfail_count;
msg_one = _("ignoring checksum failure in block %u of relation %s");
msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
det_mult = _("Block %u held first ignored page.");
hint_mult = _("See server log for the other %u ignored blocks.");
}
else
pg_unreachable();

ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
affected_count == 1 ?
errmsg_internal(msg_one, first + first_off, rpath.str) :
errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);

Does that approach make sense?

What do you think about using
"zeroing invalid page in block %u of relation %s"
instead of
"invalid page in block %u of relation %s; zeroing out page"

I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.

Greetings,

Andres Freund

#150

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#149)

Re: AIO v2.5

On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:

The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery.

That's intricate indeed.

But I finally got to a point where the code ends up readable, without undue
duplication. It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Introducing that is fine.

Here's the relevant code:

/*
* Treat a read that had both zeroed buffers *and* ignored checksums as a
* special case, it's too irregular to be emitted the same way as the other
* cases.
*/
if (zeroed_any && ignored_any)
{
Assert(zeroed_any && ignored_any);
Assert(nblocks > 1); /* same block can't be both zeroed and ignored */
Assert(result.status != PGAIO_RS_ERROR);
affected_count = zeroed_or_error_count;

ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
affected_count, checkfail_count, first, last, rpath.str),

Translation stumbles on this one, because each of the first two %u are
plural-sensitive. I'd do one of:

- Call ereport() twice, once for zeroed pages and once for ignored checksums.
Since elevel <= ERROR here, that doesn't lose the second call.

- s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared
transaction(s) using the database."

- Something more like the style of VACUUM VERBOSE, e.g. "INTRO_TEXT: %u
zeroed, %u checksums ignored". I've not written INTRO_TEXT, and this
doesn't really resolve pluralization. Probably don't use this option.

affected_count > 1 ?
errdetail("Block %u held first zeroed page.",
first + first_off) : 0,
errhint("See server log for details about the other %u invalid blocks.",
affected_count + checkfail_count - 1));
return;
}

/*
* The other messages are highly repetitive. To avoid duplicating a long
* and complicated ereport(), gather the translated format strings
* separately and then do one common ereport.
*/
if (result.status == PGAIO_RS_ERROR)
{
Assert(!zeroed_any); /* can't have invalid pages when zeroing them */
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s");
msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first invalid page.");
hint_mult = _("See server log for the other %u invalid blocks.");

For each hint_mult, we would usually use ngettext() instead of _(). (Would be
errhint_plural() if not separated from its ereport().) Alternatively,
s/blocks/block(s)/ is fine.

}
else if (zeroed_any && !ignored_any)
{
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s; zeroing out page");
msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first zeroed page.");
hint_mult = _("See server log for the other %u zeroed blocks.");
}
else if (!zeroed_any && ignored_any)
{
affected_count = checkfail_count;
msg_one = _("ignoring checksum failure in block %u of relation %s");
msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
det_mult = _("Block %u held first ignored page.");
hint_mult = _("See server log for the other %u ignored blocks.");
}
else
pg_unreachable();

ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
affected_count == 1 ?
errmsg_internal(msg_one, first + first_off, rpath.str) :
errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);

Does that approach make sense?

Yes.

What do you think about using
"zeroing invalid page in block %u of relation %s"
instead of
"invalid page in block %u of relation %s; zeroing out page"

I like the replacement. It moves the important part to the front, and it's
shorter.

I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.

Agreed, I wouldn't have high hopes for that. An approach like that would
probably need messages that separate the independently-translated part
grammatically, e.g.:

/* last %s is translation of "ignore" or "zero-fill" */
"invalid page in block %u of relation %s; resolved by method \"%s\""

(Again, I'm not recommending that.)

#151

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#150)

27 attachment(s)

Re: AIO v2.5

Hi,

On 2025-03-29 06:41:43 -0700, Noah Misch wrote:

On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:

But I finally got to a point where the code ends up readable, without undue
duplication. It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Introducing that is fine.

Cool.

Here's the relevant code:

/*
* Treat a read that had both zeroed buffers *and* ignored checksums as a
* special case, it's too irregular to be emitted the same way as the other
* cases.
*/
if (zeroed_any && ignored_any)
{
Assert(zeroed_any && ignored_any);
Assert(nblocks > 1); /* same block can't be both zeroed and ignored */
Assert(result.status != PGAIO_RS_ERROR);
affected_count = zeroed_or_error_count;

ereport(elevel,
errcode(ERRCODE_DATA_CORRUPTED),
errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
affected_count, checkfail_count, first, last, rpath.str),

Translation stumbles on this one, because each of the first two %u are
plural-sensitive.

Fair. We don't generally seem to have been very careful around this in relate
code, but there's no reason to just continue down that road when it's easy.

E.g. in md.c we unconditionally output "could not read blocks %u..%u in file \"%s\": %m"
even if it's just a single block...

I'd do one of:

- Call ereport() twice, once for zeroed pages and once for ignored checksums.
Since elevel <= ERROR here, that doesn't lose the second call.

- s/pages/page(s)/ like msgid "There are %d other session(s) and %d prepared
transaction(s) using the database."

I think I like this better.

/*
* The other messages are highly repetitive. To avoid duplicating a long
* and complicated ereport(), gather the translated format strings
* separately and then do one common ereport.
*/
if (result.status == PGAIO_RS_ERROR)
{
Assert(!zeroed_any); /* can't have invalid pages when zeroing them */
affected_count = zeroed_or_error_count;
msg_one = _("invalid page in block %u of relation %s");
msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
det_mult = _("Block %u held first invalid page.");
hint_mult = _("See server log for the other %u invalid blocks.");

For each hint_mult, we would usually use ngettext() instead of _(). (Would be
errhint_plural() if not separated from its ereport().) Alternatively,
s/blocks/block(s)/ is fine.

I will go with the (s) here as well, this stuff is too rare to be worth having
pluralized messages imo.

Does that approach make sense?

Yes.
...

I like the replacement. It moves the important part to the front, and it's

shorter.

Cool, I squashed them with the relevant changes now.

Attached is v2.14:

Changes:

- Added a commit to fix stats attribution of checksum errors, previously the
checksum errors detected bufmgr.c/storage.c were always attributed to the
current database

This would have caused bigger issues with worker based IO, as IO workers
aren't connected to databases.

- Added a commit to allow checksum error reports to happen in critical
sections. For that a pgstat_prepare_report_checksum_failure() has to be
called in the same backend, to report the critical section.

Other suggestions for the name welcome.

- Expanded on the idea in 13 to track the number of invalid buffers in the
IO's result, by also tracking checksum errors. Combined with the previous
point, this fixes the issue of an assert during checksum failure reporting
outlined in:

/messages/by-id/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az

This required being a bit more careful with space in the error, to be able
to squeeze in the checksum number.

- The ignore_checksum_failure of the issuer needs to be used when completing
IO, not the one of the completor, particularly when using io_method=worker

For that the access to ignore_checksum_failure had to be moved from
PageIsVerified() to its callers.

I added tests for ignore_checksum_failure, including its interplay with
zero_damaged_pages.

- Deduplicated the error reporting in buffer_readv_report() somewhat by only
having the selection of format strings be done in branches. I think this
ends up a lot more readable than the huge ereport before.

- Added details about the changed error/warning logging to "bufmgr: Implement
AIO read support"'s commit message.

- polished the commit to add PGAIO_RS_WARNING a bit, adding defines for the
bit-widths of PgAioResult portions and added static asserts to verify them

- Squashed the changes that I had kept separately in v2.13, it was too hard to
do that while doing the above changes.

I did make the encoding function cast the arguments to uint32 before
shifting. I think that's implied by the C integer promotion rules, but it
seemed fishy enough to not want to believe in that.

I also added a StaticAssertStmt() to ensure we are only using the available
bit space.

- Added a test for a) checksum errors being detected b) CREATE DATABASE
... STRATEGY WAL_LOG

The latter is interesting because it also provides test coverage for doing
IO for objects in other databases.

- Removed an obsoleted inclusion of pg_trace.h in localbuf.c

TODO:

- I think the tests around zero_damaged_pages, ignore_checksum_failure should
be expanded a bit more. There's two FIXME in the tests about that.

At the moment there are two different test functions for zero_damaged_pages
and ignore_checksum_failure, I'm not sure how good that is.

I wanted to get this version out, because I have to run some errands,
otherwise I'd have implemente them first...

Next steps:

- push the checksums stats fix

- unless somebody sees a reason to not use LOG_SERVER_ONLY in
"aio: Implement support for reads in smgr/md/fd", push that

Besides that the only change since Noah's last review of that commit is an
added comment.

- push acronym, glossary change

- push pg_aios view (depends a tiny bit on the smgr/md/fd change above)

- push "localbuf: Track pincount in BufferDesc as well" - I think I addressed
all of Noah's review feedback

- address the above TODO

Greetings,

Andres Freund

Attachments:

v2.14-0001-Fix-mis-attribution-of-checksum-failure-stats-.patchtext/x-diff; charset=us-asciiDownload

From 96cd8d51ab871e730e646db74631a97fd9bd2f7d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 28 Mar 2025 12:29:40 -0400
Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to
 the wrong database

Checksum failure stats could be attributed to the wrong database in two cases:

- when a read of a shared relation encountered a checksum error , it would be
  attributed to the current database, instead of the "database" representing
  shared relations

- when using CREATE DATABASE ... STRATEGY WAL_LOG checksum errors in the
  source database would be attributed to the current database

The checksum stats reporting via PageIsVerifiedExtended(PIV_REPORT_STAT) does
not have access to the information about what database a page belongs to.

This fixes the issue by removing PIV_REPORT_STAT and delegating the
responsibility to report stats to the caller, which now can learn about the
number of stats via a new optional argument.

As this changes the signature of PageIsVerifiedExtended() and all callers
should adapt to the new signature, use the occasion to rename the function to
PageIsVerified() and remove the compatibility macro.

We could instead have fixed this by adding information about the database to
the args of PageIsVerified(), but there are soon-to-be-applied patches that
need to separate the stats reporting from the PageIsVerified() call
anyway. Those patches also include testing for the failure paths, something we
inexplicably have not had.

As there is no caller of pgstat_report_checksum_failure() left, remove it.

It'd be possible, but awkward to fix this in the back branches. We considered
doing the work not quite worth it, as mis-attributed stats should still elicit
concern. The emitted error messages do allow to attribute the errors
correctly.

Discussion: https://postgr.es/m/5tyic6epvdlmd6eddgelv47syg2b5cpwffjam54axp25xyq2ga@ptwkinxqo3az
Discussion: https://postgr.es/m/mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz
---
 src/include/pgstat.h                         |  1 -
 src/include/storage/bufpage.h                | 20 ++++++++------------
 src/backend/catalog/storage.c                | 16 ++++++++++++++--
 src/backend/storage/buffer/bufmgr.c          | 16 +++++++++++++---
 src/backend/storage/page/bufpage.c           | 20 +++++++++++++-------
 src/backend/utils/activity/pgstat_database.c |  9 ---------
 6 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5bfe19e66be..9f3d13bf1ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -612,7 +612,6 @@ extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
-extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 extern void pgstat_update_parallel_workers_stats(PgStat_Counter workers_to_launch,
 												 PgStat_Counter workers_launched);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6646b6f6371..b943db707db 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -465,31 +465,27 @@ do { \
 #define PAI_OVERWRITE			(1 << 0)
 #define PAI_IS_HEAP				(1 << 1)
 
-/* flags for PageIsVerifiedExtended() */
+/* flags for PageIsVerified() */
 #define PIV_LOG_WARNING			(1 << 0)
-#define PIV_REPORT_STAT			(1 << 1)
 
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
 						((overwrite) ? PAI_OVERWRITE : 0) | \
 						((is_heap) ? PAI_IS_HEAP : 0))
 
-#define PageIsVerified(page, blkno) \
-	PageIsVerifiedExtended(page, blkno, \
-						   PIV_LOG_WARNING | PIV_REPORT_STAT)
-
 /*
- * Check that BLCKSZ is a multiple of sizeof(size_t).  In
- * PageIsVerifiedExtended(), it is much faster to check if a page is
- * full of zeroes using the native word size.  Note that this assertion
- * is kept within a header to make sure that StaticAssertDecl() works
- * across various combinations of platforms and compilers.
+ * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsVerified(), it
+ * is much faster to check if a page is full of zeroes using the native word
+ * size.  Note that this assertion is kept within a header to make sure that
+ * StaticAssertDecl() works across various combinations of platforms and
+ * compilers.
  */
 StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)),
 				 "BLCKSZ has to be a multiple of sizeof(size_t)");
 
 extern void PageInit(Page page, Size pageSize, Size specialSize);
-extern bool PageIsVerifiedExtended(PageData *page, BlockNumber blkno, int flags);
+extern bool PageIsVerified(PageData *page, BlockNumber blkno, int flags,
+						   bool *checksum_failure_p);
 extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size,
 										OffsetNumber offsetNumber, int flags);
 extern Page PageGetTempPage(const PageData *page);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 624ed41bbf3..cacf16c1cdb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,6 +27,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/bulk_write.h"
 #include "storage/freespace.h"
 #include "storage/proc.h"
@@ -507,6 +508,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		BulkWriteBuffer buf;
+		bool		checksum_failure;
+		bool		verified;
 
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
@@ -514,8 +517,17 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		buf = smgr_bulk_get_buf(bulkstate);
 		smgrread(src, forkNum, blkno, (Page) buf);
 
-		if (!PageIsVerifiedExtended((Page) buf, blkno,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		verified = PageIsVerified((Page) buf, blkno, PIV_LOG_WARNING,
+								  &checksum_failure);
+
+		if (checksum_failure)
+		{
+			RelFileLocatorBackend rloc = src->smgr_rlocator;
+
+			pgstat_report_checksum_failures_in_db(rloc.locator.dbOid, 1);
+		}
+
+		if (!verified)
 		{
 			/*
 			 * For paranoia's sake, capture the file path before invoking the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 323382dcfa8..5cac8cd7389 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,7 +770,7 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * In RBM_NORMAL mode, the page is read from disk, and the page header is
  * validated.  An error is thrown if the page header is not valid.  (But
  * note that an all-zero page is considered "valid"; see
- * PageIsVerifiedExtended().)
+ * PageIsVerified().)
  *
  * RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
  * valid, the page is zeroed instead of throwing an error. This is intended
@@ -1569,6 +1569,8 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		{
 			BufferDesc *bufHdr;
 			Block		bufBlock;
+			bool		verified;
+			bool		checksum_failure;
 
 			if (persistence == RELPERSISTENCE_TEMP)
 			{
@@ -1582,8 +1584,16 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			}
 
 			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			verified = PageIsVerified((Page) bufBlock, io_first_block + j,
+									  PIV_LOG_WARNING, &checksum_failure);
+			if (checksum_failure)
+			{
+				RelFileLocatorBackend rloc = operation->smgr->smgr_rlocator;
+
+				pgstat_report_checksum_failures_in_db(rloc.locator.dbOid, 1);
+			}
+
+			if (!verified)
 			{
 				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
 				{
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ecc81aacfc3..5d1b039fcbb 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -61,7 +61,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
 
 
 /*
- * PageIsVerifiedExtended
+ * PageIsVerified
  *		Check that the page header and checksum (if any) appear valid.
  *
  * This is called when a page has just been read in from disk.  The idea is
@@ -81,11 +81,13 @@ PageInit(Page page, Size pageSize, Size specialSize)
  * If flag PIV_LOG_WARNING is set, a WARNING is logged in the event of
  * a checksum failure.
  *
- * If flag PIV_REPORT_STAT is set, a checksum failure is reported directly
- * to pgstat.
+ * To allow the caller to report statistics about checksum failures,
+ * *checksum_failure_p can be passed in. Note that there may be checksum
+ * failures even if this function returns true, due to
+ * ignore_checksum_failure.
  */
 bool
-PageIsVerifiedExtended(PageData *page, BlockNumber blkno, int flags)
+PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_failure_p)
 {
 	const PageHeaderData *p = (const PageHeaderData *) page;
 	size_t	   *pagebytes;
@@ -93,6 +95,9 @@ PageIsVerifiedExtended(PageData *page, BlockNumber blkno, int flags)
 	bool		header_sane = false;
 	uint16		checksum = 0;
 
+	if (checksum_failure_p)
+		*checksum_failure_p = false;
+
 	/*
 	 * Don't verify page data unless the page passes basic non-zero test
 	 */
@@ -103,7 +108,11 @@ PageIsVerifiedExtended(PageData *page, BlockNumber blkno, int flags)
 			checksum = pg_checksum_page(page, blkno);
 
 			if (checksum != p->pd_checksum)
+			{
 				checksum_failure = true;
+				if (checksum_failure_p)
+					*checksum_failure_p = true;
+			}
 		}
 
 		/*
@@ -141,9 +150,6 @@ PageIsVerifiedExtended(PageData *page, BlockNumber blkno, int flags)
 					 errmsg("page verification failed, calculated checksum %u but expected %u",
 							checksum, p->pd_checksum)));
 
-		if ((flags & PIV_REPORT_STAT) != 0)
-			pgstat_report_checksum_failure();
-
 		if (header_sane && ignore_checksum_failure)
 			return true;
 	}
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 05a8ccfdb75..c0c02efd9d3 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -159,15 +159,6 @@ pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 	pgstat_unlock_entry(entry_ref);
 }
 
-/*
- * Report one checksum failure in the current database.
- */
-void
-pgstat_report_checksum_failure(void)
-{
-	pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
-}
-
 /*
  * Report creation of temporary file.
  */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0002-aio-Implement-support-for-reads-in-smgr-md-fd.patchtext/x-diff; charset=us-asciiDownload

From 25bb63b1ec9ee08ce7585de846a147114bc34f17 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:05 -0400
Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd

This implements the following:

1) An smgr AIO target, for AIO on smgr files. This should be usable not just
   for md.c but also other SMGR implementation if we ever get them.
2) readv support in fd.c, which requires a small bit of infrastructure work in
   fd.c
3) smgr.c and md.c support for readv

There still is nothing performing AIO, but as of this commit it would be
possible.

As part of this change FileGetRawDesc() actually ensures that the file is
opened - previously it was basically not usable. It's used to reopen a file in
IO workers.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/aio_types.h        |  10 +-
 src/include/storage/fd.h               |   3 +
 src/include/storage/md.h               |   7 +
 src/include/storage/smgr.h             |  14 ++
 src/backend/storage/aio/aio_callback.c |   3 +
 src/backend/storage/aio/aio_target.c   |   2 +
 src/backend/storage/file/fd.c          |  40 +++++
 src/backend/storage/smgr/md.c          | 196 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        | 160 ++++++++++++++++++++
 10 files changed, 437 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index cc987556e14..c386fc96d90 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -117,9 +117,10 @@ typedef enum PgAioTargetID
 {
 	/* intentionally the zero value, to help catch zeroed memory etc */
 	PGAIO_TID_INVALID = 0,
+	PGAIO_TID_SMGR,
 } PgAioTargetID;
 
-#define PGAIO_TID_COUNT (PGAIO_TID_INVALID + 1)
+#define PGAIO_TID_COUNT (PGAIO_TID_SMGR + 1)
 
 
 /*
@@ -191,6 +192,9 @@ struct PgAioTargetInfo
 typedef enum PgAioHandleCallbackID
 {
 	PGAIO_HCB_INVALID,
+
+	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index dddda3a3e2f..debe8163d4e 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -60,11 +60,15 @@ typedef struct PgAioWaitRef
  */
 typedef union PgAioTargetData
 {
-	/* just as an example placeholder for later */
 	struct
 	{
-		uint32		queue_id;
-	}			wal;
+		RelFileLocator rlocator;	/* physical relation identifier */
+		BlockNumber blockNum;	/* blknum relative to begin of reln */
+		BlockNumber nblocks;
+		ForkNumber	forkNum:8;	/* don't waste 4 byte for four values */
+		bool		is_temp:1;	/* proc can be inferred by owning AIO */
+		bool		skip_fsync:1;
+	}			smgr;
 } PgAioTargetData;
 
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..b77d8e5e30e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -101,6 +101,8 @@ extern PGDLLIMPORT int max_safe_fds;
  * prototypes for functions in fd.c
  */
 
+struct PgAioHandle;
+
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -109,6 +111,7 @@ extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 05bf537066e..9d7131eff43 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -14,11 +14,14 @@
 #ifndef MD_H
 #define MD_H
 
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 #include "storage/smgr.h"
 #include "storage/sync.h"
 
+extern const PgAioHandleCallbacks aio_md_readv_cb;
+
 /* md storage manager functionality */
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
@@ -36,6 +39,9 @@ extern uint32 mdmaxcombine(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
+extern void mdstartreadv(PgAioHandle *ioh,
+						 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+						 void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
@@ -46,6 +52,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber old_blocks, BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
+extern int	mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4016b206ad6..856ebcda350 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -15,6 +15,7 @@
 #define SMGR_H
 
 #include "lib/ilist.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
 
@@ -73,6 +74,8 @@ typedef SMgrRelationData *SMgrRelation;
 #define SmgrIsTemp(smgr) \
 	RelFileLocatorBackendIsTemp((smgr)->smgr_rlocator)
 
+extern const PgAioTargetInfo aio_smgr_target_info;
+
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, ProcNumber backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
@@ -97,6 +100,10 @@ extern uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum,
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffers, BlockNumber nblocks);
+extern void smgrstartreadv(PgAioHandle *ioh,
+						   SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum,
+						   void **buffers, BlockNumber nblocks);
 extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
@@ -127,4 +134,11 @@ smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
 }
 
+extern void pgaio_io_set_target_smgr(PgAioHandle *ioh,
+									 SMgrRelationData *smgr,
+									 ForkNumber forknum,
+									 BlockNumber blocknum,
+									 int nblocks,
+									 bool skip_fsync);
+
 #endif							/* SMGR_H */
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 413ea3247c2..53db6e194af 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/md.h"
 
 
 /* just to have something to put into aio_handle_cbs */
@@ -37,6 +38,8 @@ typedef struct PgAioHandleCallbacksEntry
 static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 #define CALLBACK_ENTRY(id, callback)  [id] = {.cb = &callback, .name = #callback}
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/aio/aio_target.c b/src/backend/storage/aio/aio_target.c
index 536c7f91f5e..ac6c74f4ff2 100644
--- a/src/backend/storage/aio/aio_target.c
+++ b/src/backend/storage/aio/aio_target.c
@@ -16,6 +16,7 @@
 
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/smgr.h"
 
 
 /*
@@ -25,6 +26,7 @@ static const PgAioTargetInfo *pgaio_target_info[] = {
 	[PGAIO_TID_INVALID] = &(PgAioTargetInfo) {
 		.name = "invalid",
 	},
+	[PGAIO_TID_SMGR] = &aio_smgr_target_info,
 };
 
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 84873f41488..0e8299dd556 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/startup.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "utils/guc.h"
@@ -1296,6 +1297,8 @@ LruDelete(File file)
 
 	vfdP = &VfdCache[file];
 
+	pgaio_closing_fd(vfdP->fd);
+
 	/*
 	 * Close the file.  We aren't expecting this to fail; if it does, better
 	 * to leak the FD than to mess up our internal state.
@@ -1989,6 +1992,8 @@ FileClose(File file)
 
 	if (!FileIsNotOpen(file))
 	{
+		pgaio_closing_fd(vfdP->fd);
+
 		/* close the file */
 		if (close(vfdP->fd) != 0)
 		{
@@ -2212,6 +2217,32 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartReadV(PgAioHandle *ioh, File file,
+			   int iovcnt, off_t offset,
+			   uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	pgaio_io_start_readv(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 ssize_t
 FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
 		   uint32 wait_event_info)
@@ -2500,6 +2531,12 @@ FilePathName(File file)
 int
 FileGetRawDesc(File file)
 {
+	int			returnCode;
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
 	Assert(FileIsValid(file));
 	return VfdCache[file].fd;
 }
@@ -2780,6 +2817,7 @@ FreeDesc(AllocateDesc *desc)
 			result = closedir(desc->desc.dir);
 			break;
 		case AllocateDescRawFD:
+			pgaio_closing_fd(desc->desc.fd);
 			result = close(desc->desc.fd);
 			break;
 		default:
@@ -2848,6 +2886,8 @@ CloseTransientFile(int fd)
 	/* Only get here if someone passes us a file not in allocatedDescs */
 	elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
 
+	pgaio_closing_fd(fd);
+
 	return close(fd);
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3220f98dc4..eedba1e5794 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,6 +31,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/md.h"
@@ -152,6 +153,15 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
 
+static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+
+const PgAioHandleCallbacks aio_md_readv_cb = {
+	.complete_shared = md_readv_complete,
+	.report = md_readv_report,
+};
+
+
 static inline int
 _mdfd_open_flags(void)
 {
@@ -937,6 +947,69 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartreadv() -- Asynchronous version of mdreadv().
+ */
+void
+mdstartreadv(PgAioHandle *ioh,
+			 SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 void **buffers, BlockNumber nblocks)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "read crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 false);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
+
+	ret = FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start reading blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+
+	/*
+	 * The error checks corresponding to the post-read checks in mdread() are
+	 * in md_readv_complete().
+	 */
+}
+
 /*
  * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
@@ -1365,6 +1438,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+int
+mdfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	MdfdVec    *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL);
+
+	*off = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(*off < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	return FileGetRawDesc(v->mdfd_vfd);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1841,3 +1929,111 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * AIO completion callback for mdstartreadv().
+ */
+static PgAioResult
+md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		/*
+		 * Immediately log a message about the IO error, but only to the
+		 * server log. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing).  The
+		 * issuer of the IO will emit an ERROR when processing the IO's
+		 * results
+		 */
+		pgaio_result_report(result, td, LOG_SERVER_ONLY);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartreadv(), the smgr API operates on the level
+	 * of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks read a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_READV;
+		result.error_data = 0;
+
+		/* see comment above the "hard error" case */
+		pgaio_result_report(result, td, LOG_SERVER_ONLY);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial reads should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_READV;
+	}
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartreadv().
+ *
+ * Errors are encoded as follows:
+ * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data == 0 encodes IO that didn't read all data
+ */
+static void
+md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		/* for errcode_for_file_access() and %m */
+		errno = result.error_data;
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not read blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str));
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ));
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index af74f54b43b..bf8f57b410a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -66,6 +66,7 @@
 #include "access/xlogutils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/md.h"
@@ -106,6 +107,10 @@ typedef struct f_smgr
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_startreadv) (PgAioHandle *ioh,
+									SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum,
+									void **buffers, BlockNumber nblocks);
 	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
@@ -117,6 +122,7 @@ typedef struct f_smgr
 								  BlockNumber old_blocks, BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
+	int			(*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -134,12 +140,14 @@ static const f_smgr smgrsw[] = {
 		.smgr_prefetch = mdprefetch,
 		.smgr_maxcombine = mdmaxcombine,
 		.smgr_readv = mdreadv,
+		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_registersync = mdregistersync,
+		.smgr_fd = mdfd,
 	}
 };
 
@@ -157,6 +165,16 @@ static dlist_head unpinned_relns;
 static void smgrshutdown(int code, Datum arg);
 static void smgrdestroy(SMgrRelation reln);
 
+static void smgr_aio_reopen(PgAioHandle *ioh);
+static char *smgr_aio_describe_identity(const PgAioTargetData *sd);
+
+
+const PgAioTargetInfo aio_smgr_target_info = {
+	.name = "smgr",
+	.reopen = smgr_aio_reopen,
+	.describe_identity = smgr_aio_describe_identity,
+};
+
 
 /*
  * smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -709,6 +727,30 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully read blocks if the read [partially] succeeds (Buffers for
+ * blocks not successfully read might bear unspecified modifications, up to
+ * the full nblocks). This maintains the abstraction that smgr operates on the
+ * level of blocks, rather than bytes.
+ */
+void
+smgrstartreadv(PgAioHandle *ioh,
+			   SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   void **buffers, BlockNumber nblocks)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startreadv(ioh,
+											 reln, forknum, blocknum, buffers,
+											 nblocks);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwritev() -- Write the supplied buffers out.
  *
@@ -917,6 +959,29 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * Return fd for the specified block number and update *off to the appropriate
+ * position.
+ *
+ * This is only to be used for when AIO needs to perform the IO in a different
+ * process than where it was issued (e.g. in an IO worker).
+ */
+static int
+smgrfd(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off)
+{
+	int			fd;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed prematurely.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	fd = smgrsw[reln->smgr_which].smgr_fd(reln, forknum, blocknum, off);
+
+	return fd;
+}
+
 /*
  * AtEOXact_SMgr
  *
@@ -945,3 +1010,98 @@ ProcessBarrierSmgrRelease(void)
 	smgrreleaseall();
 	return true;
 }
+
+/*
+ * Set target of the IO handle to be smgr and initialize all the relevant
+ * pieces of data.
+ */
+void
+pgaio_io_set_target_smgr(PgAioHandle *ioh,
+						 SMgrRelationData *smgr,
+						 ForkNumber forknum,
+						 BlockNumber blocknum,
+						 int nblocks,
+						 bool skip_fsync)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+
+	pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
+
+	/* backend is implied via IO owner */
+	sd->smgr.rlocator = smgr->smgr_rlocator.locator;
+	sd->smgr.forkNum = forknum;
+	sd->smgr.blockNum = blocknum;
+	sd->smgr.nblocks = nblocks;
+	sd->smgr.is_temp = SmgrIsTemp(smgr);
+	/* Temp relations should never be fsync'd */
+	sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
+}
+
+/*
+ * Callback for smgr AIO targets, to reopen the file (e.g. because the IO is
+ * executed in a worker).
+ */
+static void
+smgr_aio_reopen(PgAioHandle *ioh)
+{
+	PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
+	PgAioOpData *od = pgaio_io_get_op_data(ioh);
+	SMgrRelation reln;
+	ProcNumber	procno;
+	uint32		off;
+
+	/*
+	 * The caller needs to prevent interrupts from being processed, otherwise
+	 * the FD could be closed again before we get to executing the IO.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
+	if (sd->smgr.is_temp)
+		procno = pgaio_io_get_owner(ioh);
+	else
+		procno = INVALID_PROC_NUMBER;
+
+	reln = smgropen(sd->smgr.rlocator, procno);
+	switch (pgaio_io_get_op(ioh))
+	{
+		case PGAIO_OP_INVALID:
+			pg_unreachable();
+			break;
+		case PGAIO_OP_READV:
+			od->read.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->read.offset);
+			break;
+		case PGAIO_OP_WRITEV:
+			od->write.fd = smgrfd(reln, sd->smgr.forkNum, sd->smgr.blockNum, &off);
+			Assert(off == od->write.offset);
+			break;
+	}
+}
+
+/*
+ * Callback for smgr AIO targets, to describe the target of the IO.
+ */
+static char *
+smgr_aio_describe_identity(const PgAioTargetData *sd)
+{
+	RelPathStr	path;
+	char	   *desc;
+
+	path = relpathbackend(sd->smgr.rlocator,
+						  sd->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  sd->smgr.forkNum);
+
+	if (sd->smgr.nblocks == 0)
+		desc = psprintf(_("file \"%s\""), path.str);
+	else if (sd->smgr.nblocks == 1)
+		desc = psprintf(_("block %u in file \"%s\""),
+						sd->smgr.blockNum,
+						path.str);
+	else
+		desc = psprintf(_("blocks %u..%u in file \"%s\""),
+						sd->smgr.blockNum,
+						sd->smgr.blockNum + sd->smgr.nblocks - 1,
+						path.str);
+
+	return desc;
+}
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0003-docs-Add-acronym-and-glossary-entries-for-I-O-.patchtext/x-diff; charset=us-asciiDownload

From ef4be323f13f0f552cb6d6efee17693119f65962 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 21 Mar 2025 15:06:35 -0400
Subject: [PATCH v2.14 03/29] docs: Add acronym and glossary entries for I/O
 and AIO

These are fairly basic, but better than nothing.  While there are several
opportunities to link to these entries, this patch does not add any. They will
however be referenced by future patches.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326183102.92.nmisch@google.com
---
 doc/src/sgml/acronyms.sgml | 18 ++++++++++++++++++
 doc/src/sgml/glossary.sgml | 39 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml
index 58d0d90fece..2f906e9f018 100644
--- a/doc/src/sgml/acronyms.sgml
+++ b/doc/src/sgml/acronyms.sgml
@@ -9,6 +9,15 @@
 
   <variablelist>
 
+   <varlistentry>
+    <term><acronym>AIO</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-aio">Asynchronous <acronym>I/O</acronym></link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ACL</acronym></term>
     <listitem>
@@ -354,6 +363,15 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><acronym>I/O</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-io">Input/Output</link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ISO</acronym></term>
     <listitem>
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index c0f812e3f5e..b88cac598e9 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -81,6 +81,31 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-aio">
+   <glossterm>Asynchronous <acronym>I/O</acronym></glossterm>
+   <acronym>AIO</acronym>
+   <indexterm>
+    <primary>Asynchronous <acronym>I/O</acronym></primary>
+   </indexterm>
+   <glossdef>
+    <para>
+     Asynchronous <acronym>I/O</acronym> (<acronym>AIO</acronym>) describes
+     performing <acronym>I/O</acronym> in a non-blocking way (asynchronously),
+     in contrast to synchronous <acronym>I/O</acronym>, which blocks for the
+     entire duration of the <acronym>I/O</acronym>.
+    </para>
+    <para>
+     With <acronym>AIO</acronym>, starting an <acronym>I/O</acronym> operation
+     is separated from waiting for the result of the operation, allowing
+     multiple <acronym>I/O</acronym> operations to be initiated concurrently,
+     as well as performing <acronym>CPU</acronym> heavy operations
+     concurrently with <acronym>I/O</acronym>. The price for that increased
+     concurrency is increased complexity.
+    </para>
+    <glossseealso otherterm="glossary-io" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-atomic">
    <glossterm>Atomic</glossterm>
    <glossdef>
@@ -938,6 +963,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-io">
+   <glossterm>Input/Output</glossterm>
+   <acronym>I/O</acronym>
+   <glossdef>
+    <para>
+     Input/Output (<acronym>I/O</acronym>) describes the communication between
+     a program and peripheral devices. In the context of database systems,
+     <acronym>I/O</acronym> commonly, but not exclusively, refers to
+     interaction with storage devices or the network.
+    </para>
+    <glossseealso otherterm="glossary-aio" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-insert">
    <glossterm>Insert</glossterm>
    <glossdef>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0004-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From d08ee04f17ea642507ec8ee20401f8a079c10664 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 04/29] aio: Add pg_aios view

The new view lists all IO handles that are currently in use and is mainly
useful for PG developers, but may also be useful when tuning PG.

FIXME:
- catversion bump before commit

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/catalog/pg_proc.dat      |  10 +
 src/backend/catalog/system_views.sql |   3 +
 src/backend/storage/aio/Makefile     |   1 +
 src/backend/storage/aio/aio_funcs.c  | 225 +++++++++++++++++++++
 src/backend/storage/aio/meson.build  |   1 +
 doc/src/sgml/system-views.sgml       | 288 +++++++++++++++++++++++++++
 src/test/regress/expected/rules.out  |  16 ++
 7 files changed, 544 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..52dbce05a45 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12493,4 +12493,14 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..37fd4bc2566 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1391,3 +1391,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..765fb16aef4
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,225 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "nodes/execnodes.h"
+#include "port/atomics.h"
+#include "storage/aio_internal.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/tuplestore.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+retry:
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IO's state changed while we were "rendering" it. Just start
+		 * from scratch. There's no risk of a livelock here, as an IO has a
+		 * limited sets of states it can be in, and state changes go only in a
+		 * single direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of the following
+		 * fields are valid yet (or are in the process of being set).
+		 * Therefore we don't want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation (offset, length) */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED
+			|| start_state == PGAIO_HS_COMPLETED_LOCAL)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..5154512b16b 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -51,6 +51,11 @@
     </thead>
 
     <tbody>
+     <row>
+      <entry><link linkend="view-pg-aios"><structname>pg_aios</structname></link></entry>
+      <entry>In-use asynchronous IO handles</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-available-extensions"><structname>pg_available_extensions</structname></link></entry>
       <entry>available extensions</entry>
@@ -231,6 +236,289 @@
   </table>
  </sect1>
 
+ <sect1 id="view-pg-aios">
+  <title><structname>pg_aios</structname></title>
+
+  <indexterm zone="view-pg-aios">
+   <primary>pg_aios</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_aios</structname> view lists all <xref
+   linkend="glossary-aio"/> handles that are currently in-use.  An I/O handle
+   is used to reference an I/O operation that is being prepared, executed or
+   is in the process of completing.  <structname>pg_aios</structname> contains
+   one row for each I/O handle.
+  </para>
+
+  <para>
+   This view is mainly useful for developers of
+   <productname>PostgreSQL</productname>, but may also be useful when tuning
+   <productname>PostgreSQL</productname>.
+  </para>
+
+  <table>
+   <title><structname>pg_aios</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>int4</type>
+      </para>
+      <para>
+       Process ID of the server process that is issuing this I/O.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_id</structfield> <type>int4</type>
+      </para>
+      <para>
+       Identifier of the I/O handle. Handles are reused once the I/O
+       completed (or if the handle is released before I/O is started). On reuse
+       <link linkend="view-pg-aios-io-generation">
+        <structname>pg_aios</structname>.<structfield>io_generation</structfield>
+       </link>
+       is incremented.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry" id="view-pg-aios-io-generation"><para role="column_definition">
+       <structfield>io_generation</structfield> <type>int8</type>
+      </para>
+      <para>
+       Generation of the I/O handle.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>state</structfield> <type>text</type>
+      </para>
+      <para>
+       State of the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>HANDED_OUT</literal>, referenced by code but not yet used
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>DEFINED</literal>, information necessary for execution is known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>STAGED</literal>, ready for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>SUBMITTED</literal>, submitted for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_IO</literal>, finished, but result has not yet been processed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_SHARED</literal>, shared completion processing completed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_LOCAL</literal>, backend local completion processing completed
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>operation</structfield> <type>text</type>
+      </para>
+      <para>
+       Operation performed using the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>invalid</literal>, not yet known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>readv</literal>, a vectored read
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writev</literal>, a vectored write
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>off</structfield> <type>int8</type>
+      </para>
+      <para>
+       Offset of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>length</structfield> <type>int8</type>
+      </para>
+      <para>
+       Length of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on relations
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>handle_data_len</structfield> <type>int2</type>
+      </para>
+      <para>
+       Length of the data associated with the I/O operation. For I/O to/from
+       <xref linkend="guc-shared-buffers"/> and <xref
+       linkend="guc-temp-buffers"/>, this indicates the number of buffers the
+       I/O is operating on.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>raw_result</structfield> <type>int4</type>
+      </para>
+      <para>
+       Low-level result of the I/O operation, or NULL if the operation has not
+       yet completed.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>result</structfield> <type>text</type>
+      </para>
+      <para>
+       High-level result of the I/O operation:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>UNKNOWN</literal> means that the result of the
+          operation is not yet known.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>OK</literal> means the I/O completed successfully.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>PARTIAL</literal> means that the I/O completed without
+          error, but did not process all data. Commonly callers will need to
+          retry and perform the remainder of the work in a separate I/O.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>WARNING</literal> means that the I/O completed without
+          error, but that execution of the IO triggered a warning. E.g. when
+          encountering a corrupted buffer with <xref
+          linkend="guc-zero-damaged-pages"/> enabled.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>ERROR</literal> means the I/O failed with an error.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target_desc</structfield> <type>text</type>
+      </para>
+      <para>
+       Description of what the I/O operation is targeting.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_sync</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is executed synchronously.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_localmem</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O references process local memory.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_buffered</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is buffered I/O.
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_aios</structname> view is read-only.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-available-extensions">
   <title><structname>pg_available_extensions</structname></title>
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..d9533deb04e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    off,
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, off, length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0005-localbuf-Track-pincount-in-BufferDesc-as-well.patchtext/x-diff; charset=us-asciiDownload

From 88da964c4b51e32e6ab5c2a4896b24c5ca7f615a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well

For AIO on temporary table buffers the AIO subsystem needs to be able to
ensure a pin on a buffer while AIO is going on, even if the IO issuing query
errors out. Tracking the buffer in LocalRefCount does not work, as it would
cause CheckForLocalBufferLeaks() to assert out.

Instead, also track the refcount in BufferDesc.state, not just
LocalRefCount. This also makes local buffers behave a bit more akin to shared
buffers.

Note that we still don't need locking, AIO completion callbacks for local
buffers are executed in the issuing session (i.e. nobody else has access to
the BufferDesc).

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/backend/storage/buffer/localbuf.c | 28 +++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f172a5c7820..3a722321533 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -249,6 +249,13 @@ GetLocalVictimBuffer(void)
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 				trycounter = NLocBuffer;
 			}
+			else if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				/*
+				 * This can be reached if the backend initiated AIO for this
+				 * buffer and then errored out.
+				 */
+			}
 			else
 			{
 				/* Found a usable buffer */
@@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	if (check_unreferenced && LocalRefCount[bufid] != 0)
+	/*
+	 * We need to test not just LocalRefCount[bufid] but also the BufferDesc
+	 * itself, as the latter is used to represent a pin by the AIO subsystem.
+	 * This can happen if AIO is initiated and then the query errors out.
+	 */
+	if (check_unreferenced &&
+		(LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
 		elog(ERROR, "block %u of %s is still referenced (local %u)",
 			 bufHdr->tag.blockNum,
 			 relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
@@ -744,12 +757,13 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 	if (LocalRefCount[bufid] == 0)
 	{
 		NLocalPinnedBuffers++;
+		buf_state += BUF_REFCOUNT_ONE;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
 			buf_state += BUF_USAGECOUNT_ONE;
-			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 		}
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -775,7 +789,17 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 	Assert(NLocalPinnedBuffers > 0);
 
 	if (--LocalRefCount[buffid] == 0)
+	{
+		BufferDesc *buf_hdr = GetLocalBufferDescriptor(buffid);
+		uint32		buf_state;
+
 		NLocalPinnedBuffers--;
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0006-aio-bufmgr-Comment-fixes.patchtext/x-diff; charset=us-asciiDownload

From 1766e29c3bbe0e0659d3bf1d5d1e6d7aa88c1a05 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 20 Mar 2025 11:32:10 -0400
Subject: [PATCH v2.14 06/29] aio, bufmgr: Comment fixes

Some of these comments have been wrong for a while (12f3867f5534), some I
recently introduced (da7226993fd, 55b454d0e14). This also updates a comment in
FlushBuffer(), which will be copied in a future commit.

These changes seem big enough to be worth doing in separate commits.

Suggested-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250319212530.80.nmisch@google.com
---
 src/include/storage/aio.h           |  2 +-
 src/include/storage/aio_internal.h  | 22 +++++++++++++++++++++-
 src/backend/postmaster/postmaster.c |  2 +-
 src/backend/storage/buffer/bufmgr.c | 10 ++++------
 4 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index c386fc96d90..5c886e478a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -80,7 +80,7 @@ typedef enum PgAioHandleFlags
 /*
  * The IO operations supported by the AIO subsystem.
  *
- * This could be in aio_internal.h, as it is not pubicly referenced, but
+ * This could be in aio_internal.h, as it is not publicly referenced, but
  * PgAioOpData currently *does* need to be public, therefore keeping this
  * public seems to make sense.
  */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index fb0425ccbfc..7f18da2c856 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,6 +34,11 @@
  * linearly through all states.
  *
  * State changes should all go through pgaio_io_update_state().
+ *
+ * Note that the externally visible functions to start IO
+ * (e.g. FileStartReadV(), via pgaio_io_start_readv()) move an IO from
+ * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ * PGAIO_HS_COMPLETED_LOCAL (at which point the handle will be reused).
  */
 typedef enum PgAioHandleState
 {
@@ -285,6 +290,9 @@ typedef struct IoMethodOps
 	/*
 	 * Start executing passed in IOs.
 	 *
+	 * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+	 * returns, other backends might have advanced the state further.)
+	 *
 	 * Will not be called if ->needs_synchronous_execution() returned true.
 	 *
 	 * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -293,12 +301,24 @@ typedef struct IoMethodOps
 	 */
 	int			(*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
 
-	/*
+	/* ---
 	 * Wait for the IO to complete. Optional.
 	 *
+	 * On return, state shall be on of
+	 * - PGAIO_HS_COMPLETED_IO
+	 * - PGAIO_HS_COMPLETED_SHARED
+	 * - PGAIO_HS_COMPLETED_LOCAL
+	 *
+	 * The callback must not block if the handle is already in one of those
+	 * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+	 * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+	 * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+	 * method.
+	 *
 	 * If not provided, it needs to be guaranteed that the IO method calls
 	 * pgaio_io_process_completion() without further interaction by the
 	 * issuing backend.
+	 * ---
 	 */
 	void		(*wait_one) (PgAioHandle *ioh,
 							 uint64 ref_generation);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a0c37532d2f..c966c2e83af 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4401,7 +4401,7 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* XXX try again soon? */
+			break;				/* try again next time */
 	}
 
 	/* Too many running? */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5cac8cd7389..03317b49025 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3929,9 +3929,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		XLogFlush(recptr);
 
 	/*
-	 * Now it's safe to write buffer to disk. Note that no one else should
-	 * have been able to write it while we were busy with log flushing because
-	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
@@ -5499,9 +5500,6 @@ IsBufferCleanupOK(Buffer buffer)
 /*
  *	Functions for buffer I/O handling
  *
- *	Note: We assume that nested buffer I/O never occurs.
- *	i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
- *
  *	Also note that these are used only for shared buffers, not local ones.
  */
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0007-aio-Add-WARNING-result-status.patchtext/x-diff; charset=us-asciiDownload

From 50e1cb669103b7de10bf69bea8d68d39e33076d7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Mar 2025 14:33:09 -0400
Subject: [PATCH v2.14 07/29] aio: Add WARNING result status

If an IO succeeds, but issues a warning, e.g. due to a page verification
failure with zero_damaged_pages, we want to issue that warning in the context
of the issuer of the IO, not the process that executes the completion (always
the case for worker).

It's already possible for a completion callback to report a custom error
message, we just didn't have a result status that allowed a user of AIO to
know that a warning should be emitted even though the IO request succeeded.

All that's needed for that is a dedicated PGAIO_RS_ value.

Previously there were not enough bits in PgAioResult.id for the new
value. Increase. While at that, add defines for the amount of bits and static
asserts to check that the widths are appropriate.
---
 src/include/storage/aio.h       |  6 +++++-
 src/include/storage/aio_types.h | 26 +++++++++++++++++++++-----
 src/backend/storage/aio/aio.c   |  2 ++
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 5c886e478a5..d819f9e37ca 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -191,12 +191,16 @@ struct PgAioTargetInfo
  */
 typedef enum PgAioHandleCallbackID
 {
-	PGAIO_HCB_INVALID,
+	PGAIO_HCB_INVALID = 0,
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
 } PgAioHandleCallbackID;
 
+#define PGAIO_HCB_MAX	PGAIO_HCB_MD_WRITEV
+StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS),
+				 "PGAIO_HCB_MAX is too big for PGAIO_RESULT_ID_BITS");
+
 
 typedef void (*PgAioHandleCallbackStage) (PgAioHandle *ioh, uint8 cb_flags);
 typedef PgAioResult (*PgAioHandleCallbackComplete) (PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_flags);
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index debe8163d4e..156132cde03 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -79,32 +79,48 @@ typedef enum PgAioResultStatus
 {
 	PGAIO_RS_UNKNOWN,			/* not yet completed / uninitialized */
 	PGAIO_RS_OK,
-	PGAIO_RS_PARTIAL,			/* did not fully succeed, but no error */
-	PGAIO_RS_ERROR,
+	PGAIO_RS_PARTIAL,			/* did not fully succeed, no warning/error */
+	PGAIO_RS_WARNING,			/* [partially] succeeded, with a warning */
+	PGAIO_RS_ERROR,				/* failed entirely */
 } PgAioResultStatus;
 
 
 /*
  * Result of IO operation, visible only to the initiator of IO.
+ *
+ * We need to be careful about the size of PgAioResult, as it is embedded in
+ * every PgAioHandle, as well as every PgAioReturn. Currently we assume we can
+ * fit it into one 8 byte value, restricting the space for per-callback error
+ * data to PGAIO_RESULT_ERROR_BITS.
  */
+#define PGAIO_RESULT_ID_BITS 6
+#define PGAIO_RESULT_STATUS_BITS 3
+#define PGAIO_RESULT_ERROR_BITS 23
 typedef struct PgAioResult
 {
 	/*
 	 * This is of type PgAioHandleCallbackID, but can't use a bitfield of an
 	 * enum, because some compilers treat enums as signed.
 	 */
-	uint32		id:8;
+	uint32		id:PGAIO_RESULT_ID_BITS;
 
 	/* of type PgAioResultStatus, see above */
-	uint32		status:2;
+	uint32		status:PGAIO_RESULT_STATUS_BITS;
 
 	/* meaning defined by callback->error */
-	uint32		error_data:22;
+	uint32		error_data:PGAIO_RESULT_ERROR_BITS;
 
 	int32		result;
 } PgAioResult;
 
 
+StaticAssertDecl(PGAIO_RESULT_ID_BITS +
+				 PGAIO_RESULT_STATUS_BITS +
+				 PGAIO_RESULT_ERROR_BITS == 32,
+				 "PgAioResult bits divied incorrectly");
+StaticAssertDecl(sizeof(PgAioResult) == 8,
+				 "PgAioResult has unexpected size");
+
 /*
  * Combination of PgAioResult with minimal metadata about the IO.
  *
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 91e76113412..e3ed087e8a2 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -839,6 +839,8 @@ pgaio_result_status_string(PgAioResultStatus rs)
 			return "UNKNOWN";
 		case PGAIO_RS_OK:
 			return "OK";
+		case PGAIO_RS_WARNING:
+			return "WARNING";
 		case PGAIO_RS_PARTIAL:
 			return "PARTIAL";
 		case PGAIO_RS_ERROR:
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0008-pgstat-Allow-checksum-errors-to-be-reported-in.patchtext/x-diff; charset=us-asciiDownload

From f909fdaa534fd9d4a0f940752197f4a1b360fa26 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 28 Mar 2025 12:31:34 -0400
Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
 critical sections

For AIO we execute completion callbacks in critical sections (to ensure that
AIO can in the future be used for WAL, which in turn requires that we can call
completion callbacks in critical sections, to get the resources for WAL
io). To report checksum errors a backend now has to call
pgstat_prepare_report_checksum_failure(), before entering a critical section,
which guarantees the relevant pgstats entry is in shared memory.

Discussion: https://postgr.es/m/wkjj4p2rmkevutkwc6tewoovdqznj6c6nvjmvii4oo5wmbh5sr@retq7d6uqs4j
---
 src/include/pgstat.h                         |  1 +
 src/backend/backup/basebackup.c              |  1 +
 src/backend/catalog/storage.c                |  1 +
 src/backend/storage/buffer/bufmgr.c          |  1 +
 src/backend/utils/activity/pgstat_database.c | 51 ++++++++++++++++++--
 5 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9f3d13bf1ce..378f2f2c2ba 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -611,6 +611,7 @@ extern void pgstat_drop_database(Oid databaseid);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
+extern void pgstat_prepare_report_checksum_failure(Oid dboid);
 extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_connect(Oid dboid);
 extern void pgstat_update_parallel_workers_stats(PgStat_Counter workers_to_launch,
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 3f8a3c55725..891637e3a44 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1817,6 +1817,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 							   checksum_failures,
 							   readfilename, checksum_failures)));
 
+		pgstat_prepare_report_checksum_failure(dboid);
 		pgstat_report_checksum_failures_in_db(dboid, checksum_failures);
 	}
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cacf16c1cdb..2577f69cbde 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -524,6 +524,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		{
 			RelFileLocatorBackend rloc = src->smgr_rlocator;
 
+			pgstat_prepare_report_checksum_failure(rloc.locator.dbOid);
 			pgstat_report_checksum_failures_in_db(rloc.locator.dbOid, 1);
 		}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 03317b49025..16b5b69efda 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1590,6 +1590,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			{
 				RelFileLocatorBackend rloc = operation->smgr->smgr_rlocator;
 
+				pgstat_prepare_report_checksum_failure(rloc.locator.dbOid);
 				pgstat_report_checksum_failures_in_db(rloc.locator.dbOid, 1);
 			}
 
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index c0c02efd9d3..fbaf8364117 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -133,8 +133,34 @@ pgstat_report_deadlock(void)
 	dbent->deadlocks++;
 }
 
+/*
+ * Allow this backend to later report checksum failures for dboid, even if in
+ * a critical section at the time of the report.
+ *
+ * Without this function having been called first, the backend might need to
+ * allocate an EntryRef or might need to map in DSM segments. Neither should
+ * happen in a critical section.
+ */
+void
+pgstat_prepare_report_checksum_failure(Oid dboid)
+{
+	Assert(!CritSectionCount);
+
+	/*
+	 * Just need to ensure this backend has an entry ref for the database.
+	 * That will allows us to report checksum failures without e.g. needing to
+	 * map in DSM segments.
+	 */
+	pgstat_get_entry_ref(PGSTAT_KIND_DATABASE, dboid, InvalidOid,
+						 true, NULL);
+}
+
 /*
  * Report one or more checksum failures.
+ *
+ * To be allowed to report checksum failures in critical sections, we require
+ * pgstat_prepare_report_checksum_failure() to have been called before this
+ * function is called.
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
@@ -147,10 +173,29 @@ pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 
 	/*
 	 * Update the shared stats directly - checksum failures should never be
-	 * common enough for that to be a problem.
+	 * common enough for that to be a problem. Note that we pass create=false
+	 * here, as we want to be sure to not require memory allocations, so this
+	 * can be called in critical sections.
 	 */
-	entry_ref =
-		pgstat_get_entry_ref_locked(PGSTAT_KIND_DATABASE, dboid, InvalidOid, false);
+	entry_ref = pgstat_get_entry_ref(PGSTAT_KIND_DATABASE, dboid, InvalidOid,
+									 false, NULL);
+
+	/*
+	 * Should always have been created by
+	 * pgstat_prepare_report_checksum_failure().
+	 *
+	 * When not using assertions, we don't want to crash should something have
+	 * gone wrong, so just return.
+	 */
+	Assert(entry_ref);
+	if (!entry_ref)
+	{
+		elog(WARNING, "could not report %d conflicts for DB %u",
+			 failurecount, dboid);
+		return;
+	}
+
+	pgstat_lock_entry(entry_ref, false);
 
 	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
 	sharedent->stats.checksum_failures += failurecount;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0009-Add-errhint_internal.patchtext/x-diff; charset=us-asciiDownload

From 5166b45d2f8dc7b3d0b144af8ce414b3e9f79b22 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 29 Mar 2025 10:14:37 -0400
Subject: [PATCH v2.14 09/29] Add errhint_internal()

We have errmsg_internal(), errdetail_internal(), but not errhint_internal().

Sometimes it is useful to output a hint with already translated format
string (e.g. because there different messages depending on the condition). For
message/detail we do that with the _internal() variants, but we can't do that
with hint today.  It's possible to work around that that by using something
like
  str = psprintf(translated_format, args);
  ereport(...
          errhint("%s", str);
but that's not exactly pretty and makes it harder to avoid memory leaks.

Discussion: https://postgr.es/m/ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6
---
 src/include/utils/elog.h       |  1 +
 src/backend/utils/error/elog.c | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 855c147325b..a5313c5d2d5 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -195,6 +195,7 @@ extern int	errdetail_plural(const char *fmt_singular, const char *fmt_plural,
 							 unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
 extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
+extern int	errhint_internal(const char *fmt,...) pg_attribute_printf(1, 2);
 
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 						   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 97014c1a5a5..8a6b6905079 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -1330,6 +1330,27 @@ errhint(const char *fmt,...)
 	return 0;					/* return value does not matter */
 }
 
+/*
+ * errhint_internal --- add a hint error message text to the current error
+ *
+ * Non-translated version of errhint(), see also errmsg_internal().
+ */
+int
+errhint_internal(const char *fmt,...)
+{
+	ErrorData  *edata = &errordata[errordata_stack_depth];
+	MemoryContext oldcontext;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+
+	EVALUATE_MESSAGE(edata->domain, hint, false, false);
+
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;					/* return value does not matter */
+}
 
 /*
  * errhint_plural --- add a hint error message text to the current error,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0010-bufmgr-Implement-AIO-read-support.patchtext/x-diff; charset=us-asciiDownload

From 7b47410463e186edacd10b4a090a9bf70c7a3861 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support

This commit implements the infrastructure to perform asynchronous reads into
the buffer pool.

To do so, it:

- Adds readv AIO callbacks for shared and local buffers

  It may be worth calling out that shared buffer completions may be run in a
  different backend than where the IO started.

- Adds an AIO wait reference to BufferDesc, to allow backends to wait for
  in-progress asynchronous IOs

- Adapts StartBufferIO(), WaitIO(), TerminateBufferIO(), and their localbuf.c
  equivalents, to be able to deal with AIO

- Moves the code to handle BM_PIN_COUNT_WAITER into a helper function, as it
  now also needs to be called on IO completion

As of this commit, nothing issues AIO on shared buffers. A future commit will
update StartReadBuffers() to do so.

Buffer reads executed this infrastructure will report invalid page / checksum
errors / warnings differently than before:

In the error case the error message will cover all the blocks that were
included in the read, rather than just the reporting the first invalid
block. If more than one block is invalid, the error will include information
about the range of the read, the first invalid block and the number of invalid
pages, with a HINT towards the server log for per-block details.

For the warning case (i.e. zero_damaged_buffers) we would previously emit one
warning message for each buffer in a multi-block read. Now there is only a
single warning message for the entire read, again referring to the server log
for more details in case of multiple checksum failures within a single larger
read.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/aio.h              |   6 +-
 src/include/storage/buf_internals.h    |   7 +-
 src/include/storage/bufmgr.h           |   6 +
 src/include/storage/bufpage.h          |   1 +
 src/backend/storage/aio/aio_callback.c |   5 +
 src/backend/storage/buffer/README      |   9 +-
 src/backend/storage/buffer/buf_init.c  |   3 +
 src/backend/storage/buffer/bufmgr.c    | 831 +++++++++++++++++++++++--
 src/backend/storage/buffer/localbuf.c  |  61 +-
 src/backend/storage/page/bufpage.c     |   8 +-
 10 files changed, 873 insertions(+), 64 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index d819f9e37ca..bfe0d93683b 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -195,9 +195,13 @@ typedef enum PgAioHandleCallbackID
 
 	PGAIO_HCB_MD_READV,
 	PGAIO_HCB_MD_WRITEV,
+
+	PGAIO_HCB_SHARED_BUFFER_READV,
+
+	PGAIO_HCB_LOCAL_BUFFER_READV,
 } PgAioHandleCallbackID;
 
-#define PGAIO_HCB_MAX	PGAIO_HCB_MD_WRITEV
+#define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_READV
 StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS),
 				 "PGAIO_HCB_MAX is too big for PGAIO_RESULT_ID_BITS");
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9327f60c44c..72b36a4af26 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -17,6 +17,7 @@
 
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/aio_types.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -264,6 +265,8 @@ typedef struct BufferDesc
 
 	int			wait_backend_pgprocno;	/* backend of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
+
+	PgAioWaitRef io_wref;		/* set iff AIO is in progress */
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -472,8 +475,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
-								   uint32 set_flag_bits);
-extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput);
+								   uint32 set_flag_bits, bool release_aio);
+extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 538b890a51d..11f8508a90b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,6 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
+#include "storage/aio_types.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -111,6 +112,8 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ZERO_ON_ERROR (1 << 0)
 /* Call smgrprefetch() if I/O necessary. */
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
+/* Don't treat page as invalid due to checksum failures. */
+#define READ_BUFFERS_IGNORE_CHECKSUM_FAILURES (1 << 2)
 
 struct ReadBuffersOperation
 {
@@ -170,6 +173,9 @@ extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
+extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
 
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index b943db707db..7dbdfc564ed 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -467,6 +467,7 @@ do { \
 
 /* flags for PageIsVerified() */
 #define PIV_LOG_WARNING			(1 << 0)
+#define PIV_LOG_LOG				(1 << 1)
 
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 53db6e194af..c0063d6950e 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/aio.h"
 #include "storage/aio_internal.h"
+#include "storage/bufmgr.h"
 #include "storage/md.h"
 
 
@@ -40,6 +41,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..a182fcd660c 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -147,9 +147,12 @@ in the buffer.  It is used per the rules above.
 
 * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
 buffer to complete (and in releases before 14, it was accompanied by a
-per-buffer LWLock).  The process doing a read or write sets the flag for the
-duration, and processes that need to wait for it to be cleared sleep on a
-condition variable.
+per-buffer LWLock).  The process starting a read or write sets the flag. When
+the I/O is completed, be it by the process that initiated the I/O or by
+another process, the flag is removed and the Buffer's condition variable is
+signalled.  Processes that need to wait for the I/O to complete can wait for
+asynchronous I/O by using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be
+unset by sleeping on the buffer's condition variable.
 
 
 Normal Buffer Replacement Strategy
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..ed1dc488a42 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -125,6 +126,8 @@ BufferManagerShmemInit(void)
 
 			buf->buf_id = i;
 
+			pgaio_wref_clear(&buf->io_wref);
+
 			/*
 			 * Initially link all the buffers together as unused. Subsequent
 			 * management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 16b5b69efda..ca89d9345f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -519,7 +520,8 @@ static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner);
+							  uint32 set_flag_bits, bool forget_owner,
+							  bool release_aio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -1041,7 +1043,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 	{
 		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
-		need_to_zero = StartLocalBufferIO(bufHdr, true);
+		need_to_zero = StartLocalBufferIO(bufHdr, true, false);
 	}
 	else
 	{
@@ -1077,9 +1079,9 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
-			TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+			TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
 		else
-			TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
 	}
 	else if (!isLocalBuf)
 	{
@@ -1454,7 +1456,8 @@ static inline bool
 WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
-		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1), true);
+		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
+								  true, nowait);
 	else
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
@@ -1615,9 +1618,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 			/* Set BM_VALID, terminate IO, and wake up any waiters */
 			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID);
+				TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
 			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+				TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
 
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
@@ -1879,13 +1882,14 @@ retry:
 	}
 
 	/*
-	 * We assume the only reason for it to be pinned is that someone else is
-	 * flushing the page out.  Wait for them to finish.  (This could be an
-	 * infinite loop if the refcount is messed up... it would be nice to time
-	 * out after awhile, but there seems no way to be sure how many loops may
-	 * be needed.  Note that if the other guy has pinned the buffer but not
-	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
-	 * be busy-looping here.)
+	 * We assume the reason for it to be pinned is that either we were
+	 * asynchronously reading the page in before erroring out or someone else
+	 * is flushing the page out.  Wait for the IO to finish.  (This could be
+	 * an infinite loop if the refcount is messed up... it would be nice to
+	 * time out after awhile, but there seems no way to be sure how many loops
+	 * may be needed.  Note that if the other guy has pinned the buffer but
+	 * not yet done StartBufferIO, WaitIO will fall through and we'll
+	 * effectively be busy-looping here.)
 	 */
 	if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
 	{
@@ -2525,7 +2529,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (lock)
 			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
 
-		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
+		TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
 	}
 
 	pgBufferUsage.shared_blks_written += extend_by;
@@ -2871,6 +2875,44 @@ PinBuffer_Locked(BufferDesc *buf)
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 }
 
+/*
+ * Support for waking up another backend that is waiting for the cleanup lock
+ * to be released using BM_PIN_COUNT_WAITER.
+ *
+ * See LockBufferForCleanup().
+ *
+ * Expected to be called just after releasing a buffer pin (in a BufferDesc,
+ * not just reducing the backend-local pincount for the buffer).
+ */
+static void
+WakePinCountWaiter(BufferDesc *buf)
+{
+	/*
+	 * Acquire the buffer header lock, re-check that there's a waiter. Another
+	 * backend could have unpinned this buffer, and already woken up the
+	 * waiter.
+	 *
+	 * There's no danger of the buffer being replaced after we unpinned it
+	 * above, as it's pinned by the waiter. The waiter removes
+	 * BM_PIN_COUNT_WAITER if it stops waiting for a reason other than this
+	 * backend waking it up.
+	 */
+	uint32		buf_state = LockBufHdr(buf);
+
+	if ((buf_state & BM_PIN_COUNT_WAITER) &&
+		BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+	{
+		/* we just released the last pin other than the waiter's */
+		int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
+
+		buf_state &= ~BM_PIN_COUNT_WAITER;
+		UnlockBufHdr(buf, buf_state);
+		ProcSendSignal(wait_backend_pgprocno);
+	}
+	else
+		UnlockBufHdr(buf, buf_state);
+}
+
 /*
  * UnpinBuffer -- make buffer available for replacement.
  *
@@ -2939,29 +2981,8 @@ UnpinBufferNoOwner(BufferDesc *buf)
 
 		/* Support LockBufferForCleanup() */
 		if (buf_state & BM_PIN_COUNT_WAITER)
-		{
-			/*
-			 * Acquire the buffer header lock, re-check that there's a waiter.
-			 * Another backend could have unpinned this buffer, and already
-			 * woken up the waiter.  There's no danger of the buffer being
-			 * replaced after we unpinned it above, as it's pinned by the
-			 * waiter.
-			 */
-			buf_state = LockBufHdr(buf);
+			WakePinCountWaiter(buf);
 
-			if ((buf_state & BM_PIN_COUNT_WAITER) &&
-				BUF_STATE_GET_REFCOUNT(buf_state) == 1)
-			{
-				/* we just released the last pin other than the waiter's */
-				int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
-
-				buf_state &= ~BM_PIN_COUNT_WAITER;
-				UnlockBufHdr(buf, buf_state);
-				ProcSendSignal(wait_backend_pgprocno);
-			}
-			else
-				UnlockBufHdr(buf, buf_state);
-		}
 		ForgetPrivateRefCountEntry(ref);
 	}
 }
@@ -3982,7 +4003,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0, true);
+	TerminateBufferIO(buf, true, 0, true, false);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
 									   buf->tag.blockNum,
@@ -5252,6 +5273,12 @@ LockBufferForCleanup(Buffer buffer)
 
 	CheckBufferIsPinnedOnce(buffer);
 
+	/*
+	 * We do not yet need to be worried about in-progress AIOs holding a pin,
+	 * as we only support doing reads via AIO and this function can only be
+	 * called once the buffer is valid (i.e. no read can be in flight).
+	 */
+
 	/* Nobody else to wait for */
 	if (BufferIsLocal(buffer))
 		return;
@@ -5409,6 +5436,8 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/* see AIO related comment in LockBufferForCleanup() */
+
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];
@@ -5464,6 +5493,8 @@ IsBufferCleanupOK(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+	/* see AIO related comment in LockBufferForCleanup() */
+
 	if (BufferIsLocal(buffer))
 	{
 		/* There should be exactly one pin */
@@ -5516,6 +5547,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		PgAioWaitRef iow;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5523,10 +5555,40 @@ WaitIO(BufferDesc *buf)
 		 * play it safe.
 		 */
 		buf_state = LockBufHdr(buf);
+
+		/*
+		 * Copy the wait reference while holding the spinlock. This protects
+		 * against a concurrent TerminateBufferIO() in another backend from
+		 * clearing the wref while it's being read.
+		 */
+		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
+		/* no IO in progress, we don't need to wait */
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
+
+		/*
+		 * The buffer has asynchronous IO in progress, wait for it to
+		 * complete.
+		 */
+		if (pgaio_wref_valid(&iow))
+		{
+			pgaio_wref_wait(&iow);
+
+			/*
+			 * The AIO subsystem internally uses condition variables and thus
+			 * might remove this backend from the BufferDesc's CV. While that
+			 * wouldn't cause a correctness issue (the first CV sleep just
+			 * immediately returns if not already registered), it seems worth
+			 * avoiding unnecessary loop iterations, given that we take care
+			 * to do so at the start of the function.
+			 */
+			ConditionVariablePrepareToSleep(cv);
+			continue;
+		}
+
+		/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
 		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
@@ -5535,13 +5597,12 @@ WaitIO(BufferDesc *buf)
 /*
  * StartBufferIO: begin I/O on this buffer
  *	(Assumptions)
- *	My process is executing no IO
+ *	My process is executing no IO on this buffer
  *	The buffer is Pinned
  *
- * In some scenarios there are race conditions in which multiple backends
- * could attempt the same I/O operation concurrently.  If someone else
- * has already started I/O on this buffer then we will block on the
- * I/O condition variable until he's done.
+ * In some scenarios multiple backends could attempt the same I/O operation
+ * concurrently.  If someone else has already started I/O on this buffer then
+ * we will wait for completion of the IO using WaitIO().
  *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -5577,9 +5638,9 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 
 	/* Once we get here, there is definitely no I/O active on this buffer */
 
+	/* Check if someone else already did the I/O */
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		UnlockBufHdr(buf, buf_state);
 		return false;
 	}
@@ -5615,7 +5676,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  */
 static void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
-				  bool forget_owner)
+				  bool forget_owner, bool release_aio)
 {
 	uint32		buf_state;
 
@@ -5630,6 +5691,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 		buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
 
+	if (release_aio)
+	{
+		/* release ownership by the AIO subsystem */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&buf->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5638,6 +5707,17 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 									BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
+
+	/*
+	 * Support LockBufferForCleanup()
+	 *
+	 * We may have just released the last pin other than the waiter's. In most
+	 * cases, this backend holds another pin on the buffer. But, if, for
+	 * example, this backend is completing an IO issued by another backend, it
+	 * may be time to wake the waiter.
+	 */
+	if (release_aio && (buf_state & BM_PIN_COUNT_WAITER))
+		WakePinCountWaiter(buf);
 }
 
 /*
@@ -5686,7 +5766,7 @@ AbortBufferIO(Buffer buffer)
 		}
 	}
 
-	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false);
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR, false, false);
 }
 
 /*
@@ -6137,3 +6217,662 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Generic implementation of the AIO handle staging callback for readv/writev
+ * on local/shared buffers.
+ *
+ * Each readv/writev can target multiple buffers. The buffers have already
+ * been registered with the IO handle.
+ *
+ * To make the IO ready for execution ("staging"), we need to ensure that the
+ * targeted buffers are in an appropriate state while the IO is ongoing. For
+ * that the AIO subsystem needs to have its own buffer pin, otherwise an error
+ * in this backend could lead to this backend's buffer pin being released as
+ * part of error handling, which in turn could lead to the buffer being
+ * replaced while IO is ongoing.
+ */
+static pg_attribute_always_inline void
+buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp)
+{
+	uint64	   *io_data;
+	uint8		handle_data_len;
+	PgAioWaitRef io_ref;
+	BufferTag	first PG_USED_FOR_ASSERTS_ONLY = {0};
+
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+
+	pgaio_io_get_wref(ioh, &io_ref);
+
+	/* iterate over all buffers affected by the vectored readv/writev */
+	for (int i = 0; i < handle_data_len; i++)
+	{
+		Buffer		buffer = (Buffer) io_data[i];
+		BufferDesc *buf_hdr = is_temp ?
+			GetLocalBufferDescriptor(-buffer - 1)
+			: GetBufferDescriptor(buffer - 1);
+		uint32		buf_state;
+
+		/*
+		 * Check that all the buffers are actually ones that could conceivably
+		 * be done in one IO, i.e. are sequential. This is the last
+		 * buffer-aware code before IO is actually executed and confusion
+		 * about which buffers are targeted by IO can be hard to debug, making
+		 * it worth doing extra-paranoid checks.
+		 */
+		if (i == 0)
+			first = buf_hdr->tag;
+		else
+		{
+			Assert(buf_hdr->tag.relNumber == first.relNumber);
+			Assert(buf_hdr->tag.blockNum == first.blockNum + i);
+		}
+
+		if (is_temp)
+			buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		else
+			buf_state = LockBufHdr(buf_hdr);
+
+		/* verify the buffer is in the expected state */
+		Assert(buf_state & BM_TAG_VALID);
+		if (is_write)
+		{
+			Assert(buf_state & BM_VALID);
+			Assert(buf_state & BM_DIRTY);
+		}
+		else
+		{
+			Assert(!(buf_state & BM_VALID));
+			Assert(!(buf_state & BM_DIRTY));
+		}
+
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
+
+		/*
+		 * Reflect that the buffer is now owned by the AIO subsystem.
+		 *
+		 * For local buffers: This can't be done just via LocalRefCount, as
+		 * one might initially think, as this backend could error out while
+		 * AIO is still in progress, releasing all the pins by the backend
+		 * itself.
+		 *
+		 * This pin is released again in TerminateBufferIO().
+		 */
+		buf_state += BUF_REFCOUNT_ONE;
+		buf_hdr->io_wref = io_ref;
+
+		if (is_temp)
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		else
+			UnlockBufHdr(buf_hdr, buf_state);
+
+		/*
+		 * Ensure the content lock that prevents buffer modifications while
+		 * the buffer is being written out is not released early due to an
+		 * error.
+		 */
+		if (is_write && !is_temp)
+		{
+			LWLock	   *content_lock;
+
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+
+			Assert(LWLockHeldByMe(content_lock));
+
+			/*
+			 * Lock is now owned by AIO subsystem.
+			 */
+			LWLockDisown(content_lock);
+		}
+
+		/*
+		 * Stop tracking this buffer via the resowner - the AIO system now
+		 * keeps track.
+		 */
+		if (!is_temp)
+			ResourceOwnerForgetBufferIO(CurrentResourceOwner, buffer);
+	}
+}
+
+/*
+ * Decode readv errors as encoded by buffer_readv_encode_error().
+ */
+static inline void
+buffer_readv_decode_error(PgAioResult result,
+						  bool *zeroed_any,
+						  bool *ignored_any,
+						  uint8 *zeroed_or_error_count,
+						  uint8 *checkfail_count,
+						  uint8 *first_off)
+{
+	uint32		rem_error = result.error_data;
+
+	*zeroed_any = rem_error & 1;
+	rem_error >>= 1;
+
+	*ignored_any = rem_error & 1;
+	rem_error >>= 1;
+
+	*zeroed_or_error_count = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;
+
+	*checkfail_count = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;
+
+	*first_off = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;
+}
+
+/*
+ * Helper to encode errors for buffer_readv_complete()
+ *
+ * Errors are encoded as follows:
+ * - bit 0 indicates whether any page was zeroed (1) or not (0)
+ * - bit 1 indicates whether any checksum failure was ignored (1) or not (0)
+ * - next 7 bits indicate the number of errored or zeroed pages
+ * - next 7 bits indicate the number of checksum failures
+ * - next 7 bits indicate the first offset of the first page
+ *   that was errored or zerored or, if no errors/zeroes, the first ignored
+ *   checksum
+ */
+static inline void
+buffer_readv_encode_error(PgAioResult *result,
+						  bool is_temp,
+						  bool zeroed_any,
+						  bool ignored_any,
+						  uint8 error_count,
+						  uint8 zeroed_count,
+						  uint8 checkfail_count,
+						  uint8 first_error_off,
+						  uint8 first_zeroed_off,
+						  uint8 first_ignored_off)
+{
+
+	uint8		shift = 0;
+	uint8		zeroed_or_error_count =
+		error_count > 0 ? error_count : zeroed_count;
+	uint8		first_off;
+
+	StaticAssertStmt(PG_IOV_MAX <= 1 << 7,
+					 "PG_IOV_MAX is bigger than reserved space for error data");
+	StaticAssertStmt((1 + 7 + 7 + 7) <= PGAIO_RESULT_ERROR_BITS,
+					 "PGAIO_RESULT_ERROR_BITS is insufficient for buffer_readv");
+
+	/*
+	 * We only have space to encode one offset - but luckily that's good
+	 * enough. If there is an error, the error is the integeresting offset,
+	 * same with a zeroed buffer vs an ignored buffer.
+	 */
+	if (error_count > 0)
+		first_off = first_error_off;
+	else if (zeroed_count > 0)
+		first_off = first_zeroed_off;
+	else
+		first_off = first_ignored_off;
+
+	Assert(!zeroed_any || error_count == 0);
+
+	result->error_data = 0;
+
+	result->error_data |= zeroed_any << shift;
+	shift += 1;
+
+	result->error_data |= ignored_any << shift;
+	shift += 1;
+
+	result->error_data |= ((uint32) zeroed_or_error_count) << shift;
+	shift += 7;
+
+	result->error_data |= ((uint32) checkfail_count) << shift;
+	shift += 7;
+
+	result->error_data |= ((uint32) first_off) << shift;
+	shift += 7;
+
+	result->id = is_temp ? PGAIO_HCB_LOCAL_BUFFER_READV :
+		PGAIO_HCB_SHARED_BUFFER_READV;
+
+	if (error_count > 0)
+		result->status = PGAIO_RS_ERROR;
+	else
+		result->status = PGAIO_RS_WARNING;
+
+	/*
+	 * The encoding is complicated enough to warrant cross-checking it against
+	 * the decode function.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	{
+		bool		zeroed_any_2,
+					ignored_any_2;
+		uint8		zeroed_or_error_count_2,
+					checkfail_count_2,
+					first_off_2;
+
+		buffer_readv_decode_error(*result,
+								  &zeroed_any_2, &ignored_any_2,
+								  &zeroed_or_error_count_2,
+								  &checkfail_count_2,
+								  &first_off_2);
+		Assert(zeroed_any == zeroed_any_2);
+		Assert(ignored_any == ignored_any_2);
+		Assert(zeroed_or_error_count == zeroed_or_error_count_2);
+		Assert(checkfail_count == checkfail_count_2);
+		Assert(first_off == first_off_2);
+	}
+#endif
+}
+
+/*
+ * Helper for AIO readv completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page read.
+ */
+static pg_attribute_always_inline void
+buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
+						  uint8 flags, bool failed, bool is_temp,
+						  bool *buffer_invalid,
+						  bool *failed_checksum,
+						  bool *ignored_checksum,
+						  bool *zeroed_buffer)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	BufferTag	tag = buf_hdr->tag;
+	char	   *bufdata = BufferGetBlock(buffer);
+	uint32		set_flag_bits;
+
+	/* check that the buffer is in the expected state for a read */
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(!(buf_state & BM_VALID));
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(!(buf_state & BM_DIRTY));
+	}
+#endif
+
+	*buffer_invalid = false;
+	*failed_checksum = false;
+	*ignored_checksum = false;
+	*zeroed_buffer = false;
+
+	/*
+	 * We ask PageIsVerified() to only log the message about checksum errors,
+	 * as the completion might be run in any backend (or IO workers). We will
+	 * report checksum errors in buffer_readv_report().
+	 */
+	piv_flags = PIV_LOG_LOG;
+
+	/* the local zero_damaged_pages may differ from the issuer's */
+	if (flags & READ_BUFFERS_IGNORE_CHECKSUM_FAILURES)
+		piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
+
+	/* Check for garbage data. */
+	if (!failed)
+	{
+		PgAioResult result_one;
+
+		if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
+							failed_checksum))
+		{
+			if (flags & READ_BUFFERS_ZERO_ON_ERROR)
+			{
+				memset(bufdata, 0, BLCKSZ);
+				*zeroed_buffer = true;
+			}
+			else
+			{
+				*buffer_invalid = true;
+				/* mark buffer as having failed */
+				failed = true;
+			}
+		}
+		else if (*failed_checksum)
+			*ignored_checksum = true;
+
+		/*
+		 * Immediately log a message about the invalid page, but only to the
+		 * server log. The reason to do so immediately is that this may be
+		 * executed in a different backend than the one that originated the
+		 * request. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing). The
+		 * issuer of the IO will emit an ERROR or WARNING when processing the
+		 * IO's results
+		 *
+		 * To avoid duplicating the code to emit these log messages, we reuse
+		 * buffer_readv_report().
+		 */
+		if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
+		{
+			buffer_readv_encode_error(&result_one, is_temp,
+									  *zeroed_buffer,
+									  *ignored_checksum,
+									  *buffer_invalid,
+									  *zeroed_buffer ? 1 : 0,
+									  *failed_checksum ? 1 : 0,
+									  buf_off, buf_off, buf_off);
+			pgaio_result_report(result_one, td, LOG_SERVER_ONLY);
+		}
+	}
+
+
+	/* Terminate I/O and set BM_VALID. */
+	set_flag_bits = failed ? BM_IO_ERROR : BM_VALID;
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, false, set_flag_bits, true);
+	else
+		TerminateBufferIO(buf_hdr, false, set_flag_bits, false, true);
+
+	/*
+	 * Call the BUFFER_READ_DONE tracepoint in the callback, even though the
+	 * callback may not be executed in the same backend that called
+	 * BUFFER_READ_START. The alternative would be to defer calling the
+	 * tracepoint to a later point (e.g. the local completion callback for
+	 * shared buffer reads), which seems even less helpful.
+	 */
+	TRACE_POSTGRESQL_BUFFER_READ_DONE(tag.forkNum,
+									  tag.blockNum,
+									  tag.spcOid,
+									  tag.dbOid,
+									  tag.relNumber,
+									  is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+									  false);
+}
+
+/*
+ * Perform completion handling of a single AIO read. This read may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					  uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint8		first_error_off = 0;
+	uint8		first_zeroed_off = 0;
+	uint8		first_ignored_off = 0;
+	uint8		error_count = 0;
+	uint8		zeroed_count = 0;
+	uint8		ignored_count = 0;
+	uint8		checkfail_count = 0;
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call the
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		bool		failed;
+		bool		failed_verification = false;
+		bool		failed_checksum = false;
+		bool		zeroed_buffer = false;
+		bool		ignored_checksum = false;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire I/O failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, the first few buffers
+		 * may be ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buffer_readv_complete_one(td, buf_off, buf, cb_data, failed, is_temp,
+								  &failed_verification,
+								  &failed_checksum,
+								  &ignored_checksum,
+								  &zeroed_buffer);
+
+		/*
+		 * Track information about the number of different kinds of error
+		 * conditions across all pages, as there can be multiple pages failing
+		 * verification as part of one IO.
+		 */
+		if (failed_verification && !zeroed_buffer && error_count++ == 0)
+			first_error_off = buf_off;
+		if (zeroed_buffer && zeroed_count++ == 0)
+			first_zeroed_off = buf_off;
+		if (ignored_checksum && ignored_count++ == 0)
+			first_ignored_off = buf_off;
+		if (failed_checksum)
+			checkfail_count++;
+	}
+
+	/*
+	 * If the smgr read succeeded [partially] and page verification failed for
+	 * some of the pages, adjust the IO's result state appropriately.
+	 */
+	if (prior_result.status != PGAIO_RS_ERROR &&
+		(error_count > 0 || ignored_count > 0 || zeroed_count > 0))
+	{
+		buffer_readv_encode_error(&result, is_temp,
+								  zeroed_count > 0, ignored_count > 0,
+								  error_count, zeroed_count, checkfail_count,
+								  first_error_off, first_zeroed_off,
+								  first_ignored_off);
+		pgaio_result_report(result, td, DEBUG1);
+	}
+
+	/*
+	 * For shared relations this reporting is done in
+	 * shared_buffer_readv_complete_local().
+	 */
+	if (is_temp && checkfail_count > 0)
+		pgstat_report_checksum_failures_in_db(td->smgr.rlocator.dbOid,
+											  checkfail_count);
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for aio_shared_buffer_readv_cb and
+ * aio_local_buffer_readv_cb.
+ *
+ * The error is encoded / decoded in buffer_readv_encode_error() /
+ * buffer_readv_decode_error().
+ */
+static void
+buffer_readv_report(PgAioResult result, const PgAioTargetData *td,
+					int elevel)
+{
+	int			nblocks = td->smgr.nblocks;
+	BlockNumber first = td->smgr.blockNum;
+	BlockNumber last = first + nblocks - 1;
+	ProcNumber	errProc =
+		td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER;
+	RelPathStr	rpath =
+		relpathbackend(td->smgr.rlocator, errProc, td->smgr.forkNum);
+	bool		zeroed_any,
+				ignored_any;
+	uint8		zeroed_or_error_count,
+				checkfail_count,
+				first_off;
+	uint8		affected_count;
+	const char *msg_one,
+			   *msg_mult,
+			   *det_mult,
+			   *hint_mult;
+
+	buffer_readv_decode_error(result, &zeroed_any, &ignored_any,
+							  &zeroed_or_error_count,
+							  &checkfail_count,
+							  &first_off);
+
+	/*
+	 * Treat a read that had both zeroed buffers *and* ignored checksums as a
+	 * special case, it's too irregular to be emitted the same way as the other
+	 * cases.
+	 */
+	if (zeroed_any && ignored_any)
+	{
+		Assert(zeroed_any && ignored_any);
+		Assert(nblocks > 1);	/* same block can't be both zeroed and ignored */
+		Assert(result.status != PGAIO_RS_ERROR);
+		affected_count = zeroed_or_error_count;
+
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("zeroing %u page(s) and ignoring %u checksum failure(s) among blocks %u..%u of relation %s",
+					   affected_count, checkfail_count, first, last, rpath.str),
+				affected_count > 1 ?
+				errdetail("Block %u held first zeroed page.",
+						  first + first_off) : 0,
+				errhint("See server log for details about the other %u invalid block(s).",
+						affected_count + checkfail_count - 1));
+		return;
+	}
+
+	/*
+	 * The other messages are highly repetitive. To avoid duplicating a long
+	 * and complicated ereport(), gather the translated format strings
+	 * separately and then do one common ereport.
+	 */
+	if (result.status == PGAIO_RS_ERROR)
+	{
+		Assert(!zeroed_any);	/* can't have invalid pages when zeroing them */
+		affected_count = zeroed_or_error_count;
+		msg_one = _("invalid page in block %u of relation %s");
+		msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
+		det_mult = _("Block %u held first invalid page.");
+		hint_mult = _("See server log for the other %u invalid block(s).");
+	}
+	else if (zeroed_any && !ignored_any)
+	{
+		affected_count = zeroed_or_error_count;
+		msg_one = _("invalid page in block %u of relation %s; zeroing out page");
+		msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
+		det_mult = _("Block %u held first zeroed page.");
+		hint_mult = _("See server log for the other %u zeroed block(s).");
+	}
+	else if (!zeroed_any && ignored_any)
+	{
+		affected_count = checkfail_count;
+		msg_one = _("ignoring checksum failure in block %u of relation %s");
+		msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
+		det_mult = _("Block %u held first ignored page.");
+		hint_mult = _("See server log for the other %u ignored block(s).");
+	}
+	else
+		pg_unreachable();
+
+	ereport(elevel,
+			errcode(ERRCODE_DATA_CORRUPTED),
+			affected_count == 1 ?
+			errmsg_internal(msg_one, first + first_off, rpath.str) :
+			errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
+			affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
+			affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);
+}
+
+static void
+shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, false);
+}
+
+static PgAioResult
+shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							 uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, false);
+}
+
+/*
+ * We need a backend-local completion callback for shared buffers, to be able
+ * to report checksum errors correctly. Unfortunately that can only safely
+ * happen if the reporting backend has previously called
+ */
+static PgAioResult
+shared_buffer_readv_complete_local(PgAioHandle *ioh, PgAioResult prior_result,
+								   uint8 cb_data)
+{
+	bool		zeroed_any,
+				ignored_any;
+	uint8		zeroed_or_error_count,
+				checkfail_count,
+				first_off;
+
+	if (prior_result.status == PGAIO_RS_OK)
+		return prior_result;
+
+	buffer_readv_decode_error(prior_result,
+							  &zeroed_any,
+							  &ignored_any,
+							  &zeroed_or_error_count,
+							  &checkfail_count,
+							  &first_off);
+
+	if (checkfail_count)
+	{
+		PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+
+		pgstat_report_checksum_failures_in_db(td->smgr.rlocator.dbOid,
+											  checkfail_count);
+	}
+
+	return prior_result;
+}
+
+static void
+local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, false, true);
+}
+
+static PgAioResult
+local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							uint8 cb_data)
+{
+	return buffer_readv_complete(ioh, prior_result, cb_data, true);
+}
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
+	.stage = shared_buffer_readv_stage,
+	.complete_shared = shared_buffer_readv_complete,
+	/* need a local callback to report checksum failures */
+	.complete_local = shared_buffer_readv_complete_local,
+	.report = buffer_readv_report,
+};
+
+/* readv callback is passed READ_BUFFERS_* flags as callback data */
+const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
+	.stage = local_buffer_readv_stage,
+
+	/*
+	 * Note that this, in contrast to the shared_buffers case, uses
+	 * complete_local, as only the issuing backend has access to the required
+	 * datastructures. This is important in case the IO completion may be
+	 * consumed incidentally by another backend.
+	 */
+	.complete_local = local_buffer_readv_complete,
+	.report = buffer_readv_report,
+};
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3a722321533..bf89076bb10 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -18,6 +18,7 @@
 #include "access/parallel.h"
 #include "executor/instrument.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -187,7 +188,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
 	 */
-	if (!StartLocalBufferIO(bufHdr, false))
+	if (!StartLocalBufferIO(bufHdr, false, false))
 		elog(ERROR, "failed to start write IO on local buffer");
 
 	/* Find smgr relation for buffer */
@@ -211,7 +212,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
 	/* Mark not-dirty */
-	TerminateLocalBufferIO(bufHdr, true, 0);
+	TerminateLocalBufferIO(bufHdr, true, 0, false);
 
 	pgBufferUsage.local_blks_written++;
 }
@@ -430,7 +431,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
 
 			/* no need to loop for local buffers */
-			StartLocalBufferIO(existing_hdr, true);
+			StartLocalBufferIO(existing_hdr, true, false);
 		}
 		else
 		{
@@ -446,7 +447,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 			hresult->id = victim_buf_id;
 
-			StartLocalBufferIO(victim_buf_hdr, true);
+			StartLocalBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -515,13 +516,31 @@ MarkLocalBufferDirty(Buffer buffer)
  * Like StartBufferIO, but for local buffers
  */
 bool
-StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
+StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait)
 {
-	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+	uint32		buf_state;
 
+	/*
+	 * With AIO the buffer could have IO in progress, e.g. when there are two
+	 * scans of the same relation. Either wait for the other IO or return
+	 * false.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		if (nowait)
+			return false;
+
+		pgaio_wref_wait(&iow);
+	}
+
+	/* Once we get here, there is definitely no I/O active on this buffer */
+
+	/* Check if someone else already did the I/O */
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
 	if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY))
 	{
-		/* someone else already did the I/O */
 		return false;
 	}
 
@@ -536,7 +555,8 @@ StartLocalBufferIO(BufferDesc *bufHdr, bool forInput)
  * Like TerminateBufferIO, but for local buffers
  */
 void
-TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits)
+TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bits,
+					   bool release_aio)
 {
 	/* Only need to adjust flags */
 	uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
@@ -549,12 +569,22 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
 	if (clear_dirty)
 		buf_state &= ~BM_DIRTY;
 
+	if (release_aio)
+	{
+		/* release pin held by IO subsystem, see also buffer_stage_common() */
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+		buf_state -= BUF_REFCOUNT_ONE;
+		pgaio_wref_clear(&bufHdr->io_wref);
+	}
+
 	buf_state |= set_flag_bits;
 	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 	/* local buffers don't track IO using resowners */
 
 	/* local buffers don't use the IO CV, as no other process can see buffer */
+
+	/* local buffers don't use BM_PIN_COUNT_WAITER, so no need to wake */
 }
 
 /*
@@ -575,6 +605,19 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 	uint32		buf_state;
 	LocalBufferLookupEnt *hresult;
 
+	/*
+	 * It's possible that we started IO on this buffer before e.g. aborting
+	 * the transaction that created a table. We need to wait for that IO to
+	 * complete before removing / reusing the buffer.
+	 */
+	if (pgaio_wref_valid(&bufHdr->io_wref))
+	{
+		PgAioWaitRef iow = bufHdr->io_wref;
+
+		pgaio_wref_wait(&iow);
+		Assert(!pgaio_wref_valid(&bufHdr->io_wref));
+	}
+
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 	/*
@@ -714,6 +757,8 @@ InitLocalBuffers(void)
 		 */
 		buf->buf_id = -i - 2;
 
+		pgaio_wref_clear(&buf->io_wref);
+
 		/*
 		 * Intentionally do not initialize the buffer's atomic variable
 		 * (besides zeroing the underlying memory above). That way we get
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 5d1b039fcbb..84c63b1590b 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -78,8 +78,8 @@ PageInit(Page page, Size pageSize, Size specialSize)
  * treat such a page as empty and without free space.  Eventually, VACUUM
  * will clean up such a page and make it usable.
  *
- * If flag PIV_LOG_WARNING is set, a WARNING is logged in the event of
- * a checksum failure.
+ * If flag PIV_LOG_WARNING/PIV_LOG_LOG is set, a WARNING/LOG message is logged
+ * in the event of a checksum failure.
  *
  * To allow the caller to report statistics about checksum failures,
  * *checksum_failure_p can be passed in. Note that there may be checksum
@@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
 	 */
 	if (checksum_failure)
 	{
-		if ((flags & PIV_LOG_WARNING) != 0)
-			ereport(WARNING,
+		if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
+			ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("page verification failed, calculated checksum %u but expected %u",
 							checksum, p->pd_checksum)));
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0011-Let-caller-of-PageIsVerified-control-ignore_ch.patchtext/x-diff; charset=us-asciiDownload

From 505f390da755bf83ec4005a726c213fc0a000b3f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 28 Mar 2025 16:48:57 -0400
Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
 ignore_checksum_failure

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufpage.h       | 1 +
 src/backend/catalog/storage.c       | 8 ++++++--
 src/backend/storage/buffer/bufmgr.c | 7 ++++++-
 src/backend/storage/page/bufpage.c  | 6 +++++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 7dbdfc564ed..aeb67c498c5 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -468,6 +468,7 @@ do { \
 /* flags for PageIsVerified() */
 #define PIV_LOG_WARNING			(1 << 0)
 #define PIV_LOG_LOG				(1 << 1)
+#define PIV_IGNORE_CHECKSUM_FAILURE (1 << 2)
 
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2577f69cbde..57a5e43c881 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -508,6 +508,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		BulkWriteBuffer buf;
+		int			piv_flags;
 		bool		checksum_failure;
 		bool		verified;
 
@@ -517,9 +518,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		buf = smgr_bulk_get_buf(bulkstate);
 		smgrread(src, forkNum, blkno, (Page) buf);
 
-		verified = PageIsVerified((Page) buf, blkno, PIV_LOG_WARNING,
+		piv_flags = PIV_LOG_WARNING;
+		if (ignore_checksum_failure)
+			piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
+
+		verified = PageIsVerified((Page) buf, blkno, piv_flags,
 								  &checksum_failure);
-
 		if (checksum_failure)
 		{
 			RelFileLocatorBackend rloc = src->smgr_rlocator;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ca89d9345f3..13c116d05d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1572,6 +1572,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		{
 			BufferDesc *bufHdr;
 			Block		bufBlock;
+			int			piv_flags;
 			bool		verified;
 			bool		checksum_failure;
 
@@ -1587,8 +1588,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			}
 
 			/* check for garbage data */
+			piv_flags = PIV_LOG_WARNING;
+			if (ignore_checksum_failure)
+				piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
 			verified = PageIsVerified((Page) bufBlock, io_first_block + j,
-									  PIV_LOG_WARNING, &checksum_failure);
+									  piv_flags, &checksum_failure);
 			if (checksum_failure)
 			{
 				RelFileLocatorBackend rloc = operation->smgr->smgr_rlocator;
@@ -6485,6 +6489,7 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
 	BufferTag	tag = buf_hdr->tag;
 	char	   *bufdata = BufferGetBlock(buffer);
 	uint32		set_flag_bits;
+	int			piv_flags;
 
 	/* check that the buffer is in the expected state for a read */
 #ifdef USE_ASSERT_CHECKING
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 84c63b1590b..652746e1fa2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -81,6 +81,10 @@ PageInit(Page page, Size pageSize, Size specialSize)
  * If flag PIV_LOG_WARNING/PIV_LOG_LOG is set, a WARNING/LOG message is logged
  * in the event of a checksum failure.
  *
+ * If flag PIV_IGNORE_CHECKSUM_FAILURE is set, checksum failures will cause a
+ * message about the failure to be emitted, but will not cause
+ * PageIsVerified() to return false.
+ *
  * To allow the caller to report statistics about checksum failures,
  * *checksum_failure_p can be passed in. Note that there may be checksum
  * failures even if this function returns true, due to
@@ -150,7 +154,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
 					 errmsg("page verification failed, calculated checksum %u but expected %u",
 							checksum, p->pd_checksum)));
 
-		if (header_sane && ignore_checksum_failure)
+		if (header_sane && (flags & PIV_IGNORE_CHECKSUM_FAILURE))
 			return true;
 	}
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0012-bufmgr-Use-AIO-in-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload

From a75d34549bff8980ed76d0beb651885381246ca9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 28 Mar 2025 16:51:01 -0400
Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()

This finally introduces the first actual use of AIO. StartReadBuffers() now
uses the AIO routines to issue IO. This converts a lot of callers to use the
AIO infrastructure.

As the implementation of StartReadBuffers() is also used by the functions for
reading individual blocks (StartReadBuffer() and through that
ReadBufferExtended()) this means all buffered read IO passes through the AIO
paths.  However, as those are synchronous reads, actually performing the IO
asynchronously would be rarely beneficial. Instead such IOs are flagged to
always be executed synchronously. This way we don't have to duplicate a fair
bit of code.

When io_method=sync is used, the IO patterns generated after this change are
the same as before, i.e. actual reads are only issued in WaitReadBuffers() and
StartReadBuffers() may issue prefetch requests.  This allows to bypass most of
the actual asynchronicity, which is important to make a change as big as this
less risky.

One thing worth calling out is that, if IO is actually executed
asynchronously, the precise meaning of what track_io_timing is measuring has
changed. Previously it tracked the time for each IO, but that does not make
sense when multiple IOs are executed concurrently. Now it only measures the
time actually spent waiting for IO. A subsequent commit will adjust the docs
for this.

While AIO is now actually used, the logic in read_stream.c will often prevent
using sufficiently many concurrent IOs. That will be addressed in the next
commit.

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/include/storage/bufmgr.h        |   6 +
 src/backend/storage/buffer/bufmgr.c | 638 +++++++++++++++++++++-------
 2 files changed, 492 insertions(+), 152 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 11f8508a90b..867ae9facb5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -114,6 +114,9 @@ typedef struct BufferManagerRelation
 #define READ_BUFFERS_ISSUE_ADVICE (1 << 1)
 /* Don't treat page as invalid due to checksum failures. */
 #define READ_BUFFERS_IGNORE_CHECKSUM_FAILURES (1 << 2)
+/* IO will immediately be waited for */
+#define READ_BUFFERS_SYNCHRONOUSLY (1 << 3)
+
 
 struct ReadBuffersOperation
 {
@@ -133,6 +136,9 @@ struct ReadBuffersOperation
 	BlockNumber blocknum;
 	int			flags;
 	int16		nblocks;
+	int16		nblocks_done;
+	PgAioWaitRef io_wref;
+	PgAioReturn io_return;
 };
 
 typedef struct ReadBuffersOperation ReadBuffersOperation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 13c116d05d0..6ac1a821523 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,6 +531,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  BlockNumber blockNum,
 									  BufferAccessStrategy strategy,
 									  bool *foundPtr, IOContext io_context);
+static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks);
+static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -1231,10 +1233,14 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		return buffer;
 	}
 
+	/*
+	 * Signal that we are going to immediately wait. If we're immediately
+	 * waiting, there is no benefit in actually executing the IO
+	 * asynchronously, it would just add dispatch overhead.
+	 */
+	flags = READ_BUFFERS_SYNCHRONOUSLY;
 	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
 	operation.smgr = smgr;
 	operation.rel = rel;
 	operation.persistence = persistence;
@@ -1259,6 +1265,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 {
 	int			actual_nblocks = *nblocks;
 	int			maxcombine = 0;
+	bool		did_start_io;
 
 	Assert(*nblocks == 1 || allow_forwarding);
 	Assert(*nblocks > 0);
@@ -1326,6 +1333,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 			if (i == 0)
 			{
 				*nblocks = 1;
+
+#ifdef USE_ASSERT_CHECKING
+
+				/*
+				 * Initialize enough of ReadBuffersOperation to make
+				 * CheckReadBuffersOperation() work. Outside of assertions
+				 * that's not necessary when no IO is issued.
+				 */
+				operation->buffers = buffers;
+				operation->blocknum = blockNum;
+				operation->nblocks = 1;
+				operation->nblocks_done = 1;
+				CheckReadBuffersOperation(operation, true);
+#endif
 				return false;
 			}
 
@@ -1368,25 +1389,74 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
 	operation->blocknum = blockNum;
 	operation->flags = flags;
 	operation->nblocks = actual_nblocks;
+	operation->nblocks_done = 0;
+	pgaio_wref_clear(&operation->io_wref);
 
-	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	/*
+	 * When using AIO, start the IO in the background. If not, issue prefetch
+	 * requests if desired by the caller.
+	 *
+	 * The reason we have a dedicated path for IOMETHOD_SYNC here is to
+	 * de-risk the introduction of AIO somewhat. It's a large architectural
+	 * change, with lots of chances for unanticipated performance effects.
+	 *
+	 * Use of IOMETHOD_SYNC already leads to not actually performing IO
+	 * asynchronously, but without the check here we'd execute IO earlier than
+	 * we used to. Eventually this IOMETHOD_SYNC specific path should go away.
+	 */
+	if (io_method != IOMETHOD_SYNC)
 	{
 		/*
-		 * In theory we should only do this if PinBufferForBlock() had to
-		 * allocate new buffers above.  That way, if two calls to
-		 * StartReadBuffers() were made for the same blocks before
-		 * WaitReadBuffers(), only the first would issue the advice. That'd be
-		 * a better simulation of true asynchronous I/O, which would only
-		 * start the I/O once, but isn't done here for simplicity.
+		 * Try to start IO asynchronously. It's possible that no IO needs to
+		 * be started, if another backend already performed the IO.
+		 *
+		 * Note that if an IO is started, it might not cover the entire
+		 * requested range, e.g. because an intermediary block has been read
+		 * in by another backend.  In that case any "trailing" buffers we
+		 * already pinned above will be "forwarded" by read_stream.c to the
+		 * next call to StartReadBuffers().
+		 *
+		 * This is signalled to the caller by decrementing *nblocks *and*
+		 * reducing operation->nblocks. The latter is done here, but not below
+		 * WaitReadBuffers(), as in WaitReadBuffers() we can't "shorten" the
+		 * overall read size anymore, we need to retry until done in its
+		 * entirety or until failed.
 		 */
-		smgrprefetch(operation->smgr,
-					 operation->forknum,
-					 blockNum,
-					 actual_nblocks);
+		did_start_io = AsyncReadBuffers(operation, nblocks);
+
+		operation->nblocks = *nblocks;
+	}
+	else
+	{
+		operation->flags |= READ_BUFFERS_SYNCHRONOUSLY;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.
+			 */
+			smgrprefetch(operation->smgr,
+						 operation->forknum,
+						 blockNum,
+						 actual_nblocks);
+		}
+
+		/*
+		 * Indicate that WaitReadBuffers() should be called. WaitReadBuffers()
+		 * will initiate the necessary IO.
+		 */
+		did_start_io = true;
 	}
 
-	/* Indicate that WaitReadBuffers() should be called. */
-	return true;
+	CheckReadBuffersOperation(operation, !did_start_io);
+
+	return did_start_io;
 }
 
 /*
@@ -1452,8 +1522,35 @@ StartReadBuffer(ReadBuffersOperation *operation,
 	return result;
 }
 
+/*
+ * Perform sanity checks on the ReadBuffersOperation.
+ */
+static void
+CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(operation->nblocks_done <= operation->nblocks);
+	Assert(!is_complete || operation->nblocks == operation->nblocks_done);
+
+	for (int i = 0; i < operation->nblocks; i++)
+	{
+		Buffer		buffer = operation->buffers[i];
+		BufferDesc *buf_hdr = BufferIsLocal(buffer) ?
+			GetLocalBufferDescriptor(-buffer - 1) :
+			GetBufferDescriptor(buffer - 1);
+
+		Assert(BufferGetBlockNumber(buffer) == operation->blocknum + i);
+		Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_TAG_VALID);
+
+		if (i < operation->nblocks_done)
+			Assert(pg_atomic_read_u32(&buf_hdr->state) & BM_VALID);
+	}
+#endif
+}
+
+/* helper for ReadBuffersCanStartIO(), to avoid repetition */
 static inline bool
-WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 		return StartLocalBufferIO(GetLocalBufferDescriptor(-buffer - 1),
@@ -1462,31 +1559,240 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
+ */
+static inline bool
+ReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	/*
+	 * If this backend currently has staged IO, we need to submit the pending
+	 * IO before waiting for the right to issue IO, to avoid the potential for
+	 * deadlocks (and, more commonly, unnecessary delays for other backends).
+	 */
+	if (!nowait && pgaio_have_staged())
+	{
+		if (ReadBuffersCanStartIOOnce(buffer, true))
+			return true;
+
+		/*
+		 * Unfortunately StartBufferIO() returning false doesn't allow to
+		 * distinguish between the buffer already being valid and IO already
+		 * being in progress. Since IO already being in progress is quite
+		 * rare, this approach seems fine.
+		 */
+		pgaio_submit_staged();
+	}
+
+	return ReadBuffersCanStartIOOnce(buffer, nowait);
+}
+
+/*
+ * Helper for WaitReadBuffers() that processes the results of a readv
+ * operation, raising an error if necessary.
+ */
+static void
+ProcessReadBuffersResult(ReadBuffersOperation *operation)
+{
+	PgAioReturn *aio_ret = &operation->io_return;
+	PgAioResultStatus rs = aio_ret->result.status;
+	int			newly_read_blocks = 0;
+
+	Assert(pgaio_wref_valid(&operation->io_wref));
+	Assert(aio_ret->result.status != PGAIO_RS_UNKNOWN);
+
+	/*
+	 * SMGR reports the number of blocks successfully read as the result of
+	 * the IO operation. Thus we can simply add that to ->nblocks_done.
+	 */
+
+	if (likely(rs != PGAIO_RS_ERROR))
+		newly_read_blocks = aio_ret->result.result;
+
+	if (rs == PGAIO_RS_ERROR || rs == PGAIO_RS_WARNING)
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data,
+							rs == PGAIO_RS_ERROR ? ERROR : WARNING);
+	else if (aio_ret->result.status == PGAIO_RS_PARTIAL)
+	{
+		/*
+		 * We'll retry, so we just emit a debug message to the server log (or
+		 * not even that in prod scenarios).
+		 */
+		pgaio_result_report(aio_ret->result, &aio_ret->target_data, DEBUG1);
+		elog(DEBUG3, "partial read, will retry");
+	}
+
+	Assert(newly_read_blocks > 0);
+	Assert(newly_read_blocks <= MAX_IO_COMBINE_LIMIT);
+
+	operation->nblocks_done += newly_read_blocks;
+
+	Assert(operation->nblocks_done <= operation->nblocks);
+}
+
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-	Buffer	   *buffers;
-	int			nblocks;
-	BlockNumber blocknum;
-	ForkNumber	forknum;
+	PgAioReturn *aio_ret = &operation->io_return;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
-	/* Find the range of the physical read we need to perform. */
-	nblocks = operation->nblocks;
-	buffers = &operation->buffers[0];
-	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-	persistence = operation->persistence;
+	if (operation->persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
-	Assert(nblocks > 0);
-	Assert(nblocks <= MAX_IO_COMBINE_LIMIT);
+	/*
+	 * If we get here without an IO operation having been issued, the
+	 * io_method == IOMETHOD_SYNC path must have been used. Otherwise the
+	 * caller should not have called WaitReadBuffers().
+	 *
+	 * In the case of IOMETHOD_SYNC, we start - as we used to before the
+	 * introducing of AIO - the IO in WaitReadBuffers(). This is done as part
+	 * of the retry logic below, no extra code is required.
+	 *
+	 * This path is expected to eventually go away.
+	 */
+	if (!pgaio_wref_valid(&operation->io_wref) && io_method != IOMETHOD_SYNC)
+		elog(ERROR, "waiting for read operation that didn't read");
+
+	/*
+	 * To handle partial reads, and IOMETHOD_SYNC, we re-issue IO until we're
+	 * done. We may need multiple retries, not just because we could get
+	 * multiple partial reads, but also because some of the remaining
+	 * to-be-read buffers may have been read in by other backends, limiting
+	 * the IO size.
+	 */
+	while (true)
+	{
+		int			ignored_nblocks_progress;
+
+		CheckReadBuffersOperation(operation, false);
+
+		/*
+		 * If there is an IO associated with the operation, we may need to
+		 * wait for it.
+		 */
+		if (pgaio_wref_valid(&operation->io_wref))
+		{
+			/*
+			 * Track the time spent waiting for the IO to complete. As
+			 * tracking a wait even if we don't actually need to wait
+			 *
+			 * a) is not cheap, due to the timestamping overhead
+			 *
+			 * b) reports some time as waiting, even if we never waited
+			 *
+			 * we first check if we already know the IO is complete.
+			 */
+			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+				!pgaio_wref_check_done(&operation->io_wref))
+			{
+				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+
+				pgaio_wref_wait(&operation->io_wref);
+
+				/*
+				 * The IO operation itself was already counted earlier, in
+				 * AsyncReadBuffers(), this just accounts for the wait time.
+				 */
+				pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+										io_start, 0, 0);
+			}
+			else
+			{
+				Assert(pgaio_wref_check_done(&operation->io_wref));
+			}
+
+			/*
+			 * We now are sure the IO completed. Check the results. This
+			 * includes reporting on errors if there were any.
+			 */
+			ProcessReadBuffersResult(operation);
+		}
+
+		/*
+		 * Most of the time, the one IO we already started, will read in
+		 * everything.  But we need to deal with partial reads and buffers not
+		 * needing IO anymore.
+		 */
+		if (operation->nblocks_done == operation->nblocks)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * This may only complete the IO partially, either because some
+		 * buffers were already valid, or because of a partial read.
+		 *
+		 * NB: In contrast to after the AsyncReadBuffers() call in
+		 * StartReadBuffers(), we do *not* reduce
+		 * ReadBuffersOperation->nblocks here, callers expect the full
+		 * operation to be completed at this point (as more operations may
+		 * have been queued).
+		 */
+		AsyncReadBuffers(operation, &ignored_nblocks_progress);
+	}
+
+	CheckReadBuffersOperation(operation, true);
+
+	/* NB: READ_DONE tracepoint was already executed in completion callback */
+}
+
+/*
+ * Initiate IO for the ReadBuffersOperation
+ *
+ * This function only starts a single IO at a time. The size of the IO may be
+ * limited to below the to-be-read blocks, if one of the buffers has
+ * concurrently been read in. If the first to-be-read buffer is already valid,
+ * no IO will be issued.
+ *
+ * To support retries after partial reads, the first operation->nblocks_done
+ * buffers are skipped.
+ *
+ * On return *nblocks_progress is updated to reflect the number of buffers
+ * affected by the call. If the first buffer is valid, *nblocks_progress is
+ * set to 1 and operation->nblocks_done is incremented.
+ *
+ * Returns true if IO was initiated, false if no IO was necessary.
+ */
+static bool
+AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
+{
+	Buffer	   *buffers = &operation->buffers[0];
+	int			flags = operation->flags;
+	BlockNumber blocknum = operation->blocknum;
+	ForkNumber	forknum = operation->forknum;
+	char		persistence = operation->persistence;
+	int16		nblocks_done = operation->nblocks_done;
+	Buffer	   *io_buffers = &operation->buffers[nblocks_done];
+	int			io_buffers_len = 0;
+	PgAioHandle *ioh;
+	uint32		ioh_flags = 0;
+	void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+	IOContext	io_context;
+	IOObject	io_object;
+	bool		did_start_io;
+
+	/*
+	 * When this IO is executed synchronously, either because the caller will
+	 * immediately block waiting for the IO or because IOMETHOD_SYNC is used,
+	 * the AIO subsystem needs to know.
+	 */
+	if (flags & READ_BUFFERS_SYNCHRONOUSLY)
+		ioh_flags |= PGAIO_HF_SYNCHRONOUS;
 
 	if (persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
+		ioh_flags |= PGAIO_HF_REFERENCES_LOCAL;
 	}
 	else
 	{
@@ -1494,155 +1800,183 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_object = IOOBJECT_RELATION;
 	}
 
-	for (int i = 0; i < nblocks; ++i)
+	/*
+	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
+	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
+	 * set globally, but on a per-session basis. The completion callback,
+	 * which may be run in other processes, e.g. in IO workers, may have a
+	 * different value of the zero_damaged_pages GUC.
+	 *
+	 * XXX: We probably should eventually use a different flag for
+	 * zero_damaged_pages, so we can report different log levels / error codes
+	 * for zero_damaged_pages and ZERO_ON_ERROR.
+	 */
+	if (zero_damaged_pages)
+		flags |= READ_BUFFERS_ZERO_ON_ERROR;
+
+	/*
+	 * For the same reason as with zero_damaged_pages we need to use this
+	 * backend's ignore_checksum_failure value.
+	 */
+	if (ignore_checksum_failure)
+		flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;
+
+
+	/*
+	 * To be allowed to report stats in the local completion callback we need
+	 * to prepare to report stats now. This ensures we can safely report the
+	 * checksum failure even in a critical section.
+	 */
+	pgstat_prepare_report_checksum_failure(operation->smgr->smgr_rlocator.locator.dbOid);
+
+	/*
+	 * Get IO handle before ReadBuffersCanStartIO(), as pgaio_io_acquire()
+	 * might block, which we don't want after setting IO_IN_PROGRESS.
+	 *
+	 * If we need to wait for IO before we can get a handle, submit
+	 * already-staged IO first, so that other backends don't need to wait.
+	 * There wouldn't be a deadlock risk, as pgaio_io_acquire() just needs to
+	 * wait for already submitted IO, which doesn't require additional locks,
+	 * but it could still cause undesirable waits.
+	 *
+	 * A secondary benefit is that this would allow us to measure the time in
+	 * pgaio_io_acquire() without causing undue timer overhead in the common,
+	 * non-blocking, case.  However, currently the pgstats infrastructure
+	 * doesn't really allow that, as it a) asserts that an operation can't
+	 * have time without operations b) doesn't have an API to report
+	 * "accumulated" time.
+	 */
+	ioh = pgaio_io_acquire_nb(CurrentResourceOwner, &operation->io_return);
+	if (unlikely(!ioh))
+	{
+		pgaio_submit_staged();
+
+		ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+	}
+
+	/*
+	 * Check if we can start IO on the first to-be-read buffer.
+	 *
+	 * If an I/O is already in progress in another backend, we want to wait
+	 * for the outcome: either done, or something went wrong and we will
+	 * retry.
+	 */
+	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
+	{
+		/*
+		 * Someone else has already completed this block, we're done.
+		 *
+		 * When IO is necessary, ->nblocks_done is updated in
+		 * ProcessReadBuffersResult(), but that is not called if no IO is
+		 * necessary. Thus update here.
+		 */
+		operation->nblocks_done += 1;
+		*nblocks_progress = 1;
+
+		pgaio_io_release(ioh);
+		pgaio_wref_clear(&operation->io_wref);
+		did_start_io = false;
+
+		/*
+		 * Report and track this as a 'hit' for this backend, even though it
+		 * must have started out as a miss in PinBufferForBlock(). The other
+		 * backend will track this as a 'read'.
+		 */
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
+										  operation->smgr->smgr_rlocator.locator.spcOid,
+										  operation->smgr->smgr_rlocator.locator.dbOid,
+										  operation->smgr->smgr_rlocator.locator.relNumber,
+										  operation->smgr->smgr_rlocator.backend,
+										  true);
+
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+	}
+	else
 	{
-		int			io_buffers_len;
-		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
-		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
 		instr_time	io_start;
-		BlockNumber io_first_block;
-
-		/*
-		 * Skip this block if someone else has already completed it.  If an
-		 * I/O is already in progress in another backend, this will wait for
-		 * the outcome: either done, or something went wrong and we will
-		 * retry.
-		 */
-		if (!WaitReadBuffersCanStartIO(buffers[i], false))
-		{
-			/*
-			 * Report and track this as a 'hit' for this backend, even though
-			 * it must have started out as a miss in PinBufferForBlock(). The
-			 * other backend will track this as a 'read'.
-			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  true);
-
-			if (persistence == RELPERSISTENCE_TEMP)
-				pgBufferUsage.local_blks_hit += 1;
-			else
-				pgBufferUsage.shared_blks_hit += 1;
-
-			if (operation->rel)
-				pgstat_count_buffer_hit(operation->rel);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
-
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
-
-			continue;
-		}
 
 		/* We found a buffer that we need to read in. */
-		io_buffers[0] = buffers[i];
-		io_pages[0] = BufferGetBlock(buffers[i]);
-		io_first_block = blocknum + i;
+		Assert(io_buffers[0] == buffers[nblocks_done]);
+		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
-		 * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
+		 * I/O already in progress.  We already set BM_IO_IN_PROGRESS for the
 		 * head block, so we should get on with that I/O as soon as possible.
-		 * We'll come back to this block again, above.
 		 */
-		while ((i + 1) < nblocks &&
-			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		for (int i = nblocks_done + 1; i < operation->nblocks; i++)
 		{
+			if (!ReadBuffersCanStartIO(buffers[i], true))
+				break;
 			/* Must be consecutive block numbers. */
-			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
-				   BufferGetBlockNumber(buffers[i]) + 1);
+			Assert(BufferGetBlockNumber(buffers[i - 1]) ==
+				   BufferGetBlockNumber(buffers[i]) - 1);
+			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_buffers[io_buffers_len] = buffers[++i];
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
+		/* get a reference to wait for in WaitReadBuffers() */
+		pgaio_io_get_wref(ioh, &operation->io_wref);
+
+		/* provide the list of buffers to the completion callbacks */
+		pgaio_io_set_handle_data_32(ioh, (uint32 *) io_buffers, io_buffers_len);
+
+		pgaio_io_register_callbacks(ioh,
+									persistence == RELPERSISTENCE_TEMP ?
+									PGAIO_HCB_LOCAL_BUFFER_READV :
+									PGAIO_HCB_SHARED_BUFFER_READV,
+									flags);
+
+		pgaio_io_set_flag(ioh, ioh_flags);
+
+		/* ---
+		 * Even though we're trying to issue IO asynchronously, track the time
+		 * in smgrstartreadv():
+		 * - if io_method == IOMETHOD_SYNC, we will always perform the IO
+		 *   immediately
+		 * - the io method might not support the IO (e.g. worker IO for a temp
+		 *   table)
+		 * ---
+		 */
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								1, io_buffers_len * BLCKSZ);
-
-		/* Verify each block we read, and terminate the I/O. */
-		for (int j = 0; j < io_buffers_len; ++j)
-		{
-			BufferDesc *bufHdr;
-			Block		bufBlock;
-			int			piv_flags;
-			bool		verified;
-			bool		checksum_failure;
-
-			if (persistence == RELPERSISTENCE_TEMP)
-			{
-				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
-				bufBlock = LocalBufHdrGetBlock(bufHdr);
-			}
-			else
-			{
-				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
-				bufBlock = BufHdrGetBlock(bufHdr);
-			}
-
-			/* check for garbage data */
-			piv_flags = PIV_LOG_WARNING;
-			if (ignore_checksum_failure)
-				piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
-			verified = PageIsVerified((Page) bufBlock, io_first_block + j,
-									  piv_flags, &checksum_failure);
-			if (checksum_failure)
-			{
-				RelFileLocatorBackend rloc = operation->smgr->smgr_rlocator;
-
-				pgstat_prepare_report_checksum_failure(rloc.locator.dbOid);
-				pgstat_report_checksum_failures_in_db(rloc.locator.dbOid, 1);
-			}
-
-			if (!verified)
-			{
-				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-					memset(bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									io_first_block + j,
-									relpath(operation->smgr->smgr_rlocator, forknum).str)));
-			}
-
-			/* Set BM_VALID, terminate IO, and wake up any waiters */
-			if (persistence == RELPERSISTENCE_TEMP)
-				TerminateLocalBufferIO(bufHdr, false, BM_VALID, false);
-			else
-				TerminateBufferIO(bufHdr, false, BM_VALID, true, false);
-
-			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->smgr->smgr_rlocator.locator.spcOid,
-											  operation->smgr->smgr_rlocator.locator.dbOid,
-											  operation->smgr->smgr_rlocator.locator.relNumber,
-											  operation->smgr->smgr_rlocator.backend,
-											  false);
-		}
+		smgrstartreadv(ioh, operation->smgr, forknum,
+					   blocknum + nblocks_done,
+					   io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ,
+								io_start, 1, io_buffers_len * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
 			pgBufferUsage.local_blks_read += io_buffers_len;
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Track vacuum cost when issuing IO, not after waiting for it.
+		 * Otherwise we could end up issuing a lot of IO in a short timespan,
+		 * despite a low cost limit.
+		 */
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+
+		*nblocks_progress = io_buffers_len;
+		did_start_io = true;
 	}
+
+	return did_start_io;
 }
 
 /*
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0013-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From e30b7edbe7d8c1216409c00fbfe696ad70273d5f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 424 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 426 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..ddd59404a59
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,424 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have a) partially
+ * completed or b) succeeded with a warning (e.g. due to zero_damaged_pages).
+ * If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read or a warning.
+ */
+if (ioret.result.status != PGAIO_RS_OK)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "define" it, i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_start_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_start_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this is the ability to associate multiple completion
+callbacks with a handle. E.g. bufmgr.c can have a callback to update the
+BufferDesc state and to verify the page and md.c can have another callback to
+check if the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e3ed087e8a2..86f7250b7a5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0014-aio-Basic-read_stream-adjustments-for-real-AIO.patchtext/x-diff; charset=us-asciiDownload

From a07f930bca25979c3ab78ec0353a109bb484846e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO

Adapt the read stream logic for real AIO:
- If AIO is enabled, we shouldn't issue advice, but if it isn't, we should
  continue issuing advice
- AIO benefits from reading ahead with direct IO
- If effective_io_concurrency=0, pass READ_BUFFERS_SYNCHRONOUSLY to
  StartReadBuffers() to ensure synchronous IO execution

There are further improvements we should consider:

- While in read_stream_look_ahead(), we can use AIO batch submission mode for
  increased efficiency. That however requires care to avoid deadlocks and thus
  done separately.
- It can be beneficial to defer starting new IOs until we can issue multiple
  IOs at once. That however requires non-trivial heuristics to decide when to
  do so.

Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/aio/read_stream.c | 39 ++++++++++++++++++---------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index c60e37e7f7f..26e5dfe77db 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -72,6 +72,7 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
+#include "storage/aio.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/read_stream.h"
@@ -99,6 +100,8 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		initialized_buffers;
+	int			read_buffers_flags;
+	bool		sync_mode;		/* using io_method=sync */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -250,7 +253,7 @@ read_stream_start_pending_read(ReadStream *stream)
 		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
 
 	/* Do we need to issue read-ahead advice? */
-	flags = 0;
+	flags = stream->read_buffers_flags;
 	if (stream->advice_enabled)
 	{
 		if (stream->pending_read_blocknum == stream->seq_blocknum)
@@ -261,7 +264,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 * then stay of the way of the kernel's own read-ahead.
 			 */
 			if (stream->seq_until_processed != InvalidBlockNumber)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 		else
 		{
@@ -272,7 +275,7 @@ read_stream_start_pending_read(ReadStream *stream)
 			 */
 			stream->seq_until_processed = stream->pending_read_blocknum;
 			if (stream->pinned_buffers > 0)
-				flags = READ_BUFFERS_ISSUE_ADVICE;
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
 		}
 	}
 
@@ -613,27 +616,33 @@ read_stream_begin_impl(int flags,
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
+	stream->sync_mode = io_method == IOMETHOD_SYNC;
+
 #ifdef USE_PREFETCH
 
 	/*
-	 * This system supports prefetching advice.  We can use it as long as
-	 * direct I/O isn't enabled, the caller hasn't promised sequential access
-	 * (overriding our detection heuristics), and max_ios hasn't been set to
-	 * zero.
+	 * Read-ahead advice simulating asynchronous I/O with synchronous calls.
+	 * Issue advice only if AIO is not used, direct I/O isn't enabled, the
+	 * caller hasn't promised sequential access (overriding our detection
+	 * heuristics), and max_ios hasn't been set to zero.
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+	if (stream->sync_mode &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0 &&
 		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
 		max_ios > 0)
 		stream->advice_enabled = true;
 #endif
 
 	/*
-	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
-	 * above.  If we had real asynchronous I/O we might need a slightly
-	 * different definition.
+	 * Setting max_ios to zero disables AIO and advice-based pseudo AIO, but
+	 * we still need to allocate space to combine and run one I/O.  Bump it up
+	 * to one, and remember to ask for synchronous I/O only.
 	 */
 	if (max_ios == 0)
+	{
 		max_ios = 1;
+		stream->read_buffers_flags = READ_BUFFERS_SYNCHRONOUSLY;
+	}
 
 	/*
 	 * Capture stable values for these two GUC-derived numbers for the
@@ -777,6 +786,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 
 		if (likely(next_blocknum != InvalidBlockNumber))
 		{
+			int			flags = stream->read_buffers_flags;
+
+			if (stream->advice_enabled)
+				flags |= READ_BUFFERS_ISSUE_ADVICE;
+
 			/*
 			 * Pin a buffer for the next call.  Same buffer entry, and
 			 * arbitrary I/O entry (they're all free).  We don't have to
@@ -792,8 +806,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
 										next_blocknum,
-										stream->advice_enabled ?
-										READ_BUFFERS_ISSUE_ADVICE : 0)))
+										flags)))
 			{
 				/* Fast return. */
 				return buffer;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0015-read_stream-Introduce-and-use-optional-batchmo.patchtext/x-diff; charset=us-asciiDownload

From 23e7b3f541198d53d376011fd84fffda55ad2819 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Mar 2025 17:30:42 -0400
Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
 support

Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
   to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
   never held while waiting for IO.

b) directly or indirectly start another batch pgaio_enter_batchmode()

As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.

This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.

There are two cases where batching would likely be beneficial, but where we
aren't using it yet:

1) bitmap heap scans, because the callback reads the VM

   This should soon be solved, because we are planning to remove the use of
   the VM, due to that not being sound.

2) The first phase of heap vacuum

   This could be made to support batchmode, but would require some care.

Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/read_stream.h     | 21 +++++++++++++++++++++
 src/backend/access/gist/gistvacuum.c  |  8 +++++++-
 src/backend/access/heap/heapam.c      | 18 +++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c  | 21 ++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  8 +++++++-
 src/backend/access/spgist/spgvacuum.c |  8 +++++++-
 src/backend/commands/analyze.c        |  7 ++++++-
 src/backend/storage/aio/read_stream.c | 16 ++++++++++++++++
 src/backend/storage/buffer/bufmgr.c   |  8 +++++++-
 contrib/pg_prewarm/pg_prewarm.c       |  7 ++++++-
 contrib/pg_visibility/pg_visibility.c |  8 +++++++-
 11 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index c11d8ce3300..9b0d65161d0 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -42,6 +42,27 @@
  */
 #define READ_STREAM_FULL 0x04
 
+/* ---
+ * Opt-in to using AIO batchmode.
+ *
+ * Submitting IO in larger batches can be more efficient than doing so
+ * one-by-one, particularly for many small reads. It does, however, require
+ * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
+ * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
+ *
+ * a) block without first calling pgaio_submit_staged(), unless a
+ *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
+ *    never held while waiting for IO.
+ *
+ * b) start another batch (without first exiting batchmode and re-entering
+ *    before returning)
+ *
+ * As this requires care and is nontrivial in some cases, batching is only
+ * used with explicit opt-in.
+ * ---
+ */
+#define READ_STREAM_USE_BATCHING 0x08
+
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 20b1bb5dbac..ce9d78d78d6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -210,7 +210,13 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = GIST_ROOT_BLKNO;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b12b583c4d9..c8357660776 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1206,7 +1206,15 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 		else
 			cb = heap_scan_stream_read_next_serial;
 
-		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+		/* ---
+		 * It is safe to use batchmode as the only locks taken by `cb`
+		 * are never taken while waiting for IO:
+		 * - SyncScanLock is used in the non-parallel case
+		 * - in the parallel case, only spinlocks and atomics are used
+		 * ---
+		 */
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL |
+														  READ_STREAM_USE_BATCHING,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
 														  MAIN_FORKNUM,
@@ -1216,6 +1224,14 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	}
 	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
 	{
+		/*
+		 * Currently we can't trivially use batching, due to the
+		 * VM_ALL_VISIBLE check in bitmapheap_stream_read_next. While that
+		 * could be made safe, we are about to remove the all-visible logic
+		 * from bitmap scans due to its unsoundness.
+		 *
+		 * FIXME: Should be changed soon!
+		 */
 		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_DEFAULT,
 														  scan->rs_strategy,
 														  scan->rs_base.rs_rd,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6d287b38cf5..f28326bad09 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1225,7 +1225,12 @@ lazy_scan_heap(LVRelState *vacrel)
 	vacrel->next_unskippable_eager_scanned = false;
 	vacrel->next_unskippable_vmbuffer = InvalidBuffer;
 
-	/* Set up the read stream for vacuum's first pass through the heap */
+	/*
+	 * Set up the read stream for vacuum's first pass through the heap.
+	 *
+	 * This could be made safe for READ_STREAM_USE_BATCHING, but only with
+	 * explicit work in heap_vac_scan_next_block.
+	 */
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vacrel->bstrategy,
 										vacrel->rel,
@@ -2669,6 +2674,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
  * Read stream callback for vacuum's third phase (second pass over the heap).
  * Gets the next block from the TID store and returns it or InvalidBlockNumber
  * if there are no further blocks to vacuum.
+ *
+ * NB: Assumed to be safe to use with READ_STREAM_USE_BATCHING.
  */
 static BlockNumber
 vacuum_reap_lp_read_stream_next(ReadStream *stream,
@@ -2732,8 +2739,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	iter = TidStoreBeginIterate(vacrel->dead_items);
 
-	/* Set up the read stream for vacuum's second pass through the heap */
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * Set up the read stream for vacuum's second pass through the heap.
+	 *
+	 * It is safe to use batchmode, as vacuum_reap_lp_read_stream_next() does
+	 * not need to wait for IO and does not perform locking. Once we support
+	 * parallelism it should still be fine, as presumably the holder of locks
+	 * would never be blocked by IO while holding the lock.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vacrel->bstrategy,
 										vacrel->rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 80b04d6ca2a..4a0bf069f99 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1064,7 +1064,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	needLock = !RELATION_IS_LOCAL(rel);
 
 	p.current_blocknum = BTREE_METAPAGE + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										info->strategy,
 										rel,
 										MAIN_FORKNUM,
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 77deb226b7e..b3df2d89074 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -822,7 +822,13 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* We can skip locking for new or temp relations */
 	needLock = !RELATION_IS_LOCAL(index);
 	p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
-	stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_FULL |
+										READ_STREAM_USE_BATCHING,
 										bds->info->strategy,
 										index,
 										MAIN_FORKNUM,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ca76c0d2668..4fffb76e557 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1237,7 +1237,12 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+	/*
+	 * It is safe to use batching, as block_sampling_read_stream_next never
+	 * blocks.
+	 */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
+										READ_STREAM_USE_BATCHING,
 										vac_strategy,
 										scan->rs_rd,
 										MAIN_FORKNUM,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 26e5dfe77db..36c54fb695b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -102,6 +102,7 @@ struct ReadStream
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
+	bool		batch_mode;		/* READ_STREAM_USE_BATCHING */
 	bool		advice_enabled;
 	bool		temporary;
 
@@ -403,6 +404,15 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Allow amortizing the cost of submitting IO over multiple IOs. This
+	 * requires that we don't do any operations that could lead to a deadlock
+	 * with staged-but-unsubmitted IO. The callback needs to opt-in to being
+	 * careful.
+	 */
+	if (stream->batch_mode)
+		pgaio_enter_batchmode();
+
 	while (stream->ios_in_progress < stream->max_ios &&
 		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
 	{
@@ -450,6 +460,8 @@ read_stream_look_ahead(ReadStream *stream)
 			{
 				/* We've hit the buffer or I/O limit.  Rewind and stop here. */
 				read_stream_unget_block(stream, blocknum);
+				if (stream->batch_mode)
+					pgaio_exit_batchmode();
 				return;
 			}
 		}
@@ -484,6 +496,9 @@ read_stream_look_ahead(ReadStream *stream)
 	 * time.
 	 */
 	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+
+	if (stream->batch_mode)
+		pgaio_exit_batchmode();
 }
 
 /*
@@ -617,6 +632,7 @@ read_stream_begin_impl(int flags,
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
 
 	stream->sync_mode = io_method == IOMETHOD_SYNC;
+	stream->batch_mode = flags & READ_STREAM_USE_BATCHING;
 
 #ifdef USE_PREFETCH
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6ac1a821523..39cc6489145 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5100,7 +5100,13 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	p.current_blocknum = 0;
 	p.last_exclusive = nblocks;
 	src_smgr = smgropen(srclocator, INVALID_PROC_NUMBER);
-	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL,
+
+	/*
+	 * It is safe to use batchmode as block_range_read_stream_cb takes no
+	 * locks.
+	 */
+	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL |
+												 READ_STREAM_USE_BATCHING,
 												 bstrategy_src,
 												 src_smgr,
 												 permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 63faf43d0bf..c0efb530c4e 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -198,7 +198,12 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		p.current_blocknum = first_block;
 		p.last_exclusive = last_block + 1;
 
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											NULL,
 											rel,
 											forkNumber,
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index ca91819852c..d79ef35006b 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -526,7 +526,13 @@ collect_visibility_data(Oid relid, bool include_pd)
 	{
 		p.current_blocknum = 0;
 		p.last_exclusive = nblocks;
-		stream = read_stream_begin_relation(READ_STREAM_FULL,
+
+		/*
+		 * It is safe to use batchmode as block_range_read_stream_cb takes no
+		 * locks.
+		 */
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
 											bstrategy,
 											rel,
 											MAIN_FORKNUM,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0016-docs-Reframe-track_io_timing-related-docs-as-w.patchtext/x-diff; charset=us-asciiDownload

From 49107bf7479af7d6099cb409cea2e0a5179f346d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
 wait time

With AIO it does not make sense anymore to track the time for each individual
IO, as multiple IOs can be in-flight at the same time. Instead we now track
the time spent *waiting* for IOs.

This should be reflected in the docs. While, so far, we only do a subset of
reads, and no other operations, via AIO, describing the GUC and view columns
as measuring IO waits is accurate for synchronous and asynchronous IO.

Discussion: https://postgr.es/m/5dzyoduxlvfg55oqtjyjehez5uoq6hnwgzor4kkybkfdgkj7ag@rbi4gsmzaczk
---
 doc/src/sgml/config.sgml     |  4 ++--
 doc/src/sgml/monitoring.sgml | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65ab95be370..f55f38cb85b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8568,7 +8568,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of database I/O calls.  This parameter is off by
+        Enables timing of database I/O waits.  This parameter is off by
         default, as it will repeatedly query the operating system for
         the current time, which may cause significant overhead on some
         platforms.  You can use the <xref linkend="pgtesttiming"/> tool to
@@ -8602,7 +8602,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables timing of WAL I/O calls. This parameter is off by default,
+        Enables timing of WAL I/O waits. This parameter is off by default,
         as it will repeatedly query the operating system for the current time,
         which may cause significant overhead on some platforms.
         You can use the <application>pg_test_timing</application> tool to
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bacc09cb8af..a6d67d2fbaa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2747,7 +2747,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>read_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in read operations in milliseconds (if
+        Time spent waiting for read operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2785,7 +2785,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>write_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in write operations in milliseconds (if
+        Time spent waiting for write operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2813,7 +2813,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>writeback_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in writeback operations in milliseconds (if
+        Time spent waiting for writeback operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
         includes the time spent queueing write-out requests and, potentially,
         the time spent to write out the dirty data.
@@ -2849,7 +2849,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds. (if
+        Time spent waiting for extend operations in milliseconds. (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -2923,7 +2923,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>fsync_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in fsync operations in milliseconds (if
+        Time spent waiting for fsync operations in milliseconds (if
         <xref linkend="guc-track-io-timing"/> is enabled and
         <varname>object</varname> is not <literal>wal</literal>,
         or if <xref linkend="guc-track-wal-io-timing"/> is enabled
@@ -3010,7 +3010,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
   <note>
    <para>
-    Columns tracking I/O time will only be non-zero when
+    Columns tracking I/O wait time will only be non-zero when
     <xref linkend="guc-track-io-timing"/> is enabled. The user should be
     careful when referencing these columns in combination with their
     corresponding I/O operations in case <varname>track_io_timing</varname>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0017-Enable-IO-concurrency-on-all-systems.patchtext/x-diff; charset=us-asciiDownload

From f181f271ba8ab468d5170014c78de90963b9e3f5 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 19 Mar 2025 10:15:20 -0400
Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Previously effective_io_concurrency and maintenance_io_concurrency could
not be set above 0 on machines without fadvise support. AIO enables
IO concurrency without such support.

Currently only subsystems using the read stream API will take advantage
of this. Other users of maintenance_io_concurrency (like recovery
prefetching) which leverage OS advice directly will not benefit from
this change. In those cases, maintenance_io_concurrency will have no
effect on I/O behavior.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAAKRu_atGgZePo=_g6T3cNtfMf0QxpvoUh5OUqa_cnPdhLd=gw@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  6 ----
 src/include/utils/guc_hooks.h                 |  4 ---
 src/backend/access/common/reloptions.c        |  8 -----
 src/backend/commands/variable.c               | 30 -------------------
 src/backend/utils/misc/guc_tables.c           |  4 +--
 src/backend/utils/misc/postgresql.conf.sample |  4 +--
 src/bin/initdb/initdb.c                       |  5 ----
 doc/src/sgml/config.sgml                      | 16 +++++-----
 doc/src/sgml/ref/alter_tablespace.sgml        |  2 +-
 doc/src/sgml/ref/create_tablespace.sgml       |  2 +-
 10 files changed, 14 insertions(+), 67 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 867ae9facb5..f2192ceb271 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -158,14 +158,8 @@ extern PGDLLIMPORT int bgwriter_lru_maxpages;
 extern PGDLLIMPORT double bgwriter_lru_multiplier;
 extern PGDLLIMPORT bool track_io_timing;
 
-/* only applicable when prefetching is available */
-#ifdef USE_PREFETCH
 #define DEFAULT_EFFECTIVE_IO_CONCURRENCY 16
 #define DEFAULT_MAINTENANCE_IO_CONCURRENCY 16
-#else
-#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
-#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
-#endif
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 0f1e74f96c9..799fa7ace68 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -61,8 +61,6 @@ extern bool check_default_text_search_config(char **newval, void **extra, GucSou
 extern void assign_default_text_search_config(const char *newval, void *extra);
 extern bool check_default_with_oids(bool *newval, void **extra,
 									GucSource source);
-extern bool check_effective_io_concurrency(int *newval, void **extra,
-										   GucSource source);
 extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
 extern void assign_io_method(int newval, void *extra);
 extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
@@ -83,8 +81,6 @@ extern bool check_log_stats(bool *newval, void **extra, GucSource source);
 extern bool check_log_timezone(char **newval, void **extra, GucSource source);
 extern void assign_log_timezone(const char *newval, void *extra);
 extern const char *show_log_timezone(void);
-extern bool check_maintenance_io_concurrency(int *newval, void **extra,
-											 GucSource source);
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern void assign_io_max_combine_limit(int newval, void *extra);
 extern void assign_io_combine_limit(int newval, void *extra);
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 645b5c00467..46c1dce222d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -361,11 +361,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
@@ -374,11 +370,7 @@ static relopt_int intRelOpts[] =
 			RELOPT_KIND_TABLESPACE,
 			ShareUpdateExclusiveLock
 		},
-#ifdef USE_PREFETCH
 		-1, 0, MAX_IO_CONCURRENCY
-#else
-		0, 0, 0
-#endif
 	},
 	{
 		{
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 84f044a1959..a9f2a3a3062 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1145,7 +1145,6 @@ check_cluster_name(char **newval, void **extra, GucSource source)
 void
 assign_maintenance_io_concurrency(int newval, void *extra)
 {
-#ifdef USE_PREFETCH
 	/*
 	 * Reconfigure recovery prefetching, because a setting it depends on
 	 * changed.
@@ -1153,7 +1152,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 	maintenance_io_concurrency = newval;
 	if (AmStartupProcess())
 		XLogPrefetchReconfigure();
-#endif
 }
 
 /*
@@ -1249,34 +1247,6 @@ check_default_with_oids(bool *newval, void **extra, GucSource source)
 	return true;
 }
 
-bool
-check_effective_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"effective_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
-bool
-check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*newval != 0)
-	{
-		GUC_check_errdetail("\"%s\" must be set to 0 on platforms that lack support for issuing read-ahead advice.",
-							"maintenance_io_concurrency");
-		return false;
-	}
-#endif							/* USE_PREFETCH */
-	return true;
-}
-
 bool
 check_ssl(bool *newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 76c7c6bb4b1..4eaeca89f2c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3235,7 +3235,7 @@ struct config_int ConfigureNamesInt[] =
 		&effective_io_concurrency,
 		DEFAULT_EFFECTIVE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_effective_io_concurrency, NULL, NULL
+		NULL, NULL, NULL
 	},
 
 	{
@@ -3249,7 +3249,7 @@ struct config_int ConfigureNamesInt[] =
 		&maintenance_io_concurrency,
 		DEFAULT_MAINTENANCE_IO_CONCURRENCY,
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL, assign_maintenance_io_concurrency,
 		NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c12434efa2..ff56a1f0732 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -198,8 +198,8 @@
 # - I/O -
 
 #backend_flush_after = 0		# measured in pages, 0 disables
-#effective_io_concurrency = 16		# 1-1000; 0 disables prefetching
-#maintenance_io_concurrency = 16	# 1-1000; 0 disables prefetching
+#effective_io_concurrency = 16		# 1-1000; 0 disables issuing multiple simultaneous IO requests
+#maintenance_io_concurrency = 16	# 1-1000; same as effective_io_concurrency
 #io_max_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 22b7d31b165..c17fda2bc81 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1402,11 +1402,6 @@ setup_config(void)
 								  repltok, true);
 #endif
 
-#ifndef USE_PREFETCH
-	conflines = replace_guc_value(conflines, "effective_io_concurrency",
-								  "0", true);
-#endif
-
 #ifdef WIN32
 	conflines = replace_guc_value(conflines, "update_process_title",
 								  "off", true);
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f55f38cb85b..0d02e21a1ab 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2585,8 +2585,7 @@ include_dir 'conf.d'
          session attempts to initiate in parallel.  The allowed range is
          <literal>1</literal> to <literal>1000</literal>, or
          <literal>0</literal> to disable issuance of asynchronous I/O requests.
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.
+         The default is <literal>16</literal>.
         </para>
 
         <para>
@@ -2597,8 +2596,9 @@ include_dir 'conf.d'
         </para>
 
         <para>
-         On systems without prefetch advice support, attempting to configure
-         any value other than <literal>0</literal> will error out.
+         On systems with prefetch advice support,
+         <varname>effective_io_concurrency</varname> also controls the
+         prefetch distance.
         </para>
 
         <para>
@@ -2621,10 +2621,10 @@ include_dir 'conf.d'
          for maintenance work that is done on behalf of many client sessions.
         </para>
         <para>
-         The default is <literal>16</literal> on supported systems, otherwise
-         <literal>0</literal>.  This value can be overridden for tables in a
-         particular tablespace by setting the tablespace parameter of the same
-         name (see <xref linkend="sql-altertablespace"/>).
+         The default is <literal>16</literal>.  This value can be overridden
+         for tables in a particular tablespace by setting the tablespace
+         parameter of the same name (see <xref
+         linkend="sql-altertablespace"/>).
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..d0e08089ddb 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -88,7 +88,7 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       and <varname>maintenance_io_concurrency</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and how many concurrent I/Os are issued, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index 9d5ab025261..b77e774c53f 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -110,7 +110,7 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
         and <varname>maintenance_io_concurrency</varname>.
         Setting these values for a particular tablespace will override the
         planner's usual estimate of the cost of reading pages from tables in
-        that tablespace, and the executor's prefetching behavior, as established
+        that tablespace, and how many concurrent I/Os are issued, as established
         by the configuration parameters of the
         same name (see <xref linkend="guc-seq-page-cost"/>,
         <xref linkend="guc-random-page-cost"/>,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0018-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 46880777f409db35dd83d8acfabe56ef02e9c51c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 18/29] aio: Add test_aio module

To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/buf_internals.h           |    7 +
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/buffer/localbuf.c         |    3 +-
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_aio/.gitignore          |    2 +
 src/test/modules/test_aio/Makefile            |   26 +
 src/test/modules/test_aio/meson.build         |   37 +
 src/test/modules/test_aio/t/001_aio.pl        | 1439 +++++++++++++++++
 src/test/modules/test_aio/t/002_io_workers.pl |  125 ++
 src/test/modules/test_aio/test_aio--1.0.sql   |  108 ++
 src/test/modules/test_aio/test_aio.c          |  802 +++++++++
 src/test/modules/test_aio/test_aio.control    |    3 +
 13 files changed, 2554 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/t/002_io_workers.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72b36a4af26..0dec7d93b3b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool release_aio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
@@ -478,6 +484,7 @@ extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits, bool release_aio);
 extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
+extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 39cc6489145..c65d3916211 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -518,10 +518,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool release_aio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5961,7 +5957,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -6018,7 +6014,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool release_aio)
 {
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index bf89076bb10..ed56202af14 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -57,7 +57,6 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
-static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -597,7 +596,7 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
  *
  * See also InvalidateBuffer().
  */
-static void
+void
 InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 {
 	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..9de0057bd1d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -13,6 +13,7 @@ subdir('oauth_validator')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('test_aio')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..f53cc64671a
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/test_aio/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..73d2fd68eaa
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+      't/002_io_workers.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..4f08116a3ff
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,1439 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if (have_io_uring())
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+sub psql_like
+{
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+
+	return $output;
+}
+
+sub query_wait_block
+{
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+# Returns count of checksum failures for the specified database or for shared
+# relations, if $datname is undefined.
+sub checksum_failures
+{
+	my $psql = shift;
+	my $datname = shift;
+	my $checksum_count;
+	my $checksum_last_failure;
+
+	if (defined $datname)
+	{
+		$checksum_count = $psql->query_safe(
+			qq(
+SELECT checksum_failures FROM pg_stat_database WHERE datname = '$datname';
+));
+		$checksum_last_failure = $psql->query_safe(
+			qq(
+SELECT checksum_last_failure FROM pg_stat_database WHERE datname = '$datname';
+));
+	}
+	else
+	{
+		$checksum_count = $psql->query_safe(
+			qq(
+SELECT checksum_failures FROM pg_stat_database WHERE datname IS NULL;
+));
+		$checksum_last_failure = $psql->query_safe(
+			qq(
+SELECT checksum_last_failure FROM pg_stat_database WHERE datname IS NULL;
+));
+	}
+
+	return $checksum_count, $checksum_last_failure;
+}
+
+###
+# Sub-tests
+###
+
+# Sanity checks for the IO handle API
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get_twice()",
+		qq(SELECT handle_get_twice()),
+		qr/^$/,
+		qr/ERROR:  API violation: Only one IO can be handed out$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+# Sanity checks for the batchmode API
+sub test_batchmode
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+# Test that simple cases of invalid pages are reported
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_corr(data int not null);
+INSERT INTO tmp_corr SELECT generate_series(1, 10000);
+SELECT modify_rel_block('tmp_corr', 1, corrupt_header=>true);
+));
+
+	foreach my $tblname (qw(tbl_corr tmp_corr))
+	{
+		my $invalid_page_re =
+		  $tblname eq 'tbl_corr'
+		  ? qr/invalid page in block 1 of relation base\/\d+\/\d+/
+		  : qr/invalid page in block 1 of relation base\/\d+\/t\d+_\d+/;
+
+		# verify the error is reported in custom C code
+		psql_like(
+			$io_method,
+			$psql,
+			"read_rel_block_ll() of $tblname page",
+			qq(SELECT read_rel_block_ll('$tblname', 1)),
+			qr/^$/,
+			$invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, seq scan
+		psql_like(
+			$io_method, $psql,
+			"sequential scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname),
+			qr/^$/, $invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, tid scan
+		psql_like(
+			$io_method,
+			$psql,
+			"tid scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname WHERE ctid = '(1, 1)'),
+			qr/^$/,
+			$invalid_page_re);
+	}
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+
+	### Verify behavior for normal tables
+
+	# create a buffer we can play around with
+	my $buf_id = psql_like(
+		$io_method, $psql_a,
+		"creation of toy buffer succeeds",
+		qq(SELECT buffer_create_toy('tbl_ok', 1)),
+		qr/^\d+$/, qr/^$/);
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, not valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking buffer io w/ success: first start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, marking it as success
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+	# buffer is valid now, make it invalid again
+	$psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+
+	### Verify behavior for temporary tables
+
+	# Can't unfortunately share the code with the normal table case, there are
+	# too many behavioral differences.
+
+	# create a buffer we can play around with
+	$psql_a->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_ok(data int not null);
+INSERT INTO tmp_ok SELECT generate_series(1, 10000);
+));
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tmp_ok', 3);));
+
+	# check that one backend can perform StartLocalBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartLocalBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Because local buffers don't use IO_IN_PROGRESS, a second StartLocalBufer
+	# succeeds as well. This test mostly serves as a documentation of that
+	# fact. If we had actually started IO, it'd be different.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartLocalBufferIO succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, without marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after not marking valid succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false);)
+	);
+
+	# Now another StartLocalBufferIO should fail, this time because the buffer
+	# is already valid.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after marking valid fails",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^f$/,
+		qr/^$/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit.
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block. This verifies
+	# that the exiting backend left the AIO in a sane state.
+	psql_like(
+		$io_method,
+		$psql_b,
+		"read buffer started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that we deal correctly with FDs being closed while IO is in progress
+sub test_close_fd
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, waiting for results",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>true,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>false,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting, query works",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	$psql->quit();
+}
+
+# Tests using injection points. Mostly to exercise had IO errors that are
+# hard to trigger without using injection points.
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method, $psql,
+		"injection point not triggering failure",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^1$/, qr/^$/);
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method,
+		$psql,
+		"single block short read fails",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/
+	);
+
+	# shorten multi-block read to a single block, should retry
+	my $inval_query = qq(SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8););
+
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (1 block) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# shorten multi-block read to two blocks, should retry
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192*2);));
+
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (2 blocks) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is corrupted)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"shortened multi-block read detects invalid page",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/);
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"first hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"second hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_short_read_detach()));
+
+	# now the IO should be ok.
+	psql_like(
+		$io_method, $psql,
+		"recovers after hard error",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Tests using injection points, only for io_method=worker.
+#
+# io_method=worker has the special case of needing to reopen files. That can
+# in theory fail, because the file could be gone. That's a hard path to test
+# for real, so we use an injection point to trigger it.
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"failure to open: detected",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_reopen_detach();));
+
+	# check that we indeed recover
+	psql_like(
+		$io_method, $psql,
+		"failure to open: recovers",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Verify that we handle a relation getting removed (due to a rollback or a
+# DROP TABLE) while IO is ongoing for that table.
+sub test_invalidate
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal unlogged temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+		my $tblname = $persistency . '_transactional';
+
+		my $create_sql = qq(
+CREATE $sql_persistency TABLE $tblname (id int not null, data text not null) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO $tblname(id, data) SELECT generate_series(1, 10000) as id, repeat('a', 200);
+);
+
+		# Verify that outstanding read IO does not cause problems with
+		# AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql;");
+		$psql->query_safe(
+			qq(
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+		psql_like(
+			$io_method,
+			$psql,
+			"rollback of newly created $persistency table with outstanding IO",
+			qq(ROLLBACK),
+			qr/^$/,
+			qr/^$/);
+
+		# Verify that outstanding read IO does not cause problems with
+		# CommitTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql; COMMIT;");
+		$psql->query_safe(
+			qq(
+BEGIN;
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+
+		psql_like(
+			$io_method, $psql,
+			"drop $persistency table with outstanding IO",
+			qq(DROP TABLE $tblname),
+			qr/^$/, qr/^$/);
+
+		psql_like($io_method, $psql,
+			"commit after drop $persistency table with outstanding IO",
+			qq(COMMIT), qr/^$/, qr/^$/);
+	}
+
+	$psql->quit();
+}
+
+# Test behavior related to ZERO_ON_ERROR and zero_damaged_pages
+sub test_zero
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+
+		$psql_a->query_safe(
+			qq(
+CREATE $sql_persistency TABLE tbl_zero(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_zero SELECT generate_series(1, 10000);
+));
+
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, corrupt_header=>true);
+));
+
+		# Check that page validity errors are detected
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 0 of relation base\/.*\/.*$/
+		);
+
+		# Check that page validity errors are zeroed
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 0 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+		# And that once the corruption is fixed, we can read again
+		$psql_a->query(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test re-read of block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^$/);
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 3",
+			qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true);
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		# First test error
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 2,3 in larger read",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing via ZERO_ON_ERROR flag
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first zeroed page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing vio zero_damaged_pages
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, zero_damaged_pages",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)
+COMMIT;
+),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first zeroed page\.\nHINT:[^\n]+$/
+		);
+
+		$psql_a->query_safe(qq(COMMIT));
+
+
+		# Verify that bufmgr.c IO detects page validity errors
+		$psql_a->query(
+			qq(
+SELECT invalidate_rel_block('tbl_zero', g.i)
+FROM generate_series(0, 15) g(i);
+SELECT modify_rel_block('tbl_zero', 3, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify reading zero_damaged_pages=off",
+			qq(
+SELECT count(*) FROM tbl_zero),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+		# Verify that bufmgr.c IO zeroes out pages with page validity errors
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify zero_damaged_pages=on",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT count(*) FROM tbl_zero;
+COMMIT;
+),
+			qr/^\d+$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+
+		# FIXME: Test that if one session triggers a read but doesn't complete
+		# it, the completing session doesn't see the WARNING.
+
+		# Clean up
+		$psql_a->query_safe(
+			qq(
+DROP TABLE tbl_zero;
+));
+	}
+
+	$psql_a->{stderr} = '';
+
+	$psql_a->quit();
+}
+
+# Test that we detect checksum failures and report them
+sub test_checksum
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql_a->query_safe(
+		qq(
+CREATE TABLE tbl_normal(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_normal SELECT generate_series(1, 5000);
+SELECT modify_rel_block('tbl_normal', 3, corrupt_checksum=>true);
+
+CREATE TEMPORARY TABLE tbl_temp(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_temp SELECT generate_series(1, 5000);
+SELECT modify_rel_block('tbl_temp', 3, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_temp', 4, corrupt_checksum=>true);
+));
+
+	# To be able to test checksum failures on shared rels we need a shared rel
+	# with invalid pages - which is a bit scary. pg_shseclabel seems like a
+	# good bet, as it's not accessed in a default configuration.
+	$psql_a->query_safe(
+		qq(
+SELECT grow_rel('pg_shseclabel', 4);
+SELECT modify_rel_block('pg_shseclabel', 2, corrupt_checksum=>true);
+SELECT modify_rel_block('pg_shseclabel', 3, corrupt_checksum=>true);
+));
+
+
+	# Check that page validity errors are detected, checksums stats increase, normal rel
+	my ($cs_count_before, $cs_ts_before) =
+	  checksum_failures($psql_a, 'postgres');
+	psql_like(
+		$io_method,
+		$psql_a,
+		"normal rel: test reading of invalid block 3",
+		qq(
+SELECT read_rel_block_ll('tbl_normal', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 3 of relation base\/\d+\/\d+$/
+	);
+
+	my ($cs_count_after, $cs_ts_after) =
+	  checksum_failures($psql_a, 'postgres');
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: normal rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: normal rel: checksum timestamp is not null");
+
+
+	# Check that page validity errors are detected, checksums stats increase, temp rel
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a, 'postgres');
+	psql_like(
+		$io_method,
+		$psql_a,
+		"temp rel: test reading of invalid block 4, valid block 5",
+		qq(
+SELECT read_rel_block_ll('tbl_temp', 4, nblocks=>2, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 4 of relation base\/\d+\/t\d+_\d+$/
+	);
+
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a, 'postgres');
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: temp rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: temp rel: checksum timestamp is not null");
+
+
+	# Check that page validity errors are detected, checksums stats increase, shared rel
+	($cs_count_before, $cs_ts_after) = checksum_failures($psql_a);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"shared rel: reading of invalid blocks 2+3",
+		qq(
+SELECT read_rel_block_ll('pg_shseclabel', 2, nblocks=>2, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 2..3 of relation global\/\d+\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+	);
+
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a);
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: shared rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: shared rel: checksum timestamp is not null");
+
+
+	# and restore sanity
+	$psql_a->query(
+		qq(
+SELECT modify_rel_block('pg_shseclabel', 1, zero=>true);
+DROP TABLE tbl_normal;
+));
+	$psql_a->{stderr} = '';
+
+	$psql_a->quit();
+}
+
+# Verify checksum handling when creating database from an invalid database.
+# This also serves as a minimal check that cross-database IO is handled
+# reasonably.
+sub test_checksum_createdb
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$node->safe_psql('postgres',
+		'CREATE DATABASE regression_createdb_source');
+
+	$node->safe_psql(
+		'regression_createdb_source', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_cs_fail(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_cs_fail SELECT generate_series(1, 1000);
+SELECT modify_rel_block('tbl_cs_fail', 1, corrupt_checksum=>true);
+));
+
+	my $createdb_sql = qq(
+CREATE DATABASE regression_createdb_target
+TEMPLATE regression_createdb_source
+STRATEGY wal_log;
+);
+
+	# Verify that CREATE DATABASE of an invalid database fails and is
+	# accounted for accurately.
+	my ($cs_count_before, $cs_ts_before) =
+	  checksum_failures($psql, 'regression_createdb_source');
+	psql_like(
+		$io_method,
+		$psql,
+		"create database w/ wal strategy, invalid source",
+		$createdb_sql,
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 1 of relation base\/\d+\/\d+$/
+	);
+	my ($cs_count_after, $cs_ts_after) =
+	  checksum_failures($psql, 'regression_createdb_source');
+	cmp_ok($cs_count_before + 1, '<=', $cs_count_after,
+		"$io_method: create database w/ wal strategy, invalid source: checksum count increased"
+	);
+
+	# Verify that CREATE DATABASE of the fixed database succeeds.
+	$node->safe_psql(
+		'regression_createdb_source', qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, zero=>true);
+));
+	psql_like($io_method, $psql,
+		"create database w/ wal strategy, valid source",
+		$createdb_sql, qr/^$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Test that we detect checksum failures and report them
+sub test_ignore_checksum
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TABLE tbl_cs_fail(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_cs_fail SELECT generate_series(1, 10000);
+));
+
+	my $count_sql = "SELECT count(*) FROM tbl_cs_fail";
+	my $invalidate_sql = qq(
+SELECT invalidate_rel_block('tbl_cs_fail', g.i)
+FROM generate_series(0, 6) g(i);
+);
+
+	my $expect = $psql->query_safe($count_sql);
+
+	$psql->query_safe(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 5, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 6, corrupt_checksum=>true);
+));
+
+	$psql->query_safe($invalidate_sql);
+	psql_like($io_method, $psql,
+		"ignore_checksum_failure=off fails",
+		$count_sql, qr/^$/, qr/ERROR:  invalid page in block/);
+
+	$psql->query_safe("SET ignore_checksum_failure=on");
+
+	$psql->query_safe($invalidate_sql);
+	psql_like($io_method, $psql,
+			  "ignore_checksum_failure=on succeeds",
+			  $count_sql,
+			  qr/^$expect$/,
+			  qr/WARNING:  ignoring \d checksum failures among blocks/);
+
+
+	$psql->query_safe(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 2, zero=>true);
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 4, corrupt_header=>true);
+));
+
+	my $log_location = -s $node->logfile;
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of checksum failed block 3, with ignore",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  ignoring checksum failure in block 3/
+	);
+
+	$log_location =
+	  $node->wait_for_log(qr/LOG:  ignoring checksum failure/, $log_location);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of valid block 2, checksum failed 3, invalid 4, zero=false",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 2, nblocks=>3, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 4 of relation base\/\d+\/\d+$/
+	);
+
+	$psql->query(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, zero=>true);
+SELECT modify_rel_block('tbl_cs_fail', 2, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 4, corrupt_header=>true);
+SELECT modify_rel_block('tbl_cs_fail', 5, corrupt_header=>true);
+SELECT modify_rel_block('tbl_cs_fail', 6, corrupt_header=>true);
+));
+	$psql->{stderr} = '';
+
+	$log_location = -s $node->logfile;
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of valid block 1, checksum failed 2-3, invalid 4-6, zero=true",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 1, nblocks=>6, zero_on_error=>true);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  zeroing 3 page\(s\) and ignoring 2 checksum failure\(s\) among blocks 1..6 of relation/
+	);
+
+	# FIXME: "register" as tests, add more similar tests and explain why we want this.
+	$node->wait_for_log(qr/LOG:  ignoring checksum failure in block 2/,
+						$log_location);
+	$node->wait_for_log(qr/LOG:  ignoring checksum failure in block 3/,
+						$log_location);
+	$node->wait_for_log(qr/LOG:  invalid page in block 4 of relation base.*; zeroing out page/,
+						$log_location);
+	$node->wait_for_log(qr/LOG:  invalid page in block 5 of relation base.*; zeroing out page/,
+						$log_location);
+	$node->wait_for_log(qr/LOG:  invalid page in block 6 of relation base.*; zeroing out page/,
+						$log_location);
+
+	$psql->query(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true, corrupt_header=>true);
+));
+	$psql->{stderr} = '';
+
+	# Reading a page with both an invalid header and an invalid checksum
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of block with both invalid header and invalid checksum, zero=false",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 3 of relation/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of block 3 with both invalid header and invalid checksum, zero=true",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*; zeroing out page/
+	);
+
+
+	$psql->quit();
+}
+
+
+# Run all tests that are supported for all io_methods
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT modify_rel_block('tbl_corr', 1, corrupt_header=>true);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batchmode($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+	test_close_fd($io_method, $node);
+	test_invalidate($io_method, $node);
+	test_zero($io_method, $node);
+	test_checksum($io_method, $node);
+	test_ignore_checksum($io_method, $node);
+	test_checksum_createdb($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
new file mode 100644
index 00000000000..af5fae15ea7
--- /dev/null
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use List::Util qw(shuffle);
+
+
+my $node = PostgreSQL::Test::Cluster->new('worker');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+io_method=worker
+));
+
+$node->start();
+
+# Test changing the number of I/O worker processes while also evaluating the
+# handling of their termination.
+test_number_of_io_workers_dynamic($node);
+
+$node->stop();
+
+done_testing();
+
+
+sub test_number_of_io_workers_dynamic
+{
+	my $node = shift;
+
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+
+	# Verify that worker count can't be set to 0
+	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
+
+	# Verify that worker count can't be set to 33 (above the max)
+	change_number_of_io_workers($node, 33, $prev_worker_count, 1);
+
+	# Try changing IO workers to a random value and verify that the worker
+	# count ends up as expected. Always test the min/max of workers.
+	#
+	# Valid range for io_workers is [1, 32]. 8 tests in total seems
+	# reasonable.
+	my @io_workers_range = shuffle(1 ... 32);
+	foreach my $worker_count (1, 32, @io_workers_range[ 0, 6 ])
+	{
+		$prev_worker_count =
+		  change_number_of_io_workers($node, $worker_count,
+			$prev_worker_count, 0);
+	}
+}
+
+sub change_number_of_io_workers
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my $prev_worker_count = shift;
+	my $expect_failure = shift;
+	my ($result, $stdout, $stderr);
+
+	($result, $stdout, $stderr) =
+	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
+
+	if ($expect_failure)
+	{
+		ok( $stderr =~
+			  /$worker_count is outside the valid range for parameter "io_workers"/,
+			"updating number of io_workers to $worker_count failed, as expected"
+		);
+
+		return $prev_worker_count;
+	}
+	else
+	{
+		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+			$worker_count,
+			"updating number of io_workers from $prev_worker_count to $worker_count"
+		);
+
+		check_io_worker_count($node, $worker_count);
+		terminate_io_worker($node, $worker_count);
+		check_io_worker_count($node, $worker_count);
+
+		return $worker_count;
+	}
+}
+
+sub terminate_io_worker
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my ($pid, $ret);
+
+	# Select a random io worker
+	$pid = $node->safe_psql(
+		'postgres',
+		qq(SELECT pid FROM pg_stat_activity WHERE
+			backend_type = 'io worker' ORDER BY RANDOM() LIMIT 1));
+
+	# terminate IO worker with SIGINT
+	is(PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'INT', $pid),
+		0, "random io worker process signalled with INT");
+
+	# Check that worker exits
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE pid = $pid), '0'),
+		"random io worker process exited after signal");
+}
+
+sub check_io_worker_count
+{
+	my $node = shift;
+	my $worker_count = shift;
+
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'io worker'),
+			$worker_count),
+		"io worker count is $worker_count");
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e495481c41e
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,108 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION modify_rel_block(rel regclass, blockno int,
+  zero bool DEFAULT false,
+  corrupt_header bool DEFAULT false,
+  corrupt_checksum bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(
+    rel regclass,
+    blockno int,
+    nblocks int DEFAULT 1,
+    wait_complete bool DEFAULT true,
+    batchmode_enter bool DEFAULT false,
+    smgrreleaseall bool DEFAULT false,
+    batchmode_exit bool DEFAULT false,
+    zero_on_error bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, release_aio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..35c6b331923
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,802 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_aio.c
+ *		Helpers to write tests for AIO
+ *
+ * This module provides interface functions for C functionality to SQL, to
+ * make it possible to test AIO related behavior in a targeted way from SQL.
+ * It'd not generally be safe to export these functions to SQL, but for a test
+ * that's fine.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/*
+		 * First time through, so initialize.  This is shared with the dynamic
+		 * initialization using a DSM.
+		 */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* Shared memory initialization */
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(modify_rel_block);
+Datum
+modify_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		zero = PG_GETARG_BOOL(2);
+	bool		corrupt_header = PG_GETARG_BOOL(3);
+	bool		corrupt_checksum = PG_GETARG_BOOL(4);
+	Page		page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	Relation	rel;
+	Buffer		buf;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+							 RBM_ZERO_ON_ERROR, NULL);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * copy the page to local memory, seems nicer than to directly modify in
+	 * the buffer pool.
+	 */
+	memcpy(page, BufferGetPage(buf), BLCKSZ);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	/*
+	 * Don't want to have a buffer in-memory that's marked valid where the
+	 * on-disk contents are invalid. Particularly not if the in-memory buffer
+	 * could be dirty...
+	 *
+	 * While we hold an AEL on the relation nobody else should be able to read
+	 * the buffer in.
+	 *
+	 * NB: This is probably racy, better don't copy this to non-test code.
+	 */
+	if (BufferIsLocal(buf))
+		InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+	else
+		EvictUnpinnedBuffer(buf);
+
+	/*
+	 * Now modify the page as asked for by the caller.
+	 */
+	if (zero)
+		memset(page, 0, BufferGetPageSize(buf));
+
+	if (PageIsEmpty(page) && (corrupt_header || corrupt_checksum))
+		PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+
+	if (corrupt_header)
+		ph->pd_special = BLCKSZ + 1;
+
+	if (corrupt_checksum)
+	{
+		bool		successfully_corrupted = 0;
+
+		/*
+		 * Any single modification of the checksum could just end up being
+		 * valid again. To be sure
+		 */
+		for (int i = 0; i < 100; i++)
+		{
+			uint16		verify_checksum;
+			uint16		old_checksum;
+
+			old_checksum = ph->pd_checksum;
+			ph->pd_checksum = old_checksum + 3;
+
+			elog(LOG, "corrupting checksum of blk %u from %u to %u",
+				 blkno, old_checksum, ph->pd_checksum);
+
+			verify_checksum = pg_checksum_page(page, blkno);
+			if (verify_checksum != ph->pd_checksum)
+			{
+				successfully_corrupted = true;
+				break;
+			}
+		}
+
+		if (!successfully_corrupted)
+			elog(ERROR, "could not corrupt checksum, what's going on?");
+	}
+	else
+	{
+		PageSetChecksumInplace(page, blkno);
+	}
+
+	smgrwrite(RelationGetSmgr(rel),
+			  MAIN_FORKNUM, blkno, page, true);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* place buffer in shared buffers without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	}
+	else
+	{
+		buf_hdr = GetBufferDescriptor(buf - 1);
+		buf_state = LockBufHdr(buf_hdr);
+	}
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	if (RelationUsesLocalBuffers(rel))
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	else
+		UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+/*
+ * A "low level" read. This does similar things to what
+ * StartReadBuffers()/WaitReadBuffers() do, but provides more control (and
+ * less sanity).
+ */
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	int			nblocks = PG_GETARG_INT32(2);
+	bool		wait_complete = PG_GETARG_BOOL(3);
+	bool		batchmode_enter = PG_GETARG_BOOL(4);
+	bool		call_smgrreleaseall = PG_GETARG_BOOL(5);
+	bool		batchmode_exit = PG_GETARG_BOOL(6);
+	bool		zero_on_error = PG_GETARG_BOOL(7);
+	Relation	rel;
+	Buffer		bufs[PG_IOV_MAX];
+	BufferDesc *buf_hdrs[PG_IOV_MAX];
+	Page		pages[PG_IOV_MAX];
+	uint8		srb_flags = 0;
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
+		elog(ERROR, "nblocks is out of range");
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	for (int i = 0; i < nblocks; i++)
+	{
+		bufs[i] = create_toy_buffer(rel, blkno + i);
+		pages[i] = BufferGetBlock(bufs[i]);
+		buf_hdrs[i] = BufferIsLocal(bufs[i]) ?
+			GetLocalBufferDescriptor(-bufs[i] - 1) :
+			GetBufferDescriptor(bufs[i] - 1);
+	}
+
+	smgr = RelationGetSmgr(rel);
+
+	pgstat_prepare_report_checksum_failure(smgr->smgr_rlocator.locator.dbOid);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartLocalBufferIO(buf_hdrs[i], true, false);
+		pgaio_io_set_flag(ioh, PGAIO_HF_REFERENCES_LOCAL);
+	}
+	else
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartBufferIO(buf_hdrs[i], true, false);
+	}
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) bufs, nblocks);
+
+	if (zero_on_error | zero_damaged_pages)
+		srb_flags |= READ_BUFFERS_ZERO_ON_ERROR;
+	if (ignore_checksum_failure)
+		srb_flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;
+
+	pgaio_io_register_callbacks(ioh,
+								RelationUsesLocalBuffers(rel) ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								srb_flags);
+
+	if (batchmode_enter)
+		pgaio_enter_batchmode();
+
+	elog(LOG, "about to smgrstartreadv");
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, nblocks);
+
+	if (call_smgrreleaseall)
+		smgrreleaseall();
+
+	if (batchmode_exit)
+		pgaio_exit_batchmode();
+
+	for (int i = 0; i < nblocks; i++)
+		ReleaseBuffer(bufs[i]);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != PGAIO_RS_OK)
+			pgaio_result_report(ior.result,
+								&ior.target_data,
+								ior.result.status == PGAIO_RS_ERROR ?
+								ERROR : WARNING);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			BufferDesc *buf_hdr = BufferIsLocal(buf) ?
+				GetLocalBufferDescriptor(-buf - 1)
+				: GetBufferDescriptor(buf - 1);
+
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (pg_atomic_read_u32(&buf_hdr->state) & BM_DIRTY)
+			{
+				if (BufferIsLocal(buf))
+					FlushLocalBuffer(buf_hdr, NULL);
+				else
+					FlushOneBuffer(buf);
+			}
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (BufferIsLocal(buf))
+				InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+			else if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	if (BufferIsLocal(buf))
+		can_start = StartLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+									   for_input, nowait);
+	else
+		can_start = StartBufferIO(GetBufferDescriptor(buf - 1),
+								  for_input, nowait);
+
+	/*
+	 * For tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start && !BufferIsLocal(buf))
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		release_aio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	if (BufferIsLocal(buf))
+		TerminateLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+							   clear_dirty, set_flag_bits, release_aio);
+	else
+		TerminateBufferIO(GetBufferDescriptor(buf - 1),
+						  clear_dirty, set_flag_bits, false, release_aio);
+
+	ereport(LOG,
+			errmsg("buffer %d after Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	elog(LOG, "short read called: %d", inj_io_error_state->enabled_short_read);
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		/*
+		 * Only shorten reads that are actually longer than the target size,
+		 * otherwise we can trigger over-reads.
+		 */
+		if (inj_io_error_state->short_read_result_set
+			&& ioh->op == PGAIO_OP_READV
+			&& inj_io_error_state->short_read_result <= ioh->result)
+		{
+			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			int32		old_result = ioh->result;
+			int32		new_result = inj_io_error_state->short_read_result;
+			int32		processed = 0;
+
+			ereport(LOG,
+					errmsg("short read, changing result from %d to %d",
+						   old_result, new_result),
+					errhidestmt(true), errhidecontext(true));
+
+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
+			 *
+			 * To avoid that, iterate through the IOV and zero out the
+			 * "failed" portion of the IO.
+			 */
+			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			{
+				if (processed + iov[i].iov_len <= new_result)
+					processed += iov[i].iov_len;
+				else if (processed <= new_result)
+				{
+					uint32		ok_part = new_result - processed;
+
+					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+					processed += iov[i].iov_len;
+				}
+				else
+				{
+					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+				}
+			}
+
+			ioh->result = new_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	elog(LOG, "reopen called: %d", inj_io_error_state->enabled_reopen);
+
+	if (inj_io_error_state->enabled_reopen)
+	{
+		elog(ERROR, "injection point triggering failure to reopen ");
+	}
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0019-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From 7d9911908130fd99b502c6c2d7573fb28a542c59 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 19/29] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 36c54fb695b..cec93129f58 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -404,6 +404,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0020-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From c4f3fe9002ec977f172f9413073353557d9f45f8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 20/29] aio: Implement smgr/md/fd write support

TODO:
- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
  in a critical section

  The only reason it doesn't immediately fail is that it's reasonably rare
  that RegisterSyncRequest() fails *and* either:

  - smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
    though the lookup is guaranteed to succeed for io_method=worker.

  - an io_method=uring completion is run in a different backend and smgropen()
    needs to build a new entry and thus needs to allocate memory

  For a bit I thought this could be worked around easily enough by not doing
  an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
  instead just opening the file directly. That actually does kinda solve the
  problem, but only because the memory allocation in PathNameOpenFile()
  uses malloc(), not palloc() and thus doesn't trigger

- temp_file_limit implementation
---
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 201 ++++++++++++++++++++++++-
 src/backend/storage/smgr/smgr.c        |  29 ++++
 7 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index c0063d6950e..67026f5f6b4 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..2138d47dab9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2348,6 +2348,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_start_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index eedba1e5794..f6bdc7e37bc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1115,6 +1122,64 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	ret = FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start writing blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1503,6 +1568,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -1997,7 +2096,7 @@ md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
  * AIO error reporting callback for mdstartreadv().
  *
  * Errors are encoded as follows:
- * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0
+ * - PgAioResult.error_data != 0 encodes IO that failed with that errno
  * - PgAioResult.error_data == 0 encodes IO that didn't read all data
  */
 static void
@@ -2037,3 +2136,103 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartwritev(), the smgr API operates on the
+	 * level of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bf8f57b410a..e56d6a2597b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -787,6 +793,29 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully written blocks if the write [partially] succeeds. This
+ * maintains the abstraction that smgr operates on the level of blocks, rather
+ * than bytes.
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0021-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From f2e20aee8e448cd85cef32b3011084768c11fa74 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 21/29] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index bfe0d93683b..f91f0afc5a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -353,6 +353,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -365,6 +379,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 7f18da2c856..833f97361a1 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -127,6 +127,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -182,11 +188,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -217,6 +236,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -244,6 +269,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index 156132cde03..b23ed0ec47d 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -134,4 +134,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index ddd59404a59..534aaad22bc 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -406,6 +406,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits (see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer can be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 86f7250b7a5..cff48964d07 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	if (ioh->state != PGAIO_HS_HANDED_OUT)
 		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1046,6 +1064,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..95b10933fed 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -88,6 +88,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -113,6 +139,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -136,11 +189,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -156,6 +229,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -167,6 +243,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -179,6 +256,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -186,9 +297,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -210,6 +325,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_max_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..bd49b302293 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3293,6 +3293,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ff56a1f0732..2c6456e907f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -211,6 +211,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..e2a81235166 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -87,6 +87,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 35c6b331923..5d8f255e449 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -52,6 +52,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -671,6 +672,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1279b69422a..74ba36eb78c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2134,6 +2134,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0022-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 7cdb018181f0f71bf9651f7abc427c675bc484cc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 22/29] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Problems with AIO writes:

- Write logic needs to be rebased on-top of the patch series to not hit bit
  dirty buffers while IO is going on

  The performance impact of doing the memory copies is rather substantial, as
  on intel memory bandwidth is *the* IO bottleneck even just for the checksum
  computation, without a copy. That makes the memory copy for something like
  bounce buffers hurt really badly.

  And the memory usage of bounce buffers is also really concerning.

  And even without checksums, several filesystems *really* don't like buffers
  getting modified during DIO writes. Which I think would mean we ought to use
  bounce buffers for *all* writes, which would impose a *very* substantial
  overhead (basically removing the benefit of DMA happening off-cpu).

- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
  LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
  in-progress writes

  I can think of ways to solve this purely in bufmgr.c, but only in ways that
  would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
  for an exclusive lock) and/or expensive.
---
 src/include/storage/aio.h              |   4 +-
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 188 ++++++++++++++++++++++++-
 4 files changed, 190 insertions(+), 6 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index f91f0afc5a5..72d5680e767 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -197,11 +197,13 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
-#define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_READV
+#define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_WRITEV
 StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS),
 				 "PGAIO_HCB_MAX is too big for PGAIO_RESULT_ID_BITS");
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f2192ceb271..492feab0cb5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -174,7 +174,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 67026f5f6b4..f9c4599ee21 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c65d3916211..688d74f997c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5536,7 +5536,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5551,6 +5559,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5614,9 +5635,8 @@ LockBufferForCleanup(Buffer buffer)
 	CheckBufferIsPinnedOnce(buffer);
 
 	/*
-	 * We do not yet need to be worried about in-progress AIOs holding a pin,
-	 * as we only support doing reads via AIO and this function can only be
-	 * called once the buffer is valid (i.e. no read can be in flight).
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
 	 */
 
 	/* Nobody else to wait for */
@@ -5629,6 +5649,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5776,7 +5801,13 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -5833,7 +5864,10 @@ IsBufferCleanupOK(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -7133,12 +7167,129 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *td,
 			affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, true);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, true);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR && buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -7182,6 +7333,13 @@ shared_buffer_readv_complete_local(PgAioHandle *ioh, PgAioResult prior_result,
 	return prior_result;
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -7195,6 +7353,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -7204,6 +7373,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -7217,3 +7391,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0023-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From 772973488e99e03c6e4f39665c6e210e4d5af27c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 23/29] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 74ba36eb78c..54239ccc69e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1193,6 +1193,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3019,6 +3020,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0024-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From b7962cd93be0adadebd861ecb76a72464c379c2e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 24/29] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.

TODO;

- This doesn't implement bgwriter_flush_after, checkpointer_flush_after

  I think that's not too hard to do, it's mainly round tuits.

- The queuing logic doesn't carefully respect pin limits

  That might be ok for checkpointer and bgwriter, but the infrastructure
  should be usable outside of this as well.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 594 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 586 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..45c2b70b736 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 492feab0cb5..89e0ca11288 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -297,7 +297,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..000a3ab23f9 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 688d74f997c..65a6643dc12 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -515,8 +517,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -532,6 +532,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3321,6 +3322,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3352,7 +3404,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3414,7 +3469,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3522,48 +3579,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3581,15 +3681,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3615,7 +3723,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3658,6 +3766,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3678,6 +3789,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3834,11 +3947,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3849,6 +3976,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3860,6 +3994,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3898,8 +4037,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3908,22 +4105,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3933,7 +4158,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3942,40 +4167,300 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
+	/*
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+	UnlockBufHdr(cur_buf_hdr, buf_state);
 
-	tag = bufHdr->tag;
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
 
-	UnpinBuffer(bufHdr);
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
 
-	return result | BUF_WRITTEN;
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+
+	/*
+	 * FIXME: Implement issuing writebacks (note wb_context isn't used here).
+	 * Possibly needs to be integrated with io_queue.c.
+	 */
 }
 
 /*
@@ -4349,6 +4834,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 652746e1fa2..1e1a6641a4d 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1490,6 +1490,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 54239ccc69e..53093e2a1db 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -348,6 +348,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0025-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From 77d6e1f4521a78f959d7c136f24f0025266a37d8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 25/29] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a8d2e024d34..72c32f5c32e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0026-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 6b4ff088406f58cf40ddf3b573a734cfaf1ee8e7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 26/29] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.14-0027-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 510267ae9283f0943dc675262dbe0eca4d78bc0f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.14 27/29] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#152

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#151)

Re: AIO v2.5

Hi,

On 2025-03-29 10:48:10 -0400, Andres Freund wrote:

Attached is v2.14:

FWIW, there was a last minute change in the test that fails in one task on CI,
due to reading across the smaller segment size configured for one of the
runs. Doesn't quite seem worth posting a new version for.

- push the checksums stats fix

Done.

- unless somebody sees a reason to not use LOG_SERVER_ONLY in
"aio: Implement support for reads in smgr/md/fd", push that

Besides that the only change since Noah's last review of that commit is an
added comment.

Also done. If we want to change log level later, it's easy to do so.

I made some small changes since the version I had posted:
- I found one dangling reference to mdread() instead of mdreadv()
- I had accidentally squashed the fix to Noah's review comment about a comment
above md_readv_report() to the wrong commit (smgr/md/fd.c write support)
- PGAIO_HCB_MD_WRITEV was added in "smgr/md/fd.c read support" instead of
"smgr/md/fd.c write support"

- push pg_aios view (depends a tiny bit on the smgr/md/fd change above)

I think I found an issue with this one - as it stands the view was viewable by
everyone. While it doesn't provide a *lot* of insight, it still seems a bit
too much for an unprivileged user to learn what part of a relation any other
user is currently reading.

There'd be two different ways to address that:
1) revoke view & function from public, grant to a limited role (presumably
pg_read_all_stats)
2) copy pg_stat_activity's approach of using something like

#define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

on a per-IO basis.

Greetings,

Andres Freund

#153

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#152)

Re: AIO v2.5

On Sat, Mar 29, 2025 at 2:25 PM Andres Freund <andres@anarazel.de> wrote:

I think I found an issue with this one - as it stands the view was viewable by
everyone. While it doesn't provide a *lot* of insight, it still seems a bit
too much for an unprivileged user to learn what part of a relation any other
user is currently reading.

There'd be two different ways to address that:
1) revoke view & function from public, grant to a limited role (presumably
pg_read_all_stats)
2) copy pg_stat_activity's approach of using something like

#define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

on a per-IO basis.

Is it easier to later change it to be more restrictive or less? If it
is easier to later lock it down more, then go with 2, otherwise go
with 1?

- Melanie

#154

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#152)

Re: AIO v2.5

Flushing half-baked review comments before going offline for a few hours:

On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:

Attached v2.13, with the following changes:

5) The WARNING in the callback is now a LOG, as it will be sent to the
client as a WARNING explicitly when the IO's results are processed

I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
LOG? But not at all sure.

LOG_SERVER_ONLY and its synonym COMMERR look to be used for:

- ProcessLogMemoryContextInterrupt()
- messages before successful authentication
- protocol sync loss, where we'd fail to send a client message
- client already gone

The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
If a client has decided to set client_min_messages that high, the client might
be interested in the fact that it got side-tracked completing someone else's
IO. On the other hand, almost none of those sidetrack events will produce
messages. The main argument I'd envision for LOG_SERVER_ONLY is that we
consider the message content sensitive, but I don't see the message content as
materially sensitive.

Since you committed LOG_SERVER_ONLY, let's keep that decision. My last draft
review discouraged it, but it doesn't matter. pgaio_result_report() should
assert elevel != LOG to avoid future divergence.

- Previously the buffer completion callback checked zero_damaged_pages - but
that's not right, the GUC hopefully is only set on a per-session basis

Good catch. I've now audited the complete_shared callbacks for other variable
references and actions not acceptable there. I found nothing beyond what you
found by v2.14.

On Sat, Mar 29, 2025 at 10:48:10AM -0400, Andres Freund wrote:

On 2025-03-29 06:41:43 -0700, Noah Misch wrote:

On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:

Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to
the wrong database

I've skipped reviewing this patch, since it's already committed. If it needs
post-commit review, let me know.

Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd

+		/*
+		 * Immediately log a message about the IO error, but only to the
+		 * server log. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing).  The
+		 * issuer of the IO will emit an ERROR when processing the IO's

s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other
places in the patches.

+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().

I would add a comment starting with:

Compared to smgrreadv(), more responsibilities fall on layers above smgr.
Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some
problems, but higher layers are responsible for pgaio_result_report() to
mirror that news to the user and (for ERROR) abort the (sub)transaction.

md_readv_complete() comment "issuer of the IO will emit an ERROR" says some of
that, but someone adding a smgrstartreadv() call is less likely to find it
there.

I say "comment starting with", because I think there's a remaining decision
about who owns the zeroing currently tied to smgrreadv(). An audit of
mdreadv() vs. AIO counterparts found this part of mdreadv():

if (nbytes == 0)
{
/*
* We are at or past EOF, or we read a partial block at EOF.
* Normally this is an error; upper levels should never try to
* read a nonexistent block. However, if zero_damaged_pages
* is ON or we are InRecovery, we should instead return zeroes
* without complaining. This allows, for example, the case of
* trying to update a block that was later truncated away.
*/
if (zero_damaged_pages || InRecovery)
{

I didn't write a test to prove its absence, but I'm not finding such code in
the AIO path. I wondered if we could just Assert(!InRecovery), but adding
that to md_readv_complete() failed 001_stream_rep.pl with this stack:

ExceptionalCondition at assert.c:52
md_readv_complete at md.c:2043
pgaio_io_call_complete_shared at aio_callback.c:258
pgaio_io_process_completion at aio.c:515
pgaio_io_perform_synchronously at aio_io.c:148
pgaio_io_stage at aio.c:453
pgaio_io_start_readv at aio_io.c:87
FileStartReadV at fd.c:2243
mdstartreadv at md.c:1005
smgrstartreadv at smgr.c:757
AsyncReadBuffers at bufmgr.c:1938
StartReadBuffersImpl at bufmgr.c:1422
StartReadBuffer at bufmgr.c:1515
ReadBuffer_common at bufmgr.c:1246
ReadBufferExtended at bufmgr.c:818
vm_readbuf at visibilitymap.c:584
visibilitymap_pin at visibilitymap.c:203
heap_xlog_insert at heapam_xlog.c:450
heap_redo at heapam_xlog.c:1195
ApplyWalRecord at xlogrecovery.c:1995
PerformWalRecovery at xlogrecovery.c:1825
StartupXLOG at xlog.c:5895

If this is a real problem, fix options may include:

- Implement the InRecovery zeroing for real.
- Make the InRecovery case somehow use real mdreadv(), not
pgaio_io_perform_synchronously() to use AIO APIs with synchronous AIO. I'll
guess this is harder than the previous option, though.

Subject: [PATCH v2.14 04/29] aio: Add pg_aios view

+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();

I think "retry:" needs to be here, above start_state assignment. Otherwise,
the "live_ioh->state != start_state" test will keep seeing a state mismatch.

+		start_state = live_ioh->state;
+
+retry:
+		if (start_state == PGAIO_HS_IDLE)
+			continue;

Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
critical sections
Subject: [PATCH v2.14 09/29] Add errhint_internal()

Ready for commit

Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support

Buffer reads executed this infrastructure will report invalid page / checksum
errors / warnings differently than before:

s/this/through this/

+	*zeroed_or_error_count = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;

These raw "7" are good places to use your new #define values. Likewise in
buffer_readv_encode_error().

+ * that was errored or zerored or, if no errors/zeroes, the first ignored

s/zerored/zeroed/

+ * enough. If there is an error, the error is the integeresting offset,

typo "integeresting"

+/*
+ * We need a backend-local completion callback for shared buffers, to be able
+ * to report checksum errors correctly. Unfortunately that can only safely
+ * happen if the reporting backend has previously called

Missing end of sentence.

@@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
*/

There's an outdated comment ending here:

/*
* Throw a WARNING if the checksum fails, but only after we've checked for
* the all-zeroes case.
*/

if (checksum_failure)
{
-		if ((flags & PIV_LOG_WARNING) != 0)
-			ereport(WARNING,
+		if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
+			ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,

Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
ignore_checksum_failure
Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
support
Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
wait time
Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Ready for commit

Subject: [PATCH v2.14 18/29] aio: Add test_aio module

I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in
my queue. One thing I noticed anyway:

+# Tests using injection points. Mostly to exercise had IO errors that are

s/had/hard/

On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:

On 2025-03-29 10:48:10 -0400, Andres Freund wrote:

Attached is v2.14:

- push pg_aios view (depends a tiny bit on the smgr/md/fd change above)

I think I found an issue with this one - as it stands the view was viewable by
everyone. While it doesn't provide a *lot* of insight, it still seems a bit
too much for an unprivileged user to learn what part of a relation any other
user is currently reading.

There'd be two different ways to address that:
1) revoke view & function from public, grant to a limited role (presumably
pg_read_all_stats)
2) copy pg_stat_activity's approach of using something like

#define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

on a per-IO basis.

No strong opinion. I'm not really worried about any of this information
leaking. Nothing in pg_aios comes close to the sensitivity of
pg_stat_activity.query. pg_stat_activity is mighty cautious, hiding even
stuff like wait_event_type that I wouldn't worry about. Hence, another valid
choice is (3) change nothing.

Meanwhile, I see substantially less need to monitor your own IOs than to
monitor your own pg_stat_activity rows, and even your own IOs potentially
reveal things happening in other sessions, e.g. evicting buffers that others
read and you never read. So restrictions wouldn't be too painful, and (1)
arguably helps privacy more than (2).

I'd likely go with (1) today.

#155

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#154)

Re: AIO v2.5

Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:

Flushing half-baked review comments before going offline for a few hours:

On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:

Attached v2.13, with the following changes:

5) The WARNING in the callback is now a LOG, as it will be sent to the
client as a WARNING explicitly when the IO's results are processed

I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
LOG? But not at all sure.

LOG_SERVER_ONLY and its synonym COMMERR look to be used for:

- ProcessLogMemoryContextInterrupt()
- messages before successful authentication
- protocol sync loss, where we'd fail to send a client message
- client already gone

The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
If a client has decided to set client_min_messages that high, the client might
be interested in the fact that it got side-tracked completing someone else's
IO. On the other hand, almost none of those sidetrack events will produce
messages. The main argument I'd envision for LOG_SERVER_ONLY is that we
consider the message content sensitive, but I don't see the message content as
materially sensitive.

I don't think it's sensitive - it just seems a bit sillier to send the same
thing to the client twice than to the server log. I'm happy to change it to
LOG if you prefer. Your points below mean some comments need to be updated in
smgr/md.c anyway.

- Previously the buffer completion callback checked zero_damaged_pages - but
that's not right, the GUC hopefully is only set on a per-session basis

Good catch. I've now audited the complete_shared callbacks for other variable
references and actions not acceptable there. I found nothing beyond what you
found by v2.14.

I didn't find anything else either.

Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd

+		/*
+		 * Immediately log a message about the IO error, but only to the
+		 * server log. The reason to do so immediately is that the originator
+		 * might not process the query result immediately (because it is busy
+		 * doing another part of query processing) or at all (e.g. if it was
+		 * cancelled or errored out due to another IO also failing).  The
+		 * issuer of the IO will emit an ERROR when processing the IO's

s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other
places in the patches.

Hm. Will do. Doesn't bother me personally, but happy to change it.

+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
I would add a comment starting with:

Compared to smgrreadv(), more responsibilities fall on layers above smgr.
Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some
problems, but higher layers are responsible for pgaio_result_report() to
mirror that news to the user and (for ERROR) abort the (sub)transaction.

Hm - if we document that in all the smgrstart* we'd end up with something like
that in a lot of places - but OTOH, this is the first one so far...

I say "comment starting with", because I think there's a remaining decision
about who owns the zeroing currently tied to smgrreadv(). An audit of
mdreadv() vs. AIO counterparts found this part of mdreadv():

if (nbytes == 0)
{
/*
* We are at or past EOF, or we read a partial block at EOF.
* Normally this is an error; upper levels should never try to
* read a nonexistent block. However, if zero_damaged_pages
* is ON or we are InRecovery, we should instead return zeroes
* without complaining. This allows, for example, the case of
* trying to update a block that was later truncated away.
*/
if (zero_damaged_pages || InRecovery)
{

I didn't write a test to prove its absence, but I'm not finding such code in
the AIO path.

Yes, there is no such codepath

A while ago I had started a thread about whether the above codepath is
necessary, as the whole idea of putting a buffer into shared buffers that
doesn't exist on-disk is *extremely* ill conceived, it puts a buffer into
shared buffer that somehow wasn't readable on disk, *without* creating it on
disk. The problem is that an mdnblocks() wouldn't know about that
only-in-memory part of the relation and thus most parts of PG won't consider
that buffer to exist - it'd be just skipped in sequential scans etc, but then
it'd trigger errors when extending the relation ("unexpected data beyond
EOF"), etc.

I had planned to put in an error into mdreadv() at the time, but somehow lost
track of that - I kind of mentally put this issue into the "done" category :(

Based on my research, the InRecovery path is not reachable (most recovery
buffer reads go through XLogReadBufferExtended() which extends at that layer
files, and the exceptions like VM/FSM have explicit code to extend the
relation, c.f. vm_readbuf()). It actually looks to me like it *never* was
reachable, the XLogReadBufferExtended() predecessors, back to the initial
addition of WAL to PG, had that such an extension path, as did vm/fsm.

The zero_damaged_pages path hasn't reliably worked for a long time afaict,
because _mdfd_getseg() doesn't know about it (note we're not passing
EXTENSION_CREATE). So unless the buffer is just after the physical end of the
last segment, you'll just get an error at that point. To my knowledge we
haven't heard related complaints.

It makes some sense to have zero_damaged_pages for actually existing pages
reached from sequential / tid / COPY on the table level - after all that's the
only way you might get data out during data recovery. But those would never
reach this logic, as such scans rely on mdnblocks(). For index -> heap
fetches the option seems mainly dangerous, because that'll just create random
buffers in shared buffers that, as explained above, won't then be reached by
other scans. And index scans during data recovery are not a good idea in the
first place, all that one should do in that situation is to dump out the data.

At the very least we need to add a comment about this though. If we want to
implement it, it'd be easy enough, but I think that logic is so insane that I
think we shouldn't do it unless there is some *VERY* clear evidence that we
need it.

I wondered if we could just Assert(!InRecovery), but adding that to
md_readv_complete() failed 001_stream_rep.pl with this stack:

I'd expect that to fail in a lot of paths:
XLogReadBufferExtended() ->
ReadBufferWithoutRelcache() ->
ReadBuffer_common() ->
StartReadBuffer()

Subject: [PATCH v2.14 04/29] aio: Add pg_aios view

+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+		pg_read_barrier();

I think "retry:" needs to be here, above start_state assignment. Otherwise,
the "live_ioh->state != start_state" test will keep seeing a state mismatch.

Damn, you're right.

Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
critical sections
Subject: [PATCH v2.14 09/29] Add errhint_internal()

Ready for commit

Cool

Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support

Buffer reads executed this infrastructure will report invalid page / checksum
errors / warnings differently than before:

s/this/through this/

Fixed.

+	*zeroed_or_error_count = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;
These raw "7" are good places to use your new #define values. Likewise in
buffer_readv_encode_error().

Which define value are you thinking of here? I don't think any of the ones I
added apply? But I think you're right it'd be good to have some define for
it, at least locally.

+ * that was errored or zerored or, if no errors/zeroes, the first ignored

s/zerored/zeroed/

+ * enough. If there is an error, the error is the integeresting offset,

typo "integeresting"

:(. Fixed.

+/*
+ * We need a backend-local completion callback for shared buffers, to be able
+ * to report checksum errors correctly. Unfortunately that can only safely
+ * happen if the reporting backend has previously called

Missing end of sentence.

It's now:

/*
* We need a backend-local completion callback for shared buffers, to be able
* to report checksum errors correctly. Unfortunately that can only safely
* happen if the reporting backend has previously called
* pgstat_prepare_report_checksum_failure(), which we can only guarantee in
* the backend that started the IO. Hence this callback.
*/

@@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
*/

There's an outdated comment ending here:

/*
* Throw a WARNING if the checksum fails, but only after we've checked for
* the all-zeroes case.
*/

Updated to:
/*
* Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails,
* but only after we've checked for the all-zeroes case.
*/

I found one more, the newly added comment about checksum_failure_p was still
talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE.

Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
ignore_checksum_failure
Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
support
Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
wait time
Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Ready for commit

Cool.

Subject: [PATCH v2.14 18/29] aio: Add test_aio module

I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in
my queue.

That's good - I think some of the tests need to expand a bit more. Since
that's at the end of the dependency chain...

One thing I noticed anyway:

+# Tests using injection points. Mostly to exercise had IO errors that are

s/had/hard/

Fixed.

On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:

On 2025-03-29 10:48:10 -0400, Andres Freund wrote:

Attached is v2.14:

- push pg_aios view (depends a tiny bit on the smgr/md/fd change above)

I think I found an issue with this one - as it stands the view was viewable by
everyone. While it doesn't provide a *lot* of insight, it still seems a bit
too much for an unprivileged user to learn what part of a relation any other
user is currently reading.

There'd be two different ways to address that:
1) revoke view & function from public, grant to a limited role (presumably
pg_read_all_stats)
2) copy pg_stat_activity's approach of using something like

#define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))

on a per-IO basis.

No strong opinion.

Same.

I'm not really worried about any of this information leaking. Nothing in
pg_aios comes close to the sensitivity of pg_stat_activity.query.
pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type
that I wouldn't worry about. Hence, another valid choice is (3) change
nothing.

I'd also be on board with that.

Meanwhile, I see substantially less need to monitor your own IOs than to
monitor your own pg_stat_activity rows, and even your own IOs potentially
reveal things happening in other sessions, e.g. evicting buffers that others
read and you never read. So restrictions wouldn't be too painful, and (1)
arguably helps privacy more than (2).

I'd likely go with (1) today.

Sounds good to me. It also has the advantage of being much easier to test than
2).

Greetings,

Andres Freund

#156

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#155)

Re: AIO v2.5

On Sat, Mar 29, 2025 at 08:39:54PM -0400, Andres Freund wrote:

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:

On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:

The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
If a client has decided to set client_min_messages that high, the client might
be interested in the fact that it got side-tracked completing someone else's
IO. On the other hand, almost none of those sidetrack events will produce
messages. The main argument I'd envision for LOG_SERVER_ONLY is that we
consider the message content sensitive, but I don't see the message content as
materially sensitive.

I don't think it's sensitive - it just seems a bit sillier to send the same
thing to the client twice than to the server log.

Ah, that adds weight to the benefit of LOG_SERVER_ONLY.

I'm happy to change it to
LOG if you prefer. Your points below mean some comments need to be updated in
smgr/md.c anyway.

Nah.

+/*
+ * smgrstartreadv() -- asynchronous version of smgrreadv()
+ *
+ * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrreadv().
I would add a comment starting with:

Compared to smgrreadv(), more responsibilities fall on layers above smgr.
Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some
problems, but higher layers are responsible for pgaio_result_report() to
mirror that news to the user and (for ERROR) abort the (sub)transaction.
Hm - if we document that in all the smgrstart* we'd end up with something like
that in a lot of places - but OTOH, this is the first one so far...

Alternatively, to avoid duplication:

See $PLACE for the tasks that the caller's layer owns, in contrast to smgr
owning them for smgrreadv().

I say "comment starting with", because I think there's a remaining decision
about who owns the zeroing currently tied to smgrreadv(). An audit of
mdreadv() vs. AIO counterparts found this part of mdreadv():

if (nbytes == 0)
{
/*
* We are at or past EOF, or we read a partial block at EOF.
* Normally this is an error; upper levels should never try to
* read a nonexistent block. However, if zero_damaged_pages
* is ON or we are InRecovery, we should instead return zeroes
* without complaining. This allows, for example, the case of
* trying to update a block that was later truncated away.
*/
if (zero_damaged_pages || InRecovery)
{

I didn't write a test to prove its absence, but I'm not finding such code in
the AIO path.

Yes, there is no such codepath

A while ago I had started a thread about whether the above codepath is
necessary

postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd
which I had forgotten completely.

I've redone your audit, and I agree the InRecovery case is dead code.
check-world InRecovery reaches mdstartreadv() and mdreadv() only via
XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf().

The zero_damaged_pages case entails more of a judgment call about whether or
not its rule breakage eclipses its utility. Fortunately, a disappointed
zero_damaged_pages user could work around that code's absence by stopping the
server and using "dd" to extend the segment with zeros.

I had planned to put in an error into mdreadv() at the time, but somehow lost
track of that - I kind of mentally put this issue into the "done" category :(

At the very least we need to add a comment about this though.

I'm fine with either of:

1. Replace that mdreadv() code with an error.

2. Update comment in mdreadv() that we're phasing out this code due the
InRecovery case's dead code status and the zero_damaged_pages problems; AIO
intentionally doesn't replicate it. Maybe add Assert(false).

I'd do (2) for v18, then do (1) first thing in v19 development.

+	*zeroed_or_error_count = rem_error & ((1 << 7) - 1);
+	rem_error >>= 7;
These raw "7" are good places to use your new #define values. Likewise in
buffer_readv_encode_error().
Which define value are you thinking of here? I don't think any of the ones I
added apply? But I think you're right it'd be good to have some define for
it, at least locally.

It was just my imagination. Withdrawn.

It's now:

/*
* We need a backend-local completion callback for shared buffers, to be able
* to report checksum errors correctly. Unfortunately that can only safely
* happen if the reporting backend has previously called
* pgstat_prepare_report_checksum_failure(), which we can only guarantee in
* the backend that started the IO. Hence this callback.
*/

Sounds good.

Updated to:
/*
* Throw a WARNING/LOG, as instructed by PIV_LOG_*, if the checksum fails,
* but only after we've checked for the all-zeroes case.
*/

I found one more, the newly added comment about checksum_failure_p was still
talking about ignore_checksum_failure, but it should now be IGNORE_CHECKSUM_FAILURE.

That works.

Subject: [PATCH v2.14 18/29] aio: Add test_aio module

I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in
my queue.

That's good - I think some of the tests need to expand a bit more. Since
that's at the end of the dependency chain...

Okay, I'll delay on re-reviewing that one. When it's a good time, please put
the CF entry back in Needs Review. The patches before it are all ready for
commit after the above points of this mail.

#157

Melanie Plageman

melanieplageman@gmail.com

10 months ago

In reply to: Andres Freund (#125)

1 attachment(s)

Re: AIO v2.5

On Tue, Mar 25, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

I've been wondering about that as well, and yes, we probably should.

I'd add the pid of the backend that started the IO to the message - although
need to check whether we're trying to keep PIDs of other processes from
unprivileged users.

I think we probably should add a similar, but not equivalent, context in io
workers. Maybe "I/O worker executing I/O on behalf of process %d".

I think this has not yet been done. Attached patch is an attempt to
add error context for IO completions by another backend when using
io_uring and IO processing in general by an IO worker. It seems to
work -- that is, running the test_aio tests, you can see the context
in the logs.

I'm not certain that I did this in the way you both were envisioning, though.

- Melanie

Attachments:

Add-errcontext-for-processing-I-Os-for-another-backe.patchtext/x-patch; charset=US-ASCII; name=Add-errcontext-for-processing-I-Os-for-another-backe.patchDownload

From 79d7e930de510cad6aef31532b06ba679a72d94a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 30 Mar 2025 14:03:55 -0400
Subject: [PATCH] Add errcontext for processing I/Os for another backend

Push an ErrorContextCallback adding additional detail about the process
performing the I/O and the owner of the I/O when those are not the same.

For io_method worker, this adds context specifying which process owns
the I/O that the I/O worker is processing.

For io_method io_uring, this adds context only when a backend is
*completing* I/O for another backend. It specifies the pid of the owning
process.

Discussion: https://postgr.es/m/20250325141120.8e.nmisch%40google.com
---
 src/backend/storage/aio/aio.c             |  1 +
 src/backend/storage/aio/method_io_uring.c | 31 +++++++++++++++++++++++
 src/backend/storage/aio/method_worker.c   | 29 +++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 43f1e2a7785..3d5cf726e24 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -42,6 +42,7 @@
 #include "port/atomics.h"
 #include "storage/aio_internal.h"
 #include "storage/aio_subsys.h"
+#include "storage/proc.h"
 #include "storage/procnumber.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 3b299dcf388..244918e1883 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -303,14 +303,41 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 	return num_staged_ios;
 }
 
+static void
+pgaio_uring_completion_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	/* No need for context if a backend is completing the IO for itself */
+	if (ioh->owner_procno == MyProcNumber)
+		return;
+
+	owner = ioh->owner_procno;
+	owner_proc = GetPGProcByNumber(owner);
+	owner_pid = owner_proc->pid;
+
+	errcontext("completing I/O on behalf of process %d", owner_pid);
+}
+
 static void
 pgaio_uring_drain_locked(PgAioUringContext *context)
 {
 	int			ready;
 	int			orig_ready;
+	ErrorContextCallback errcallback = {0};
 
 	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
 
+	errcallback.callback = pgaio_uring_completion_error_callback;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
 	/*
 	 * Don't drain more events than available right now. Otherwise it's
 	 * plausible that one backend could get stuck, for a while, receiving CQEs
@@ -338,9 +365,11 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 			PgAioHandle *ioh;
 
 			ioh = io_uring_cqe_get_data(cqe);
+			errcallback.arg = ioh;
 			io_uring_cqe_seen(&context->io_uring_ring, cqe);
 
 			pgaio_io_process_completion(ioh, cqe->res);
+			errcallback.arg = NULL;
 		}
 
 		END_CRIT_SECTION();
@@ -349,6 +378,8 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 					"drained %d/%d, now expecting %d",
 					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
 	}
+
+	error_context_stack = errcallback.previous;
 }
 
 static void
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 5ea00d8a89e..52f901ed4c2 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -361,11 +361,33 @@ pgaio_worker_register(void)
 	on_shmem_exit(pgaio_worker_die, 0);
 }
 
+static void
+pgaio_worker_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	Assert(ioh->owner_procno != MyProcNumber);
+	Assert(MyBackendType == B_IO_WORKER);
+
+	owner = ioh->owner_procno;
+	owner_proc = GetPGProcByNumber(owner);
+	owner_pid = owner_proc->pid;
+
+	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	PgAioHandle *volatile error_ioh = NULL;
+	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
 
@@ -392,6 +414,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	sprintf(cmd, "%d", MyIoWorkerId);
 	set_ps_display(cmd);
 
+	errcallback.callback = pgaio_worker_error_callback;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -475,6 +501,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
+			errcallback.arg = ioh;
 
 			pgaio_debug_io(DEBUG4, ioh,
 						   "worker %d processing IO",
@@ -515,6 +542,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			pgaio_io_perform_synchronously(ioh);
 
 			RESUME_INTERRUPTS();
+			errcallback.arg = NULL;
 		}
 		else
 		{
@@ -526,6 +554,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	error_context_stack = errcallback.previous;
 	proc_exit(0);
 }
 
-- 
2.34.1

#158

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Thomas Munro (#143)

Re: AIO v2.5

Hi,

On 2025-03-27 10:52:10 +1300, Thomas Munro wrote:

On Thu, Mar 27, 2025 at 10:41 AM Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH v2.12 13/28] Enable IO concurrency on all systems

Consider also updating this comment to stop focusing on prefetch; I think
changing that aligns with the patch's other changes:

/*
* How many buffers PrefetchBuffer callers should try to stay ahead of their
* ReadBuffer calls by. Zero means "never prefetch". This value is only used
* for buffers not belonging to tablespaces that have their
* effective_io_concurrency parameter set.
*/
int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

Good point. Although I suspect it might be worth adjusting this, and also the
config.sgml bit about effective_io_concurrency separately. That seems like it
might take an iteration or two.

+1 for rewriting that separately from this work on the code (I can
have a crack at that if you want).

You taking a crack at that would be appreciated!

For the comment, my suggestion would be something like:

"Default limit on the level of concurrency that each I/O stream
(currently, ReadStream but in future other kinds of streams) can use.
Zero means that I/O is always performed synchronously, ie not
concurrently with query execution. This value can be overridden at the
tablespace level with the parameter of the same name. Note that
streams performing I/O not classified as single-session work respect
maintenance_io_concurrency instead."

Generally sounds good. I do wonder if the last sentence could be made a bit
simpler, it took me a few seconds to parse "not classified as single-session".

Maybe "classified as performing work for multiple sessions respect
maintenance_io_concurrency instead."?

Greetings,

Andres Freund

#159

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#154)

Re: AIO v2.5

Hi,

On 2025-03-29 14:29:29 -0700, Noah Misch wrote:

Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
ignore_checksum_failure
Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
support
Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
wait time
Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Ready for commit

I've pushed most of these after some very light further editing. Yay. Thanks
a lot for all the reviews!

So far the buildfarm hasn't been complaining, but it's early days.

I didn't yet push

Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.

Tomorrow I'll work on sending out a new version with the remaining patches. I
plan for that version to have:

- pg_aios view with the security checks (already done, trivial)

- a commit with updated language for smgrstartreadv(), probably referencing
aio's README

- a change to mdreadv() around zero_damaged_pages, as Noah and I have been
discussing

- updated tests, with the FIXMEs etc addressed

- a reviewed version of the errcontext callback patch that Melanie sent
earlier today

Todo beyond that:

- The comment and docs updates we've been discussing in
/messages/by-id/5fc6r4smanncsmqw7ib6s3uj6eoiqoioxbd5ibmhtqimcggtoe@fyrok2gozsoq

- I think a good long search through the docs is in order, there probably are
other things that should be updated, beyond concrete references to
effective_io_concurrency etc.

- Whether we should do something, and if so what, about BAS_BULKREAD for
18. Thomas may have some thoughts / code.

- Whether there's anything committable around Jelte's RLIMIT_NOFILE changes.

Greetings,

Andres Freund

#160

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#159)

16 attachment(s)

Re: AIO v2.5

Hi,

On 2025-03-30 19:46:57 -0400, Andres Freund wrote:

I didn't yet push

Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.

I tried a bunch of variations and none of them seemed great. So I ended up
with a lightly polished version of your suggested comment above
smgrstartreadv(). We can later see about generalizing it.

Tomorrow I'll work on sending out a new version with the remaining patches. I
plan for that version to have:

Got a bit distracted with $work stuff today, but here we go.

The updated version has all of that:

- pg_aios view with the security checks (already done, trivial)

- a commit with updated language for smgrstartreadv(), probably referencing
aio's README

- a change to mdreadv() around zero_damaged_pages, as Noah and I have been
discussing

- updated tests, with the FIXMEs etc addressed

- a reviewed version of the errcontext callback patch that Melanie sent
earlier today

Although I didn't actually find anything in that, other than one unnecessary
change.

Greetings,

Andres Freund

Attachments:

v2.15-0001-docs-Add-acronym-and-glossary-entries-for-I-O-.patchtext/x-diff; charset=us-asciiDownload

From 7aab9384b807f8476d79c53aff0b44035e012e8c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 21 Mar 2025 15:06:35 -0400
Subject: [PATCH v2.15 01/18] docs: Add acronym and glossary entries for I/O
 and AIO

These are fairly basic, but better than nothing.  While there are several
opportunities to link to these entries, this patch does not add any. They will
however be referenced by future patches.

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250326183102.92.nmisch@google.com
---
 doc/src/sgml/acronyms.sgml | 18 ++++++++++++++++++
 doc/src/sgml/glossary.sgml | 39 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml
index 58d0d90fece..2f906e9f018 100644
--- a/doc/src/sgml/acronyms.sgml
+++ b/doc/src/sgml/acronyms.sgml
@@ -9,6 +9,15 @@
 
   <variablelist>
 
+   <varlistentry>
+    <term><acronym>AIO</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-aio">Asynchronous <acronym>I/O</acronym></link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ACL</acronym></term>
     <listitem>
@@ -354,6 +363,15 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><acronym>I/O</acronym></term>
+    <listitem>
+     <para>
+      <link linkend="glossary-io">Input/Output</link>
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><acronym>ISO</acronym></term>
     <listitem>
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index c0f812e3f5e..b88cac598e9 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -81,6 +81,31 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-aio">
+   <glossterm>Asynchronous <acronym>I/O</acronym></glossterm>
+   <acronym>AIO</acronym>
+   <indexterm>
+    <primary>Asynchronous <acronym>I/O</acronym></primary>
+   </indexterm>
+   <glossdef>
+    <para>
+     Asynchronous <acronym>I/O</acronym> (<acronym>AIO</acronym>) describes
+     performing <acronym>I/O</acronym> in a non-blocking way (asynchronously),
+     in contrast to synchronous <acronym>I/O</acronym>, which blocks for the
+     entire duration of the <acronym>I/O</acronym>.
+    </para>
+    <para>
+     With <acronym>AIO</acronym>, starting an <acronym>I/O</acronym> operation
+     is separated from waiting for the result of the operation, allowing
+     multiple <acronym>I/O</acronym> operations to be initiated concurrently,
+     as well as performing <acronym>CPU</acronym> heavy operations
+     concurrently with <acronym>I/O</acronym>. The price for that increased
+     concurrency is increased complexity.
+    </para>
+    <glossseealso otherterm="glossary-io" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-atomic">
    <glossterm>Atomic</glossterm>
    <glossdef>
@@ -938,6 +963,20 @@
    </glossdef>
   </glossentry>
 
+  <glossentry id="glossary-io">
+   <glossterm>Input/Output</glossterm>
+   <acronym>I/O</acronym>
+   <glossdef>
+    <para>
+     Input/Output (<acronym>I/O</acronym>) describes the communication between
+     a program and peripheral devices. In the context of database systems,
+     <acronym>I/O</acronym> commonly, but not exclusively, refers to
+     interaction with storage devices or the network.
+    </para>
+    <glossseealso otherterm="glossary-aio" />
+   </glossdef>
+  </glossentry>
+
   <glossentry id="glossary-insert">
    <glossterm>Insert</glossterm>
    <glossdef>
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0002-aio-Add-pg_aios-view.patchtext/x-diff; charset=us-asciiDownload

From 6a8f2696b60d00ac50938573d103b3c28f8f2257 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 02/18] aio: Add pg_aios view

The new view lists all IO handles that are currently in use and is mainly
useful for PG developers, but may also be useful when tuning PG.

FIXME:
- catversion bump before commit

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/catalog/pg_proc.dat          |   9 +
 src/backend/catalog/system_views.sql     |   7 +
 src/backend/storage/aio/Makefile         |   1 +
 src/backend/storage/aio/aio_funcs.c      | 230 ++++++++++++++++++
 src/backend/storage/aio/meson.build      |   1 +
 doc/src/sgml/system-views.sgml           | 294 +++++++++++++++++++++++
 src/test/regress/expected/privileges.out |  18 ++
 src/test/regress/expected/rules.out      |  16 ++
 src/test/regress/sql/privileges.sql      |   3 +
 9 files changed, 579 insertions(+)
 create mode 100644 src/backend/storage/aio/aio_funcs.c

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8b68b16d79d..d9c41fa426b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12493,4 +12493,13 @@
   proargtypes => 'int4',
   prosrc => 'gist_stratnum_common' },
 
+# AIO related functions
+{ oid => '9200', descr => 'information about in-progress asynchronous IOs',
+  proname => 'pg_get_aios', prorows => '100', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int4,int4,int8,text,text,int8,int8,text,int2,int4,text,text,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,io_id,io_generation,state,operation,off,length,target,handle_data_len,raw_result,result,target_desc,f_sync,f_localmem,f_buffered}',
+  prosrc => 'pg_get_aios' },
+
 ]
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31d269b7ee0..64a7240aa77 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1391,3 +1391,10 @@ CREATE VIEW pg_stat_subscription_stats AS
 
 CREATE VIEW pg_wait_events AS
     SELECT * FROM pg_get_wait_events();
+
+CREATE VIEW pg_aios AS
+    SELECT * FROM pg_get_aios();
+REVOKE ALL ON pg_aios FROM PUBLIC;
+GRANT SELECT ON pg_aios TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_aios() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_aios() TO pg_read_all_stats;
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index c06c50771e0..3f2469cc399 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -11,6 +11,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	aio.o \
 	aio_callback.o \
+	aio_funcs.o \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
diff --git a/src/backend/storage/aio/aio_funcs.c b/src/backend/storage/aio/aio_funcs.c
new file mode 100644
index 00000000000..584e683371a
--- /dev/null
+++ b/src/backend/storage/aio/aio_funcs.c
@@ -0,0 +1,230 @@
+/*-------------------------------------------------------------------------
+ *
+ * aio_funcs.c
+ *    AIO - SQL interface for AIO
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *    src/backend/storage/aio/aio_funcs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "nodes/execnodes.h"
+#include "port/atomics.h"
+#include "storage/aio_internal.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procnumber.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/tuplestore.h"
+
+
+/*
+ * Byte length of an iovec.
+ */
+static size_t
+iov_byte_length(const struct iovec *iov, int cnt)
+{
+	size_t		len = 0;
+
+	for (int i = 0; i < cnt; i++)
+	{
+		len += iov[i].iov_len;
+	}
+
+	return len;
+}
+
+Datum
+pg_get_aios(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+#define PG_GET_AIOS_COLS	15
+
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *live_ioh = &pgaio_ctl->io_handles[i];
+		uint32		ioh_id = pgaio_io_get_id(live_ioh);
+		Datum		values[PG_GET_AIOS_COLS] = {0};
+		bool		nulls[PG_GET_AIOS_COLS] = {0};
+		ProcNumber	owner;
+		PGPROC	   *owner_proc;
+		int32		owner_pid;
+		PgAioHandleState start_state;
+		uint64		start_generation;
+		PgAioHandle ioh_copy;
+		struct iovec iov_copy[PG_IOV_MAX];
+
+
+		/*
+		 * There is no lock that could prevent the state of the IO to advance
+		 * concurrently - and we don't want to introduce one, as that would
+		 * introduce atomics into a very common path. Instead we
+		 *
+		 * 1) Determine the state + generation of the IO.
+		 *
+		 * 2) Copy the IO to local memory.
+		 *
+		 * 3) Check if state or generation of the IO changed. If the state
+		 * changed, retry, if the generation changed don't display the IO.
+		 */
+
+		/* 1) from above */
+		start_generation = live_ioh->generation;
+
+		/*
+		 * Retry at this point, so we can accept changing states, but not
+		 * changing generations.
+		 */
+retry:
+		pg_read_barrier();
+		start_state = live_ioh->state;
+
+		if (start_state == PGAIO_HS_IDLE)
+			continue;
+
+		/* 2) from above */
+		memcpy(&ioh_copy, live_ioh, sizeof(PgAioHandle));
+
+		/*
+		 * Safe to copy even if no iovec is used - we always reserve the
+		 * required space.
+		 */
+		memcpy(&iov_copy, &pgaio_ctl->iovecs[ioh_copy.iovec_off],
+			   PG_IOV_MAX * sizeof(struct iovec));
+
+		/*
+		 * Copy information about owner before 3) below, if the process exited
+		 * it'd have to wait for the IO to finish first, which we would detect
+		 * in 3).
+		 */
+		owner = ioh_copy.owner_procno;
+		owner_proc = GetPGProcByNumber(owner);
+		owner_pid = owner_proc->pid;
+
+		/* 3) from above */
+		pg_read_barrier();
+
+		/*
+		 * The IO completed and a new one was started with the same ID. Don't
+		 * display it - it really started after this function was called.
+		 * There be a risk of a livelock if we just retried endlessly, if IOs
+		 * complete very quickly.
+		 */
+		if (live_ioh->generation != start_generation)
+			continue;
+
+		/*
+		 * The IO's state changed while we were "rendering" it. Just start
+		 * from scratch. There's no risk of a livelock here, as an IO has a
+		 * limited sets of states it can be in, and state changes go only in a
+		 * single direction.
+		 */
+		if (live_ioh->state != start_state)
+			goto retry;
+
+		/*
+		 * Now that we have copied the IO into local memory and checked that
+		 * it's still in the same state, we are not allowed to access "live"
+		 * memory anymore. To make it slightly easier to catch such cases, set
+		 * the "live" pointers to NULL.
+		 */
+		live_ioh = NULL;
+		owner_proc = NULL;
+
+
+		/* column: owning pid */
+		if (owner_pid != 0)
+			values[0] = Int32GetDatum(owner_pid);
+		else
+			nulls[0] = false;
+
+		/* column: IO's id */
+		values[1] = ioh_id;
+
+		/* column: IO's generation */
+		values[2] = Int64GetDatum(start_generation);
+
+		/* column: IO's state */
+		values[3] = CStringGetTextDatum(pgaio_io_get_state_name(&ioh_copy));
+
+		/*
+		 * If the IO is in PGAIO_HS_HANDED_OUT state, none of the following
+		 * fields are valid yet (or are in the process of being set).
+		 * Therefore we don't want to display any other columns.
+		 */
+		if (start_state == PGAIO_HS_HANDED_OUT)
+		{
+			memset(nulls + 4, 1, (lengthof(nulls) - 4) * sizeof(bool));
+			goto display;
+		}
+
+		/* column: IO's operation */
+		values[4] = CStringGetTextDatum(pgaio_io_get_op_name(&ioh_copy));
+
+		/* columns: details about the IO's operation (offset, length) */
+		switch (ioh_copy.op)
+		{
+			case PGAIO_OP_INVALID:
+				nulls[5] = true;
+				nulls[6] = true;
+				break;
+			case PGAIO_OP_READV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.read.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.read.iov_length));
+				break;
+			case PGAIO_OP_WRITEV:
+				values[5] = Int64GetDatum(ioh_copy.op_data.write.offset);
+				values[6] =
+					Int64GetDatum(iov_byte_length(iov_copy, ioh_copy.op_data.write.iov_length));
+				break;
+		}
+
+		/* column: IO's target */
+		values[7] = CStringGetTextDatum(pgaio_io_get_target_name(&ioh_copy));
+
+		/* column: length of IO's data array */
+		values[8] = Int16GetDatum(ioh_copy.handle_data_len);
+
+		/* column: raw result (i.e. some form of syscall return value) */
+		if (start_state == PGAIO_HS_COMPLETED_IO
+			|| start_state == PGAIO_HS_COMPLETED_SHARED
+			|| start_state == PGAIO_HS_COMPLETED_LOCAL)
+			values[9] = Int32GetDatum(ioh_copy.result);
+		else
+			nulls[9] = true;
+
+		/*
+		 * column: result in the higher level representation (unknown if not
+		 * finished)
+		 */
+		values[10] =
+			CStringGetTextDatum(pgaio_result_status_string(ioh_copy.distilled_result.status));
+
+		/* column: target description */
+		values[11] = CStringGetTextDatum(pgaio_io_get_target_description(&ioh_copy));
+
+		/* columns: one for each flag */
+		values[12] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_SYNCHRONOUS);
+		values[13] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_REFERENCES_LOCAL);
+		values[14] = BoolGetDatum(ioh_copy.flags & PGAIO_HF_BUFFERED);
+
+display:
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index 2f0f03d8071..da6df2d3654 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'aio.c',
   'aio_callback.c',
+  'aio_funcs.c',
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3f5a306247e..e9a59af8c34 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -51,6 +51,11 @@
     </thead>
 
     <tbody>
+     <row>
+      <entry><link linkend="view-pg-aios"><structname>pg_aios</structname></link></entry>
+      <entry>In-use asynchronous IO handles</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-available-extensions"><structname>pg_available_extensions</structname></link></entry>
       <entry>available extensions</entry>
@@ -231,6 +236,295 @@
   </table>
  </sect1>
 
+ <sect1 id="view-pg-aios">
+  <title><structname>pg_aios</structname></title>
+
+  <indexterm zone="view-pg-aios">
+   <primary>pg_aios</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_aios</structname> view lists all <xref
+   linkend="glossary-aio"/> handles that are currently in-use.  An I/O handle
+   is used to reference an I/O operation that is being prepared, executed or
+   is in the process of completing.  <structname>pg_aios</structname> contains
+   one row for each I/O handle.
+  </para>
+
+  <para>
+   This view is mainly useful for developers of
+   <productname>PostgreSQL</productname>, but may also be useful when tuning
+   <productname>PostgreSQL</productname>.
+  </para>
+
+  <table>
+   <title><structname>pg_aios</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>int4</type>
+      </para>
+      <para>
+       Process ID of the server process that is issuing this I/O.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_id</structfield> <type>int4</type>
+      </para>
+      <para>
+       Identifier of the I/O handle. Handles are reused once the I/O
+       completed (or if the handle is released before I/O is started). On reuse
+       <link linkend="view-pg-aios-io-generation">
+        <structname>pg_aios</structname>.<structfield>io_generation</structfield>
+       </link>
+       is incremented.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry" id="view-pg-aios-io-generation"><para role="column_definition">
+       <structfield>io_generation</structfield> <type>int8</type>
+      </para>
+      <para>
+       Generation of the I/O handle.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>state</structfield> <type>text</type>
+      </para>
+      <para>
+       State of the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>HANDED_OUT</literal>, referenced by code but not yet used
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>DEFINED</literal>, information necessary for execution is known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>STAGED</literal>, ready for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>SUBMITTED</literal>, submitted for execution
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_IO</literal>, finished, but result has not yet been processed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_SHARED</literal>, shared completion processing completed
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>COMPLETED_LOCAL</literal>, backend local completion processing completed
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>operation</structfield> <type>text</type>
+      </para>
+      <para>
+       Operation performed using the I/O handle:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>invalid</literal>, not yet known
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>readv</literal>, a vectored read
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writev</literal>, a vectored write
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>off</structfield> <type>int8</type>
+      </para>
+      <para>
+       Offset of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>length</structfield> <type>int8</type>
+      </para>
+      <para>
+       Length of the I/O operation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target</structfield> <type>text</type>
+      </para>
+      <para>
+       What kind of object is the I/O targeting:
+       <itemizedlist spacing="compact">
+        <listitem>
+         <para>
+          <literal>smgr</literal>, I/O on relations
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>handle_data_len</structfield> <type>int2</type>
+      </para>
+      <para>
+       Length of the data associated with the I/O operation. For I/O to/from
+       <xref linkend="guc-shared-buffers"/> and <xref
+       linkend="guc-temp-buffers"/>, this indicates the number of buffers the
+       I/O is operating on.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>raw_result</structfield> <type>int4</type>
+      </para>
+      <para>
+       Low-level result of the I/O operation, or NULL if the operation has not
+       yet completed.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>result</structfield> <type>text</type>
+      </para>
+      <para>
+       High-level result of the I/O operation:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>UNKNOWN</literal> means that the result of the
+          operation is not yet known.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>OK</literal> means the I/O completed successfully.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>PARTIAL</literal> means that the I/O completed without
+          error, but did not process all data. Commonly callers will need to
+          retry and perform the remainder of the work in a separate I/O.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>WARNING</literal> means that the I/O completed without
+          error, but that execution of the IO triggered a warning. E.g. when
+          encountering a corrupted buffer with <xref
+          linkend="guc-zero-damaged-pages"/> enabled.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>ERROR</literal> means the I/O failed with an error.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>target_desc</structfield> <type>text</type>
+      </para>
+      <para>
+       Description of what the I/O operation is targeting.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_sync</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is executed synchronously.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_localmem</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O references process local memory.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>f_buffered</structfield> <type>bool</type>
+      </para>
+      <para>
+       Flag indicating whether the I/O is buffered I/O.
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_aios</structname> view is read-only.
+  </para>
+
+  <para>
+   By default, the <structname>pg_aios</structname> view can be read only by
+   superusers or roles with privileges of the
+   <literal>pg_read_all_stats</literal> role.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-available-extensions">
   <title><structname>pg_available_extensions</structname></title>
 
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 954f549555e..5588d83e1bf 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3132,6 +3132,12 @@ DROP USER regress_locktable_user;
 -- switch to superuser
 \c -
 CREATE ROLE regress_readallstats;
+SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
+ has_table_privilege 
+---------------------
+ f
+(1 row)
+
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
  has_table_privilege 
 ---------------------
@@ -3145,6 +3151,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
 (1 row)
 
 GRANT pg_read_all_stats TO regress_readallstats;
+SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
+ has_table_privilege 
+---------------------
+ t
+(1 row)
+
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
  has_table_privilege 
 ---------------------
@@ -3159,6 +3171,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
 
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
+SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
+ ok 
+----
+ t
+(1 row)
+
 SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
  ok 
 ----
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 47478969135..d9533deb04e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1286,6 +1286,22 @@ drop table cchild;
 SELECT viewname, definition FROM pg_views
 WHERE schemaname = 'pg_catalog'
 ORDER BY viewname;
+pg_aios| SELECT pid,
+    io_id,
+    io_generation,
+    state,
+    operation,
+    off,
+    length,
+    target,
+    handle_data_len,
+    raw_result,
+    result,
+    target_desc,
+    f_sync,
+    f_localmem,
+    f_buffered
+   FROM pg_get_aios() pg_get_aios(pid, io_id, io_generation, state, operation, off, length, target, handle_data_len, raw_result, result, target_desc, f_sync, f_localmem, f_buffered);
 pg_available_extension_versions| SELECT e.name,
     e.version,
     (x.extname IS NOT NULL) AS installed,
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index b81694c24f2..286b1d03756 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1919,16 +1919,19 @@ DROP USER regress_locktable_user;
 
 CREATE ROLE regress_readallstats;
 
+SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
 
 GRANT pg_read_all_stats TO regress_readallstats;
 
+SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
 
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
+SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
 SELECT COUNT(*) >= 0 AS ok FROM pg_backend_memory_contexts;
 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations;
 RESET ROLE;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0003-aio-Add-README.md-explaining-higher-level-desi.patchtext/x-diff; charset=us-asciiDownload

From 8630699dd1c0f66b2b379d926dd0ac9167cb7d78 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 03/18] aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
---
 src/backend/storage/aio/README.md | 424 ++++++++++++++++++++++++++++++
 src/backend/storage/aio/aio.c     |   2 +
 2 files changed, 426 insertions(+)
 create mode 100644 src/backend/storage/aio/README.md

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
new file mode 100644
index 00000000000..ddd59404a59
--- /dev/null
+++ b/src/backend/storage/aio/README.md
@@ -0,0 +1,424 @@
+# Asynchronous & Direct IO
+
+## Motivation
+
+### Why Asynchronous IO
+
+Until the introduction of asynchronous IO Postgres relied on the operating
+system to hide the cost of synchronous IO from Postgres. While this worked
+surprisingly well in a lot of workloads, it does not do as good a job on
+prefetching and controlled writeback as we would like.
+
+There are important expensive operations like `fdatasync()` where the operating
+system cannot hide the storage latency. This is particularly important for WAL
+writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
+writes can yield significantly higher throughput.
+
+
+### Why Direct / unbuffered IO
+
+The main reasons to want to use Direct IO are:
+
+- Lower CPU usage / higher throughput. Particularly on modern storage buffered
+  writes are bottlenecked by the operating system having to copy data from the
+  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
+  can often move the data directly between the storage devices and postgres'
+  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
+  perform other work.
+- Reduced latency - Direct IO can have substantially lower latency than
+  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
+  write latency.
+- Avoiding double buffering between operating system cache and postgres'
+  shared_buffers.
+- Better control over the timing and pace of dirty data writeback.
+
+
+The main reasons *not* to use Direct IO are:
+
+- Without AIO, Direct IO is unusably slow for most purposes.
+- Even with AIO, many parts of postgres need to be modified to perform
+  explicit prefetching.
+- In situations where shared_buffers cannot be set appropriately large,
+  e.g. because there are many different postgres instances hosted on shared
+  hardware, performance will often be worse than when using buffered IO.
+
+
+## AIO Usage Example
+
+In many cases code that can benefit from AIO does not directly have to
+interact with the AIO interface, but can use AIO via higher-level
+abstractions. See [Helpers](#helpers).
+
+In this example, a buffer will be read into shared buffers.
+
+```C
+/*
+ * Result of the operation, only to be accessed in this backend.
+ */
+PgAioReturn ioret;
+
+/*
+ * Acquire an AIO Handle, ioret will get result upon completion.
+ *
+ * Note that ioret needs to stay alive until the IO completes or
+ * CurrentResourceOwner is released (i.e. an error is thrown).
+ */
+PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
+
+/*
+ * Reference that can be used to wait for the IO we initiate below. This
+ * reference can reside in local or shared memory and waited upon by any
+ * process. An arbitrary number of references can be made for each IO.
+ */
+PgAioWaitRef iow;
+
+pgaio_io_get_wref(ioh, &iow);
+
+/*
+ * Arrange for shared buffer completion callbacks to be called upon completion
+ * of the IO. This callback will update the buffer descriptors associated with
+ * the AioHandle, which e.g. allows other backends to access the buffer.
+ *
+ * A callback can be passed a small bit of data, e.g. to indicate whether to
+ * zero a buffer if it is invalid.
+ *
+ * Multiple completion callbacks can be registered for each handle.
+ */
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
+
+/*
+ * The completion callback needs to know which buffers to update when the IO
+ * completes. As the AIO subsystem does not know about buffers, we have to
+ * associate this information with the AioHandle, for use by the completion
+ * callback registered above.
+ *
+ * In this example we're reading only a single buffer, hence the 1.
+ */
+pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+
+/*
+ * Pass the AIO handle to lower-level function. When operating on the level of
+ * buffers, we don't know how exactly the IO is performed, that is the
+ * responsibility of the storage manager implementation.
+ *
+ * E.g. md.c needs to translate block numbers into offsets in segments.
+ *
+ * Once the IO handle has been handed off to smgstartreadv(), it may not
+ * further be used, as the IO may immediately get executed below
+ * smgrstartreadv() and the handle reused for another IO.
+ *
+ * To issue multiple IOs in an efficient way, a caller can call
+ * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
+ * with pgaio_exit_batchmode().  Note that one needs to be careful while there
+ * may be unsubmitted IOs, as another backend may need to wait for one of the
+ * unsubmitted IOs. If this backend then had to wait for the other backend,
+ * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
+ * details.
+ *
+ * Note that even while in batchmode an IO might get submitted immediately,
+ * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
+ * complete before smgrstartreadv() returns.
+ */
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               BufferGetBlock(buffer), 1);
+
+/*
+ * To benefit from AIO, it is beneficial to perform other work, including
+ * submitting other IOs, before waiting for the IO to complete. Otherwise
+ * we could just have used synchronous, blocking IO.
+ */
+perform_other_work();
+
+/*
+ * We did some other work and now need the IO operation to have completed to
+ * continue.
+ */
+pgaio_wref_wait(&iow);
+
+/*
+ * At this point the IO has completed. We do not yet know whether it succeeded
+ * or failed, however. The buffer's state has been updated, which allows other
+ * backends to use the buffer (if the IO succeeded), or retry the IO (if it
+ * failed).
+ *
+ * Note that in case the IO has failed, a LOG message may have been emitted,
+ * but no ERROR has been raised. This is crucial, as another backend waiting
+ * for this IO should not see an ERROR.
+ *
+ * To check whether the operation succeeded, and to raise an ERROR, or if more
+ * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
+ */
+if (ioret.result.status == PGAIO_RS_ERROR)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * Besides having succeeded completely, the IO could also have a) partially
+ * completed or b) succeeded with a warning (e.g. due to zero_damaged_pages).
+ * If we e.g. tried to read many blocks at once, the read might have
+ * only succeeded for the first few blocks.
+ *
+ * If the IO partially succeeded and this backend needs all blocks to have
+ * completed, this backend needs to reissue the IO for the remaining buffers.
+ * The AIO subsystem cannot handle this retry transparently.
+ *
+ * As this example is already long, and we only read a single block, we'll just
+ * error out if there's a partial read or a warning.
+ */
+if (ioret.result.status != PGAIO_RS_OK)
+    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
+
+/*
+ * The IO succeeded, so we can use the buffer now.
+ */
+```
+
+
+## Design Criteria & Motivation
+
+### Deadlock and Starvation Dangers due to AIO
+
+Using AIO in a naive way can easily lead to deadlocks in an environment where
+the source/target of AIO are shared resources, like pages in postgres'
+shared_buffers.
+
+Consider one backend performing readahead on a table, initiating IO for a
+number of buffers ahead of the current "scan position". If that backend then
+performs some operation that blocks, or even just is slow, the IO completion
+for the asynchronously initiated read may not be processed.
+
+This AIO implementation solves this problem by requiring that AIO methods
+either allow AIO completions to be processed by any backend in the system
+(e.g. io_uring), or to guarantee that AIO processing will happen even when the
+issuing backend is blocked (e.g. worker mode, which offloads completion
+processing to the AIO workers).
+
+
+### IO can be started in critical sections
+
+Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
+
+- AIO allows to start WAL writes eagerly, so they complete before needing to
+  wait
+- AIO allows to have multiple WAL flushes in progress at the same time
+- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
+  the number of roundtrips to storage on some OSs and storage HW (buffered IO
+  and direct IO without O_DSYNC needs to issue a write and after the write's
+  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
+  Force Unit Access (FUA) write).
+
+The need to be able to execute IO in critical sections has substantial design
+implication on the AIO subsystem. Mainly because completing IOs (see prior
+section) needs to be possible within a critical section, even if the
+to-be-completed IO itself was not issued in a critical section. Consider
+e.g. the case of a backend first starting a number of writes from shared
+buffers and then starting to flush the WAL. Because only a limited amount of
+IO can be in-progress at the same time, initiating IO for flushing the WAL may
+require to first complete IO that was started earlier.
+
+
+### State for AIO needs to live in shared memory
+
+Because postgres uses a process model and because AIOs need to be
+complete-able by any backend much of the state of the AIO subsystem needs to
+live in shared memory.
+
+In an `EXEC_BACKEND` build, a backend's executable code and other process
+local state is not necessarily mapped to the same addresses in each process
+due to ASLR. This means that the shared memory cannot contain pointers to
+callbacks.
+
+
+## Design of the AIO Subsystem
+
+
+### AIO Methods
+
+To achieve portability and performance, multiple methods of performing AIO are
+implemented and others are likely worth adding in the future.
+
+
+#### Synchronous Mode
+
+`io_method=sync` does not actually perform AIO but allows to use the AIO API
+while performing synchronous IO. This can be useful for debugging. The code
+for the synchronous mode is also used as a fallback by e.g. the [worker
+mode](#worker) uses it to execute IO that cannot be executed by workers.
+
+
+#### Worker
+
+`io_method=worker` is available on every platform postgres runs on, and
+implements asynchronous IO - from the view of the issuing process - by
+dispatching the IO to one of several worker processes performing the IO in a
+synchronous manner.
+
+
+#### io_uring
+
+`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
+dispatches all IO from within the process, lowering context switch rate /
+latency.
+
+
+### AIO Handles
+
+The central API piece for postgres' AIO abstraction are AIO handles. To
+execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
+then "define" it, i.e. associate an IO operation with the handle.
+
+Often AIO handles are acquired on a higher level and then passed to a lower
+level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
+routines acquire the handle, which is then passed through smgr.c, md.c to be
+finally fully defined in fd.c.
+
+The functions used at the lowest level to define the operation are
+`pgaio_io_start_*()`.
+
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed. Obviously code needs
+to be able to react to IO completion. State can be updated using
+[AIO Completion callbacks](#aio-callbacks)
+and the issuing backend can provide a backend local variable to receive the
+result of the IO, as described in
+[AIO Result](#aio-results).
+An IO can be waited for, by both the issuing and any other backend, using
+[AIO References](#aio-wait-references).
+
+
+Because an AIO Handle is not executable just after calling
+`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
+(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
+`pgaio_io_acquire()`) without causing the IO to have been defined (by,
+potentially indirectly, causing `pgaio_io_start_*()` to have been
+called). Otherwise a backend could trivially self-deadlock by using up all AIO
+Handles without the ability to wait for some of the IOs to complete.
+
+If it turns out that an AIO Handle is not needed, e.g., because the handle was
+acquired before holding a contended lock, it can be released without being
+defined using `pgaio_io_release()`.
+
+
+### AIO Callbacks
+
+Commonly several layers need to react to completion of an IO. E.g. for a read
+md.c needs to check if the IO outright failed or was shorter than needed,
+bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
+BufferDesc to update the buffer's state.
+
+The fact that several layers / subsystems need to react to IO completion poses
+a few challenges:
+
+- Upper layers should not need to know details of lower layers. E.g. bufmgr.c
+  should not assume the IO will pass through md.c.  Therefore upper levels
+  cannot know what lower layers would consider an error.
+
+- Lower layers should not need to know about upper layers. E.g. smgr APIs are
+  used going through shared buffers but are also used bypassing shared
+  buffers. This means that e.g. md.c is not in a position to validate
+  checksums.
+
+- Having code in the AIO subsystem for every possible combination of layers
+  would lead to a lot of duplication.
+
+The "solution" to this is the ability to associate multiple completion
+callbacks with a handle. E.g. bufmgr.c can have a callback to update the
+BufferDesc state and to verify the page and md.c can have another callback to
+check if the IO operation was successful.
+
+As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
+currently cannot contain function pointers. Because of that completion
+callbacks are not directly identified by function pointers but by IDs
+(`PgAioHandleCallbackID`).  A substantial added benefit is that that
+allows callbacks to be identified by much smaller amount of memory (a single
+byte currently).
+
+In addition to completion, AIO callbacks also are called to "stage" an
+IO. This is, e.g., used to increase buffer reference counts to account for the
+AIO subsystem referencing the buffer, which is required to handle the case
+where the issuing backend errors out and releases its own pins while the IO is
+still ongoing.
+
+As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
+need to be safe to execute in critical sections. To allow the backend that
+issued the IO to error out in case of failure [AIO Result](#aio-results) can
+be used.
+
+
+### AIO Targets
+
+In addition to the completion callbacks describe above, each AIO Handle has
+exactly one "target". Each target has some space inside an AIO Handle with
+information specific to the target and can provide callbacks to allow to
+reopen the underlying file (required for worker mode) and to describe the IO
+operation (used for debug logging and error messages).
+
+I.e., if two different uses of AIO can describe the identity of the file being
+operated on the same way, it likely makes sense to use the same
+target. E.g. different smgr implementations can describe IO with
+RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
+contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
+and it would not make sense to use the same target for smgr and WAL.
+
+
+### AIO Wait References
+
+As [described above](#aio-handles), AIO Handles can be reused immediately
+after completion and therefore cannot be used to wait for completion of the
+IO. Waiting is enabled using AIO wait references, which do not just identify
+an AIO Handle but also include the handles "generation".
+
+A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
+then waited upon using `pgaio_wref_wait()`.
+
+
+### AIO Results
+
+As AIO completion callbacks
+[are executed in critical sections](#io-can-be-started-in-critical-sections)
+and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
+completion callbacks cannot be used to, e.g., make the query that triggered an
+IO ERROR out.
+
+To allow to react to failing IOs the issuing backend can pass a pointer to a
+`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
+`PgAioReturn` is filled with information about the IO. This includes
+information about whether the IO was successful (as a value of
+`PgAioResultStatus`) and enough information to raise an error in case of a
+failure (via `pgaio_result_report()`, with the error details encoded in
+`PgAioResult`).
+
+
+### AIO Errors
+
+It would be very convenient to have shared completion callbacks encode the
+details of errors as an `ErrorData` that could be raised at a later
+time. Unfortunately doing so would require allocating memory. While elog.c can
+guarantee (well, kinda) that logging a message will not run out of memory,
+that only works because a very limited number of messages are in the process
+of being logged.  With AIO a large number of concurrently issued AIOs might
+fail.
+
+To avoid the need for preallocating a potentially large amount of memory (in
+shared memory no less!), completion callbacks instead have to encode errors in
+a more compact format that can be converted into an error message.
+
+
+## Helpers
+
+Using the low-level AIO API introduces too much complexity to do so all over
+the tree. Most uses of AIO should be done via reusable, higher-level,
+helpers.
+
+
+### Read Stream
+
+A common and very beneficial use of AIO are reads where a substantial number
+of to-be-read locations are known ahead of time. E.g., for a sequential scan
+the set of blocks that need to be read can be determined solely by knowing the
+current position and checking the buffer mapping table.
+
+The [Read Stream](../../../include/storage/read_stream.h) interface makes it
+comparatively easy to use AIO for such use cases.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index e3ed087e8a2..86f7250b7a5 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -24,6 +24,8 @@
  *
  * - read_stream.c - helper for reading buffered relation data
  *
+ * - README.md - higher-level overview over AIO
+ *
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0004-aio-Add-test_aio-module.patchtext/x-diff; charset=us-asciiDownload

From 090b74df45ac1ab491806282591a6b20be638117 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 04/18] aio: Add test_aio module

To make the tests possible, a few functions from bufmgr.c/localbuf.c had to be
exported, via buf_internals.h.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
---
 src/include/storage/buf_internals.h           |    7 +
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/buffer/localbuf.c         |    3 +-
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_aio/.gitignore          |    2 +
 src/test/modules/test_aio/Makefile            |   26 +
 src/test/modules/test_aio/meson.build         |   37 +
 src/test/modules/test_aio/t/001_aio.pl        | 1504 +++++++++++++++++
 src/test/modules/test_aio/t/002_io_workers.pl |  125 ++
 src/test/modules/test_aio/test_aio--1.0.sql   |  108 ++
 src/test/modules/test_aio/test_aio.c          |  804 +++++++++
 src/test/modules/test_aio/test_aio.control    |    3 +
 13 files changed, 2621 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_aio/.gitignore
 create mode 100644 src/test/modules/test_aio/Makefile
 create mode 100644 src/test/modules/test_aio/meson.build
 create mode 100644 src/test/modules/test_aio/t/001_aio.pl
 create mode 100644 src/test/modules/test_aio/t/002_io_workers.pl
 create mode 100644 src/test/modules/test_aio/test_aio--1.0.sql
 create mode 100644 src/test/modules/test_aio/test_aio.c
 create mode 100644 src/test/modules/test_aio/test_aio.control

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72b36a4af26..0dec7d93b3b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -434,6 +434,12 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
+/* solely to make it easier to write tests */
+extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
+extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
+							  bool forget_owner, bool release_aio);
+
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
@@ -478,6 +484,7 @@ extern void TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty,
 								   uint32 set_flag_bits, bool release_aio);
 extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
+extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
 									 BlockNumber firstDelBlock);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f9681d09e1e..1c37d7dfe2f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -518,10 +518,6 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
-							  uint32 set_flag_bits, bool forget_owner,
-							  bool release_aio);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
@@ -5962,7 +5958,7 @@ WaitIO(BufferDesc *buf)
  * find out if they can perform the I/O as part of a larger operation, without
  * waiting for the answer or distinguishing the reasons why not.
  */
-static bool
+bool
 StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
@@ -6019,7 +6015,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
  * resource owner. (forget_owner=false is used when the resource owner itself
  * is being released)
  */
-static void
+void
 TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 				  bool forget_owner, bool release_aio)
 {
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index bf89076bb10..ed56202af14 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -57,7 +57,6 @@ static int	NLocalPinnedBuffers = 0;
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
 static Buffer GetLocalVictimBuffer(void);
-static void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 
 
 /*
@@ -597,7 +596,7 @@ TerminateLocalBufferIO(BufferDesc *bufHdr, bool clear_dirty, uint32 set_flag_bit
  *
  * See also InvalidateBuffer().
  */
-static void
+void
 InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
 {
 	Buffer		buffer = BufferDescriptorGetBuffer(bufHdr);
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 4e4be3fa511..aa1d27bbed3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		  oauth_validator \
 		  plsample \
 		  spgist_name_ops \
+		  test_aio \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 2b057451473..9de0057bd1d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -13,6 +13,7 @@ subdir('oauth_validator')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('test_aio')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
diff --git a/src/test/modules/test_aio/.gitignore b/src/test/modules/test_aio/.gitignore
new file mode 100644
index 00000000000..716e17f5a2a
--- /dev/null
+++ b/src/test/modules/test_aio/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_aio/Makefile b/src/test/modules/test_aio/Makefile
new file mode 100644
index 00000000000..f53cc64671a
--- /dev/null
+++ b/src/test/modules/test_aio/Makefile
@@ -0,0 +1,26 @@
+# src/test/modules/test_aio/Makefile
+
+PGFILEDESC = "test_aio - test code for AIO"
+
+MODULE_big = test_aio
+OBJS = \
+	$(WIN32RES) \
+	test_aio.o
+
+EXTENSION = test_aio
+DATA = test_aio--1.0.sql
+
+TAP_TESTS = 1
+
+export enable_injection_points
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_aio/meson.build b/src/test/modules/test_aio/meson.build
new file mode 100644
index 00000000000..73d2fd68eaa
--- /dev/null
+++ b/src/test/modules/test_aio/meson.build
@@ -0,0 +1,37 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+test_aio_sources = files(
+  'test_aio.c',
+)
+
+if host_system == 'windows'
+  test_aio_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_aio',
+    '--FILEDESC', 'test_aio - test code for AIO',])
+endif
+
+test_aio = shared_module('test_aio',
+  test_aio_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_aio
+
+test_install_data += files(
+  'test_aio.control',
+  'test_aio--1.0.sql',
+)
+
+tests += {
+  'name': 'test_aio',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'env': {
+       'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
+    },
+    'tests': [
+      't/001_aio.pl',
+      't/002_io_workers.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_aio/t/001_aio.pl b/src/test/modules/test_aio/t/001_aio.pl
new file mode 100644
index 00000000000..58a56c8cec2
--- /dev/null
+++ b/src/test/modules/test_aio/t/001_aio.pl
@@ -0,0 +1,1504 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+###
+# Test io_method=worker
+###
+my $node_worker = create_node('worker');
+$node_worker->start();
+
+test_generic('worker', $node_worker);
+SKIP:
+{
+	skip 'Injection points not supported by this build', 1
+	  unless $ENV{enable_injection_points} eq 'yes';
+	test_inject_worker('worker', $node_worker);
+}
+
+$node_worker->stop();
+
+
+###
+# Test io_method=io_uring
+###
+
+if (have_io_uring())
+{
+	my $node_uring = create_node('io_uring');
+	$node_uring->start();
+	test_generic('io_uring', $node_uring);
+	$node_uring->stop();
+}
+
+
+###
+# Test io_method=sync
+###
+
+my $node_sync = create_node('sync');
+
+# just to have one test not use the default auto-tuning
+
+$node_sync->append_conf(
+	'postgresql.conf', qq(
+io_max_concurrency=4
+));
+
+$node_sync->start();
+test_generic('sync', $node_sync);
+$node_sync->stop();
+
+done_testing();
+
+
+###
+# Test Helpers
+###
+
+sub create_node
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $io_method = shift;
+
+	my $node = PostgreSQL::Test::Cluster->new($io_method);
+
+	# Want to test initdb for each IO method, otherwise we could just reuse
+	# the cluster.
+	#
+	# Unfortunately Cluster::init() puts PG_TEST_INITDB_EXTRA_OPTS after the
+	# options specified by ->extra, if somebody puts -c io_method=xyz in
+	# PG_TEST_INITDB_EXTRA_OPTS it would break this test. Fix that up if we
+	# detect it.
+	local $ENV{PG_TEST_INITDB_EXTRA_OPTS} = $ENV{PG_TEST_INITDB_EXTRA_OPTS};
+	if (defined $ENV{PG_TEST_INITDB_EXTRA_OPTS}
+		&& $ENV{PG_TEST_INITDB_EXTRA_OPTS} =~ m/io_method=/)
+	{
+		$ENV{PG_TEST_INITDB_EXTRA_OPTS} .= " -c io_method=$io_method";
+	}
+
+	$node->init(extra => [ '-c', "io_method=$io_method" ]);
+
+	$node->append_conf(
+		'postgresql.conf', qq(
+shared_preload_libraries=test_aio
+log_min_messages = 'DEBUG3'
+log_statement=all
+log_error_verbosity=default
+restart_after_crash=false
+temp_buffers=100
+));
+
+	ok(1, "$io_method: initdb");
+
+	return $node;
+}
+
+sub have_io_uring
+{
+	# To detect if io_uring is supported, we look at the error message for
+	# assigning an invalid value to an enum GUC, which lists all the valid
+	# options. We need to use -C to deal with running as administrator on
+	# windows, the superuser check is omitted if -C is used.
+	my ($stdout, $stderr) =
+	  run_command [qw(postgres -C invalid -c io_method=invalid)];
+	die "can't determine supported io_method values"
+	  unless $stderr =~ m/Available values: ([^\.]+)\./;
+	my $methods = $1;
+	note "supported io_method values are: $methods";
+
+	return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+sub psql_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my $io_method = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $expected_stdout = shift;
+	my $expected_stderr = shift;
+	my ($cmdret, $output);
+
+	($output, $cmdret) = $psql->query($sql);
+
+	like($output, $expected_stdout, "$io_method: $name: expected stdout");
+	like($psql->{stderr}, $expected_stderr,
+		"$io_method: $name: expected stderr");
+	$psql->{stderr} = '';
+
+	return $output;
+}
+
+sub query_wait_block
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my $io_method = shift;
+	my $node = shift;
+	my $psql = shift;
+	my $name = shift;
+	my $sql = shift;
+	my $waitfor = shift;
+
+	my $pid = $psql->query_safe('SELECT pg_backend_pid()');
+
+	$psql->{stdin} .= qq($sql;\n);
+	$psql->{run}->pump_nb();
+	ok(1, "$io_method: $name: issued sql");
+
+	$node->poll_query_until('postgres',
+		qq(SELECT wait_event FROM pg_stat_activity WHERE pid = $pid),
+		$waitfor);
+	ok(1, "$io_method: $name: observed $waitfor wait event");
+}
+
+# Returns count of checksum failures for the specified database or for shared
+# relations, if $datname is undefined.
+sub checksum_failures
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my $psql = shift;
+	my $datname = shift;
+	my $checksum_count;
+	my $checksum_last_failure;
+
+	if (defined $datname)
+	{
+		$checksum_count = $psql->query_safe(
+			qq(
+SELECT checksum_failures FROM pg_stat_database WHERE datname = '$datname';
+));
+		$checksum_last_failure = $psql->query_safe(
+			qq(
+SELECT checksum_last_failure FROM pg_stat_database WHERE datname = '$datname';
+));
+	}
+	else
+	{
+		$checksum_count = $psql->query_safe(
+			qq(
+SELECT checksum_failures FROM pg_stat_database WHERE datname IS NULL;
+));
+		$checksum_last_failure = $psql->query_safe(
+			qq(
+SELECT checksum_last_failure FROM pg_stat_database WHERE datname IS NULL;
+));
+	}
+
+	return $checksum_count, $checksum_last_failure;
+}
+
+###
+# Sub-tests
+###
+
+# Sanity checks for the IO handle API
+sub test_handle
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in implicit xact",
+		qq(SELECT handle_get()),
+		qr/^$/,
+		qr/leaked AIO handle/,
+		"$io_method: leaky handle_get() warns");
+
+	# leak warning: explicit xact
+	psql_like(
+		$io_method, $psql,
+		"handle_get() leak in explicit xact",
+		qq(BEGIN; SELECT handle_get(); COMMIT),
+		qr/^$/, qr/leaked AIO handle/);
+
+
+	# leak warning: explicit xact, rollback
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in explicit xact, rollback",
+		qq(BEGIN; SELECT handle_get(); ROLLBACK;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning: subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get() leak in subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle/);
+
+	# leak warning + error: released in different command (thus resowner)
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in different command",
+		qq(BEGIN; SELECT handle_get(); SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/leaked AIO handle.*release in unexpected state/ms);
+
+	# no leak, release in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_release() in same command",
+		qq(BEGIN; SELECT handle_get() UNION ALL SELECT handle_release_last(); COMMIT;),
+		qr/^$/,
+		qr/^$/);
+
+	# normal handle use
+	psql_like($io_method, $psql, "handle_get_release()",
+		qq(SELECT handle_get_release()),
+		qr/^$/, qr/^$/);
+
+	# should error out, API violation
+	psql_like(
+		$io_method,
+		$psql,
+		"handle_get_twice()",
+		qq(SELECT handle_get_twice()),
+		qr/^$/,
+		qr/ERROR:  API violation: Only one IO can be handed out$/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in implicit xact",
+		qq(SELECT handle_get_and_error(); SELECT 'ok', handle_get_release()),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit xact",
+		qq(BEGIN; SELECT handle_get_and_error(); SELECT handle_get_release(), 'ok'; COMMIT;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	# recover after error in subtrans
+	psql_like(
+		$io_method,
+		$psql,
+		"handle error recovery in explicit subxact",
+		qq(BEGIN; SAVEPOINT foo; SELECT handle_get_and_error(); ROLLBACK TO SAVEPOINT foo; SELECT handle_get_release(); ROLLBACK;),
+		qr/^|ok$/,
+		qr/ERROR.*as you command/);
+
+	$psql->quit();
+}
+
+# Sanity checks for the batchmode API
+sub test_batchmode
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# leak warning & recovery: implicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in implicit xact",
+		qq(SELECT batch_start()),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+	# leak warning & recovery: explicit xact
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start() leak & cleanup in explicit xact",
+		qq(BEGIN; SELECT batch_start(); COMMIT;),
+		qr/^$/,
+		qr/open AIO batch at end/,
+		"$io_method: leaky batch_start() warns");
+
+
+	# leak warning & recovery: explicit xact, rollback
+	#
+	# XXX: This doesn't fail right now, due to not getting a chance to do
+	# something at transaction command commit. That's not a correctness issue,
+	# it just means it's a bit harder to find buggy code.
+	#psql_like($io_method, $psql,
+	#		  "batch_start() leak & cleanup after abort",
+	#		  qq(BEGIN; SELECT batch_start(); ROLLBACK;),
+	#		  qr/^$/,
+	#		  qr/open AIO batch at end/, "$io_method: leaky batch_start() warns");
+
+	# no warning, batch closed in same command
+	psql_like(
+		$io_method,
+		$psql,
+		"batch_start(), batch_end() works",
+		qq(SELECT batch_start() UNION ALL SELECT batch_end()),
+		qr/^$/,
+		qr/^$/,
+		"$io_method: batch_start(), batch_end()");
+
+	$psql->quit();
+}
+
+# Test that simple cases of invalid pages are reported
+sub test_io_error
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_corr(data int not null);
+INSERT INTO tmp_corr SELECT generate_series(1, 10000);
+SELECT modify_rel_block('tmp_corr', 1, corrupt_header=>true);
+));
+
+	foreach my $tblname (qw(tbl_corr tmp_corr))
+	{
+		my $invalid_page_re =
+		  $tblname eq 'tbl_corr'
+		  ? qr/invalid page in block 1 of relation base\/\d+\/\d+/
+		  : qr/invalid page in block 1 of relation base\/\d+\/t\d+_\d+/;
+
+		# verify the error is reported in custom C code
+		psql_like(
+			$io_method,
+			$psql,
+			"read_rel_block_ll() of $tblname page",
+			qq(SELECT read_rel_block_ll('$tblname', 1)),
+			qr/^$/,
+			$invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, seq scan
+		psql_like(
+			$io_method, $psql,
+			"sequential scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname),
+			qr/^$/, $invalid_page_re);
+
+		# verify the error is reported for bufmgr reads, tid scan
+		psql_like(
+			$io_method,
+			$psql,
+			"tid scan of $tblname block fails",
+			qq(SELECT count(*) FROM $tblname WHERE ctid = '(1, 1)'),
+			qr/^$/,
+			$invalid_page_re);
+	}
+
+	$psql->quit();
+}
+
+# Test interplay between StartBufferIO and TerminateBufferIO
+sub test_startwait_io
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+
+	### Verify behavior for normal tables
+
+	# create a buffer we can play around with
+	my $buf_id = psql_like(
+		$io_method, $psql_a,
+		"creation of toy buffer succeeds",
+		qq(SELECT buffer_create_toy('tbl_ok', 1)),
+		qr/^\d+$/, qr/^$/);
+
+	# check that one backend can perform StartBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# but not twice on the same buffer (non-waiting)
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartBufferIO fails, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+	psql_like(
+		$io_method,
+		$psql_b,
+		"second StartBufferIO fails, other session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^f$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, without marking it as success, this should trigger the
+	# waiting session to be able to start the io
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, not valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+
+	# Because the IO was terminated, but not marked as valid, second session should get the right to start io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/t/);
+	ok(1, "$io_method: blocking start buffer io, can start io");
+
+	# terminate the IO again
+	$psql_b->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+
+
+	# same as the above scenario, but mark IO as having succeeded
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking buffer io w/ success: first start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^t$/,
+		qr/^$/);
+
+	# start io in a different session, will block
+	query_wait_block(
+		$io_method,
+		$node,
+		$psql_b,
+		"blocking start buffer io",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		"BufferIo");
+
+	# Terminate the IO, marking it as success
+	psql_like(
+		$io_method,
+		$psql_a,
+		"blocking start buffer io, terminating io, valid",
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false)),
+		qr/^$/,
+		qr/^$/);
+
+	# Because the IO was terminated, and marked as valid, second session should complete but not need io
+	pump_until($psql_b->{run}, $psql_b->{timeout}, \$psql_b->{stdout}, qr/f/);
+	ok(1, "$io_method: blocking start buffer io, no need to start io");
+
+	# buffer is valid now, make it invalid again
+	$psql_a->query_safe(qq(SELECT buffer_create_toy('tbl_ok', 1);));
+
+
+	### Verify behavior for temporary tables
+
+	# Can't unfortunately share the code with the normal table case, there are
+	# too many behavioral differences.
+
+	# create a buffer we can play around with
+	$psql_a->query_safe(
+		qq(
+CREATE TEMPORARY TABLE tmp_ok(data int not null);
+INSERT INTO tmp_ok SELECT generate_series(1, 10000);
+));
+	$buf_id = $psql_a->query_safe(qq(SELECT buffer_create_toy('tmp_ok', 3);));
+
+	# check that one backend can perform StartLocalBufferIO
+	psql_like(
+		$io_method,
+		$psql_a,
+		"first StartLocalBufferIO",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Because local buffers don't use IO_IN_PROGRESS, a second StartLocalBufer
+	# succeeds as well. This test mostly serves as a documentation of that
+	# fact. If we had actually started IO, it'd be different.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"second StartLocalBufferIO succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, without marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>false, io_error=>false, release_aio=>false);)
+	);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after not marking valid succeeds, same session",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>true);),
+		qr/^t$/,
+		qr/^$/);
+
+	# Terminate the IO again, marking it as a success
+	$psql_a->query_safe(
+		qq(SELECT buffer_call_terminate_io($buf_id, for_input=>true, succeed=>true, io_error=>false, release_aio=>false);)
+	);
+
+	# Now another StartLocalBufferIO should fail, this time because the buffer
+	# is already valid.
+	psql_like(
+		$io_method,
+		$psql_a,
+		"StartLocalBufferIO after marking valid fails",
+		qq(SELECT buffer_call_start_io($buf_id, for_input=>true, nowait=>false);),
+		qr/^f$/,
+		qr/^$/);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that if the backend issuing a read doesn't wait for the IO's
+# completion, another backend can complete the IO
+sub test_complete_foreign
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Issue IO without waiting for completion, then sleep
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read started by sleeping backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Issue IO without waiting for completion, then exit.
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_ok', 1, wait_complete=>false);));
+	$psql_a->reconnect_and_clear();
+
+	# Check that another backend can read the relevant block. This verifies
+	# that the exiting backend left the AIO in a sane state.
+	psql_like(
+		$io_method,
+		$psql_b,
+		"read buffer started by exited backend",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	# Read a tbl_corr block, then sleep. The other session will retry the IO
+	# and also fail. The easiest thing to verify that seems to be to check
+	# that both are in the log.
+	my $log_location = -s $node->logfile;
+	$psql_a->query_safe(
+		qq(SELECT read_rel_block_ll('tbl_corr', 1, wait_complete=>false);));
+
+	psql_like(
+		$io_method,
+		$psql_b,
+		"completing read of tbl_corr block started by other backend",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^$/,
+		qr/invalid page in block/);
+
+	# The log message issued for the read_rel_block_ll() should be logged as a LOG
+	$node->wait_for_log(qr/LOG[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: LOG message for background read"
+	);
+
+	# But for the SELECT, it should be an ERROR
+	$log_location =
+	  $node->wait_for_log(qr/ERROR[^\n]+invalid page in/, $log_location);
+	ok(1,
+		"$io_method: completing read of tbl_corr block started by other backend: ERROR message for foreground read"
+	);
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that we deal correctly with FDs being closed while IO is in progress
+sub test_close_fd
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, waiting for results",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>true,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting",
+		qq(
+			SELECT read_rel_block_ll('tbl_ok', 1,
+				wait_complete=>false,
+				batchmode_enter=>true,
+				smgrreleaseall=>true,
+				batchmode_exit=>true
+			);),
+		qr/^$/,
+		qr/^$/);
+
+	# Check that another backend can read the relevant block
+	psql_like(
+		$io_method,
+		$psql,
+		"close all FDs after read, no waiting, query works",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(1,1)' LIMIT 1),
+		qr/^1$/,
+		qr/^$/);
+
+	$psql->quit();
+}
+
+# Tests using injection points. Mostly to exercise hard IO errors that are
+# hard to trigger without using injection points.
+sub test_inject
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# injected what we'd expect
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method, $psql,
+		"injection point not triggering failure",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^1$/, qr/^$/);
+
+	# injected a read shorter than a single block, expecting error
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(17);));
+	$psql->query_safe(qq(SELECT invalidate_rel_block('tbl_ok', 2);));
+	psql_like(
+		$io_method,
+		$psql,
+		"single block short read fails",
+		qq(SELECT count(*) FROM tbl_ok WHERE ctid = '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file "base\/.*": read only 0 of 8192 bytes/
+	);
+
+	# shorten multi-block read to a single block, should retry
+	my $inval_query = qq(SELECT invalidate_rel_block('tbl_ok', 0);
+SELECT invalidate_rel_block('tbl_ok', 1);
+SELECT invalidate_rel_block('tbl_ok', 2);
+SELECT invalidate_rel_block('tbl_ok', 3);
+/* gap */
+SELECT invalidate_rel_block('tbl_ok', 5);
+SELECT invalidate_rel_block('tbl_ok', 6);
+SELECT invalidate_rel_block('tbl_ok', 7);
+SELECT invalidate_rel_block('tbl_ok', 8););
+
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192);));
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (1 block) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# shorten multi-block read to two blocks, should retry
+	$psql->query_safe($inval_query);
+	$psql->query_safe(qq(SELECT inj_io_short_read_attach(8192*2);));
+
+	psql_like(
+		$io_method, $psql,
+		"multi block short read (2 blocks) is retried",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# verify that page verification errors are detected even as part of a
+	# shortened multi-block read (tbl_corr, block 1 is corrupted)
+	$psql->query_safe(
+		qq(
+SELECT invalidate_rel_block('tbl_corr', 0);
+SELECT invalidate_rel_block('tbl_corr', 1);
+SELECT invalidate_rel_block('tbl_corr', 2);
+SELECT inj_io_short_read_attach(8192);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"shortened multi-block read detects invalid page",
+		qq(SELECT count(*) FROM tbl_corr WHERE ctid < '(2, 1)'),
+		qr/^$/,
+		qr/ERROR:.*invalid page in block 1 of relation base\/.*/);
+
+	# trigger a hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EIO'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"first hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"second hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Input\/output error/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_short_read_detach()));
+
+	# now the IO should be ok.
+	psql_like(
+		$io_method, $psql,
+		"recovers after hard error",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	# trigger a different hard error, should error out
+	$psql->query_safe(
+		qq(
+SELECT inj_io_short_read_attach(-errno_from_string('EROFS'));
+SELECT invalidate_rel_block('tbl_ok', 2);
+	));
+	psql_like(
+		$io_method,
+		$psql,
+		"different hard IO error is reported",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 2\.\.2 in file \"base\/.*\": Read-only file system/
+	);
+	$psql->query_safe(qq(SELECT inj_io_short_read_detach()));
+
+	$psql->quit();
+}
+
+# Tests using injection points, only for io_method=worker.
+#
+# io_method=worker has the special case of needing to reopen files. That can
+# in theory fail, because the file could be gone. That's a hard path to test
+# for real, so we use an injection point to trigger it.
+sub test_inject_worker
+{
+	my $io_method = shift;
+	my $node = shift;
+	my ($ret, $output);
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# trigger a failure to reopen, should error out, but should recover
+	$psql->query_safe(
+		qq(
+SELECT inj_io_reopen_attach();
+SELECT invalidate_rel_block('tbl_ok', 1);
+	));
+
+	psql_like(
+		$io_method,
+		$psql,
+		"failure to open: detected",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^$/,
+		qr/ERROR:.*could not read blocks 1\.\.1 in file "base\/.*": No such file or directory/
+	);
+
+	$psql->query_safe(qq(SELECT inj_io_reopen_detach();));
+
+	# check that we indeed recover
+	psql_like(
+		$io_method, $psql,
+		"failure to open: recovers",
+		qq(SELECT count(*) FROM tbl_ok),
+		qr/^10000$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Verify that we handle a relation getting removed (due to a rollback or a
+# DROP TABLE) while IO is ongoing for that table.
+sub test_invalidate
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal unlogged temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+		my $tblname = $persistency . '_transactional';
+
+		my $create_sql = qq(
+CREATE $sql_persistency TABLE $tblname (id int not null, data text not null) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO $tblname(id, data) SELECT generate_series(1, 10000) as id, repeat('a', 200);
+);
+
+		# Verify that outstanding read IO does not cause problems with
+		# AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql;");
+		$psql->query_safe(
+			qq(
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+		psql_like(
+			$io_method,
+			$psql,
+			"rollback of newly created $persistency table with outstanding IO",
+			qq(ROLLBACK),
+			qr/^$/,
+			qr/^$/);
+
+		# Verify that outstanding read IO does not cause problems with
+		# CommitTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> ...
+		# -> Invalidate[Local]Buffer.
+		$psql->query_safe("BEGIN; $create_sql; COMMIT;");
+		$psql->query_safe(
+			qq(
+BEGIN;
+SELECT read_rel_block_ll('$tblname', 1, wait_complete=>false);
+));
+
+		psql_like(
+			$io_method, $psql,
+			"drop $persistency table with outstanding IO",
+			qq(DROP TABLE $tblname),
+			qr/^$/, qr/^$/);
+
+		psql_like($io_method, $psql,
+			"commit after drop $persistency table with outstanding IO",
+			qq(COMMIT), qr/^$/, qr/^$/);
+	}
+
+	$psql->quit();
+}
+
+# Test behavior related to ZERO_ON_ERROR and zero_damaged_pages
+sub test_zero
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+	my $psql_b = $node->background_psql('postgres', on_error_stop => 0);
+
+	foreach my $persistency (qw(normal temporary))
+	{
+		my $sql_persistency = $persistency eq 'normal' ? '' : $persistency;
+
+		$psql_a->query_safe(
+			qq(
+CREATE $sql_persistency TABLE tbl_zero(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_zero SELECT generate_series(1, 10000);
+));
+
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, corrupt_header=>true);
+));
+
+		# Check that page validity errors are detected
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 0 of relation base\/.*\/.*$/
+		);
+
+		# Check that page validity errors are zeroed
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 0 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+		# And that once the corruption is fixed, we can read again
+		$psql_a->query(
+			qq(
+SELECT modify_rel_block('tbl_zero', 0, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test re-read of block 0",
+			qq(
+SELECT read_rel_block_ll('tbl_zero', 0, zero_on_error=>false)),
+			qr/^$/,
+			qr/^$/);
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 3",
+			qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true);
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		# First test error
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 2,3 in larger read",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing via ZERO_ON_ERROR flag
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first zeroed page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing vio zero_damaged_pages
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, zero_damaged_pages",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)
+COMMIT;
+),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first zeroed page\.\nHINT:[^\n]+$/
+		);
+
+		$psql_a->query_safe(qq(COMMIT));
+
+
+		# Verify that bufmgr.c IO detects page validity errors
+		$psql_a->query(
+			qq(
+SELECT invalidate_rel_block('tbl_zero', g.i)
+FROM generate_series(0, 15) g(i);
+SELECT modify_rel_block('tbl_zero', 3, zero=>true);
+));
+		$psql_a->{stderr} = '';
+
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify reading zero_damaged_pages=off",
+			qq(
+SELECT count(*) FROM tbl_zero),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+		# Verify that bufmgr.c IO zeroes out pages with page validity errors
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: verify zero_damaged_pages=on",
+			qq(
+BEGIN;
+SET LOCAL zero_damaged_pages = true;
+SELECT count(*) FROM tbl_zero;
+COMMIT;
+),
+			qr/^\d+$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 2 of relation base\/.*\/.*$/
+		);
+
+		# Check that warnings/errors about page validity in an IO started by
+		# session A that session B might complete aren't logged visibly to
+		# session B.
+		#
+		# This will only ever trigger for io_method's like io_uring, that can
+		# complete IO's in a client backend. But it doesn't seem worth
+		# restricting to that.
+		#
+		# This requires cross-session access to the same relation, hence the
+		# restriction to non-temporary table.
+		if ($sql_persistency ne 'temporary')
+		{
+			# Create a corruption and then read the block without waiting for
+			# completion.
+			$psql_a->query(qq(
+SELECT modify_rel_block('tbl_zero', 1, corrupt_header=>true);
+SELECT read_rel_block_ll('tbl_zero', 1, wait_complete=>false, zero_on_error=>true)
+));
+
+			psql_like(
+				$io_method,
+				$psql_b,
+				"$persistency: test completing read by other session doesn't generate warning",
+				qq(SELECT count(*) > 0 FROM tbl_zero;),
+			qr/^t$/, qr/^$/);
+		}
+
+		# Clean up
+		$psql_a->query_safe(
+			qq(
+DROP TABLE tbl_zero;
+));
+	}
+
+	$psql_a->{stderr} = '';
+
+	$psql_a->quit();
+	$psql_b->quit();
+}
+
+# Test that we detect checksum failures and report them
+sub test_checksum
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql_a = $node->background_psql('postgres', on_error_stop => 0);
+
+	$psql_a->query_safe(
+		qq(
+CREATE TABLE tbl_normal(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_normal SELECT generate_series(1, 5000);
+SELECT modify_rel_block('tbl_normal', 3, corrupt_checksum=>true);
+
+CREATE TEMPORARY TABLE tbl_temp(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_temp SELECT generate_series(1, 5000);
+SELECT modify_rel_block('tbl_temp', 3, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_temp', 4, corrupt_checksum=>true);
+));
+
+	# To be able to test checksum failures on shared rels we need a shared rel
+	# with invalid pages - which is a bit scary. pg_shseclabel seems like a
+	# good bet, as it's not accessed in a default configuration.
+	$psql_a->query_safe(
+		qq(
+SELECT grow_rel('pg_shseclabel', 4);
+SELECT modify_rel_block('pg_shseclabel', 2, corrupt_checksum=>true);
+SELECT modify_rel_block('pg_shseclabel', 3, corrupt_checksum=>true);
+));
+
+
+	# Check that page validity errors are detected, checksums stats increase, normal rel
+	my ($cs_count_before, $cs_ts_before) =
+	  checksum_failures($psql_a, 'postgres');
+	psql_like(
+		$io_method,
+		$psql_a,
+		"normal rel: test reading of invalid block 3",
+		qq(
+SELECT read_rel_block_ll('tbl_normal', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 3 of relation base\/\d+\/\d+$/
+	);
+
+	my ($cs_count_after, $cs_ts_after) =
+	  checksum_failures($psql_a, 'postgres');
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: normal rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: normal rel: checksum timestamp is not null");
+
+
+	# Check that page validity errors are detected, checksums stats increase, temp rel
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a, 'postgres');
+	psql_like(
+		$io_method,
+		$psql_a,
+		"temp rel: test reading of invalid block 4, valid block 5",
+		qq(
+SELECT read_rel_block_ll('tbl_temp', 4, nblocks=>2, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 4 of relation base\/\d+\/t\d+_\d+$/
+	);
+
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a, 'postgres');
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: temp rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: temp rel: checksum timestamp is not null");
+
+
+	# Check that page validity errors are detected, checksums stats increase, shared rel
+	($cs_count_before, $cs_ts_after) = checksum_failures($psql_a);
+	psql_like(
+		$io_method,
+		$psql_a,
+		"shared rel: reading of invalid blocks 2+3",
+		qq(
+SELECT read_rel_block_ll('pg_shseclabel', 2, nblocks=>2, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 2..3 of relation global\/\d+\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+	);
+
+	($cs_count_after, $cs_ts_after) = checksum_failures($psql_a);
+
+	cmp_ok($cs_count_before + 1,
+		'<=', $cs_count_after,
+		"$io_method: shared rel: checksum count increased");
+	cmp_ok($cs_ts_after, 'ne', '',
+		"$io_method: shared rel: checksum timestamp is not null");
+
+
+	# and restore sanity
+	$psql_a->query(
+		qq(
+SELECT modify_rel_block('pg_shseclabel', 1, zero=>true);
+DROP TABLE tbl_normal;
+));
+	$psql_a->{stderr} = '';
+
+	$psql_a->quit();
+}
+
+# Verify checksum handling when creating database from an invalid database.
+# This also serves as a minimal check that cross-database IO is handled
+# reasonably.
+sub test_checksum_createdb
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	$node->safe_psql('postgres',
+		'CREATE DATABASE regression_createdb_source');
+
+	$node->safe_psql(
+		'regression_createdb_source', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_cs_fail(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_cs_fail SELECT generate_series(1, 1000);
+SELECT modify_rel_block('tbl_cs_fail', 1, corrupt_checksum=>true);
+));
+
+	my $createdb_sql = qq(
+CREATE DATABASE regression_createdb_target
+TEMPLATE regression_createdb_source
+STRATEGY wal_log;
+);
+
+	# Verify that CREATE DATABASE of an invalid database fails and is
+	# accounted for accurately.
+	my ($cs_count_before, $cs_ts_before) =
+	  checksum_failures($psql, 'regression_createdb_source');
+	psql_like(
+		$io_method,
+		$psql,
+		"create database w/ wal strategy, invalid source",
+		$createdb_sql,
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 1 of relation base\/\d+\/\d+$/
+	);
+	my ($cs_count_after, $cs_ts_after) =
+	  checksum_failures($psql, 'regression_createdb_source');
+	cmp_ok($cs_count_before + 1, '<=', $cs_count_after,
+		"$io_method: create database w/ wal strategy, invalid source: checksum count increased"
+	);
+
+	# Verify that CREATE DATABASE of the fixed database succeeds.
+	$node->safe_psql(
+		'regression_createdb_source', qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, zero=>true);
+));
+	psql_like($io_method, $psql,
+		"create database w/ wal strategy, valid source",
+		$createdb_sql, qr/^$/, qr/^$/);
+
+	$psql->quit();
+}
+
+# Test that we detect checksum failures and report them
+#
+# In several places we make sure that the server log actually contains
+# individual information for each block involved in the IO.
+sub test_ignore_checksum
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	my $psql = $node->background_psql('postgres', on_error_stop => 0);
+
+	# Test setup
+	$psql->query_safe(
+		qq(
+CREATE TABLE tbl_cs_fail(id int) WITH (AUTOVACUUM_ENABLED = false);
+INSERT INTO tbl_cs_fail SELECT generate_series(1, 10000);
+));
+
+	my $count_sql = "SELECT count(*) FROM tbl_cs_fail";
+	my $invalidate_sql = qq(
+SELECT invalidate_rel_block('tbl_cs_fail', g.i)
+FROM generate_series(0, 6) g(i);
+);
+
+	my $expect = $psql->query_safe($count_sql);
+
+
+	# Very basic tests for ignore_checksum_failure=off / on
+
+	$psql->query_safe(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 5, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 6, corrupt_checksum=>true);
+));
+
+	$psql->query_safe($invalidate_sql);
+	psql_like($io_method, $psql,
+		"reading block w/ wrong checksum with ignore_checksum_failure=off fails",
+		$count_sql, qr/^$/, qr/ERROR:  invalid page in block/);
+
+	$psql->query_safe("SET ignore_checksum_failure=on");
+
+	$psql->query_safe($invalidate_sql);
+	psql_like($io_method, $psql,
+			  "reading block w/ wrong checksum with ignore_checksum_failure=off succeeds",
+			  $count_sql,
+			  qr/^$expect$/,
+			  qr/WARNING:  ignoring (checksum failure|\d checksum failures)/);
+
+
+	# Verify that ignore_checksum_failure=off works in multi-block reads
+
+	$psql->query_safe(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 2, zero=>true);
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 4, corrupt_header=>true);
+));
+
+	my $log_location = -s $node->logfile;
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of checksum failed block 3, with ignore",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  ignoring checksum failure in block 3/
+	);
+
+	# Check that the log contains a LOG message about the failure
+	$log_location =
+	  $node->wait_for_log(qr/LOG:  ignoring checksum failure/, $log_location);
+
+	# check that we error
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of valid block 2, checksum failed 3, invalid 4, zero=false with ignore",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 2, nblocks=>3, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 4 of relation base\/\d+\/\d+$/
+	);
+
+	# Test multi-block read with different problems in different blocks
+	$psql->query(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 1, zero=>true);
+SELECT modify_rel_block('tbl_cs_fail', 2, corrupt_checksum=>true);
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true, corrupt_header=>true);
+SELECT modify_rel_block('tbl_cs_fail', 4, corrupt_header=>true);
+SELECT modify_rel_block('tbl_cs_fail', 5, corrupt_header=>true);
+));
+	$psql->{stderr} = '';
+
+	$log_location = -s $node->logfile;
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of valid block 1, checksum failed 2, 3, invalid 3-5, zero=true",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 1, nblocks=>5, zero_on_error=>true);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  zeroing 3 page\(s\) and ignoring 2 checksum failure\(s\) among blocks 1..5 of relation/
+	);
+
+
+	# Unfortunately have to scan the whole log since determining $log_location
+	# above in each of the tests, as wait_for_log() returns the size of the
+	# file.
+
+	$node->wait_for_log(qr/LOG:  ignoring checksum failure in block 2/,
+						$log_location);
+	ok(1, "$io_method: found information about checksum failure in block 2");
+
+	$node->wait_for_log(qr/LOG:  invalid page in block 3 of relation base.*; zeroing out page/,
+						$log_location);
+	ok(1, "$io_method: found information about invalid page in block 3");
+
+	$node->wait_for_log(qr/LOG:  invalid page in block 4 of relation base.*; zeroing out page/,
+						$log_location);
+	ok(1, "$io_method: found information about checksum failure in block 4");
+
+	$node->wait_for_log(qr/LOG:  invalid page in block 5 of relation base.*; zeroing out page/,
+						$log_location);
+	ok(1, "$io_method: found information about checksum failure in block 5");
+
+
+	# Reading a page with both an invalid header and an invalid checksum
+	$psql->query(
+		qq(
+SELECT modify_rel_block('tbl_cs_fail', 3, corrupt_checksum=>true, corrupt_header=>true);
+));
+	$psql->{stderr} = '';
+
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of block with both invalid header and invalid checksum, zero=false",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>false);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: ERROR:  invalid page in block 3 of relation/
+	);
+
+	psql_like(
+		$io_method,
+		$psql,
+		"test reading of block 3 with both invalid header and invalid checksum, zero=true",
+		qq(
+SELECT read_rel_block_ll('tbl_cs_fail', 3, nblocks=>1, zero_on_error=>true);),
+		qr/^$/,
+		qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*; zeroing out page/
+	);
+
+
+	$psql->quit();
+}
+
+
+# Run all tests that are supported for all io_methods
+sub test_generic
+{
+	my $io_method = shift;
+	my $node = shift;
+
+	is($node->safe_psql('postgres', 'SHOW io_method'),
+		$io_method, "$io_method: io_method set correctly");
+
+	$node->safe_psql(
+		'postgres', qq(
+CREATE EXTENSION test_aio;
+CREATE TABLE tbl_corr(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+CREATE TABLE tbl_ok(data int not null) WITH (AUTOVACUUM_ENABLED = false);
+
+INSERT INTO tbl_corr SELECT generate_series(1, 10000);
+INSERT INTO tbl_ok SELECT generate_series(1, 10000);
+SELECT grow_rel('tbl_corr', 16);
+SELECT grow_rel('tbl_ok', 16);
+
+SELECT modify_rel_block('tbl_corr', 1, corrupt_header=>true);
+CHECKPOINT;
+));
+
+	test_handle($io_method, $node);
+	test_io_error($io_method, $node);
+	test_batchmode($io_method, $node);
+	test_startwait_io($io_method, $node);
+	test_complete_foreign($io_method, $node);
+	test_close_fd($io_method, $node);
+	test_invalidate($io_method, $node);
+	test_zero($io_method, $node);
+	test_checksum($io_method, $node);
+	test_ignore_checksum($io_method, $node);
+	test_checksum_createdb($io_method, $node);
+
+  SKIP:
+	{
+		skip 'Injection points not supported by this build', 1
+		  unless $ENV{enable_injection_points} eq 'yes';
+		test_inject($io_method, $node);
+	}
+}
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
new file mode 100644
index 00000000000..af5fae15ea7
--- /dev/null
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use List::Util qw(shuffle);
+
+
+my $node = PostgreSQL::Test::Cluster->new('worker');
+$node->init();
+$node->append_conf(
+	'postgresql.conf', qq(
+io_method=worker
+));
+
+$node->start();
+
+# Test changing the number of I/O worker processes while also evaluating the
+# handling of their termination.
+test_number_of_io_workers_dynamic($node);
+
+$node->stop();
+
+done_testing();
+
+
+sub test_number_of_io_workers_dynamic
+{
+	my $node = shift;
+
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+
+	# Verify that worker count can't be set to 0
+	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
+
+	# Verify that worker count can't be set to 33 (above the max)
+	change_number_of_io_workers($node, 33, $prev_worker_count, 1);
+
+	# Try changing IO workers to a random value and verify that the worker
+	# count ends up as expected. Always test the min/max of workers.
+	#
+	# Valid range for io_workers is [1, 32]. 8 tests in total seems
+	# reasonable.
+	my @io_workers_range = shuffle(1 ... 32);
+	foreach my $worker_count (1, 32, @io_workers_range[ 0, 6 ])
+	{
+		$prev_worker_count =
+		  change_number_of_io_workers($node, $worker_count,
+			$prev_worker_count, 0);
+	}
+}
+
+sub change_number_of_io_workers
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my $prev_worker_count = shift;
+	my $expect_failure = shift;
+	my ($result, $stdout, $stderr);
+
+	($result, $stdout, $stderr) =
+	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
+
+	if ($expect_failure)
+	{
+		ok( $stderr =~
+			  /$worker_count is outside the valid range for parameter "io_workers"/,
+			"updating number of io_workers to $worker_count failed, as expected"
+		);
+
+		return $prev_worker_count;
+	}
+	else
+	{
+		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+			$worker_count,
+			"updating number of io_workers from $prev_worker_count to $worker_count"
+		);
+
+		check_io_worker_count($node, $worker_count);
+		terminate_io_worker($node, $worker_count);
+		check_io_worker_count($node, $worker_count);
+
+		return $worker_count;
+	}
+}
+
+sub terminate_io_worker
+{
+	my $node = shift;
+	my $worker_count = shift;
+	my ($pid, $ret);
+
+	# Select a random io worker
+	$pid = $node->safe_psql(
+		'postgres',
+		qq(SELECT pid FROM pg_stat_activity WHERE
+			backend_type = 'io worker' ORDER BY RANDOM() LIMIT 1));
+
+	# terminate IO worker with SIGINT
+	is(PostgreSQL::Test::Utils::system_log('pg_ctl', 'kill', 'INT', $pid),
+		0, "random io worker process signalled with INT");
+
+	# Check that worker exits
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE pid = $pid), '0'),
+		"random io worker process exited after signal");
+}
+
+sub check_io_worker_count
+{
+	my $node = shift;
+	my $worker_count = shift;
+
+	ok( $node->poll_query_until(
+			'postgres',
+			qq(SELECT COUNT(*) FROM pg_stat_activity WHERE backend_type = 'io worker'),
+			$worker_count),
+		"io worker count is $worker_count");
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
new file mode 100644
index 00000000000..e495481c41e
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -0,0 +1,108 @@
+/* src/test/modules/test_aio/test_aio--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_aio" to load this file. \quit
+
+
+CREATE FUNCTION errno_from_string(sym text)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION grow_rel(rel regclass, nblocks int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+CREATE FUNCTION modify_rel_block(rel regclass, blockno int,
+  zero bool DEFAULT false,
+  corrupt_header bool DEFAULT false,
+  corrupt_checksum bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_rel_block_ll(
+    rel regclass,
+    blockno int,
+    nblocks int DEFAULT 1,
+    wait_complete bool DEFAULT true,
+    batchmode_enter bool DEFAULT false,
+    smgrreleaseall bool DEFAULT false,
+    batchmode_exit bool DEFAULT false,
+    zero_on_error bool DEFAULT false)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION invalidate_rel_block(rel regclass, blockno int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_create_toy(rel regclass, blockno int4)
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_start_io(buffer int, for_input bool, nowait bool)
+RETURNS pg_catalog.bool STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION buffer_call_terminate_io(buffer int, for_input bool, succeed bool, io_error bool, release_aio bool)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Handle related functions
+ */
+CREATE FUNCTION handle_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION handle_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+/*
+ * Batchmode related functions
+ */
+CREATE FUNCTION batch_start()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION batch_end()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
+
+/*
+ * Injection point related functions
+ */
+CREATE FUNCTION inj_io_short_read_attach(result int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_short_read_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_attach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION inj_io_reopen_detach()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
new file mode 100644
index 00000000000..bef3b0e3ad0
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.c
@@ -0,0 +1,804 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_aio.c
+ *		Helpers to write tests for AIO
+ *
+ * This module provides interface functions for C functionality to SQL, to
+ * make it possible to test AIO related behavior in a targeted way from SQL.
+ * It'd not generally be safe to export these functions to SQL, but for a test
+ * that's fine.
+ *
+ * Copyright (c) 2020-2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/test_aio/test_aio.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "storage/aio.h"
+#include "storage/aio_internal.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/injection_point.h"
+#include "utils/rel.h"
+
+
+PG_MODULE_MAGIC;
+
+
+typedef struct InjIoErrorState
+{
+	bool		enabled_short_read;
+	bool		enabled_reopen;
+
+	bool		short_read_result_set;
+	int			short_read_result;
+}			InjIoErrorState;
+
+static InjIoErrorState * inj_io_error_state;
+
+/* Shared memory init callbacks */
+static shmem_request_hook_type prev_shmem_request_hook = NULL;
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
+
+static PgAioHandle *last_handle;
+
+
+
+static void
+test_aio_shmem_request(void)
+{
+	if (prev_shmem_request_hook)
+		prev_shmem_request_hook();
+
+	RequestAddinShmemSpace(sizeof(InjIoErrorState));
+}
+
+static void
+test_aio_shmem_startup(void)
+{
+	bool		found;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Create or attach to the shared memory state */
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+	inj_io_error_state = ShmemInitStruct("injection_points",
+										 sizeof(InjIoErrorState),
+										 &found);
+
+	if (!found)
+	{
+		/* First time through, initialize */
+		inj_io_error_state->enabled_short_read = false;
+		inj_io_error_state->enabled_reopen = false;
+
+#ifdef USE_INJECTION_POINTS
+		InjectionPointAttach("AIO_PROCESS_COMPLETION_BEFORE_SHARED",
+							 "test_aio",
+							 "inj_io_short_read",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+
+		InjectionPointAttach("AIO_WORKER_AFTER_REOPEN",
+							 "test_aio",
+							 "inj_io_reopen",
+							 NULL,
+							 0);
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+
+#endif
+	}
+	else
+	{
+		/*
+		 * Pre-load the injection points now, so we can call them in a
+		 * critical section.
+		 */
+#ifdef USE_INJECTION_POINTS
+		InjectionPointLoad("AIO_PROCESS_COMPLETION_BEFORE_SHARED");
+		InjectionPointLoad("AIO_WORKER_AFTER_REOPEN");
+		elog(LOG, "injection point loaded");
+#endif
+	}
+
+	LWLockRelease(AddinShmemInitLock);
+}
+
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	prev_shmem_request_hook = shmem_request_hook;
+	shmem_request_hook = test_aio_shmem_request;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = test_aio_shmem_startup;
+}
+
+
+PG_FUNCTION_INFO_V1(errno_from_string);
+Datum
+errno_from_string(PG_FUNCTION_ARGS)
+{
+	const char *sym = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (strcmp(sym, "EIO") == 0)
+		PG_RETURN_INT32(EIO);
+	else if (strcmp(sym, "EAGAIN") == 0)
+		PG_RETURN_INT32(EAGAIN);
+	else if (strcmp(sym, "EINTR") == 0)
+		PG_RETURN_INT32(EINTR);
+	else if (strcmp(sym, "ENOSPC") == 0)
+		PG_RETURN_INT32(ENOSPC);
+	else if (strcmp(sym, "EROFS") == 0)
+		PG_RETURN_INT32(EROFS);
+
+	ereport(ERROR,
+			errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg_internal("%s is not a supported errno value", sym));
+	PG_RETURN_INT32(0);
+}
+
+PG_FUNCTION_INFO_V1(grow_rel);
+Datum
+grow_rel(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	uint32		nblocks = PG_GETARG_UINT32(1);
+	Relation	rel;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	while (nblocks > 0)
+	{
+		uint32		extend_by_pages;
+
+		extend_by_pages = Min(nblocks, MAX_BUFFERS_TO_EXTEND_BY);
+
+		ExtendBufferedRelBy(BMR_REL(rel),
+							MAIN_FORKNUM,
+							NULL,
+							0,
+							extend_by_pages,
+							victim_buffers,
+							&extend_by_pages);
+
+		nblocks -= extend_by_pages;
+
+		for (uint32 i = 0; i < extend_by_pages; i++)
+		{
+			ReleaseBuffer(victim_buffers[i]);
+		}
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(modify_rel_block);
+Datum
+modify_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	bool		zero = PG_GETARG_BOOL(2);
+	bool		corrupt_header = PG_GETARG_BOOL(3);
+	bool		corrupt_checksum = PG_GETARG_BOOL(4);
+	Page		page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0);
+	Relation	rel;
+	Buffer		buf;
+	PageHeader	ph;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+							 RBM_ZERO_ON_ERROR, NULL);
+
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * copy the page to local memory, seems nicer than to directly modify in
+	 * the buffer pool.
+	 */
+	memcpy(page, BufferGetPage(buf), BLCKSZ);
+
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	ReleaseBuffer(buf);
+
+	/*
+	 * Don't want to have a buffer in-memory that's marked valid where the
+	 * on-disk contents are invalid. Particularly not if the in-memory buffer
+	 * could be dirty...
+	 *
+	 * While we hold an AEL on the relation nobody else should be able to read
+	 * the buffer in.
+	 *
+	 * NB: This is probably racy, better don't copy this to non-test code.
+	 */
+	if (BufferIsLocal(buf))
+		InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+	else
+		EvictUnpinnedBuffer(buf);
+
+	/*
+	 * Now modify the page as asked for by the caller.
+	 */
+	if (zero)
+		memset(page, 0, BufferGetPageSize(buf));
+
+	if (PageIsEmpty(page) && (corrupt_header || corrupt_checksum))
+		PageInit(page, BufferGetPageSize(buf), 0);
+
+	ph = (PageHeader) page;
+
+	if (corrupt_header)
+		ph->pd_special = BLCKSZ + 1;
+
+	if (corrupt_checksum)
+	{
+		bool		successfully_corrupted = 0;
+
+		/*
+		 * Any single modification of the checksum could just end up being
+		 * valid again. To be sure
+		 */
+		for (int i = 0; i < 100; i++)
+		{
+			uint16		verify_checksum;
+			uint16		old_checksum;
+
+			old_checksum = ph->pd_checksum;
+			ph->pd_checksum = old_checksum + 1;
+
+			elog(LOG, "corrupting checksum of blk %u from %u to %u",
+				 blkno, old_checksum, ph->pd_checksum);
+
+			verify_checksum = pg_checksum_page(page, blkno);
+			if (verify_checksum != ph->pd_checksum)
+			{
+				successfully_corrupted = true;
+				break;
+			}
+		}
+
+		if (!successfully_corrupted)
+			elog(ERROR, "could not corrupt checksum, what's going on?");
+	}
+	else
+	{
+		PageSetChecksumInplace(page, blkno);
+	}
+
+	smgrwrite(RelationGetSmgr(rel),
+			  MAIN_FORKNUM, blkno, page, true);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Ensures a buffer for rel & blkno is in shared buffers, without actually
+ * caring about the buffer contents. Used to set up test scenarios.
+ */
+static Buffer
+create_toy_buffer(Relation rel, BlockNumber blkno)
+{
+	Buffer		buf;
+	BufferDesc *buf_hdr;
+	uint32		buf_state;
+	bool		was_pinned = false;
+
+	/* place buffer in shared buffers without erroring out */
+	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL);
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	}
+	else
+	{
+		buf_hdr = GetBufferDescriptor(buf - 1);
+		buf_state = LockBufHdr(buf_hdr);
+	}
+
+	/*
+	 * We should be the only backend accessing this buffer. This is just a
+	 * small bit of belt-and-suspenders defense, none of this code should ever
+	 * run in a cluster with real data.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 1)
+		was_pinned = true;
+	else
+		buf_state &= ~(BM_VALID | BM_DIRTY);
+
+	if (RelationUsesLocalBuffers(rel))
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	else
+		UnlockBufHdr(buf_hdr, buf_state);
+
+	if (was_pinned)
+		elog(ERROR, "toy buffer %d was already pinned",
+			 buf);
+
+	return buf;
+}
+
+/*
+ * A "low level" read. This does similar things to what
+ * StartReadBuffers()/WaitReadBuffers() do, but provides more control (and
+ * less sanity).
+ */
+PG_FUNCTION_INFO_V1(read_rel_block_ll);
+Datum
+read_rel_block_ll(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	int			nblocks = PG_GETARG_INT32(2);
+	bool		wait_complete = PG_GETARG_BOOL(3);
+	bool		batchmode_enter = PG_GETARG_BOOL(4);
+	bool		call_smgrreleaseall = PG_GETARG_BOOL(5);
+	bool		batchmode_exit = PG_GETARG_BOOL(6);
+	bool		zero_on_error = PG_GETARG_BOOL(7);
+	Relation	rel;
+	Buffer		bufs[PG_IOV_MAX];
+	BufferDesc *buf_hdrs[PG_IOV_MAX];
+	Page		pages[PG_IOV_MAX];
+	uint8		srb_flags = 0;
+	PgAioReturn ior;
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+	SMgrRelation smgr;
+
+	if (nblocks <= 0 || nblocks > PG_IOV_MAX)
+		elog(ERROR, "nblocks is out of range");
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	for (int i = 0; i < nblocks; i++)
+	{
+		bufs[i] = create_toy_buffer(rel, blkno + i);
+		pages[i] = BufferGetBlock(bufs[i]);
+		buf_hdrs[i] = BufferIsLocal(bufs[i]) ?
+			GetLocalBufferDescriptor(-bufs[i] - 1) :
+			GetBufferDescriptor(bufs[i] - 1);
+	}
+
+	smgr = RelationGetSmgr(rel);
+
+	pgstat_prepare_report_checksum_failure(smgr->smgr_rlocator.locator.dbOid);
+
+	ioh = pgaio_io_acquire(CurrentResourceOwner, &ior);
+	pgaio_io_get_wref(ioh, &iow);
+
+	if (RelationUsesLocalBuffers(rel))
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartLocalBufferIO(buf_hdrs[i], true, false);
+		pgaio_io_set_flag(ioh, PGAIO_HF_REFERENCES_LOCAL);
+	}
+	else
+	{
+		for (int i = 0; i < nblocks; i++)
+			StartBufferIO(buf_hdrs[i], true, false);
+	}
+
+	pgaio_io_set_handle_data_32(ioh, (uint32 *) bufs, nblocks);
+
+	if (zero_on_error | zero_damaged_pages)
+		srb_flags |= READ_BUFFERS_ZERO_ON_ERROR;
+	if (ignore_checksum_failure)
+		srb_flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;
+
+	pgaio_io_register_callbacks(ioh,
+								RelationUsesLocalBuffers(rel) ?
+								PGAIO_HCB_LOCAL_BUFFER_READV :
+								PGAIO_HCB_SHARED_BUFFER_READV,
+								srb_flags);
+
+	if (batchmode_enter)
+		pgaio_enter_batchmode();
+
+	smgrstartreadv(ioh, smgr, MAIN_FORKNUM, blkno,
+				   (void *) pages, nblocks);
+
+	if (call_smgrreleaseall)
+		smgrreleaseall();
+
+	if (batchmode_exit)
+		pgaio_exit_batchmode();
+
+	for (int i = 0; i < nblocks; i++)
+		ReleaseBuffer(bufs[i]);
+
+	if (wait_complete)
+	{
+		pgaio_wref_wait(&iow);
+
+		if (ior.result.status != PGAIO_RS_OK)
+			pgaio_result_report(ior.result,
+								&ior.target_data,
+								ior.result.status == PGAIO_RS_ERROR ?
+								ERROR : WARNING);
+	}
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(invalidate_rel_block);
+Datum
+invalidate_rel_block(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	PrefetchBufferResult pr;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/*
+	 * This is a gross hack, but there's no other API exposed that allows to
+	 * get a buffer ID without actually reading the block in.
+	 */
+	pr = PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	buf = pr.recent_buffer;
+
+	if (BufferIsValid(buf))
+	{
+		/* if the buffer contents aren't valid, this'll return false */
+		if (ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, blkno, buf))
+		{
+			BufferDesc *buf_hdr = BufferIsLocal(buf) ?
+				GetLocalBufferDescriptor(-buf - 1)
+				: GetBufferDescriptor(buf - 1);
+
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (pg_atomic_read_u32(&buf_hdr->state) & BM_DIRTY)
+			{
+				if (BufferIsLocal(buf))
+					FlushLocalBuffer(buf_hdr, NULL);
+				else
+					FlushOneBuffer(buf);
+			}
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(buf);
+
+			if (BufferIsLocal(buf))
+				InvalidateLocalBuffer(GetLocalBufferDescriptor(-buf - 1), true);
+			else if (!EvictUnpinnedBuffer(buf))
+				elog(ERROR, "couldn't evict");
+		}
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(buffer_create_toy);
+Datum
+buffer_create_toy(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	BlockNumber blkno = PG_GETARG_UINT32(1);
+	Relation	rel;
+	Buffer		buf;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	buf = create_toy_buffer(rel, blkno);
+	ReleaseBuffer(buf);
+
+	relation_close(rel, NoLock);
+
+	PG_RETURN_INT32(buf);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_start_io);
+Datum
+buffer_call_start_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		nowait = PG_GETARG_BOOL(2);
+	bool		can_start;
+
+	if (BufferIsLocal(buf))
+		can_start = StartLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+									   for_input, nowait);
+	else
+		can_start = StartBufferIO(GetBufferDescriptor(buf - 1),
+								  for_input, nowait);
+
+	/*
+	 * For tests we don't want the resowner release preventing us from
+	 * orchestrating odd scenarios.
+	 */
+	if (can_start && !BufferIsLocal(buf))
+		ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+									buf);
+
+	ereport(LOG,
+			errmsg("buffer %d after StartBufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_BOOL(can_start);
+}
+
+PG_FUNCTION_INFO_V1(buffer_call_terminate_io);
+Datum
+buffer_call_terminate_io(PG_FUNCTION_ARGS)
+{
+	Buffer		buf = PG_GETARG_INT32(0);
+	bool		for_input = PG_GETARG_BOOL(1);
+	bool		succeed = PG_GETARG_BOOL(2);
+	bool		io_error = PG_GETARG_BOOL(3);
+	bool		release_aio = PG_GETARG_BOOL(4);
+	bool		clear_dirty = false;
+	uint32		set_flag_bits = 0;
+
+	if (io_error)
+		set_flag_bits |= BM_IO_ERROR;
+
+	if (for_input)
+	{
+		clear_dirty = false;
+
+		if (succeed)
+			set_flag_bits |= BM_VALID;
+	}
+	else
+	{
+		if (succeed)
+			clear_dirty = true;
+	}
+
+	ereport(LOG,
+			errmsg("buffer %d before Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	if (BufferIsLocal(buf))
+		TerminateLocalBufferIO(GetLocalBufferDescriptor(-buf - 1),
+							   clear_dirty, set_flag_bits, release_aio);
+	else
+		TerminateBufferIO(GetBufferDescriptor(buf - 1),
+						  clear_dirty, set_flag_bits, false, release_aio);
+
+	ereport(LOG,
+			errmsg("buffer %d after Terminate[Local]BufferIO: %s",
+				   buf, DebugPrintBufferRefcount(buf)),
+			errhidestmt(true), errhidecontext(true));
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get);
+Datum
+handle_get(PG_FUNCTION_ARGS)
+{
+	last_handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_release_last);
+Datum
+handle_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_handle)
+		elog(ERROR, "no handle");
+
+	pgaio_io_release(last_handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_and_error);
+Datum
+handle_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_twice);
+Datum
+handle_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_acquire(CurrentResourceOwner, NULL);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(handle_get_release);
+Datum
+handle_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioHandle *handle;
+
+	handle = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	pgaio_io_release(handle);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_start);
+Datum
+batch_start(PG_FUNCTION_ARGS)
+{
+	pgaio_enter_batchmode();
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(batch_end);
+Datum
+batch_end(PG_FUNCTION_ARGS)
+{
+	pgaio_exit_batchmode();
+	PG_RETURN_VOID();
+}
+
+#ifdef USE_INJECTION_POINTS
+extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
+extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
+
+void
+inj_io_short_read(const char *name, const void *private_data)
+{
+	PgAioHandle *ioh;
+
+	ereport(LOG,
+			errmsg("short read injection point called, is enabled: %d",
+				   inj_io_error_state->enabled_reopen),
+			errhidestmt(true), errhidecontext(true));
+
+	if (inj_io_error_state->enabled_short_read)
+	{
+		ioh = pgaio_inj_io_get();
+
+		/*
+		 * Only shorten reads that are actually longer than the target size,
+		 * otherwise we can trigger over-reads.
+		 */
+		if (inj_io_error_state->short_read_result_set
+			&& ioh->op == PGAIO_OP_READV
+			&& inj_io_error_state->short_read_result <= ioh->result)
+		{
+			struct iovec *iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+			int32		old_result = ioh->result;
+			int32		new_result = inj_io_error_state->short_read_result;
+			int32		processed = 0;
+
+			ereport(LOG,
+					errmsg("short read inject point, changing result from %d to %d",
+						   old_result, new_result),
+					errhidestmt(true), errhidecontext(true));
+
+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
+			 *
+			 * To avoid that, iterate through the IOV and zero out the
+			 * "failed" portion of the IO.
+			 */
+			for (int i = 0; i < ioh->op_data.read.iov_length; i++)
+			{
+				if (processed + iov[i].iov_len <= new_result)
+					processed += iov[i].iov_len;
+				else if (processed <= new_result)
+				{
+					uint32		ok_part = new_result - processed;
+
+					memset((char *) iov[i].iov_base + ok_part, 0, iov[i].iov_len - ok_part);
+					processed += iov[i].iov_len;
+				}
+				else
+				{
+					memset((char *) iov[i].iov_base, 0, iov[i].iov_len);
+				}
+			}
+
+			ioh->result = new_result;
+		}
+	}
+}
+
+void
+inj_io_reopen(const char *name, const void *private_data)
+{
+	ereport(LOG,
+			errmsg("reopen injection point called, is enabled: %d",
+				   inj_io_error_state->enabled_reopen),
+			errhidestmt(true), errhidecontext(true));
+
+	if (inj_io_error_state->enabled_reopen)
+		elog(ERROR, "injection point triggering failure to reopen ");
+}
+#endif
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_attach);
+Datum
+inj_io_short_read_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = true;
+	inj_io_error_state->short_read_result_set = !PG_ARGISNULL(0);
+	if (inj_io_error_state->short_read_result_set)
+		inj_io_error_state->short_read_result = PG_GETARG_INT32(0);
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_short_read_detach);
+Datum
+inj_io_short_read_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_short_read = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_attach);
+Datum
+inj_io_reopen_attach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = true;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(inj_io_reopen_detach);
+Datum
+inj_io_reopen_detach(PG_FUNCTION_ARGS)
+{
+#ifdef USE_INJECTION_POINTS
+	inj_io_error_state->enabled_reopen = false;
+#else
+	elog(ERROR, "injection points not supported");
+#endif
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_aio/test_aio.control b/src/test/modules/test_aio/test_aio.control
new file mode 100644
index 00000000000..cd91c3ed16b
--- /dev/null
+++ b/src/test/modules/test_aio/test_aio.control
@@ -0,0 +1,3 @@
+comment = 'Test code for AIO'
+default_version = '1.0'
+module_pathname = '$libdir/test_aio'
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0005-md-Add-comment-assert-to-buffer-zeroing-path-i.patchtext/x-diff; charset=us-asciiDownload

From 670455bcb88ee0925276df5da07c3126fcd37434 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 31 Mar 2025 19:27:04 -0400
Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path
 in md[start]readv()

mdreadv() has a codepath to zero out buffers when a read returns zero bytes,
guarded by a check for zero_damaged_pages || InRecovery.

The InRecovery codepath to zero out buffers in mdreadv() appears to be
unreachable. The only known paths to reach mdreadv()/mdstartreadv() in
recovery are XLogReadBufferExtended(), vm_readbuf(), and fsm_readbuf(), each
of which takes care to extend the relation if necessary. This looks to either
have been the case for a long time, or the code was never reachable.

The zero_damaged_pages path is incomplete, as as missing segments are not
created.

Putting blocks into the buffer-pool that do not exist on disk is rather
problematic, as such blocks will, at least initially, not be found by scans
that rely on smgrnblocks(), as they are beyond EOF. It also can cause weird
problems with relation extension, as relation extension does not expect blocks
beyond EOF to exist.

Therefore we would like to remove that path.

mdstartreadv(), which I added in e5fe570b51c, does not implement this zeroing
logic. I had started a discussion about that a while ago (linked below), but
forgot to act on the conclusion of the discussion, namely to disable the
in-memory-zeroing behavior.

We could certainly implement equivalent zeroing logic in mdstartreadv(), but
it would have to be more complicated due to potential differences in the
zero_damaged_pages setting between the definer and completor of IO. Given that
we want to remove the logic, that does not seem worth implementing the
necessary logic.

For now, put an Assert(false) comments documenting this choice into mdreadv()
and comments documenting the deprecation of the path in mdreadv() and the
non-implementation of it in mdstartreadv().  If we, during testing, discover
that we do need the path, we can implement it at that time.

Discussion: https://postgr.es/m/postgr.es/m/20250330024513.ac.nmisch@google.com
Discussion: https://postgr.es/m/postgr.es/m/3qxxsnciyffyf3wyguiz4besdp5t5uxvv3utg75cbcszojlz7p@uibfzmnukkbd
---
 src/backend/storage/smgr/md.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b55892b9405..ecc33713d8f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -910,9 +910,30 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 * is ON or we are InRecovery, we should instead return zeroes
 				 * without complaining.  This allows, for example, the case of
 				 * trying to update a block that was later truncated away.
+				 *
+				 * NB: We think that this codepath is unreachable in recovery
+				 * and incomplete with zero_damaged_pages, as missing segments
+				 * are not created. Putting blocks into the buffer-pool that
+				 * do not exist on disk is rather problematic, as it will not
+				 * be found by scans that rely on smgrnblocks(), as they are
+				 * beyond EOF. It also can cause weird problems with relation
+				 * extension, as relation extension does not expect blocks
+				 * beyond EOF to exist.
+				 *
+				 * Therefore we do not want to copy the logic into
+				 * mdstartreadv(), where it would have to be more complicated
+				 * due to potential differences in the zero_damaged_pages
+				 * setting between the definer and completor of IO.
+				 *
+				 * For PG 18, we are putting an Assert(false) in into
+				 * mdreadv() (triggering failures in assertion-enabled builds,
+				 * but continuing to work in production builds). Afterwards we
+				 * plan to remove this code entirely.
 				 */
 				if (zero_damaged_pages || InRecovery)
 				{
+					Assert(false);	/* see comment above */
+
 					for (BlockNumber i = transferred_this_segment / BLCKSZ;
 						 i < nblocks_this_segment;
 						 ++i)
@@ -1007,6 +1028,13 @@ mdstartreadv(PgAioHandle *ioh,
 	/*
 	 * The error checks corresponding to the post-read checks in mdreadv() are
 	 * in md_readv_complete().
+	 *
+	 * However we chose, at least for now, to not implement the
+	 * zero_damaged_pages logic present in mdreadv(). As outlined in mdreadv()
+	 * that logic is rather problematic, and we want to get rid of it. Here
+	 * equivalent logic would have to be more complicated due to potential
+	 * differences in the zero_damaged_pages setting between the definer and
+	 * completor of IO.
 	 */
 }

-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0006-aio-comment-polishing.patchtext/x-diff; charset=us-asciiDownload

From 11406c5999d0f2b1361cadec1a9160d7da9099c8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 30 Mar 2025 15:38:32 -0400
Subject: [PATCH v2.15 06/18] aio: comment polishing

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/smgr/md.c   | 2 +-
 src/backend/storage/smgr/smgr.c | 8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ecc33713d8f..bc7d94df870 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1981,7 +1981,7 @@ md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
 		 * might not process the query result immediately (because it is busy
 		 * doing another part of query processing) or at all (e.g. if it was
 		 * cancelled or errored out due to another IO also failing).  The
-		 * issuer of the IO will emit an ERROR when processing the IO's
+		 * definer of the IO will emit an ERROR when processing the IO's
 		 * results
 		 */
 		pgaio_result_report(result, td, LOG_SERVER_ONLY);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 320dc04d83a..d532fea39d9 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -738,6 +738,14 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * blocks not successfully read might bear unspecified modifications, up to
  * the full nblocks). This maintains the abstraction that smgr operates on the
  * level of blocks, rather than bytes.
+ *
+ * Compared to smgrreadv(), more responsibilities fall on the caller:
+ * - Partial reads need to be handle by the caller re-issuing IO for the
+ *   unread blocks
+ * - smgr will ereport(LOG_SERVER_ONLY) some problems, but higher layers are
+ *   responsible for pgaio_result_report() to mirror that news to the user (if
+ *   the IO results in PGAIO_RS_WARNING) or abort the (sub)transaction (if
+ *   PGAIO_RS_ERROR).
  */
 void
 smgrstartreadv(PgAioHandle *ioh,
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0007-aio-Add-errcontext-for-processing-I-Os-for-ano.patchtext/x-diff; charset=us-asciiDownload

From da0fe0c1ae0d22cf22788a142725c11631d80721 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 30 Mar 2025 14:03:55 -0400
Subject: [PATCH v2.15 07/18] aio: Add errcontext for processing I/Os for
 another backend

Push an ErrorContextCallback adding additional detail about the process
performing the I/O and the owner of the I/O when those are not the same.

For io_method worker, this adds context specifying which process owns
the I/O that the I/O worker is processing.

For io_method io_uring, this adds context only when a backend is
*completing* I/O for another backend. It specifies the pid of the owning
process.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20250325141120.8e.nmisch%40google.com
---
 src/backend/storage/aio/method_io_uring.c | 31 +++++++++++++++++++++++
 src/backend/storage/aio/method_worker.c   | 29 +++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index 0bcdab14ae7..c719ba2727a 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -302,14 +302,41 @@ pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 	return num_staged_ios;
 }
 
+static void
+pgaio_uring_completion_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	/* No need for context if a backend is completing the IO for itself */
+	if (ioh->owner_procno == MyProcNumber)
+		return;
+
+	owner = ioh->owner_procno;
+	owner_proc = GetPGProcByNumber(owner);
+	owner_pid = owner_proc->pid;
+
+	errcontext("completing I/O on behalf of process %d", owner_pid);
+}
+
 static void
 pgaio_uring_drain_locked(PgAioUringContext *context)
 {
 	int			ready;
 	int			orig_ready;
+	ErrorContextCallback errcallback = {0};
 
 	Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
 
+	errcallback.callback = pgaio_uring_completion_error_callback;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
 	/*
 	 * Don't drain more events than available right now. Otherwise it's
 	 * plausible that one backend could get stuck, for a while, receiving CQEs
@@ -337,9 +364,11 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 			PgAioHandle *ioh;
 
 			ioh = io_uring_cqe_get_data(cqe);
+			errcallback.arg = ioh;
 			io_uring_cqe_seen(&context->io_uring_ring, cqe);
 
 			pgaio_io_process_completion(ioh, cqe->res);
+			errcallback.arg = NULL;
 		}
 
 		END_CRIT_SECTION();
@@ -348,6 +377,8 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 					"drained %d/%d, now expecting %d",
 					ncqes, orig_ready, io_uring_cq_ready(&context->io_uring_ring));
 	}
+
+	error_context_stack = errcallback.previous;
 }
 
 static void
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 4a7853d13fa..31d94ac82c5 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -357,11 +357,33 @@ pgaio_worker_register(void)
 	on_shmem_exit(pgaio_worker_die, 0);
 }
 
+static void
+pgaio_worker_error_callback(void *arg)
+{
+	ProcNumber	owner;
+	PGPROC	   *owner_proc;
+	int32		owner_pid;
+	PgAioHandle *ioh = arg;
+
+	if (!ioh)
+		return;
+
+	Assert(ioh->owner_procno != MyProcNumber);
+	Assert(MyBackendType == B_IO_WORKER);
+
+	owner = ioh->owner_procno;
+	owner_proc = GetPGProcByNumber(owner);
+	owner_pid = owner_proc->pid;
+
+	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	PgAioHandle *volatile error_ioh = NULL;
+	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
 
@@ -388,6 +410,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	sprintf(cmd, "%d", MyIoWorkerId);
 	set_ps_display(cmd);
 
+	errcallback.callback = pgaio_worker_error_callback;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -471,6 +497,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
+			errcallback.arg = ioh;
 
 			pgaio_debug_io(DEBUG4, ioh,
 						   "worker %d processing IO",
@@ -511,6 +538,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			pgaio_io_perform_synchronously(ioh);
 
 			RESUME_INTERRUPTS();
+			errcallback.arg = NULL;
 		}
 		else
 		{
@@ -522,6 +550,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		CHECK_FOR_INTERRUPTS();
 	}
 
+	error_context_stack = errcallback.previous;
 	proc_exit(0);
 }
 
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0008-aio-Experimental-heuristics-to-increase-batchi.patchtext/x-diff; charset=us-asciiDownload

From b642a383b99b3f536c69e0b134223eb52f819314 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 08/18] aio: Experimental heuristics to increase batching
 in read_stream.c

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/read_stream.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 36c54fb695b..cec93129f58 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -404,6 +404,30 @@ read_stream_start_pending_read(ReadStream *stream)
 static void
 read_stream_look_ahead(ReadStream *stream)
 {
+	/*
+	 * Batch-submitting multiple IOs is more efficient than doing so
+	 * one-by-one. If we just ramp up to the max, we'll only be allowed to
+	 * submit one io_combine_limit sized IO. Defer submitting IO in that case.
+	 *
+	 * FIXME: This needs better heuristics.
+	 */
+#if 1
+	if (!stream->sync_mode && stream->distance > (io_combine_limit * 8))
+	{
+		if (stream->pinned_buffers + stream->pending_read_nblocks > ((stream->distance * 3) / 4))
+		{
+#if 0
+			ereport(LOG,
+					errmsg("reduce reduce reduce: pinned: %d, pending: %d, distance: %d",
+						   stream->pinned_buffers,
+						   stream->pending_read_nblocks,
+						   stream->distance));
+#endif
+			return;
+		}
+	}
+#endif
+
 	/*
 	 * Allow amortizing the cost of submitting IO over multiple IOs. This
 	 * requires that we don't do any operations that could lead to a deadlock
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0009-aio-Implement-smgr-md-fd-write-support.patchtext/x-diff; charset=us-asciiDownload

From 37407ad3a4d1aed04a4bbb1ba316a23854128722 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 29 Mar 2025 13:28:52 -0400
Subject: [PATCH v2.15 09/18] aio: Implement smgr/md/fd write support

TODO:
- Right now the sync.c integration with smgr.c/md.c isn't properly safe to use
  in a critical section

  The only reason it doesn't immediately fail is that it's reasonably rare
  that RegisterSyncRequest() fails *and* either:

  - smgropen()->hash_search(HASH_ENTER) decides to resize the hash table, even
    though the lookup is guaranteed to succeed for io_method=worker.

  - an io_method=uring completion is run in a different backend and smgropen()
    needs to build a new entry and thus needs to allocate memory

  For a bit I thought this could be worked around easily enough by not doing
  an smgropen() in mdsyncfiletag(), or adding a "fallible" smgropen() and
  instead just opening the file directly. That actually does kinda solve the
  problem, but only because the memory allocation in PathNameOpenFile()
  uses malloc(), not palloc() and thus doesn't trigger

- temp_file_limit implementation
---
 src/include/storage/aio.h              |   1 +
 src/include/storage/fd.h               |   1 +
 src/include/storage/md.h               |   5 +
 src/include/storage/smgr.h             |   5 +
 src/backend/storage/aio/aio_callback.c |   1 +
 src/backend/storage/file/fd.c          |  28 ++++
 src/backend/storage/smgr/md.c          | 199 +++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c        |  29 ++++
 8 files changed, 269 insertions(+)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 9fe9d9ad9fa..bfe0d93683b 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -194,6 +194,7 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_INVALID = 0,
 
 	PGAIO_HCB_MD_READV,
+	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index b77d8e5e30e..2cc7c5a4761 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -112,6 +112,7 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern ssize_t FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern ssize_t FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileStartReadV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileStartWriteV(struct PgAioHandle *ioh, File file, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 9d7131eff43..47ae6c36c94 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -21,6 +21,7 @@
 #include "storage/sync.h"
 
 extern const PgAioHandleCallbacks aio_md_readv_cb;
+extern const PgAioHandleCallbacks aio_md_writev_cb;
 
 /* md storage manager functionality */
 extern void mdinit(void);
@@ -45,6 +46,10 @@ extern void mdstartreadv(PgAioHandle *ioh,
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum,
 					 const void **buffers, BlockNumber nblocks, bool skipFsync);
+extern void mdstartwritev(PgAioHandle *ioh,
+						  SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum,
+						  const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 856ebcda350..f00b3763ac9 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -108,6 +108,11 @@ extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum,
 					   const void **buffers, BlockNumber nblocks,
 					   bool skipFsync);
+extern void smgrstartwritev(PgAioHandle *ioh,
+							SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum,
+							const void **buffers, BlockNumber nblocks,
+							bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index bf42778a48c..abe5b191b42 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -41,6 +41,7 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_INVALID, aio_invalid_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_MD_READV, aio_md_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 0e8299dd556..2138d47dab9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2348,6 +2348,34 @@ retry:
 	return returnCode;
 }
 
+int
+FileStartWriteV(PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_start_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}
+
 int
 FileSync(File file, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bc7d94df870..bacfe5f121b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -155,12 +155,19 @@ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 
 static PgAioResult md_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
 static void md_readv_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
+static PgAioResult md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data);
+static void md_writev_report(PgAioResult result, const PgAioTargetData *target_data, int elevel);
 
 const PgAioHandleCallbacks aio_md_readv_cb = {
 	.complete_shared = md_readv_complete,
 	.report = md_readv_report,
 };
 
+const PgAioHandleCallbacks aio_md_writev_cb = {
+	.complete_shared = md_writev_complete,
+	.report = md_writev_report,
+};
+
 
 static inline int
 _mdfd_open_flags(void)
@@ -1143,6 +1150,64 @@ mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+/*
+ * mdstartwritev() -- Asynchronous version of mdrwritev().
+ */
+void
+mdstartwritev(PgAioHandle *ioh,
+			  SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	off_t		seekpos;
+	MdfdVec    *v;
+	BlockNumber nblocks_this_segment;
+	struct iovec *iov;
+	int			iovcnt;
+	int			ret;
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	nblocks_this_segment =
+		Min(nblocks,
+			RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+	if (nblocks_this_segment != nblocks)
+		elog(ERROR, "write crossing segment boundary");
+
+	iovcnt = pgaio_io_get_iovec(ioh, &iov);
+
+	Assert(nblocks <= iovcnt);
+
+	iovcnt = buffers_to_iovec(iov, unconstify(void **, buffers), nblocks_this_segment);
+
+	Assert(iovcnt <= nblocks_this_segment);
+
+	if (!(io_direct_flags & IO_DIRECT_DATA))
+		pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
+
+	pgaio_io_set_target_smgr(ioh,
+							 reln,
+							 forknum,
+							 blocknum,
+							 nblocks,
+							 skipFsync);
+	pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_WRITEV, 0);
+
+	ret = FileStartWriteV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	if (ret != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not start writing blocks %u..%u in file \"%s\": %m",
+						blocknum,
+						blocknum + nblocks_this_segment - 1,
+						FilePathName(v->mdfd_vfd))));
+}
+
 
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
@@ -1531,6 +1596,40 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
+/*
+ * Like register_dirty_segment(), except for use by AIO. In the completion
+ * callback we don't have access to the MdfdVec (the completion callback might
+ * be executed in a different backend than the issuing backend), therefore we
+ * have to implement this slightly differently.
+ */
+static void
+register_dirty_segment_aio(RelFileLocator locator, ForkNumber forknum, uint64 segno)
+{
+	FileTag		tag;
+
+	INIT_MD_FILETAG(tag, locator, forknum, segno);
+
+	/*
+	 * Can't block here waiting for checkpointer to accept our sync request,
+	 * as checkpointer might be waiting for this AIO to finish if offloaded to
+	 * a worker.
+	 */
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
+	{
+		char		path[MAXPGPATH];
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		/* reuse mdsyncfiletag() to avoid duplicating code */
+		if (mdsyncfiletag(&tag, path))
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							path)));
+	}
+}
+
 /*
  * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint
  */
@@ -2065,3 +2164,103 @@ md_readv_report(PgAioResult result, const PgAioTargetData *td, int elevel)
 					   td->smgr.nblocks * (size_t) BLCKSZ));
 	}
 }
+
+/*
+ * AIO completion callback for mdstartwritev().
+ */
+static PgAioResult
+md_writev_complete(PgAioHandle *ioh, PgAioResult prior_result, uint8 cb_data)
+{
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	PgAioResult result = prior_result;
+
+	if (prior_result.result < 0)
+	{
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		/* For "hard" errors, track the error number in error_data */
+		result.error_data = -prior_result.result;
+		result.result = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	/*
+	 * As explained above smgrstartwritev(), the smgr API operates on the
+	 * level of blocks, rather than bytes. Convert.
+	 */
+	result.result /= BLCKSZ;
+
+	Assert(result.result <= td->smgr.nblocks);
+
+	if (result.result == 0)
+	{
+		/* consider 0 blocks written a failure */
+		result.status = PGAIO_RS_ERROR;
+		result.id = PGAIO_HCB_MD_WRITEV;
+		result.error_data = 0;
+
+		pgaio_result_report(result, td, LOG);
+
+		return result;
+	}
+
+	if (result.status != PGAIO_RS_ERROR &&
+		result.result < td->smgr.nblocks)
+	{
+		/* partial writes should be retried at upper level */
+		result.status = PGAIO_RS_PARTIAL;
+		result.id = PGAIO_HCB_MD_WRITEV;
+	}
+
+	if (!td->smgr.skip_fsync)
+		register_dirty_segment_aio(td->smgr.rlocator, td->smgr.forkNum,
+								   td->smgr.blockNum / ((BlockNumber) RELSEG_SIZE));
+
+	return result;
+}
+
+/*
+ * AIO error reporting callback for mdstartwritev().
+ */
+static void
+md_writev_report(PgAioResult result, const PgAioTargetData *td, int elevel)
+{
+	RelPathStr	path;
+
+	path = relpathbackend(td->smgr.rlocator,
+						  td->smgr.is_temp ? MyProcNumber : INVALID_PROC_NUMBER,
+						  td->smgr.forkNum);
+
+	if (result.error_data != 0)
+	{
+		errno = result.error_data;	/* for errcode_for_file_access() */
+
+		ereport(elevel,
+				errcode_for_file_access(),
+				errmsg("could not write blocks %u..%u in file \"%s\": %m",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks,
+					   path.str)
+			);
+	}
+	else
+	{
+		/*
+		 * NB: This will typically only be output in debug messages, while
+		 * retrying a partial IO.
+		 */
+		ereport(elevel,
+				errcode(ERRCODE_DATA_CORRUPTED),
+				errmsg("could not write blocks %u..%u in file \"%s\": wrote only %zu of %zu bytes",
+					   td->smgr.blockNum,
+					   td->smgr.blockNum + td->smgr.nblocks - 1,
+					   path.str,
+					   result.result * (size_t) BLCKSZ,
+					   td->smgr.nblocks * (size_t) BLCKSZ
+					   )
+			);
+	}
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d532fea39d9..85daf686e08 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -115,6 +115,11 @@ typedef struct f_smgr
 								BlockNumber blocknum,
 								const void **buffers, BlockNumber nblocks,
 								bool skipFsync);
+	void		(*smgr_startwritev) (PgAioHandle *ioh,
+									 SMgrRelation reln, ForkNumber forknum,
+									 BlockNumber blocknum,
+									 const void **buffers, BlockNumber nblocks,
+									 bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -142,6 +147,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_readv = mdreadv,
 		.smgr_startreadv = mdstartreadv,
 		.smgr_writev = mdwritev,
+		.smgr_startwritev = mdstartwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -795,6 +801,29 @@ smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	RESUME_INTERRUPTS();
 }
 
+/*
+ * smgrstartwritev() -- asynchronous version of smgrwritev()
+ *
+ * This starts an asynchronous writev IO using the IO handle `ioh`. Other than
+ * `ioh` all parameters are the same as smgrwritev().
+ *
+ * Completion callbacks above smgr will be passed the result as the number of
+ * successfully written blocks if the write [partially] succeeds. This
+ * maintains the abstraction that smgr operates on the level of blocks, rather
+ * than bytes.
+ */
+void
+smgrstartwritev(PgAioHandle *ioh,
+				SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+				const void **buffers, BlockNumber nblocks, bool skipFsync)
+{
+	HOLD_INTERRUPTS();
+	smgrsw[reln->smgr_which].smgr_startwritev(ioh,
+											  reln, forknum, blocknum, buffers,
+											  nblocks, skipFsync);
+	RESUME_INTERRUPTS();
+}
+
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0010-aio-Add-bounce-buffers.patchtext/x-diff; charset=us-asciiDownload

From 43ac0c1934bd71b25a6ceef8074da48671145a32 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 10/18] aio: Add bounce buffers

---
 src/include/storage/aio.h                     |  15 ++
 src/include/storage/aio_internal.h            |  34 ++++
 src/include/storage/aio_types.h               |   2 +
 src/include/utils/resowner.h                  |   2 +
 src/backend/storage/aio/README.md             |  27 +++
 src/backend/storage/aio/aio.c                 | 178 ++++++++++++++++++
 src/backend/storage/aio/aio_init.c            | 123 ++++++++++++
 src/backend/utils/misc/guc_tables.c           |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/resowner/resowner.c         |  25 ++-
 src/test/modules/test_aio/test_aio--1.0.sql   |  21 +++
 src/test/modules/test_aio/test_aio.c          |  55 ++++++
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 496 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index bfe0d93683b..f91f0afc5a5 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -353,6 +353,20 @@ extern bool pgaio_have_staged(void);
 
 
 
+/* --------------------------------------------------------------------------------
+ * Bounce Buffers
+ * --------------------------------------------------------------------------------
+ */
+
+extern PgAioBounceBuffer *pgaio_bounce_buffer_get(void);
+extern void pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb);
+extern uint32 pgaio_bounce_buffer_id(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release(PgAioBounceBuffer *bb);
+extern char *pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb);
+extern void pgaio_bounce_buffer_release_resowner(struct dlist_node *bb_node, bool on_error);
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
@@ -365,6 +379,7 @@ extern void pgaio_closing_fd(int fd);
 /* GUCs */
 extern PGDLLIMPORT int io_method;
 extern PGDLLIMPORT int io_max_concurrency;
+extern PGDLLIMPORT int io_bounce_buffers;
 
 
 #endif							/* AIO_H */
diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 7f18da2c856..833f97361a1 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -127,6 +127,12 @@ struct PgAioHandle
 	/* raw result of the IO operation */
 	int32		result;
 
+	/*
+	 * List of bounce_buffers owned by IO. It would suffice to use an index
+	 * based linked list here.
+	 */
+	slist_head	bounce_buffers;
+
 	/**
 	 * In which list the handle is registered, depends on the state:
 	 * - IDLE, in per-backend list
@@ -182,11 +188,24 @@ struct PgAioHandle
 };
 
 
+/* typedef is in aio_types.h */
+struct PgAioBounceBuffer
+{
+	slist_node	node;
+	struct ResourceOwnerData *resowner;
+	dlist_node	resowner_node;
+	char	   *buffer;
+};
+
+
 typedef struct PgAioBackend
 {
 	/* index into PgAioCtl->io_handles */
 	uint32		io_handle_off;
 
+	/* index into PgAioCtl->bounce_buffers */
+	uint32		bounce_buffers_off;
+
 	/* IO Handles that currently are not used */
 	dclist_head idle_ios;
 
@@ -217,6 +236,12 @@ typedef struct PgAioBackend
 	 * IOs being appended at the end.
 	 */
 	dclist_head in_flight_ios;
+
+	/* Bounce Buffers that currently are not used */
+	slist_head	idle_bbs;
+
+	/* see handed_out_io */
+	PgAioBounceBuffer *handed_out_bb;
 } PgAioBackend;
 
 
@@ -244,6 +269,15 @@ typedef struct PgAioCtl
 
 	uint32		io_handle_count;
 	PgAioHandle *io_handles;
+
+	/*
+	 * To perform AIO on buffers that are not located in shared memory (either
+	 * because they are not in shared memory or because we need to operate on
+	 * a copy, as e.g. the case for writes when checksums are in use)
+	 */
+	uint32		bounce_buffers_count;
+	PgAioBounceBuffer *bounce_buffers;
+	char	   *bounce_buffers_data;
 } PgAioCtl;
 
 
diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index 18183366077..3c18dade49c 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -134,4 +134,6 @@ typedef struct PgAioReturn
 } PgAioReturn;
 
 
+typedef struct PgAioBounceBuffer PgAioBounceBuffer;
+
 #endif							/* AIO_TYPES_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..7e2ec224169 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -168,5 +168,7 @@ extern void ResourceOwnerForgetLock(ResourceOwner owner, struct LOCALLOCK *local
 struct dlist_node;
 extern void ResourceOwnerRememberAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
 extern void ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node);
+extern void ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
+extern void ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *bb_node);
 
 #endif							/* RESOWNER_H */
diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index ddd59404a59..534aaad22bc 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -406,6 +406,33 @@ shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 
 
+### AIO Bounce Buffers
+
+For some uses of AIO there is no convenient memory location as the source /
+destination of an AIO. E.g. when data checksums are enabled, writes from
+shared buffers currently cannot be done directly from shared buffers, as a
+shared buffer lock still allows some modification, e.g., for hint bits (see
+`FlushBuffer()`). If the write were done in-place, such modifications can
+cause the checksum to fail.
+
+For synchronous IO this is solved by copying the buffer to separate memory
+before computing the checksum and using that copy as the source buffer for the
+AIO.
+
+However, for AIO that is not a workable solution:
+- Instead of a single buffer many buffers are required, as many IOs might be
+  in flight
+- When using the [worker method](#worker), the source/target of IO needs to be
+  in shared memory, otherwise the workers won't be able to access the memory.
+
+The AIO subsystem addresses this by providing a limited number of bounce
+buffers that can be used as the source / target for IO. A bounce buffer can be
+acquired with `pgaio_bounce_buffer_get()` and multiple bounce buffers can be
+associated with an AIO Handle with `pgaio_io_assoc_bounce_buffer()`.
+
+Bounce buffers are automatically released when the IO completes.
+
+
 ## Helpers
 
 Using the low-level AIO API introduces too much complexity to do so all over
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 86f7250b7a5..cff48964d07 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -62,6 +62,8 @@ static PgAioHandle *pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation
 static const char *pgaio_io_state_get_name(PgAioHandleState s);
 static void pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation);
 
+static void pgaio_bounce_buffer_wait_for_free(void);
+
 
 /* Options for io_method. */
 const struct config_enum_entry io_method_options[] = {
@@ -76,6 +78,7 @@ const struct config_enum_entry io_method_options[] = {
 /* GUCs */
 int			io_method = DEFAULT_IO_METHOD;
 int			io_max_concurrency = -1;
+int			io_bounce_buffers = -1;
 
 /* global control for AIO */
 PgAioCtl   *pgaio_ctl;
@@ -662,6 +665,21 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	if (ioh->state != PGAIO_HS_HANDED_OUT)
 		dclist_delete_from(&pgaio_my_backend->in_flight_ios, &ioh->node);
 
+	/* reclaim all associated bounce buffers */
+	if (!slist_is_empty(&ioh->bounce_buffers))
+	{
+		slist_mutable_iter it;
+
+		slist_foreach_modify(it, &ioh->bounce_buffers)
+		{
+			PgAioBounceBuffer *bb = slist_container(PgAioBounceBuffer, node, it.cur);
+
+			slist_delete_current(&it);
+
+			slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+		}
+	}
+
 	if (ioh->resowner)
 	{
 		ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
@@ -1046,6 +1064,166 @@ pgaio_submit_staged(void)
 
 
 
+/* --------------------------------------------------------------------------------
+ * Functions primarily related to PgAioBounceBuffer
+ * --------------------------------------------------------------------------------
+ */
+
+PgAioBounceBuffer *
+pgaio_bounce_buffer_get(void)
+{
+	PgAioBounceBuffer *bb = NULL;
+	slist_node *node;
+
+	if (pgaio_my_backend->handed_out_bb != NULL)
+		elog(ERROR, "can only hand out one BB");
+
+	/*
+	 * XXX: It probably is not a good idea to have bounce buffers be per
+	 * backend, that's a fair bit of memory.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+	{
+		pgaio_bounce_buffer_wait_for_free();
+	}
+
+	node = slist_pop_head_node(&pgaio_my_backend->idle_bbs);
+	bb = slist_container(PgAioBounceBuffer, node, node);
+
+	pgaio_my_backend->handed_out_bb = bb;
+
+	bb->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberAioBounceBuffer(bb->resowner, &bb->resowner_node);
+
+	return bb;
+}
+
+void
+pgaio_io_assoc_bounce_buffer(PgAioHandle *ioh, PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only assign handed out BB");
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	/*
+	 * There can be many bounce buffers assigned in case of vectorized IOs.
+	 */
+	slist_push_head(&ioh->bounce_buffers, &bb->node);
+
+	/* once associated with an IO, the IO has ownership */
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+uint32
+pgaio_bounce_buffer_id(PgAioBounceBuffer *bb)
+{
+	return bb - pgaio_ctl->bounce_buffers;
+}
+
+void
+pgaio_bounce_buffer_release(PgAioBounceBuffer *bb)
+{
+	if (pgaio_my_backend->handed_out_bb != bb)
+		elog(ERROR, "can only release handed out BB");
+
+	slist_push_head(&pgaio_my_backend->idle_bbs, &bb->node);
+	pgaio_my_backend->handed_out_bb = NULL;
+
+	ResourceOwnerForgetAioBounceBuffer(bb->resowner, &bb->resowner_node);
+	bb->resowner = NULL;
+}
+
+void
+pgaio_bounce_buffer_release_resowner(dlist_node *bb_node, bool on_error)
+{
+	PgAioBounceBuffer *bb = dlist_container(PgAioBounceBuffer, resowner_node, bb_node);
+
+	Assert(bb->resowner);
+
+	if (!on_error)
+		elog(WARNING, "leaked AIO bounce buffer");
+
+	pgaio_bounce_buffer_release(bb);
+}
+
+char *
+pgaio_bounce_buffer_buffer(PgAioBounceBuffer *bb)
+{
+	return bb->buffer;
+}
+
+static void
+pgaio_bounce_buffer_wait_for_free(void)
+{
+	static uint32 lastpos = 0;
+
+	if (pgaio_my_backend->num_staged_ios > 0)
+	{
+		pgaio_debug(DEBUG2, "submitting %d, while acquiring free bb",
+					pgaio_my_backend->num_staged_ios);
+		pgaio_submit_staged();
+	}
+
+	for (uint32 i = lastpos; i < lastpos + io_max_concurrency; i++)
+	{
+		uint32		thisoff = pgaio_my_backend->io_handle_off + (i % io_max_concurrency);
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[thisoff];
+
+		switch (ioh->state)
+		{
+			case PGAIO_HS_IDLE:
+			case PGAIO_HS_HANDED_OUT:
+				continue;
+			case PGAIO_HS_DEFINED:	/* should have been submitted above */
+			case PGAIO_HS_STAGED:
+				elog(ERROR, "shouldn't get here with io:%d in state %d",
+					 pgaio_io_get_id(ioh), ioh->state);
+				break;
+			case PGAIO_HS_COMPLETED_IO:
+			case PGAIO_HS_SUBMITTED:
+				if (!slist_is_empty(&ioh->bounce_buffers))
+				{
+					pgaio_debug_io(DEBUG2, ioh,
+								   "waiting for IO to reclaim BB with %d in flight",
+								   dclist_count(&pgaio_my_backend->in_flight_ios));
+
+					/* see comment in pgaio_io_wait_for_free() about raciness */
+					pgaio_io_wait(ioh, ioh->generation);
+
+					if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+						elog(WARNING, "empty after wait");
+
+					if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+					{
+						lastpos = i;
+						return;
+					}
+				}
+				break;
+			case PGAIO_HS_COMPLETED_SHARED:
+			case PGAIO_HS_COMPLETED_LOCAL:
+				/* reclaim */
+				pgaio_io_reclaim(ioh);
+
+				if (!slist_is_empty(&pgaio_my_backend->idle_bbs))
+				{
+					lastpos = i;
+					return;
+				}
+				break;
+		}
+	}
+
+	/*
+	 * The submission above could have caused the IO to complete at any time.
+	 */
+	if (slist_is_empty(&pgaio_my_backend->idle_bbs))
+		elog(PANIC, "no more bbs");
+}
+
+
+
 /* --------------------------------------------------------------------------------
  * Other
  * --------------------------------------------------------------------------------
diff --git a/src/backend/storage/aio/aio_init.c b/src/backend/storage/aio/aio_init.c
index 885c3940c66..95b10933fed 100644
--- a/src/backend/storage/aio/aio_init.c
+++ b/src/backend/storage/aio/aio_init.c
@@ -88,6 +88,32 @@ AioHandleDataShmemSize(void)
 							 io_max_concurrency));
 }
 
+static Size
+AioBounceBufferDescShmemSize(void)
+{
+	Size		sz;
+
+	/* PgAioBounceBuffer itself */
+	sz = mul_size(sizeof(PgAioBounceBuffer),
+				  mul_size(AioProcs(), io_bounce_buffers));
+
+	return sz;
+}
+
+static Size
+AioBounceBufferDataShmemSize(void)
+{
+	Size		sz;
+
+	/* and the associated buffer */
+	sz = mul_size(BLCKSZ,
+				  mul_size(io_bounce_buffers, AioProcs()));
+	/* memory for alignment */
+	sz += BLCKSZ;
+
+	return sz;
+}
+
 /*
  * Choose a suitable value for io_max_concurrency.
  *
@@ -113,6 +139,33 @@ AioChooseMaxConcurrency(void)
 	return Min(max_proportional_pins, 64);
 }
 
+/*
+ * Choose a suitable value for io_bounce_buffers.
+ *
+ * It's very unlikely to be useful to allocate more bounce buffers for each
+ * backend than the backend is allowed to pin. Additionally, bounce buffers
+ * currently are used for writes, it seems very uncommon for more than 10% of
+ * shared_buffers to be written out concurrently.
+ *
+ * XXX: This quickly can take up significant amounts of memory, the logic
+ * should probably fine tuned.
+ */
+static int
+AioChooseBounceBuffers(void)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	/* Similar logic to LimitAdditionalPins() */
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = (NBuffers / 10) / max_backends;
+
+	max_proportional_pins = Max(max_proportional_pins, 1);
+
+	/* apply upper limit */
+	return Min(max_proportional_pins, 256);
+}
+
 Size
 AioShmemSize(void)
 {
@@ -136,11 +189,31 @@ AioShmemSize(void)
 							PGC_S_OVERRIDE);
 	}
 
+
+	/*
+	 * If io_bounce_buffers is -1, we automatically choose a suitable value.
+	 *
+	 * See also comment above.
+	 */
+	if (io_bounce_buffers == -1)
+	{
+		char		buf[32];
+
+		snprintf(buf, sizeof(buf), "%d", AioChooseBounceBuffers());
+		SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+						PGC_S_DYNAMIC_DEFAULT);
+		if (io_bounce_buffers == -1)	/* failed to apply it? */
+			SetConfigOption("io_bounce_buffers", buf, PGC_POSTMASTER,
+							PGC_S_OVERRIDE);
+	}
+
 	sz = add_size(sz, AioCtlShmemSize());
 	sz = add_size(sz, AioBackendShmemSize());
 	sz = add_size(sz, AioHandleShmemSize());
 	sz = add_size(sz, AioHandleIOVShmemSize());
 	sz = add_size(sz, AioHandleDataShmemSize());
+	sz = add_size(sz, AioBounceBufferDescShmemSize());
+	sz = add_size(sz, AioBounceBufferDataShmemSize());
 
 	/* Reserve space for method specific resources. */
 	if (pgaio_method_ops->shmem_size)
@@ -156,6 +229,9 @@ AioShmemInit(void)
 	uint32		io_handle_off = 0;
 	uint32		iovec_off = 0;
 	uint32		per_backend_iovecs = io_max_concurrency * io_max_combine_limit;
+	uint32		bounce_buffers_off = 0;
+	uint32		per_backend_bb = io_bounce_buffers;
+	char	   *bounce_buffers_data;
 
 	pgaio_ctl = (PgAioCtl *)
 		ShmemInitStruct("AioCtl", AioCtlShmemSize(), &found);
@@ -167,6 +243,7 @@ AioShmemInit(void)
 
 	pgaio_ctl->io_handle_count = AioProcs() * io_max_concurrency;
 	pgaio_ctl->iovec_count = AioProcs() * per_backend_iovecs;
+	pgaio_ctl->bounce_buffers_count = AioProcs() * per_backend_bb;
 
 	pgaio_ctl->backend_state = (PgAioBackend *)
 		ShmemInitStruct("AioBackend", AioBackendShmemSize(), &found);
@@ -179,6 +256,40 @@ AioShmemInit(void)
 	pgaio_ctl->handle_data = (uint64 *)
 		ShmemInitStruct("AioHandleData", AioHandleDataShmemSize(), &found);
 
+	pgaio_ctl->bounce_buffers = (PgAioBounceBuffer *)
+		ShmemInitStruct("AioBounceBufferDesc", AioBounceBufferDescShmemSize(),
+						&found);
+
+	bounce_buffers_data =
+		ShmemInitStruct("AioBounceBufferData", AioBounceBufferDataShmemSize(),
+						&found);
+	bounce_buffers_data =
+		(char *) TYPEALIGN(BLCKSZ, (uintptr_t) bounce_buffers_data);
+	pgaio_ctl->bounce_buffers_data = bounce_buffers_data;
+
+
+	/* Initialize IO handles. */
+	for (uint64 i = 0; i < pgaio_ctl->io_handle_count; i++)
+	{
+		PgAioHandle *ioh = &pgaio_ctl->io_handles[i];
+
+		ioh->op = PGAIO_OP_INVALID;
+		ioh->target = PGAIO_TID_INVALID;
+		ioh->state = PGAIO_HS_IDLE;
+
+		slist_init(&ioh->bounce_buffers);
+	}
+
+	/* Initialize Bounce Buffers. */
+	for (uint64 i = 0; i < pgaio_ctl->bounce_buffers_count; i++)
+	{
+		PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[i];
+
+		bb->buffer = bounce_buffers_data;
+		bounce_buffers_data += BLCKSZ;
+	}
+
+
 	for (int procno = 0; procno < AioProcs(); procno++)
 	{
 		PgAioBackend *bs = &pgaio_ctl->backend_state[procno];
@@ -186,9 +297,13 @@ AioShmemInit(void)
 		bs->io_handle_off = io_handle_off;
 		io_handle_off += io_max_concurrency;
 
+		bs->bounce_buffers_off = bounce_buffers_off;
+		bounce_buffers_off += per_backend_bb;
+
 		dclist_init(&bs->idle_ios);
 		memset(bs->staged_ios, 0, sizeof(PgAioHandle *) * PGAIO_SUBMIT_BATCH_SIZE);
 		dclist_init(&bs->in_flight_ios);
+		slist_init(&bs->idle_bbs);
 
 		/* initialize per-backend IOs */
 		for (int i = 0; i < io_max_concurrency; i++)
@@ -210,6 +325,14 @@ AioShmemInit(void)
 			dclist_push_tail(&bs->idle_ios, &ioh->node);
 			iovec_off += io_max_combine_limit;
 		}
+
+		/* initialize per-backend bounce buffers */
+		for (int i = 0; i < per_backend_bb; i++)
+		{
+			PgAioBounceBuffer *bb = &pgaio_ctl->bounce_buffers[bs->bounce_buffers_off + i];
+
+			slist_push_head(&bs->idle_bbs, &bb->node);
+		}
 	}
 
 out:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..bd49b302293 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3293,6 +3293,19 @@ struct config_int ConfigureNamesInt[] =
 		check_io_max_concurrency, NULL, NULL
 	},
 
+	{
+		{"io_bounce_buffers",
+			PGC_POSTMASTER,
+			RESOURCES_IO,
+			gettext_noop("Number of IO Bounce Buffers reserved for each backend."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_bounce_buffers,
+		-1, -1, 4096,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"io_workers",
 			PGC_SIGHUP,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ff56a1f0732..2c6456e907f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -211,6 +211,8 @@
 					# -1 sets based on shared_buffers
 					# (change requires restart)
 #io_workers = 3				# 1-32;
+#io_bounce_buffers = -1			# -1 sets based on shared_buffers
+					# (change requires restart)
 
 # - Worker Processes -
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index d39f3e1b655..81e7e27965a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -159,10 +159,11 @@ struct ResourceOwnerData
 	LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];	/* list of owned locks */
 
 	/*
-	 * AIO handles need be registered in critical sections and therefore
-	 * cannot use the normal ResourceElem mechanism.
+	 * AIO handles & bounce buffers need be registered in critical sections
+	 * and therefore cannot use the normal ResourceElem mechanism.
 	 */
 	dlist_head	aio_handles;
+	dlist_head	aio_bounce_buffers;
 };
 
 
@@ -434,6 +435,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	dlist_init(&owner->aio_handles);
+	dlist_init(&owner->aio_bounce_buffers);
 
 	return owner;
 }
@@ -742,6 +744,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			pgaio_io_release_resowner(node, !isCommit);
 		}
+
+		while (!dlist_is_empty(&owner->aio_bounce_buffers))
+		{
+			dlist_node *node = dlist_head_node(&owner->aio_bounce_buffers);
+
+			pgaio_bounce_buffer_release_resowner(node, !isCommit);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -1111,3 +1120,15 @@ ResourceOwnerForgetAioHandle(ResourceOwner owner, struct dlist_node *ioh_node)
 {
 	dlist_delete_from(&owner->aio_handles, ioh_node);
 }
+
+void
+ResourceOwnerRememberAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_push_tail(&owner->aio_bounce_buffers, ioh_node);
+}
+
+void
+ResourceOwnerForgetAioBounceBuffer(ResourceOwner owner, struct dlist_node *ioh_node)
+{
+	dlist_delete_from(&owner->aio_bounce_buffers, ioh_node);
+}
diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..e2a81235166 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -87,6 +87,27 @@ RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 
+CREATE FUNCTION bb_get_and_error()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_twice()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_get_release()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bb_release_last()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+
 
 /*
  * Injection point related functions
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index bef3b0e3ad0..2a7f4378ef3 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -52,6 +52,7 @@ static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 
 
 static PgAioHandle *last_handle;
+static PgAioBounceBuffer *last_bb;
 
 
 
@@ -669,6 +670,60 @@ batch_end(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+PG_FUNCTION_INFO_V1(bb_get);
+Datum
+bb_get(PG_FUNCTION_ARGS)
+{
+	last_bb = pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_release_last);
+Datum
+bb_release_last(PG_FUNCTION_ARGS)
+{
+	if (!last_bb)
+		elog(ERROR, "no bb");
+
+	pgaio_bounce_buffer_release(last_bb);
+
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_and_error);
+Datum
+bb_get_and_error(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+
+	elog(ERROR, "as you command");
+	PG_RETURN_VOID();
+}
+
+PG_FUNCTION_INFO_V1(bb_get_twice);
+Datum
+bb_get_twice(PG_FUNCTION_ARGS)
+{
+	pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_get();
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(bb_get_release);
+Datum
+bb_get_release(PG_FUNCTION_ARGS)
+{
+	PgAioBounceBuffer *bb;
+
+	bb = pgaio_bounce_buffer_get();
+	pgaio_bounce_buffer_release(bb);
+
+	PG_RETURN_VOID();
+}
+
 #ifdef USE_INJECTION_POINTS
 extern PGDLLEXPORT void inj_io_short_read(const char *name, const void *private_data);
 extern PGDLLEXPORT void inj_io_reopen(const char *name, const void *private_data);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..3a67ee01b46 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2137,6 +2137,7 @@ PermutationStep
 PermutationStepBlocker
 PermutationStepBlockerType
 PgAioBackend
+PgAioBounceBuffer
 PgAioCtl
 PgAioHandle
 PgAioHandleCallbackID
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0011-bufmgr-Implement-AIO-write-support.patchtext/x-diff; charset=us-asciiDownload

From 96bbde773f50e6625a1b8e8aa8a8d1c921b4ac4b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 30 Mar 2025 16:43:54 -0400
Subject: [PATCH v2.15 11/18] bufmgr: Implement AIO write support

As of this commit there are no users of these AIO facilities, that'll come in
later commits.

Problems with AIO writes:

- Write logic needs to be rebased on-top of the patch series to not hit bit
  dirty buffers while IO is going on

  The performance impact of doing the memory copies is rather substantial, as
  on intel memory bandwidth is *the* IO bottleneck even just for the checksum
  computation, without a copy. That makes the memory copy for something like
  bounce buffers hurt really badly.

  And the memory usage of bounce buffers is also really concerning.

  And even without checksums, several filesystems *really* don't like buffers
  getting modified during DIO writes. Which I think would mean we ought to use
  bounce buffers for *all* writes, which would impose a *very* substantial
  overhead (basically removing the benefit of DMA happening off-cpu).

- I think it requires new lwlock.c infrastructure (as v1 of aio had), to make
  LockBuffer(BUFFER_LOCK_EXCLUSIVE) etc wait in a concurrency safe manner for
  in-progress writes

  I can think of ways to solve this purely in bufmgr.c, but only in ways that
  would cause other problems (e.g. setting BM_IO_IN_PROGRESS before waiting
  for an exclusive lock) and/or expensive.
---
 src/include/storage/aio.h              |   4 +-
 src/include/storage/bufmgr.h           |   2 +
 src/backend/storage/aio/aio_callback.c |   2 +
 src/backend/storage/buffer/bufmgr.c    | 189 ++++++++++++++++++++++++-
 4 files changed, 190 insertions(+), 7 deletions(-)

diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index f91f0afc5a5..72d5680e767 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -197,11 +197,13 @@ typedef enum PgAioHandleCallbackID
 	PGAIO_HCB_MD_WRITEV,
 
 	PGAIO_HCB_SHARED_BUFFER_READV,
+	PGAIO_HCB_SHARED_BUFFER_WRITEV,
 
 	PGAIO_HCB_LOCAL_BUFFER_READV,
+	PGAIO_HCB_LOCAL_BUFFER_WRITEV,
 } PgAioHandleCallbackID;
 
-#define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_READV
+#define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_WRITEV
 StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS),
 				 "PGAIO_HCB_MAX is too big for PGAIO_RESULT_ID_BITS");
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f2192ceb271..492feab0cb5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -174,7 +174,9 @@ extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
 
 extern const PgAioHandleCallbacks aio_shared_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_shared_buffer_writev_cb;
 extern const PgAioHandleCallbacks aio_local_buffer_readv_cb;
+extern const PgAioHandleCallbacks aio_local_buffer_writev_cb;
 
 /* in buf_init.c */
 extern PGDLLIMPORT char *BufferBlocks;
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index abe5b191b42..21665e7eccb 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -44,8 +44,10 @@ static const PgAioHandleCallbacksEntry aio_handle_cbs[] = {
 	CALLBACK_ENTRY(PGAIO_HCB_MD_WRITEV, aio_md_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_READV, aio_shared_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_SHARED_BUFFER_WRITEV, aio_shared_buffer_writev_cb),
 
 	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_READV, aio_local_buffer_readv_cb),
+	CALLBACK_ENTRY(PGAIO_HCB_LOCAL_BUFFER_WRITEV, aio_local_buffer_writev_cb),
 #undef CALLBACK_ENTRY
 };
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2f..9bf30c44af0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5536,7 +5536,15 @@ LockBuffer(Buffer buffer, int mode)
 	else if (mode == BUFFER_LOCK_SHARE)
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
+	{
+		/*
+		 * FIXME: Wait for AIO writes, otherwise there would be a risk of
+		 * deadlock. This isn't entirely trivial to do in a race-free way, IO
+		 * could be started between us checking whether there is IO and
+		 * enqueueing ourselves for the lock.
+		 */
 		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5551,6 +5559,19 @@ ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
 
+	/*
+	 * FIXME: Wait for AIO writes. Some code does not deal well
+	 * ConditionalLockBuffer() continuously failing, e.g.
+	 * spginsert()->spgdoinsert() ends up busy-looping (easily reproducible by
+	 * just making this function always fail and running the regression
+	 * tests). While that code could be fixed, it'd be hard to find all
+	 * problematic places.
+	 *
+	 * It would be OK to wait for the IO as waiting for IO completion does not
+	 * need to wait for any locks that could lead to an undetected deadlock or
+	 * such.
+	 */
+
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return true;			/* act as though we got it */
@@ -5614,10 +5635,8 @@ LockBufferForCleanup(Buffer buffer)
 	CheckBufferIsPinnedOnce(buffer);
 
 	/*
-	 * We do not yet need to be worried about in-progress AIOs holding a pin,
-	 * as we, so far, only support doing reads via AIO and this function can
-	 * only be called once the buffer is valid (i.e. no read can be in
-	 * flight).
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
 	 */
 
 	/* Nobody else to wait for */
@@ -5630,6 +5649,11 @@ LockBufferForCleanup(Buffer buffer)
 	{
 		uint32		buf_state;
 
+		/*
+		 * FIXME: LockBuffer()'s handling of in-progress writes (once
+		 * implemented) should suffice to deal with deadlock risk.
+		 */
+
 		/* Try to acquire lock */
 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		buf_state = LockBufHdr(bufHdr);
@@ -5777,7 +5801,13 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: Should wait for IO for the same reason as in
+	 * ConditionalLockBuffer(). Needs to happen before the
+	 * ConditionalLockBuffer() call below, as we'd never reach the
+	 * ConditionalLockBuffer() call due the buffer pin held for the duration
+	 * of the IO.
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -5834,7 +5864,10 @@ IsBufferCleanupOK(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
-	/* see AIO related comment in LockBufferForCleanup() */
+	/*
+	 * FIXME: See AIO related comments in LockBuffer() and
+	 * ConditionalLockBuffer()
+	 */
 
 	if (BufferIsLocal(buffer))
 	{
@@ -7140,12 +7173,129 @@ buffer_readv_report(PgAioResult result, const PgAioTargetData *td,
 			affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);
 }
 
+/*
+ * Helper for AIO writev completion callbacks, supporting both shared and temp
+ * buffers. Gets called once for each buffer in a multi-page write.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
+						   bool failed, bool is_temp)
+{
+	BufferDesc *buf_hdr = is_temp ?
+		GetLocalBufferDescriptor(-buffer - 1)
+		: GetBufferDescriptor(buffer - 1);
+	PgAioResult result = {.status = PGAIO_RS_OK};
+	bool		clear_dirty;
+	uint32		set_flag_bits;
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+		Assert(buf_state & BM_VALID);
+		Assert(buf_state & BM_TAG_VALID);
+		/* temp buffers don't use BM_IO_IN_PROGRESS */
+		if (!is_temp)
+			Assert(buf_state & BM_IO_IN_PROGRESS);
+		Assert(buf_state & BM_DIRTY);
+	}
+#endif
+
+	clear_dirty = failed ? false : true;
+	set_flag_bits = failed ? BM_IO_ERROR : 0;
+
+	if (is_temp)
+		TerminateLocalBufferIO(buf_hdr, clear_dirty, set_flag_bits, true);
+	else
+		TerminateBufferIO(buf_hdr, clear_dirty, set_flag_bits, false, true);
+
+	/*
+	 * The initiator of IO is not managing the lock (i.e. we called
+	 * LWLockDisown()), we are.
+	 */
+	if (!is_temp)
+		LWLockReleaseDisowned(BufferDescriptorGetContentLock(buf_hdr),
+							  LW_SHARED);
+
+	/* FIXME: tracepoint */
+
+	return result;
+}
+
+/*
+ * Perform completion handling of a single AIO write. This write may cover
+ * multiple blocks / buffers.
+ *
+ * Shared between shared and local buffers, to reduce code duplication.
+ */
+static pg_attribute_always_inline PgAioResult
+buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+					   uint8 cb_data, bool is_temp)
+{
+	PgAioResult result = prior_result;
+	PgAioTargetData *td = pgaio_io_get_target_data(ioh);
+	uint64	   *io_data;
+	uint8		handle_data_len;
+
+	if (is_temp)
+	{
+		Assert(td->smgr.is_temp);
+		Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
+	}
+	else
+		Assert(!td->smgr.is_temp);
+
+	/*
+	 * Iterate over all the buffers affected by this IO and call appropriate
+	 * per-buffer completion function for each buffer.
+	 */
+	io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
+	for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
+	{
+		Buffer		buf = io_data[buf_off];
+		PgAioResult buf_result;
+		bool		failed;
+
+		Assert(BufferIsValid(buf));
+
+		/*
+		 * If the entire failed on a lower-level, each buffer needs to be
+		 * marked as failed. In case of a partial read, some buffers may be
+		 * ok.
+		 */
+		failed =
+			prior_result.status == PGAIO_RS_ERROR
+			|| prior_result.result <= buf_off;
+
+		buf_result = buffer_writev_complete_one(buf_off, buf, cb_data, failed,
+												is_temp);
+
+		/*
+		 * If there wasn't any prior error and the IO for this page failed in
+		 * some form, set the whole IO's to the page's result.
+		 */
+		if (result.status != PGAIO_RS_ERROR && buf_result.status != PGAIO_RS_OK)
+		{
+			result = buf_result;
+			pgaio_result_report(result, td, LOG);
+		}
+	}
+
+	return result;
+}
+
 static void
 shared_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
 	buffer_stage_common(ioh, false, false);
 }
 
+static void
+shared_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	buffer_stage_common(ioh, true, false);
+}
+
 static PgAioResult
 shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 							 uint8 cb_data)
@@ -7191,6 +7341,13 @@ shared_buffer_readv_complete_local(PgAioHandle *ioh, PgAioResult prior_result,
 	return prior_result;
 }
 
+static PgAioResult
+shared_buffer_writev_complete(PgAioHandle *ioh, PgAioResult prior_result,
+							  uint8 cb_data)
+{
+	return buffer_writev_complete(ioh, prior_result, cb_data, false);
+}
+
 static void
 local_buffer_readv_stage(PgAioHandle *ioh, uint8 cb_data)
 {
@@ -7204,6 +7361,17 @@ local_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, true);
 }
 
+static void
+local_buffer_writev_stage(PgAioHandle *ioh, uint8 cb_data)
+{
+	/*
+	 * Currently this is unreachable as the only write support is for
+	 * checkpointer / bgwriter, which don't deal with local buffers.
+	 */
+	elog(ERROR, "should be unreachable");
+}
+
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.stage = shared_buffer_readv_stage,
@@ -7213,6 +7381,11 @@ const PgAioHandleCallbacks aio_shared_buffer_readv_cb = {
 	.report = buffer_readv_report,
 };
 
+const PgAioHandleCallbacks aio_shared_buffer_writev_cb = {
+	.stage = shared_buffer_writev_stage,
+	.complete_shared = shared_buffer_writev_complete,
+};
+
 /* readv callback is passed READ_BUFFERS_* flags as callback data */
 const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.stage = local_buffer_readv_stage,
@@ -7226,3 +7399,7 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+const PgAioHandleCallbacks aio_local_buffer_writev_cb = {
+	.stage = local_buffer_writev_stage,
+};
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0012-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload

From a7aad0b753c5f3afdbdb90a0ca77765c051dbfce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 12/18] aio: Add IO queue helper

This is likely never going to anywhere - Thomas Munro is working on something
more complete. But I needed a way to exercise aio for checkpointer / bgwriter.
---
 src/include/storage/io_queue.h      |  33 +++++
 src/backend/storage/aio/Makefile    |   1 +
 src/backend/storage/aio/io_queue.c  | 204 ++++++++++++++++++++++++++++
 src/backend/storage/aio/meson.build |   1 +
 src/tools/pgindent/typedefs.list    |   2 +
 5 files changed, 241 insertions(+)
 create mode 100644 src/include/storage/io_queue.h
 create mode 100644 src/backend/storage/aio/io_queue.c

diff --git a/src/include/storage/io_queue.h b/src/include/storage/io_queue.h
new file mode 100644
index 00000000000..92b1e9afe6f
--- /dev/null
+++ b/src/include/storage/io_queue.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.h
+ *	  Mechanism for tracking many IOs
+ *
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/io_queue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef IO_QUEUE_H
+#define IO_QUEUE_H
+
+#include "storage/aio_types.h"
+
+struct IOQueue;
+typedef struct IOQueue IOQueue;
+
+struct PgAioWaitRef;
+
+extern IOQueue *io_queue_create(int depth, int flags);
+extern void io_queue_track(IOQueue *ioq, const PgAioWaitRef *iow);
+extern void io_queue_wait_one(IOQueue *ioq);
+extern void io_queue_wait_all(IOQueue *ioq);
+extern bool io_queue_is_empty(IOQueue *ioq);
+extern void io_queue_reserve(IOQueue *ioq);
+extern PgAioHandle *io_queue_acquire_io(IOQueue *ioq);
+extern void io_queue_free(IOQueue *ioq);
+
+#endif							/* IO_QUEUE_H */
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..86fa4276fda 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	io_queue.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/io_queue.c b/src/backend/storage/aio/io_queue.c
new file mode 100644
index 00000000000..526aa1d5e06
--- /dev/null
+++ b/src/backend/storage/aio/io_queue.c
@@ -0,0 +1,204 @@
+/*-------------------------------------------------------------------------
+ *
+ * io_queue.c
+ *	  AIO - Mechanism for tracking many IOs
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/io_queue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "storage/aio.h"
+#include "storage/io_queue.h"
+#include "utils/resowner.h"
+
+
+
+typedef struct TrackedIO
+{
+	PgAioWaitRef iow;
+	dlist_node	node;
+} TrackedIO;
+
+struct IOQueue
+{
+	int			depth;
+	int			unsubmitted;
+
+	bool		has_reserved;
+
+	dclist_head idle;
+	dclist_head in_progress;
+
+	TrackedIO	tracked_ios[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
+IOQueue *
+io_queue_create(int depth, int flags)
+{
+	size_t		sz;
+	IOQueue    *ioq;
+
+	sz = offsetof(IOQueue, tracked_ios)
+		+ sizeof(TrackedIO) * depth;
+
+	ioq = palloc0(sz);
+
+	ioq->depth = 0;
+
+	for (int i = 0; i < depth; i++)
+	{
+		TrackedIO  *tio = &ioq->tracked_ios[i];
+
+		pgaio_wref_clear(&tio->iow);
+		dclist_push_tail(&ioq->idle, &tio->node);
+	}
+
+	return ioq;
+}
+
+void
+io_queue_wait_one(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* FIXME: Should we really pop here already? */
+		dlist_node *node = dclist_pop_head_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		pgaio_wref_wait(&tio->iow);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+void
+io_queue_reserve(IOQueue *ioq)
+{
+	if (ioq->has_reserved)
+		return;
+
+	if (dclist_is_empty(&ioq->idle))
+		io_queue_wait_one(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	ioq->has_reserved = true;
+}
+
+PgAioHandle *
+io_queue_acquire_io(IOQueue *ioq)
+{
+	PgAioHandle *ioh;
+
+	io_queue_reserve(ioq);
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	if (!io_queue_is_empty(ioq))
+	{
+		ioh = pgaio_io_acquire_nb(CurrentResourceOwner, NULL);
+		if (ioh == NULL)
+		{
+			/*
+			 * Need to wait for all IOs, blocking might not be legal in the
+			 * context.
+			 *
+			 * XXX: This doesn't make a whole lot of sense, we're also
+			 * blocking here. What was I smoking when I wrote the above?
+			 */
+			io_queue_wait_all(ioq);
+			ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+		}
+	}
+	else
+	{
+		ioh = pgaio_io_acquire(CurrentResourceOwner, NULL);
+	}
+
+	return ioh;
+}
+
+void
+io_queue_track(IOQueue *ioq, const struct PgAioWaitRef *iow)
+{
+	dlist_node *node;
+	TrackedIO  *tio;
+
+	Assert(ioq->has_reserved);
+	ioq->has_reserved = false;
+
+	Assert(!dclist_is_empty(&ioq->idle));
+
+	node = dclist_pop_head_node(&ioq->idle);
+	tio = dclist_container(TrackedIO, node, node);
+
+	tio->iow = *iow;
+
+	dclist_push_tail(&ioq->in_progress, &tio->node);
+
+	ioq->unsubmitted++;
+
+	/*
+	 * XXX: Should have some smarter logic here. We don't want to wait too
+	 * long to submit, that'll mean we're more likely to block. But we also
+	 * don't want to have the overhead of submitting every IO individually.
+	 */
+	if (ioq->unsubmitted >= 4)
+	{
+		pgaio_submit_staged();
+		ioq->unsubmitted = 0;
+	}
+}
+
+void
+io_queue_wait_all(IOQueue *ioq)
+{
+	/* submit all pending IO before waiting */
+	pgaio_submit_staged();
+
+	while (!dclist_is_empty(&ioq->in_progress))
+	{
+		/* wait for the last IO to minimize unnecessary wakeups */
+		dlist_node *node = dclist_tail_node(&ioq->in_progress);
+		TrackedIO  *tio = dclist_container(TrackedIO, node, node);
+
+		if (!pgaio_wref_check_done(&tio->iow))
+		{
+			ereport(DEBUG3,
+					errmsg("io_queue_wait_all for io:%d",
+						   pgaio_wref_get_id(&tio->iow)),
+					errhidestmt(true),
+					errhidecontext(true));
+
+			pgaio_wref_wait(&tio->iow);
+		}
+
+		dclist_delete_from(&ioq->in_progress, &tio->node);
+		dclist_push_head(&ioq->idle, &tio->node);
+	}
+}
+
+bool
+io_queue_is_empty(IOQueue *ioq)
+{
+	return dclist_is_empty(&ioq->in_progress);
+}
+
+void
+io_queue_free(IOQueue *ioq)
+{
+	io_queue_wait_all(ioq);
+
+	pfree(ioq);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..270c4a64428 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'io_queue.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3a67ee01b46..0c6ddadc51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1196,6 +1196,7 @@ IOContext
 IOFuncSelector
 IOObject
 IOOp
+IOQueue
 IO_STATUS_BLOCK
 IPCompareMethod
 ITEM
@@ -3022,6 +3023,7 @@ TocEntry
 TokenAuxData
 TokenizedAuthLine
 TrackItem
+TrackedIO
 TransApplyAction
 TransInvalidationInfo
 TransState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0013-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload

From 262e24801bfa5b4390b550d92ba433cc7ec48747 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 13/18] bufmgr: use AIO in checkpointer, bgwriter

This is far from ready - just included to be able to exercise AIO writes and
get some preliminary numbers.  In all likelihood this will instead be based
on-top of work by Thomas Munro instead of the preceding commit.

TODO;

- This doesn't implement bgwriter_flush_after, checkpointer_flush_after

  I think that's not too hard to do, it's mainly round tuits.

- The queuing logic doesn't carefully respect pin limits

  That might be ok for checkpointer and bgwriter, but the infrastructure
  should be usable outside of this as well.
---
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/buf_internals.h   |   2 +
 src/include/storage/bufmgr.h          |   3 +-
 src/include/storage/bufpage.h         |   1 +
 src/backend/postmaster/bgwriter.c     |  19 +-
 src/backend/postmaster/checkpointer.c |  11 +-
 src/backend/storage/buffer/bufmgr.c   | 594 +++++++++++++++++++++++---
 src/backend/storage/page/bufpage.c    |  10 +
 src/tools/pgindent/typedefs.list      |   1 +
 9 files changed, 586 insertions(+), 58 deletions(-)

diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..a8081d411b6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -31,7 +31,8 @@ pg_noreturn extern void BackgroundWriterMain(const void *startup_data, size_t st
 pg_noreturn extern void CheckpointerMain(const void *startup_data, size_t startup_data_len);
 
 extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
+struct IOQueue;
+extern void CheckpointWriteDelay(struct IOQueue *ioq, int flags, double progress);
 
 extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..45c2b70b736 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -21,6 +21,8 @@
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
+#include "storage/io_queue.h"
+#include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
 #include "storage/shmem.h"
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 492feab0cb5..89e0ca11288 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -297,7 +297,8 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern bool BgBufferSync(struct WritebackContext *wb_context);
+struct IOQueue;
+extern bool BgBufferSync(struct IOQueue *ioq, struct WritebackContext *wb_context);
 
 extern uint32 GetPinLimit(void);
 extern uint32 GetLocalPinLimit(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..000a3ab23f9 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern bool PageNeedsChecksumCopy(Page page);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..6e8801a39e3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -38,11 +38,13 @@
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -90,6 +92,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 	bool		prev_hibernate;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
 
 	Assert(startup_data_len == 0);
@@ -131,6 +134,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
+	ioq = io_queue_create(128, 0);
 	WritebackContextInit(&wb_context, &bgwriter_flush_after);
 
 	/*
@@ -228,12 +232,22 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
 
+		/*
+		 * FIXME: this is theoretically racy, but I didn't want to copy
+		 * ProcessMainLoopInterrupts() remaining body here.
+		 */
+		if (ShutdownRequestPending)
+		{
+			io_queue_wait_all(ioq);
+			io_queue_free(ioq);
+		}
+
 		ProcessMainLoopInterrupts();
 
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync(&wb_context);
+		can_hibernate = BgBufferSync(ioq, &wb_context);
 
 		/* Report pending statistics to the cumulative stats system */
 		pgstat_report_bgwriter();
@@ -250,6 +264,9 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
 			smgrdestroyall();
 		}
 
+		/* finish IO before sleeping, to avoid blocking other backends */
+		io_queue_wait_all(ioq);
+
 		/*
 		 * Log a new xl_running_xacts every now and then so replication can
 		 * get into a consistent state faster (think of suboverflowed
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..904fe167eb4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -49,10 +49,12 @@
 #include "postmaster/bgwriter.h"
 #include "postmaster/interrupt.h"
 #include "replication/syncrep.h"
+#include "storage/aio.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
@@ -766,7 +768,7 @@ ImmediateCheckpointRequested(void)
  * fraction between 0.0 meaning none, and 1.0 meaning all done.
  */
 void
-CheckpointWriteDelay(int flags, double progress)
+CheckpointWriteDelay(IOQueue *ioq, int flags, double progress)
 {
 	static int	absorb_counter = WRITES_PER_ABSORB;
 
@@ -800,6 +802,13 @@ CheckpointWriteDelay(int flags, double progress)
 		/* Report interim statistics to the cumulative stats system */
 		pgstat_report_checkpointer();
 
+		/*
+		 * Ensure all pending IO is submitted to avoid unnecessary delays for
+		 * other processes.
+		 */
+		io_queue_wait_all(ioq);
+
+
 		/*
 		 * This sleep used to be connected to bgwriter_delay, typically 200ms.
 		 * That resulted in more frequent wakeups if not much work to do.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9bf30c44af0..d9a362d7553 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -52,6 +52,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/io_queue.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -76,6 +77,7 @@
 /* Bits in SyncOneBuffer's return value */
 #define BUF_WRITTEN				0x01
 #define BUF_REUSABLE			0x02
+#define BUF_CANT_MERGE			0x04
 
 #define RELS_BSEARCH_THRESHOLD		20
 
@@ -515,8 +517,6 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -532,6 +532,7 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -3321,6 +3322,57 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	}
 }
 
+typedef struct BuffersToWrite
+{
+	int			nbuffers;
+	BufferTag	start_at_tag;
+	uint32		max_combine;
+
+	XLogRecPtr	max_lsn;
+
+	PgAioHandle *ioh;
+	PgAioWaitRef iow;
+
+	uint64		total_writes;
+
+	Buffer		buffers[IOV_MAX];
+	PgAioBounceBuffer *bounce_buffers[IOV_MAX];
+	const void *data_ptrs[IOV_MAX];
+} BuffersToWrite;
+
+static int	PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+								 bool skip_recently_used,
+								 IOQueue *ioq, WritebackContext *wb_context);
+
+static void WriteBuffers(BuffersToWrite *to_write,
+						 IOQueue *ioq, WritebackContext *wb_context);
+
+static void
+BuffersToWriteInit(BuffersToWrite *to_write,
+				   IOQueue *ioq, WritebackContext *wb_context)
+{
+	to_write->total_writes = 0;
+	to_write->nbuffers = 0;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+	to_write->max_lsn = InvalidXLogRecPtr;
+
+	pgaio_enter_batchmode();
+}
+
+static void
+BuffersToWriteEnd(BuffersToWrite *to_write)
+{
+	if (to_write->ioh != NULL)
+	{
+		pgaio_io_release(to_write->ioh);
+		to_write->ioh = NULL;
+	}
+
+	pgaio_exit_batchmode();
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
@@ -3352,7 +3404,10 @@ BufferSync(int flags)
 	binaryheap *ts_heap;
 	int			i;
 	int			mask = BM_DIRTY;
+	IOQueue    *ioq;
 	WritebackContext wb_context;
+	BuffersToWrite to_write;
+	int			max_combine;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3414,7 +3469,9 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
+	ioq = io_queue_create(512, 0);
 	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -3522,48 +3579,91 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+
+	BuffersToWriteInit(&to_write, ioq, &wb_context);
+
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		bool		batch_continue = true;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		Assert(ts_stat->num_scanned <= ts_stat->num_to_scan);
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Collect a batch of buffers to write out from the current
+		 * tablespace. That causes some imbalance between the tablespaces, but
+		 * that's more than outweighed by the efficiency gain due to batching.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		while (batch_continue &&
+			   to_write.nbuffers < max_combine &&
+			   ts_stat->num_scanned < ts_stat->num_to_scan)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			buf_id = CkptBufferIds[ts_stat->index].buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			num_processed++;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since SyncOneBuffer will then do nothing.
+			 * However, there is a further race condition: it's conceivable
+			 * that between the time we examine the bit here and the time
+			 * SyncOneBuffer acquires the lock, someone else not only wrote
+			 * the buffer but replaced it with another page and dirtied it. In
+			 * that improbable case, SyncOneBuffer will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				int			result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+														  ioq, &wb_context);
+
+				if (result & BUF_CANT_MERGE)
+				{
+					Assert(to_write.nbuffers > 0);
+					WriteBuffers(&to_write, ioq, &wb_context);
+
+					result = PrepareToWriteBuffer(&to_write, buf_id + 1, false,
+												  ioq, &wb_context);
+					Assert(result != BUF_CANT_MERGE);
+				}
+
+				if (result & BUF_WRITTEN)
+				{
+					TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
+					PendingCheckpointerStats.buffers_written++;
+					num_written++;
+				}
+				else
+				{
+					batch_continue = false;
+				}
 			}
+			else
+			{
+				if (to_write.nbuffers > 0)
+					WriteBuffers(&to_write, ioq, &wb_context);
+			}
+
+			/*
+			 * Measure progress independent of actually having to flush the
+			 * buffer - otherwise writing become unbalanced.
+			 */
+			ts_stat->progress += ts_stat->progress_slice;
+			ts_stat->num_scanned++;
+			ts_stat->index++;
 		}
 
-		/*
-		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
-		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		if (to_write.nbuffers > 0)
+			WriteBuffers(&to_write, ioq, &wb_context);
+
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -3581,15 +3681,23 @@ BufferSync(int flags)
 		 *
 		 * (This will check for barrier events even if it doesn't sleep.)
 		 */
-		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
+		CheckpointWriteDelay(ioq, flags, (double) num_processed / num_to_scan);
 	}
 
+	Assert(to_write.nbuffers == 0);
+	io_queue_wait_all(ioq);
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
 	IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
 
+	io_queue_wait_all(ioq);		/* IssuePendingWritebacks might have added
+								 * more */
+	io_queue_free(ioq);
+	BuffersToWriteEnd(&to_write);
+
 	pfree(per_ts_stat);
 	per_ts_stat = NULL;
 	binaryheap_free(ts_heap);
@@ -3615,7 +3723,7 @@ BufferSync(int flags)
  * bgwriter_lru_maxpages to 0.)
  */
 bool
-BgBufferSync(WritebackContext *wb_context)
+BgBufferSync(IOQueue *ioq, WritebackContext *wb_context)
 {
 	/* info obtained from freelist.c */
 	int			strategy_buf_id;
@@ -3658,6 +3766,9 @@ BgBufferSync(WritebackContext *wb_context)
 	long		new_strategy_delta;
 	uint32		new_recent_alloc;
 
+	BuffersToWrite to_write;
+	int			max_combine;
+
 	/*
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
@@ -3678,6 +3789,8 @@ BgBufferSync(WritebackContext *wb_context)
 		return true;
 	}
 
+	max_combine = Min(io_bounce_buffers, io_combine_limit);
+
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
 	 * clock sweep since last time.  If first time through, assume none. Then
@@ -3834,11 +3947,25 @@ BgBufferSync(WritebackContext *wb_context)
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
 
+	BuffersToWriteInit(&to_write, ioq, wb_context);
+
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state;
+
+		sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+										  true, ioq, wb_context);
+		if (sync_state & BUF_CANT_MERGE)
+		{
+			Assert(to_write.nbuffers > 0);
+
+			WriteBuffers(&to_write, ioq, wb_context);
+
+			sync_state = PrepareToWriteBuffer(&to_write, next_to_clean + 1,
+											  true, ioq, wb_context);
+			Assert(sync_state != BUF_CANT_MERGE);
+		}
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -3849,6 +3976,13 @@ BgBufferSync(WritebackContext *wb_context)
 
 		if (sync_state & BUF_WRITTEN)
 		{
+			Assert(sync_state & BUF_REUSABLE);
+
+			if (to_write.nbuffers == max_combine)
+			{
+				WriteBuffers(&to_write, ioq, wb_context);
+			}
+
 			reusable_buffers++;
 			if (++num_written >= bgwriter_lru_maxpages)
 			{
@@ -3860,6 +3994,11 @@ BgBufferSync(WritebackContext *wb_context)
 			reusable_buffers++;
 	}
 
+	if (to_write.nbuffers > 0)
+		WriteBuffers(&to_write, ioq, wb_context);
+
+	BuffersToWriteEnd(&to_write);
+
 	PendingBgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
@@ -3898,8 +4037,66 @@ BgBufferSync(WritebackContext *wb_context)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+static inline bool
+BufferTagsSameRel(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum)
+		;
+}
+
+static bool
+CanMergeWrite(BuffersToWrite *to_write, BufferDesc *cur_buf_hdr)
+{
+	BlockNumber cur_block = cur_buf_hdr->tag.blockNum;
+
+	Assert(to_write->nbuffers > 0); /* can't merge with nothing */
+	Assert(to_write->start_at_tag.relNumber != InvalidOid);
+	Assert(to_write->start_at_tag.blockNum != InvalidBlockNumber);
+
+	Assert(to_write->ioh != NULL);
+
+	/*
+	 * First check if the blocknumber is one that we could actually merge,
+	 * that's cheaper than checking the tablespace/db/relnumber/fork match.
+	 */
+	if (to_write->start_at_tag.blockNum + to_write->nbuffers != cur_block)
+		return false;
+
+	if (!BufferTagsSameRel(&to_write->start_at_tag, &cur_buf_hdr->tag))
+		return false;
+
+	/*
+	 * Need to check with smgr how large a write we're allowed to make. To
+	 * reduce the overhead of the smgr check, only inquire once, when
+	 * processing the first to-be-merged buffer. That avoids the overhead in
+	 * the common case of writing out buffers that definitely not mergeable.
+	 */
+	if (to_write->nbuffers == 1)
+	{
+		SMgrRelation smgr;
+
+		smgr = smgropen(BufTagGetRelFileLocator(&to_write->start_at_tag), INVALID_PROC_NUMBER);
+
+		to_write->max_combine = smgrmaxcombine(smgr,
+											   to_write->start_at_tag.forkNum,
+											   to_write->start_at_tag.blockNum);
+	}
+	else
+	{
+		Assert(to_write->max_combine > 0);
+	}
+
+	if (to_write->start_at_tag.blockNum + to_write->max_combine <= cur_block)
+		return false;
+
+	return true;
+}
+
 /*
- * SyncOneBuffer -- process a single buffer during syncing.
+ * PrepareToWriteBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
  * buffers marked recently used, as these are not replacement candidates.
@@ -3908,22 +4105,50 @@ BgBufferSync(WritebackContext *wb_context)
  *	BUF_WRITTEN: we wrote the buffer.
  *	BUF_REUSABLE: buffer is available for replacement, ie, it has
  *		pin count 0 and usage count 0.
+ *	BUF_CANT_MERGE: can't combine this write with prior writes, caller needs
+ *		to issue those first
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+PrepareToWriteBuffer(BuffersToWrite *to_write, Buffer buf,
+					 bool skip_recently_used,
+					 IOQueue *ioq, WritebackContext *wb_context)
 {
-	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+	BufferDesc *cur_buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 	int			result = 0;
-	uint32		buf_state;
-	BufferTag	tag;
+	XLogRecPtr	cur_buf_lsn;
+	LWLock	   *content_lock;
+	bool		may_block;
+
+	/*
+	 * Check if this buffer can be written out together with already prepared
+	 * writes. We check before we have pinned the buffer, so the buffer can be
+	 * written out and replaced between this check and us pinning the buffer -
+	 * we'll recheck below. The reason for the pre-check is that we don't want
+	 * to pin the buffer just to find out that we can't merge the IO.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			result |= BUF_CANT_MERGE;
+			return result;
+		}
+	}
+	else
+	{
+		to_write->start_at_tag = cur_buf_hdr->tag;
+	}
 
 	/* Make sure we can handle the pin */
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
+	/* XXX: Should also check if we are allowed to pin one more buffer */
+
 	/*
 	 * Check whether buffer needs writing.
 	 *
@@ -3933,7 +4158,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * don't worry because our checkpoint.redo points before log record for
 	 * upcoming changes and so we are not required to write such dirty buffer.
 	 */
-	buf_state = LockBufHdr(bufHdr);
+	buf_state = LockBufHdr(cur_buf_hdr);
 
 	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
 		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
@@ -3942,40 +4167,300 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	}
 	else if (skip_recently_used)
 	{
+#if 0
+		elog(LOG, "at block %d: skip recent with nbuffers %d",
+			 cur_buf_hdr->tag.blockNum, to_write->nbuffers);
+#endif
 		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
 	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 	{
 		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
+		UnlockBufHdr(cur_buf_hdr, buf_state);
 		return result;
 	}
 
+	/* pin the buffer, from now on its identity can't change anymore */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * Acquire IO, if needed, now that it's likely that we'll need to write.
+	 */
+	if (to_write->ioh == NULL)
+	{
+		/* otherwise we should already have acquired a handle */
+		Assert(to_write->nbuffers == 0);
+
+		to_write->ioh = io_queue_acquire_io(ioq);
+		pgaio_io_get_wref(to_write->ioh, &to_write->iow);
+	}
+
+	/*
+	 * If we are merging, check if the buffer's identity possibly changed
+	 * while we hadn't yet pinned it.
+	 *
+	 * XXX: It might be worth checking if we still want to write the buffer
+	 * out, e.g. it could have been replaced with a buffer that doesn't have
+	 * BM_CHECKPOINT_NEEDED set.
+	 */
+	if (to_write->nbuffers != 0)
+	{
+		if (!CanMergeWrite(to_write, cur_buf_hdr))
+		{
+			elog(LOG, "changed identity");
+			UnpinBuffer(cur_buf_hdr);
+
+			result |= BUF_CANT_MERGE;
+
+			return result;
+		}
+	}
+
+	may_block = to_write->nbuffers == 0
+		&& !pgaio_have_staged()
+		&& io_queue_is_empty(ioq)
+		;
+	content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+
+	if (!may_block)
+	{
+		if (LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/* done */
+		}
+		else if (to_write->nbuffers == 0)
+		{
+			/*
+			 * Need to wait for all prior IO to finish before blocking for
+			 * lock acquisition, to avoid the risk a deadlock due to us
+			 * waiting for another backend that is waiting for our unsubmitted
+			 * IO to complete.
+			 */
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			elog(DEBUG2, "at block %u: can't block, nbuffers = 0",
+				 cur_buf_hdr->tag.blockNum
+				);
+
+			may_block = to_write->nbuffers == 0
+				&& !pgaio_have_staged()
+				&& io_queue_is_empty(ioq)
+				;
+			Assert(may_block);
+
+			LWLockAcquire(content_lock, LW_SHARED);
+		}
+		else
+		{
+			elog(DEBUG2, "at block %d: can't block nbuffers = %d",
+				 cur_buf_hdr->tag.blockNum,
+				 to_write->nbuffers);
+
+			UnpinBuffer(cur_buf_hdr);
+			result |= BUF_CANT_MERGE;
+			Assert(to_write->nbuffers > 0);
+
+			return result;
+		}
+	}
+	else
+	{
+		LWLockAcquire(content_lock, LW_SHARED);
+	}
+
+	if (!may_block)
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+		{
+			pgaio_submit_staged();
+			io_queue_wait_all(ioq);
+
+			may_block = io_queue_is_empty(ioq) && to_write->nbuffers == 0 && !pgaio_have_staged();
+
+			if (!StartBufferIO(cur_buf_hdr, false, !may_block))
+			{
+				elog(DEBUG2, "at block %d: non-waitable StartBufferIO returns false, %d",
+					 cur_buf_hdr->tag.blockNum,
+					 may_block);
+
+				/*
+				 * FIXME: can't tell whether this is because the buffer has
+				 * been cleaned
+				 */
+				if (!may_block)
+				{
+					result |= BUF_CANT_MERGE;
+					Assert(to_write->nbuffers > 0);
+				}
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+
+				return result;
+			}
+		}
+	}
+	else
+	{
+		if (!StartBufferIO(cur_buf_hdr, false, false))
+		{
+			elog(DEBUG2, "waitable StartBufferIO returns false");
+			LWLockRelease(content_lock);
+			UnpinBuffer(cur_buf_hdr);
+
+			/*
+			 * FIXME: Historically we returned BUF_WRITTEN in this case, which
+			 * seems wrong
+			 */
+			return result;
+		}
+	}
+
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
 	 */
-	PinBuffer_Locked(bufHdr);
-	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+	buf_state = LockBufHdr(cur_buf_hdr);
+
+	cur_buf_lsn = BufferGetLSN(cur_buf_hdr);
 
-	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
 
-	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+	UnlockBufHdr(cur_buf_hdr, buf_state);
 
-	tag = bufHdr->tag;
+	to_write->buffers[to_write->nbuffers] = buf;
+	to_write->nbuffers++;
 
-	UnpinBuffer(bufHdr);
+	if (buf_state & BM_PERMANENT &&
+		(to_write->max_lsn == InvalidXLogRecPtr || to_write->max_lsn < cur_buf_lsn))
+	{
+		to_write->max_lsn = cur_buf_lsn;
+	}
+
+	result |= BUF_WRITTEN;
+
+	return result;
+}
+
+static void
+WriteBuffers(BuffersToWrite *to_write,
+			 IOQueue *ioq, WritebackContext *wb_context)
+{
+	SMgrRelation smgr;
+	Buffer		first_buf;
+	BufferDesc *first_buf_hdr;
+	bool		needs_checksum;
+
+	Assert(to_write->nbuffers > 0 && to_write->nbuffers <= io_combine_limit);
+
+	first_buf = to_write->buffers[0];
+	first_buf_hdr = GetBufferDescriptor(first_buf - 1);
+
+	smgr = smgropen(BufTagGetRelFileLocator(&first_buf_hdr->tag), INVALID_PROC_NUMBER);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	if (to_write->max_lsn != InvalidXLogRecPtr)
+		XLogFlush(to_write->max_lsn);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+		Block		bufBlock;
+		char	   *bufToWrite;
+
+		bufBlock = BufHdrGetBlock(cur_buf_hdr);
+		needs_checksum = PageNeedsChecksumCopy((Page) bufBlock);
+
+		/*
+		 * Update page checksum if desired.  Since we have only shared lock on
+		 * the buffer, other processes might be updating hint bits in it, so
+		 * we must copy the page to a bounce buffer if we do checksumming.
+		 */
+		if (needs_checksum)
+		{
+			PgAioBounceBuffer *bb = pgaio_bounce_buffer_get();
 
-	return result | BUF_WRITTEN;
+			pgaio_io_assoc_bounce_buffer(to_write->ioh, bb);
+
+			bufToWrite = pgaio_bounce_buffer_buffer(bb);
+			memcpy(bufToWrite, bufBlock, BLCKSZ);
+			PageSetChecksumInplace((Page) bufToWrite, cur_buf_hdr->tag.blockNum);
+		}
+		else
+		{
+			bufToWrite = bufBlock;
+		}
+
+		to_write->data_ptrs[nbuf] = bufToWrite;
+	}
+
+	pgaio_io_set_handle_data_32(to_write->ioh,
+								(uint32 *) to_write->buffers,
+								to_write->nbuffers);
+	pgaio_io_register_callbacks(to_write->ioh, PGAIO_HCB_SHARED_BUFFER_WRITEV, 0);
+
+	smgrstartwritev(to_write->ioh, smgr,
+					BufTagGetForkNum(&first_buf_hdr->tag),
+					first_buf_hdr->tag.blockNum,
+					to_write->data_ptrs,
+					to_write->nbuffers,
+					false);
+	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+					   IOOP_WRITE, 1, BLCKSZ * to_write->nbuffers);
+
+
+	for (int nbuf = 0; nbuf < to_write->nbuffers; nbuf++)
+	{
+		Buffer		cur_buf = to_write->buffers[nbuf];
+		BufferDesc *cur_buf_hdr = GetBufferDescriptor(cur_buf - 1);
+
+		UnpinBuffer(cur_buf_hdr);
+	}
+
+	io_queue_track(ioq, &to_write->iow);
+	to_write->total_writes++;
+
+	/* clear state for next write */
+	to_write->nbuffers = 0;
+	to_write->start_at_tag.relNumber = InvalidOid;
+	to_write->start_at_tag.blockNum = InvalidBlockNumber;
+	to_write->max_combine = 0;
+	to_write->max_lsn = InvalidXLogRecPtr;
+	to_write->ioh = NULL;
+	pgaio_wref_clear(&to_write->iow);
+
+	/*
+	 * FIXME: Implement issuing writebacks (note wb_context isn't used here).
+	 * Possibly needs to be integrated with io_queue.c.
+	 */
 }
 
 /*
@@ -4349,6 +4834,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 82457bacc62..d4bbd2cdf00 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1490,6 +1490,16 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	return true;
 }
 
+bool
+PageNeedsChecksumCopy(Page page)
+{
+	if (PageIsNew(page))
+		return false;
+
+	/* If we don't need a checksum, just return the passed-in data */
+	return DataChecksumsEnabled();
+}
+
 
 /*
  * Set checksum for a page in shared buffers.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c6ddadc51d..e89c501fa8a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BuffersToWrite
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0014-Ensure-a-resowner-exists-for-all-paths-that-ma.patchtext/x-diff; charset=us-asciiDownload

From 5d6f12023658f2cde878fb5f38ff2ce51a613d66 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 14/18] Ensure a resowner exists for all paths that may
 perform AIO

Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
---
 src/backend/bootstrap/bootstrap.c         | 7 +++++++
 src/backend/replication/logical/logical.c | 6 ++++++
 src/backend/utils/init/postinit.c         | 6 +++++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6db864892d0..e554504e1f0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -361,8 +361,15 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
 	BaseInit();
 
 	bootstrap_signals();
+
+	/* need a resowner for IO during BootStrapXLOG() */
+	CreateAuxProcessResourceOwner();
+
 	BootStrapXLOG(bootstrap_data_checksum_version);
 
+	ReleaseAuxProcessResources(true);
+	CurrentResourceOwner = NULL;
+
 	/*
 	 * To ensure that src/common/link-canary.c is linked into the backend, we
 	 * must call it from somewhere.  Here is as good as anywhere.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index a8d2e024d34..72c32f5c32e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,6 +386,12 @@ CreateInitDecodingContext(const char *plugin,
 	slot->data.plugin = plugin_name;
 	SpinLockRelease(&slot->mutex);
 
+	if (CurrentResourceOwner == NULL)
+	{
+		Assert(am_walsender);
+		CurrentResourceOwner = AuxProcessResourceOwner;
+	}
+
 	if (XLogRecPtrIsInvalid(restart_lsn))
 		ReplicationSlotReserveWal();
 	else
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7958ea11b73..222e24bcb08 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,8 +785,12 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		 * We don't yet have an aux-process resource owner, but StartupXLOG
 		 * and ShutdownXLOG will need one.  Hence, create said resource owner
 		 * (and register a callback to clean it up after ShutdownXLOG runs).
+		 *
+		 * In bootstrap mode CreateAuxProcessResourceOwner() was already
+		 * called in BootstrapModeMain().
 		 */
-		CreateAuxProcessResourceOwner();
+		if (!bootstrap)
+			CreateAuxProcessResourceOwner();
 
 		StartupXLOG();
 		/* Release (and warn about) any buffer pins leaked in StartupXLOG */
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0015-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload

From 51cb522b6b2467c63268d4c7f37af07914b081c6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 15/18] Temporary: Increase BAS_BULKREAD size

Without this we only can execute very little AIO for sequential scans, as
there's just not enough buffers in the ring.  This isn't the right fix, as
just increasing the ring size can have negative performance implications in
workloads where the kernel has all the data cached.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/freelist.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..b72a5957a20 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,7 +555,12 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
+
+			/*
+			 * FIXME: Temporary increase to allow large enough streaming reads
+			 * to actually benefit from AIO. This needs a better solution.
+			 */
+			ring_size_kb = 2 * 1024;
 			break;
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
-- 
2.48.1.76.g4e746b1a31.dirty

v2.15-0016-WIP-Use-MAP_POPULATE.patchtext/x-diff; charset=us-asciiDownload

From 6f6df9dfd8f5169a0ef0dac00356d05c6fa0a9fd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2.15 16/18] WIP: Use MAP_POPULATE

For benchmarking it's quite annoying that the first time a memory is touched
has completely different perf characteristics than subsequent accesses. Using
MAP_POPULATE reduces that substantially.
---
 src/backend/port/sysv_shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..a700b02d5a1 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -620,7 +620,7 @@ CreateAnonymousSegment(Size *size)
 			allocsize += hugepagesize - (allocsize % hugepagesize);
 
 		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
+				   PG_MMAP_FLAGS | MAP_POPULATE | mmap_flags, -1, 0);
 		mmap_errno = errno;
 		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
 			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
-- 
2.48.1.76.g4e746b1a31.dirty

#161

Aleksander Alekseev

aleksander@timescale.com

10 months ago

In reply to: Andres Freund (#160)

Re: AIO v2.5

Hi Andres,

I didn't yet push

Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

I have several notes about 0003 / README.md:

1. I noticed that the use of "Postgres" and "postgres" is inconsistent.

```
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
```

Perhaps I'm a bit late here, but the name of the function is weird. It
registers a single callback, but the name is "_callbacks".

3. The use of "AIO Handle" and "AioHandle" is inconsistent.

- pgaio_io_register_callbacks
- pgaio_io_set_handle_data_32

If I understand correctly one can register multiple callbacks per one
AIO Handle (right? ...). However I don't see an obvious way to match
handle data to the given callback. If all the callbacks get the same
handle data... well it's weird IMO, but we should explicitly say that.
On top of that we should probably explain in which order the callbacks
are going to be executed. If there are any guarantees in this respect
of course.

5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1)

Perhaps it's worth mentioning if `buffer` can be freed after the call
i.e. if it's stored by value or by reference. It's also worth
clarifying if the maximum number of buffers is limited or not.

6. It is worth clarifying if AIO allows reads and writes or only reads
at the moment. Perhaps it's also worth explicitly saying that AIO is
for disk IO only, not for network one.

7. It is worth clarifying how many times the callbacks are called when
reading multiple buffers. Is it guaranteed that the callbacks are
called ones, or if it somehow depends on the implementation, and also
what happens in the case if I/O succeeds partially.

8. I believe we should tell a bit more about the context in which the
callbacks are called. Particularly what happens to the memory contexts
and if I can allocate/free memory, can I throw ERRORs, can I create
new AIO Handles, is it expected that the callback should return
quickly, are the signals masked while the callback is executed, can I
use sockets, is it guaranteed that the callback is going to be called
in the same process (I guess so, but the text doesn't explicitly
promise that), etc.

```
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed.
```

What pgaio_io_acquire() does if we are out of AIO Handles? Since it
always succeeds I guess it should block the caller in this case, but
IMO we should say this explicitly.

10.

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.

I tried a bunch of variations and none of them seemed great. So I ended up
with a lightly polished version of your suggested comment above
smgrstartreadv(). We can later see about generalizing it.

IMO the problem here is that README doesn't show the code that does IO
per se, and thus doesn't give the full picture of how AIO should be
used. Perhaps instead of referencing smgrstartreadv() it would be
better to provide a simple but complete example, one that opens a
binary file and reads 512 bytes from it by the given offset for
instance.

--
Best regards,
Aleksander Alekseev

#162

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Aleksander Alekseev (#161)

Re: AIO v2.5

Hi,

On 2025-04-01 14:56:17 +0300, Aleksander Alekseev wrote:

Hi Andres,

I didn't yet push

Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design

I have several notes about 0003 / README.md:

1. I noticed that the use of "Postgres" and "postgres" is inconsistent.

It probably should be consistent, but I have no idea which of the spellings we
should go for. Either looks ugly in some contexts.

2.

```
+pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
```

Perhaps I'm a bit late here, but the name of the function is weird. It
registers a single callback, but the name is "_callbacks".

It registers a *set* of callbacks (stage, complete_shared, complete_local,
report_error) on the handle.

3. The use of "AIO Handle" and "AioHandle" is inconsistent.

This seems ok to me.

4.

- pgaio_io_register_callbacks
- pgaio_io_set_handle_data_32

If I understand correctly one can register multiple callbacks per one
AIO Handle (right? ...). However I don't see an obvious way to match
handle data to the given callback. If all the callbacks get the same
handle data... well it's weird IMO, but we should explicitly say that.

There is:

/*
* Associate an array of data with the Handle. This is e.g. useful to the
* transport knowledge about which buffers a multi-block IO affects to
* completion callbacks.
*
* Right now this can be done only once for each IO, even though multiple
* callbacks can be registered. There aren't any known usecases requiring more
* and the required amount of shared memory does add up, so it doesn't seem
* worth multiplying memory usage by PGAIO_HANDLE_MAX_CALLBACKS.
*/

On top of that we should probably explain in which order the callbacks
are going to be executed. If there are any guarantees in this respect
of course.

Alongside PgAioHandleCallbacks:
*
* The latest registered callback is called first. This allows
* higher-level code to register callbacks that can rely on callbacks
* registered by lower-level code to already have been executed.
*

5. pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1)

Perhaps it's worth mentioning if `buffer` can be freed after the call
i.e. if it's stored by value or by reference.

By value.

It's also worth clarifying if the maximum number of buffers is limited or
not.

It's limited to PG_IOV_MAX, fwiw.

6. It is worth clarifying if AIO allows reads and writes or only reads
at the moment.

We have patches for writes, I just ran out of time for 18. Im not particularly
excited about adding stuff that then needs to be removed in 19.

Perhaps it's also worth explicitly saying that AIO is for disk IO only, not
for network one.

I'd like to integrate network IO too. I have a local prototype, fwiw.

7. It is worth clarifying how many times the callbacks are called when
reading multiple buffers. Is it guaranteed that the callbacks are
called ones, or if it somehow depends on the implementation, and also
what happens in the case if I/O succeeds partially.

The aio subsystem doesn't know anything about buffers. Callbacks are executed
exactly once, with the exception of the error reporting callback, which could
be called multiple times.

8. I believe we should tell a bit more about the context in which the
callbacks are called. Particularly what happens to the memory contexts
and if I can allocate/free memory, can I throw ERRORs, can I create
new AIO Handles, is it expected that the callback should return
quickly, are the signals masked while the callback is executed, can I
use sockets, is it guaranteed that the callback is going to be called
in the same process (I guess so, but the text doesn't explicitly
promise that), etc.

There is the following above pgaio_io_register_callbacks()

* Note that callbacks are executed in critical sections. This is necessary
* to be able to execute IO in critical sections (consider e.g. WAL
* logging). To perform AIO we first need to acquire a handle, which, if there
* are no free handles, requires waiting for IOs to complete and to execute
* their completion callbacks.
*
* Callbacks may be executed in the issuing backend but also in another
* backend (because that backend is waiting for the IO) or in IO workers (if
* io_method=worker is used).

And also a bunch of detail along struct PgAioHandleCallbacks.

9.
```
+Because acquisition of an IO handle
+[must always succeed](#io-can-be-started-in-critical-sections)
+and the number of AIO Handles
+[has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
+AIO handles can be reused as soon as they have completed.
```
What pgaio_io_acquire() does if we are out of AIO Handles? Since it
always succeeds I guess it should block the caller in this case, but
IMO we should say this explicitly.

That's documented above pgaio_io_acquire().

10.

because I want to integrate some language that could be referenced by
smgrstartreadv() (and more in the future), as we have been talking about.

I tried a bunch of variations and none of them seemed great. So I ended up
with a lightly polished version of your suggested comment above
smgrstartreadv(). We can later see about generalizing it.

IMO the problem here is that README doesn't show the code that does IO
per se, and thus doesn't give the full picture of how AIO should be
used. Perhaps instead of referencing smgrstartreadv() it would be
better to provide a simple but complete example, one that opens a
binary file and reads 512 bytes from it by the given offset for
instance.

IMO the example is already long enough, if you want that level of detail, you
can just look at smgrstartreadv() etc. The idea about explaining it at that
level is that that is basically what is required to use AIO in a new place,
whereas implementing AIO for a new target, or a new IO operation, requires a
bit more care.

Greetings,

Andres Freund

#163

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#160)

Re: AIO v2.5

On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:

updated version

All non-write patches (1-7) are ready for commit, though I have some cosmetic
recommendations below. I've marked the commitfest entry Ready for Committer.

+		# Check a page validity error in another block, to ensure we report
+		# the correct block number
+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 3",
+			qq(SELECT read_rel_block_ll('tbl_zero', 3, zero_on_error=>true);),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  invalid page in block 3 of relation base\/.*\/.*; zeroing out page$/
+		);
+
+
+		# Check a page validity error in another block, to ensure we report
+		# the correct block number

This comment is a copy of the previous test's comment. While the comment is
not false, consider changing it to:

# Check one read reporting multiple invalid blocks.

+		$psql_a->query_safe(
+			qq(
+SELECT modify_rel_block('tbl_zero', 2, corrupt_header=>true);
+SELECT modify_rel_block('tbl_zero', 3, corrupt_header=>true);
+));
+		# First test error
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test reading of invalid block 2,3 in larger read",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>false)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: ERROR:  2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first invalid page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing via ZERO_ON_ERROR flag
+		psql_like(
+			$io_method,
+			$psql_a,
+			"$persistency: test zeroing of invalid block 2,3 in larger read, ZERO_ON_ERROR",
+			qq(SELECT read_rel_block_ll('tbl_zero', 1, nblocks=>4, zero_on_error=>true)),
+			qr/^$/,
+			qr/^psql:<stdin>:\d+: WARNING:  zeroing out 2 invalid pages among blocks 1..4 of relation base\/.*\/.*\nDETAIL:  Block 2 held first zeroed page\.\nHINT:[^\n]+$/
+		);
+
+		# Then test zeroing vio zero_damaged_pages

s/vio/via/

+# Verify checksum handling when creating database from an invalid database.
+# This also serves as a minimal check that cross-database IO is handled
+# reasonably.

To me, "invalid database" is a term of art from the message "cannot connect to
invalid database". Hence, I would change "invalid database" to "database w/
invalid block" or similar, here and below. (Alternatively, just delete "from
an invalid database". It's clear from the context.)

+	if (corrupt_checksum)
+	{
+		bool		successfully_corrupted = 0;
+
+		/*
+		 * Any single modification of the checksum could just end up being
+		 * valid again. To be sure
+		 */

Unfinished sentence. That said, I'm not following why we'd need this loop.
If this test code were changing the input to the checksum, it's true that an
input bit flip might reach the same pd_checksum. The test case is changing
pd_checksum, not the input bits. I don't see how changing pd_checksum could
leave the page still valid. There's only one valid pd_checksum value for a
given input page.

+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.

Minimally s/read, IO/read IO,/ but I'd edit a bit further:

* a lot of problems, e.g. if we were to wrongly mark-valid a
* buffer that wasn't read according to the shortened-read IO, the
* contents would look valid and we might miss a bug.

Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path
in md[start]readv()

The zero_damaged_pages path is incomplete, as as missing segments are not

s/as as/as/

For now, put an Assert(false) comments documenting this choice into mdreadv()

s/comments/and comments/

+				 * For PG 18, we are putting an Assert(false) in into
+				 * mdreadv() (triggering failures in assertion-enabled builds,

s/in into/in/

Subject: [PATCH v2.15 06/18] aio: comment polishing

+ * - Partial reads need to be handle by the caller re-issuing IO for the
+ *   unread blocks

s/handle/handled/

Show quoted text

Subject: [PATCH v2.15 07/18] aio: Add errcontext for processing I/Os for
another backend

#164

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#163)

Re: AIO v2.5

Hi,

On 2025-04-01 08:11:59 -0700, Noah Misch wrote:

On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:

updated version

All non-write patches (1-7) are ready for commit, though I have some cosmetic
recommendations below. I've marked the commitfest entry Ready for Committer.

Thanks!

I haven't yet pushed the changes, but will work on that in the afternoon.

I plan to afterwards close the CF entry and will eventually create a new one
for write support, although probably only rebasing onto
/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
and addressing some of the locking issues.

WRT the locking issues, I've been wondering whether we could make
LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
Probably better to get rid of the LWLock*Var functions and go for the approach
I had in v1, namely a version of LWLockAcquire() with a callback that gets
called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
lock acquisition to abort.

This comment is a copy of the previous test's comment. While the comment is
not false, consider changing it to:

# Check one read reporting multiple invalid blocks.

+ # Then test zeroing vio zero_damaged_pages

s/vio/via/

These make sense.

+# Verify checksum handling when creating database from an invalid database.
+# This also serves as a minimal check that cross-database IO is handled
+# reasonably.
To me, "invalid database" is a term of art from the message "cannot connect to
invalid database". Hence, I would change "invalid database" to "database w/
invalid block" or similar, here and below. (Alternatively, just delete "from
an invalid database". It's clear from the context.)

Yea, I agree, this is easy to misunderstand when stepping back. I went for "with
an invalid block".

+	if (corrupt_checksum)
+	{
+		bool		successfully_corrupted = 0;
+
+		/*
+		 * Any single modification of the checksum could just end up being
+		 * valid again. To be sure
+		 */

Unfinished sentence.

Oops. See below.

That said, I'm not following why we'd need this loop. If this test code
were changing the input to the checksum, it's true that an input bit flip
might reach the same pd_checksum. The test case is changing pd_checksum,
not the input bits.

We might be changing the input, due to the zero/corrupt_header options. Or we
might be called on a page that is *already* corrupted. I did encounter that
situation once while writing tests, where the tests only passed if I made the
+ 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed
that day.

I don't see how changing pd_checksum could leave the
page still valid. There's only one valid pd_checksum value for a given
input page.

I updated the comment to:
/*
* Any single modification of the checksum could just end up being
* valid again, due to e.g. corrupt_header changing the data in a way
* that'd result in the "corrupted" checksum, or the checksum already
* being invalid. Retry in that, unlikely, case.
*/

+			/*
+			 * The underlying IO actually completed OK, and thus the "invalid"
+			 * portion of the IOV actually contains valid data. That can hide
+			 * a lot of problems, e.g. if we were to wrongly mark a buffer,
+			 * that wasn't read according to the shortened-read, IO as valid,
+			 * the contents would look valid and we might miss a bug.
Minimally s/read, IO/read IO,/ but I'd edit a bit further:

* a lot of problems, e.g. if we were to wrongly mark-valid a
* buffer that wasn't read according to the shortened-read IO, the
* contents would look valid and we might miss a bug.

Adopted.

Subject: [PATCH v2.15 05/18] md: Add comment & assert to buffer-zeroing path
in md[start]readv()

The zero_damaged_pages path is incomplete, as as missing segments are not

s/as as/as/

For now, put an Assert(false) comments documenting this choice into mdreadv()

s/comments/and comments/
+				 * For PG 18, we are putting an Assert(false) in into
+				 * mdreadv() (triggering failures in assertion-enabled builds,
s/in into/in/

Subject: [PATCH v2.15 06/18] aio: comment polishing
+ * - Partial reads need to be handle by the caller re-issuing IO for the
+ *   unread blocks
s/handle/handled/

All adopted. I'm sorry that you had to see so much of tiredness-enhanced
dyslexia :(.

Greetings,

Andres Freund

#165

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#164)

Re: AIO v2.5

On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:

On 2025-04-01 08:11:59 -0700, Noah Misch wrote:

On Mon, Mar 31, 2025 at 08:41:39PM -0400, Andres Freund wrote:

I haven't yet pushed the changes, but will work on that in the afternoon.

I plan to afterwards close the CF entry and will eventually create a new one
for write support, although probably only rebasing onto
/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
and addressing some of the locking issues.

Sounds good.

WRT the locking issues, I've been wondering whether we could make
LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
Probably better to get rid of the LWLock*Var functions and go for the approach
I had in v1, namely a version of LWLockAcquire() with a callback that gets
called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
lock acquisition to abort.

What are the best thing(s) to read to understand the locking issues?

+# Verify checksum handling when creating database from an invalid database.
+# This also serves as a minimal check that cross-database IO is handled
+# reasonably.
To me, "invalid database" is a term of art from the message "cannot connect to
invalid database". Hence, I would change "invalid database" to "database w/
invalid block" or similar, here and below. (Alternatively, just delete "from
an invalid database". It's clear from the context.)
Yea, I agree, this is easy to misunderstand when stepping back. I went for "with
an invalid block".

Sounds good.

+	if (corrupt_checksum)
+	{
+		bool		successfully_corrupted = 0;
+
+		/*
+		 * Any single modification of the checksum could just end up being
+		 * valid again. To be sure
+		 */
Unfinished sentence.
Oops. See below.

That said, I'm not following why we'd need this loop. If this test code
were changing the input to the checksum, it's true that an input bit flip
might reach the same pd_checksum. The test case is changing pd_checksum,
not the input bits.

We might be changing the input, due to the zero/corrupt_header options. Or we
might be called on a page that is *already* corrupted. I did encounter that
situation once while writing tests, where the tests only passed if I made the
+ 1 a + 2. Which was, uh, rather confusing and left me feel like I was cursed
that day.

Got it.

I don't see how changing pd_checksum could leave the
page still valid. There's only one valid pd_checksum value for a given
input page.

I updated the comment to:
/*
* Any single modification of the checksum could just end up being
* valid again, due to e.g. corrupt_header changing the data in a way
* that'd result in the "corrupted" checksum, or the checksum already
* being invalid. Retry in that, unlikely, case.
*/

Works for me.

#166

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Noah Misch (#165)

Re: AIO v2.5

Hi,

On 2025-04-01 09:07:27 -0700, Noah Misch wrote:

On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote:

WRT the locking issues, I've been wondering whether we could make
LWLockWaitForVar() work that purpose, but I doubt it's the right approach.
Probably better to get rid of the LWLock*Var functions and go for the approach
I had in v1, namely a version of LWLockAcquire() with a callback that gets
called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the
lock acquisition to abort.

What are the best thing(s) to read to understand the locking issues?

Unfortunately I think it's our discussion from a few days/weeks ago.

The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able
to non-racily

a) wait for in-fligth IOs
b) acquire the content lock

If you just do it naively like this:

else if (mode == BUFFER_LOCK_EXCLUSIVE)
{
if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS)
WaitIO(buf);
LWLockAcquire(content_lock, LW_EXCLUSIVE);
}

you obviously could have another backend start new IO between the WaitIO() and
the LWLockAcquire(). If that other backend then doesn't consume the
completion of that IO, the current backend could end up endlessly waiting for
the IO. I don't see a way to avoid with narrow changes just to LockBuffer().

We need some infrastructure that allows to avoid that issue. One approach
could be to integrate more tightly with lwlock.c. If

1) anyone starting IO were to wake up all waiters for the LWLock

2) The waiting side checked that there is no IO in progress *after*
LWLockQueueSelf(), but before PGSemaphoreLock()

The backend doing LockBuffer() would be guaranteed to have the chance to wait
for the IO, rather than the lwlock.

But there might be better approaches.

I'm not really convinced that using generic lwlocks for buffer locking is the
best idea. There's just too many special things about buffers. E.g. we have
rather massive NUMA scalability issues due to the amount of lock traffic from
buffer header and content lock atomic operations, particuly on things like the
uppermost levels of a btree. I've played with ideas like super-pinning and
locking btree root pages, which move all the overhead to the side that wants
to exclusively lock such a page - but that doesn't really make sense for
lwlocks in general.

Greetings,

Andres Freund

#167

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#164)

Re: AIO v2.5

Hi,

On 2025-04-01 11:55:20 -0400, Andres Freund wrote:

I haven't yet pushed the changes, but will work on that in the afternoon.

There are three different types of failures in the test_aio test so far:

1) TEMP_CONFIG

See /messages/by-id/zh5u22wbpcyfw2ddl3lsvmsxf4yvsrvgxqwwmfjddc4c2khsgp@gfysyjsaelr5

2) Failure on at least some windows BF machines:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-04-01%2020%3A15%3A19
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-04-01%2019%3A03%3A07

Afaict the error is independent of AIO, instead just related CREATE DATABASE
... STRATEGY wal_log failing on windows. In contrast to dropdb(), which does

/*
* Force a checkpoint to make sure the checkpointer has received the
* message sent by ForgetDatabaseSyncRequests.
*/
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

/* Close all smgr fds in all backends. */
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

createdb_failure_callback() does no such thing. But it's rather likely that
we, bgwriter, checkpointer (and now IO workers) have files open for the target
database.

Note that the test is failing even with "io_method=sync", which obviously
doesn't use IO workers, so it's not related to that.

It's probably not a good idea to blockingly request a checkpoint and a barrier
inside a PG_TRY/PG_ENSURE_ERROR_CLEANUP() though, so this would need a bit
more rearchitecting.

I think I'm just going to make the test more lenient by not insisting that the
error is the first thing on psql's stderr.

3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07

# +++ tap check in src/test/modules/test_aio +++

# Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
# at t/001_aio.pl line 318.
# 'psql:<stdin>:4: ERROR: starting batch while batch already in progress'
# doesn't match '(?^:open AIO batch at end)'

The problem is basically that the test intentionally forgets to exit batchmode
- normally that would trigger an error at the end of the transaction, which
the test verifies. However, with RELCACHE_FORCE_RELEASE and
CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
erroring out because batchmode isn't allowed to be entered recursively.

#0 pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997
#1 0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098)
at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438
#2 0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0)
at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890
#3 0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679
#4 0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1, key=0x55ecbcfd1620)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041
#5 0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420
#6 0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18)
at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041
#7 0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541
#8 0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2, v1=403, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543
#9 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456
#13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201
#14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100
#15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58
#16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137
#17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0, nkeys=1, key=0x7ffc11aa7c90)
at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400
#18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60, v1=2278, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533
#19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995
#23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1)
at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277
#24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00)
at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315
#25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection,
dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814

I don't really have a good idea how to deal with that yet.

Greetings,

Andres

#168

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Andres Freund (#167)

Re: AIO v2.5

Hi,

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:

3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07

# +++ tap check in src/test/modules/test_aio +++

# Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
# at t/001_aio.pl line 318.
# 'psql:<stdin>:4: ERROR: starting batch while batch already in progress'
# doesn't match '(?^:open AIO batch at end)'

The problem is basically that the test intentionally forgets to exit batchmode
- normally that would trigger an error at the end of the transaction, which
the test verifies. However, with RELCACHE_FORCE_RELEASE and
CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
erroring out because batchmode isn't allowed to be entered recursively.

#0 pgaio_enter_batchmode () at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/aio.c:997
#1 0x000055ec847959bf in read_stream_look_ahead (stream=0x55ecbcfda098)
at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:438
#2 0x000055ec84796514 in read_stream_next_buffer (stream=0x55ecbcfda098, per_buffer_data=0x0)
at ../../../../../home/andres/src/postgresql/src/backend/storage/aio/read_stream.c:890
#3 0x000055ec8432520b in heap_fetch_next_buffer (scan=0x55ecbcfd1c00, dir=ForwardScanDirection)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:679
#4 0x000055ec843259a4 in heapgettup_pagemode (scan=0x55ecbcfd1c00, dir=ForwardScanDirection, nkeys=1, key=0x55ecbcfd1620)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1041
#5 0x000055ec843263ba in heap_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam.c:1420
#6 0x000055ec8434ebe5 in table_scan_getnextslot (sscan=0x55ecbcfd1c00, direction=ForwardScanDirection, slot=0x55ecbcfd0e18)
at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1041
#7 0x000055ec8434f786 in systable_getnext (sysscan=0x55ecbcfd8088) at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:541
#8 0x000055ec849c784a in SearchCatCacheMiss (cache=0x55ecbcf81000, nkeys=1, hashValue=3830081846, hashIndex=2, v1=403, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1543
#9 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcf81000, nkeys=1, v1=403, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#10 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcf81000, v1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#11 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=2, key1=403) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#12 0x000055ec849d8c78 in RelationInitIndexAccessInfo (relation=0x7f6a85901c20)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1456
#13 0x000055ec849d8471 in RelationBuildDesc (targetRelId=2703, insertIt=true)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:1201
#14 0x000055ec849d9e9c in RelationIdGetRelation (relationId=2703) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2100
#15 0x000055ec842d219f in relation_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/common/relation.c:58
#16 0x000055ec8435043c in index_open (relationId=2703, lockmode=1) at ../../../../../home/andres/src/postgresql/src/backend/access/index/indexam.c:137
#17 0x000055ec8434f2f9 in systable_beginscan (heapRelation=0x7f6a859353a8, indexId=2703, indexOK=true, snapshot=0x0, nkeys=1, key=0x7ffc11aa7c90)
at ../../../../../home/andres/src/postgresql/src/backend/access/index/genam.c:400
#18 0x000055ec849c782c in SearchCatCacheMiss (cache=0x55ecbcfa0e80, nkeys=1, hashValue=2659955452, hashIndex=60, v1=2278, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1533
#19 0x000055ec849c76f9 in SearchCatCacheInternal (cache=0x55ecbcfa0e80, nkeys=1, v1=2278, v2=0, v3=0, v4=0)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1464
#20 0x000055ec849c73ec in SearchCatCache1 (cache=0x55ecbcfa0e80, v1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/catcache.c:1332
#21 0x000055ec849e5ae3 in SearchSysCache1 (cacheId=82, key1=2278) at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/syscache.c:228
#22 0x000055ec849d0375 in getTypeOutputInfo (type=2278, typOutput=0x55ecbcfd15d0, typIsVarlena=0x55ecbcfd15d8)
at ../../../../../home/andres/src/postgresql/src/backend/utils/cache/lsyscache.c:2995
#23 0x000055ec842d1a57 in printtup_prepare_info (myState=0x55ecbcfcec00, typeinfo=0x55ecbcfd0588, numAttrs=1)
at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:277
#24 0x000055ec842d1ba6 in printtup (slot=0x55ecbcfd0b28, self=0x55ecbcfcec00)
at ../../../../../home/andres/src/postgresql/src/backend/access/common/printtup.c:315
#25 0x000055ec84541f54 in ExecutePlan (queryDesc=0x55ecbced4290, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection,
dest=0x55ecbcfcec00) at ../../../../../home/andres/src/postgresql/src/backend/executor/execMain.c:1814

I don't really have a good idea how to deal with that yet.

Hm. Making the query something like

SELECT * FROM (VALUES (NULL), (batch_start()));

avoids the wrong output, because the type lookup happens for the first row
already. But that's pretty magical and probably fragile.

Greetings,

Andres Freund

#169

Noah Misch

noah@leadboat.com

10 months ago

In reply to: Andres Freund (#168)

Re: AIO v2.5

On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote:

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:

3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07

# +++ tap check in src/test/modules/test_aio +++

# Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
# at t/001_aio.pl line 318.
# 'psql:<stdin>:4: ERROR: starting batch while batch already in progress'
# doesn't match '(?^:open AIO batch at end)'

The problem is basically that the test intentionally forgets to exit batchmode
- normally that would trigger an error at the end of the transaction, which
the test verifies. However, with RELCACHE_FORCE_RELEASE and
CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
erroring out because batchmode isn't allowed to be entered recursively.

I don't really have a good idea how to deal with that yet.

Hm. Making the query something like

SELECT * FROM (VALUES (NULL), (batch_start()));

avoids the wrong output, because the type lookup happens for the first row
already. But that's pretty magical and probably fragile.

Hmm. Some options:

a. VALUES() trick above. For test code, it's hard to argue with something
that seems to solve it in practice.

b. Encapsulate the test in a PROCEDURE, so perhaps less happens between the
batch_start() and the procedure-managed COMMIT. Maybe less fragile than
(a), maybe more fragile.

c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be
GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the
debug_discard_caches GUC. Then disable them for relevant parts of
test_aio. This feels best long-term, but it's bigger. I also wanted this
in syscache-update-pruned.spec[1]For that spec, an alternative expected output sufficed. Incidentally, I'll soon fix that spec flaking on valgrind/skink..

d. Have test_aio deduce whether these are set, probably by observing memory
contexts or DEBUG messages. Maybe have every postmaster startup print a
DEBUG message about these settings being enabled. Skip relevant parts of
test_aio. This sounds messy.

Each of those feels defensible to me. I'd probably do (a) or (b) to start.

[1]: For that spec, an alternative expected output sufficed. Incidentally, I'll soon fix that spec flaking on valgrind/skink.
I'll soon fix that spec flaking on valgrind/skink.

#170

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Noah Misch (#169)

Re: AIO v2.5

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

On 2025-04-01 17:13:24 -0700, Noah Misch wrote:

On Tue, Apr 01, 2025 at 06:25:28PM -0400, Andres Freund wrote:

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:

3) Some subtests fail if RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE are defined:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-04-01%2019%3A23%3A07

# +++ tap check in src/test/modules/test_aio +++

# Failed test 'worker: batch_start() leak & cleanup in implicit xact: expected stderr'
# at t/001_aio.pl line 318.
# 'psql:<stdin>:4: ERROR: starting batch while batch already in progress'
# doesn't match '(?^:open AIO batch at end)'

The problem is basically that the test intentionally forgets to exit batchmode
- normally that would trigger an error at the end of the transaction, which
the test verifies. However, with RELCACHE_FORCE_RELEASE and
CATCACHE_FORCE_RELEASE defined, we get other code entering batchmode and
erroring out because batchmode isn't allowed to be entered recursively.

I don't really have a good idea how to deal with that yet.

Hm. Making the query something like

SELECT * FROM (VALUES (NULL), (batch_start()));

avoids the wrong output, because the type lookup happens for the first row
already. But that's pretty magical and probably fragile.

Hmm. Some options:

a. VALUES() trick above. For test code, it's hard to argue with something
that seems to solve it in practice.

I think I'll go for a slightly nicer version of that, namely
SELECT WHERE batch_start() IS NULL
I think that ends up the least verbose of the ideas we've been discussing.

c. Move RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE to be
GUC-controlled, like how CLOBBER_CACHE_ALWAYS changed into the
debug_discard_caches GUC. Then disable them for relevant parts of
test_aio. This feels best long-term, but it's bigger. I also wanted this
in syscache-update-pruned.spec[1].

Yea, that'd probably be a good thing medium-term.

Greetings,

Andres Freund

#171

Ranier Vilela

ranier.vf@gmail.com

9 months ago

In reply to: Andres Freund (#170)

Re: AIO v2.5

Hi.

Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field
result_one.result is uninitialized when calling pgaio_result_report.

Below not is a fix, but some suggestion:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
  else
  result->status = PGAIO_RS_WARNING;

+ result->result = 0;
+
/*
* The encoding is complicated enough to warrant cross-checking it against
* the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
/* Check for garbage data. */
if (!failed)
{
- PgAioResult result_one;
-
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
*/
if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
{
+ PgAioResult result_one;
+
buffer_readv_encode_error(&result_one, is_temp,
*zeroed_buffer,
*ignored_checksum,

1. I couldn't find the correct value to initialize the *result* field.
2. result_one can be reduced scope.

best regards,
Ranier Vilela

#172

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Andres Freund (#167)

Re: AIO v2.5

Hi,

On 2025-04-01 17:47:51 -0400, Andres Freund wrote:

There are three different types of failures in the test_aio test so far:

And a fourth, visible after I enabled liburing support for skink.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-04-03%2007%3A06%3A19&stg=pg_upgrade-check
(ignore the pg_upgrade and oauth failures, they're independent, I've raised
them separately)

4a)

2025-04-03 10:58:32.978 UTC [2486740][client backend][3/6:0] LOG: short read injection point called, is enabled: 0
==2486740== VALGRINDERROR-BEGIN
==2486740== Invalid read of size 2
==2486740== at 0x59C8AC: PageIsNew (bufpage.h:237)
==2486740== by 0x59C8AC: PageIsVerified (bufpage.c:108)
==2486740== by 0x567870: buffer_readv_complete_one (bufmgr.c:6873)
==2486740== by 0x567870: buffer_readv_complete (bufmgr.c:6996)
==2486740== by 0x567870: shared_buffer_readv_complete (bufmgr.c:7153)
==2486740== by 0x55DDB2: pgaio_io_call_complete_shared (aio_callback.c:256)
==2486740== by 0x55D6F1: pgaio_io_process_completion (aio.c:512)
==2486740== by 0x55F53A: pgaio_uring_drain_locked (method_io_uring.c:370)
==2486740== by 0x55F7B8: pgaio_uring_wait_one (method_io_uring.c:449)
==2486740== by 0x55C702: pgaio_io_wait (aio.c:587)
==2486740== by 0x55C8B0: pgaio_wref_wait (aio.c:900)
==2486740== by 0x8639240: read_rel_block_ll (test_aio.c:440)
==2486740== by 0x3B915C: ExecInterpExpr (execExprInterp.c:953)
==2486740== by 0x3B4E4E: ExecInterpExprStillValid (execExprInterp.c:2299)
==2486740== by 0x3F7E97: ExecEvalExprNoReturn (executor.h:445)
==2486740== Address 0x8fa400e is in a rw- anonymous segment
==2486740==
==2486740== VALGRINDERROR-END

The reason for this is that this test unpins the buffer (from the backend's
view), before waiting for the IO. While the AIO subsystem holds a pin,
UnpinBufferNoOwner() marked the buffer as inaccessible:
/*
* Mark buffer non-accessible to Valgrind.
*
* Note that the buffer may have already been marked non-accessible
* within access method code that enforces that buffers are only
* accessed while a buffer lock is held.
*/
VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);

I think to fix this we need to mark buffers as accessible around the
PageIsVerified() call in buffer_readv_complete_one(), IFF they're not pinned
by the backend. Unfortunately, this is complicated by the fact that local
buffers do not have valgrind integration :(, so we should only do that for
local buffers, as otherwise the local buffer stays inaccessible the next time
it is pinned.

4b)

That's not all though, after getting past this failure, I see uninitialized
memory errors for reads into temporary buffers:

==3334031== VALGRINDERROR-BEGIN
==3334031== Conditional jump or move depends on uninitialised value(s)
==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108)
==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644)
==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521)
==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587)
==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900)
==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375)
==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81)
==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126)
==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170)

The reason for this one is, I think, that valgrind doesn't understand io_uring
sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
of commands is somewhat hard to intercept by tools like valgrind and rr.

The best fix for that one would, I think, be to have method_io_uring() iterate
over the IOV and mark the relevant regions as defined? That does fix the
issue at least and does seem to make sense? Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?

4c)

Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
PageIsVerified() causes the *next* read into the same buffer in an IO worker
to fail:

==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
==3339904== by 0x5B93C83: pread (pread64.c:25)
==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56)
==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972)
==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381)
==3339904== by 0xB69622: main (main.c:227)
==3339904== Address 0x7f936d386000 is in a rw- anonymous segment

Because, from the view of the IO worker, that memory is still marked NOACCESS,
even if it since has been marked accessible in the backend.

We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
being in an IO worker, but it seems better to instead explicitly mark the
region accessible in the worker, before executing the IO.

In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
likely too broad. I don't think the same scenario exists when IOs are
executed synchronously in the foreground.

Questions:

1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
need to have special-case logic for that. But it also makes the change less
localized and more "impactful", who knows what kind of skullduggery we have
been getting away with unnoticed.

I haven't written the code up yet, but I don't think it'd be all that much
code to add valgrind support to localbuf.

2) Any better ideas to handle the above issues than what I outlined?

Greetings,

Andres Freund

#173

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Ranier Vilela (#171)

Re: AIO v2.5

Hi,

On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:

Em qua., 2 de abr. de 2025 ï¿½s 08:58, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field
result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity? The field isn't
used in pgaio_result_report(). It can't be a particularly rare thing to have
struct fields that aren't always used?

Below not is a fix, but some suggestion:

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
else
result->status = PGAIO_RS_WARNING;

+ result->result = 0;
+

That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks were
successfully read.

/*
* The encoding is complicated enough to warrant cross-checking it against
* the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
/* Check for garbage data. */
if (!failed)
{
- PgAioResult result_one;
-
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8
buf_off, Buffer buffer,
*/
if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
{
+ PgAioResult result_one;
+
buffer_readv_encode_error(&result_one, is_temp,
*zeroed_buffer,
*ignored_checksum,

1. I couldn't find the correct value to initialize the *result* field.

It is not accessed in this path. I guess we can just zero-initialize
result_one to shut up coverity.

2. result_one can be reduced scope.

True.

Greetings,

Andres Freund

#174

Ranier Vilela

ranier.vf@gmail.com

9 months ago

In reply to: Andres Freund (#173)

Re: AIO v2.5

Em qui., 3 de abr. de 2025 às 15:35, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:

Em qua., 2 de abr. de 2025 às 08:58, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field
result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity?

Personally, I consider every warning to be important.

The field isn't
used in pgaio_result_report(). It can't be a particularly rare thing to
have
struct fields that aren't always used?

Always considered a risk, someone may start using it.

Below not is a fix, but some suggestion:
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2..b0f9ce452c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6786,6 +6786,8 @@ buffer_readv_encode_error(PgAioResult *result,
else
result->status = PGAIO_RS_WARNING;
+ result->result = 0;
+
That'd be completely wrong - and the tests indeed fail if you do that. The
read might succeed with a warning (e.g. due to zero_damaged_pages) in which
case the result still carries important information about how many blocks
were
successfully read.

That's exactly why it's not a patch.

/*
* The encoding is complicated enough to warrant cross-checking it

against

* the decode function.
@@ -6868,8 +6870,6 @@ buffer_readv_complete_one(PgAioTargetData *td,

uint8

buf_off, Buffer buffer,
/* Check for garbage data. */
if (!failed)
{
- PgAioResult result_one;
-
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
failed_checksum))
{
@@ -6904,6 +6904,8 @@ buffer_readv_complete_one(PgAioTargetData *td,

uint8

buf_off, Buffer buffer,
*/
if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
{
+ PgAioResult result_one;
+
buffer_readv_encode_error(&result_one, is_temp,
*zeroed_buffer,
*ignored_checksum,

1. I couldn't find the correct value to initialize the *result* field.

It is not accessed in this path. I guess we can just zero-initialize
result_one to shut up coverity.

Very good.

2. result_one can be reduced scope.

True.

Ok.

best regards,
Ranier Vilela

#175

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Andres Freund (#172)

Re: AIO v2.5

On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:

4b)

That's not all though, after getting past this failure, I see uninitialized
memory errors for reads into temporary buffers:

==3334031== VALGRINDERROR-BEGIN
==3334031== Conditional jump or move depends on uninitialised value(s)
==3334031== at 0xD7C859: PageIsVerified (bufpage.c:108)
==3334031== by 0xD381CA: buffer_readv_complete_one (bufmgr.c:6876)
==3334031== by 0xD385D1: buffer_readv_complete (bufmgr.c:7002)
==3334031== by 0xD38D2E: local_buffer_readv_complete (bufmgr.c:7210)
==3334031== by 0xD265FA: pgaio_io_call_complete_local (aio_callback.c:306)
==3334031== by 0xD24720: pgaio_io_reclaim (aio.c:644)
==3334031== by 0xD24400: pgaio_io_process_completion (aio.c:521)
==3334031== by 0xD28D3D: pgaio_uring_drain_locked (method_io_uring.c:382)
==3334031== by 0xD2905F: pgaio_uring_wait_one (method_io_uring.c:461)
==3334031== by 0xD245E0: pgaio_io_wait (aio.c:587)
==3334031== by 0xD24FFE: pgaio_wref_wait (aio.c:900)
==3334031== by 0xD2F471: WaitReadBuffers (bufmgr.c:1695)
==3334031== by 0xD2BCF4: read_stream_next_buffer (read_stream.c:898)
==3334031== by 0x8B4861: heap_fetch_next_buffer (heapam.c:654)
==3334031== by 0x8B4FFA: heapgettup_pagemode (heapam.c:1016)
==3334031== by 0x8B594F: heap_getnextslot (heapam.c:1375)
==3334031== by 0xB28AA4: table_scan_getnextslot (tableam.h:1031)
==3334031== by 0xB29177: SeqNext (nodeSeqscan.c:81)
==3334031== by 0xB28F75: ExecScanFetch (execScan.h:126)
==3334031== by 0xB28FDD: ExecScanExtended (execScan.h:170)

The reason for this one is, I think, that valgrind doesn't understand io_uring
sufficiently. Which isn't surprising, io_uring's nature of an in-memory queue
of commands is somewhat hard to intercept by tools like valgrind and rr.

The best fix for that one would, I think, be to have method_io_uring() iterate
over the IOV and mark the relevant regions as defined? That does fix the
issue at least and does seem to make sense?

Makes sense. Valgrind knows that read() makes its target bytes "defined". It
probably doesn't have an io_uring equivalent for that.

I expect we only need this for local buffers, and it's unclear to me how the
fix for (4a) didn't fix this. Before bufmgr Valgrind integration (1e0dfd1 of
2020-07) there was no explicit handling of shared_buffers. I suspect that
worked because the initial mmap() of shared memory was considered "defined"
(zeros), and steps like PageAddItem() copy only defined bytes into buffers.
Hence, shared_buffers remained defined without explicit Valgrind client
requests. This example uses local buffers. Storage for those comes from
MemoryContextAlloc() in GetLocalBufferStorage(). That memory starts
undefined, but it becomes defined at PageInit() or read(). Hence, I expected
the fix for (4a) to make the buffer defined after io_uring read. What makes
the outcome different?

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
- Make defined before verifying page. That's all. It might be cleaner to
do this when first retrieving a return value from io_uring, since this
just makes up for what Valgrind already does for readv().

- If completor!=definer or has dropped pin:
- Make NOACCESS in definer when definer cedes its own pin.
- For io_method=worker, make UNDEFINED before starting readv(). It might be
cleanest to do this when the worker first acts as the owner of the AIO
subsystem pin, if that's a clear moment earlier than readv().
- Make DEFINED in completor before verifying page. It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().
- Make NOACCESS in completor after verifying page. Similarly, it might be
cleaner to do this when the completor releases the AIO subsystem pin.

Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?

Seems fine to do the simpler way if that saves fiddly code.

4c)

Unfortunately, once 4a) is addressed, the VALGRIND_MAKE_MEM_NOACCESS() after
PageIsVerified() causes the *next* read into the same buffer in an IO worker
to fail:

==3339904== Syscall param pread64(buf) points to unaddressable byte(s)
==3339904== at 0x5B3B687: __internal_syscall_cancel (cancellation.c:64)
==3339904== by 0x5B3B6AC: __syscall_cancel (cancellation.c:75)
==3339904== by 0x5B93C83: pread (pread64.c:25)
==3339904== by 0xD274F4: pg_preadv (pg_iovec.h:56)
==3339904== by 0xD2799A: pgaio_io_perform_synchronously (aio_io.c:137)
==3339904== by 0xD2A6D7: IoWorkerMain (method_worker.c:538)
==3339904== by 0xC91E26: postmaster_child_launch (launch_backend.c:290)
==3339904== by 0xC99594: StartChildProcess (postmaster.c:3972)
==3339904== by 0xC99EE3: maybe_adjust_io_workers (postmaster.c:4403)
==3339904== by 0xC958A8: PostmasterMain (postmaster.c:1381)
==3339904== by 0xB69622: main (main.c:227)
==3339904== Address 0x7f936d386000 is in a rw- anonymous segment

Because, from the view of the IO worker, that memory is still marked NOACCESS,
even if it since has been marked accessible in the backend.

We could adress this by conditioning the VALGRIND_MAKE_MEM_NOACCESS() on not
being in an IO worker, but it seems better to instead explicitly mark the
region accessible in the worker, before executing the IO.

Sounds good. Since the definer gave the AIO subsystem a pin on the worker's
behalf, it's like the worker is doing an implicit pin and explicit unpin.

In a first hack, I did that in pgaio_io_perform_synchronously(), but that is
likely too broad. I don't think the same scenario exists when IOs are
executed synchronously in the foreground.

Questions:

1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
need to have special-case logic for that. But it also makes the change less
localized and more "impactful", who knows what kind of skullduggery we have
been getting away with unnoticed.

I haven't written the code up yet, but I don't think it'd be all that much
code to add valgrind support to localbuf.

It would be the right thing long-term, and it's not a big deal if it causes
some false positives initially. So if you're leaning that way, that's good.

2) Any better ideas to handle the above issues than what I outlined?

Not here, unless the discussion under (4b) differs usefully from what you
planned.

#176

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Ranier Vilela (#174)

Re: AIO v2.5

Hi,

On 2025-04-03 16:16:50 -0300, Ranier Vilela wrote:

Em qui., 3 de abr. de 2025 ï¿½s 15:35, Andres Freund <andres@anarazel.de>
escreveu:> > On 2025-04-03 13:46:39 -0300, Ranier Vilela wrote:

Em qua., 2 de abr. de 2025 ï¿½s 08:58, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

Coverity has one report about this.

CID 1596092: (#1 of 1): Uninitialized scalar variable (UNINIT)
13. uninit_use_in_call: Using uninitialized value result_one. Field
result_one.result is uninitialized when calling pgaio_result_report.

Isn't this a rather silly thing to warn about for coverity?

Personally, I consider every warning to be important.

If the warning is wrong, then it's not helpful. Warning quality really
matters.

Zero-initializing everything *REDUCES* what static analysis and sanitizers can
do. The analyzer/sanitizer can't tell that you just silenced a warning by
zero-initializing something that shouldn't be accessed. If later there is an
access, the zero is probably the wrong value, but the no tool can tell you,
because you did initialize it after all.

The field isn't
used in pgaio_result_report(). It can't be a particularly rare thing to
have
struct fields that aren't always used?

Always considered a risk, someone may start using it.

That makes it worse! E.g. valgrind won't raise errors about it anymore.

Greetings,

Andres Freund

#177

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Noah Misch (#175)

3 attachment(s)

Re: AIO v2.5

Hi,

Sorry for the slow work on this. The cycle times are humonguous due to
valgrind being so slow...

On 2025-04-03 12:40:23 -0700, Noah Misch wrote:

On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:

The best fix for that one would, I think, be to have method_io_uring() iterate
over the IOV and mark the relevant regions as defined? That does fix the
issue at least and does seem to make sense?

Makes sense. Valgrind knows that read() makes its target bytes "defined". It
probably doesn't have an io_uring equivalent for that.

Correct - and I think it would be nontrivial to add, because there's not easy
syscall to intercept...

I expect we only need this for local buffers, and it's unclear to me how the
fix for (4a) didn't fix this.

At that time I didn't apply the fix in 4a) to local buffers, because local
buffers, in HEAD, don't have the valgrind integration. Without that marking
the buffer as NOACCESS would cause all sorts of issues, because it'd be
considered inaccessible even after pinning. As you analyzed, that then ends
up considered undefined due to the MemoryContextAlloc().

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
- Make defined before verifying page. That's all. It might be cleaner to
do this when first retrieving a return value from io_uring, since this
just makes up for what Valgrind already does for readv().

Yea, I think it's better to do that in io_uring. It's what I have done in the
attached.

- If completor!=definer or has dropped pin:
- Make NOACCESS in definer when definer cedes its own pin.

That's the current behaviour for shared buffers, right?

- For io_method=worker, make UNDEFINED before starting readv(). It might be
cleanest to do this when the worker first acts as the owner of the AIO
subsystem pin, if that's a clear moment earlier than readv().

Hm, what do we need this for?

- Make DEFINED in completor before verifying page. It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().

I think we can't rely on the marking during retrieving it from io_uring, as
that might have happened in a different backend for a temp buffer. That'd only
happen if we got io_uring events for *another* IO that involved a shared rel,
but it can happen.

Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?

Seems fine to do the simpler way if that saves fiddly code.

Can't quite decide, it's just at the border of what I consider too
fiddly... See the change to method_io_uring.c in the attached patch.

Questions:

1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't
need to have special-case logic for that. But it also makes the change less
localized and more "impactful", who knows what kind of skullduggery we have
been getting away with unnoticed.

I haven't written the code up yet, but I don't think it'd be all that much
code to add valgrind support to localbuf.

It would be the right thing long-term, and it's not a big deal if it causes
some false positives initially. So if you're leaning that way, that's good.

It was easy enough.

I saw one related failure, FlushRelationBuffers() didn't pin temporary buffers
before flushing them. Pinning the buffers fixed that.

I don't think it's a real problem to not pin the local buffer during
FlushRelationBuffers(), at least not today. But it seems unnecessarily odd to
not pin it.

I wish valgrind had a way to mark the buffer as inaccessible and then
accessible again, without loosing the defined-ness information...

Greetings,

Andres Freund

Attachments:

v1-0001-localbuf-Add-Valgrind-buffer-access-instrumentati.patchtext/x-diff; charset=us-asciiDownload

From 472b75150ffaca1d30147f764208341181c31571 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 4 Apr 2025 14:55:30 -0400
Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation

This mirrors 1e0dfd166b3 (+ 46ef520b9566), for temporary table buffers. This
is mainly interesting right now because the AIO work currently triggers
spurious valgrind errors, and the fix for that is cleaner if temp buffers
behave the same as shared buffers.

This requires one change beyond the annotations themselves, namely to pin
local buffers while writing them out in FlushRelationBuffers().

Reviewed-by:
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
---
 src/backend/storage/buffer/bufmgr.c   | 13 +++++++++++++
 src/backend/storage/buffer/localbuf.c |  9 +++++++++
 2 files changed, 22 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1c37d7dfe2f..13cebad9b12 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4895,8 +4895,21 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
+				/* Make sure we can handle the pin */
+				ReservePrivateRefCountEntry();
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+
+				/*
+				 * Pin/upin mostly to make valgrind work, but it also seems
+				 * like the right thing to do.
+				 */
+				PinLocalBuffer(bufHdr, false);
+
+
 				FlushLocalBuffer(bufHdr, srel);
 
+				UnpinLocalBuffer(BufferDescriptorGetBuffer(bufHdr));
+
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index ed56202af14..fb44d756dad 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -23,6 +23,7 @@
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "utils/guc_hooks.h"
+#include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -183,6 +184,8 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	instr_time	io_start;
 	Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
+	Assert(LocalRefCount[-BufferDescriptorGetBuffer(bufHdr) - 1] > 0);
+
 	/*
 	 * Try to start an I/O operation.  There currently are no reasons for
 	 * StartLocalBufferIO to return false, so we raise an error in that case.
@@ -808,6 +811,9 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 			buf_state += BUF_USAGECOUNT_ONE;
 		}
 		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+
+		/* see comment in PinBuffer() */
+		VALGRIND_MAKE_MEM_DEFINED(LocalBufHdrGetBlock(buf_hdr), BLCKSZ);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -843,6 +849,9 @@ UnpinLocalBufferNoOwner(Buffer buffer)
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
 		buf_state -= BUF_REFCOUNT_ONE;
 		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+
+		/* see comment in UnpinBufferNoOwner */
+		VALGRIND_MAKE_MEM_NOACCESS(LocalBufHdrGetBlock(buf_hdr), BLCKSZ);
 	}
 }
 
-- 
2.48.1.76.g4e746b1a31.dirty

v1-0002-aio-Make-AIO-compatible-with-valgrind.patchtext/x-diff; charset=us-asciiDownload

From 31759bafc9a44442f14b6363689bc343f0f9fb86 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 4 Apr 2025 15:03:34 -0400
Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind

In some edge cases valgrind flags issues with AIO related code. All of the
cases addressed in this change are false positives.

Most are caused by UnpinBuffer[NoOwner] marking buffer data as
inaccessible. This happens even though the AIO subsystem still holds a
pin. That's good, there shouldn't be accesses to the buffer outside of AIO
related code until it is pinned bu "user" code again. But it requires some
explicit work.

There is a bit of additional work due to temp tables, as valgrind does not
understand io_uring sufficiently to mark buffer contents as defined after a
read.

Per buildfarm animal skink.

Reviewed-by:
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
---
 src/include/storage/aio_internal.h        |  1 +
 src/backend/storage/aio/aio_io.c          | 23 +++++++++++
 src/backend/storage/aio/method_io_uring.c | 49 ++++++++++++++++++++++-
 src/backend/storage/aio/method_worker.c   | 18 +++++++++
 src/backend/storage/buffer/bufmgr.c       | 19 +++++++++
 5 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h
index 7f18da2c856..33f27b9fe50 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -344,6 +344,7 @@ extern PgAioResult pgaio_io_call_complete_local(PgAioHandle *ioh);
 extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
 extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
 extern bool pgaio_io_uses_fd(PgAioHandle *ioh, int fd);
+extern int	pgaio_io_get_iovec_length(PgAioHandle *ioh, struct iovec **iov);
 
 /* aio_target.c */
 extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 4d31392ddc7..bd8be987526 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -210,3 +210,26 @@ pgaio_io_uses_fd(PgAioHandle *ioh, int fd)
 
 	return false;				/* silence compiler */
 }
+
+/*
+ * Return the iovecand its length. Currently only expected to be used by
+ * debugging infrastructure
+ */
+int
+pgaio_io_get_iovec_length(PgAioHandle *ioh, struct iovec **iov)
+{
+	Assert(ioh->state >= PGAIO_HS_DEFINED);
+
+	*iov = &pgaio_ctl->iovecs[ioh->iovec_off];
+
+	switch (ioh->op)
+	{
+		case PGAIO_OP_READV:
+			return ioh->op_data.read.iov_length;
+		case PGAIO_OP_WRITEV:
+			return ioh->op_data.write.iov_length;
+		default:
+			pg_unreachable();
+			return 0;
+	}
+}
diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c
index c719ba2727a..1a90cadfd49 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -38,6 +38,7 @@
 #include "storage/shmem.h"
 #include "storage/lwlock.h"
 #include "storage/procnumber.h"
+#include "utils/memdebug.h"
 #include "utils/wait_event.h"
 
 
@@ -324,6 +325,49 @@ pgaio_uring_completion_error_callback(void *arg)
 	errcontext("completing I/O on behalf of process %d", owner_pid);
 }
 
+static void
+pgaio_uring_io_process_completion(PgAioHandle *ioh, int32 res)
+{
+	/*
+	 * Valgrind does not understand io_uring sufficiently to mark the
+	 * referenced region as defined on IO completion, and it'd probably be
+	 * nontrivial to teach it to do so. So we just have to do the legwork
+	 * ourselves.
+	 */
+#ifdef USE_VALGRIND
+	if (ioh->op == PGAIO_OP_READV)
+	{
+		struct iovec *iov;
+		uint16		iov_length = pgaio_io_get_iovec_length(ioh, &iov);
+		int32		processed = 0;
+
+		for (int i = 0; i < iov_length; i++)
+		{
+			const char *base = iov[i].iov_base;
+			int32		len = iov[i].iov_len;
+
+			Assert(iov[i].iov_len <= PG_INT32_MAX);
+
+			if (processed + len <= res)
+				VALGRIND_MAKE_MEM_DEFINED(base, len);
+			else if (processed <= res)
+			{
+				size_t		middle = res - processed;
+
+				VALGRIND_MAKE_MEM_DEFINED(base, middle);
+				VALGRIND_MAKE_MEM_UNDEFINED(base + middle, len - middle);
+			}
+			else
+				VALGRIND_MAKE_MEM_UNDEFINED(base, len);
+
+			processed += len;
+		}
+	}
+#endif
+
+	pgaio_io_process_completion(ioh, res);
+}
+
 static void
 pgaio_uring_drain_locked(PgAioUringContext *context)
 {
@@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
 		for (int i = 0; i < ncqes; i++)
 		{
 			struct io_uring_cqe *cqe = cqes[i];
+			int32		res;
 			PgAioHandle *ioh;
 
 			ioh = io_uring_cqe_get_data(cqe);
 			errcallback.arg = ioh;
+			res = cqe->res;
+
 			io_uring_cqe_seen(&context->io_uring_ring, cqe);
 
-			pgaio_io_process_completion(ioh, cqe->res);
+			pgaio_uring_io_process_completion(ioh, res);
 			errcallback.arg = NULL;
 		}
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 31d94ac82c5..232b79be4c4 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -42,6 +42,7 @@
 #include "storage/latch.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
+#include "utils/memdebug.h"
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
@@ -529,6 +530,23 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			error_errno = 0;
 			error_ioh = NULL;
 
+			/*
+			 * As part of IO completion the buffer will be marked as
+			 * non-accessible, until the buffer is pinned again. The next time
+			 * there is IO for the same buffer, the memory will be considered
+			 * inaccessible. Therefore we need to explicitly allow access to
+			 * the memory before reading data into it.
+			 */
+#ifdef USE_VALGRIND
+			{
+				struct iovec *iov;
+				uint16		iov_length = pgaio_io_get_iovec_length(ioh, &iov);
+
+				for (int i = 0; i < iov_length; i++)
+					VALGRIND_MAKE_MEM_UNDEFINED(iov[i].iov_base, iov[i].iov_len);
+			}
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 13cebad9b12..d34a1e335e5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6883,6 +6883,19 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
 	{
 		PgAioResult result_one;
 
+		/*
+		 * If the buffer is not currently pinned by this backend, e.g. because
+		 * we're completing this IO after an error, the buffer data will have
+		 * been marked as inaccessible when the buffer was unpinned. The AIO
+		 * subystem holds a pin, but that doesn't prevent the buffer from
+		 * having been marked as inaccessible. The completion might also be
+		 * executed in a different process.
+		 */
+#ifdef USE_VALGRIND
+		if (!BufferIsPinned(buffer))
+			VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ);
+#endif
+
 		if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags,
 							failed_checksum))
 		{
@@ -6901,6 +6914,12 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
 		else if (*failed_checksum)
 			*ignored_checksum = true;
 
+		/* undo what we did above */
+#ifdef USE_VALGRIND
+		if (!BufferIsPinned(buffer))
+			VALGRIND_MAKE_MEM_NOACCESS(bufdata, BLCKSZ);
+#endif
+
 		/*
 		 * Immediately log a message about the invalid page, but only to the
 		 * server log. The reason to do so immediately is that this may be
-- 
2.48.1.76.g4e746b1a31.dirty

v1-0003-aio-Avoid-spurious-coverity-warning.patchtext/x-diff; charset=us-asciiDownload

From 8b059ca837dbbf77344b6a45cd66f0c0935769fc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 4 Apr 2025 15:15:39 -0400
Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning

PgAioResult.result is never accessed in the relevant path, but coverity
complains about an uninitialized access anyway. So just zero-initialize the
whole thing.  While at it, reduce the scope of the variable.

Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d34a1e335e5..8265bf9fd10 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6881,8 +6881,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
 	/* Check for garbage data. */
 	if (!failed)
 	{
-		PgAioResult result_one;
-
 		/*
 		 * If the buffer is not currently pinned by this backend, e.g. because
 		 * we're completing this IO after an error, the buffer data will have
@@ -6936,6 +6934,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer,
 		 */
 		if (*buffer_invalid || *failed_checksum || *zeroed_buffer)
 		{
+			PgAioResult result_one = {0};
+
 			buffer_readv_encode_error(&result_one, is_temp,
 									  *zeroed_buffer,
 									  *ignored_checksum,
-- 
2.48.1.76.g4e746b1a31.dirty

#178

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Andres Freund (#177)

Re: AIO v2.5

On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:

On 2025-04-03 12:40:23 -0700, Noah Misch wrote:

On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:

In the general case, we could want client requests as follows:

- If completor==definer and has not dropped pin:
- Make defined before verifying page. That's all. It might be cleaner to
do this when first retrieving a return value from io_uring, since this
just makes up for what Valgrind already does for readv().

Yea, I think it's better to do that in io_uring. It's what I have done in the
attached.

- If completor!=definer or has dropped pin:
- Make NOACCESS in definer when definer cedes its own pin.

That's the current behaviour for shared buffers, right?

Yes.

- For io_method=worker, make UNDEFINED before starting readv(). It might be
cleanest to do this when the worker first acts as the owner of the AIO
subsystem pin, if that's a clear moment earlier than readv().

Hm, what do we need this for?

At the time, we likely didn't need it:

- If the worker does its own PinBuffer*()+unpin, we don't need it. Those
functions do the Valgrind client requests.
- If the worker relies on the AIO-subsystem-owned pin and does neither regular
pin nor regular unpin, we don't need it. Buffers are always "defined".
- If the worker relies on the AIO-subsystem-owned pin to skip PinBuffer*() but
uses regular unpin code, then the buffer may be NOACCESS. Then one would
need this. But this would be questionable for other reasons.

Your proposed change to set NOACCESS in buffer_readv_complete_one() interacts
with things further, making the UNDEFINED necessary.

- Make DEFINED in completor before verifying page. It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().

I think we can't rely on the marking during retrieving it from io_uring, as
that might have happened in a different backend for a temp buffer. That'd only
happen if we got io_uring events for *another* IO that involved a shared rel,
but it can happen.

Good point. I think the VALGRIND_MAKE_MEM_DEFINED() in
pgaio_uring_drain_locked() isn't currently needed at all. If
completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
needs. Otherwise, buffer_readv_complete_one() does what Valgrind needs.

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr. That helps future,
non-bufmgr AIO use cases. It's tricky to pick the right place for that
VALGRIND_MAKE_MEM_DEFINED():

- pgaio_uring_drain_locked() is problematic, I think. In the localbuf case,
the iovec base address is relevant only in the ioh-defining process. In the
shmem completor!=definer case, this runs only in the completor.

- A complete_local callback solves those problems. However, if the
AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
the buffer mapping may have changed by the time of complete_local.

- Putting it in the place that would call pgaio_result_report(ERROR) if
needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
mapping having moved. ProcessReadBuffersResult() doesn't even need this,
since PinBuffer() already did it. Each future AIO use case will have a
counterpart of ProcessReadBuffersResult() that consumes the result and
proceeds with tasks that depend on the AIO. That's the place.

Is that right? I got this wrong a few times while trying to think through it,
so I'm not too confident in the above.

Not quite sure if we should mark
the entire IOV is efined or just the portion that was actually read - the
latter is additional fiddly code, and it's not clear it's likely to be helpful?

Seems fine to do the simpler way if that saves fiddly code.

Can't quite decide, it's just at the border of what I consider too
fiddly... See the change to method_io_uring.c in the attached patch.

It is at the border, as you say, but I'd tend to keep it.

Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation

Ready for commit

Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind

See above about pgaio_uring_drain_locked().

related code until it is pinned bu "user" code again. But it requires some

s/bu/by/

+ * Return the iovecand its length. Currently only expected to be used by

s/iovecand/iovec and/

@@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
for (int i = 0; i < ncqes; i++)
{
struct io_uring_cqe *cqe = cqes[i];
+ int32 res;
PgAioHandle *ioh;

ioh = io_uring_cqe_get_data(cqe);
errcallback.arg = ioh;
+ res = cqe->res;
+
io_uring_cqe_seen(&context->io_uring_ring, cqe);
-			pgaio_io_process_completion(ioh, cqe->res);
+			pgaio_uring_io_process_completion(ioh, res);

I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
reused asynchronously after io_uring_cqe_seen(). Is that right?

Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning

Ready for commit

#179

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Noah Misch (#178)

Re: AIO v2.5

Hi,

On 2025-04-04 14:18:02 -0700, Noah Misch wrote:

On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:

- Make DEFINED in completor before verifying page. It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().

I think we can't rely on the marking during retrieving it from io_uring, as
that might have happened in a different backend for a temp buffer. That'd only
happen if we got io_uring events for *another* IO that involved a shared rel,
but it can happen.

Good point. I think the VALGRIND_MAKE_MEM_DEFINED() in
pgaio_uring_drain_locked() isn't currently needed at all. If
completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
needs. Otherwise, buffer_readv_complete_one() does what Valgrind needs.

We did need it - but only because I bungled something in the earlier patch to
add valgrind support. The problem is that in PinLocalBuffer() there may not
actually be any storage allocated for the buffer yet, so
VALGRIND_MAKE_MEM_DEFINED() doesn't work. In the first use of the buffer the
allocation happens a bit later, in GetLocalVictimBuffer(), namely during the
call to GetLocalBufferStorage().

Not quite sure yet how to best deal with it. Putting the PinLocalBuffer()
slightly later into GetLocalVictimBuffer() fixes the issue, but also doesn't
really seem great.

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.

I think that would be possible if we didn't do VALGRIND_MAKE_MEM_NOACCESS() in
UnpinBuffer()/UnpinLocalBuffer(). But with that I don't see how we can avoid
needing to remark the region as accessible?

That helps future, non-bufmgr AIO use cases. It's tricky to pick the right
place for that VALGRIND_MAKE_MEM_DEFINED():

- pgaio_uring_drain_locked() is problematic, I think. In the localbuf case,
the iovec base address is relevant only in the ioh-defining process. In the
shmem completor!=definer case, this runs only in the completor.

You're right :(

- A complete_local callback solves those problems. However, if the
AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
the buffer mapping may have changed by the time of complete_local.

I don't think that is possible, due to the aio subsystem owned pin?

- Putting it in the place that would call pgaio_result_report(ERROR) if
needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
mapping having moved. ProcessReadBuffersResult() doesn't even need this,
since PinBuffer() already did it. Each future AIO use case will have a
counterpart of ProcessReadBuffersResult() that consumes the result and
proceeds with tasks that depend on the AIO. That's the place.

I don't really follow - at the point something like ProcessReadBuffersResult()
gets involved, we'll already have done the accesses that needed the memory to
be accessible and defined?

I think the point about non-aio uses is a fair one, but I don't quite know how
to best solve it right now, due to the local buffer issue you mentioned. I'd
guess that we'd best put it somewhere
a) in pgaio_io_process_completion(), if definer==completor || !PGAIO_HF_REFERENCES_LOCAL
b) pgaio_io_call_complete_local(), just before calling
pgaio_io_call_complete_local() if PGAIO_HF_REFERENCES_LOCAL

related code until it is pinned bu "user" code again. But it requires some

s/bu/by/

+ * Return the iovecand its length. Currently only expected to be used by

s/iovecand/iovec and/

Fixed.

@@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
for (int i = 0; i < ncqes; i++)
{
struct io_uring_cqe *cqe = cqes[i];
+ int32 res;
PgAioHandle *ioh;

ioh = io_uring_cqe_get_data(cqe);
errcallback.arg = ioh;
+ res = cqe->res;
+
io_uring_cqe_seen(&context->io_uring_ring, cqe);
-			pgaio_io_process_completion(ioh, cqe->res);
+			pgaio_uring_io_process_completion(ioh, res);
I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
reused asynchronously after io_uring_cqe_seen(). Is that right?

I don't think there is any such danger - there's no background thing
processing things on the ring, if there were, it'd get corrupted, but it
seemed cleaner to do it that way when I introduced
pgaio_uring_io_process_completion().

Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning

Ready for commit

Thanks!

Greetings,

Andres Freund

#180

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Andres Freund (#179)

Re: AIO v2.5

On Fri, Apr 04, 2025 at 11:53:13PM -0400, Andres Freund wrote:

On 2025-04-04 14:18:02 -0700, Noah Misch wrote:

On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:

- Make DEFINED in completor before verifying page. It might be cleaner to
do this when the completor first retrieves a return value from io_uring,
since this just makes up for what Valgrind already does for readv().

I think we can't rely on the marking during retrieving it from io_uring, as
that might have happened in a different backend for a temp buffer. That'd only
happen if we got io_uring events for *another* IO that involved a shared rel,
but it can happen.

Good point. I think the VALGRIND_MAKE_MEM_DEFINED() in
pgaio_uring_drain_locked() isn't currently needed at all. If
completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
needs. Otherwise, buffer_readv_complete_one() does what Valgrind needs.

We did need it - but only because I bungled something in the earlier patch to
add valgrind support. The problem is that in PinLocalBuffer() there may not
actually be any storage allocated for the buffer yet, so
VALGRIND_MAKE_MEM_DEFINED() doesn't work. In the first use of the buffer the
allocation happens a bit later, in GetLocalVictimBuffer(), namely during the
call to GetLocalBufferStorage().

Not quite sure yet how to best deal with it. Putting the PinLocalBuffer()
slightly later into GetLocalVictimBuffer() fixes the issue, but also doesn't
really seem great.

Yeah. Maybe this (untested):

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fb44d75..3de13e7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -813,7 +813,8 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);

 		/* see comment in PinBuffer() */
-		VALGRIND_MAKE_MEM_DEFINED(LocalBufHdrGetBlock(buf_hdr), BLCKSZ);
+		if (LocalBufHdrGetBlock(bufHdr) != NULL)
+			VALGRIND_MAKE_MEM_DEFINED(LocalBufHdrGetBlock(buf_hdr), BLCKSZ);
 	}
 	LocalRefCount[bufid]++;
 	ResourceOwnerRememberBuffer(CurrentResourceOwner,
@@ -932,6 +933,15 @@ GetLocalBufferStorage(void)
 	next_buf_in_block++;
 	total_bufs_allocated++;

+	/*
+	 * Caller's PinBuffer() was too early for Valgrind updates, so do it here.
+	 * The block is actually undefined, but we want consistency with the
+	 * regular case of not needing to allocate memory.  This is specifically
+	 * needed when method_io_uring.c fills the block, because Valgrind doesn't
+	 * recognize io_uring reads causing undefined memory to become defined.
+	 */
+	VALGRIND_MAKE_MEM_DEFINED(this_buf, BLCKSZ);
+
 	return (Block) this_buf;
 }

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.

I think that would be possible if we didn't do VALGRIND_MAKE_MEM_NOACCESS() in
UnpinBuffer()/UnpinLocalBuffer(). But with that I don't see how we can avoid
needing to remark the region as accessible?

Yes, it's not that we should remove VALGRIND_MAKE_MEM_DEFINED() from bufmgr.
I was trying to think about future AIO callers (e.g. RelationCopyStorage())
and how they'd want things to work. That said, perhaps we should just omit
the io_uring-level Valgrind calls and delegate the problem to higher layers
until there's a concrete use case:

--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -742,14 +742,16 @@ smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * Compared to smgrreadv(), more responsibilities fall on the caller:
  * - Partial reads need to be handled by the caller re-issuing IO for the
  *   unread blocks
  * - smgr will ereport(LOG_SERVER_ONLY) some problems, but higher layers are
  *   responsible for pgaio_result_report() to mirror that news to the user (if
  *   the IO results in PGAIO_RS_WARNING) or abort the (sub)transaction (if
  *   PGAIO_RS_ERROR).
+ * - Under Valgrind, the "buffers" memory may or may not change status to
+ *   DEFINED, depending on io_method and concurrent activity.
  */
 void
 smgrstartreadv(PgAioHandle *ioh,

- A complete_local callback solves those problems. However, if the
AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
the buffer mapping may have changed by the time of complete_local.

I don't think that is possible, due to the aio subsystem owned pin?

We currently drop the shared buffer AIO subsystem pin in complete_shared
(buffer_readv_complete_one() -> TerminateBufferIO()). I was trying to say
that if we put a VALGRIND_MAKE_MEM_DEFINED in complete_local without changing
anything else, it would have this problem. We probably wouldn't want to move
the shared buffer pin drop to complete_local without a strong reason.

- Putting it in the place that would call pgaio_result_report(ERROR) if
needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
mapping having moved. ProcessReadBuffersResult() doesn't even need this,
since PinBuffer() already did it. Each future AIO use case will have a
counterpart of ProcessReadBuffersResult() that consumes the result and
proceeds with tasks that depend on the AIO. That's the place.

I don't really follow - at the point something like ProcessReadBuffersResult()
gets involved, we'll already have done the accesses that needed the memory to
be accessible and defined?

(I was referring to future AIO callers, and this part of the discussion is
obsolete if we just delegate the problem.) The bufmgr subsystem does the
PageIsVerified() accesses before ProcessReadBuffersResult(), but I think the
client of the bufmgr subsystem, e.g. read_stream.c, does accesses only after
that. The client must do accesses only after that, since any
pgaio_result_report(ERROR) doesn't happen until ProcessReadBuffersResult().

I think the point about non-aio uses is a fair one, but I don't quite know how
to best solve it right now, due to the local buffer issue you mentioned. I'd
guess that we'd best put it somewhere
a) in pgaio_io_process_completion(), if definer==completor || !PGAIO_HF_REFERENCES_LOCAL
b) pgaio_io_call_complete_local(), just before calling
pgaio_io_call_complete_local() if PGAIO_HF_REFERENCES_LOCAL

I think that would do nothing wrong today, but it uses
!PGAIO_HF_REFERENCES_LOCAL as a proxy for "target memory is already DEFINED,
or a higher layer will make it DEFINED". While !PGAIO_HF_REFERENCES_LOCAL
does imply that today, I think that's a bufmgr-specific conclusion. I have no
particular reason to expect that to hold for future AIO use cases. As above,
I'd be inclined to omit the io_uring-level Valgrind calls until we have a
concrete use case to drive their design. How do you see it?

@@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
for (int i = 0; i < ncqes; i++)
{
struct io_uring_cqe *cqe = cqes[i];
+ int32 res;
PgAioHandle *ioh;

ioh = io_uring_cqe_get_data(cqe);
errcallback.arg = ioh;
+ res = cqe->res;
+
io_uring_cqe_seen(&context->io_uring_ring, cqe);
-			pgaio_io_process_completion(ioh, cqe->res);
+			pgaio_uring_io_process_completion(ioh, res);
I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
reused asynchronously after io_uring_cqe_seen(). Is that right?
I don't think there is any such danger - there's no background thing
processing things on the ring, if there were, it'd get corrupted, but it
seemed cleaner to do it that way when I introduced
pgaio_uring_io_process_completion().

I agree there's no concrete danger.

#181

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Andres Freund (#170)

1 attachment(s)

Re: AIO v2.5

Hello Andres,

02.04.2025 14:58, Andres Freund wrote:

Hi,

I've pushed fixes for 1) and 2) and am working on 3).

When running multiple installcheck's against a single server (please find
the ready-to-use script attached (I use more sophisticated version with
additional patches to make installcheck pass cleanly, but that's not
required for this case)), I've encountered an interesting error related to
AIO/uring:
iteration 8: Sun Apr 6 19:22:39 UTC 2025
installchecks finished: Sun Apr 6 19:23:47 UTC 2025
2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled

server.log contains:
2025-04-06 19:22:44.215 UTC [38231] LOG: checkpoint complete: wrote ...
2025-04-06 19:22:44.216 UTC [38231] LOG: checkpoint starting: immediate force wait flush-all
2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest attach partition parted_copytest_a1 for
values in(1);
2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest attach partition parted_copytest_a1 for
values in(1);

It's reproduced better on tmpfs for me; probably you would need to increase
NUM_INSTALLCHECKS/NUM_ITERATIONS for your machine. I can reduce the testing
procedure to something trivial, if it makes sense for you. Probably, the
same effect can be also achieved with just pgbench...

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#182

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Alexander Lakhin (#181)

Re: AIO v2.5

Hi,

On 2025-04-06 23:00:00 +0300, Alexander Lakhin wrote:

02.04.2025 14:58, Andres Freund wrote:
When running multiple installcheck's against a single server (please find
the ready-to-use script attached (I use more sophisticated version with
additional patches to make installcheck pass cleanly, but that's not
required for this case)), I've encountered an interesting error related to
AIO/uring:
iteration 8: Sun Aprï¿½ 6 19:22:39 UTC 2025
installchecks finished: Sun Aprï¿½ 6 19:23:47 UTC 2025
2025-04-06 19:22:44.216 UTC [349525] LOG:ï¿½ could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] ERROR:ï¿½ could not read blocks 0..0 in file "base/6179194/2606": Operation canceled

Thanks for the report, clearly something isn't right.

It's reproduced better on tmpfs for me; probably you would need to increase
NUM_INSTALLCHECKS/NUM_ITERATIONS for your machine.

I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
tmpfs.

server.log contains:
2025-04-06 19:22:44.215 UTC [38231] LOG:ï¿½ checkpoint complete: wrote ...
2025-04-06 19:22:44.216 UTC [38231] LOG:ï¿½ checkpoint starting: immediate force wait flush-all
2025-04-06 19:22:44.216 UTC [349525] LOG:ï¿½ could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT:ï¿½ alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);
2025-04-06 19:22:44.216 UTC [349525] ERROR:ï¿½ could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT:ï¿½ alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);

Hm. Does the failure vary between occurrences?
- is it always the same statement? Probably not?
- is it always 2606 (i.e. pg_constraint)?
- does the failure always happen around a checkpoint? If so, is it always
immediate?
- I do assume it's always ECANCELED?

I can reduce the testing procedure to something trivial, if it makes sense
for you. Probably, the same effect can be also achieved with just pgbench...

That'd be very helpful!

Greetings,

Andres Freund

#183

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Andres Freund (#182)

Re: AIO v2.5

Hello Andres,

07.04.2025 19:20, Andres Freund wrote:

iteration 8: Sun Apr 6 19:22:39 UTC 2025
installchecks finished: Sun Apr 6 19:23:47 UTC 2025
2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled

Thanks for the report, clearly something isn't right.

Thank you for your attention to it!

It's reproduced better on tmpfs for me; probably you would need to increase
NUM_INSTALLCHECKS/NUM_ITERATIONS for your machine.

I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
tmpfs.

Before sharing the script I tested it on two my machines, but I had
anticipated that the error can be hard to reproduce. Will try to reduce
the reproducer...

server.log contains:
2025-04-06 19:22:44.215 UTC [38231] LOG: checkpoint complete: wrote ...
2025-04-06 19:22:44.216 UTC [38231] LOG: checkpoint starting: immediate force wait flush-all
2025-04-06 19:22:44.216 UTC [349525] LOG: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);
2025-04-06 19:22:44.216 UTC [349525] ERROR: could not read blocks 0..0 in file "base/6179194/2606": Operation canceled
2025-04-06 19:22:44.216 UTC [349525] STATEMENT: alter table parted_copytest
attach partition parted_copytest_a1 for values in(1);

Hm. Does the failure vary between occurrences?
- is it always the same statement? Probably not?
- is it always 2606 (i.e. pg_constraint)?
- does the failure always happen around a checkpoint? If so, is it always
immediate?
- I do assume it's always ECANCELED?

I've got three such failures with three runs of of the script to
answer your questions:
...
iteration 6: Mon Apr 7 18:08:35 UTC 2025
installchecks finished: Mon Apr 7 18:09:45 UTC 2025
2025-04-07 18:08:45.369 UTC [1941260] LOG: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] ERROR: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
server.log:
...
2025-04-07 18:08:44.120 UTC [1713945] LOG: checkpoint starting: wal

2025-04-07 18:08:45.369 UTC [1941260] LOG: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT: SELECT routine_name, ordinal_position, parameter_name, parameter_default
        FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
        WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
        ORDER BY 1, 2;
2025-04-07 18:08:45.369 UTC [1941260] ERROR: could not read blocks 7..7 in file "base/4448730/1255": Operation canceled
2025-04-07 18:08:45.369 UTC [1941260] STATEMENT: SELECT routine_name, ordinal_position, parameter_name, parameter_default
        FROM information_schema.parameters JOIN information_schema.routines USING (specific_schema, specific_name)
        WHERE routine_schema = 'temp_func_test' AND routine_name ~ '^functest_is_'
        ORDER BY 1, 2;
...
2025-04-07 18:08:51.836 UTC [1713945] LOG: checkpoint complete: wrote 1558 buffers (9.5%), wrote 22 SLRU buffers; 0 WAL
file(s) added, 0 removed, 33 recycled; write=7.544 s, sync=0.019 s, total=7.720 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=533837 kB, estimate=533837 kB; lsn=C/E532D478, redo lsn=C/C7054358
...

---
iteration 8: Mon Apr 7 18:26:47 UTC 2025
installchecks finished: Mon Apr 7 18:27:59 UTC 2025
2025-04-07 18:26:53.252 UTC [2359398] LOG: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] ERROR: could not read blocks 0..12 in file "base/4/1255": Operation canceled
server.log:
...
2025-04-07 18:26:53.249 UTC [2048053] LOG: checkpoint complete: wrote 3061 buffers (18.7%), wrote 3 SLRU buffers; 0 WAL
file(s) added, 2 removed, 0 recycled; write=0.051 s, sync=0.002 s, total=0.068 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=18937 kB, estimate=120161 kB; lsn=11/6D42ED30, redo lsn=11/6C17AC30
2025-04-07 18:26:53.251 UTC [2359308] WARNING: "most_common_elems" must be specified when "most_common_elem_freqs" is
specified
2025-04-07 18:26:53.251 UTC [2359351] WARNING: "elem_count_histogram" array cannot contain NULL values
2025-04-07 18:26:53.252 UTC [2359375] ERROR: cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL: This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT: SELECT pg_catalog.pg_restore_relation_stats(
            'schemaname', 'stats_import',
            'relname', 'testseq');
2025-04-07 18:26:53.252 UTC [2359375] ERROR: cannot modify statistics for relation "testseq"
2025-04-07 18:26:53.252 UTC [2359375] DETAIL: This operation is not supported for sequences.
2025-04-07 18:26:53.252 UTC [2359375] STATEMENT: SELECT pg_catalog.pg_clear_relation_stats(schemaname =>
'stats_import', relname => 'testseq');
2025-04-07 18:26:53.252 UTC [2359398] LOG: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.252 UTC [2359398] STATEMENT: CREATE DATABASE regress020_tbd
        ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "31:12:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '31:12:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "0:1:" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '0:1:'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "12:13:0" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '12:13:0'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359449] ERROR: invalid input syntax for type pg_snapshot: "12:16:14,13" at character 8
2025-04-07 18:26:53.255 UTC [2359449] STATEMENT: select '12:16:14,13'::pg_snapshot;
2025-04-07 18:26:53.255 UTC [2359398] ERROR: could not read blocks 0..12 in file "base/4/1255": Operation canceled
2025-04-07 18:26:53.255 UTC [2359398] STATEMENT: CREATE DATABASE regress020_tbd
        ENCODING utf8 LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
...
---
iteration 2: Mon Apr 7 19:02:57 UTC 2025
installchecks finished: Mon Apr 7 19:04:07 UTC 2025
2025-04-07 19:03:02.133 UTC [3537488] LOG: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] ERROR: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
server.log:
2025-04-07 19:03:02.129 UTC [3487651] LOG: checkpoint complete: wrote 337 buffers (2.1%), wrote 2 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.001 s, total=0.024 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=2555 kB, estimate=293963 kB; lsn=2/9D3DECE8, redo lsn=2/9D2AAC00
2025-04-07 19:03:02.129 UTC [3537184] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT: select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.129 UTC [3537184] ERROR: invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.129 UTC [3537184] STATEMENT: select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.130 UTC [3537651] ERROR: operator does not exist: point = box at character 23
2025-04-07 19:03:02.130 UTC [3537651] HINT: No operator matches the given name and argument types. You might need to
add explicit type casts.
2025-04-07 19:03:02.130 UTC [3537651] STATEMENT: select '(0,0)'::point in ('(0,0,0,0)'::box, point(0,0));
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'xyz' ~ 'x(\w)(?=\1)';
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid backreference number
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'xyz' ~ 'x(\w)(?=(\1))';
2025-04-07 19:03:02.133 UTC [3537362] ERROR: invalid regular expression: invalid escape \ sequence
2025-04-07 19:03:02.133 UTC [3537362] STATEMENT: select 'a' ~ '\x7fffffff';
2025-04-07 19:03:02.133 UTC [3537488] LOG: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE 'cstring'::regtype = ANY (p1.proargtypes)
        AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
        AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
        AND p1.oid != 'shell_in(cstring)'::regprocedure
    ORDER BY 1;
2025-04-07 19:03:02.133 UTC [3537488] ERROR: could not read blocks 0..0 in file "base/902183/2607": Operation canceled
2025-04-07 19:03:02.133 UTC [3537488] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE 'cstring'::regtype = ANY (p1.proargtypes)
        AND NOT EXISTS(SELECT 1 FROM pg_type WHERE typinput = p1.oid)
        AND NOT EXISTS(SELECT 1 FROM pg_conversion WHERE conproc = p1.oid)
        AND p1.oid != 'shell_in(cstring)'::regprocedure
    ORDER BY 1;
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIME(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIMESTAMP(7) WITH TIME ZONE precision reduced to maximum allowed, 6
2025-04-07 19:03:02.135 UTC [3537671] WARNING: TIME(7) precision reduced to maximum allowed, 6
2025-04-07 19:03:02.136 UTC [3487651] LOG: checkpoint starting: immediate force wait
...
So I suspect checkpointer, but I'm not sure yet.

(Sometimes (2 out of 5 runs) 10 iterations pass without the error.)

That script with s/grep 'could not read blocks '/grep 'was terminated '/
also discovers another anomaly (right now on the third run, with 10
iterations of 30 installchecks):
Core was generated by `postgres: postgres regress005 [local] SELECT                  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000591086d5ccfa in pgaio_io_get_target_description (ioh=ioh@entry=0x7ffcf961bc20) at aio_target.c:85
#2 0x0000591086d5bfdd in pg_get_aios (fcinfo=<optimized out>) at aio_funcs.c:218
#3 0x0000591086bb64c1 in ExecMakeTableFunctionResult (setexpr=0x5910bddcbf48, econtext=0x5910bddcbde8,
argContext=<optimized out>,
    expectedDesc=0x5910bddff918, randomAccess=false) at execSRF.c:234
#4 0x0000591086bc9673 in FunctionNext (node=node@entry=0x5910bddcbbd8) at nodeFunctionscan.c:94
#5 0x0000591086bb7093 in ExecScanFetch (recheckMtd=0x591086bc9522 <FunctionRecheck>, accessMtd=0x591086bc95a2
<FunctionNext>, epqstate=0x0,
    node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:126
#6 ExecScanExtended (projInfo=0x0, qual=0x0, epqstate=0x0, recheckMtd=0x591086bc9522 <FunctionRecheck>,
accessMtd=0x591086bc95a2 <FunctionNext>,
    node=0x5910bddcbbd8) at ../../../src/include/executor/execScan.h:170
#7 ExecScan (node=0x5910bddcbbd8, accessMtd=accessMtd@entry=0x591086bc95a2 <FunctionNext>,
recheckMtd=recheckMtd@entry=0x591086bc9522 <FunctionRecheck>)
    at execScan.c:59
#8 0x0000591086bc9580 in ExecFunctionScan (pstate=<optimized out>) at nodeFunctionscan.c:269
#9 0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcbbd8) at execProcnode.c:469
#10 0x0000591086bc0195 in ExecProcNode (node=0x5910bddcbbd8) at ../../../src/include/executor/executor.h:341
#11 0x0000591086bc021b in fetch_input_tuple (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:563
#12 0x0000591086bc3825 in agg_retrieve_direct (aggstate=aggstate@entry=0x5910bddcb480) at nodeAgg.c:2450
#13 0x0000591086bc3a19 in ExecAgg (pstate=<optimized out>) at nodeAgg.c:2265
#14 0x0000591086bb38f3 in ExecProcNodeFirst (node=0x5910bddcb480) at execProcnode.c:469
#15 0x0000591086baac23 in ExecProcNode (node=node@entry=0x5910bddcb480) at ../../../src/include/executor/executor.h:341
#16 0x0000591086baacd0 in ExecutePlan (queryDesc=queryDesc@entry=0x5910bdde0e00, operation=operation@entry=CMD_SELECT,
sendTuples=sendTuples@entry=true,
    numberTuples=numberTuples@entry=0, direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x5910bddf6070)
at execMain.c:1783
#17 0x0000591086bab79d in standard_ExecutorRun (queryDesc=0x5910bdde0e00, direction=ForwardScanDirection, count=0) at
execMain.c:435
#18 0x0000591086bab7e0 in ExecutorRun (queryDesc=queryDesc@entry=0x5910bdde0e00,
direction=direction@entry=ForwardScanDirection, count=count@entry=0)
    at execMain.c:371
#19 0x0000591086da9e9f in PortalRunSelect (portal=portal@entry=0x5910bdd72f10, forward=forward@entry=true, count=0,
count@entry=9223372036854775807,
    dest=dest@entry=0x5910bddf6070) at pquery.c:953
#20 0x0000591086dab83b in PortalRun (portal=portal@entry=0x5910bdd72f10, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,
    dest=dest@entry=0x5910bddf6070, altdest=altdest@entry=0x5910bddf6070, qc=qc@entry=0x7ffcf961c9f0) at pquery.c:797
#21 0x0000591086da74a4 in exec_simple_query (query_string=query_string@entry=0x5910bdcebe60 "SELECT COUNT(*) >= 0 AS ok
FROM pg_aios;") at postgres.c:1274

But I'm yet to construct a more reliable reproducer for it. Hope I could
do this during the current week.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#184

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Noah Misch (#180)

Re: AIO v2.5

Hi,

On 2025-04-05 06:43:52 -0700, Noah Misch wrote:

Yeah. Maybe this (untested):

Something like that works. I adopted your formulation of this, mine was in
GetLocalVictimBuffer(), which seems slightly less future proof.

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.

I think that would be possible if we didn't do VALGRIND_MAKE_MEM_NOACCESS() in
UnpinBuffer()/UnpinLocalBuffer(). But with that I don't see how we can avoid
needing to remark the region as accessible?

Yes, it's not that we should remove VALGRIND_MAKE_MEM_DEFINED() from bufmgr.
I was trying to think about future AIO callers (e.g. RelationCopyStorage())
and how they'd want things to work.

Ah, I now understand what you mean. I'm inclined to leave that out for now,
we can do that later. I spent a bit of time experimenting with going bigger,
but I think it's important to get skink a bit less red.

That said, perhaps we should just omit the io_uring-level Valgrind calls and
delegate the problem to higher layers until there's a concrete use case:

Yes, I think that's the right answer for now. Applied your suggested comment.

I looked for a good place to add a comment to method_io_uring.c, but couldn't
really come up with anything convincing. Then I thought the
* "Start" routines for individual IO operations
comment in aio_io.c might be a good place, but also failed to come up with
anything particularly convincing.

- A complete_local callback solves those problems. However, if the
AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
the buffer mapping may have changed by the time of complete_local.

I don't think that is possible, due to the aio subsystem owned pin?

We currently drop the shared buffer AIO subsystem pin in complete_shared
(buffer_readv_complete_one() -> TerminateBufferIO()). I was trying to say
that if we put a VALGRIND_MAKE_MEM_DEFINED in complete_local without changing
anything else, it would have this problem.

Ah, yes, I see what you mean.

We probably wouldn't want to move the shared buffer pin drop to
complete_local without a strong reason.

Agreed.

I think the point about non-aio uses is a fair one, but I don't quite know how
to best solve it right now, due to the local buffer issue you mentioned. I'd
guess that we'd best put it somewhere
a) in pgaio_io_process_completion(), if definer==completor || !PGAIO_HF_REFERENCES_LOCAL
b) pgaio_io_call_complete_local(), just before calling
pgaio_io_call_complete_local() if PGAIO_HF_REFERENCES_LOCAL

I think that would do nothing wrong today, but it uses
!PGAIO_HF_REFERENCES_LOCAL as a proxy for "target memory is already DEFINED,
or a higher layer will make it DEFINED". While !PGAIO_HF_REFERENCES_LOCAL
does imply that today, I think that's a bufmgr-specific conclusion. I have no
particular reason to expect that to hold for future AIO use cases. As above,
I'd be inclined to omit the io_uring-level Valgrind calls until we have a
concrete use case to drive their design. How do you see it?

Yea, I'm not sure either. I think we'll best wait until we have non-bufmgr
AIO to address all this. That'll make it easier to shake out.

I think eventually we should do an explicit
- VALGRIND_CHECK_MEM_IS_ADDRESSABLE() in pgaio_io_start_readv

- VALGRIND_CHECK_MEM_IS_DEFINED() in pgaio_io_start_writev

- VALGRIND_MAKE_MEM_DEFINED() in when a READV completes, although some of the
details of when/where to do that aren't entirely clear to me yet. I suspect
we might have to do it in pgaio_io_start_readv(), because it's harder to
reliably do it later

I've pushed the patches. Thanks for the discussion, somehow the mix of shared
memory with valgrind tracking accessibility/definedness in a process local
manner is somewhat mindbending.

Greetings,

Andres Freund

#185

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Alexander Lakhin (#183)

Re: AIO v2.5

Hello Andres,

07.04.2025 22:10, Alexander Lakhin wrote:

I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
tmpfs.

Before sharing the script I tested it on two my machines, but I had
anticipated that the error can be hard to reproduce. Will try to reduce
the reproducer...

I've managed to reduce it to the following:
ulimit -n 4096

echo "
fsync = off
autovacuum = off

checkpoint_timeout = 30s

io_max_concurrency = 10
io_method = io_uring
" >> $PGDATA/postgresql.conf

pg_ctl -l server.log start

for i in `seq 1000`; do
numjobs=$((20 + $RANDOM % 60))
echo "iteration $i (jobs: $numjobs)"
date
for ((j=1;j<=numjobs;j++)); do
    (
      createdb db$j;
      for ((n=1;n<=50;n++)); do
        cat << EOF | psql -d db$j -a >>/dev/null 2>&1
DROP TABLE IF EXISTS tenk1;
CREATE TABLE tenk1 (
    unique1     int4,
    unique2     int4,
    two         int4,
    four        int4,
    ten         int4,
    twenty      int4,
    hundred     int4,
    thousand    int4,
    twothousand int4,
    fivethous   int4,
    tenthous    int4,
    odd         int4,
    even        int4,
    stringu1    name,
    stringu2    name,
    string4     name
);
COPY tenk1 FROM '.../src/test/regress/data/tenk.data';
EOF
      done;
    ) &
done
wait

for ((j=1;j<=numjobs;j++)); do dropdb db$j & done
wait
grep -A3 -E '(ERROR|could not read blocks )' server.log && break;
done

pg_ctl stop

It fails for me as below:
iteration 13 (jobs: 25)
Sun Apr 13 05:31:47 AM UTC 2025
iteration 14 (jobs: 67)
Sun Apr 13 05:31:50 AM UTC 2025
dropdb: error: database removal failed: ERROR: could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153451] LOG: could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153451] CONTEXT: completing I/O on behalf of process 1153456
2025-04-13 05:31:58.930 UTC [1153451] STATEMENT: DROP DATABASE db5;
2025-04-13 05:31:58.930 UTC [1153456] ERROR: could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153456] STATEMENT: DROP DATABASE db6;
2025-04-13 05:31:58.931 UTC [1034758] LOG: checkpoint complete: wrote 3 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=18 kB, estimate=458931 kB; lsn=16/54589E08, redo lsn=16/54586F88
2025-04-13 05:31:58.931 UTC [1034758] LOG: checkpoint starting: immediate force wait

I reproduced this error on three different machines (all are running
Ubuntu 24.04, two with kernel version 6.8, one with 6.11), with PGDATA
located on tmpfs.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#186

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Alexander Lakhin (#185)

Re: AIO v2.5

Hi,

On 2025-04-13 09:00:01 +0300, Alexander Lakhin wrote:

07.04.2025 22:10, Alexander Lakhin wrote:

I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
tmpfs.

Before sharing the script I tested it on two my machines, but I had
anticipated that the error can be hard to reproduce. Will try to reduce
the reproducer...

I've managed to reduce it to the following:

Thanks a lot for working on that!

[reproducer]

It fails for me as below:
iteration 13 (jobs: 25)
Sun Apr 13 05:31:47 AM UTC 2025
iteration 14 (jobs: 67)
Sun Apr 13 05:31:50 AM UTC 2025
dropdb: error: database removal failed: ERROR:ï¿½ could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153451] LOG:ï¿½ could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153451] CONTEXT:ï¿½ completing I/O on behalf of process 1153456
2025-04-13 05:31:58.930 UTC [1153451] STATEMENT:ï¿½ DROP DATABASE db5;
2025-04-13 05:31:58.930 UTC [1153456] ERROR:ï¿½ could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-13 05:31:58.930 UTC [1153456] STATEMENT:ï¿½ DROP DATABASE db6;
2025-04-13 05:31:58.931 UTC [1034758] LOG:ï¿½ checkpoint complete: wrote 3
buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0
recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=18 kB, estimate=458931 kB;
lsn=16/54589E08, redo lsn=16/54586F88
2025-04-13 05:31:58.931 UTC [1034758] LOG:ï¿½ checkpoint starting: immediate force wait

Unfortunately I'm several hundred iterations in, without reproducing the
issue. I'm bad at statistics, but I think that makes it rather unlikely that I
will, without changing some aspect.

Was this an assert enabled build? What compiler and what optimization settings
did you use? Do you have huge pages configured (so that the default
huge_pages=try would end up with huge pages)?

So far I've been trying to use a cassert enabled build built with -O0, without
huge pages. After the current test run I'll switch to cassert+-O2.

I reproduced this error on three different machines (all are running
Ubuntu 24.04, two with kernel version 6.8, one with 6.11), with PGDATA
located on tmpfs.

That's another variable to try - so far I've been trying this on 6.15.0-rc1
[1]: I wanted to play with io_uring changes that were recently merged. Namely support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin buffers while IO is ongoing, which turns out to be a noticeable bottleneck in some workloads, particularly when using 1GB huge pages.

Greetings,

Andres Freund

[1]: I wanted to play with io_uring changes that were recently merged. Namely support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin buffers while IO is ongoing, which turns out to be a noticeable bottleneck in some workloads, particularly when using 1GB huge pages.
support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin
buffers while IO is ongoing, which turns out to be a noticeable bottleneck in
some workloads, particularly when using 1GB huge pages.

#187

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Andres Freund (#186)

1 attachment(s)

Re: AIO v2.5

Hello Andres,

14.04.2025 19:06, Andres Freund wrote:

Unfortunately I'm several hundred iterations in, without reproducing the
issue. I'm bad at statistics, but I think that makes it rather unlikely that I
will, without changing some aspect.

Was this an assert enabled build? What compiler and what optimization settings
did you use? Do you have huge pages configured (so that the default
huge_pages=try would end up with huge pages)?

Yes, I used --enable-cassert; no explicit optimization setting and no huge
pages configured. pg_config says:
CONFIGURE = '--enable-debug' '--enable-cassert' '--enable-tap-tests' '--with-liburing'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -O2

Please look at the complete script attached. I've just run it and got:
iteration 56 (jobs: 44)
Tue Apr 15 06:30:52 PM CEST 2025
dropdb: error: database removal failed: ERROR: could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-15 18:31:00.650 CEST [1612266] LOG: could not read blocks 0..0 in file "global/1213": Operation canceled
2025-04-15 18:31:00.650 CEST [1612266] CONTEXT: completing I/O on behalf of process 1612271
2025-04-15 18:31:00.650 CEST [1612266] STATEMENT: DROP DATABASE db3;

I used gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, but now I've also
reproduced the issue with CC=clang (18.1.3 (1ubuntu1)).

Please take a look also at the simple reproducer for the crash inside
pg_get_aios() I mentioned upthread:
for i in {1..100}; do
numjobs=12
echo "iteration $i"
date
for ((j=1;j<=numjobs;j++)); do
    ( createdb db$j; for k in {1..300}; do
        echo "CREATE TABLE t (a INT); CREATE INDEX ON t (a); VACUUM t;
              SELECT COUNT(*) >= 0 AS ok FROM pg_aios; " \
        | psql -d db$j >/dev/null 2>&1;
      done; dropdb db$j; ) &
done
wait
psql -c 'SELECT 1' || break;
done

it fails for me as follows:
iteration 20
Tue Apr 15 07:21:29 PM EEST 2025
dropdb: error: connection to server on socket "/tmp/.s.PGSQL.55432" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
...
2025-04-15 19:21:30.675 EEST [3111699] LOG: client backend (PID 3320979) was terminated by signal 11: Segmentation fault
2025-04-15 19:21:30.675 EEST [3111699] DETAIL: Failed process was running: SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
2025-04-15 19:21:30.675 EEST [3111699] LOG: terminating any other active server processes

I reproduced this error on three different machines (all are running
Ubuntu 24.04, two with kernel version 6.8, one with 6.11), with PGDATA
located on tmpfs.

That's another variable to try - so far I've been trying this on 6.15.0-rc1
[1]. I guess I'll have to set up a ubuntu 24.04 VM and try with that.

Greetings,

Andres Freund

[1] I wanted to play with io_uring changes that were recently merged. Namely
support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin
buffers while IO is ongoing, which turns out to be a noticeable bottleneck in
some workloads, particularly when using 1GB huge pages.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#188

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Andres Freund (#159)

Re: AIO v2.5

Hello Andres,

31.03.2025 02:46, Andres Freund wrote:

I've pushed most of these after some very light further editing. Yay. Thanks
a lot for all the reviews!

So far the buildfarm hasn't been complaining, but it's early days.

I found one complaint against commit 12ce89fd0, expressed as:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2025-04-08%2001%3A36%3A41

149/334 postgresql:recovery / recovery/032_relfilenode_reuse ERROR 206.41s (exit status 255 or signal 127
SIGinvalid)

[01:41:53.249](0.620s) ok 8 - standby: post move contents as expected
[01:44:55.175](181.927s) Bail out! aborting wait: program timed out
stream contents: >><<
pattern searched for: (?^m:warmed_buffers)

032_relfilenode_reuse_standby.log contains:
2025-04-08 01:41:54.674 UTC [4013024][startup][33/0:0] DEBUG: waiting for all backends to process ProcSignalBarrier
generation 7
2025-04-08 01:41:54.674 UTC [4013024][startup][33/0:0] CONTEXT: WAL redo at 0/43E5360 for Database/DROP: dir 1663/50001
2025-04-08 01:41:54.670 UTC [4013251][client backend][0/2:0] DEBUG: waiting for self with 0 pending
2025-04-08 01:41:54.674 UTC [4013251][client backend][0/2:0] ERROR: no free IOs despite no in-flight IOs
2025-04-08 01:41:54.674 UTC [4013251][client backend][0/2:0] STATEMENT: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM
pg_class WHERE pg_relation_filenode(oid) != 0;

I could reproduce this error with the following TEMP_CONFIG:
log_min_messages = DEBUG4

log_connections = on
log_disconnections = on
log_statement = 'all'
log_line_prefix = '%m [%d][%p:%l][%b] %q[%a] '

fsync = on
io_method=io_uring
backtrace_functions = 'pgaio_io_wait_for_free'

When running 032_relfilenode_reuse.pl in a loop, I got failures on
iterations 10, 18, 6:
2025-04-20 15:56:01.310 CEST [][1517068:122][startup] DEBUG: updated min recovery point to 0/4296E90 on timeline 1
2025-04-20 15:56:01.310 CEST [][1517068:123][startup] CONTEXT: WAL redo at 0/4296E48 for Transaction/COMMIT: 2025-04-20
15:56:01.09363+02; inval msgs: catcache 21; sync
2025-04-20 15:56:01.310 CEST [][1517068:124][startup] DEBUG: waiting for all backends to process ProcSignalBarrier
generation 5
2025-04-20 15:56:01.310 CEST [][1517068:125][startup] CONTEXT: WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.312 CEST [postgres][1517075:144][client backend] [032_relfilenode_reuse.pl] DEBUG: waiting for self
with 0 pending
2025-04-20 15:56:01.312 CEST [postgres][1517075:145][client backend] [032_relfilenode_reuse.pl] BACKTRACE:
pgaio_io_wait_for_free at aio.c:703:2
(inlined by) pgaio_io_acquire at aio.c:186:3
AsyncReadBuffers at bufmgr.c:1854:9
StartReadBuffersImpl at bufmgr.c:1425:18
(inlined by) StartReadBuffers at bufmgr.c:1497:9
read_stream_start_pending_read at read_stream.c:333:25
read_stream_look_ahead at read_stream.c:475:3
read_stream_next_buffer at read_stream.c:840:6
pg_prewarm at pg_prewarm.c:214:10
ExecInterpExpr at execExprInterp.c:927:7
ExecEvalExprNoReturn at executor.h:447:2
...
2025-04-20 15:56:01.312 CEST [][1517068:126][startup] DEBUG: finished waiting for all backends to process
ProcSignalBarrier generation 5
2025-04-20 15:56:01.312 CEST [][1517068:127][startup] CONTEXT: WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.314 CEST [postgres][1517075:146][client backend] [032_relfilenode_reuse.pl] ERROR: no free IOs
despite no in-flight IOs
2025-04-20 15:56:01.314 CEST [postgres][1517075:147][client backend] [032_relfilenode_reuse.pl] BACKTRACE:
pgaio_io_wait_for_free at aio.c:735:3
(inlined by) pgaio_io_acquire at aio.c:186:3
AsyncReadBuffers at bufmgr.c:1854:9
StartReadBuffersImpl at bufmgr.c:1425:18
(inlined by) StartReadBuffers at bufmgr.c:1497:9
read_stream_start_pending_read at read_stream.c:333:25
read_stream_look_ahead at read_stream.c:475:3
read_stream_next_buffer at read_stream.c:840:6
pg_prewarm at pg_prewarm.c:214:10
ExecInterpExpr at execExprInterp.c:927:7
ExecEvalExprNoReturn at executor.h:447:2
...

I configured the build with:
CPPFLAGS="-O1" ./configure --enable-injection-points --enable-cassert --enable-debug --enable-tap-tests --with-liburing

I also encountered another failure when running this test:
t/032_relfilenode_reuse.pl .. 1/? Bailout called. Further testing stopped: aborting wait: program died

Core was generated by `postgres: standby: law postgres [local] SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x0000700a5a24527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x0000700a5a2288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000630ff2c5dcea in errfinish (filename=filename@entry=0x630ff2cdd139 "aio.c", lineno=lineno@entry=559,
    funcname=funcname@entry=0x630ff2e85da8 <__func__.9> "pgaio_io_wait") at elog.c:599
#6 0x0000630ff2aae79c in pgaio_io_wait (ioh=ioh@entry=0x700a57b59700, ref_generation=3972) at aio.c:559
#7 0x0000630ff2aaefab in pgaio_io_wait_for_free () at aio.c:771
#8 pgaio_io_acquire (resowner=0x63100114dee8, ret=ret@entry=0x6310011efec8) at aio.c:186
#9 0x0000630ff2ab9137 in AsyncReadBuffers (operation=operation@entry=0x6310011efe88,
nblocks_progress=nblocks_progress@entry=0x7ffd8d262034) at bufmgr.c:1854
#10 0x0000630ff2abc6cc in StartReadBuffersImpl (allow_forwarding=true, flags=0, nblocks=0x7ffd8d262034, blockNum=0,
buffers=0x6310011efe3c,
    operation=0x6310011efe88) at bufmgr.c:1425
#11 StartReadBuffers (operation=0x6310011efe88, buffers=buffers@entry=0x6310011efe3c, blockNum=0,
nblocks=nblocks@entry=0x7ffd8d262034, flags=flags@entry=0)
    at bufmgr.c:1497
#12 0x0000630ff2ab2f3f in read_stream_start_pending_read (stream=stream@entry=0x6310011efde0) at read_stream.c:328
#13 0x0000630ff2ab3363 in read_stream_look_ahead (stream=stream@entry=0x6310011efde0) at read_stream.c:475
#14 0x0000630ff2ab37e5 in read_stream_next_buffer (stream=stream@entry=0x6310011efde0,
per_buffer_data=per_buffer_data@entry=0x0) at read_stream.c:837
#15 0x0000700a5adae60c in pg_prewarm (fcinfo=<optimized out>) at pg_prewarm.c:214
#16 0x0000630ff29004cc in ExecInterpExpr (state=0x6310012239c8, econtext=0x6310011e1920, isnull=0x0) at execExprInterp.c:926
#17 0x0000630ff291bfb4 in ExecEvalExprNoReturn (econtext=<optimized out>, state=<optimized out>) at
../../../src/include/executor/executor.h:445
#18 ExecEvalExprNoReturnSwitchContext (econtext=<optimized out>, state=<optimized out>) at
../../../src/include/executor/executor.h:486
#19 advance_aggregates (aggstate=aggstate@entry=0x6310011e14f8) at nodeAgg.c:820
#20 0x0000630ff291db18 in agg_retrieve_direct (aggstate=0x6310011e14f8) at nodeAgg.c:2540
#21 ExecAgg (pstate=0x6310011e14f8) at nodeAgg.c:2265
#22 0x0000630ff290deef in ExecProcNodeFirst (node=0x6310011e14f8) at execProcnode.c:469
...

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#189

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Andres Freund (#86)

Re: AIO v2.5

On 3/18/25 21:12, Andres Freund wrote:

Hi,

Attached is v2.10, with the following changes:

...

- committed io_method=worker

There's an open item related to this commit (247ce06b883d), based on:

For now the default io_method is changed to "worker". We should re-
evaluate that around beta1, we might want to be careful and set the
default to "sync" for 18.

FWIW the open item is in the "recheck mid-beta" section, but the commit
message says "around beta1" which is not that far away (especially if we
choose to do beta1 on May 8, with the minor releases).

What information we need to gather to make the decision and who's
expected to do it? I assume someone needs to do run some benchmarks, but
is anyone working on it already? What benchmarks, what platforms?

FWIW I'm asking because of the RMT, but I'm also willing to do some of
the tests, if needed - but it'd be good to get some guidance.

regards

--
Tomas Vondra

#190

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Alexander Lakhin (#188)

Re: AIO v2.5

Hi,

On 2025-04-20 18:00:00 +0300, Alexander Lakhin wrote:

31.03.2025 02:46, Andres Freund wrote:

I've pushed most of these after some very light further editing. Yay. Thanks
a lot for all the reviews!

So far the buildfarm hasn't been complaining, but it's early days.

I found one complaint against commit 12ce89fd0, expressed as:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2025-04-08%2001%3A36%3A41

Thanks for pointing that out and repro'ing it!

2025-04-08 01:41:54.670 UTC [4013251][client backend][0/2:0] DEBUG: waiting for self with 0 pending

I'd like to extend this debug message with the number of in-flight IOs. I
assume nobody will mind me doing that.

I could reproduce this error with the following TEMP_CONFIG:
log_min_messages = DEBUG4

log_connections = on
log_disconnections = on
log_statement = 'all'
log_line_prefix = '%m [%d][%p:%l][%b] %q[%a] '

fsync = on
io_method=io_uring
backtrace_functions = 'pgaio_io_wait_for_free'

Hm. Several hundred iterations in without triggering the issue even once, with
that added config. Given your reproducer used fsync, I assume this was on a
real filesystem not tmpfs? What FS? I tried on XFS.

very well could just be a timing dependent issue, the time for an fsync will
differ heavily betweeen devices...

When running 032_relfilenode_reuse.pl in a loop, I got failures on
iterations 10, 18, 6:
2025-04-20 15:56:01.310 CEST [][1517068:122][startup] DEBUG: updated min recovery point to 0/4296E90 on timeline 1
2025-04-20 15:56:01.310 CEST [][1517068:123][startup] CONTEXT:ï¿½ WAL redo at
0/4296E48 for Transaction/COMMIT: 2025-04-20 15:56:01.09363+02; inval msgs:
catcache 21; sync
2025-04-20 15:56:01.310 CEST [][1517068:124][startup] DEBUG: waiting for all
backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.310 CEST [][1517068:125][startup] CONTEXT:ï¿½ WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.312 CEST [postgres][1517075:144][client backend]
[032_relfilenode_reuse.pl] DEBUG:ï¿½ waiting for self with 0 pending
2025-04-20 15:56:01.312 CEST [postgres][1517075:145][client backend] [032_relfilenode_reuse.pl] BACKTRACE:
pgaio_io_wait_for_free at aio.c:703:2
ï¿½(inlined by) pgaio_io_acquire at aio.c:186:3
AsyncReadBuffers at bufmgr.c:1854:9
StartReadBuffersImpl at bufmgr.c:1425:18
ï¿½(inlined by) StartReadBuffers at bufmgr.c:1497:9
read_stream_start_pending_read at read_stream.c:333:25
read_stream_look_ahead at read_stream.c:475:3
read_stream_next_buffer at read_stream.c:840:6
pg_prewarm at pg_prewarm.c:214:10
ExecInterpExpr at execExprInterp.c:927:7
ExecEvalExprNoReturn at executor.h:447:2
...
2025-04-20 15:56:01.312 CEST [][1517068:126][startup] DEBUG: finished
waiting for all backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.312 CEST [][1517068:127][startup] CONTEXT:ï¿½ WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.314 CEST [postgres][1517075:146][client backend]
[032_relfilenode_reuse.pl] ERROR:ï¿½ no free IOs despite no in-flight IOs

The earlier report also had a ProcSignalBarrier wait just before. That's
doesn't have to be related, the test stresses those pretty intentionally.

But it does seem like it could be related, perhaps we somehow end processing
interrupts while issuing IOs, which then again submits IOs or something? I
don't immediately see such code, e.g. pgaio_closing_fd() is careful to iterate
over the list while waiting.

I think pgaio_closing_fd() should get a DEBUG2 (or so) message documenting
that we are waiting for an IO due to closing an FD. pgaio_shutdown() as well,
while at it.

I somewhat suspect that this is the same issue that you reported earlier, that
I also couldn't repro. That involved a lot of CREATE DATABASEs, which will
trigger a lot of procsignal barriers... I'll try to come up with a more
dedicated stress test and see whether I can trigger a problem that way.

I also encountered another failure when running this test:
t/032_relfilenode_reuse.pl .. 1/? Bailout called.ï¿½ Further testing stopped:ï¿½ aborting wait: program died

Core was generated by `postgres: standby: law postgres [local] SELECTï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ '.
Program terminated with signal SIGABRT, Aborted.
#0ï¿½ __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

(gdb) bt
#0ï¿½ __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1ï¿½ __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2ï¿½ __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3ï¿½ 0x0000700a5a24527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4ï¿½ 0x0000700a5a2288ff in __GI_abort () at ./stdlib/abort.c:79
#5ï¿½ 0x0000630ff2c5dcea in errfinish (filename=filename@entry=0x630ff2cdd139 "aio.c", lineno=lineno@entry=559,
ï¿½ï¿½ï¿½ funcname=funcname@entry=0x630ff2e85da8 <__func__.9> "pgaio_io_wait") at elog.c:599
#6ï¿½ 0x0000630ff2aae79c in pgaio_io_wait (ioh=ioh@entry=0x700a57b59700, ref_generation=3972) at aio.c:559
#7ï¿½ 0x0000630ff2aaefab in pgaio_io_wait_for_free () at aio.c:771

Do you still have the core file for this? Or at least the generated log
message? It'd be useful to know what state the IO was in when the error was
triggered...

Greetings,

Andres Freund

#191

Alexander Lakhin

exclusion@gmail.com

9 months ago

In reply to: Andres Freund (#190)

Re: AIO v2.5

Hello Andres,

24.04.2025 03:40, Andres Freund wrote:

Hi,

On 2025-04-20 18:00:00 +0300, Alexander Lakhin wrote:

2025-04-08 01:41:54.670 UTC [4013251][client backend][0/2:0] DEBUG: waiting for self with 0 pending

I'd like to extend this debug message with the number of in-flight IOs. I
assume nobody will mind me doing that.

I printed that number for debugging and always got 1, because
io_max_concurrency == 1 during that test (because of shared_buffers=1MB).
With shared_buffers increased:
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -18,7 +18,7 @@ log_connections=on
  # to avoid "repairing" corruption
  full_page_writes=off
  log_min_messages=debug2
-shared_buffers=1MB
+shared_buffers=16MB

I got 100 iterations passed.

I could reproduce this error with the following TEMP_CONFIG:
log_min_messages = DEBUG4

log_connections = on
log_disconnections = on
log_statement = 'all'
log_line_prefix = '%m [%d][%p:%l][%b] %q[%a] '

fsync = on
io_method=io_uring
backtrace_functions = 'pgaio_io_wait_for_free'

Hm. Several hundred iterations in without triggering the issue even once, with
that added config. Given your reproducer used fsync, I assume this was on a
real filesystem not tmpfs? What FS? I tried on XFS.

very well could just be a timing dependent issue, the time for an fsync will
differ heavily betweeen devices...

Yeah, it definitely depends on timing.
Using that TEMP_CONFIG changes the duration of the test significantly for me:
Files=1, Tests=14, 2 wallclock secs ( 0.01 usr 0.01 sys + 0.13 cusr 0.32 csys = 0.47 CPU)
->
Files=1, Tests=14, 14 wallclock secs ( 0.00 usr 0.00 sys + 0.14 cusr 0.33 csys = 0.47 CPU)

I reproduced the failure on ext4 (desktop-grade NVME) and on xfs (cheap
SSD), on two different machines.

It's also reproduced on HDD (with ext4), seemingly worse, where the
duration is increased even more:
I 25
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. ok
All tests successful.
Files=1, Tests=14, 28 wallclock secs ( 0.00 usr 0.00 sys + 0.15 cusr 0.39 csys = 0.54 CPU)
Result: PASS
I 26
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. 7/? Bailout called. Further testing stopped: aborting wait: program timed out

When running 032_relfilenode_reuse.pl in a loop, I got failures on
iterations 10, 18, 6:
2025-04-20 15:56:01.310 CEST [][1517068:122][startup] DEBUG: updated min recovery point to 0/4296E90 on timeline 1
2025-04-20 15:56:01.310 CEST [][1517068:123][startup] CONTEXT: WAL redo at
0/4296E48 for Transaction/COMMIT: 2025-04-20 15:56:01.09363+02; inval msgs:
catcache 21; sync
2025-04-20 15:56:01.310 CEST [][1517068:124][startup] DEBUG: waiting for all
backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.310 CEST [][1517068:125][startup] CONTEXT: WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.312 CEST [postgres][1517075:144][client backend]
[032_relfilenode_reuse.pl] DEBUG: waiting for self with 0 pending
2025-04-20 15:56:01.312 CEST [postgres][1517075:145][client backend] [032_relfilenode_reuse.pl] BACKTRACE:
pgaio_io_wait_for_free at aio.c:703:2
(inlined by) pgaio_io_acquire at aio.c:186:3
AsyncReadBuffers at bufmgr.c:1854:9
StartReadBuffersImpl at bufmgr.c:1425:18
(inlined by) StartReadBuffers at bufmgr.c:1497:9
read_stream_start_pending_read at read_stream.c:333:25
read_stream_look_ahead at read_stream.c:475:3
read_stream_next_buffer at read_stream.c:840:6
pg_prewarm at pg_prewarm.c:214:10
ExecInterpExpr at execExprInterp.c:927:7
ExecEvalExprNoReturn at executor.h:447:2
...
2025-04-20 15:56:01.312 CEST [][1517068:126][startup] DEBUG: finished
waiting for all backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.312 CEST [][1517068:127][startup] CONTEXT: WAL redo at 0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.314 CEST [postgres][1517075:146][client backend]
[032_relfilenode_reuse.pl] ERROR: no free IOs despite no in-flight IOs

The earlier report also had a ProcSignalBarrier wait just before. That's
doesn't have to be related, the test stresses those pretty intentionally.

But it does seem like it could be related, perhaps we somehow end processing
interrupts while issuing IOs, which then again submits IOs or something? I
don't immediately see such code, e.g. pgaio_closing_fd() is careful to iterate
over the list while waiting.

I think pgaio_closing_fd() should get a DEBUG2 (or so) message documenting
that we are waiting for an IO due to closing an FD. pgaio_shutdown() as well,
while at it.

I somewhat suspect that this is the same issue that you reported earlier, that
I also couldn't repro. That involved a lot of CREATE DATABASEs, which will
trigger a lot of procsignal barriers... I'll try to come up with a more
dedicated stress test and see whether I can trigger a problem that way.

Maybe it does matter that "no free IOs despite no in-flight IOs" follows
"WAL redo at 0/4296E90 for Database/DROP" here (that's seemingly constant)...

I've tried also:
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -705,6 +705,7 @@ pgaio_io_wait_for_free(void)
         pgaio_debug(DEBUG2, "waiting for self with %d pending",
pgaio_my_backend->num_staged_ios);

+pg_sleep(3000);

and got the test failed on each run with:
t/032_relfilenode_reuse.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

032_relfilenode_reuse_primary.log
2025-04-24 05:29:27.103 CEST [3504475] 032_relfilenode_reuse.pl LOG: statement: UPDATE large SET datab = 1;
2025-04-24 05:29:27.104 CEST [3504475] 032_relfilenode_reuse.pl DEBUG: waiting for self with 0 pending
2025-04-24 05:29:27.112 CEST [3504369] DEBUG: releasing pm child slot 3
2025-04-24 05:29:27.112 CEST [3504369] LOG: client backend (PID 3504475) was terminated by signal 11: Segmentation fault
2025-04-24 05:29:27.112 CEST [3504369] DETAIL: Failed process was running: UPDATE large SET datab = 1;

I also encountered another failure when running this test:
t/032_relfilenode_reuse.pl .. 1/? Bailout called. Further testing stopped: aborting wait: program died

Core was generated by `postgres: standby: law postgres [local] SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x0000700a5a24527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x0000700a5a2288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000630ff2c5dcea in errfinish (filename=filename@entry=0x630ff2cdd139 "aio.c", lineno=lineno@entry=559,
funcname=funcname@entry=0x630ff2e85da8 <__func__.9> "pgaio_io_wait") at elog.c:599
#6 0x0000630ff2aae79c in pgaio_io_wait (ioh=ioh@entry=0x700a57b59700, ref_generation=3972) at aio.c:559
#7 0x0000630ff2aaefab in pgaio_io_wait_for_free () at aio.c:771

Do you still have the core file for this? Or at least the generated log
message? It'd be useful to know what state the IO was in when the error was
triggered...

I can reproduce it easily, so here is the message:
2025-04-24 06:10:01.924 EEST [postgres][3036378:150][client backend] [032_relfilenode_reuse.pl] PANIC: waiting for own
IO in wrong state: 0
2025-04-24 06:10:01.924 EEST [postgres][3036378:151][client backend] [032_relfilenode_reuse.pl] STATEMENT: SELECT
SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

(Sorry for not including it before — I observed it many times, always with
state 0, so it had lost its value for me.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#192

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Alexander Lakhin (#187)

Re: AIO v2.5

Hi,

On 2025-04-15 21:00:00 +0300, Alexander Lakhin wrote:

Please take a look also at the simple reproducer for the crash inside
pg_get_aios() I mentioned upthread:
for i in {1..100}; do
ï¿½ numjobs=12
ï¿½ echo "iteration $i"
ï¿½ date
ï¿½ for ((j=1;j<=numjobs;j++)); do
ï¿½ï¿½ï¿½ ( createdb db$j; for k in {1..300}; do
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ echo "CREATE TABLE t (a INT); CREATE INDEX ON t (a); VACUUM t;
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ SELECT COUNT(*) >= 0 AS ok FROM pg_aios; " \
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | psql -d db$j >/dev/null 2>&1;
ï¿½ï¿½ï¿½ï¿½ï¿½ done; dropdb db$j; ) &
ï¿½ done
ï¿½ wait
ï¿½ psql -c 'SELECT 1' || break;
done

it fails for me as follows:
iteration 20
Tue Apr 15 07:21:29 PM EEST 2025
dropdb: error: connection to server on socket "/tmp/.s.PGSQL.55432" failed: No such file or directory
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ Is the server running locally and accepting connections on that socket?
...
2025-04-15 19:21:30.675 EEST [3111699] LOG:ï¿½ client backend (PID 3320979) was terminated by signal 11: Segmentation fault
2025-04-15 19:21:30.675 EEST [3111699] DETAIL:ï¿½ Failed process was running: SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
2025-04-15 19:21:30.675 EEST [3111699] LOG:ï¿½ terminating any other active server processes

Thanks for that. The bug turns out to be pretty stupid - pgaio_io_reclaim()
resets the fields in PgAioHandle *before* updating the generation/state. That
opens up a window in which pg_get_aios() thinks the copied PgAioHandle is
valid, even though it was taken while the fields were being reset.

Once I had figured that out, it was easy to make it more reproducible - put a
pg_usleep() between the fields being reset in pgaio_io_reclaim() and the
generation increase / state update.

The fix is simple, increment generation and state before resetting fields.

Will push the fix for that soon.

Greetings,

Andres Freund

#193

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Alexander Lakhin (#187)

Re: AIO v2.5

Hi,

After a bit more private back and forth with Alexander I have found the issue
- and it's pretty stupid:

pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
of other things, finds the oldest in-flight IO and waits for it.

PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
&pgaio_my_backend->in_flight_ios);

switch (ioh->state)
{
...
case PGAIO_HS_COMPLETED_IO:
case PGAIO_HS_SUBMITTED:
pgaio_debug_io(DEBUG2, ioh,
"waiting for free io with %d in flight",
dclist_count(&pgaio_my_backend->in_flight_ios));
...
pgaio_io_wait(ioh, ioh->generation);
break;

The problem is that, if the log level is low enough, ereport() (which is
called by pgaio_debug_io()), processes interrupts. The interrupt processing
may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
for all in-flight IOs before the IOs are closed.

Which then leads to the
elog(PANIC, "waiting for own IO in wrong state: %d",
state);

error.

The waiting for in-flight IOs before closing FDs only happens with io-uring,
hence this only triggering with io-uring.

A similar set of steps can lead to the "no free IOs despite no in-flight IOs"
ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug
ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might
wait for all in-flight IOs during IO submission, triggering the error.

I'm somewhat amazed that Alexander could somewhat reliably reproduce this - I
haven't been able to do so once using Alexander's recipe. I did find a
reproducer though:

c=32;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'SELECT sum(abalance) FROM pgbench_accounts;')
c=1;pgbench -n -c$c -j$c -P1 -T1000 -f <(echo 'DROP DATABASE IF EXISTS foo;CREATE DATABASE foo;')

trigger both issues quickly if run with
log_min_messages=debug2
io_method=io_uring
and not when a non-debug log level is used.

I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport. But
that doesn't really seem great - otoh requiring various pieces of code to know
that anything emitting debug messages needs to hold interrupts etc makes for
rare and hard to understand bugs.

We could just make the relevant functions hold interrupts, and that might be
the best path forward, but we don't really need to hold all interrupts
(e.g. termination would be fine), so it's a bit coarse grained. It would need
to happen in a few places, which isn't great either.

Other suggestions?

Thanks again for finding and reporting this Alexander!

Greetings,

Andres Freund

#194

Noah Misch

noah@leadboat.com

8 months ago

In reply to: Andres Freund (#193)

Re: AIO v2.5

On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:

pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
of other things, finds the oldest in-flight IO and waits for it.

PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
&pgaio_my_backend->in_flight_ios);

switch (ioh->state)
{
...
case PGAIO_HS_COMPLETED_IO:
case PGAIO_HS_SUBMITTED:
pgaio_debug_io(DEBUG2, ioh,
"waiting for free io with %d in flight",
dclist_count(&pgaio_my_backend->in_flight_ios));
...
pgaio_io_wait(ioh, ioh->generation);
break;

The problem is that, if the log level is low enough, ereport() (which is
called by pgaio_debug_io()), processes interrupts. The interrupt processing
may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
for all in-flight IOs before the IOs are closed.

Which then leads to the
elog(PANIC, "waiting for own IO in wrong state: %d",
state);

error.

Printing state 0 (PGAIO_HS_IDLE), right? I think the chief problem is that
pgaio_io_wait_for_free() is fetching ioh->state, then possibly processing
interrupts in pgaio_debug_io(), then finally fetching ioh->generation. If it
fetched ioh->generation to a local variable before pgaio_debug_io, I think
that would resolve this one. Then the pgaio_io_was_recycled() would prevent
the PANIC:

if (pgaio_io_was_recycled(ioh, ref_generation, &state))
return;

if (am_owner)
{
if (state != PGAIO_HS_SUBMITTED
&& state != PGAIO_HS_COMPLETED_IO
&& state != PGAIO_HS_COMPLETED_SHARED
&& state != PGAIO_HS_COMPLETED_LOCAL)
{
elog(PANIC, "waiting for own IO in wrong state: %d",
state);
}
}

Is that right? If that's the solution, pgaio_closing_fd() and
pgaio_shutdown() would need similar care around fetching the generation before
the pgaio_debug_io. Maybe there's an opportunity for a common inline
function. Or at least a comment at the "generation" field on how to safely
time a fetch thereof and any barrier required.

A similar set of steps can lead to the "no free IOs despite no in-flight IOs"
ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug
ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might
wait for all in-flight IOs during IO submission, triggering the error.

That makes sense.

I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport. But
that doesn't really seem great - otoh requiring various pieces of code to know
that anything emitting debug messages needs to hold interrupts etc makes for
rare and hard to understand bugs.

We could just make the relevant functions hold interrupts, and that might be
the best path forward, but we don't really need to hold all interrupts
(e.g. termination would be fine), so it's a bit coarse grained. It would need
to happen in a few places, which isn't great either.

Other suggestions?

For the "no free IOs despite no in-flight IOs" case, I'd replace the
ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
an IO. Then decide what protection if any, we need against bugs causing an
infinite loop in caller pgaio_io_acquire(). What's the case motivating the
unbounded loop in pgaio_io_acquire(), as opposed to capping at two
pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could
be reentrant, what scenario would reach that reentrancy?

Thanks again for finding and reporting this Alexander!

+1!

#195

Andres Freund

andres@anarazel.de

8 months ago

In reply to: Noah Misch (#194)

2 attachment(s)

Re: AIO v2.5

Hi,

On 2025-05-02 20:05:11 -0700, Noah Misch wrote:

On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:

pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
of other things, finds the oldest in-flight IO and waits for it.

PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
&pgaio_my_backend->in_flight_ios);

switch (ioh->state)
{
...
case PGAIO_HS_COMPLETED_IO:
case PGAIO_HS_SUBMITTED:
pgaio_debug_io(DEBUG2, ioh,
"waiting for free io with %d in flight",
dclist_count(&pgaio_my_backend->in_flight_ios));
...
pgaio_io_wait(ioh, ioh->generation);
break;

The problem is that, if the log level is low enough, ereport() (which is
called by pgaio_debug_io()), processes interrupts. The interrupt processing
may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
for all in-flight IOs before the IOs are closed.

Which then leads to the
elog(PANIC, "waiting for own IO in wrong state: %d",
state);

error.

Printing state 0 (PGAIO_HS_IDLE), right?

Correct.

I think the chief problem is that pgaio_io_wait_for_free() is fetching
ioh->state, then possibly processing interrupts in pgaio_debug_io(), then
finally fetching ioh->generation. If it fetched ioh->generation to a local
variable before pgaio_debug_io, I think that would resolve this one.

That's what I also concluded after playing around with a few different
approaches.

Then the pgaio_io_was_recycled() would prevent the PANIC:

if (pgaio_io_was_recycled(ioh, ref_generation, &state))
return;

if (am_owner)
{
if (state != PGAIO_HS_SUBMITTED
&& state != PGAIO_HS_COMPLETED_IO
&& state != PGAIO_HS_COMPLETED_SHARED
&& state != PGAIO_HS_COMPLETED_LOCAL)
{
elog(PANIC, "waiting for own IO in wrong state: %d",
state);
}
}

Is that right?

Yes, it avoids the issue.

If that's the solution, pgaio_closing_fd() and pgaio_shutdown() would need
similar care around fetching the generation before the pgaio_debug_io.
Maybe there's an opportunity for a common inline function. Or at least a
comment at the "generation" field on how to safely time a fetch thereof and
any barrier required.

I didn't see a good way to move this into an inline function, unfortunately.

We do need to hold interrupts in a few other places, I think - with some debug
infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
interrupts could be processed and calling CFI() in errstart() in its return
false case) it's possible to find state confusions which trigger
assertions. The issue is that pgaio_io_update_state() contains a
pgaio_debug_io() and executing pgaio_closing_fd() in places that call
pgaio_io_update_state() doesn't end well. There's a similar danger with the
debug message in pgaio_io_reclaim().

In the attached patch I added an assertion to pgaio_io_update_state()
verifying that interrupts are held and added code to hold interrupts in the
relevant places.

I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport. But
that doesn't really seem great - otoh requiring various pieces of code to know
that anything emitting debug messages needs to hold interrupts etc makes for
rare and hard to understand bugs.

We could just make the relevant functions hold interrupts, and that might be
the best path forward, but we don't really need to hold all interrupts
(e.g. termination would be fine), so it's a bit coarse grained. It would need
to happen in a few places, which isn't great either.

Other suggestions?

For the "no free IOs despite no in-flight IOs" case, I'd replace the
ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
an IO.

Hm - it seems better to me to check if there are now free handles and return
if that's the case, but to keep the error check in case there actually is no
free IO? That seems like a not implausible bug...

Then decide what protection if any, we need against bugs causing an
infinite loop in caller pgaio_io_acquire(). What's the case motivating the
unbounded loop in pgaio_io_acquire(), as opposed to capping at two
pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could
be reentrant, what scenario would reach that reentrancy?

I do not remember why I wrote this as an endless loop. If you prefer I could
change that as part of this patch.

It does seem rather dangerous that errstart() processes interrupts for debug
messages, but only if the debug message is actually logged. That's really a
recipe for hard to find bugs. I wonder if we should, at least in assertion
mode, process interrupts even if not emitting the message.

Greetings,

Andres Freund

Attachments:

v1-0001-WIP-aio-Fix-possible-state-confusion-due-to-inter.patchtext/x-diff; charset=us-asciiDownload

From 650c48ffc42f554f86b8485a6c6b98c36c35fb81 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 8 May 2025 21:03:52 -0400
Subject: [PATCH v1] WIP: aio: Fix possible state confusion due to interrupt
 processing

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/aio.c | 113 +++++++++++++++++++++++++++++-----
 1 file changed, 99 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 57b9cf3dcab..10535066762 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -198,6 +198,8 @@ pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 PgAioHandle *
 pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 {
+	PgAioHandle *ioh = NULL;
+
 	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
 	{
 		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
@@ -207,10 +209,17 @@ pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 	if (pgaio_my_backend->handed_out_io)
 		elog(ERROR, "API violation: Only one IO can be handed out");
 
+	/*
+	 * Probably not needed today, as interrupts should not process this IO,
+	 * but...
+	 */
+	HOLD_INTERRUPTS();
+
 	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
 	{
 		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
-		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		ioh = dclist_container(PgAioHandle, node, ion);
 
 		Assert(ioh->state == PGAIO_HS_IDLE);
 		Assert(ioh->owner_procno == MyProcNumber);
@@ -226,11 +235,11 @@ pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 			ioh->report_return = ret;
 			ret->result.status = PGAIO_RS_UNKNOWN;
 		}
-
-		return ioh;
 	}
 
-	return NULL;
+	RESUME_INTERRUPTS();
+
+	return ioh;
 }
 
 /*
@@ -247,6 +256,12 @@ pgaio_io_release(PgAioHandle *ioh)
 		Assert(ioh->resowner);
 
 		pgaio_my_backend->handed_out_io = NULL;
+
+		/*
+		 * Note that no interrupts are processed between the handed_out_io
+		 * check and the call to reclaim - that's important as otherwise an
+		 * interrupt could have already reclaimed the handle.
+		 */
 		pgaio_io_reclaim(ioh);
 	}
 	else
@@ -265,6 +280,12 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 
 	Assert(ioh->resowner);
 
+	/*
+	 * Otherwise an interrupt, in the middle of releasing the IO, could end up
+	 * trying to wait for the IO, leading to state confusion.
+	 */
+	HOLD_INTERRUPTS();
+
 	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
 	ioh->resowner = NULL;
 
@@ -305,6 +326,8 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 	 */
 	if (ioh->report_return)
 		ioh->report_return = NULL;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -373,6 +396,13 @@ pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
 static inline void
 pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
 {
+	/*
+	 * All callers need to have held interrupts in some form, otherwise
+	 * interrupt processing could wait for the IO to complete, while in an
+	 * intermediary state.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	pgaio_debug_io(DEBUG5, ioh,
 				   "updating state to %s",
 				   pgaio_io_state_get_name(new_state));
@@ -410,6 +440,13 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
 	Assert(pgaio_my_backend->handed_out_io == ioh);
 	Assert(pgaio_io_has_target(ioh));
 
+	/*
+	 * Otherwise an interrupt, in the middle of staging and possibly executing
+	 * the IO, could end up trying to wait for the IO, leading to state
+	 * confusion.
+	 */
+	HOLD_INTERRUPTS();
+
 	ioh->op = op;
 	ioh->result = 0;
 
@@ -449,6 +486,8 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
 		pgaio_io_prepare_submit(ioh);
 		pgaio_io_perform_synchronously(ioh);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 bool
@@ -558,8 +597,8 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 			&& state != PGAIO_HS_COMPLETED_SHARED
 			&& state != PGAIO_HS_COMPLETED_LOCAL)
 		{
-			elog(PANIC, "waiting for own IO in wrong state: %d",
-				 state);
+			elog(PANIC, "waiting for own IO %d in wrong state: %s",
+				 pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh));
 		}
 	}
 
@@ -613,7 +652,13 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 
 			case PGAIO_HS_COMPLETED_SHARED:
 			case PGAIO_HS_COMPLETED_LOCAL:
-				/* see above */
+
+				/*
+				 * Note that no interrupts are processed between
+				 * pgaio_io_was_recycled() and this check - that's important
+				 * as otherwise an interrupt could have already reclaimed the
+				 * handle.
+				 */
 				if (am_owner)
 					pgaio_io_reclaim(ioh);
 				return;
@@ -624,6 +669,11 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 /*
  * Make IO handle ready to be reused after IO has completed or after the
  * handle has been released without being used.
+ *
+ * Note that callers need to be careful about only calling this in the right
+ * state and that no interrupts can be processed between the state check and
+ * the call to pgaio_io_reclaim(). Otherwise interrupt processing could
+ * already have reclaimed the handle.
  */
 static void
 pgaio_io_reclaim(PgAioHandle *ioh)
@@ -632,6 +682,9 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	Assert(ioh->owner_procno == MyProcNumber);
 	Assert(ioh->state != PGAIO_HS_IDLE);
 
+	/* see comment in function header */
+	HOLD_INTERRUPTS();
+
 	/*
 	 * It's a bit ugly, but right now the easiest place to put the execution
 	 * of local completion callbacks is this function, as we need to execute
@@ -699,6 +752,8 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	 * efficient in cases where only a few IOs are used.
 	 */
 	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -714,7 +769,7 @@ pgaio_io_wait_for_free(void)
 	pgaio_debug(DEBUG2, "waiting for free IO with %d pending, %d in-flight, %d idle IOs",
 				pgaio_my_backend->num_staged_ios,
 				dclist_count(&pgaio_my_backend->in_flight_ios),
-				dclist_is_empty(&pgaio_my_backend->idle_ios));
+				dclist_count(&pgaio_my_backend->idle_ios));
 
 	/*
 	 * First check if any of our IOs actually have completed - when using
@@ -728,6 +783,11 @@ pgaio_io_wait_for_free(void)
 
 		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
 		{
+			/*
+			 * Note that no interrupts are processed between the state check
+			 * and the call to reclaim - that's important as otherwise an
+			 * interrupt could have already reclaimed the handle.
+			 */
 			pgaio_io_reclaim(ioh);
 			reclaimed++;
 		}
@@ -744,13 +804,17 @@ pgaio_io_wait_for_free(void)
 	if (pgaio_my_backend->num_staged_ios > 0)
 		pgaio_submit_staged();
 
+	/* possibly some IOs finished during submission */
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+		return;
+
 	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
 		ereport(ERROR,
 				errmsg_internal("no free IOs despite no in-flight IOs"),
 				errdetail_internal("%d pending, %d in-flight, %d idle IOs",
 								   pgaio_my_backend->num_staged_ios,
 								   dclist_count(&pgaio_my_backend->in_flight_ios),
-								   dclist_is_empty(&pgaio_my_backend->idle_ios)));
+								   dclist_count(&pgaio_my_backend->idle_ios)));
 
 	/*
 	 * Wait for the oldest in-flight IO to complete.
@@ -761,6 +825,7 @@ pgaio_io_wait_for_free(void)
 	{
 		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
 											   &pgaio_my_backend->in_flight_ios);
+		uint64		generation = ioh->generation;
 
 		switch (ioh->state)
 		{
@@ -784,13 +849,24 @@ pgaio_io_wait_for_free(void)
 				 * In a more general case this would be racy, because the
 				 * generation could increase after we read ioh->state above.
 				 * But we are only looking at IOs by the current backend and
-				 * the IO can only be recycled by this backend.
+				 * the IO can only be recycled by this backend.  Even this is
+				 * only OK because we get the handle's generation before
+				 * potentially processing interrupts, e.g. as part of
+				 * pgaio_debug_io().
 				 */
-				pgaio_io_wait(ioh, ioh->generation);
+				pgaio_io_wait(ioh, generation);
 				break;
 
 			case PGAIO_HS_COMPLETED_SHARED:
-				/* it's possible that another backend just finished this IO */
+
+				/*
+				 * It's possible that another backend just finished this IO.
+				 *
+				 * Note that no interrupts are processed between the state
+				 * check and the call to reclaim - that's important as
+				 * otherwise an interrupt could have already reclaimed the
+				 * handle.
+				 */
 				pgaio_io_reclaim(ioh);
 				break;
 		}
@@ -940,6 +1016,11 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 	if (state == PGAIO_HS_COMPLETED_SHARED ||
 		state == PGAIO_HS_COMPLETED_LOCAL)
 	{
+		/*
+		 * Note that no interrupts are processed between
+		 * pgaio_io_was_recycled() and this check - that's important as
+		 * otherwise an interrupt could have already reclaimed the handle.
+		 */
 		if (am_owner)
 			pgaio_io_reclaim(ioh);
 		return true;
@@ -1167,11 +1248,14 @@ pgaio_closing_fd(int fd)
 		{
 			dlist_iter	iter;
 			PgAioHandle *ioh = NULL;
+			uint64		generation;
 
 			dclist_foreach(iter, &pgaio_my_backend->in_flight_ios)
 			{
 				ioh = dclist_container(PgAioHandle, node, iter.cur);
 
+				generation = ioh->generation;
+
 				if (pgaio_io_uses_fd(ioh, fd))
 					break;
 				else
@@ -1186,7 +1270,7 @@ pgaio_closing_fd(int fd)
 						   fd, dclist_count(&pgaio_my_backend->in_flight_ios));
 
 			/* see comment in pgaio_io_wait_for_free() about raciness */
-			pgaio_io_wait(ioh, ioh->generation);
+			pgaio_io_wait(ioh, generation);
 		}
 	}
 }
@@ -1215,13 +1299,14 @@ pgaio_shutdown(int code, Datum arg)
 	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
 	{
 		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+		uint64		generation = ioh->generation;
 
 		pgaio_debug_io(DEBUG2, ioh,
 					   "waiting for IO to complete during shutdown, %d in-flight IOs",
 					   dclist_count(&pgaio_my_backend->in_flight_ios));
 
 		/* see comment in pgaio_io_wait_for_free() about raciness */
-		pgaio_io_wait(ioh, ioh->generation);
+		pgaio_io_wait(ioh, generation);
 	}
 
 	pgaio_my_backend = NULL;
-- 
2.48.1.76.g4e746b1a31.dirty

v1-0001-WIP-aio-Fix-possible-state-confusion-due-to-inter.patchtext/x-diff; charset=us-asciiDownload

From 650c48ffc42f554f86b8485a6c6b98c36c35fb81 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 8 May 2025 21:03:52 -0400
Subject: [PATCH v1] WIP: aio: Fix possible state confusion due to interrupt
 processing

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/aio.c | 113 +++++++++++++++++++++++++++++-----
 1 file changed, 99 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 57b9cf3dcab..10535066762 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -198,6 +198,8 @@ pgaio_io_acquire(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 PgAioHandle *
 pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 {
+	PgAioHandle *ioh = NULL;
+
 	if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
 	{
 		Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
@@ -207,10 +209,17 @@ pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 	if (pgaio_my_backend->handed_out_io)
 		elog(ERROR, "API violation: Only one IO can be handed out");
 
+	/*
+	 * Probably not needed today, as interrupts should not process this IO,
+	 * but...
+	 */
+	HOLD_INTERRUPTS();
+
 	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
 	{
 		dlist_node *ion = dclist_pop_head_node(&pgaio_my_backend->idle_ios);
-		PgAioHandle *ioh = dclist_container(PgAioHandle, node, ion);
+
+		ioh = dclist_container(PgAioHandle, node, ion);
 
 		Assert(ioh->state == PGAIO_HS_IDLE);
 		Assert(ioh->owner_procno == MyProcNumber);
@@ -226,11 +235,11 @@ pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
 			ioh->report_return = ret;
 			ret->result.status = PGAIO_RS_UNKNOWN;
 		}
-
-		return ioh;
 	}
 
-	return NULL;
+	RESUME_INTERRUPTS();
+
+	return ioh;
 }
 
 /*
@@ -247,6 +256,12 @@ pgaio_io_release(PgAioHandle *ioh)
 		Assert(ioh->resowner);
 
 		pgaio_my_backend->handed_out_io = NULL;
+
+		/*
+		 * Note that no interrupts are processed between the handed_out_io
+		 * check and the call to reclaim - that's important as otherwise an
+		 * interrupt could have already reclaimed the handle.
+		 */
 		pgaio_io_reclaim(ioh);
 	}
 	else
@@ -265,6 +280,12 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 
 	Assert(ioh->resowner);
 
+	/*
+	 * Otherwise an interrupt, in the middle of releasing the IO, could end up
+	 * trying to wait for the IO, leading to state confusion.
+	 */
+	HOLD_INTERRUPTS();
+
 	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
 	ioh->resowner = NULL;
 
@@ -305,6 +326,8 @@ pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
 	 */
 	if (ioh->report_return)
 		ioh->report_return = NULL;
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -373,6 +396,13 @@ pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
 static inline void
 pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
 {
+	/*
+	 * All callers need to have held interrupts in some form, otherwise
+	 * interrupt processing could wait for the IO to complete, while in an
+	 * intermediary state.
+	 */
+	Assert(!INTERRUPTS_CAN_BE_PROCESSED());
+
 	pgaio_debug_io(DEBUG5, ioh,
 				   "updating state to %s",
 				   pgaio_io_state_get_name(new_state));
@@ -410,6 +440,13 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
 	Assert(pgaio_my_backend->handed_out_io == ioh);
 	Assert(pgaio_io_has_target(ioh));
 
+	/*
+	 * Otherwise an interrupt, in the middle of staging and possibly executing
+	 * the IO, could end up trying to wait for the IO, leading to state
+	 * confusion.
+	 */
+	HOLD_INTERRUPTS();
+
 	ioh->op = op;
 	ioh->result = 0;
 
@@ -449,6 +486,8 @@ pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
 		pgaio_io_prepare_submit(ioh);
 		pgaio_io_perform_synchronously(ioh);
 	}
+
+	RESUME_INTERRUPTS();
 }
 
 bool
@@ -558,8 +597,8 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 			&& state != PGAIO_HS_COMPLETED_SHARED
 			&& state != PGAIO_HS_COMPLETED_LOCAL)
 		{
-			elog(PANIC, "waiting for own IO in wrong state: %d",
-				 state);
+			elog(PANIC, "waiting for own IO %d in wrong state: %s",
+				 pgaio_io_get_id(ioh), pgaio_io_get_state_name(ioh));
 		}
 	}
 
@@ -613,7 +652,13 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 
 			case PGAIO_HS_COMPLETED_SHARED:
 			case PGAIO_HS_COMPLETED_LOCAL:
-				/* see above */
+
+				/*
+				 * Note that no interrupts are processed between
+				 * pgaio_io_was_recycled() and this check - that's important
+				 * as otherwise an interrupt could have already reclaimed the
+				 * handle.
+				 */
 				if (am_owner)
 					pgaio_io_reclaim(ioh);
 				return;
@@ -624,6 +669,11 @@ pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
 /*
  * Make IO handle ready to be reused after IO has completed or after the
  * handle has been released without being used.
+ *
+ * Note that callers need to be careful about only calling this in the right
+ * state and that no interrupts can be processed between the state check and
+ * the call to pgaio_io_reclaim(). Otherwise interrupt processing could
+ * already have reclaimed the handle.
  */
 static void
 pgaio_io_reclaim(PgAioHandle *ioh)
@@ -632,6 +682,9 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	Assert(ioh->owner_procno == MyProcNumber);
 	Assert(ioh->state != PGAIO_HS_IDLE);
 
+	/* see comment in function header */
+	HOLD_INTERRUPTS();
+
 	/*
 	 * It's a bit ugly, but right now the easiest place to put the execution
 	 * of local completion callbacks is this function, as we need to execute
@@ -699,6 +752,8 @@ pgaio_io_reclaim(PgAioHandle *ioh)
 	 * efficient in cases where only a few IOs are used.
 	 */
 	dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
+
+	RESUME_INTERRUPTS();
 }
 
 /*
@@ -714,7 +769,7 @@ pgaio_io_wait_for_free(void)
 	pgaio_debug(DEBUG2, "waiting for free IO with %d pending, %d in-flight, %d idle IOs",
 				pgaio_my_backend->num_staged_ios,
 				dclist_count(&pgaio_my_backend->in_flight_ios),
-				dclist_is_empty(&pgaio_my_backend->idle_ios));
+				dclist_count(&pgaio_my_backend->idle_ios));
 
 	/*
 	 * First check if any of our IOs actually have completed - when using
@@ -728,6 +783,11 @@ pgaio_io_wait_for_free(void)
 
 		if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
 		{
+			/*
+			 * Note that no interrupts are processed between the state check
+			 * and the call to reclaim - that's important as otherwise an
+			 * interrupt could have already reclaimed the handle.
+			 */
 			pgaio_io_reclaim(ioh);
 			reclaimed++;
 		}
@@ -744,13 +804,17 @@ pgaio_io_wait_for_free(void)
 	if (pgaio_my_backend->num_staged_ios > 0)
 		pgaio_submit_staged();
 
+	/* possibly some IOs finished during submission */
+	if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
+		return;
+
 	if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
 		ereport(ERROR,
 				errmsg_internal("no free IOs despite no in-flight IOs"),
 				errdetail_internal("%d pending, %d in-flight, %d idle IOs",
 								   pgaio_my_backend->num_staged_ios,
 								   dclist_count(&pgaio_my_backend->in_flight_ios),
-								   dclist_is_empty(&pgaio_my_backend->idle_ios)));
+								   dclist_count(&pgaio_my_backend->idle_ios)));
 
 	/*
 	 * Wait for the oldest in-flight IO to complete.
@@ -761,6 +825,7 @@ pgaio_io_wait_for_free(void)
 	{
 		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
 											   &pgaio_my_backend->in_flight_ios);
+		uint64		generation = ioh->generation;
 
 		switch (ioh->state)
 		{
@@ -784,13 +849,24 @@ pgaio_io_wait_for_free(void)
 				 * In a more general case this would be racy, because the
 				 * generation could increase after we read ioh->state above.
 				 * But we are only looking at IOs by the current backend and
-				 * the IO can only be recycled by this backend.
+				 * the IO can only be recycled by this backend.  Even this is
+				 * only OK because we get the handle's generation before
+				 * potentially processing interrupts, e.g. as part of
+				 * pgaio_debug_io().
 				 */
-				pgaio_io_wait(ioh, ioh->generation);
+				pgaio_io_wait(ioh, generation);
 				break;
 
 			case PGAIO_HS_COMPLETED_SHARED:
-				/* it's possible that another backend just finished this IO */
+
+				/*
+				 * It's possible that another backend just finished this IO.
+				 *
+				 * Note that no interrupts are processed between the state
+				 * check and the call to reclaim - that's important as
+				 * otherwise an interrupt could have already reclaimed the
+				 * handle.
+				 */
 				pgaio_io_reclaim(ioh);
 				break;
 		}
@@ -940,6 +1016,11 @@ pgaio_wref_check_done(PgAioWaitRef *iow)
 	if (state == PGAIO_HS_COMPLETED_SHARED ||
 		state == PGAIO_HS_COMPLETED_LOCAL)
 	{
+		/*
+		 * Note that no interrupts are processed between
+		 * pgaio_io_was_recycled() and this check - that's important as
+		 * otherwise an interrupt could have already reclaimed the handle.
+		 */
 		if (am_owner)
 			pgaio_io_reclaim(ioh);
 		return true;
@@ -1167,11 +1248,14 @@ pgaio_closing_fd(int fd)
 		{
 			dlist_iter	iter;
 			PgAioHandle *ioh = NULL;
+			uint64		generation;
 
 			dclist_foreach(iter, &pgaio_my_backend->in_flight_ios)
 			{
 				ioh = dclist_container(PgAioHandle, node, iter.cur);
 
+				generation = ioh->generation;
+
 				if (pgaio_io_uses_fd(ioh, fd))
 					break;
 				else
@@ -1186,7 +1270,7 @@ pgaio_closing_fd(int fd)
 						   fd, dclist_count(&pgaio_my_backend->in_flight_ios));
 
 			/* see comment in pgaio_io_wait_for_free() about raciness */
-			pgaio_io_wait(ioh, ioh->generation);
+			pgaio_io_wait(ioh, generation);
 		}
 	}
 }
@@ -1215,13 +1299,14 @@ pgaio_shutdown(int code, Datum arg)
 	while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
 	{
 		PgAioHandle *ioh = dclist_head_element(PgAioHandle, node, &pgaio_my_backend->in_flight_ios);
+		uint64		generation = ioh->generation;
 
 		pgaio_debug_io(DEBUG2, ioh,
 					   "waiting for IO to complete during shutdown, %d in-flight IOs",
 					   dclist_count(&pgaio_my_backend->in_flight_ios));
 
 		/* see comment in pgaio_io_wait_for_free() about raciness */
-		pgaio_io_wait(ioh, ioh->generation);
+		pgaio_io_wait(ioh, generation);
 	}
 
 	pgaio_my_backend = NULL;
-- 
2.48.1.76.g4e746b1a31.dirty

#196

Andres Freund

andres@anarazel.de

8 months ago

In reply to: Tomas Vondra (#189)

Re: AIO v2.5

Hi,

On 2025-04-23 00:38:01 +0200, Tomas Vondra wrote:

There's an open item related to this commit (247ce06b883d), based on:

Oops, forgot to reply.

For now the default io_method is changed to "worker". We should re-
evaluate that around beta1, we might want to be careful and set the
default to "sync" for 18.

FWIW the open item is in the "recheck mid-beta" section, but the commit
message says "around beta1" which is not that far away (especially if we
choose to do beta1 on May 8, with the minor releases).

I think we certainly should wait for at least beta1 (not that we have a choice
now), to make sure all of this gets at least some coverage.

What information we need to gather to make the decision and who's
expected to do it? I assume someone needs to do run some benchmarks, but
is anyone working on it already? What benchmarks, what platforms?

I guess I'd say "the community". I think it should depend on how many problems
we find, be it correctness or performance regressions. I don't know how to
make that call without seeing how things are going during beta.

And yes, I think this should involve some benchmarking too - generally a good
thing to do during beta, and if it involves AIO, all the better.

FWIW I'm asking because of the RMT, but I'm also willing to do some of
the tests, if needed - but it'd be good to get some guidance.

More testing would be welcome!

Greetings,

Andres Freund

#197

Noah Misch

noah@leadboat.com

8 months ago

In reply to: Andres Freund (#195)

Re: AIO v2.5

On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:

On 2025-05-02 20:05:11 -0700, Noah Misch wrote:

On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:

We do need to hold interrupts in a few other places, I think - with some debug
infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
interrupts could be processed and calling CFI() in errstart() in its return
false case) it's possible to find state confusions which trigger
assertions. The issue is that pgaio_io_update_state() contains a
pgaio_debug_io() and executing pgaio_closing_fd() in places that call
pgaio_io_update_state() doesn't end well. There's a similar danger with the
debug message in pgaio_io_reclaim().

In the attached patch I added an assertion to pgaio_io_update_state()
verifying that interrupts are held and added code to hold interrupts in the
relevant places.

Works for me.

For the "no free IOs despite no in-flight IOs" case, I'd replace the
ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
an IO.

Hm - it seems better to me to check if there are now free handles and return
if that's the case, but to keep the error check in case there actually is no
free IO? That seems like a not implausible bug...

Works for me.

Then decide what protection if any, we need against bugs causing an
infinite loop in caller pgaio_io_acquire(). What's the case motivating the
unbounded loop in pgaio_io_acquire(), as opposed to capping at two
pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could
be reentrant, what scenario would reach that reentrancy?

I do not remember why I wrote this as an endless loop. If you prefer I could
change that as part of this patch.

I asked because it would be sad to remove the ereport(ERROR) like I proposed
and then have a bug cause a real infinite loop. Removing the loop was one way
to prove that can't happen. As you say, another way would be keeping the
ereport(ERROR) and guarding it with a free-handles check, like in your patch
today. I don't have a strong preference between those.

It does seem rather dangerous that errstart() processes interrupts for debug
messages, but only if the debug message is actually logged. That's really a
recipe for hard to find bugs. I wonder if we should, at least in assertion
mode, process interrupts even if not emitting the message.

Yes, that sounds excellent to have.

#198

Andres Freund

andres@anarazel.de

8 months ago

In reply to: Noah Misch (#197)

Re: AIO v2.5

Hi,

On 2025-05-08 19:22:27 -0700, Noah Misch wrote:

On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:

On 2025-05-02 20:05:11 -0700, Noah Misch wrote:

On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:

We do need to hold interrupts in a few other places, I think - with some debug
infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
interrupts could be processed and calling CFI() in errstart() in its return
false case) it's possible to find state confusions which trigger
assertions. The issue is that pgaio_io_update_state() contains a
pgaio_debug_io() and executing pgaio_closing_fd() in places that call
pgaio_io_update_state() doesn't end well. There's a similar danger with the
debug message in pgaio_io_reclaim().

In the attached patch I added an assertion to pgaio_io_update_state()
verifying that interrupts are held and added code to hold interrupts in the
relevant places.

Works for me.

For the "no free IOs despite no in-flight IOs" case, I'd replace the
ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
an IO.

Hm - it seems better to me to check if there are now free handles and return
if that's the case, but to keep the error check in case there actually is no
free IO? That seems like a not implausible bug...

Works for me.

Then decide what protection if any, we need against bugs causing an
infinite loop in caller pgaio_io_acquire(). What's the case motivating the
unbounded loop in pgaio_io_acquire(), as opposed to capping at two
pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could
be reentrant, what scenario would reach that reentrancy?

I do not remember why I wrote this as an endless loop. If you prefer I could
change that as part of this patch.

I asked because it would be sad to remove the ereport(ERROR) like I proposed
and then have a bug cause a real infinite loop. Removing the loop was one way
to prove that can't happen. As you say, another way would be keeping the
ereport(ERROR) and guarding it with a free-handles check, like in your patch
today. I don't have a strong preference between those.

I pushed it this way just now.

Thanks again for Alexander for reporting & investigating the issues and for
Noah's review.

Greetings,

Andres Freund

#199

Antonin Houska

ah@cybertec.at

7 months ago

In reply to: Andres Freund (#82)

1 attachment(s)

Re: AIO v2.5

Andres Freund <andres@anarazel.de> wrote:

On 2025-03-13 11:53:03 +0100, Antonin Houska wrote:

Attached are a few proposals for minor comment fixes.

Thanks, applied.

After reading the code a bit more, I noticed that the 'cb_flags' argument of
PgAioHandleCallbackStage is not really used, at least in the existing
callbacks. Is there an intention to use it in the future (the patches for
async write do not seem to indicate so), or is this only a leftover from
previous versions of the patch?

Besides that, I suggest a minor comment fix.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

comment.difftext/x-diffDownload

diff --git a/src/include/storage/aio_types.h b/src/include/storage/aio_types.h
index 18183366077..afee85c787b 100644
--- a/src/include/storage/aio_types.h
+++ b/src/include/storage/aio_types.h
@@ -107,7 +107,7 @@ typedef struct PgAioResult
 	/* of type PgAioResultStatus, see above */
 	uint32		status:PGAIO_RESULT_STATUS_BITS;
 
-	/* meaning defined by callback->error */
+	/* meaning defined by callback->report */
 	uint32		error_data:PGAIO_RESULT_ERROR_BITS;
 
 	int32		result;

#200

Andres Freund

andres@anarazel.de

7 months ago

In reply to: Antonin Houska (#199)

Re: AIO v2.5

Hi,

On 2025-06-30 08:58:21 +0200, Antonin Houska wrote:

Andres Freund <andres@anarazel.de> wrote:

On 2025-03-13 11:53:03 +0100, Antonin Houska wrote:

Attached are a few proposals for minor comment fixes.

Thanks, applied.

After reading the code a bit more, I noticed that the 'cb_flags' argument of
PgAioHandleCallbackStage is not really used, at least in the existing
callbacks. Is there an intention to use it in the future (the patches for
async write do not seem to indicate so), or is this only a leftover from
previous versions of the patch?

Thanks for reading through it!

It seemed more symmetric to have it for stage as well as the completion
callback - and I still think that...

Besides that, I suggest a minor comment fix.

Thanks. Pushed that fix.

Greetings,

Andres Freund

#201

Matthias van de Meent

boekewurm+postgres@gmail.com

6 months ago

In reply to: Andres Freund (#200)

1 attachment(s)

Re: AIO v2.5

Hi,

I've been going through the new AIO code as an effort to rebase and
adapt Neon to PG18. In doing so, I found the following
items/curiosities:

1. In aio/README.md, the following code snippet is found:

[...]
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
[...]

I believe it would be clearer if it took a reference to the buffer:

pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);

The main reason here is that common practice is to have a `Buffer
buffer;` whereas a Buffer * is more commonly plural.

2. In aio.h, PgAioHandleCallbackID is checked to fit in
PGAIO_RESULT_ID_BITS (though the value of PGAIO_HCB_MAX).
However, the check is off by 1:

```
#define PGAIO_HCB_MAX PGAIO_HCB_LOCAL_BUFFER_READV
StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS), [...])
```

This static assert will not trigger when PGAIO_HCB_MAX is equal to
2^PGAIO_RESULT_ID_BITS, but its value won't fit in those 6 bits and
instead overflow to 0.
To fix this, I suggest the `<=` is replaced with `<` to make that
work, or the definition of PGAIO_HCB_MAX to be updated to
PGAIO_HCB_LOCAL_BUFFER_READV + 1.

Note that this won't have caused bugs yet because we're not nearing
this limit of 64 callback IDs, but it's just a matter of good code
hygiene.

3. I noticed that there is AIO code for writev-related operations
(specifically, pgaio_io_start_writev is exposed, as is
PGAIO_OP_WRITEV), but no practical way to excercise that code: it's
not called from anywhere in the project, and there is no way for
extensions to register the relevant callbacks required to make writev
work well on buffered contents. Is that intentional?

Relatedly, the rationale for using enum PgAioHandleCallbackID rather
than function pointers in the documentation above its definition says
that function pointer instability across backends is one of the 2 main
reasons.
Is there any example of OS or linker behaviour that does not start
PostgreSQL with stable function pointer addresses across backends of
the same PostgreSQL binary? Or is this designed with extensibility
and/or cross-version EXEC_BACKEND in mind, but with extensibility not
yet been implemented due to $constraints?

I've attached a patch that solves the issues of (1) and (2).

Kind regards,

Matthias van de Meent
Databricks

Attachments:

v1-0001-AIO-README-clarification-assertion-fix.patchapplication/octet-stream; name=v1-0001-AIO-README-clarification-assertion-fix.patchDownload

From c289e6b430238140390fbbabeb4d4398d66b4774 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 9 Jul 2025 13:16:03 +0200
Subject: [PATCH v1] AIO: README clarification; assertion fix

The assertion wouldn't have triggered for a long while yet, but this
won't accidentally fail to detect the issue if/when it occurs.
---
 src/backend/storage/aio/README.md | 2 +-
 src/include/storage/aio.h         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/README.md b/src/backend/storage/aio/README.md
index f10b5c7e31e..b59b20fc9c0 100644
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@@ -94,7 +94,7 @@ pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
  *
  * In this example we're reading only a single buffer, hence the 1.
  */
-pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
+pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);
 
 /*
  * Pass the AIO handle to lower-level function. When operating on the level of
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index e7a0a234b6c..2933eea0649 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -201,7 +201,7 @@ typedef enum PgAioHandleCallbackID
 } PgAioHandleCallbackID;
 
 #define PGAIO_HCB_MAX	PGAIO_HCB_LOCAL_BUFFER_READV
-StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS),
+StaticAssertDecl(PGAIO_HCB_MAX < (1 << PGAIO_RESULT_ID_BITS),
 				 "PGAIO_HCB_MAX is too big for PGAIO_RESULT_ID_BITS");
 
 
-- 
2.39.5 (Apple Git-154)

#202

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Matthias van de Meent (#201)

Re: AIO v2.5

Hi,

On 2025-07-09 13:26:09 +0200, Matthias van de Meent wrote:

I've been going through the new AIO code as an effort to rebase and
adapt Neon to PG18. In doing so, I found the following
items/curiosities:

1. In aio/README.md, the following code snippet is found:

[...]
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
[...]

I believe it would be clearer if it took a reference to the buffer:

pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);

The main reason here is that common practice is to have a `Buffer
buffer;` whereas a Buffer * is more commonly plural.

It's also just simply wrong as-is :/. Interpreting the buffer id as a pointer
obviously makes no sense...

2. In aio.h, PgAioHandleCallbackID is checked to fit in
PGAIO_RESULT_ID_BITS (though the value of PGAIO_HCB_MAX).
However, the check is off by 1:

```
#define PGAIO_HCB_MAX PGAIO_HCB_LOCAL_BUFFER_READV
StaticAssertDecl(PGAIO_HCB_MAX <= (1 << PGAIO_RESULT_ID_BITS), [...])
```

This static assert will not trigger when PGAIO_HCB_MAX is equal to
2^PGAIO_RESULT_ID_BITS, but its value won't fit in those 6 bits and
instead overflow to 0.
To fix this, I suggest the `<=` is replaced with `<` to make that
work, or the definition of PGAIO_HCB_MAX to be updated to
PGAIO_HCB_LOCAL_BUFFER_READV + 1.

Ooops.

3. I noticed that there is AIO code for writev-related operations
(specifically, pgaio_io_start_writev is exposed, as is
PGAIO_OP_WRITEV), but no practical way to excercise that code: it's
not called from anywhere in the project, and there is no way for
extensions to register the relevant callbacks required to make writev
work well on buffered contents. Is that intentional?

Yes. We obviously do want to support writes eventually, and it didn't seem
useful to not have the most basic code for writes in the AIO infrastructure.

You could still use it to e.g. write out temporary file data or such.

and there is no way for extensions to register the relevant callbacks
required to make writev work well on buffered contents. Is that intentional?

FWIW, the problem with writev for buffered IO is not so much with the AIO
infrastructure, but with buffer locking and sync.c...

Relatedly, the rationale for using enum PgAioHandleCallbackID rather
than function pointers in the documentation above its definition says
that function pointer instability across backends is one of the 2 main
reasons.
Is there any example of OS or linker behaviour that does not start
PostgreSQL with stable function pointer addresses across backends of
the same PostgreSQL binary? Or is this designed with extensibility
and/or cross-version EXEC_BACKEND in mind, but with extensibility not
yet been implemented due to $constraints?

It's due to EXEC_BACKEND - function pointers aren't stable with EXEC_BACKEND
even *without* cross-version issues. Due to ASLR function/data pointers may be
shifted around.

Even without EXEC_BACKEND we probably would want to have something like
PgAioHandleCallbackID, just to save space. But it'd be a lot easier to make it
extensible, because we could add entries once, rather than having to do so in
every process.

I suspect to make this extensible in the face of EXEC_BACKEND we'd want an
array of dynamically registered callbacks, in shared memory, that has the
library_name/symbol_name for the location of each callback struct, to be
registered in _PG_init(), which then would be entered into the backend-local
table in pgaio_init_backend().

Personally I'm not particularly interested in investing the energy to make
that work, I'd rather wait till we have threading and focus on the
infrastructure work for AIO writes etc. But then I don't work on an extension
these days, so I'm not directly affected by this not being extensible - I'm
not opposed to somebody else doing that work.

Greetings,

Andres Freund

#203

Matthias van de Meent

boekewurm+postgres@gmail.com

6 months ago

In reply to: Andres Freund (#202)

Re: AIO v2.5

On Wed, 9 Jul 2025 at 16:59, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-07-09 13:26:09 +0200, Matthias van de Meent wrote:

I've been going through the new AIO code as an effort to rebase and
adapt Neon to PG18. In doing so, I found the following
items/curiosities:

1. In aio/README.md, the following code snippet is found:

[...]
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
[...]

I believe it would be clearer if it took a reference to the buffer:

pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);

The main reason here is that common practice is to have a `Buffer
buffer;` whereas a Buffer * is more commonly plural.

It's also just simply wrong as-is :/. Interpreting the buffer id as a pointer
obviously makes no sense...

Given that the snippet didn't contain type indications for buffer upto
that point, technically the buffer variable could've been defined as
`Buffer* buffer;` which would've been type-correct. That would be very
confusing however, hence the suggested change.

After your mail, I also noticed the later snippet which should be updated, too:

```
-smgrstartreadv(ioh, operation->smgr, forknum, blkno,
-               BufferGetBlock(buffer), 1);
+void *page = BufferGetBlock(buffer);
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               &page, 1);
```

3. I noticed that there is AIO code for writev-related operations
(specifically, pgaio_io_start_writev is exposed, as is
PGAIO_OP_WRITEV), but no practical way to excercise that code: it's
not called from anywhere in the project, and there is no way for
extensions to register the relevant callbacks required to make writev
work well on buffered contents. Is that intentional?

Yes. We obviously do want to support writes eventually, and it didn't seem
useful to not have the most basic code for writes in the AIO infrastructure.

You could still use it to e.g. write out temporary file data or such.

Yes, though IIUC that would require an implementation of at least
PgAioTargetInfo for such a use case (it's definitely not a SMGR
target), which currently isn't available and can't be registered
dynamically by an extension. Or maybe did I miss something?

and there is no way for extensions to register the relevant callbacks
required to make writev work well on buffered contents. Is that intentional?

FWIW, the problem with writev for buffered IO is not so much with the AIO
infrastructure, but with buffer locking and sync.c...

Maybe "buffered contents" wasn't the right phrasing, but my main point
remains - I can't seem to find a way to make use of the
pgaio_start_writev API in the 18 branch that doesn't require patching
core code or heavily relying on the behaviour of various pgaio
internals:

The only valid pgaio target is PGAIO_TID_SMGR, which effectively
forces you to use buffered relations as IO target (either shared, or
backend-local/temp). Then, there are no callbacks defined for writev,
so the user will effectively have to do setup and cleanup on their
own, making worker-based IO practically impossible, and requiring the
user to handle all lock acquisition/releases in their own process'
time instead of in the AIO paths - requiring significantly more
involved error handling.

(PS. I'm not quite 100% sure that it is impossible to use, just that
there are rather few handles available for using this part of the new
tool, and it seems completely untested in the PG18 branch)

-----

Something else I've just noticed is the use of int32 in
PgAIOHandle->result. In sync and worker mode, pg_preadv and pg_pwritev
return ssize_t, which most modern systems can't fit in int32 (the
output was int before, then size_t, then ssize_t: [0]/messages/by-id/flat/1672202.1703441340@sss.pgh.pa.us). While not
directly an issue in default PG18 due to the use of 1GB relation
segments capping the max IO size for SMGR-managed IOs (and various
other code-level constraints), this may have more issues when an
extension starts bulk-reading data on a system compiled with
RELSEG_SIZE >= 2GB; I can't find any protective checks against
overflows in downcasting the IO result.

I did notice Linux seems to guarantee it won't read more than
0x7FFF_F000 in any one operation, but I can't find any similar
guarantees for e.g. MacOS or Windows (well, our win32 port limits
pg_pread/writev to 1GB, which seems sufficient for this exact case).

If we're keeping int32, then it would be great if the considerations
for why it's OK to use 32 bits to store the ssize_t vlaue would be
included in the docs/comments.

Kind regards,

Matthias van de Meent
Databricks

[0]: /messages/by-id/flat/1672202.1703441340@sss.pgh.pa.us

#204

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Matthias van de Meent (#203)

Re: AIO v2.5

Hi,

On 2025-07-10 21:00:21 +0200, Matthias van de Meent wrote:

On Wed, 9 Jul 2025 at 16:59, Andres Freund <andres@anarazel.de> wrote:

3. I noticed that there is AIO code for writev-related operations
(specifically, pgaio_io_start_writev is exposed, as is
PGAIO_OP_WRITEV), but no practical way to excercise that code: it's
not called from anywhere in the project, and there is no way for
extensions to register the relevant callbacks required to make writev
work well on buffered contents. Is that intentional?

Yes. We obviously do want to support writes eventually, and it didn't seem
useful to not have the most basic code for writes in the AIO infrastructure.

You could still use it to e.g. write out temporary file data or such.

Yes, though IIUC that would require an implementation of at least
PgAioTargetInfo for such a use case (it's definitely not a SMGR
target), which currently isn't available and can't be registered
dynamically by an extension. Or maybe did I miss something?

I can see some hacky ways around that, but they're just that, hacky...

(PS. I'm not quite 100% sure that it is impossible to use, just that
there are rather few handles available for using this part of the new
tool, and it seems completely untested in the PG18 branch)

I'm not saying it's 100% ready to use without modifying core code, but for
something that's like 30 lines of code, as part of a considerably larger
subystem, I just don't see a problem with writev not yet being covered. It's
just incremental development.

-----

Something else I've just noticed is the use of int32 in
PgAIOHandle->result. In sync and worker mode, pg_preadv and pg_pwritev
return ssize_t, which most modern systems can't fit in int32 (the
output was int before, then size_t, then ssize_t: [0]).

I don't think there's anything that can actually do IO that's large enough to
be problematic. What's the potential scenario where you'd want to read/write
more than 3GB of data within one syscall? That just doesn't seem to make
sense.

While not directly an issue in default PG18 due to the use of 1GB relation
segments capping the max IO size for SMGR-managed IOs (and various other
code-level constraints), this may have more issues when an extension starts
bulk-reading data on a system compiled with RELSEG_SIZE >= 2GB; I can't find
any protective checks against overflows in downcasting the IO result.

I don't think the relation size is relevant piece here, it's just that it
doesn't make sense (and likely isn't possible) to read that much data at once.

Greetings,

Andres Freund

#205

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#196)

7 attachment(s)

Re: AIO v2.5

Hi,

I've been running some benchmarks comparing the io_methods, to help with
resolving this PG18 open item. So here are some results, and my brief
analysis of it.

I was hoping to get this out sooner before beta2 :-( and some tests are
still running, but I don't think it'll affect the conclusions.

The TL;DR version
-----------------

* The "worker" method seems good, and I think we should keep it as a
default. We should probably think about increasing the number of workers
a bit, the current io_workers=3 seems to be too low and regresses in a
couple tests.

* The "sync" seems OK too, but it's more of a conservative choice, i.e.
more weight for keeping the PG17 behavior / not causing regressions. But
I haven't seen that (with enough workers). And there are cases when the
"worker" is much faster. It'd be a shame to throw away that benefit.

* There might be bugs in "worker", simply because it has to deal with
multiple concurrent processes etc. But I guess we'll fix those just like
other bugs. I don't think it's a good argument against "worker" default.

* All my tests were done on Linux and NVMe drives. It'd be good to do
similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
to do some of that, but it'd be great to cover more cases. I can help
with getting my script running, a run takes ~1-2 days.

A more detailed version ...
---------------------------

The benchmark I did works like this:

1) It generates datasets with different data distributions (patterns in
the data - uniform, linear, cyclic, ...). Each table is ~4.3GB of data.

2) It then runs queries on that, with a BETWEEN clause with a certain
selectivity (matching % of the rows), forcing a particular scan type
(indexscan, bitmapscan, seqscan).

3) For each query it measures duration of "cold" and "warm" runs. Cold
means "nothing in page cache / shared buffers", while "warm" means
everything is cached somewhere (by the first "cold" run).

There's a couple relevant parameters varied between the runs:

* effective_io_concurrency = [0, 1, 16, 32]
* shared_buffers = [4GB, 32GB]

Parallel query was disabled for these tests.

The test also included PG17 for comparison, but I forgot PG18 enabled
checksums by default. So PG17 results are with checksums off, which in
some cases means PG17 seems a little bit faster. I'm re-running it with
checksums enabled on PG17, and that seems to eliminate the differences
(as expected).

Scripts
-------

The benchmark scripts/results/charts are available here:

https://github.com/tvondra/iomethod-tests

The SQL data generator and the script driving the benchmark:

* https://github.com/tvondra/iomethod-tests/blob/master/create2.sql

* https://github.com/tvondra/iomethod-tests/blob/master/test-full-cost-2.sh

The script is fairly simple. It first generates the combinations of
parameters to test, randomizes them, and then tests each of them.
Results are recorded in a CSV file "results.csv".

Test machines
-------------

As usual, I did this on two machines I have at home:

* ryzen (~2024)
* Ryzen 9 9900X (12 cores)
* 64GB RAM
* 4x NVMe SSD (Samsung 990 PRO 1TB) in RAID0

* xeon (~2016)
* 2x Xeon 2699v4 (44 cores)
* 64GB RAM
* 1x NVMe SSD WDC Ultrastar DC SN640 960GB

The ryzen is much beefier in terms of I/O, it can do ~20GB/s in
sequential read. The xeon is much more modest (and generally older).

Charts
------

Most of the repository is PDF charts generated from the CSV results.
There's a README explaining the naming convention of the charts, and
some other details.

But in general there are two "types" of charts. The first one fixes all
parameters except for "dataset", and then shows comparison of results
for all datasets.

For example the attached "ryzen-rows-cold-32GB-16-unscaled.pdf" shows
results for:

* ryzen machine
* selectivity calculated as "% of rows" (not pages)
* cold runs, i.e. data not cached
* shared_buffers=32GB
* effective_io_concurrency=16
* unscaled - each plot has custom y-range

There are "scaled" plots too, with all the plots scaled to the same
y-range. This makes it easier to confirm plots from the same PDF (e.g.
how the different scans perform). But that's irrelevant for picking the
io_method default. and it often hides details in some plots.

The other type fixes "dataset" and varies effective_io_concurrency, so
see how it affects duration (e.g. due to prefetching). For example the
attached ryzen-rows-cold-32GB-uniform-unscaled.pdf shows that for the
"uniform" data set.

The charts show behavior for the whole selectivity range, from 0% to
100%, for each scan type. Of course, the scan type may not be the right
choice for that selectivity point. For example, we'd probably not pick
an index scan for 50% selectivity, or a seqscan for 0.01%. But I think
it makes sense to still consider the whole range, for robustness. We
make estimation/planning mistakes fairly regularly, so it's important to
care about what happens in those cases.

Findings
--------

I'm attaching only three PDFs with charts from the cold runs, to keep
the e-mail small (each PDF is ~100-200kB). Feel free to check the other
PDFs in the git repository, but it's all very similar and the attached
PDFs are quite representative.

Some basic observations:

a) index scans

There's almost no difference for indexscans, i.e. the middle column in
the PDFs. There's a bit of variation on some of the cyclic/linear data
sets, but it seems more like random noise than a systemic difference.

Which is not all that surprising, considering index scans don't really
use read_stream yet, so there's no prefetching etc.

b) bitmapscans (ryzen-bitmapscan-uniform.png)

Bitmapscans are much more affected, with a lot of differences. See the
attached PNG image, for example. That shows that at ~10% you get 2x
faster query with io_method=worker than "sync".

And it also shows that PG18 with io_method=sync is much faster than
PG17. Which is a bit surprising, TBH. At first I thought it's due to the
"checksums=off" on PG17 in this run, but that doesn't seem to be the
case - results from PG17 with checksums=on show the same thing. In fact,
the difference increases, which is rather bizarre. (There's a branch
"run2-17-checksums-on" with results from that run.)

I believe this is likely due to the bitmapscan prefetch fixes, which we
decided to not backpatch (IIRC), because the results with e_i_c=0 show
no difference between PG17 and PG18/sync.

The "ryzen" results however demonstrate that 3 workers may be too low.
The timing spikes to ~3000ms (at ~1% selectivity), before quickly
dropping back to ~1000ms. The other datasets show similar difference.
With 12 workers, there's no such problem.

On "xeon" the differences are much smaller, or not visible at all. My
guess it's due to the hardware differences (slower CPU / single NVMe for
storage).

But I'm also wondering what would happen if there are multiple queries
doing bitmapscans. Consider the benchmark is only a backend running
queries, and parallel query is disabled.

The problems with too few workers are much better visible in the
log-scale charts (just add "-log" at the end). See the attached example
(ryzen-bitmapscan-uniform-log.png). It shows how much slower it's
compared to all other methods (effectively regression compared to PG17)
for a huge chunk of the selectivity. I'm sure we'd pick bitmapscan for
some of those queries.

c) seqscan (ryzen-seqscan-uniform.png)

This shows a lot of significant differences between different io_method
options. And by significant I mean ~2x difference, on both machines.

On "ryzen" the "worker" takes ~800ms, while "sync" is ~1700ms. PG17
seems a bit faster, but that's due to missing checksums, and with
checksums it's almost exactly the same as PG18/sync. io_uring is in
between at ~1200ms.

On "xeon" it's very similar, but io_uring does better and worker/3 a bit
worse (which I think is another reason to increase io_workers default).

I believe these differences can be explained by where the work happens
for different io_method cases. And by work I mean memcpy and checksum
verification. With "sync/io_uring" it's in the backend itself, while
with "worker" it's offloaded to the worker processes. So single-process
vs. multi-process memory bandwidth.

Which is nice, but we need to have enough of them ;-)

d) warm runs

The "warm" runs are very uninteresting. There's literally no difference
between the various io_methods. On "ryzen" it's very smooth, while on
"xeon" the "seqscan" is very noisy. I believe this is due to NUMA
effects (two sockets etc.) but I need to verify that. It affects all
io_methods equally, so it's irrelevant for this.

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf

The interesting thing is that PG17 indexscans on uniform dataset got a
little bit faster. In the attached PDF it's exactly on par with PG18,
but here it got a bit faster. Which makes no sense, if it has to also
verify checksums. I haven't had time to investigate this yet.

Conclusion
----------

That's all I have at the moment. I still think it makes sense to keep
io_method=worker, but bump up the io_workers a bit higher. Could we also
add some suggestions how to pick a good value to the docs?

You can probably find more interesting results in the other PDFs in the
repository. I'm certainly going to keep looking.

You might also run the benchmark on different hardware, and either
build/publish the plots somewhere, or just give me the CSV and I'll do
that. Better to find strange stuff / regressions now.

The repository also has branches with plots showing results with WIP
indexscan prefetching. (It's excluded from the PDFs I presented here).
The conclusions are similar to what we found here - "worker" is good
with enough workers, io_uring is good too. Sync has issues for some of
the data sets, but still helps a lot.

regards

--
Tomas Vondra

Attachments:

ryzen-bitmapscan-uniform.pngimage/png; name=ryzen-bitmapscan-uniform.pngDownload

�PNG


IHDR���8��PLTE�����A����V���s���������������QQQ


���AAAPPP���������'''��������D[[["""///���666+++���VVV^^^dddooo�����������NNN���333���������KKK���sss����v������===Z�����������kkkFFF���gggiii��������������:::��1���www�������������������������������z���|||��������Faaammm���~~~�����������a��K���d�������;�������������������p��������������:������w����V�����T��|������������������� ���������Fq���u�g������������������,����������f���������/�/�zzz������V��H����W�������n��b���{{{���������������*��P���=�������J����"��������������?��5��a��H����P��������i����[w�p��?��+��$Ka=XB0�y��y���wX��:,�eA�������	j��#~��ejhZU��IY=���������-�|����g��hG��x��d��;��������g��P��M���������e�������]�y1|���_�~S����T��I^����V�e�[dV�G���^N��:�����q�pqG���i�cn�[4TKukIDATx���KO"K���8IU�����&<D� �	��0��h1�&l��t��p����S������}D��t��R=���c�1�c�1�c�1�c�1�c����B���}-��!^�%�������4.t�p��q��_&�/$<�/�h��|
/�s�'_W"r<��K`��{dq9�j��*�e���a"R�<,�I�x�M�'�����$X_
����G?��7
5�@��vq��S���2��8�#f�@�NIq�����������&@��2���'����@������j��P�!v���������:��@4�bs���J����&������f������-4g��9O�.�[��~��D4�O��4A�\cG���y�>�rk��d��!3�	Du�d'�4U�����Z������|2���^�d���?��}���y�%<�����b��Be��i7��{��>R�D���:��Vm���Q\�`��Q�A"�����oo��oW������)����+�.ZMN����x8�M.��;_J����/�}:����ds,�u��]�������	��|�"b-!�z'y�J}��*���p�6�x7��ig��b	{pb	���}���)m���e���/���Z�Z�Ko����87�s&Ao�3�y��l������%4��p�h,����X�Z����.=���0���o#1[#
��j'1w�c����+c�A�.Jo���Z{�K����uA�T���T��4b/Q���KXh��{9]�Osa��%��f��YS���W�/����v�eT�m.��\�J�������c\���y�tc!���&5������]�?�L5���N"f�l�&��/���>��K���c���hQzs��z<G��L*it��Naz�z$�t��n���I�-bK�(������.��
/�"b`hC�R�S�N)����]%17��=[:.]
�H)��Y��^t�w�`�kK��/�v�u��b�+��	�_��a��zs��������=�_�]W�t��}_���09@�jQ:�������8 �c�����:��"�>(<8hl�BD>(U�oKn�1�k����
}T�z45~9�J�I
S�@�t;�)��E+
��������<K?�@��>\tR����Z=t;B�^=)�;
K���R�`�S[�~�T�}�������n�?�s�N��u���m7�8��F"�p�����<��S��-u�7�x(^�~tr�4y����>)m�������Z�U��a��6�����-�+�=���OK��2��S����m�K�}j����E�T�j�m���� l�rME���.�q0��t9�{�wJ�Z	u>[�o����*=��nh�m�a��4�=����EQ�r���}	���-Cw�E�������P��h�)gQ�i@[q�������jT���������

��~y�~�M�b`���N��%.���_�C�m����@��v���K�oK����Q^,J������vuk�"-J�����R�l09��E��Kg?�si"2Y���ET�5�}�O������+=(��z�����
�~]���h~A��/J��0]B�W����\�N��e/�[�m!|�����54����
c���q��C)���K��t�����%�+;���K:s��W:LS�;j�<�)�i
{-x]:.��
�[�(��D�����J�g��I|*]��K����L�v�6~�`��~��V�]��+.��`�%�$��Q��������D����{�[z���<m�e�n�_�n��4�E��g�������B�J?��C;����������}�7��t����Jm��v�����(����{����R�c �������OS��W{U
��l��)uTk�T�8��.]��~�+�f[,�pS������7�������J�J���	���%}�<�U
>���B�(�T��6U����������f�QJm�=h]�EN�X
�S�����:4T��&o�������4�28�$��j��o������q������
�l��#�qS���nY@$s�h "N����l�S`�}n\:c?.��A���O�1�c�1�c�1�c�1�c�1�c�1�c�1�c�1�c�1�c�1�c�1������x��@�k	�������Kc��l��[������vN|iC�8�����b�Nn��9u�>�������&U� ��;�(���q�7�:"�v%������������t�Ih�l+:5V�		��,��Y���&0�>�M3�af��@I��bN4���J��I�X�������L�������u���[X�Y����S�U G����3�Qi������0�����)�nH��qw+r��o��`���
��������i�V%�Unk'U�t�l���<xv����y�1�+{w���p�1��	�i�9<�h������9��jm-���H��m��46m�
���Ja[���,�{�u������;��{B)�T����C��y{w|�wW��3��S��}-M����7j����L�z�>����zSV�9�eYZ���e��4��}�e�?�m,���/ K��-�>_��U�������X@���K- �x*����	
,�B�HX��u��2����}����M-��[�2��s����}
��\.���f{AFc8����>	�&�d�6@�[\�o���~~Q����=�2����3�������B�n�`Rz�s��q��<co�A���dV;� g�k#���j�r���������)3c��|��e
d�-���<
"�f�	2�:���ux�9�{�w�@��)g��S]+ �o6�hl��w���:xO��P
�U���Q!XF�r�i�B�����k�}H?��K�,-vH�&���I��K,4�o�P��������!!���K�:���m���������H��tR�"Kg�Y�6C��C��[K�
���
u���^G�tR:)]������^�d�����N���/ 0+z�=�������K��8Q��tG��
�����*��}�x��;&�7|�}�+�~�H�����cb\���2K��/��g�|�_j@�ezr��FA������b\3���*�0_�t���S�'nl�A���H6�+��6�
�����*��l����!_��2�����j��o�[��@�1��K�@����&��2������� ��ue0=�_�*����]�`4`���t[�q*���u�B�oQB��nm�aN�9��������x�Tzm�\���0%�+�T��b��X��Wv��R��s���N���f�
��5w\Fc�h�63��5��0P�t�^�j�t�Tz-��^�����j��������^Cj�t��&�7"����^;H�%�D��RqI?n!���v��)��\����\���������k�Gl��n�(��TO��[��C>O8�v�z)}aK5��|�j�z���Y�2C-�m�1PE(�T"_�wEN��0�~�A(� ���T9����H�~Kk3()� ��'q�\H�����'�D=H�TXT\�s��I������:8�xa���
�!R:��B���}�����/+)ua��� 8!�Y{�;���db��Y!�Ie��P���(���t��=


=_<���B�T �'N����}~������?���2*�t���u
f�-�	�b�n7��������A���(�[7��~7� �o�G�'|�.���7��"RW`��Zs4���F{��?�{�4�N���?BZ?C_�5d��P�+�R>1`�w��as�4P��W�*�W��Ry����F#������k��G=�fX����B}~���9(�0AF�L��7��x��1�`�no��W:���Oy�3���q������(����8��3r�q��(�����I��4���Q��	��<���}A�}�+����FY��O��J��._���e>���G*,���@����<D�aY��#�}
,�7!/�c�vJC1�q6.��zW�Gr'>P����S��)����6c�	J�&UX ���x��*c�)����<����@�#�z�C�S~�#��A�%�*,���P���
Y_�EpUX��B(���~�t���C�S|�h�x��)��:�2�0�����q��K|f'>�{�eK�BJ��W�������F�S������������K�&�_���zwPzsH�"��S����QN,��)bQ������u8o~��8C�?{|x
����?x���^�������D�3s��G��Q�� +�����Q�J�)��������SAd<��.K{�=���V��T���%3�2(�2��D�~�$yo*��M�B�KJ/}��z
���eR�8��^-�]R��������{$�����Kgf�	]{c���XJ!����5<����2TBn*1��0x�����E��^!C��fL��M���5�<�U��7������N������Gr*a|H��.�E��[�&���q����S��M��@�
�t�8����?C��x��R�jo��JJ'�^�4���H�]��9��9)� �����8�M�}��N����Ne��An��sx��p����=s!7)� �w�y�z5]?1�������8��	�|�_��dB��^z�=q���eB���N����|o�����zJ���|�/t�t�P�S��3� ��R ���y_����x�"��b o*�������=qR:A(:���p�
����C�@J?;��j����:
��f\�����*�F"�K���R+��3F�LLR�C��Z�:4� ��i���+W�<yr���3�}s��Q�)z�(|zgRz!���Q����2�ic����_l����������_����	�y��W��Q�`�)�b�����b�;�?���5��wL���o������_OH�u�	z��8���&�d��+��To�k����A�\�I�d�v��nR�N��at��=�WFG1z��k/=B�]�>t��?��z�������+n���3��>�
3B6��z����B���i������������-��;�V��O�2�hW�m&(��W�d����^7PP��S���@@J��l��1�g��f��^��:�`:iW���0OY��zh�������e��u�h\������������	u�}�*~dZ"��p'�W.Z��u�����2�
����A0d�`���L��q�p�n4�~i�Sy����G�Q'\za�����p��=)��p����V�<5#��Zg;@d0��$��� h������:�
�/Z�t�����W@�� W�x<��K�@DJ�\:�Y'�C�l�IJm��S �a���:(�����&�C3_���f����t���C�����H*���&���Rz���v�0��;R���Y���V���	����B�������B��UC�:��O����
D��d'x,���}�`$��
���64k�7��y�`����m��������P�J���������s���+�#����'�W.�\f�lH�4"L� �`���kmsj�(Y��~� �<J{�����Q?P<)^t!z���+�N������t���@��Ez�NO�7>m�X�a�:�,����^�WS��G�>:�#c�3H��Bi��
����Am�[��N;d,���������)�63���q��_zo?=���8������?��~��6C��+e�0!����i��C��;38D]��Qm�\�o�A�2p}q{xN�������53�����Ow��O���������>���'�%P�Rp��@���M�&����41!�YL4�e,[b2p�\b�_��]�]�J�]9����<D|x���o{��`�vw^�LG�N�������>Bv���a�B�|�,��\�+2;�K:�,��
�B�^uCv��#�X:BN��#�X:BN��#�X:BN��#�/�0���f��]��(��������8�t��.Jz�jw����8�H"��^��GX
mx�5���/����Ko���C����'���7N�1��B���0/'�;����k�n|��#,�e��"�F�J4�$�E�8����|Tq4%BK����Y��kWqX��[�-D�����6D��-��b��Sm~�D	XH����"�DJ��G0��X} zD���$Q��=t��o�>rKoQ���[+}�7�"D�p��-�����I�U"��;'�K?p����f��1�[���.�9�D���Z��M{����y�Y���pQGXz�Fj�3Q6W	�3*4h1K'���]�S������>9�<��J��|l&��<6���l�����������#,�m�C��+g����b-]�sS����@�W:�N{�Vz�� ����H��p�o��/-�������T:Q��WFj�`0�����;&���IXJ���v�)��(��5S�3����#,�}����l���#`hi(����?�����~I���e�jZ���ycc��L�g�k�U<zGXz�J���E�cG=��tk�	��?�L��[�5K��6JW��^�:l�t����:^|GXz�J���P���.0�4����] +�cM/��t��^��W2��5~2��?����Wz������-��/��.�����tc�'�K�{��>Ln�t��O��.����n}-	#�U����?�������nF|���������$����,	��
}�������y%R�$��7W��7�������p��@"��=�����v^`��!�e�f��B+sJ���Jx���)�T�����W��;E������|�K���6z�������Hp���(���l��]�����?^}f��.�	���7���h����O���L�w�e���E��C�	���0^6��p�"��K����/+Q�rw}���7������k�/��V�XH�'S�@NY�Pu)4:t0}^�>4�
rn5�c�=���)}B�[[���5=v��d�H
��b�t5�^�'0�E�j/��a{����>v`[O
��V�~���;�n��yY~�(�"�$�9
}]�������s������m�{����o�6��x��h�oo�>���9~���;�'��D{B��?U3�,�e��
o�4�%�EU�$W��9N�yp�h��?m���S�$�/Q �x�~>d���Q�%$U��f�%[���a�9�A$6b#x�~��?��^P���)�T��0���K'��`����Z�z�Y�[��^�����
�'C��
:�����^�>����}(�x��
��Ds���Bz�.v<0",C��`-�7e%@����mt����u]�/�HQ���:&��`@1�����Z�Y�����R]�����)�<��_�}6�)���;_�4��$0���"���o���qe0U��=���C�Ny~�h���`�p�.��1	,l^o���v�_�?'�U�)KG-"C���=>t�C����T=�����|�Ln'tj�2���L�����K�	r6�OG�u�������u(�?~?e�����������UICI�%j��y~����*�Tb�a�B�%I���M�#������������x������_x���/��`��B���w%"�k�~����e�E����d�"��QK���#5_m�� ������6�j���G��<=7��	�w�zuPQzF��t�-�
m-��a#L����Ojcb�f�j�����W���w�Aw� �����C���K"4-���uO4�y��M�V��B���fS?6����������>sN�45��S����[�]:����2��-$9e���� �]Fz�p����!L�5�����gp0�����>���!l��-m.=�����5���*�����>�?��rSo_���`�)}@	�����C���-�AkD��$��yl��R�#R�t*z(������9���uZ1G��B=)����8����a�8���[�����sL�jw�(��{�����������P�%	 .�+
��6�L)��Sz�t�/�(0��
��Ss0hhR�@�@���'���G�.����\�n����3Kx�1�k���C�vOl�R&s�>X��2�6�c��tt-��N
A�,t[�$��h��fM��.��3�twP����%�2������>�������!p0'�n��kb�J�����E3�u�a������yz�nY��!G\��(�����K�X�R�����P5�������+�������3g������C�vOl��"�{r�C�J�q�����Qz�-���;�o�
�Q!7��h� ���B�e��9���<����}���1����U������wa��]f����2:�����r�8�mC�R������U^�f�ef�K������8��>/��^�Y��M<zo��[/~r���;�X�u���-#��pN(1�`*+�z��m�{��Y�OV$���t���{�g�>\��5�4us�{�f�/}r���J�a��������[\�$W����6��Z���2���(on�5��-L�PG"�,����==1��:��a���n��{b�In��PaU�ko��f)�/���9�@77&��KS��W���gK0���ao������M6��Z�����W�ubKV�)��<;��yl��!#�2�~v��~� ������P(V�)��B�����e���v�����K���wa����Rr���>�ve%h���X:L�)Y��������>������s+yN^���[$�����������+B��&)b�M�+���j��Nr���d�@�q����U#�<7:�o����B>�"�`�9.+A%��%�P����"����_�x���;�p����;,���X�ag������jl64K��\�fR��~P���_~���r��m.����Mr�Y��Ar��SGk~xpu,�w��w���;p�a������4�q��{_�����n��������\���fJi�e��D�-�od�20� ��(+3��,��������n��f�����~�T��OO�����Bt�(=���N��Cs%��(��N�_J�����J'�7J_]���	�����g��E"3}��J'dW����aZ����G��KW��:�N�V����gV"�sL+3����G��N���Hta\���C��I'�&��/�%���#>��$��4��YF��a�7F���v�L(���c������S5^::z���
����s(y9�]wB������v������[\��;���e��l��t����{�c<k�����-�m��F�9I���V�;'R�[r]z���3����C�5���XH��a6�lWo^��>��v��Q�}��0��.�u�$U/"l��_��2����YX���5��ek�t���`^����.������Ik�CfJ�`]z�G�Uc�Kw����S6	>��t.��Yb�-1��������#?�o����aL�_+
���/=�W�x�����m���tV�?��{�t����A/�$C6�
W���z�����Ax-�:��|�$�����u4��>����-�;0[�")�]7��+���;�,x�������w��LlK��J���;M�I��.�K������G��f��pK���BVk�'q8 �<�z�g�<�N��g�����Y�_��C��sG��4�Br�RJ�����.��m�;���w�~w�g��C������Z�/���[v�|O���rVB���>8[��������_���J|��^��1�u��3qX��nQ:�\���.����+&������c�_s+����~>��l<e;l���,��n���>��}�J��4���j�K:� V��:���cj�/�����a�Y�������-��!��F����jO��s�y�T�������R����-}������
��:��m��0���}�A�����+��V���v�U�'�U�K�JG��� �]���o�����������%��/-����z�y�.��a������U����
�W����&�������ki+���O����5;��	�\������o4e��69a�Nw���E����v����px&�a�)��n��Q��Rj���U��~N�M���!�f��cN���z�p�9]�����t������l/�|�P�Y�.u��������R��v������w������I��YX����\���[l����
7~)]�m���.���z����P�Y�nS�>;G��t��0}d�n9 D�3��z�c�LRJ������b�i����,�tg���)���j�I�0�]�;�'***&�_��tB�gS:�IJ��[��������6i;?bS��	��Sx�������/6�h��V�GC�!{j�0*�e)������1#B�loO���M��	���QyA��jY���S��'�
�<�6�"=v������[��]�pu��?Gcw�J��[�����7�����vLA���N8h��d�J�[�XMB�l$/P�9��:_�`�]6���c���e#y�J'�P��*��b@�R�tB��NH1��	)T:!��J'�P��*��b@�R�tB��NH�B
�NHa������SH�R��M_�o�X�^�fT:!�
��[t�H�R����vK�����e;M2-E��!D600o�t�����vM���fT:!��{��_�$�����eS���tB2��F��(���[3�N����yz���S���tB20���K/��1��J'�����G�����}�O@���I�!^Y��}x)��q���K����Nr�DWK4=��"P(�[��NH���x�n!Z0'u���Ok�����!�Q�$���BI�����F������m7����G������l�+���4/[���v�=�<��t*�������9>)��<�����+�����t�k��=�@�?>P�_�?����������L�J'����\�0	8�����8��:����R(����"h�3C���;&���@�������z�
�N=�h���{�5p��~�����YTW����U�?�@��~Im��;�$W]C*��;�����%0L�p�����o��o�V�����pL��^�����oXGa�h��<]�J'?����0��O��>�;�h���uXk��X�A���un������T�bR[�TPi]hKC&B��O�����$���}.��d/�;��}��?�������;�,�s���|��\:/��)�3
)s\%c��W-b�80�t$�t��R`�F���W�����-�q���l�x��9�F���7�S_����]�z��?�P�(D4�;��u�B�����t�����/���������b&��J>_)F�k��zg�~D=��-�Q��<G�	�8[�_R��P�!������z��5��uICj�9i�5}�H2G�jdR
;�t^&"���+R�`8D��A���.�NP�A����~Xz�������u9qN�<��4�[�T�W.j�[)OlY������-d��d�@�Cc&���n_�L��.��lJ�vr�?(�w���	_S��<H�a��>y��M����L�T{���������#�D����9]��F�*5�RS�@����
���/����l����.���=���=���v`g�����ty�V�����2�O1�r��t5����6$���LK3u��_g7s!��K�v��5���n�*��}apPT�Z�_���n���9�M)�V�z�l+����3�
���������S�i$T������
�����v*����Z���G��e\4�57��V������������;�t�z;t�+�
1��,P=�gR����4��W����Ook�?����(U�xF|Q*;�������i�|2���+����4m��j	���H����9��R�}����zk��.��i'�~IU���L1m[] �����g���iq}(�jt���w���*{KfZ�Y���n�0b����J���+D;�����($�_��?g�u�l���R��*x������Z\m��t�i:N�����w9�4���i����$
�op�����z<E�L���t������F�����(Jp*.�c������1���N$��T���.�K��7\P:��V�k���p�#qQ���'Q�t����w���d+4��f�P:xOn��t��O{�F�C�.6���t�6�<������^��$��Q�-�1L��%����ypulV�����������4J��|&B��]wm�Ic�!7�����YM7��KEr�U�|�����X���7��Rq{���dC�-�`V����Z����Z����Z&���)R��f>�����=�������:�6v�(J����+�l�9��+��Ya��|�m���lI��j�z�+��a���1��E�?�G3UI��E��P:@�}��<|x��r6��t�+���Jp�d^����Q:�;��d�@���i]j*���P:�;E*RK���P:�;���Yq]�tw��%�n���N���+�.P:�K�����pp���dA��"�jLt��+���h���g��_�(��bG��J��l9��Rw�,��M��}>UU}���}~&B�������#������CCe50{#�t��Q�����k!&K��U��J�bv�4���e��B-��� J/b%�,&g�����0uZ��w�&N���$�+)G��������P:x��}
�I�����{:�I@��9�j���
'��{TzP���|�I@��9���d�U"�=*�����om�'.��=��Z!�����%�$������~����Kxz�J��n�1zX���Y�����s�Pz���,�����K-�9FK_02��e4�����g����o�]X����iZ��<�����{���6��9zX��~_x����>�$����8S����M;�����i�YUUy��F�(<(R2c���j����t���#�?5;G��I�R��g��9r^���C����������:��|a�N\������5��S���]w�k;�u�P�>�����j�TJ�G��M��{�ig��������~�W���3�<�<r�J�H
\OG-��84��4�J�k��;ji��xU7�����C����o����&au���KG��n�2W��	���w����0�Y��QK�x�����`e�����{:ja/?y
7>Bs���yB��L'9�'Q�v����44K��i��'�z:ja�k����a��%!���'Q�VV�0�y0�������O��t����=�<�Q�'��v������=�R3M����
�#)@t���3r���7��%h��� $�h'd~��Z���7����/�t�L�J~\�����Q(��Q��.��L����&U�s,!�����X:B�aa�{��[|��)�������	BE[��rK�{���������5��/���rK�LA5~B��!�.��J������&���G!w��t!�
m�KG�=�,z���m�x���r
<���E���P+H�t�����Gt��`�/��r�J_�	��<hB$��#�4)���E���a�B3����������r��R<e�T���
m~�	����`�9G������$�����X����c�9�S��*B����T)��o�$h4~���f������O���-������#)���?*�=f_�}+d��&����=�L�Y�Y�&p�|�Kh_>Ch_O�t��MV��tQ.B#������^Fr�
�\����3
���?KG���2����K�x�4�F�}i@/=V?������x�F�tb`$�����l*��d�$Y,���s�
}~����rz����Qd*�\�:"���'ig����t��+�QmZI����Bg��+)h,Vz`nt1���^�+:20#
������s����sW�^J�Q�i	t������%����{�?(=�q����(@D{�����.�UJ��4�.�a��!{�|��}{��;���EJQ+)
�b5t�����lO����&�Zr�;B���Amb���7�g!�v�w�K���cF��������-�y����zN��YV��J)|y���iF.��
�m[<�`|���Ear�����\2�W��vT���A|zG�H7'�|v�}�E�c(�x�\z�2����|,����c�|�1a ����^�?X}Oo�N5��O�{:r���=^��[/T����/���^����,X��[�r]8��������@:�����4������t���^����N5Qg��w��M��V��K�����qj���?.}���q�zz79��.}=�[O�t�:��5����T^Ne)��u9cZ*%��tb�B�TL�_�uJ`�0;��3�F�����),C��
���m���l���/�wEq�r��J��E3Q��Yd��������B&'����-'FN�'Q5��BW�?�����@��6?Y5l!���6�`�W��G�X:r��
���Kp]K=��A]�e������`K��8!�DL ��:��C�-}��v���6-�q��xP�+'
���<~��������"8EJy1/���,�B���.e����$�::�!$��<�wlP,���Q3:w���{0H�)�_�:H�L�D�"�VDQIg���������S�S`�t��.^�jn�N�c�+�JF��K�erQ/�o^p��P��s������@���^���a3�9`��+b!Uo��X:jV���2l��)7�6�������QpR��^���\���Uy�K_��OG��}��z9���#'��~�n�I�����c�n�D&j��td�c'�17���CJ�W��)��-�������-}��3,Y�����s��Q]O�gY��y��f�����-=����X:�������������{����;��6��KG��t�T��M�W^�z�������b^Q��u��Y�
�Kgs�c;�td�qL��W�_94���e��NK�LN���[��4�o�2�$�]��n|OG�����~��p��K����EYT)�,��M��
��^��Q$�6y~Y*�td��7�1G�����
g�7�R���L��u�B*�#��z:����SX��fH�p�a�kp1ZTD�x}�����*}�eX:�����&�q3*�1j4�D�(��OF��H���(�h#T	������.��R�h\�&���O���30��u�H�������{�7�gw�m"�qoN�.��������q�kO_,�ec�mX0-�@��f�p2=�����:���e5>Q����qMW?Q��\)&}�����5*T�@��[<��o2�L�����q"�H7����g�&��wo^*�}�B�
�P�]H/���+o�QLzMH:L�'����%��q�I����s�H:��^�9����7�������l��M��m�8��K�g�M����M��������y��n���Y|d�>N�h10��W,��H:&}�*|�@��k:�Ky�����G�
L����Q�C�����"����I_8{���G��*4����J�b`���F�
��<�Y�C����/[�]6hn#G:&}�rt��Yh�q��q'���e���wh��E��X�O��l��a`�7�D��)x�.>�y���12L�nt��)H��f����z�02����,��	�����94"��D�j�a�;�?�r�+{�K��I�t)h�?�;J��]���0s@�x`�1�������0�kp�c\��o�x��@6����t�I����[�R<���=���,Z/m��Z�J�thL|Qu��.���|�6��M��Pwe�&��s���<��/\"�Y��J�thT��MS���n�^�
��q�����k5d����,&;8�t�*Rp�����&}���L�c�����9V7��=_��v��Y�f"���$��9��u�H�E����]Q������|�l��y�
����E[��S��1�I8{������]>��6��&���8{�Ff�i�D��}�N�D��t1mD�u����K�9G��!uv�e����#����{y�{;�v�~bI��<z����3L�9Q��3;}�@(��xX�t�x�;}H:������W��������~F�tU�I��c�!�`0��-�������;I���:�X���]�o���BOvz�������(�Ic]�l����/-���v��U�nK�}H:��cwY�N�}RT�^e��~z�����z���=>����C�hu0��h�P�����=f���t����~���6��������}vV�H���d�*��~K�� �`�3�-�-���~yU��I�M�E������6k� '�:�d&�����W��������?u2��GpJ�1��&(��B��^I������f��W���2�t���w$�v��5K���gX����D��P���T7�I�I�/�����3V�7�
��h��;SH$+�zlCH:���u��)8���rk�����QI7�����g�gMu�Y��h�Lz��I�:���V�F:g��H���e��V�"������
���b W��&�\
��D���Io��^u"g���eB���Hh�*���'�m�L&
�����(�P�.[�AG���".�����6���q��b8��R�
%����������������_�Z�9 �PG�1�tw�W����-�\H[������[v���D�H.��M��Z�x���v�l=�F	�������x���?2�vA���w�xW���Y������x�
�op�hvoU���}�������U��-p'Su�U��I��s-�t�~�������1�cM���}�Yf4wF�
������=�J�bY�I��
��;��(����A�D������[�C���T������ti���:]�ZA��*H�����_��1���wjf�8����$����v7���}�{��s�(:�|���7�|��\<`VcS�9D�>B���.���|�?;v����
��#R�?�,��&E4q�^��w�%��"5����v;��b$����:
���/����n����[!X`K��T�`(��%���t�I�9,.A�f*�����v_6�G���$ROO�����S�DN}����=�v(Qh��y+(�����%���2GA%�7��������_����_��YG�i� H�������)�8��IT�����J�#�X�7�4�h:�4x2%X��
�N�L�]
R�&�1�}��_h�@�����c��^�������0��g�XKP�{���s��	���2��q��L���R��r1����q�^���y�)4�C�MF�B C��
�
�!��{Z��5Y��DSDJ��[��*�4e�L>�Q���fJ�HJ}: ���l	g���]����]7�7�,��s��fv�L��=����7���B��CF?���������j���Ty��b�%��e������:����J�,��(%��{y�r!�|8.��7��tj#�=e=�����AL��
8�mX?�5���h�w(7]��i�1�Iz�v_+�n�w���_���Lbw2�Y{��mD2 C�jNL6��Q��BA�=��B"BL�j���8'������f23�@Wf���o�[��aLO?58�M���
G�K��������/����T>��%
SZ��|d|!���#�������RCj7������*�>h��VL����@wzX�Q7���L�������$f�M�bR7�8@o���9�H
�h�C0��H��)3�31��`z&��:`/e�D��I�z����s���*S�!�aH��Oe����$��������� LE���[G������t�r��5S�NYX����L�������Ip�����R �W���o�Sc�1�fM���ko����y*���:���jX�"�RM�L�R��#i��tFBR�>�r$e��Y�s�jz��N�j*�W#�E����0k����@MS�������������M
l���������Mw#|�E�UP��>}�e���T*���m��n�mjj:ir�������5��Y�����x<%�+���i��h��5}� ��������s�ps�f -�n�q��_����kp���&/��k��c`0���ApH�>O������������>��PNn�.5�^�O�w��e.��W7�I��x�+�u2��bMop��������0L����O
Qs�����Hm��@�KA�B��@#��27��B�$����`q
)�;���j=}t����G����K�tz��#�E����F*i�V\����v.����@���hI������=r��
	���Q��mx?��A�����}�*Ef���J����B>_������h�Hw@���������u�R\�������[�qR�{�,?�������M^�v��������V�%\����g����_�N����.�����af�����F�����/�����Y6�oEsA����y�k�"2i�%B,�����F���������'��g�Dp��x>�woI�+������bVS+TaU.���2�s��~��94�4��J�+L���re��d��q�k4�8�^v�xz���fG�JGZ�A�h:��v�NL_��x��
���C�6���5����'�O~c�5�N�f�+�����<~KL�u�x/�F�LF�������oC|�GH��A�7���m�~�(�������4q��_�s��M0r����.��y���q\���<�N]�(���M�nXYt�n�D�Z�j���#�u�DG��gY�pyH<7��#��1MG=�uT�vl"���P`3MG=O:��MG���3	�_��#n�����;�����vE�-o��#5���5������w�O��#��/������'��:���:���S����������H�C=���n���;Y6d����H�s�jI�;�]�����������H�����m��_�r�l�!
MG����D��hzM��#�9 
NV9��{M��#�Y�L'��=����}��>��9�2h:�n�m�����O_?G����#�&�\,�Iy�p*��x�X��W����gt����
)4�(���5��h:��.�v/����#^��t�����D�F�5�������49�h:��6��w�,���w�,[*�yzIG��#�#��l{���84q=\"��o�t���#�G@������t�{���D�q�{�W�s@[�V�(n`�7h�k���jM�{K�D����#&�*�������������h�)<���CS��w�{���/���P:@>@�������T{�!0�J~y6�������t�W�����Jni�������C���X��� �4�;u���(r�~7+�n3i�d}�/�=���Q:��Z��F�\�uH�5_C�(r�?Y��}T�'�P:d�������.�LF���A��:J����Y�1i�y���5n4�[�|�O�t�:�<��O��]l���'w	P:d�'zZ�3r��r��������t�	�X��H�f/i�T�1c���ToW�6��T=��~�J]���!�� �n#J]�����{A��M��Q:����������-(t��KHH	���t���g3�����t������]5(t�f���v/J]����t��(t��M��vk'J]��cw�����t���3��;���t����W�L�x�)���w�S	}�>|Y��n?,P�y��'����l�kpn���^)HYs�L?�������v�/�xx���.����9|���>=�X�<XB��A��+���z��js�����O��TdR~T��N�$+�6D*�����Aq�
��a�J_Q��e'��R��2I��Y]R��AR�3J���i���V� KwJ�(V�4S�8�'�K�"qdw��%���!���.�SJ�4�������G�l%2��uD����@]��<%�����L��
���h�Td����&�Hi��C���_�>x�'l\�n^L)��$��\���A����tP�9���|�������h��!c��D����`q�c��L)}�����tqi��IV�\�������s����:&���3���,�b��~�&\�[ZX�0�����������b��[��/]*�H[o"�]�p�m0�W��I+w_������z�H�s�{U����~r�4�������=��>���w��
�v����!���`����S*��v�O/s�����b��O�����&N^���j}w����b������_�n�H�{`B���t��qdi��u6#��9n�qZ����������7�+�<]�cF����f�G���{����!l�y+}��d�������<#��R��h��Qgq9pr�?�W�i��j#S���y+��X�Yf)�ULr���w�(�3�Ji�|��t���'a��o�&����$>}��h17�!���9+�-���t5��Eq�P���^��5��w�8����~:�Q�����!���M�A�������hX+}��41w���!!)�>Q=��C��E�nid����
}L��!�w����4�2	NYD�����	H+���A-(t���)`�"J��[`P:J��l������A�jv	i�����{��D�q���+�8:� ��A�w��p������J�$��Q�E����XQ��
*b��	I�:(�XuAE����d�I��T#��������������Z���]�������c���tp?e|Z�$�HG�����w�����?���M:J����;y+f�/C��Nf�H������-�?Uu��3,N6�tp��W�-�,S�|��5"��t��N�q���+��2�s�L��Y\"@��JK�����������+-=�
���t'��t9+X*X�J����P*���]-�g0�����������I&F��j��-�����a����>&�U�g��y�����(����#��,����Y�3�������=�%%�?�$��x{��P:��ma/�XNz��i��a��o�Dub�b�����W��a��Q����=��E' �&x�I���W��J��K%���8�q1������P:���t/�������H@(.a�8���Z3��d���a����|������H/�L(A���B�oM^�]�1����Z�r��G������51�S�bB�q�~�{x||���
�P:�Y����9!�r�#�	:B��f]�S�4'H]�F�	J�~��x*�3�'���2���t���}�?�����BE��_t���������rN�dq���05�7�zk~�TF���gg�
YaJ��C?(Q�2bVzy����R(��}��x[���kN�3��c�L(� ��7S_5�����P:�A(����|a`���B���	Rqb��4g��.P:���2�5���2��t���/�(���u{�8q���qg���t�������2���|��;r����i���J����c��c!�zr��:bR���6�	�t�n�w���_DMW�z=�k�'f��7R�A	���JG;�KT�k�������h���i�v��KZ*��������j��=j'>�N������t`��M��Bl�Z�	�������P����P�P������~fK�d�{�K�������(}���LM���g
%�3F��%��(}��1�����0�5H�?O8JfZ���56��3�&��fg@��,�Zz@�^<8����wwP��8��T'p�>��o{��z�dyn{��=Zi�1���c��d,*���mS2���������i� g���t�@���G��)�;�A�mi�R%�����������pb�����������<��-��?�O��J��r��m?�/�%�b*V�9Jw���Q��y��My���Ol��^�6r6
�s�~��Hw��rz��xB�Z(��'���eb2Q��v��qZi�P��J�}����b
���t�%y[J�q_�����~��yf��^��.��G���=
�c�twPB'�����0��?�����BV�T�j���Mi������
����[\��qLO��.}�6gp�l!��>��VZK���������{gU��<�IBd���I��4Vc9�����"��`���`��RpL�a1�s����G\SSK3-��w�6���Qy���������
�����{g�����t����D$g'$f�#���0�.��Z�O!@�`�����!����;���
r�|����Z���!r}�A�PW��� se`1g2�� �w!c����S�%J����>��%{*VW�����`�t�ULO7[��W]����u	��]���r��������<��w��?.������5i�fK�/S�j�^�63'�1]bzA����aS�2�Nj�jo��"��Wt��&l}z�������HM��������Ed(�k@L�,�����VE�S����I�`��xc�������������}]����{|=Ab�
��d���q1���*��?>j�h��D"�3��RQf5m���*q��tS|�Vz=��4um%����WdBzF���~�H1�����+���_��d�B���%��� �5��t�!����;�8���t.:p=~��M��H����p�\�/�������� �%��F��VcV������O�Jn������[�"|-{���7�Y2/<LL���N4���)�>����S���`:�8lc�����s��z\y�&�a��"�
�ig�|D��Y�LW��
��;��nA�;B5F���jm|�:�Z�<N7U�#��o����t����X�0�=��������������3}��u�(S+���.W�&��
����AY�\�T
�6�Xb���j:�#���e������'�%e��m	�9���$��/Y��=c��(�)��3.��^}}��h�j�4k4���QM!e��������V�h=�E���3��w�Jj]}Yb���giek�r���
���Hmy,��Y2�>-�n3�b����X��[���xY�M����O��ImVlH��dWm��b6�}�`<���������h���K��}="9z��LM�d�5_����z��\������T��s�XZ���c�6o�����2us���F�,&F�D7]�����A�Zn��u�D\7-F
U[�BJ[k�e����n�0
{>�b/�_93��+���w������_��0��`�bn��W.��f���L0m�ks)�]<�%A"eM��M��������j�����s��0[����[Y���x_%�P7��o{?$b�X;
��q����_�0���uG���I�5L\=�mzlXX?�qho�������B�;0��'�{�Xq��z=.�t����8���`?�5}aOo7��w�����Fz!��n�5��R
������X�#?F9=I�9�������0f�������c�z�.MU�

���D���vE��\�w�T��@=��Mv�mm���B��^��?&l4�T��pT�o����j�-q����+E����M>�M�RL�9�1�<�%�`��2�L{�x��Q$E?�����S3S��~	<���*?����v{?qG�/t_��s�C��0��q�0��}���qo��/�����'8n�$d��O�J����!�����by3�1�	:�f��{;:�#Dy���N���3��M��6cC�px��&X*������8����qDn���F
u���:<��3���:�w����M/��U�@�����
V�m���`Po�x
fT��������DN8j
NA��w��B����n��2���1��~o������_���i����D5�����=A/a�*1$������9l�����x=�:Qb�
�r��N5�9g��3=}�-��h��L��(3ol:7x�@f�
p������A����I2�31=���G��1���ZidVn������^.����\��]E�o�b-��.��a+�h$R���@�t},��5��l*��C
���j�"�5}e�M#�9v*o��lz���>�R����08�C�O6r�����MM��Tj9��lW`:��8cN^�57}��Y��+�O�����Pw7�Y���c��\��?�4aNFi����}�.:w}���sDnvp�6p���}�CMMC�'�q��o��H���!�������t��\�W��;3��o����>n�����>����w��[=��'���o��g�p������
��cb����G������$�����22�����,���$�����r�g~���
^����?Q��(�!�c�<q���z{���)O?��l#�����kOp��S�c�I�>����������_����A}����L�2qwz�L���p�<�,
cL����Y1�Z`}��y���d��PG��m�;��k`�]�No���y��%3���:��G�N�Z�9�j�����t..����	�CG$�]���q��lp9V�U��:�� �K	��@��������������#�6����<��]L�=r`�t�i���0�Nm\<qzph<_��d��o]h1YF�'[����()��quz4/�|12evF��)[�&A���
1x�ego�?������HX'�)�
��������T�L��a�n)��:�n�W���u����DO��R�k��B�ZWI��_
��Acx�+*���2.*7���:����ZyOIk?�,�B�}�7��}4�D����D�l��^t��]:s&�$PP}/����|7�����7FX�2/'�_t��y�����gO�{�
Y���;t������N�h�)��3M�oH�������������p��<�P�'��@7�i��c�R���l8.���/�a�����n���*�p��gw1�u�&�j�O��q����5�j?�T~��
�+kK�R(��1������Qd����k�f���*E���?f"���+��t�{�P����i���������h���3\�S�
�V����(J�R����������~��P���\�����D �;�?����3�Ta	�@ �@ �@ �?�;����(�E!_j�x5���h-j�`�*����L�J���V�[w���������/��|H��YM!gq�����x��&�AAAAAAA�B�TS�^^���E���)@L���������%iUj��}�XfJw����[�Q����p�����2j�J#���	��\Mr6%�R.+��]��+2�>��#����������;��>�gjw�U���?Tn���R���������:|���\JJ{������[�	�62�v���j���Y@�ZR�bI��5�����B��>	������f�X�.r�M)�$x���"$����'
��C}���������P��v�$���$�3=�QdJ��7�����<�s+i �e�=$�'�k���!hZ��eDB��En���������{��#���@�N���������v���M���~��VC��w���1�;�����gzs�Z��:��������g��l7�����0uZyq��/����;c�h�����|36�/L��=������s>���x�X�0}�,�0�>���U�
����u�gOO,���;����A���c�5|��V+�'�b��-k�.���3�~����LFB�>�3]f�������L����$m���0��&|���M80�h:����_�o_�z$���~)���5��Io��_Y�i��	qc6}��>��-��9Wg9<���[>�R�=��	�c���^{$������L��_��wck
h�%ygmu"DB���xMoK����2�W�nx�~��cz�|��Q�n��O0=�5����$�o�Soc����M/���-W��}+DB��E,=�7����_�����?�������^c~Og�PO�Z�K�N2�*bY���u�N�c�a���u����u���[���:]����uzc
h���R"�.�k�"�{��������K����:����}2�L�:��{����l�{��V�{��s@K�q�p���h�[�����{�^���a`K���j�KD���<���d�'��iy���r�����;O���=m���e��)u�W���u��KQ����1�c�?#W���K7r�?��Pq��
D����u���[����"����d&Pl�`!0]�P^�g�S��O����b�1��:�����Mv�B!n�b���������|��� [P�������L�������[���5���C���{G�������:�������V��>�x���{M5��`��Q2�.�m:�_L�q_j:�����,I����th�0�v_����>�����|�I�\4L�R����{����.52�E'� � � � � � � �������
`��,_c"�&S���`;�j�����-X�<�Y/����^B�pI�[�n=R��N����1�
HDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD�/�'�u����);=�"zY�)�}�3B�\�H���b
`96@D/�3�?�����"���4��"z)�\z����'-����tc���_�Zk_����-=9��{r�	f�L(7�����!�+�j����L�n�i���O������Jc�p�*$_x�sG�67�'������jp�v�n0Q�+������&*����Cp=���~.�@Di�0g�i�N���E8D'��PVz��j���������J��J&nt����(�� ���R��P,���O��;�����C���OJ���@��#O�}D�����"J��N�+���3�{q��a�������}��M��������D�Q���R�:�fr��a�b��g���CP
�Zzs���b*�����~v��>�F�f_�i�<yt��&��w��X�`�Vm���O���o4Q����sye%KO�g����>v��^�s��=��o�����D�`�+-����Ho����1�� i�?���n��|�D����6w����Y�0�(5�U{:���HS�,�Nu��89�����S,��]�nJ�'W��b[�EM���U@��"���&t�����ruS�E���p�ru���@�/��h[�� ��~�����������v�I���D;��(���v��.�y��k�f�"�iWy��[3DDDDDDDD��5����R�IEND�B`�

ryzen-seqscan-uniform.pngimage/png; name=ryzen-seqscan-uniform.pngDownload

�PNG


IHDR��VLZ[�PLTE�����BV���s�������������������OOO���������mmm��������������
999�t������������������%%%���,,,���X�����������333CCC���"""�����vvw666fff����z����
����^^^��?FFF(((RRRTTU���������JJK���������������WXX���ccc[���w{z{iii������nnn===000����E[[[������������`��sss������������ppp�����������������a`a���kkk�����������L@@@��������������������{��"��e�����k����������r��������������$���U�����������`��(���������,�������l����������������T�����a�������L����G��������:������8������?���������x���B�����x�������a������J����������z����������W������l��������5��-�l����������u�����������g�������V�����������3��:��!����(���&7��.���IheK�h	^N9k�{�8O��8Z2%��\��J��mQ��L��G���G�B�������{c9u-���\����^�`���|�z�����}��zTpM����{1IDATx����N�Z�o�����1-�
��-� ��@H���@"�q��p��
N|��a��}�N6'�o��V���`��rB!�B!�B!�B!�B!��'���
����2�Y��;&���:� ������W�(�
���*Jd{�xP7����EE�?*�bO*���b!0��2�O�4�-��b��uf{t�zC�N�C�q��hF�W���B����Z�	���N������8��j
,�����{���@��xS	xQ�!�M�eY�f�dU��Qk������������`�����k,NG����!�c�Xs#��)����X��g�[M�����EaMh�X���X��-G�J��k�V��D5����Bv��l:X��Z�;���u�����(���9�������u�|�4�T>��;C��L5kj1�n>G"�f;Y���/���N��+����a���$��
|�h���n& ?�ZW�}�
l�br��,� +���M"��'���)��k�/���)F�.K_%f��<kfXV������rdE]CI[�Y��c�!�/��w��7���w��#�}S�d��p>���&�^���\*�!e��,z�C��8I���`�X�6f��=�����6:k�oDq1�����@��%��)�972��\{���)����������n
�~X�t��ct �r��1��^M�f�%`B�BO
��0�j��(�����f]�5M�QNX�9��u-[�<�8��,�Z<�3���g��Z_���Q�p��XpT�nI�O��uwEc�	B����4da.�X
wP&�����8�5������aF�
�s'"X�6z|�$����baf!|�����b�E;a����E��Oqd�9�kJ_d�]��0c��!��e����������R9�,#9Y��������G�����9:��'^�/���`*��]��t��L����r|���fa�y�wT�&"�W�RMsr �!���%S�a��	��K�������Tly��mf����gfn5��D��
s���y
Y�.��d�m��Q����tf�W
�� �<���Ia������]���'
�G���w�l��x>�t���(�3O��T^��LM6Y{�x���+\~U��-z����o.���9;a�X	����j�j
�^��f+�e�y��j}2��s�Q�5����f�����2	m	��/�,���YbN�������2��S6Z��K�1fo`)����`)�4D�8!
l�����O�`��( ^�0+�>�|����<����OW<Kw���o����#�y�?�[��ei	��e�Ce������Ab3��!���h�9 d�Ya��W���Sh;�f5�8���x��N+�<��2�%O�:��kWX)v�:�KM�5�+f�/
����-����g>�z��4�PF��0�_�Y���P�{vU:n��TE�p��=������.v8�Bv�E���
�v��0���_@0�<�<�E+�����Yh����r�WAL%�C�O\(V+��	��03���������J����S!�����"�rB�a�L��\in�����A�����������B������D
�m5v�Q�L�hd1���0+7,��QDY�\`v���XGUD�yk=E
����������.����*����I9v�_3���.X	��%yi��j���EU�=����7Jg�j�pD��$;n��zzp��9|�z����V~�q���7+�r�j�����X��v�����;�>�0���Fm�m��������F���'����f��~|�e�������>+�����=�w��f��b�,We�/G������GfH^�������a#���d����	�CZ��u��0��wN/$!v !7|w��|����0�.�����s
9�g����:r�o��9	�h�q�\T�R�Qj���?ec��4.B3-5Y�R*Oy>k�%it�#�pK,�L�V��X������4��I�������&eO�Y�x7�$��0x��G��m<�/�(���<a�k��3�t�<���t��HJ;�"v)e��K�&l�R����W�S���*���m?fG�T	���`�Bv����SDgX�_"�Bf�i�`���qX��y��}8��K����?��+�� �N�G�;���B�o�0Bv����	f�!;o~����B!�B!�B!�B!�B!�B!�B!�B!�B!�B!�l�(s�?�
��b�?W�e���E���_"!��%������h��qv
��YW�Y6�+��-��&���7W�4�������|9�<�z��������y����=��Q�<~(Og@�c�L�5l��0�>�@���N�\{���t���0�]�.W��5���U+��-�.VS	p��v�[��Q��X{R�?!��6=���`�3�w�
����b|h@��QY����B�M�\4�l��S������8t_��rL��"�1EU�DB���(~�x���!�:�cA�s�)����g�ff�������k������?�:�W��8�n
���d}�N?�s����{��>������&-h����'��'O�����Y����/�������A������l�������~1����[�uKC�w��5��?9n���[3P[����@�'f������]����"F���kg��w��Y8�����byhl���=g���Y������t����8�
|Sgln�=?���|������8���u�������
��j	>��n�7%N�`��(�Zvc�I������R �.t�[^do��4�W�M��C�bV�`S��t�y����lK������7;���6�q�s���_��1,H�/E�,�M����W�F����8.^<.f3�����=��<����_�Ba�"�ni���N�<�O�x�g{d5�Dw3����� �.�:
F�b�:�����B�.q���5������c���0,=";�i=��e��/�a\/L�m6G�f��~Pi��H�+��YM�����mA�AH����33FgG����K�Q�
S�3��w1��@�U�/|��Q���<����'���/S�N�����L1S�����D���',,��1j�f0��L������Q�3���������>�1����C�����i�^��p�������^���3��0P�K��*!e1O��z�<��.��:cvUN��K���{P��Pl��'�
c^�����.���F;������D�� xS�PQ��f�G�S%H���.Y1���j��j��S:_n�\��)�y��#��UH�]���&���h`C����|A�/��Qsv�W���.A9�!9������e4�S��������g�ge3l��H��s$�\�y��/�9�|�i*��0�G��FY�������������	1kkG��c��#�(y(!7f�����=F(�9=3 *&���O7�UBl=sC�1p�r����;�)f����1f�b��cV�b��b�&��b�3�Y�(f1/��C�����y�/b.�@��#���y/����~�n�{����{=�>{.�>�����3!���[��LH��	�'(fB�	���{�b������=��a	��g(f�s��0f7E�a���=��z2���.(f����q/�Bs�K�0�ut�����t�tC1���1--	171����2Z>��Y�V���L�������W�c�O:\>��Z5�7m����K��=f&4���v��:�������U�3t�l���]�� D�bV�4��~$~[uMo~�)�YE����D�d���w��K"E1�N�Y|������Z�?�)�Y=��YK`qr}�Yu��3"A1�H��)G�l��Y������)����#����M!�S[gc��3��bV�4��.���<s��c����t��bV�4f&|�@���������@H7����QO�N���h�/�����-����Kzk!o�\��(�����>��X�F��(�� �M7.BvwS�H�6��nndY	!��{�������M[���h��h�L5~�3��D-f����b�h^-f��"� �D�Y�y���%S��@��5���I����j1k4�X�c�"��b�h�EEO��X���@-f��APLD2�tVx��c������<�"<#����S)0
y-f��A-�y��
�T�� 	��5��
Z����E>��N�����d�"��a�H����F��8S��Ui9e���?XRN\]��{��T���X��EQh>��OMN1K'5�;`���9���=�W��%��X �P���l��������a��g����hqL�-�A���t�l$-�+���`d|�'���^�I�r��#T�|0�������x]z�z�a��hyg���P�S�zP	^���T$F��_���9
�����H��D��('�`�d����i�����b^5�����	*�<�6��H8�2T��
�_�`T��
�
��b�P#�8����+�b&f�@����2c���z��D������>>)���ig��@��y:*w��OEO�v=���v�]f��n?+�z�w��Ee�]2�]�_!�W�
j�2�<f0�4|�I���CMXlvus��=��H����[�A������F�������WG9���O=N4���51$H��S�u��i��J��6����
��o�M�H�;��j�5�]B����h^�`�eJYP�e�����XF��%���������L\#]�z	H��;�����o�a�t�I��v���G����'�Y��)�l~���9�0)N}�)����6��n����m���������{��nf�������qm~�t����
h^.�>��KP��QL����2@�P8/D1#w�P�+���I�[��o��q�:b���M�s�j8��#��>��v����Z"
3�,�u����{�P(�x��'�h�$o6�v����A�^���#��3����3����*4/������j�p���Qj��*�Z�FYz�k�>����c~���F��1���v8���������6�
��������bFW������U3��[��i?[��^9�~���&��Yu(����6�J����	����5��y)H(A@-b�+���C��Tp��U���R$�T��qe^j;[^/����MH�>Wa�{�h��v�,�`����k���:��P����H�d�h\1���]�9������c�2�[ki�l��X6Lp?����
9�gY��("A(I��=*W,S���0�T�����Yi�4�������==�^>��B��{�t�����7&\e&'<9��n��25�TYd�j�C6�2������3����?���s������.LU$��, jK�������\$$?�<
"�$���[��~�*�@E1sg�~b������^:�k��@���F��%<����'�e���X��u�� ��.B�������}���FG����{������p)o����@1�]�jC"�b����9$�4�����u)�RY���[*�l���;~�Z<EX^1�D��#�^1���Ee����2�,<9f�1�U~cz]d��,�b����-���~s�m��n�_n9���$A���TAD��=or��U�����?|B��%��$���`�%$�1)�����J+�}�R������8p[���te�M+�2���mo�{����>��li`�O^d�&��}�������������&�����(n��S�b���VV]r����l�g�>�+Q��+}}�����LjF$-N�w=��Z����+e����X�?�J��%u��fz�]�{��[��.����]�gFZ�����b�wl��w��"3Dx�0{z���K�M��0�����[�>��E4�
��`���~x������T�y��D�~���!j�|S����0~<�G��#r���m)���PH��X��g��y��@����=��c�������3����:=
W ����~���3�w"P��5zx{Q���:��������c�,�{����u���
�/N�R���?�(�g\4GP1fK��j0_�[d�a5���L�i������Z*N���s'���cd9P@��
@I��&f���y��o���8���m��U����"����
��9V^F"@�
�:�'\j�����	��X�cK�<�pT�����7L�APK�R��.�I��_��wZN�����;�(���G���{+f�����q���7M���t�����u�\G��ive�K#l��K8^ri(�[����pI�C�=cu��M���l�R9�}Y�Su���JPK,�T���`�_����>�b��"��g��%�U^���)N�Vcf�u��S���#��u��c�t�v����������n�9����<(.��So7�F����9�p��U�2�E�A��c�����G��#K�@-��o@!)�]	5f^~�3�C��T]�m?'	����2���I���J�Hda���3.�A�F���:K��v[�z�x��@���{g�6�_D�5}�T(-i�����U���DW���n�T�Aa�X���	`�qA��-�99��H9D�)��"E��R��z�a���9N�:7n�y!.�v�����/�{����U6�ZN>y�����#[�������6����k�&@���r%� ��"]��
%s�������8O�����Z�R��l��ZY�?X�df�M�:��5xyuv��#�RldO��_�n���~�E	��������U�^f�Ex����j�X��I>`%h����qh�4%���ev�0LR����8��D:`Q�C7[�/\��@	��@,?��/��A>�������C�5�F��&Ej��G�d�g�ZG.��U����lK�=�AW����^�!|�}GH8�6�H��[���d��r���8VI��H�)3['�d>�=��xu�Ce�y�3_�]"^I��P��S$���M��!��1f����|�BN-:��i�T���T2W���0�O�5efK����e�E{2�Jl}�+�����0��E-��'�.�u#�\��	��X�Ud��'_��������F���-�������7L���������1���eN�p�Jf���)���~bL����Cs���
���{��1A��A�\M�L����*'S��v����8�����Z���I�<�.������� p�0����T�\�<�f���T�V��?o��pF�RC`G2�����#�����5"��9V�~�U�U},g�Z����f�*�D�U��l������R-��r������le{�����h'�2� (%���������O��GI�z�u�H���9��|��p���"��C7�aL��6\��\���i��Qp:��s���	���@�|}p����Q��y��!��I�M�j�t��������N���S���_�,�]l��d~4�N/~<�����"#�/M�z�?�r������^6xy��)�n 8��33��,}ER�)���.�����#���w7Fb���;�ht��"Qg{l\���K=���?����Vd�Ud��Y�������(�(������m�w"4��em{.�
-�-��Z�.
$���U��_����O~���.Z�*��Y������c��*��TU��"��	��i��5���2�D���;:y !����,c�K+W��*�wFwb�����Q_I����}����w�2������y��[$8w�+�I))��qk�D����%Y0F�����m&6P�(
��>�(Z�
�	���I�����V&���
2�N\<M,����]�`����<l�}no��F������B8��D
O�w��R�MZ�wBh�j-.������{�e	�w�B�f�r�FQ6�{:-�b��&��~���\Wg����4�5Q���\��������?�����Q��e�J�t�I�rb}��a�4��"������v=d���-b������=�����aai�������g�������0�t���MZ�S���C��}
PY������^y��?��w`5C���Q�c'��x�e��@j��9q�9�P���'�
D�|����[?1[�CU�"�
���Df&�L��\�����w_�����5d���M��}al����G������Scg����z�|����f�f)�A���9�o~�d��d=s��j��%��������4���+��'�������L"�V�n��R�tD�~��8�����X�����eF2���X7G�fF���M�5s8����gF
���w��G�o������`�<��L���O@����
��O�2_SX�d��_�jr����u�>���u���*i�U�A���t9�O�k�
���5����w����?/������e�f��J����H�������M'���T"��[�n������'[�<��������=O\1���3�Df��T���r�Zf�����8tBC%h9�o,\�����������H�o^b-�^������n@Z�c��$�������a���M�V��-�=$3���E��9�����������v3d�,n�D������K����/����?[�|{��y��q���$��m�
CS���|�+-s �{5S+@����Y��d^X����`[,���r�&�I��K��5c�If���R� &s`#V��~���w2c>W9!Y�ch��`�����>���
V@5�8d&��*2/'��n�Uj_��|�R����2�<v��{l������3��l<A�2���l���b[,�2�����
�q�(��u#�DCQm�et��g�j�o�B�!3�
�9p��\n��l1��O�����L�� l���}QL�a����ZN���z$2�`��"u�m2��y�9�e�E���}�Ze3���'�Z��V�����!�U���r������
���1��&��>]�8._��3��%��vl��,2��Y
t�?7��0/�\;�{�U�4e�sv�@��NQ{���k-&>u�y���!��#�����<g�SE���>WJ�����s:���u: ,�:w����A\����M���������Y<zEf$[YX���?�(p��ws@4�v,�0��DC+&$`�C+�C��?x��4����rp8x���������
���Bg����t����l����Z�SH�G��\�Z3z
1�a�)�����_Z�y��m?J�px���Y:��.�LQv�WAo�Lm�x|���Zp|���2�Y���z;r�]��>�2~�������z:;r��s����'3|Y���?
g�����u��v�PNd�����[cFOZ���+��7�������)��4KP��$��+�=1������m�������u{_~s~i�����z��
�,��'j�]�][x	hB��FN�������B�����uC��s/M���\>s�������`�b�"����@��n9!��f��)���t\����l�3Z
��1�r}N�!Je���]{�,�$�a�m	��1����ri����s���}9-*��%�2V�������:g������o���,%sV�.�&^2���=0FF���g����!z&[�,���|�}1����k_�+2C��<y����&T.��g'B�������zY��'^���z(��c������4C�����E���C+3o^}s���p}k+k�|��9�0�32#��h����o,�M-��}�7Ac�d���p�����E�!*�\�#�7��J-
G����P)��L�h��G�e��z���^�=����|e
V��~df
(��H���QV�1���7��W
$dIz�dc�i������6z��Q��I����}���9���]T\�������3�3�9�y=�d���J�uW��,!!!�,!!!�,!qO�d���'H2KH������_��n�lT!�.o��#oV"�)W!D*����J�_%em%>B���H�h�q�5x}:���y�1�.O�ZKVv���R�{���c�Q;4��SV}���/�?�g��)wMf��i��89���y�t�+P�mc���en\����3��kylCr�C���������_�������i����l�U{j9Q�������k��e�]�a���a����`�[�%�ewT���_���/�AB�� VP'�������Q$��V��BfRE�B�����Y���1�iW�1-y�2L��ZC^f�z�u�n����W�������$����pWF�����q2#�����*�_���2��mS��G���i%�3~�
*�����gy��]
����8�l��,�=�L+(�OPE���@����}�s�{��L��b��V3�d��5���dm$v�{p��d o02�Z���|8]Wk�������Y?	��Y�2p�
A�����V�v���!Y?z�M�Z%'�*h���h�h�;����:����|�7
@������vAE�{�Y��?����@�8F����~#�a�4��l�A�Q,
&'��}��F�����
=Y�����r@]v]��E7�jN������t����f��27,��^6L��+�-�r��4Z�����r �z��]�������7���2d�O���z`�t��������vY��&��� `�u���i+��>��2��"�~-���� q�A�cB����w���E�U2�yN�h,�_��u\�?T{�e6Nk����K��M�Y�<������
��7�Ad�7����3F(��l�G��7��)�o�v�L����e�i�VX�k�nU~��jm�k�h��I>YpY�kL�*������igG�s.����q���?
2��$�;(���<L��n#!�S�����lfefR�8q��Z������m�\����Pw��$��T��D���U���yYX��{�58���\�4Mo�������U����v7+�c�8��-�}.`\2�5�������{���d��0�������_�YB�e~!���W����Y�-&{�FrQ����9������������q'3��I�P����x�:�k�z\���}��C�L��
um,��B�d������;�I�;l��v,m�,�9'���X���Y����z���.�|��!��,Z��kx���$h[��s���zK%u�%J	��"T���$DE��i�7����\����8�k��X��Q3��M@��E�����)�yK��C�jd^{h�����m0���I�e����\��}�Mf�A�P�*6�Sm�M-���u�����Vf�K���i����{��s���YQ�?���N �Bs��FE���<����d�<��<�ou�P�^W��gzY��t��V��C��G+���c�"��I��_!�d��^X��7�<lZ�g�}�����=�D2��;7<O�82;�8X����
M �=��j��$T��//����_��}G�!qm�3W���R���T%��E���u]h�cS����� ��l�\b�?�-0�Q���]��6�*����������i�>�V���Xf-���s�`�eg�"I���l���1Vu[p�F6�*���g����{��CV(�c��>!���S��'���*���t�Jc�n�\�Gf4����sk`�W��Vx�U�������r�~�oC�$sRfrl���4LZU�S�2����7$�M�F��{�Z��F��A��${P����^h����%����X�_H.�9JQL���� UjdOH�UNf%)B�=;�	0�����=�I4c�}�i�5c��������pS��f������&]�h�Y�m��`��7jA(�c�Tf��������?:��������-%(�u}�K�}��cBR�2_~�����?�Z[(B����[����x��;�HG+��sZ�@~����N���9S�Y��9tP��z~��E�v����G�!�:���O��~�Fe�1�8��d����w�����"&'<u$���P��1��o����z2e���}�4���Fx���
EzF��GN.�Y=V������V�#���B�\�	��P���?A���'��2�>g���k����L���������}�o�BK&�ufeBU2�[�d~Wde[�ej������d�ai�./N�3>������]�csF�)�����n�f�LD�r��O�C�Lg����E.N8GYI�b�>��OK����t����<�DaOL�2Bu2��<�GYL-2�l@�6������Y92a}u�W��D%�����q��h$Q^��"I�X�J�oK,��8t��m��$yrH_�e��p�D���h8���p�
6��w�tv��7T+�J��M
4u��A��F�n����c4��w[f��$���~u��W�������0����p[��|�`R q���[IEp?z�����?�:#�r��f>��u� 	�kh~��e�FY��,����5��:w�5�n:�*������0����-�L��E����u��O��y�=B�	����5�G�PNJ!<���H���4�+�H�<K�w���3�:	2�r���s\�U���Gn�������I$p�{���X����
�,APCQ��O(���KH2t��s�pt_��.u��dfN�&At"�����tL��UH�]8�p/L2z�md���!a)<34�A���R���%�?.}�=M�f�p5�bD�x�;�1�X�%��
f?Zv��1����t���}�_f*����3K�a!Z(����B1����^~�,�����?���V-�e�:)�J#���@���L�Gfj����$�8?
�+��RD��)��qxw�cPTd�E������H*��J�@�=;%�?
��P!�9/���#��S?YJ��91�=�_u��*p�W+t���w������8�C^�bn@!�L��a>J�B�����j��s �,�(
������L�3P�;[����
�����[�J����Hy�^W
�Gi(�0.�?zw<�(��>�87����?s��O)��YY_9&��+i8�/{;t:�(��:��������)�9pK�p�_��'�m*��?���-+xm���U�vS��n�d���k+:@j`�#Fj��m���qd���B�
��H�� E �$����"R{i�J�T�4M�����/��a�S�#������|�~�������6P��������Z7����M�)��p���.h ��D��
E#�`m������-����c�X��d�e��E��H�:6��zU�y�S����f%�@�@�E��!n��e��=����0�wEP���H�y�eyH�4P�Q+�#b�U�����$^��w1�����3"���T��M��eYR��YQs$M��a�V�!M��X�Qm�4g���@2���s��U#qo=����YF2F�R�Rd��r:=3K��$
���1��X���Y/HKXF����B�g��k	�~Q(���q�(M�u���L��RO���Us�0�YK0�1|�HM�E��R�7�B��W�kC:��2��j�1��)1�J �m�GL���7V��Y���p"�!��3�������}^��YI8�	�7<N��V�����f��������[I~���lI������S\�Ri�����C]u���5���_V$�y���_9��zg�otP9���T:���23����&��f�J����{��9�L�u���L�jzM		�>���^S��4R����)����������J�I�`t�,�^�U])��5/����bQf�����/�#���E��f��� 4��3�+b&�t}
�`+��N����N\{�=ES���������cs�6�=���8R���3s4V����3�
�v����I���_�Nv/����|�^St��V@����������W9S�����1����L��$����,��X�1��b�r�p����f-���%��_���;gx��=����O��^��&yR��}`� ~������<�5�SF�^U�<�=G�J�hxbv�_���S�_��T��@��%�zt�
����n������D�~y�
@�B���*����ar���)���2s���y��o�S��wc�)������0�a���Q )��?���Xe��B��D��5�z�M��H<bnVp�������\'`�1U�b.PA�51Gx�:HA�@A����BJ2Se�y�J��W;?<���Y;��Z�
e��l�)$�������+�U���-��n�M����������4� �B�
�Q�H�1�0~��i]"?����s�c1����pU�������*�E�}��s*�|�L�5u������O7!�B����A�n��f��IR-F�^��^�c	���L�)�-��P2��������P�r�$�&u������c�)H�IA@�~�&3�No�ib^<�������m,FR~�����"9�w_�^���0������:E��������n7�p-��_t�3}o_�m����mm+��'{�|����X������^�k*�=L�_��\���b�1�G�0�d�����A�����v��[���=.���H�d�����@Y��a�-�4��{�L?��%[�M�$�b������:�y��Y�a��P�^W����]7����~�:S�������}���J{�����������.���S�)���v����/]��h:�
���i�z�~�)����gg>���'�C��{�TM�)�|X����
�H��S����l�;�c*��������S���=.�p'�| ���#��A�W��J�z��^B�eW��B��44k�f��<��*f>e�
0��2��<z��az����K�M���C�O�+�=����,�mQ?R����$o?���E$_,=^�<4�Af�K��\��K�dF��#4��F����`Yv�8���{M��@j����-�q�IX4�<(��x�U��/�0p"IE��Z���%h$B$��yc�iK�����|&g��+�B���mH�xI����f^�Q��	t��]m6W����=�)bv]l�?9�s�3���(��|x���>�U���q�Cn7��Kn?_"b%b~�v}���s�������
H�|
h1#��e�J;�)Ou�)���kn�t�2��M��O��K
>�'�i�qx�������J��g���/9�'/��-�x��Q�"z1�Q�i^�	������nG	����kJ��*WW��yl�{z���K� (��xXyZ��l�G'Ow�����=f�����b1�n��b.�N�o@���_j�~�s�����kUb�B,9�pO����5�%Z�HM��f�y����
_��"4�x��wRG�Yj�� %[t�E���*���^�e�����#�r��������d���2���]=kW�����/W��-�v�i�*��c�mmw��X�x���Q�����\��c�r��+�
xX�>��LmT��'���IJ��5%�e�
B���������%xm��
s���d���SS��b�������n>f����X�,���8z1��n�2O]1���O��V���������\@sC�����]J�����.��w����g��@�3���cd�����^=3g-�1C�)!h����m=�fIB}��BP�C2K`��,���b�r�#J�`,��(�RFx�l�-���C�G/f�ne�<M�}-��q�6z
N����?���o�����&������ ����b�m@�7�u���_����(��F1�Z���jY���i��_����e���
ff��;��C����L���
te%C������y����L������,������s�dW��'N�<��~�,������$��+�HK���������x������t13�����^����������X�K���4K/��3�rio�uzM�������=E�-�]��H6:���e���,05�����Y3���M5��=��e�������t�����OJ�Vf0U���+>?�*�|O�Y�w-�^�|�m���H����8�{��Kp���_\x�ej�{t�F�V�^�������-*������t���OXe
�)N�y��������A���`��������8z1C�ZI�`����M L��t�u�NW�N}��?��R=�:y�",��.�������� ���uv"�%�}�����������>����z�o@����������>�����Q|$Y�5����j8�m4�	$�IE�?,�=���s�������������d�?���A|7���U��<��#
����|�U=%���CFQ�H���Z����46�@�l��4e��ksyt�U�kM�Rk��"�N�(M�'��J�S�\�@� ��j�B�>)�B�T{u�Q���2�
zM9��p���5��m1��);�2/����<M���@j%�+_�����^�����E���1;8#1;81;81;81;81;81�����$�q�oL�9(q�K�P����)6 q�p��\L4l��n�|
�qZ(�L�1����g���9�'=�111b����$�`��VOua.���VM3�1��~b����������y-�u#''����3�E�*C�A����Vf/����+I�Z|�_�w��Y�5�v�z���x���������V$���~4Bz��������z�!f�^b;�\mbf�������V� �b���v7ID��b�71�V����c��w���ss8�CD��a�,���G�Q@�����
?f%:|7;^�����W�O����[r�:����o��>5M�%�Q�1�V):UW1���J�~��*n���SJl�<._���	v�����T��hs�gi�z4,���z����1�����`7����{��M$���g��Y����D��3L`f�
g3�d��a�7������&��h53KD�lD�3@1������Q&�SDT=�5)*+DtR��B�����O<����$���%�b����D,�y�"f��b^�gf�u��No��+e'c�T/:���� ���S��2�����f�^������ ���C�����/Er�e[/��u���>��a$�x��p11111111111�_v���i ����2��x�e���/�xq�����`/q
9$�	��r�1o�W����i�&�]vQv�4�>P��Ci��4	�0�`$3�H f��@�#��F1�b	�0��L	��Q1l8��h�MC#3�&~����P"f��%��r�%1l,r+�3��R�\��6�q��%G��@��c���cS�4�*�������B����X=�S^*
b�r�������,
}EK���o$�*,/�L!f���������0�m��U3��\$�J^�b�Zn�I�.��6D��]j4��@�#��rM��d�-]��,7��{|�08*�f�H��4��;b��&�I� �
1�)��Q���(���K�0&���4J�?���p����O#����U��2w�������&(�yY�s6s���>	1���h���4BH%��v�G����1�)���$�\���di��Y�1�
���e�.��������f�������;,I�r�y���rC��y3sD�Dn7��b���d��rk�A��������D���ly��fa�L[k]�p�B*c���<���0�K�P���s���zn������ .��pI,��\~uc����D=��{�s��5ID�w��s����R����0�����D�h�~��.�+Ep</`�"���p������QDT�VfGO�|x���E��v�~tq��UI�����{�v_���������x��'�|P�m�t���������;�^��c�������������x�-_jwME)�-��|�-�Y� >\Z�
��Mk�^�B�$M�(��M� �?��Lf�/�x'��ih���U4��$��9��=��D��I!aB�9|L"`F��ba��b���o�|q"f|��6��y�Y8}g�������GO����p�w�~��P1#L�<C�U����G��P�4���>^�;D`�&�?� �"=�sA�W�DI��9��<���q������x�Mx������^��5�}��o��yb�>>�"�+����c��f�="��rA�������t}�e��5E��(�=N��������L�4���OO=�~S�'��b���������o^�����_W>x�(f��y��/=�����[)�}	:a�L �*���d��5��$���;�D����`���/�Fv��������F�����[������4��w��\�������_�����P�������W�$V��D���bkI�*`��1�VtK%`y`�����<�C�4������US���C/S�E.AZ.i����=H8s�"�@"�������J��ybvB8�����=����O�`��7���cA����������y��%�F����@�z|p��s�eX��'V4C����.�����ab��]��6�����l�����F�N,;�#�&/����\�����#�Rfa����'P:t����c1�pp1�c/m~q����a���po�6�����k��_~z�_��-#e�Bli�'kn�O�[������+Y�Z
U���F���4���&f�r��L�� ��AR(/���(M���v
Q@$��tDpP��,"U���?7����=u�����O������#�]sy��=���O6<<
3�\��b@M����F�N@���fD4j�X������VS��t��B{�=��$-��	�)]�5DL�z�4��� R���
�\s;+���`��f���������(w�<)�q��1?u����`��3'���fF��'�������ta����D��D�������C}2;S���%��D���<n�a2��:���5��0��98��x�[K��W�Jnk��1�(���+g����D��$��jYQ��\�A���/���sW��5�.=�]G������w7��w����,�=�e���QrZ�s�<%GK=/U�^��0I�3��3=
��b,��&-+1���������/-������3&+�D�nJ��V��%��v�CA���'.�iMRUt��-*���~b���IX!���gG�����x�ex���� @�.�������>w�����(��/$f�5��f>_�/�5NV��e�3�ez0h��J��� $�M%���Zw%+�./&*A�2"r�����Z3T�C!�z"��13z�Ac�j�^�Do�'�al�FQG�H��V�|���_�?�9�`cs����_n]������c�x�O��l��t1�����w��_
���}�$����w/�V1�l1�o����U@�t8�h��du�H�����=H�23��f-�v�J�����Joq*��~�+b>�����p�l�l����+����o^�9���51�R���+�v�A#��/�1��
T��-�����KwV�R7RZ��>2��$�T�*� �|��:���G�J��7-��mG}N��xv4������6���|��m�N<�p�u�g7.m���=U���������H/��OY��uh�I�R���#���4���+�E��d������N�a/�Wc��NjrT�|X�;�i{E�k@l~���_it��/���-��EJ�
��g�'�qLs����I��jwN�g|jg���~��s%Y�e3��m��6-����}O0�Q�&i����L�-�s�2M��:���n���;c�g�du�~�^�$�z"����Fn��������}��1��u7J'��m�M�����z��a|n����eY��D����!K��THs���&���)�#���A���o3�\9��<#���:t�Y�+@�}��[/q"����6�������aAP@��l�A�)�i!&��tj��s7���-{�������i33��:�QC"�D�q����x�UH,�����jV��#Sc�4�?8�u����aB�-��W���D_�:N��y�So:\4�J�=��mf��AuR"��x�JV���rb�����<��4����6:��<���I�Z�5�w����H8���~xAN���7
�J����C�U�d�v��
��3�[��/�����Al��x�D�����14(r��3u��,��nT15LQ�v�y ��y�Ql�B��2������s�,y������xe��Z7���~��LB{������h����a����.QXHy���0^n-�N3�'L?Z5�c�!����z��A$�3v����{�;&�Y��>������	�Mm\�d �n#��)��$�u��y�d��c-��b��f�07��P�`@$,�O[���b����e�m0�=�<��x-��m^r�Eht���=p����^��v��
%b��1����/{���2���yY�R�����4g �e"oY�	�R���\k1/�[.��i��a�����u�b��z��X����3xm�"�`�\�G�x�P�Z��"�| O?S�d�f�<b�a�<��AD%�_N�
���]gS�i(��T
���
�]�1���� ��b&��]����*��������XN��X/�f�r��!�A�������R:���\���E�t���k1��A���9�FX�8��� YU����k1�����4����fi?�����e�t&�zXj��i�rZ3c�d(;D���RE�!�u������EA�_���G�d<�w��m��.o>���3���3w���b��\5�R:� ��U�}���6�c����[\�+���u�����v������3�bA��"�P1� f@�(��E3�bA��"�P1� f@�(��E3�bA��"�s���K
2�����C,������=��|��X�ubv��6��EW=T�wj�{D�������j�GDj|Z_�~�?C�@1��|&����c������s�]\Qoo��>��pK&��%0�����bV�y���O'���g��E�}_k��t�����(5�gc"2Q�3�;��\>^�{��s��������R>�k��Z�x�"�������"V����k�����#��dd�pKY�����q�'�E<�D�@�\
�.�2p��3U)��#!�zN����p������m{��4��k����Hp~��8q�.���i}:�2����1����z�������dn5'sY��d�/M���<f+��c���5�Te�\3''����b�X0������s7��_�[v�7o'f��c���2�\� �>�������3��O��n'f��c�[�m)�;�'{�I��VvIvy&vI��g�`@1+�\����>-��	v�w�}U��Dds�n��t;1�����i��!�C���������c���3�5��J1�>3`!�(��E3�bA��"�P1� f@�(��E3�bF	rM~�~17�J�B�(=�{�]n&%
!f�����Q�(��Qr�sS�����Qjw�N�I�����\��C�M�cQ.!�q7�,��%f�`��cInZ����s��}bt��m9�y��|����e1w���d�u.���m$5;������u79��az��%6����]b�
�����Y�h��R�V�5�F����^B�Q���s0{ �J<N�)u�������G���f|�(������fM��^�X8@z�����x�a��%?��^�����B�I�qj�WX~�s��n�P�
/��`����/��,dAQ"S��)[����*{��>B���vh]����W�������;�;���'��Y����I���5Pca�n�U,���o��+�T`5=������ma�3<��k�����-uN��`
;3����mv���"�����kt�S;Obw�����O`$
��/U�5��kry�������J�z���Y��!�0�O������#B]������R�9a�c��g����mO�����������������}]X$pOti*�g���b���������[ $�2�(��G%����(����;���6�O�.�*�<��z���^��l�3tdtL��Ao�j&>`�@F����@�j�
�qf�o����u�Dp�
,N�}|�Mn����aW�&��<j�"
#~8�g{���dN��O~wZ�1�����zMd����W���Y��2H�\�'o�PY�.��`��Aw>
�6������C�K) ��8yez-X��>0[I���'D��4�{mw%���h`L�	��n�b?��;�C����FW�����4S7!�vLd�|��`�>��b��d^�����������x������5(//�KA T�ny���oIl��
�;��^�����{����A���vc�{�m��i!���]�f	��3�+���w�E���1�dF�R���|��u�?�������
|�W	���K�U�~����O��(x��������M�}a��.�c����}?p�3�+I�6K�6��5[�d�J����`lR���0T����
�F)~�q2��.���������Y�]��(Q����A�W��������B����n��#����`�b�	x�Q����za�k.��J3�����RdF��8�Q kF{�<�(q��x��Na�+��-W�?�;���o��=�wO$����c�1�`�p�\�����m��1�h������e�&~�^��U8���0t�[�U������+���^[�zk1�u�;w�4-\i+"�����������w�g����������rxm���L\���`4���5�<��W�f��EfHah�����A���X^����#	'��JIo�{�(���!sK�_������j�dv&��ia!������5�g���������R|��X��� ��Nd���4��/��1Nf�����O�|�~����
�rKL�f���	�
��s]Ou=��V�b��BN
�51^���*��-�dx8����qj�^���s0�v�
�Gpr�������:�;E8�?}�����)���_Z���1L�F��5/���pSx&���r�;%SDHR���'=�TCq0	���������L���cM|��W��1���^%`�/��G���\������}<�e���|0�y�����sQ��L����.J����Nh���!/BU�|���������TQ��sI��y���>�U ���|�JhX33�j^��?���@�o�5v���+	:�rB����������8�B�W�:T���,����LA6�a����;�SzO���Q�D���_��g������xKnQ��:�k�8wTeG�r�6C���H��^�
�f��u�!����v��*Gq��������T��@k�H��DKf����d�
A.�P�u�of��[u�*[&�Y1vs�������������S&3q/�������o<��=D_=�*Dp�����HQ�RW�[�P��<�O&�o���
^�j�������@b�dG�Vk6S~o�R��Hl0xfe��K8��������.p����������4�"�!��8�'s�V�(��BD�0L���g���:���W$��aZ\h���W�4.|��)����������b�����KNf�o����%s�������Me#���bb�7���_^�M�bD%��&����U*���������V(6��4�|��������S�&��B�cS�<���^ik�kd�F'�Gm�pI���_���������@N�O+�y�}dX ����{�u��E�f��
���R<����-�Y���n`��,l\�^V�:�X�G�����	��5��O#�	��u�}��p������Y��0��/|���~���������y����f|}y�b#�bO�r�u��
8�i��kdv�F'��"������0s���v�,��X�1*���~�X��*Bb��)u��d�����T!�i�M������1��	4��H� �����"��V�"t�^x���w>?MQ�*jb�����P� ���+��HB\��j$D�5j6���e���i��C{ M0���4iK ���7=��v�Sg�0��E>11�4ta�����7�_��S2���%B"}E�/)+^���G���fyl%��@Lj�����=�l��R1�p�z)#89��[u�D�n`�j1�`7�:���!$�H�Q�fi7����KF)Yfp��e����R���S.���HS��1�"�����v[ �$��G+K6*$�1#���VX||	����5��>k9��w�pq�����4�l��E`��
x�__�%?�<"o�iO��j��K���cT?���4�s�a9�2sv�����~x�4������
�=9fovL��,fm�3g����AKN�"a�u��X&��[��������6b�!5��5"(���A��z~�&|�D���\�BZ��T�i���M�JU��}�>��&F2q��
Y)���rR�FZ~�0W?�xB���,�`����d�W(-�����kj~d�����o#�i.�M����h|�V��08��s�r��Ahr4���m�!�l6��]�n[%<�W�Y���y\)��Y#
i��e��4�Eb�T���R�K����"�H��^�BA�i�"�1Y�nj~5�{�]����g�B�wz�
��_��M�N`B����)98~1���<Z�� ��`$?rcC��p1�R�+����t9��Y?t�_�������:h�?QQ�����Q\����?]�� �a�,���M4�M�d�+(����BZ������B���;�����@!�����4;N\��y�_$�!��L��!,2&8="�MN�d�m ��
�E�p�@vJ���ZL��p�E1k�'�����Z��@w^�����t�EL�9g*Ff���e����R��-�<y�uy*�$��>y�����6
q+�DI]9����9i��d�#J`��C�D5Gmo�SU"~�}��e=&�����u�2.Pi"�b����.�������E��q��81�$T�-#`Ali�����fQ�Y9����t���������,��n�^��:=	�>p�����
Y6��F}X���������9���'��C��O�?�����xH�Noy��TT�d�$9 �W�X�S�N
�, R��'7P��*��/��i����*��Pm
�a[*���E<�����/=j��<��n���&@�u�~W%1KAb����FF����r\�&t���mvp���w��L*�2[����Ssy]�+N�������zA[�<����t���:�.|��:��@, {�h��@�`�m�l	�N:g��'O�8f~
4��w���4�]�n��|�VW��3�;'��0��F�C�0!$<�$�2���7��8�O���X>&Z�H�h5l���-T�X������=S��W��(@����E�{��s:�4��bVa�������i:p{��.�y���S-:k�����\c����fL�+K��f9!#>����,�@��}<�@�Z�[XH�g����-
�T����i�B�*%����n�7a�b=����"�ED#b�yxQ�g����@v8�:['�7;]�����<����)i!��s����7�M������l�����+"-	�KY������{�x4mB���6A0��W�����/F��->���@��Q���`:Q�<5]����D���D��"��[t����o1C��8%���3�������^���;�!������p c�Q�M�����s�P{r��G43��f?Y-b�#Hj&S���B�4:�w��D�.Q�<vu��^��w�b�}0��7��3w94w8�}�GmO���b�������S��X ������WG�oqZn=�N�k>�������������~�
�*�ob����������3NM�g��lG������f�yr�@��)!�������X!��]/.54�<~g�k�[A�����������;&`8�W�x@!6 ��!C�&����{��=r��3��"�b�1C��!B�!f�3D�"�b�1C��!B�!f�3D�"�b�1C��!B�!f�3D�"�b�1�e�^V���8��6%�IJ"���8�E�/Y(Z�	��"(��"$�t���/�cv��	�;=~�w�r��\$���$���$���$���$���$���$���$���$���$���$���$�obVN���� �)���:X����t���v2�PA���T,-��j�
��nZv�c�3���1+g��7V2��9����>��A���f��1��s�_�`�RL����8��<�i83c�D."�l��#���_�Di��p[,�3���1+�~>�
�n������a�p\6{7���E���V��i~iO���������!���X��1���
�D����V��[���p���74�isu����>��8{�HW@D�9}��������*��m����������r�M��p�l���������<�~��z�1�s����1.���4�Ts���f�x3���1u�58�c%o���ot'����g��w���E��Q�4�@����
��<����:�Bh�����M��}�?�������dN��-������;�L�&b��Z/�f��
G�(y��%����<������v2f�g&��3�$3�$3�$3�$���@`���=�r&d�	�aBf��&d�	�aBf��&d�	�aBf��&d�	�aBf��&d�	�aBf��&d�	�aBf��&d�	�aBf��&d�	�aBfb��V	��?'QM�,�#�����i����+�"�@���P$�D���t���B��3�[�xu0<����}�/(A���(A���(�i1g
��x��F2�~v�1������9����(�E����k������V�����K�@PXc.X������l��V��:�BT���Wb�B�����s���-"y�Y{��."�l�%f  �1��g���9U�dD��YJI�<���4���2�����F������!"��y%�a_<���1���{~r���j�F�38.H����~�~��������/�L�b�w~�9���JT<���L�+��N��6����|������R+��������)�l`����������\������=��'f  |1GF3]{v���J��4{m������-�|N6�vR�w���[�p���3_f'{��/����-�a��D���<5��%32�x�=�L3^�Y�����,�o��]�3��%�h�DZ|���T���0Ghc�?3�>�����P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P��%�P�3b^�G;�Y;ZO��H�����;f�?30Gcv�
�7(9{Q����k������V�����K�������8���2���@�*�P^o^,6�B2���QOv�
x��-x�{\�m�td���>�}5����v�y��$���?��=@�?��D���n�?"f;�l��j�d��N��B��]��n"�	b�qn��k-�Pb*�S5�����_�*����1��Z��Sw��>����.i���8b�ql��b�l�����4W1�����Ak���`(���[0��)�c��x��u���f�O��7"�D�����c��-���h����w��U�h����.��{!\f��/�?�"��������7����	b�qf��r)E���nv�I�S��l�>b�qd�?�d��
�����|"���#�<b�qb���
b�~OWG�G�^�n�I�F���c��B��k�T���	���D�%�6�p�G�v����2���$����S�n'�3|k
���1���N��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8��8���������X"�;����������yG�@���5�m��j����U
�|q�9_!wI�������	bF����3b^bv�7�y_�	7���o!fgB����-N�b.��h.�k�-�����-H����h�/MH�-��=f��q�^�Mv���]y��������{�zR&������3Q�I�U�ms@��J�nm��Y����]2x%��bW"��lw���b��W��8��^�}���|�������Vj�Y�;�/��q*@�>LRKw���2��=j�P
o�D���n�?.�<��|���9�z�Y�{����(���a ���j��>H�����@�8#]p0�0�[�B+���d3*����	YN� ���i��`��t���a�_*CZI����R�%��]��,S$��;i��v�%����,�������&���^�����=��_(��h�t�,f�����O4����K���&�"�f�.��Wos0{W ��Hf��3b~�1�}-��KU1#f���c�]����q����{�9��fVL�1������rr��5�c���<u�QH��J�D�-s_s3�z2+.�<�)3{9c��xYq��,�;l=9�V��
���<eV��[l<$��0�6��\����o��E�"[�'y�V�f��+���-Y&�%�����=��_��� ����Y�R��g6���z��g9�����9�����8^�H�OT.�z��C�����>�.Mt\"[p���7��K?�L.��s����<��H3Q��5f��_���NQ��g3��P&C`E>"�� �&!��1�hZ7nz�x�B7�caf0SI�h@���[�9c��p�����{� �w���L����N����n����D�|�<�N
 �w,���#2��)��U���fkcn�D�]�n �rS��>�z���^�L�������Mm��zjs�g���=*�2M$�E$T���pp'}{�?���u���L�\O���S�?��R-*�}���l���5xr�A�1�>-M�z�v��xK��|y����=|�R��\ib.����Dy�����Q����"~��l�r��s��Vj�i%��~JA�5��mHf,.){7��2^�v1�W���d���SE��i��B��
�A���r
�c�>�R�.L���~��m]	a��RV�����h}�c����~U+�gm�C�$���FT-�=D���hi��/������d*��<�@.��1g��$���i����@�<]��M���+L���q�S0�K�h�D��S�.������?[1����i�/��|-�"�����LN�!����T��y��G�W)�~.M��m��hg��/�}����p������2w��h����01W{�9L����W�!l��t2���1�w�v����w���(t����]�.f"�(J1��H�������7x.M�ZZ��m����g��`�I���6�u�	�N�+fn���	�L!/p�3�G+�H��:C�!�������7�80�iT�v�����Fri2�sQ��s���U�T��c��!�b�U��{��&c^q�z3�jb�c�JS�}@.����X����C+osi2�sQ��7��,{�/�vc��8-��>���c^m�{Z��S��sZ����H=�E$/��a��|��K���b����;��-����_g����� ��;������v�������d*��1�3s�����>3+.�b�	�g�A�1�6���3sbyyf.�������8Z\X�K����"�\[�f_%`�������h1gR�mv���Ws!����@,v����
PY�f_`�:Y��)_<K����"��T��3��
�9�UB�����s���������G�sf���X��3R�LK���s�'���7���g?'�1�d*��1�]��`GJ�\�N��l~
wNi�E��}�|t#��w�ig�d�+�����n�w�"c���z��Z�j=?����F�Ac8�snlL���c2�sQ���<����6�>G9`z����}u�b�^[
��Z�T�J�1�,f�\Z�}�
��"��*_�����?g�NU-���I�)z���)}pL�r.�3NZQ�������<���v���B�$g�{�K�T�1�.f�oR�1��h1#�U�c��ZP�+���!""""""""""""""""""""""""""""""""""""""""""""""""""��u~���z��p�*��@���{���0�����p�,�)�
��<P�G��`$#�U���7���-t�=�`�B�ftC���Wg`����&��y+���'�Z��x!�K-5��>�F����3���y
/��5�U�W ��#o�����`�)S���-�AJ
�;@��^����s��N>0+��K�D21V��ccbv�oE��%l�J�a��������]&��^Y3B\�����E3un��@�bQ���.�FWj���	�HR��[�4c�Y$��y�*?���������ofWO�6��TB\���pq|l[+?Pe'����
�����U4El���71o+����1C��`�)u�����}��6Y���������L=���)���0cFaX��(fb^����c"!n�}����n�<�����<v��������K6���A���w�(YZ���U^0g~�p�!6���^%�=�����J�^�W��_�� f���q6�u@q�W���4B�)����?���d����uM�d��1;����B���rY����� fC����
�)�v_{�����$>���a��b�Aq�?���`�k*����w'���B0�,�������������9�d�1�;���&���� ����
o}+f}�>��/����g��)���� �������}�����b/��
�1sr`���aR��	q���z�	��$���Af3���$$����:/��!��b�*��?���s��>�'�@M�� ���*=2�/���M?!G/<��!�oIk��B��iUHLi�M!�B!��c���'_�lLIEND�B`�

ryzen-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=ryzen-rows-cold-32GB-16-unscaled.pdfDownload

ryzen-rows-cold-32GB-uniform-unscaled.pdfapplication/pdf; name=ryzen-rows-cold-32GB-uniform-unscaled.pdfDownload

%PDF-1.6
%����
4 0 obj
<</Linearized 1/L 44475/O 6/E 44069/N 1/T 44208/H [ 471 140]>>
endobj
                    
20 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<5B06F7490C6C2C4EB714F02913F11713><5B06F7490C6C2C4EB714F02913F11713>]/Index[4 23]/Length 84/Prev 44209/Root 5 0 R/Size 27/Type/XRef/W[1 2 1]>>stream
h�bbd``b`
�@����B@�	��"�	aMK$�
b�	����@�{�ce`bd�2�����x�@��
endstream
endobj
startxref
0
%%EOF
      
26 0 obj
<</Filter/FlateDecode/Length 66/S 38>>stream
h�b``�c``ba��SP#�0p4 ��A1�)/��C��?�s��jo�1cs9�4 �J
endstream
endobj
5 0 obj
<</Pages 3 0 R/Type/Catalog>>
endobj
6 0 obj
<</Contents[8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R]/CropBox[0 0 1152 1152]/Group<</CS/DeviceRGB/I true/S/Transparency/Type/Group>>/MediaBox[0 0 1152 1152]/Parent 3 0 R/Resources<</ExtGState<</GS0 21 0 R>>/Font<</T1_0 23 0 R/TT0 25 0 R>>/ProcSet[/PDF/Text]>>/Rotate 0/Type/Page>>
endobj
7 0 obj
<</Filter/FlateDecode/First 32/Length 474/N 5/Type/ObjStm>>stream
h���mk�0�����(F�OP
����Rh�X�-�����������,{�����|�#�,����$p�Z��I9�8?'�Yz`)�t�P�`�"��^��Z�eV
.)�U~��$��������ypTc�s^��*�=uR0���	����������m0Y�������rR���,���;8���nCn�*����6���.`��]?���%�+��I��C<�m��k���'m��P��+O:v_zT�2�k���:`&����^�u��5���d"]��a�G{;��n�J�"����1��5A�X�,�h�r�$Hg@
��]���F�;�%���'FZ�����3(�u�-�����������o`�����������O�����k�������������o�I�������BVs���[����� F��4�4)���	5�@i������J����Jy��T�AB��W����b(Fd���;v
endstream
endobj
8 0 obj
<</Filter/FlateDecode/Length 4490>>stream
H��WMo]7��_q����(��
t5���e�Y��i��.���9���uc���x$���E�~�%��w���d17���Ow�D�����������������N�D+����������:EZ�R���|����s��~��|N/���lm�*Mq��`[�w��%��?�>�����K{��F������I�D��Y-+GN�M�u����WIj��["�E���V�Y^�f3Do��K4�h��<������"���Z.3�JY�3�J��LE��T/3�B��y��_gZ����z��3e�y��_gj��-��"Sf#{F���r��L��[N�y�� �/��
����[}�t�hZzx~aR�i�-|]h��Ip�r<�]������Z#/����8�Q}�>E���@�RN�?����\�������*i�G��g�N�
���=��)C��[s���W*u��{�<����+���i|�4�{�<~�����y|'�6-Q{��F1��|_I�3
������Og�����N�����������i|8��Wt�?����o�����i����+�����WJW�=~�)7����7�����^R�-~Z�Z�t���zF�rbJW8�=~6U�3%�o�=~����5~?���<�>�.��
K����+���=~aW�=~����o���/�����S[��L�����L='R��v�������
�^�L[�4>�A�+���y|�v������]�]��k�<z#�y7����F�^1"��y�rU�������[��Tj���)I���H��*%��:o����,	n��{t���/�s^Q%[�4����������|<������5�<�g�?�����5~�]�}k�4zKT������5�fRi_[���L%�h~��kI �PK�1���5�9���d�dP�C-aI+ ��pj���v�������������S��GZ>���IG������Im��Z:<���V�[(r/�����5����2��.9���+V�2�k-��=����(K0v�@U�Gc�sk�fTyl�seS�w_�8���%���[z�c��X��G�Y
���%B{�g����74
g..x�XE��	zW�*�ll�e,��j�}�b����`e�0�����c��bCYL�+��w���]��
eA�u�"��]a[�p�E���	V����]��(������V�-��"erE���S�L��)�-x#�j����X�$u��������:����m8b��'��R��i���67CF���}1,���q��d��YM���N7P�w�����������1S��0(�����OQ<�-�����m�B������������B�BB�Q
�Rthq��f������f�3R��|JH��b���R����������[�%��n��{�'��co����R�2�������gl6�3�]����+�p1�X���]Vj,���	lTz��7�$�{q(��`l����?x��H6��9�K�7�n����k�U��2''-cs�HK��}�9��R�b[�k�K�}UM�9�bC_z��P���E���,r��
 m�(���E�������{W;X�����-U��:��?c�1��c+Ip�c����Gk|v���Y���?�l|kK���M�E=��n��w��_��������������.��

���
L�'pk��_�2?�����!��~�A$�[k���49��W�su
�� �_C�;��X0x$����H2��m�:��8��Yt==�lRf�rX.��T��I�����e��]S�a/��%|Er������a���'?J�3v��?x��T=�r�	,9i��{
s�s�Ub�
��8����^{%��c�=j�dh��(9	�C�(
g��K�^�
��J�0�<z=,�-�����&������zT2�5f{9��
t+c�$��A,�[�!�;���%��p]�� @���+z�(�#XP�nz&�%ZRy��Z�$R������r	#���R�������F
S�}�Z���&���d�[[�c;�s]H�������?��������F>���?�>��W��-�Fm�V;��P���~��o���cW����b�
����_�.�F��f��7�q^/.%�1V]xu�F�FY1�S��$]������;%��C1-��k"v~R/yL���Cd(Y8:6��%��kx�������O]u{M�m����~�(�^t��R��s��<6���tq��_�
&
�h�����s���Y��>�')d���������Cm�(&m��}8<���S���r�a�����@n��pT��K�
�::��G[y�5X:*�n���E���b�X����mE�h�m�Q�b�k����Cj�����T�j������~�
S
���6xr�I0N�>�*I�@>����Z���T3�,���t���n����/?�/o8��=>}8?f�0�5c�4z���%sP�Z���~�-3��K���8����pJ� �Y�68p��Hp2�!��3������a,��
�v=�a,�	
�I�>���'����W,

�����Q,0`2��N�c�U����/���m 
��W��Ls8��s
���1@�:N� v
�-��GJJm�Z����1��o4"���|�����i6a�	���z������K9nUz ���8��Y/������*�Y/��h�e�e+�j�j�������:R99R���\K���7+��1��0��]�����P���,�V�`�V]MqHS�j�3<+ bi���U�
)���:���E%oUEL�p�"���a������f�������,c��RT�� cA��a�C�*}�|��G�����M�sm�X	N�u�$�V�c���K<y$,${�����!�M!������7<���2\�����������]<�����_\�����7�/����G��i(8�OP����������o�?����x�������GGx�qw�����ow��v5�q�x��������������
4��9�s�A���T;����y����	��s=�r��O���	� \D_���z����X������J�{z��2�S����Jc@6.�J�gT������\��f��J�gT�) 2U*���"H�J�gTZ�I�*-��"���\�	�R���[����Z��r������Z��g���KE��
��r���$?�T��<�*Z�V�	1���%��b���Z���"��r1��	���g�o��ZD���:�?],��O���b�"'$�48�rX��?Q��>�gD��&7��w��Qn��C����C�7�I�S��S��cJ��%�"������.�S1l�S���*��h��Io�g�$g3����m�v��7��(TV��������:��t��)+�����V�q���������&����n���J�{b���Yo�T
�����z3��9�I��f>����91��|��d����Gll��Yo�'�������Wh
�������L5��NU�%��cVU��HT���(E YqNG���-�	�����v����6=��|B X��g��/.���?��|u�"��z3?�+��Yo��#>���������`�+���w��$&�+���77*!���<_�Oz3{#�a�Oz3?{W�}��z;y`E���f~��+�r����[��A5�����g���z3_�]����j���
���K���F��:���2�������������F�Q�z3��+�n����G"���In���k����������6=���D��f�����?F�_��Yo�������������p�H�>�������kl�T����������]��I�9���3!
���W��>?���_��:������w��\1�h�����Sr2]�L��-�G�g2���,.��.,}���� ���+[�Y��:�}?w����]�{W6e�Ei����g-�����{���������X�S.]�����[��=����\(��X3
����+[1�4J�"@O��Zbb��e���H�{����0	;�"���w�J��f|o;s_v
�+P_;a�m$a�@X�YWx�G�����:w�%
v0�`��s�5;,^}�/<����+�r_c��0�c�@h����B�`���j�c��Rjpm��+���8�j���q����z��op����
8�j���3��.��k3�/<9A\�^��
ABs����i����
e�������`�e���.�W����o�_v����z5siK�Liea���j�j}���������������>E�#{J=��9��Y���/&X��V�}��,j�+�{��+;y������UW6�j=�f�]�l�Ae�vf������{�����S_��R_I���NHwE�P��s�����R��K�������B	=��=��E�g��F���V���we�I����6���H�`q����sAj�T5���I���8���?G�����\
endstream
endobj
9 0 obj
<</Filter/FlateDecode/Length 4334>>stream
H��WMo�����#yP��?����!����hZ�lJmG������%��D��-\�r�����W����N��xv���g�D�tf���o�'����GL<?������M���y������s		�����7�L��7����v3]����$`'O�R_pSR
\��tWg��\�"����K�FR���b��r�9�!QJ���#E�w���qG�����1QH%���u7����[*�+8�%e)��,w�����pf��o�'F�iQv���������<��\m������t�y���>������������c�1,�(���� ���:�O�XI?���L[,��(,�m��O+�U���H�|e��=���8��0/�XX��Kqbr�JErzR9��F�`����HnTi&o#W&��4a�G��0�$�Q,W�0��t�_�4g��H�V�	H,�g=K,��<J��	a�1�� ������t8��+��_Z�%�t���aT�H���gx�a�B`���1�4�6Z!e#�9����b���sY��BI��<��)i�4�lTe������h����Z+u���	��K��SCY�J6tYl���+��m��XSX�'I�h`q���#K����
%��h�B-��@�K{�@���4�S�o�1���E5��p?]��j���������Exwyy�0�=<�W.����6a��_���G�.��b�
����Ej7U��w�H�R�3wGG�f]2o�{��Z`6;6n7�Tv�d�����x	=�����x�$������d�BS���h�fK���XE�G��7:�&x��.Q�NO9������7��3h����)������[�F����w�	@�f1��t�{d��2�S�^gt(MQ3�����wv��������-;#����J�"&��Ow�[���4��'!������L���`�tz��o\��@����R�����O
�o���r{5R�mi5�������G��z{��'?��Z�G��}�O�vF�����iJ0�=f�H�
|�iQ���=��S�o�1���/��w��.8�W��O?�>��.����<|���Z@CY<���M�,[�L�2���?2ws�xa�^G���[Y��\�:�:*,���X�Ad^r�EI�����qf6Ti�A,�����a�I�u�(,�q����+e�<����X`�3�9s�R��k�r��UE����"<���Y�l�r0L~����O#VNW������|����\D�q�:�nT.��"�	8{�\�3�b$����\H�L�@���\7���yA��;:��h�j-�R��^K��RH������l����4�B�Y]��8
2j�+t3�Y��R������V��6����I)��(�0��\i,��(�B*�QTt\2��p����Q�U1QJ. ��8�6��o���xkWI��p��&����;B�9cZ^o���E�|����s�S�e�z�E�,����p?]�z����ow�����������|��������?~��~�����>=��W���o/=��������������?��S����������z��Z*���2~�0��L��/�x���pX���]���r��r�������[|��f�����=��L
;�����w�xF�m57�<~�g�^y��<a�;\����y
���SN��$�f�'�����$�<�)��\&�f�g�O0U%�ef�D�u��$��MC;�S����q��N|����=ij�z*�5T�t�(>�R���`6S=��Zjv�z���i�47��������6��[�1�s��:�
���8���?K6`��3�_�� ,O�n�g��������5������f|���B;���`�y��i��q��Z���Jdt1����s�L��=Yk|s�b�W�k|3>D������c����_������_���3����k|3�z�El�_�����o�G��+��%�Z���J�?U������+D��L-���2g�
u�o���b{�l�����������������+����!,\;���/�w�-�?:b������#�h�Tj
o������J=#R&���=Qk|s���a�i�_����'�;
e�o���v��-������o�W����k|+>c� �1'��`������v0���+���:�*X8B��U�g���WJ�]0��g+d���l���Q�ei��/'�sH�.���v0��q��>���/'��9l'�� $;|����,�]YXOh����R�V���v�b������<�}�Nh�Dhg�Oh��L�v��v��|�!���|���l;`E��b��� 0��x�q{�a=`$Q��k�oV/�FY��|=�_�N�	Bv������K"�6{���l~��3�#����s$�S�\�$�Hl$�x���P�:?������=�P���7�L��7���
w3]�����{�)U.�<
t>.�8�+v�W�,Y�}�1L�/�\��6]�Y������
�����{{�nF��y�C}on���\i�����Z�
��<���*��������w�/�(��x5�?�c����7��"��wD����%��������"�3���q`��+7���o�Xk������#�\u��(G�-�
>�U���`)W[
C�-���%��z�k�mv�Q�g�n����2g�{K�j�k�G�Kw��2�f�kv������
��������I��>�,��/�0p���K�GG��uv,7d�dt�M���%Y�=	��+V�o����\��h�k`D
�onn�8����3��E�si�^�/*!���lzT���/�_�x����yL�7�ek�����|��y<����������;���~����v��7�������O��r$���4\�#m��%����f/�����Jq��]v$�p����Y.8��������_s^�Q-��gLZ���Ul��u$�k0�����i��o�F�;��R��1��0��|�!�s��[�A�9�\-�\r��V<9����N9����o�2�qQ�+��3����t��K�>:L_�;|[��C\G����t/���L��?���������J'��.7�(�7�����J�i�X���>���2������:�t2<��s��z����������������n����-w��`j�qt��z���e��yL\��{A����|��y<�������?���>]>\���O��*���4r	�����?�n�eu�u
�3����]4�K�R��u�}��	��������!+�����i�xu�h@W���A�.�*<��S6v5��
��;���Le������<D=a�p��K�xu9���ht8��q)����z�gz��6)7�G�l��M4���0���9��hDj�6�.�=�xJY�y����ya���
��!a������2��F����c��mJ���^�}��)��o���e��Rd��y�m�AK�"�Z���^�f1[���:F
����a��5*5�t�x�i���[�Y����D���������611n���W{ao�J��z��q�i��L���4������o__�|z9>����������}������~����`O4�����_���1���94^��YB�Xn%]{�����9G6<r��s���U��G��.q�&��W�#�^������]��f�on�V�d��l���hWo={�
�T��r���^��^�
8^�T����#���Y|�sg�����=����+���[�{����/��r`�7K���j�s���RK�g�,���������"�4��={Tu����K�����.��G���z���Z��x���o�d���o{�C��OP��3��rz���E\~���~�1�y�L'0���Z��ye{��V�h~H|�
H�a���n(P=v`I6����0)���z��q�Q�k�;����|��y<���|��r|�������t�p}��d���+�md�c�%���_����wVPu�<_{'������2t[1��
A#����4���hz�E�f��|u:
��vf��s6����,�������U�������^:
����l[1x9~
���
F��h�7�NC�L��}, ^���;;��b�P�B4�Nw��v[��C�Tk
����O5�",�B�X�m��E�O,V��r0\��}�
�432\FQ����*�V���]j'���r�������q���p�U�������E�/6}
�X��s����Yn������������G�e���2c��l�����^�+n��
endstream
endobj
10 0 obj
<</Filter/FlateDecode/Length 4527>>stream
H��WM�7�����}0-�)�2���6���bvb�;�8� �~�TU�If�c�
������D����Z�%Qm�R���7�De�mQ����SZ�������}{K��i|x��<^�O7��~>����f5��NMH[Yn��G��o���+�hxh^yX����R2
+������"+��G�$��U����B&mI���N��sn��:T�V�JY�UR���������w������_~y~z}�����8�Rjj�
5������_?������������n�������-��7����Py1FJ���2_��I�������h�D9?����?�q����F~?1�(����9ux��6;�3��b�c�;G)��u��,)]�Y3�=��/�y1��D��Q�MN�x:"��,�����tvj���O/Nf�4/E?��d���\�f���������l�?�</f�*���y^�g��S?�O�\�q�?!�n��n��y��G���R�y�5|���s�z|
�7Di{|����?�0�Au�F�������$���q}x#�f���Of���P�Y�-|� ���>	�	c����?
�����k�,<+��_�?
���k���Yx�w�m��>
�+_���Q�=~R�����X��P&�}X����gU#����g�-��N�o���v%�]��<��O��+�~����T�5�k�,<��fW�?
�����s
���*��,�$�	�������>�&a@�[d����He�F��i�z%|�
^1�X���ix������Y�����[�4<�W�w�=~��~Z���Iu����U���&+�O�E|��M�A�����Y�
��WT�?
���oW�7L�+��=~�R���v����$Tl��{�4|!�/�-|�b������8�����S�h���X�T����v{�l���^Q"[�,|.T�
�-~�0��h���������-~�:��s����3�|
�?
�d���?�	��w�=~V����H��P��0�#��$���'A��w4�������G�$c�%\�������L)m���
u<���r��������ckO\���r{^i��u{��sJ�t8���O	�m6��B���rJ�
Z��9��Yo?�g(�0�Z6��"c����������B^����:C����k��$�b�Eb��k�3I��]K��k�<zL�� l���=�����������r��
 ����Il�E*�	hd<��E�jdaFZ�Xl��0�2D�b��Sq�Z�_%��\�s[�,�9�U���hl�AP3�3Fc�\S!X~���{l���5���C�+����M��S�Va�V�c�TQ_c������Y*�.%<�-�����Xh�r�����4������������n����vLl���h��h0I�TlcR�}s>��-ui�q�b���88����^>����������!S�}U3����K�wx���N�����$8����-s(4r������Bs���uQM)�b�{��m����:�9��Ycn@�Z����v�������Vl��m�
��-�.��^f���]v�����\���N�9/$[k�i����b�[����jP��sK�&�5����&��ZS��c%U
U0A��X*���Tz��r0�c�$[7�h�
��c�����I���=�Rl�-�n
��k.��R�������
�[Wh
�F[W��K�s�5���N;E�����:�sp�aMr���m������������B�����u��U4���m)g
�J���H#��|��k&1��aV��r�pz�uy�����_��}|��������m���MT�!�9&q������5SV��V��qh���=:;*���j�hv	�j�}>a�M~	&l�(��2X���Xd����
�1�-���#a���'B�[;��g�}+�Z�E��������F[���THr*�h���Q�C�^R!�D�,�J��{���X`���	�	��G�(����V�tXuT�	WX��	".PNg�?V�XTL���Lb������V0M����T��@��Z��e�4��
�c�
���ez������z!��� ��nm����Q��&�f�����(N.>X���BM(�����F�B��!��v�F)�7����y�8v��3:U�.�5�u7�S_��p���^{���������z�������/����^�<�?�9?�ni$�8Qm[{�$���_��?|���j�[���d[��	e��%��&��@-��{���(��@�V��UH�
t����C�>@���r�v����8tSj�+��x����h��Z�+��O��l�c*���$�$�����`t�u�H�-[*U���E+�h�0��+�k8z�Sz�f�%gL��[n(���"q���*��J�&7��"�u�z�d-��
q/��=:��6|\G��4���	<
�U���JCO�G?��zu�xW�;��g��:�!�+z/�`tJem��,��6������V7����������(�g���Zq����������D����U��F�D���>���d�UE^{��1�b��8'';���}d7
��[B)`�Gz����^	^��iw�t7���������@oo��v����y��+�
����������e����Y7������(~��*�eRQ���}�HC@r
+U�V�02C�V"�$��\ED��1C�6A�
��}`�g�����ZvU��@.�L��"��Rh*�f��yi�������H�����U-���	� n
W�\�wJ[]S���6|!������y�+B�[�����TH�V*�@7R�pLB�`���Y-�b��.n�%�����[����m���F�JD�����VH�L'<+ne�1�����L��>�Rl��z���D�,�EV��]*[
�(��Q�H[�Uv�CS!���(����
������B����k��(�Vf����l&?x$2�j�3{��X��U�_���{���]�O!���c���iw�������������x{�����6#���gd+����>?��������w��/<v���{[�����2��v7���������?���������|���K~��/�=�-�rt��v��\+��������������~���GN�x�h�����'���}�y���c�]%���u�3����RP��s��U�Jy^����T���/U�
�@�FW��T��R�I���/PZYjQ�W�ju1+]��)���A�#�����b���*��Nl��_��~�T�,�M�y��/K�Fa�iV���h}^gN��K�oQ�������J��KQW�~�r-"Z����+%\{>0��J������)��0s$��������\����7��0:�V�g��������|������$\/ "s������}8�"]J���J���P�0_�^+����+9A66�/x3}n1�N?����F�8��7���n����eF�`_�fzv��3�L��^@��A-x�C������N�#+&�O��7��1d�������Q�d{�t���9�l�x3}�������L�bvs�x3=��d7��7�G��o�|��	���U����"�)���]����L�����Fr�qN�������������������fzqDz��7�W��/o�����������B���o�S��I��e�z�;�M���+�K�[�����������!a�_�Vz�i���L��Er�x+}���W���7�c5�+f��7�+�kN?���Q�.������rr�_��?�'��E�_��7�I	������7�c9Lvw�x+}L.'�;w��	�~C��<%��+J�����L���;�J��=�L����(����o�S��I��������ABz����_#|lBx"�J��^s���@r��x�zL���4F��3s{��H�����?�q��������w��b����~����&��i��Ex,���R��`nMx�e���������A�R��U�QG?���e���a|��m���pC����:��Q|�\��4�W�	F\K�H�s��������_����:�ic�p�D���
g�����-�x�
C�Ej��Pn8K35�p[q��]n��i��J�~��Hh,7�E����0���[F�����o[���2���R'������V�)��R'�����8������PnXx]�X�XOSB�	>1N�����blwZ��c��&<����7TM���9��I�_w*���dl������9�e���V��)���6z)���x@�������<%Q�1�sMM��?�nB���o����}���1�Vp����*�0�=�!4����m�(S��U�r�������Xn���B�Z�<G'e���v7V�@����6��}vYg���
����1p*���Vc�zQ�/�\(e,7�%5�\��k��
���Bm�J`�oi,��f)��C	�r��J]m5~��E��0��:�`��RK$�Pc
U3��gB�
$��XX�4��N�EA�U�t�R#��>S�~rj��/��R�]
endstream
endobj
11 0 obj
<</Filter/FlateDecode/Length 4272>>stream
H��WMo\�����#y`k��c��!����0hZ�lIm���O���V�H.�y��E�%�S5=���W��]%����>�>����Ch�����Sp��r����!��'spw�����w�3�t�����_���U�Z?*�����Sv�F��Xl`:��9�X���V�mb�������i<v����H��R�hTT���R����NK �c��'C����c���8:��c[Ls�P
��������Y�V�c�Qj%;�ix�sXZ���H�����Ws�_[�.�?.��W�����~|wy|xs�O7����-qy&JI�N���r�����2i������0��h�db	�����L{C*�^Y��u�6��*��n*��Yu�UG�R�f��B�`�B�����<���iu��4�N�a�f��H�,�3b��F�
��JyZ�X&��8���I, Z��J��N5��@��!�	6l�\H5C��#�s��	@6�I$r��U,�e	xr�����(�XH&_�>�gi�(S �J%�RM1i�B�����GQ���b�m�R(VT���d��SL�T(�rV��"B5��@�L��j2V-3�Ba���3���d��������<�B����4K1C2apz��4(�������Y;�Jg)��Z�������r�Y�������"�{���t����{{��^^�����I(�,��N�������F����|��e����n��E��(!@S�htL��iA���8t3
>6�P�htX����9;�!����g}w�o]2o�F6=J����W<��6{$����#������k/���A����'W�4��}�8j��U���;K\l�Xt��(|B����.0%�1KF��p���NX�LG�g�����*F�
���`��|�n"	+������&��*��0������!:E��u����:������<�[�>���^�[����F����=�'kk4x-�|��}�x-��8��22��3?�Qi��.�|4�����	������g[���cQT�pF<d��>�}5�1�K��S�����"�{���t����{{���������g�\��&����0�k��"E����~������XFE�m��iM������'��2��,kc��$��B\H�Df��������uq8�X�X`q�i/������b���;	8L���D�����,����	�J����]
g������Y��E(�E�����h��p@�PeV.��:5C/��b���",Tx����eX,��#����+^B����FI�!������T��X`2�MT1!�)R��F�..�!���������Y]��U��<&��.��MR���V�4h�Z�Z��Y>KW���P���"c��L�����H.
X�]d��S�6m
;��'[�gY���������&�w!����;2�D��Kq�O)�;������n��A��k'|<~�0��F���p����7��������]�^��s��+���#�:������?���^���?�\��.��������������zxs���|���UDv�� /�W��kc|/<��F����/�x�����^c��di1�����|��|q������l�C�'�u&� |e�9��Qu���l��<9���3���0��a������y*iXy�Y��Z3��<�<�C�V�v�g52E�y�?�B���lm�Y�X��l�x���x��$�&�d�J���P8�Cj��-t6�<-c��x��!kN�[;��?��� ~cz�����~�g�_�{���2�(�DI���?��  �n���zW��c�`��������^x�fN��k|7|����,���O�??�^��*e<L?I�i!��#�R�Z.��Z�{S���:]�{���d�G_���
�[v�/�����	��_����b�q�5�^#"������L)��_����<{A�?�Pk|�BI�
�c�T�x�@J������K��F��^x�����d�����d�������F�����z<%�]�G_����l�X��{���Pv�����H���	qZ�;���.k| N�hS�HRB����4i������o�b�Q$k|7���[|/|���o��w�+�q�5�>`� �1O�`�;����~F>���P���:��L���Gd*����/��s��vl[|7|V�&�����n
�v����	@�B��>�M��8�y���~
B�����'P�$�+�	���b)�_�>��-\�X��-��p�Hy_��z���Y���)�@�w��v@?��v��v@7�X�o���'��a���n��;����~�lO'n�@�_���^�
���>P�Gd���9>@0����������$�����ww�5��'���>�@}hA)a�Ub���B��������=~�T����w�����TaRC�>��2��[dsj�����Y(�zm�������2���^]H�wmC�Qb�+v!���f�b/�>�
���j�P��K�Zk�'��u�8bh�bK���"��l) E`�!P��r���t��i\�3|���xjm''��\*[nmP�������{Jq��<�1J���N-
fmDm���8j�#�{�����=��'	�I��k.�11����Z������wZ��L��d���c�F.�0�oA&Z#OL[���&@�\|BF�o�	N�v�oO�u����V�2��&��}Y't��\f�����-T��1�����;*%�W�������&���ZT8�e������m�w���fAp={���U��-��j���2L���_;�P(��7*!����|��|8�����������*7���j��@5+�t�v���~�6�V�C��lRkZ#"�d��G[l`��W�H��T&�&�����������?nh� V(u�������r�M�Z��@0�$w���u`��6H4�)P�+�Vf����Qr��[�n;���2d��L�/Cnt�������
#]���u����%s�vB�D�[I��,����O�n�-#G~����Vrm��6�e�> �����Us��i�����H�������KKvlL�Ckh��,�=svb��>��� d8e��\����[6�s��6��g2�rU��� �d?�M�+���^��ph��\���� ����
N%0�2\kF*�p�4~B�1p��Z;��|�����o������������.KiX7�<��	M�5���?�|��a[z�)+�B��!����mC1`Kd����!�[�:�.��E.����$����
�j��R-�v��5Z���6��D�����o@�]�5^����{A���E�D���T��Y�HZ�u
�=�^�l[�B��D�w*�O����Yw]*B���:�>��<b?�)�P�~)	������cb���v�*l��Hhl��Dda[���bP&����1���1kC�ym��SAFk+��
��8���L0P��v���6���q��`��K\�B��y.q4�`����5�}8`�l��}��
:�
[i�(�s]"��M>�7��8�
M����2������x8�{����x�k��������<{�s��E��� �h-����];���Ml�n3�8���;�^�Fv�1�����k#5���A�&gWGZZ���]]���>J��������d7��������+]���Ko�5tr)n3kz��.�@0R��n��3��gzu��S�;��5����?]�t�+�K��>��D�*pC4�<2k�1�}me�E���_�z���^���	���p�����@Y����p�X~Zd�<[��2���sl���n���%}�������J�{�������gV���������_�;)[�P^&��a���
w]m������_k������~���hCn��F��?�7��zh�Bz��2������x8�{����x���x>�>�o0#I�$9g,��^*
�����������Sq����v��V�����0�v�����QL)��~��+NG��Rg�YHv�P����](p��`u��	�1�	~s�h���[k�
�.Z����d����x0G����m�Bm���s��J����+�,����	�:_D-L���	=��{�
�]7��8D�=iE�]G��*8�%�co!��X��������k�0z�e���}�Ld�C1���b���P,Y���F�`,?�5��+�S���PW�	������0��&�P���h2�r<;w�b5�8�_�6p��:a��.C��Y4c���t�#���{�
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 4578>>stream
H��W�nG��W��:0U��z�"@�i���c��i-��5��U�-�"9������QY���B5�=�b��=�J�V��%���xs-x?���^��q������R�Fl��H��p�QS��jS�ek���#_~����G��,j�s��������~��52%�
/�"F�[`|N
�nO���^�Ou<��f�����J8�?]}��������}���/��N/��G���O�k�&O�V8��}������~�{^����~�����}{���#_��SV��B)���l(�e��y��-%���]
����)�|y>�����)J����4]kTw�����F�����#|%i��������Z��/���S��lc���*B�l�����h����4/eSs��t����gi����&_���L���y9�9�eD/4/e�THR��y1�Y�������k�e�/yL�<���4c���0��ci��k�,������%�s��=~N�c�^��������xP(L�~��>+B�[���y:Ik�d�*d9������
��+n�	�%zC�����?
o�y��[�4|��u�?��ua�����:n���o$�L+�?)M)��u_(�#����E����IqN���k�,����i�5~>����]���qw�E�$�?_��w$����w��-~�Y��d/�/������h��j����g?0O�(�Dq����'��)���������%~�Q�<�����K����m���F����-~�R�1��Yx�}m��O�+}M����5|R�XY�R��&����n��I*J�v��?����5~�R�a]��Y�&Tv\�>
^���{��'�3�v�-~^w��?
_(����8m���$������P������>�#�y�9��Yx��rG���������_���qui~���g�s�����?
�g�?
_v��>^"����-~V�2�I��8�G�)�����J��c�\��Q*G��8�pw�`�L��A�o��T�c�/5�\V���d<��_?������9�^b�9�7��Qd���^�5�� �qW�?��o�	:S��
z3yB'�WY���N0Vm�^+��+�j���^w�F��k9�b[\k
R�k-'Jj�B]�K�������]�"�X��ucs�]I����Cct�$t�3v2j}$��w��K��8eu��&��6��WPY��n�2���p�,Aj�w���D���8cWl����������Z�l�o{�%iv��l��}��Ps_��g�?�<���E�,���V����N����Q�P�^�I���]��6n��9��j}�`A�5w�6�%��,W�����y!���-�(4�1_��0���n������|O5��$K�2�����*��2|����������}����y�M����Hji����[���j�:i���R���%�+t�d9/��:�B
����}�S&����s�S�q�vM�[:�R���)���^ft��6J��� �<��"��C�~��i]s�n:,����������(r.X������!7�[��Y��CFo	F�w3�Tm��W�Y|�;�����OO���}ys�.�ZF����rH�T1����B7��[Pt7�^�@X���XVK��m���������
t�K�������0R��Z�5��k���/�S����/	v0�/��(�����j�9���X2fw���"�����Z�F��scv5���L����J���U��{u>����r�X!�5������2�������~x������|�,j��4�6��6	j%f�mI�_~���
S�,��(�����P'!rT&��*�D��H�����>��(������QU���� QR:�����tu�����J|�u@�"2�I�tX����K�zqD�.��[�����a� Qa�� 
U� I�� Xg2wH	��PF?����*��B��ae�0���Svk�}������Kv����(P,�?���A�h4���cXH�F
�8��JG�,WX-�G5H`����d��:<�Qm*jP���
�w[nB99�ov�������6T'N=j�I���#��LT���3$�!�{7��FV�����	K��|�:���X,��y���OW��/������.�p%?�xq~wz}�O=GJ��][���E���qa}��o��R(�����uw��������vO�"��E����P�~�~��l�'���^)��	C2������+��5�������cJ\�H<@hR]v������
�`�i���:c���;���y�F32���xF�,2R0Z������>�Z����u\7�
N��V�5j�A3�x�v.xv��|�|e�M���(�k��$<���1z_�@kRA�����}v��0Cg[���}���vG�����b��c����������������#�89WoS%9Qm���4��[�������=CaZWH�x����
Jc���}��i���#�Zv�8��}����A�zdee^���(`��i�ZQ2(������)��g���������zEFd����:���~���ic���2��+T����_�}�8��
�������n|.y�1j��y�L�g\`H�_����_��;'�d�I���D���=���E(:�N;��^;<2QS��Dq��D<���*b�[PE<���"�B�ME��}�)LDd�s��
�Su��'���� �p�����^*PhJj	wG�@�a_}�<�z�UD��v9k��{HD��J({!����
�$.��Y1M��V�m�D'���	�(����#*���T ���*aG�����B�W��*`Z���J{���H0� �%�M<+�~��Ug�*)�k�Q��Y�zRx��T�<D� ��"G�|n�Bw{�)��[`O�^�,qp�{����g��,��^C$�8�����W�NeJ�rf��n��j�$�t�����H��������^pg��V�+F(�ex�;�����*����{/[���������������n�|������� �WP=�1g�������Ow����������������u�kgsZ�0�v�������<>7(t��_��>�������|7>_!���\GD��z��i��U ?|j{/�~��3�7?��17�8�_�L! ��C��������h��
?��������#������R��@bV�
�@)S�7����.���~��U�J�q�g����^����t~^)������R�UW��U����`\��Ha2�!�u�Q5�Y�*��R$����/�J�������+-��t�E�*�w2+]��W�������cpQ�U�:����E�����\
lV�
?��<����O*�
a>3���!���.�u�~��|
^�����'\�sy`�|���q����I�?G>]��B���Jy?L�;���+��#:�V�i�[7
��&X3������X�N?���������x+�$��a�3�L/�^���+�%o`��f�)���'����/���f���j@�#:aO9G�>��q�r�o�-Y�V����~K���R�>���?�g�����"`7�o��N�n��L_���	o�g����7T�<�x�N9��!?�p�6)���+:����q�K4�/x3=;)vo^�V��������]�e��7���
�e�[�A7��o�g����;-x�;1
�����71e�6��o����qI�|�~�[��
�	;�����$�'��7�c�oH��J/Ws�=�x3=�i�0�f�����lY����+�J���5��Vr���'�O�����,x�6��rI�|�~���Q����J��	��y�����/d�[����6l��7���z�7/x+}��f�o����6�wZ�Fw*���_v��O��B���� ��� �/(���?���w/9I�$'|�K����RqE��)���W� �?���������S��������?oR�I������m%t��������%5����~�AO��7e�PR�������rL]����S}+�w�s(������+7&vn�����SZ���K�=���R���m-sD���Z\.}�a-����w>n�pq�Rt����agxbu�u~����������qs��Q�OJ��,�jc��fu>���La:nQ��3w�����u����Z
'�&JW����	�%��;�[�e$%*u�;�['K!���/wA_�����=����Bt*�o8/H-u�����%�R����S�='|SkZU��7wn���U���R������$}gh������j3������	�\{�i	9'�����xh_d .N��5������pd��~�|{=~<�*������0�Dq�w�J����vV��Di�1h_n$��G�[r_nLZ��q�;����:[����[{7�TW��"����%�����5�='7�n���5�	R�k�1p����]�������y�]�F������vK=nEg�K-�8���[*k7ne/K�&������"r���6�����8�=mN~#���%GIu	A6	
�t�F��{M!�(����������`���)�2
endstream
endobj
13 0 obj
<</Filter/FlateDecode/Length 4320>>stream
H��W�n\�����Q:�M����!�r�fnq��ZC���hw���S��7�-i��-H�<�tW�����Z���k	������
�P\RO%Fw�a��P��+Ipw�����w�'q����������{�����w�K��+�0�����o�-S�[����EIsv�e/C�������������0e����J`��c�YHp���}�}'O��I@�ul�kb���$2���Nb���D�}�'��c����s&�����TQ�Hy��:,�>�K�T����4�;|X.��������w��]�/o��&.$j�#S�c�������KH���_��Ek�1�*�M&���J
G9�b����B���D���4	�'���R;��Z2��x%N��N�
	J�3X$�~��41�ET����`��T�B�R��.J�����$9�9+��Lc��v��l�1�w�|s���H�|�
�z���YdH�u����1���E�2I�
shV(h�\&:�����VO��Y�U�!�Q�V�|T��r�[�X�Pd�-��2k�e�VP
���NbQ maXL�
T�x\����J��=H�H��z!��C���Y,XH�����'�p{f
&*�&!�LH�f�,w�	3�':����b��uL�����
��$�$
������=v���/���X�hm�J���q�;|X.��������o���~{yyx��9�'�S�-�k��S�sL�N#~�d�.��C���t��sj�������|$N
�����h�NIF��EE�u}�R���U�h��BEO���k�r�4�d��1����(���C�Q{�^���KWJ1�Wn������%�Wq$U�^2qF��t�����\�t)AjF���K\�����)`�p.fkN�� �
���Y�5�.,Mix�3��UC��.��DL��{�,��c�VoU����5j���'�������������&��K�8Z�5b�v��R�,���G�2����GO8MB\�[[�o4:�K��+�p�y�tk�j4:�rK����D��U����8Dh����u��N��t%�;|X.��������o�����^^�/o_����[[��n�|�0+��h�#�o~�GV.�Q��X��r8�l$m	�X`J7s^�������]Trc�e9������I$�)����6��E��H���bU�l�E�V
�����`���NEa��En;�$�nU��(�0�F4��W�(��Oba��4�#��4%7:Q-�G;D��F�$��S�-��B���=�g����"������4��R��h�KN6�EfT�H���I,`��$K���Y'�f�O`��0��j<���e4	��<XY�u!=����'���f�C��z1�H�4�-���"��`��uH�2������(=��C�����`Q(�"
K��
�}��'R�����P����ak�{)�U/�
�	�#|�p����Y��DW��Z&���QRK�����_o~������������~y}?�~������Dw�q����o�t���v�����]����~�/nn����?^���7����myp�caR�N���(����g�_��|���_�9��k�my@u��vx�����A������|O,-P5�	,��2�s<���u�<�<��Zx������*<����<�<O���W����16����<�,���l�igy2�\K7���gx*��`\�s�p��Y�R�x�fz6�L1�"oL�����(�L�����&�oL�����r7���/`Zr�?2=��$�E{���i��lL��(IL��%L����)|0o/*>�����0�Wq ���[�y���py,\���?���OI�~���gJ1>��'���?6��6i]*��}������4Jum���j��-UJ��t�������t����
�<�@?���g|��8��
�M;N����k��H��k|7��=���nx�^���R�5�W�rA�C�zL����@J���1��L�����[|/|P���$[|/<`��!��w�[[����y��48���~�5�>���[|/<f(;���nx����=!Nkx�6��%���m*9R.��Lk|o��V�n�5��"���I��nx#�1���^�qK;�~���W*���k|'|�.A�c2���;��)A?#�/h�'T���S��H��B��L�D������Vl�c����������X�d�� �k�=%�'��F������I�s	[�~
��SwJ�O�Pa��k�n
�[���~_2tWV,����L�J���^[��zA{v��)C7�������0��C���b�lK�O@I�.��8P���N	�	$�=/qK�O�~�Z�{E+�B���j="[����<�����r�J���=E����[PSO�1���5��-p��L����->#���?���g�Z=�~xw����
�
�z����kf�a��f��b'Xx5�J �BW;�F�������Hyyp�U�#�lx�����.x���c����4�����b��_�:r�0�����n#�I�����RF��
� �#j���*�C��;�������x�	��������xD���`��[�OZ@�q��m�Z.
U(����ZN	$���fQ�=���Z����b������jR�y@�Mr��5�'����4V��n|��1���1RI��QC��`z����6�I���b���Z}p���V3t�s�y�s8�K�1�M�1jy8 �&Pz
��Z�lo!�]�Y�uF���"�x���ld��(�Z�m�fW�rtk�BZ�ZY *i`r���RB�8�(+���[�����i1�R�f4����������_|����S*i�#o�+�=��(e�_?�.a>V�S�f��,�.����f����6m
���^�=��
�f�5L���@�Zl0A����:U���X�fy��q^��=kw,�z������[ �S�[�k�A-�y��7kC���H1����<�a�(K<�k�M�o_M7�b��%����j�3R[^�D��!�A��d��x�:pY��i�����l��
q��R�@��Tq�Z5��=����&�:p��.3�t�y)8����K�mV��`�wn��F�,+e��(�5�����lV7��f6vN���F0o��H�����u�l1xl�������c��9I��~{�y>����M5�����>�\?�������������M���	�DFd�
�8���
�����}�]��l��9?l��*�<�C����#s��*L��D"��P
�%zl$.��
:�]F�&k.�b6=�XQX3;���U�M�����,Sd&z�
m�H-��sU�6�$+{8�E�!���d:�����@��sk���c��CUP��a+`/��`�Dva���se x�Y28��E������s����X.�1����!h��ce�
�]N���u�����F�u��;!?e2�����7�O�_"����v�Z������\���d0[e�Rb��������
�)*�����a�b��*�X�Q���������	J�QZ������/g�.� 7:�����D~������������x��|�x{w��xX�KH����i`��d�_�];�1�)��C���lyw���0K:V����I�;o�����}���������y�KV��#�\���b��@�v����&��|�|`�s�f�rp��c���o[��1��B�|�����r��������=��;��jv�������rsT�u�1���V��:#���Y�|��#qj��>����X�m����/�	�9�M5��N&�;�j���x-�l��z�@�U����=a�W������+�k���"l��;<N���rk��2�mg]i��`<�=S�W��?�7��e7� ������F�3������S����{�y>������8B%�u�t{"��\|~�������������������>������B������x��/�k�0�}�v���Q^� ���a��\Z^�j�u�*�s�
D���y�5��W1��sd/&��*|:��cs��Tk95A}y�$��y�
������*�Z���3Nqg�.����������9���f��cA��p��N\�SUx�/��uR����x�+2�c������`���y�
s�t9���������c�
��`,L(s���2V�rS���bP��t"{a���h���|LP���6�X?K��������b����t�VX���sVU�Fz���F6t�Tl�1:��YG���F<6E�����/,Iu

endstream
endobj
14 0 obj
<</Filter/FlateDecode/Length 4574>>stream
H��WMoG��_�G� �I���FN{���b����l���=3Y��������}�q��&���r�=(Y�{8���j�L�~{���4�\*��?����?N��x|��`�(��5jn�j	�&k	%�p���/��_~�$b`���?x�E!l�F�RP�T��k��=��^N6�d�e���4!��������oo~}������jyw�Z�G�������jU�����o������.\����|���{��}{�7.=���0���F
HIQ��S_�����w?�<k��p����=<��c^-���~`M�j�g�l��!>�k�����~�������Ci��2���w]��^��$����2��I�Yr<�&^S��i�
1�$��-���4LPI�i�
1����)��)ldEg�<�R��yD�i����L\9��������-M}���l�I?����8�-�|�1���8:���f���Yl������s�����p���s�/a����-R��2�#?n�
.���'KdeDMbo���
����<�O"C<(��,����]^�O��O�7*P�i�5|��4N��>�^�f���i�Fju����IBb�cb>��'���Hl�=���"%�R�Gs�E�u<w}�E/��m~4��it\����������X`��W~�F/�up�Fo$��m��>�������=GK[�,-��f��Qz��$2�y��OVIX(�������5�
�5|]��2
�F�bk�6o���i�L���p�F7�y-��g��|��m����������=�I���}�IO1����>[������l����������it�:�Q��Y��T��}�������W��'�"��Up�FO_3o{�4z����3���Or�F�i����cJR�m���[�l�'�����g�S7�������gtv�/�>��{���e�E/�*OK�>���4�>�^�f���Y�)�i����R�Z?3I�	N��9���c�j�`�P���$�
�%�$�p�hUd���X�l��c��h-J����'���o>���wOZ{�����i�������t�iA��������E��"W�E�g�WX]�9fj28�m��l�����U�4�u������Jg-��|�����C����=�	A���s�U�����(�B�

�hu����"��H%�[�$@L�1&��%���rA�E�As���&�mA�1$����]'a)C�\��q-<��]�S����������Kx��Sp
�3��������T�5*���
Z���
��,_*�U����bC3�/�����i5���e���(�[�JES���:v�~��g�b&��*Ig����N�6M�������K%�k�h�Tg{�x��`���u_/���`��'�������t����?���Z��^/����O���%B��|}����oY��9WrfWd���@.��!�z��S���h�V;v�(�
Zz��C[���
�]V(Z�m3�Y�8.���m�F�
���%��IU�����
�_�v����C����O�����R
).����)��S�
���a�S��
���{���������!��v��d��1e�9;�a�L�E�����(�%5_l�J�-0����b�Ud�	?������K
{�_]��+�m��7�����\zN|��,*�ym��3(�$�gcw���G��{�@��k�b��qq����aAi�;���@�M�T_:U�IW0�D����bl ����_�}��l|�P����/w��p�~��������Z��^/��HTd�e1��:jBf���p�>��[�����BoQ�#�zT���{��������TI��aI0����E����B���������h-������
=n���(�0tU����$ +}'�y���������!9(��(��g�G%!X\jN!��C�6��C��jL�*�-mw�9�D�H���kGM�$�"*�����,M)c�9(	�UDOTx�X�Jd���P���J�|pCU��	Yq�J���B�V�"��+��t%X;�#E���a���dr���N�=A���$D���*XP�z��`!�\�� =����4P�aI�X������P2V�}����z���Su��l)�����r���-)p�`��a�;]@������s{~��^�Z��^/S��Ty���n��� �"	�I?��[�
���u/U���|��������/7�.�Py-����Xv��8�b, �Q�/��H��<���F����Ewt&�R$vr ��Y'9��N���Qk��
�=G���-�g���G���{Le�K<`��K����������#	g�?�:��HN�l�y1<Efo�A�SL����]�DEq��>�`������R(���+3�T����k�;�+J.�Ck`:sG��A�;�h�dZ�\����gA��@���4�a!1��-��/g�p������%T�D���h����%�[ygtLKy�\I�C'�c�s@�^N6>Z(p����9��_�N����_5�u�0��~��C��(� ��}�E����8)��E�}G�]�����	������q��p����_�7��3�W���w�����s�
&��JF��+��E����3WJ��������`s(lt/
�tm
��������5���@dX	��(����zV�^"�%X���@��phL�H����P��E��J�c ������"�
�����l�:S��������&�n!��T`�)�����n�,9��z�w�k@W�HD��e���D�	18�^�.�� M0�eV\�~��l�l�F}!���c�Eg��b&�1z�w���*�V��ZB���"���Z%($��m��v+R�)�v-��
A��KE��b�$�y�Zd��Z�=o'��U����[�O�_:���W�Hj���Hv��Bw�<�B���((f9c��^��(B^R��W�B
VU���?3�2�\~�{^�j�3k�A���m�>�}�����vzuv����W���w����V�j�2X��|y������_�����;^:�l� <�8��������W~����O*��W��_�������|y�{>������U}p��t�w�8�����.�m;�B?���<?��y7
��3�l��S|Z,L+��M�C����@+�x�,9�
���T�:����~*�B[,~x��S�6��P�-Bu[(~E�-t~Z(��%-Jm[)B4��n����>�����g[�� &n����JU_�n�S3�����t~Z�)�����b�Y�q;�n�O*�h>+��(N�k�s(���V���U�vGq�!���n�O+E\�kG�S%�A�9
�$�L��JW���85x~^��M��^�DH^n����Oq��(wG���������jS���o#=r;�<|�EZ��*�g[b��/p/{���������~�f��^vK������n?��3��]�F��|���s���y�g������f�����u��w�GL)��G�
w)����d�{��5�Y�^vDt��q������=V��a����W��=�*nC^�n��~�}�{��C�ii�;m��N�����|����9f��Fi�9g���Z��^��f�`���+������?�
w�k`.~��f�����
��s
�_V��]�����V���+��{+�#��y�����5��'�b_�^vlx��}�{�K��s���1��	y�{�-b��=q����{�-p7{��g��^��`����r��u%���=S�]��Rq��
w��0��w���������y�{���#�p7;��5��[����w�c��n?^�^�L��^��fG����?�I+��IEq#�7�G\)ND��_�c���ia7�����tsy�R*a����d��TJ*�{]�W���`��7o����_?wj���O�.�_��F�:�����5�r�D�Xf�ua��������2>�Pj�kUM��]K���8��9j�y�WHJMW����>��k����3�Z�#���4�:�P[���d����-��8���BN�>���OI�q�����(LC�z��Ck,p������unE_�
�Nhl��0�{�
��B�)&���-��mlocR��4Q�=�<��d�u"��������O#��i��VJ���������
G��n��w�2�Z�J`|n�V�����:�������}����Jf{�J��}�DX������&Nxt�-p�\lb�^��IR���*��il�e$������eD{�8��&�CEbB.�D�4M�����������w���o�E��r�����	Q�1�+���O]Ts���Q�Bc�u���Gi05F}�S�����6	��qk��1e+��[�c+��D��)}u�a�p�VjP�8�cg�:��K��������4�?v�~�dA3+�VZ��<��H�R��7�`gT����_q�J�yq������4��<��J�w��m~F�b�]FAM�2u7�Halo)BlA�&�w������-)u��h�)e"K�{�0j� �)d���3�-�D5���'rO�0���+
endstream
endobj
15 0 obj
<</Filter/FlateDecode/Length 4382>>stream
H��WMo\7����Q:��n~t��F�9lvn�E(�!'��J����[��7��������HjV��]]})������w�����w��q�v�N}!/�q1�"��n��P���w_~�������\ s?��KWC���X��
tq[<_��H����F%3�����!QJ�	+�����������N�[�������;�v���I(Yv��|���Zl���P��h-�%�#���;k_���@RS�������]���?�u���������!�2�����R
n�7�)�'?��)���P���*�b�K��*S!�"���+�2��e��+���4�l����asX$��(�L�g�G��t�~V{$)�jUd�0�C@�2�
�.q����\0
��*eLW�k2�Y$4���	�2K1�%��b���%��������X2j�lZw�j��������'��iL�FG�\�{���U�.�*&c�c��V����q�m�Mb���O6��5�U�,��i�V%5���Y���@�H��4S�X�B]���i�i,����6+M{��w@8���������a�
&�X@.j.�2:�b)EaR��i�F1E�e'u���EBq�e�\�e� �8Kj.��;k_���)��*��\��vgl��}�x����{{�������������X(��Z���2�_h?���L\L�#ts���y��������������pt���=T�q�X��ru���'�r�����hp�f��rbi4��Z�F\E���c^t]~���D��Q�a4x����������:���f�:U���)o6�L�{��p4x�
c���k]����X'��V����:c
et��jF!�����K��c;e������BA�_]Ipl�j46�LB��a/��K�()*�HsTc����C���G_]
��x6�P��.��h}�j�!������a����)u�G�8�w������������W4>��JD�_�4aa����mt��F9�����Ws��T�������vgl��}�x����{{����|�~�f����5*�%rXS�1�*�H�_.�������������Hd�hVY�0��J���i����"�Y��Dn�0?�0��f�B&=�����,��V��3����HN�=ceJQf�&����EF�,Xh��(dI��FHv��~�GA|�i$ X�i����.�^[�$%�f	V�V��fdV���[iS~�{���1�����^�\�8�Zh��;C����
)����X�j&,T����r3l�!�$+���)��DB}!H`��Yj����r�q���Ptg�+$O�7����b���XOf'R�i$��	X_�RQ� �^�g��b�$�N�0�$���%*���%��Uh\�M"��:�%Pu�7�FC`|��A�������]�w�N28E��`�%��������o��������;���]��'��������hI��e����>������]�����c8���?�nn�����������R��7�&,���
KV"~Ws�{�� �}��oF_��|�u/���-��k����6!Y����������Mv3�)�#E�W��?�E�$��M���?IS3�4��j��<-����~�f18���I�#w��?I���1x�&O���S7���
����`�"�c^�o��a\�����_�4G��L7;��4���t3�+���`e��L�,&	�L��O3o-~a��O���L7���i@X\�nv��g���v0��
��k�%/tT�LH�S��@K�0]���zW��s� N��-�%����)t����C�>���W~1��RdOJ�Uh�/��E�c/�gj��X��t��S����Cx/:Z(Z7���m��|	�E�����Cx7������{�U�
����Xy�������nI������;�YzN���������Dw������\�{�[`���5��H���kx7�Ug�
n�}���0��f����n�L�?��^���K?�!��;�n/H�!�S�j���cIzF�J���st���Fb�����^t���Cx7:���y����HZ��Cx7�R����;�i�(<���'��&���o����A�������H���3��^�R���c|w��Q�:��w�G%o��I?�� ������~|�w�c|7>��%y���3�����1��P	�58���)6������r���#����U��J�z@o� 9�+��������]����$�k���/U��G�������%�?�K���1�?�����~|�O����+U,�,�c�zF�<�8�����l����dA9�'���oc�z��(��$�S@c�PWZ��<.��2�������w�_=�������]��E�0��]���L��Q�-���X��?Qk�����b5��f@����y����������������:'��sA
�.����>v�����P���4���)n�yl�K���M���+1v��s���V0��V9��������q���K�R��z���B3h���+�FFb�A�������g������W�NG��W�6MU?�� ��%�������S������JL���ydP�R���F(y���\�E=���D�v��>��gj0�=��K�.�5�U���#h�I�#z��K+���E��c����m�����NS��kjpm���+3C�c��i�j�=�,�q��K*�R�l�m���98��:,��O������WT�}��b>9�������>6�.�@�*�'�e?�����������������H�d �k�*��T;���h�l?����[��u����~m�
S�f0�Z�7$W���4�����qQ�iO[$�=fKS������k�3����������f����EbkC�����[p�k����i��)x��Uc��������,��sl�a�-����x��C�}m&�{�����r&�>h8��g:
��`NE�C��f=o8�bI����.�pdn�3X��2S�m�sz��<�Z��4��2iNd��zpm�:��:��`�a��7��?�5�8�S�yk=z��^j���`�������H��������E3;[�j�%���������������i[���c�(��wR#8����x�.X���������~�<���\pC���
���R�
���\����M7�yr}��@�S��-���������Pt;_��hY+�W��p.�VS�u.������}+���
e���U�P�((��>�f����4�X��i��&��W�h�0��^u�M���X�c���+���/@!xY�F4�"o&����/�&�js=i�2��Z�������'�2t,3L(o��`,3�6�8��".���r*�,�&A$�S�d��3)��w[������Lq����yeEn�h�(�Y��j� p�2��	2����;�!Iz������aA��	�C��	�3&������.4'���(�"a/3�^{(��c�(�d�����v�e?����j�������~{Qn//����x�K"�������A`�i��m]�;"�5�t�.o�e�f1�F5�z���Rx�����gJ���} &�5��+{�UI���#%DW7�hs�����J�������S��6w.z�j���v$m��k�D���M4
��>/87��">v�����
q����.8r�������3���Y�/o)��I���g2U��G/��t�(Y���N��W�	��N�-%�\1���ev�Sj4�1u�*N��D+-��c�����%:Y?��2�?U�(�p}����=={�{��o;��:�	�>��V�*�x8K��N|+���yo�9~��'~�v(�3�A��p�������!�s}l�eu�{��a�����e���}~�x���^p���<���h��1�Q+�Y5���(@��z"���?�w�<[�.QG����f��(ZL6gdZ��4�B@Ck�U(���$Td
d�u��X�Bi�m|"*��S	k����Wk�d��e���.jKT��\��m�NX�E(ZK\�w��Ul��/g��U�W��_G��l �rR�(���	�U����V)���Q��Z
&��C���*�B8Q�����f���Pv��G��iSG��S�)g�PK��`�������P�j�I_���N���j�l,���:�!aH�B����G>�b��.�X��U��#/�:FV���;Z.KQ'���� z^K^�Q�`�	�cUb$t������b��a�]�%��@��gL.���������c���3a���e?�����?���<��������]����7C�a�����8~�.�~z�����~�����r��'�u�p�S���7���-���o:
endstream
endobj
16 0 obj
<</Filter/FlateDecode/Length 2810/Subtype/Type1C>>stream
x�uV}l�y?�~����u�!h�b+��
R[��0��C;$�3�M������DI�u$�G�x�(��>�y:��#E�E��E���:��eM�������0`+���6G�\l8IY�G����<��������{z��{������z����cOz��\�^��z8��Z������!+������K�?o������������������<���<��������}�����>���p�	��m�/�����{���r_����s�pG�c�s���a��1�%�!���~�>��������Y�������=�=�=7���WY���}O�a\���>���������>�����J���k<������	0�X�/�"~e<_�7�#�N��Q
#��"6!�*W���2�������3��M��k�&�2�5����Y��@� 3o����_�n���
���%G�zdu�e;3?b�_�H�X~�p��Eq�a�Y���5f�O�x�e��%�XR�����4*�����?`�c
�>q���C|��	]�.g��9�(C��'|�1�����x��F�������]��G���y�'F���=��A�>bwu]����+��j����a�V&kI�&��������g!��f��q
���pR$?h���
�^�@���u�N��"���y1�u�E�%��i �P�P�����@3�F�R�C�A���U�@ec����S�Q�Cx���v|X��9�zQ�U_�#���x��z������H5Q'�P5���U���![�[�6��l*5(�HITT}i7���R���K���;�z��pw��Pn:O�Og����Z���T<K!�RR	�@���AE�!E#��H��d��tF������E]�^,��<*����B����n����b���C�m�L�;wB�:����U����d%QBb1��!��'��E	�O�wMQ�g�e������FF�2[!t�b�F#\�.,�KFsM��i�V��������N�V�W���(1��u�����|��2�x::�r�4b��~����:D8���)�Q�[m��s����n+w�G�e��w��^���D{���w	O7B�G�m����t,�-�sX�C<�=�@z ���Fl��LFN+A+��6����tt;y=�F�mH%H%!�!�I�����L 10yj����B�_y��L	��tibZ�(/����l��.��^9�+'���S�b'��z��p��^5GfL���_�J��y�l�m��Y������!���b���/�"Hdt"sVwBw6�
����R�F�*�m��@��G�m����f�����P�-��� k�B�;��g�I��yw��P� �����:{���>������O��3���mfm�:�����P����{���;���?������$}t�c�Ol�_X��(������O��FM�U�U�,mG���c�j�����g��YF5F
��Z��	.r'=
��'�������bKm���^�_'T��ON
*~1���ciQ�� /&���s�2��B�.��`G�l��u7�����)�g�j��g�O��pR>��*b3�������
Sd�vzl�����*�m���j�
M��FFfin�p�z�����x���~==\I4U*����B9��Lm�i�f�#�!����]P�3��<��
�����0�K�%^�)xh�5��+�RA������^Tp1�L���_�SsP��4Pn2;��D&������"5g����K���1�Bk���+��/�Z�K�H��Z�x+z!���3W���f��z��'�	D1��b3J:�t����r��Z�^�P�JAU�*d��b?�`n��q:��7�o�6�
W�M�'�����
T7Zk�F~�p�Z�F
��Td���1o������`����I"�W�P��a
�����y��#��e�5�jY��K�Rd=��B��'Obx��>%�9��^Y��o�?��~\�p�M\y���{�H-%�8��bL�@�,H���M�'�)��t�� �����W�m���%����$E E��?A�tn��WiU]�A��.���m���
�������P�	�3v�N�����.�\���������&��Nn��V���s�������9�T�:T����� \���j�L��f�bsINy$�����^�1:��5�k�y7���E���1�Y��i����Ry�p���2?J^����|f?1�&�J	�bT�!���c����|�l��s���o�����|f��:l�|�j1�>�3�0k�ud�d�`�lM��}��_�n�}��X/�����qv/�
`���<-�����ff4$�Ic��+�#:�}��������[�
�6j�Vk�va�������h�u��-|���l@w��+�O�:U���|A�V����S@���5��CA'�����-�����/�����!�����~�Q�4Z��R�*����d2�DeDe9S��o�g���A���y��������6OWg��;Hw���m���z������
�f�{����4J����������Lh��PJ"��s�Q����( (D���+92=dhN/�����<X'lP;���y����$�;���~�s@���Z��9L;�P]e�
�+�V�i=q�n�,X��U���JF�E�@��h@4<y��:��(#(�#�d1"�	���3�\9� k
=Y����2�����'��������!F?g��n/3����'fI�~�
�@��@1��i)��r�y$�%�x8A<��caX�X�2�/l����b�H�
endstream
endobj
17 0 obj
<</Filter/FlateDecode/Length 339>>stream
x�]��n� E�|�t����,U���>T7���"���d�����R�[��{�C���>������w>��r��&XUs�M���������?
�-���0�y�	��R����].��8��{�}����
$
�u��B�%k[n���{��q^�����O����_�x��m�,�u4�0�e�r�\� ����"g����:�\��:�LK�,�H|D�����iUS�F��n[��2s]"��F� ]�.)+1����^�^�N~�~%�z�ef����^$���������	����
�b��z������x?)<J��cF�#��/F�N�x��uY1��_���+
endstream
endobj
18 0 obj
<</Filter/FlateDecode/Length 2666/Length1 3872>>stream
x��W{l[����������w���WR�I|}����#�I���<K�GI�VMK-�X !J��Pa�n�
4b�6`�����CL�A��$�XTM�=]�I�n����%�+��s~���}�9���{�������m�SO�gN7��@�����������������W�����x������e@�ww<�gx�*�_p����+�$w�+�+G��,*��(�P��1�F���0�
��r�@@��"�ZK�V+E!p^��j+�w������K����iQL;��w5�?~�Lt�7�0>~���������oK�o&���Bx86&��R�R%��#�*��#��!��q&��M�>��k��^D5��R�
19U0A�-fy���s�_���LSl����1P�1+D��������C
�m>w�=`���6���g ���Q�^DD�\�TY6�����P(�P7'!�P���:��v���i�
�-6���O%�2�5�x !T�G�+�(��	�*4��9��[+�����>o[(�\^e�;�{
@C����
�9'x��|0���T�
-��~Lw-�D(!
�]�6�O<�G.��K����.��!c�,��
@�_�I����xT�HL=[S7���<��za <6��Z�Ud�#�m!?�Nw�.7�M#�
OW,��)q���75��������L��G�6�L=U����y#����bvR���=2d�
�:[�
��Q�G�����}~���YrGWW}&h
6������zEI�R\����u�l	��������`��I�n����P<����s�"���Fo�\�L��z.D����I�P E�<1�M&�zj�*-f��V�!�N�J��?���r��Y��r��������~G���������<z1��i[��d2�����h[c6f4��a�/�~)��%qB-�4����B���>�@���WpZ@�����L���J����'��N��-��,�U�a�Y���
]]����k�9�E-�����B�rU��uR�\l��i���U��6������Q����+�x>x ��
���m>��m6�����&6�7sN���l
54M���vY�&���6^%��g���Dv��z�BY�?4������}:��u���j��Q��V��������&C�m���=FZA���H�ja9��<����	J�g��[�������������x:�ww�j�q���:R~���z�\K�9c�8|T�,J%	wN���l�W�S1��v��GUZ�r���o��e=���TGj*b�������^����0�+m�V���L��].�����������j������������wop��x7
r��fa��_��9��-{2���x���^��X�������������t4:�LM5����������|*��r.E~}.U�y �F-�@X.��g������)X1�����-m��j��n����$3�'S����a�o����m�����6P���:������jlC�zf�QT����>��ygc�&>j�^x�o��K<�e�1Qv�
�m��	*���0�,)����[���-]�(#��M7�X���djZFO�zC5���o_�����:�����������w{nh�s�R�i��L��@��h��=a���Q1���#����8�SeY��*��T���{vW�Z��
F]�-���AU^FT[���zp��[j�n���������������
H$��-�hb�6'q�_����^�	0���SK�2qA��c���P�+J����K�h'���I�ti^�(����W��#�\E�x���i�6�R����u�r�����@m�����t�r&��"���z��z�]3&c��M�����3;��t������� �0�(���qr�GG2��Dd8��"��G���������$[O��3?Otl���
��NJ�4?I����>�"��s�[�g�8w_"~�������ga�_(��I�-��c�V�6���1{mwx{?{����J��*���IS-w�dgn��~����"x�//Y[\�}�E����q�iG���! &�d�b������j��[����X�2Y���Kn:X8	o��g�Y�<���LL�+�b�W`���p�o���� ���R<�i��{k���Lt��H�1��>����9}��g,>��/{]����=��
{����.��J�������������zF�(7����=��^���,-$.La�2(���N�@�o!OF�9Nn%��W�{������~�����;/#�d�'�,�M�w�V�'�#����<9O+~^%����k���G%����(J��
j�����8L��*��(sv8Q�rQtp��������������D�������02���So},����]X?�;��T�
���y(�r�Z����4�^���n�h�4�IG����a�q$i�#,�I��4�}�$B�0J>�D(v�iH��|�.a�|�tc��Z1Jc�H!����y��n��\���E��
����"@S�@?���"�|(��w�7Pr
(yP�e
P�(�T+��
�������
����Q ����.[@�CX�q��`��9������D���
endstream
endobj
19 0 obj
<</Filter/FlateDecode/Length 368>>stream
x�]�Mn� ����e����	Y����Q���qj��!���^�J]}�73f��=�aJ�����(�q
>�e�FGr���FI?�t�����EM{�n�D�6���V�t�.)�����=	)e�=�)����t8����Cg
I�����Q�k���g�E^��B��m}l�?��m!��~Kn�tYzG�'�,ki�����B�0��>
�XZ������U�
x�����W�[��y�1k�f6`��~��U���<��ZX�����=x_������h�hht� ��<;�.��7���b�.�4{3`��������_e������;~���������1x�s����V�L=f�]c������s��@��\�����5���
endstream
endobj
1 0 obj
<</Filter/FlateDecode/First 4/Length 48/N 1/Type/ObjStm>>stream
h�2V0P���w�/�+Q0���L)�6����T��$������{
endstream
endobj
2 0 obj
<</DecodeParms<</Columns 3/Predictor 12>>/Filter/FlateDecode/ID[<5B06F7490C6C2C4EB714F02913F11713><5B06F7490C6C2C4EB714F02913F11713>]/Length 26/Root 5 0 R/Size 4/Type/XRef/W[1 2 0]>>stream
h�bb```b\�������`'
endstream
endobj
startxref
116
%%EOF

xeon-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=xeon-rows-cold-32GB-16-unscaled.pdfDownload

%PDF-1.6
%����
4 0 obj
<</Linearized 1/L 88122/O 6/E 87713/N 1/T 87852/H [ 478 142]>>
endobj
                    
20 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<49FF4D609161A340BDE9BB7684E2E51C><49FF4D609161A340BDE9BB7684E2E51C>]/Index[4 23]/Length 90/Prev 87853/Root 5 0 R/Size 27/Type/XRef/W[1 3 1]>>stream
h�bbd```b``��� ��D2��H�Z��L*�E�*�g��K@�Qm?��"y��MX
"�M���5}L@[&��g`�B�g`�� �K
endstream
endobj
startxref
0
%%EOF
       
26 0 obj
<</Filter/FlateDecode/Length 68/S 38>>stream
h�b``�c``�f```�`@�@�����,���� Z�2H�j����H����B�q(S�����`�p&
endstream
endobj
5 0 obj
<</Pages 3 0 R/Type/Catalog>>
endobj
6 0 obj
<</Contents[8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R]/CropBox[0 0 1152 2592]/Group<</CS/DeviceRGB/I true/S/Transparency/Type/Group>>/MediaBox[0 0 1152 2592]/Parent 3 0 R/Resources<</ExtGState<</GS0 21 0 R>>/Font<</T1_0 23 0 R/TT0 25 0 R>>/ProcSet[/PDF/Text]>>/Rotate 0/Type/Page>>
endobj
7 0 obj
<</Filter/FlateDecode/First 32/Length 478/N 5/Type/ObjStm>>stream
h���mk�0�����(F:=Y�RH�����"������D���v`���d�xe�F7��w�������0�
0R#0	��p~N� �X��������:��$�=,�������W2&(Y�v�� �������C���b�cZY���MU��J�k�w�������Ya}S�b�%W�-�fVmK��;�_�����<\e4���M�;����_O�Xm�'_m���fU��|���+v��^����~�C2$���.*�2������`�zGua��n�>0E�5�A�l�X��'{���F
�x���A�:O���i4%sPhR_t!���"���4He�� Uf�MH�	8������B�#�����@������u����U�~d*��f����k�:���5[������c���o#K�BVstI�.��pq������'x���3!�)�9��X�Q�%�w��.C]0��S��W����pG��C"
��`d\=�
endstream
endobj
8 0 obj
<</Filter/FlateDecode/Length 9783>>stream
H��WMo7�����t0��������v3G{0�A�U�(X�����)i4`�a��4���c�����Z=���U�E/+�$�!���l���K�����?���?����E�X�o�^~\����F�w�tk�w%��}8������WZO�ZR��VQ��JdC���������,O�)�o��.�|���/��<��J3��F����4�3�fz���L3NYc�W�*�#��L����Rr>����L5��n�W���<�4�z�i���3�z�m�Qp���]ej^\������7���7��:S!w����o���x�x��%���J�L�������xc��k�+z�!�����n����D��+�bu��{�=]r
�����(�K�������|y}�����;������|'|���~?�k8��w�c�;�����n��?��[�n�A��o�{��cZ����L�t�yr!��S/��w|�����=@o�����'��&�1(�9@/���#����|�����9�#���&���W��"��G��	`G�id�&Z{�^�
�ez�]�$Z�����ek��-��Y:0��z	�����s�^p�9����6�#s�����s�����#������h�C�oz	(���_��:%�\���h� X����\�zs	�H���	�M���m��5�\>�{�^X*�����&`N�H��	d��#Ck�I@<;���������G�n����9@�`!)�U���^	��7I�k����J�;���^xE����9@/�.Y���	`��CZ�^�x��9@7�?<02��	��{�@�K ���42_��=@�^I������/Hna�/�\�=��(��.J����O[(�"�RT���(^�!���>Xl�5�K������?�|o)����_?/��wRP��}^>�9�1����4�C�St�vlX���YP$<	;=
n��M>�,
{�s����;x.�����Xpe$�5x��/n���B��qhm����o�V�?�Vg��|s8�������3anfR�c���`
�����;�RpJ�� p��+8�8���;��ZG�P��KK�iD�
�x�O��� 5�<���\��F<��8
����B��J�#X��O^L��9I�3C������Y����4w�X���y��R:����;l�J��T�#�&n5t�1ZMQp!�9��
���t���K��c�%|qHK1/l���������a������hkZs��&������rGi}�����>���,O��b��o��n��	��X��}��{f��^m��R�
w����<ZQgac����z��\�}�c��s�y����!�Y��Mc�M�E���1���W��q,x��m�y08&JO�����?y�:7�	;U]��2��%��G�R�:���u��1�"�,�]��7����)LW��#����4���r+��<��3;M���r����\�e�|����;S�:>��9���aN��[n"h�XJ=c��a�"�)E�m�L#�9n��N������T�5
��ps�+~���Y`��`C�Rk�$3�\�>���\fLS#-���e��fp�C^��iI�cZa�
� sL���;J���������/��|��|<=�6�o�[�8�n}�1�2�����Y���6KKaX���T�$Y��,�$��%��"���X�Y$����(}
a���4l
e<I��@2����
�%\�	��uk�4k+J4m4"�,P��4�v���J#��Fa�;���(�T��J�xVi��6e�M��-(p�����XI��O�Px?-�����v�b�J�`���y��o��2Q3�.�2�5`s��$B�Y��X��Y]"�bB]�\N��!h�X�$��6KA��"�J��F��$1�KB��,�%P��
��/Mk��l`��b_�iZ���jaA2q�J��
�6&���k$��\s-�j�T
�~5���������1�!��+�Ai==,w��~�������Ow�����e�xzN^���]s���g��N����3u�%����`t,1���+
F�[7t�N���g��ht������|^�
���vz��4�;K{��e4�����)���1?9���it�SN�����(
=�h�����X�����=��������g&�|�������wy4<v��
>W|���zx�<|�3|��
/.������Y����}W��fgy��a����5�O��J��?�5v0<k��\66
��^���h�������40��Fw�9�����=$��O�{���R��a�����/�yZ�����U��
<�
w��5�>���M..}g�;�i|8-�~Lkh�gXpxzX�(���o_��q�tG�����e�xz�#�|��,6�����
��! ����{���*��D��r����B'����mj��$�2J:;D`�����\G�F#O����M6��,�.K��@2����
�%]�����k�Y�E3g���Uk8
hh�F#��N���6�z&��u��-h8
&�|���LSQ�c'���L�/�C��F���I4��x)4�;J�:���44y��s1��B0Z�����Z��"�j��b��U�[Gn��+�(L��n6y5`�=yG���1�X�Bv��S���+���c4s�-C�Ul6�����T�Lb�c�W"(f��H[��"-�M��X���!��.��	3*B��	��R�L��,��P��T	+^c��f�+���v'��\Y�r0�c��*
B�<�N44�
�P�
84�A�e��x(�r>*��o�,�Uh�q�����*�������"�\T�=�Y�����w�w��/o�n?�������!7��$w����^����w��~�����o�wo��W^��<����h��r��pu������r���?�q����?�����~�����L�y�l6����k��a��_��}��������}���x���Ss��na7�r;���QN@�<N���O)w���pZm�*
��������K�����_BV
lm��d��e�0���_@���_�J���F�����`���d9���q�g���,�+�!���wI�8���_B�*�����BN��������l�[g�<Y	��%{���
�V������L5L��������>����U�����F8�������F8���L�lpO�^xZ����S�u�j���
��Z������+.`K.hz����Kd������M���{�Q	�� �(.���F	h[����������z�Q�^����	$��=������Q����
VF�����3r%��P6��S��n	R`�K��#��0�^�d�lF	 ���2N�%�nOz�a�(�"�&��]����Y����������h	��fz�Z�������F������@/0J���q�^`�@�0�;L�~>�Rv��T`�B���0H��<&�]�]��(��/��?��S�A����$>#XT.���T`�W���(����Tt*0L@]I{�%����s��0ui��8%�!y�gz�a�r�s��0u�\2�?��^`T�2!�U��S�R �D\��S��Vi(�/R��F	����!����
���	��aOz�Q���A�&�+;$�T`���u���0C~�b�
*�r��{��F�p
)��):/a���R�BK@��������Br��j�����A��1!D�m<@tCnu ~�<���������cm����r{h�x+Nlp����7�%at�4\m�NP����-8,��P��)8S����[|p���m�|��\mo[�����zoSp�K\�qzSl������]q���g6�s���7�b�u-.6{�sg^P�l���<yrBq�}�J���0�����Y�'6��7��~�}h������aQm��w�������J�}�u�@����fK�r��l]c*�4T���
Kp��bu����n4�)�����RF��;c��
�>"b������r
x��������J��s��d<��^�#k��l.���W���(�#�����\�������:}}<�jK^���j�%
N�����*��f������������,����B���6
�0����?��+���
l��lv���-8v�h^��7n{���o��<;����.���%������@�Ym�	F��4K�2�q]����	���1/������b�1R7��w�����kq��
��S+��e����%ON�Fx��)x�m�q8eJ3p8��%����Z3%G���W�WZ��E�w�j�h�<!��������
�!_yF��z�0��>&q*8|���mW������(f[��a�-�So
N�E*���O�%"��crm���]&pA5������0��_5�n��G���3n��j^��V2>9�M�QZ�m�
��
<c���>r�1/u�K��B�����*��f������>��>~8�9~�,Y\�]��y��.F=���� zT�O����[X�-o����9�����������#�hT�D�'b�F��h�G6�[7
��
	���(��bA��zcAaZ3b\�L�Qs�$���l8�i��z��F�<�FR�����4��)�,&Nh����Yw�<9ik4R���T�`[l�(�["iv426K6RT��N���O`��#��TL����*��P�K��kBR���
��4���RiH��I4�8
�Fu�2k6���Z7�*�����N��V%cHf��Rv�CD9 �NkF��	����gQ�>��p�2u4��%��0�#�y4M���x���K�V[.�u�l@;E��f��Cn= !�5U���Cn?���V%I���b����p��n��~~������U|{}}�pxs|L�Y\3�H���X`��#��;�������cD�-Y^�5|]{��V���5`��
��9|�d;}f��/8}�
^�^<"J�N_���Cp�m��:�����HVx��>R{�
d
�u_�����|������s�xv�I�6�������%A�����/�%����'ON����?�x�x\�������^������w�&���P��z�I�����z8����FBpb_5]�b-:)��R��a��+@
>Y/��b�we��C;+:����/^);��N������kd�#���<�z��.8��r�@sx��`�Y3�9:\�`�r!����)Z�����C�D�������1_-���Ft�������"��l�2�]&0Y64� ;q��9Ev_i���i-$]A�:�:uJ0l�+�PT�7�I�G]��E���PZ���x�j�����7w��
�./������(��������fr]��h����o��?�$�H4y�L�am(��Y0�`sK���S(l0R��\��(�P�<���4F��4�#�`��v����h������eu�$85���KRem�QtZI2"�B
R�Y0�Y�`�"�#� H�`d?����]!T�gMJ
��C�4�����C�3	c��Z-~�BI	���$h�Y��2��
�u^Q��5���(Na�".��H�_����3�Xc��]�Y��L8
�)�I0p�z��e���%B�DL
G?Qf������c<-��	��9�[e\&&��	�CBS�\�gZM@��� 18�i�!�`xqf��P��|`�C���3�4r���B0������3o��V;
U
1�`�W���O��|���������~��?]���_hY4X>��/Z����?/���p{�\-��~��y]�����nn��B��_��No����������Q-/�l|���'���
���;3�����n���g�U;�P��5��[��A���]`�@~��c�g�X��ucIi�Z����2��n�/�X�F�������}���,�Kq����*.���������`�e���;_v�}��+���,���W���v��������-��Az������Gh_�_��`A[����yK����'��~�_�rz�n�kp�}x�=�%�#v����`7��@7����6������n�������6�Rp9�v�R]�l�R:@[��N��\��S��8����^H�n�3Y��p6�
�8(����^��f��}6� �;0,��n��Jy��<�~3�
�8�#����2`��}���b9 ��:����}x,�� �"�%$Y���l%�
G�z��gG�u7�@�n���:���3��n�@����u6�
 c����@'�������B?���5��m�
t�V���#�z�����r8�c�-t�+�>a��
��+��&�A�.P[��!�;r �
t�������~����n�Bv���M��D7��]z�>Kc��nK�|�#���h�x�rd�=0��1P�K�����
��y9�8g�������B7���CY�-�C�.�#�x��
!��
�[������-�C�.�F\<�b��^#���#1���1�!%<r��[(�����������n��SA�9�rr��#����]�?6���S���������>|�����/����w����������jL�iw�@�:g�nM�z�u.��L.��;m'�	h���39����_g]�����^Gu�U<9-:�w}�H��c�X��8%g��P�A�^�s��a�s,���Y����Rm8�;�
&������:��l��pet��^4,�Xj/EZ�m�yl��3���Slc�S���F]1���d5�n:��`��7
gb"�Xv'/�
55�:�yJ.�L-�8��Y�p6��X��"��.a�v'�VWS��_��a��6db������;���j%��(wob�)�4�y��
��L�
v���^�����qL�Z~4BD������f=Uk�.��q�tt�����O��������/�����`L�B��O��+|c�R���7����7������O8g)0��������vI��'��y&��nx������������k�K���$���Qc'�TO�f�<>� 0��?�9���f��7{a�.\c�2��	��08~S��|��U�g����V�;���|���V�8�������1�5���c�1o�T��|�/,��h�MaS�jvk�u/lC���vf�s�����*.�Uq�}��s�����w�<��c�w�� �Vx����m�0�y0�i�3\x�E�u���������^k�=�ny��K��u�����Y�G/����e�5�x�-vL;��5o�����(�m��.�`]?�.H����������������v�0�U
���qK��1^�	�q�T����M���a�Q(�`�(�`����0�:�#QUi�kEg����R�!~Zo(������6F���f�S��x�Li����LX��PY�P��V��I0��a���g�Waujr��Q��M�rg��i�����%C�,*�Jm�����y��;��k
�>�,�i0�8N����B>�H��C4M��S�7-je����N6)����������(,N#�Y�pn�e��8h�e*{�imS����:_.hm��>���n)�=�,!������&Q���#O�26�=������=�Z������)�]f��R����
]/o������-v���P�������Z~�|������"���\����_�0]�4�6��9�	��7���M��fb2�B��F�g���K�}�s���7��,�'X����v��	��y<?��"��xA������l�C�v�ND�����+�Qu������7���
���#��.�����k������$�jc�������Sy�M;a��O]����{<y�]
�?�Y�S��+��	Rl����w����st�'�Zv��a�~��.�EdQ��Zt ���e�������E��E��G�����1�@��3�I�qx���Km0:��J���1��{\M�����=� �q�f�qm�~H���+�Ho��8���~��1���EQ,��z||�����_���gFo>"��5kx?��	g���Q�8~���2���G��DY��{�D�7������!��{d���7$������=,�n��{�������j�yc�{�|�J�w(H�
��?~���y,eVqp�`X�D�5�p��:�(�Y0<)2T`L+I ��J(X�CPS��isI���Y����ukQW�p�S�����5(�8�D\�b(���A��Eq�I&����e�G
:��+Gk��i��Q&����N��E��9���)��Y��iZ���f���%MQY�Mw�,\�F�F�%zH[���7,	2�p8�ba�h���>�x�_Nfq� ���&��y8�V�����"1"h����sE���ml9��i8$�Wfu	�h�6�H�S�v��(+�O�4�D�,�iSK����8�(�)t�d����(k>�v/'
������
a� I��ac@�vy�;�7o��~d��T:Y�2��������������?yu|�=�/z�����u�)��/#�������������.����t��/|��?�@>Y�=>Elq�h��v@���
�����!#}b�A+��]�����z{<���hdv1���)=z�;��S�^B-�w��W�(Z��D��wi�y�7#����4z��]�o{H���1��t��*R2��3R�Gj!��|�^�t��:R�p�T��#
�Q�� ������l���}�1`��t��:���T��"5h����t��*Rm��3���������u���,���>R������������ ���J��Wv
��6�s��~)v�es,��$Fk�j����;^_�[�%�� ��)��_���I�`�7i�A�z)����t�(�b]{��}c�$(�:��7��P������1<i�WsUR>s���5��8h@Z�E�oo����4���p�5��a�AV��F�""����.pq�#LG����d!.IG�b�>d}��8h�q�����9h{�����Q��9��x%����A+IC�3|�A3�$����c������A�8V��`1��1)���l2�����A+�G�}s�:�C�T������6�p�ROV��b������5�qO��fNq|��|��6����}���.6��%�9h�Up
���V�JsO��fQ���16�"���)Au�@Te�����+�cgn�8����4��@������X����>�+�r�q
mZ3�A��J���X��U��V��t���A3�>yW�[�{��u����F(�t@u� ����V0��� ����O��\]���W�xS4��Hi���-��[�<���0i�9QDkD
��rKf�����^�@%=�����~��tN�s��^N���5�0���o��k���;�ndlH o�T�L��q�TN���&-��[ilp�:�����6�8�56�����)���md��b��1���}�����%�������b���"������v��XvAP�u����XF7���lc�
Y�*L(5���,J�Tmh;�����t�p6V9�7l�������(jS�l����OXdeu�0���������"v����bLNY�Z]�2~�������Wq��1���C!I�
��'�V��B.94{���F����(K+�DC&.��JL�"}��d�_����>
endstream
endobj
9 0 obj
<</Filter/FlateDecode/Length 9833>>stream
H��W�r�
��W�Q:�"� ��r��C�>�jk�u\�Zv";q�����k#{f����#�A���&�8����~�?��r������$E�,N��M���b.��2����������y�2�����C��#�L7l�U�n.7�}8{�R_l%g>;zG9t�@.�<�#��?��|���<Y���1�i~<\���������������i4�Y��RR�D�)1~�H����?3KZ���+v6g��[����m��v��7��l�qgp&�����n.pZ������;�yO�o�S)��`[�}��Ik��E��u�Z�����Y�,i��}�
_*��V}Op�:o����:wSBw����0�����vN�d'�p���N/��#�Wd8!f���b�|q��Pl�	��m����E�4�w.v��Y�iI�J�
.>��*u��
Nq�
O��/8'j���g�.
p�9d���]�C\�y���=���H����7��F�D%x��;��B��=���R��!6�����9����}��`�dt�����@����������~��&!�,pq���x��<�N������������9�h�CZ���A��S ����r)T�/��3���^!�,���0a��PD��(�$r��<r������du�
�AU��L�G@��1,,���%Zhd��h����'0�iX�(W/0��F��R�Pi$V�����?��<�`Zh�(p|�z	��h`��X�k�xg���`1aGI��XX$��X$������e3y�\i����^�S]J�
����*j�FM��+.�������.�7�����Z�D<{Q�(��%E�(�%!�����fK�������@,"�	�_�I��G��'�(��M����*�$,��I�]�a40��x.�O�x�Z�]-1����f�n>��1Ox��f�8�i~<\Q�n�/�~�4�����������9�$w�MOeuk������T����3S�
��z��3z��xt�}w���zw����
���	-��'sj��G���x�z{vX�������������p��w�S����~��v�|����,B/�c�e*��3��|���6�n����w��������3�*��O�����/v��m�s��{��j��O��'�B���;|��Y�XBo�g��i�;L����k�Zy(A�[s���{'�-�pB���W�&��-��>���>��f�7��/=8��s&������'�`J���@���{��y+�:����������%���[tQv��^�BT<f���~7r��'�b/����4?�(O����O��y�^_�����^��Y����tY���`�2�T�
���?���A���$������PD���QH� �3��B�����-hP��A4�-$�Ev���D�`i
����F�<�F]�jm�O�h(�\M�8��4�2���)4�8���Z�X���~|���0�bJ�~Vj���Y7�
��(��b�.
��Q��!��������	};��r��
�����y�V�QB�&NR-
�?J5������Q*�!�$#IU�A,8��\����
�%��&�X���I�����H��P)~�6
���r�Z�� �.�8�����1�)O�:%Es�e��8V�4&���:���dC�Cm��F.1��	�*�J�O����g!�&�K� ��R�Z���������nj�n2�_�dU[���������o��W������z~w�8c)� MeC�!�~����o�??���������~A�i���F�e��i��pu������/4�b����?|~������P����_206�e��-5����A(V,y{_/�������s�o��������� �Z#�kO�e����S>y�%��(,��V0�L��'��������������O8��D�4YBQ{��V����'K��x#{&�T�z���=y��P���*�s�����mv��d1i��������������S
�����,����^����'�p��B�Lf9�ov������
k��A8�Yl�@7�=y�����eM(g����4�d{p�-L������=���:�����@t���?������K��!�#Hr)���a��;����oO�z�5[�<2]��sZ	�8�{:f�L��`� �h%���u5�4@�����|+��K�������@r����������?�-@�`���r���p�;���1Y�)��r�J@tYdZ�����pL����L ;
;�1@+�?H{^`�L 9�;����N���Z	$v9_21~�X���U2�������dev~�t����
���-@#���q��/��K�s�J��F�����~�fo�[���v��1@3Xt�����	@%x��:h�r~)X����O�=��h�UT�eO�nZ	����= )��h�4�"�g;h%���y��?h&0��(�������-@+,c�.q�?��-@�T!�/���T*�;W�c��T)���#~�L���h�c�V��gs�L�l{j`�J �{J`=���v�/�[�#���c�f�[��X4��L<�/��;��_��`������W�����3�E7�]�wWA�(����i��������p�3�X����p��9�E�O2�Y,�$&=[R��.K�����?��JA�!/�&�S���.{{��m���`X��?|��������V�����������pR��9]Z�\*v����|����J��/�j����������&��j(�cS�hO�mb
���V�B!�E}�����C	W�L0����`n�)�+�sl>�U�'i������m%���W�+x�Pc�3k�[�/)�`V����S
�o�g��'`J�u���;g��c�y���������������=\b�+��:�e(���8����I�l/Xm�e����f	+���)$��������+��d"�]��k���S�^W�T�����,L����a����_Z5,��8����h���3��:��KK��p�@�����5;gx�oA�6��X�m�����������E��f��������?����xs��;�Y?��x��d��l,���'#]6����Q��=U������.��5e_p��;�m��n��P�y����
�Z���*
Q]�K[�4�S�m��p�b$�9Pd_p��L;:w[A�h���R�/8^�r��t�6��v_`�`��n��vl	��/n����Q�vC�H��:�.���@`��I�w�
�%a�;��-����Q�x�8�c��'8��CW���N�&_p`rrx>6��+vjs��7g�&&Askv�6p������s�5"�E��X%$�2e�	��D�-_l(ys�JH�����q��0����4���}B�����Q�f_
��l���hu&������&��`%��M��a��R�m�B6�����>��z���-�����u���Nd��r�������������f�$K&�l�}>B���A���J��~���i������Y4�}�P�EByK�@�:��A)e���c
(�Y�
��h2�FA*)���Y��������h ����t>�����^x���x�6�����D��r�.��2�54�]6$��G�z��~b��='��x��l��
���Ok
��z��v�(�Y
*����^���i���.�A1�<�v	$+�^�L��:k/�*�:TGp���g-5�r
C��1�I ]=�D�gM������Ii�U
e�5�'Lq��N�k���F�w�:TS	�������h��UV&��=J.!5�C���uh���L�WX��d����j��\3��4���Z��WW��������4(�2�u�p� [��_����~y{��^^��No���3J�9�m�kdp�&����5����V��w��T�T�*�
�4�����~���@���}���K�{���D��/0*2��7<A��|�����`���H�����hL^�[�]��^$�����]i����W^��k�=���_{�k(��K����T��%\d�����^�8�Y�(�6�����������7<g8
I!zoz���M/�{?gx�r�;��Ewt�O��W��z���7oxxdrz$��8bJ����aj������G�����a+l��IB���FS	���X�L�f��1x�:����uG�&�<��7<sk=��y
����0g��-3L��M�6x�����z���-�����GSZ�����������_�^���\�����Yg�k�=�>��v�`��20�dP���f�r�!Z�F��hH�f������N�a�O�0�F�Rk���yvgQ�i15�R�:� ���W�Y�(4��!\f�@���f��h^N��<�(x�o@#Az�u2����y�f ���+2�A�"u�B���UB�����fC�]�b��Y, ]�/5��,)v�
:��H�k�i����R�<�?����\4p���ZJ��N�������i��
�4�����q7MJ������ *h�Y��\��bJ(�WEd�~�F)P�i����L�y4JHJ5�Zf��fJ��b��,�RXq����h��
,��yZX�Pa1�&0�i���'n�����Q���XwjKJm����u	����t����d��H}�N��H���t������_��{��o�����J������N�%��������7������������w��L���~���G�����/$��Z�vz�>�o�aR�^.����"�6�Oa\�w���O�����M;@����u��^�q�����V�%�{��iH�P�����R~ �����)��Z(����w�u����Yj��Y~�,B�/��_
U�J�[���J�=i����_C�)�V��<Y��@�xa���+�2d��F��@�T��������
Y�J[��� E^�|�Ry������B���}��9�� 
J/
�3�@%���x�/�����/S�����m�C����i#���k?�h�`is=�e���m+/n�g�������9��8J@%r=��~�aH�g�8J@S��	�& !�9��p�*�-�Q��8*Z�Z����9��4�S����������0���l%����p�a�,���>�s�5����	h�|����r<oi<p�@
5�fm~A�8�X��L���e7�����+�RZL��z���0�O���(�@?y���>��[�-����H�����<<|i��!Y��\��-��l�x�f�% ��.p9��	���Gl�k��x�hov���~~'<y���%p���B�hl�'��)����)�N�jW����7T�<���1��t�L7OvS���p�n�"���3_4l���OB.�S��0)^����)d'r��lvS�����)�~
5��G
��2v�[�r���1�-Sv�_��=	�7i����t����)]��S��$���d�a7*�1�~
��|�;E�M!�Kt�-�~
���r�����#r�=��U1bu�_l�pi/u�H�B���PLXPd	!��UY�1.��E�-����+4L*/�[�{
��O��q�ey���������5�i����C��K�8���n��V����=��m�a����"�8�b'<��{g��������t��.�q	�����"�������d�P�B���WTx�
�<�������u����b���^o1���F�Kl��5�zU��nd�
H��	�
.S�"�=9����uM���ZlK~��k���F
���b�d�w����C�K�UWlm	�%w��j;���|k�v�������m�l��VU������uE����1�r�����5��G�c�v�5��h��Z����'R��F���
�-:��%j����)u��%�7]o���h���V/���:CA��m��&��{�8��a/,j�v=�h�,����{��J��W�u�|����z}x�>#5��&!��Z��#&j\�~����	_��Jj��%v��j�����H��6���!�.N.xcplT����?9v
�F���m�5�6�g����V���������~�l�FQ�1T����-x�CZC��G��	�Zj�G�ug��1z��>C�m��0Ak�'n+�)8���#��b���`�+x����N���D�U�\i�]B�h�1X��%����b���K�p~j[����^G�G��g��w'����{��X���FD���3���o�}�[�c�
���a�����F�w7W��������V���Vr��-8��X1��u��e�f>�h���u�7?G.:Qq>���L:{��}�E���[�9,����r�<|~����O�]��o�g����	�����W�����J��~����7��xb���OX`�)�d���M����NKF]M&��r*�"��Q���^���S��sc�2�@!/��'�%N���`���
���U�M4�)�h`�	u��l��g��x�PY���	9�<��f���	�^y�X����F�IC�)s��fU(yO����0��)<P�>ns~V����Td��A�5�/�U "�x�f-�>8�o�\����:�_��is���!��q�G�W�}�l����
���QKe���q��L�"u�rOG�YC���[u�Y�<�����i`W2�g���-������iD�17���g�X]�<�f�c����z��U���.��eY?�H��������?.o�����������>u�������3n�;�����yl[&N��a�c)���5|b��]e
CB<�����<;/�^.=��+�����C�+�w1XW�zu��<���������.|
al&�f�����a<�u�k�����h���sF0�z�k�"�+9��wubW�b�N���
�^p�c�I�t|��3�t���~k��������zj^���*���I���8�1�5�??���0�!Z�>y���8��?���H{�a���G����G��#���}���
���r��G\�V�?��Y�����6��}�u���~v��X��M'����4���8_���I���i_e&X��5�e�x���Y�����������������f}�M�4+���P/����c��3��j�P;Ej�b�������(!-�Fe����G�����-2��@K����f��^wTus*���,��+��$J�?r!�
CK^���ByZ.�f��"1��O���w�V�E#�"��}N��Ks�]����9
hW��]�gMWu�iF<�%�$�{���l�Ao7*#�j��(e��~��bc���v6�k�ex�W���������o"��hVcj��9yA�8<�i��������C'�/B�l�g>�4,$��0If�Y��1���iB����i�4��3��S
�J�c1���n����D���2� ��Cw@�g�z]��xaBu�Y��D��$�x�U�%09_d	�����]�x�?��n���H�O-���GX����/����x���_���������������?�4�G��_����w���r�|�����~�����_)��o���k��P��@/����%(�������P<F*�v�� ��7�:.�];�E��QI�������?�X��������/o<�*��D���I��0"��~�����T<�;
�|�)����*8�u�!8
���y����.`z���L&
��<SI�6�������3�����2U�'�������2U�q|0���B,|�O����y(yo_�������*=�u�	.6�r�����*=�u���|o��'�jz��q���;�����j������`A�pL����'5&���!�efz�b�(�X|�)���9��U�{�{���?3�oS��xb��=�����������?������4+����,<�B~���G��~�]���Im#KP��HQ�>�W�������O���C�O������@�=��q~�#,t�-�,���b!���4|r�.��-�4Y*�q~�����?L���������)}��e�
�
��e�M�!\��`�@O��##�4��RY�?��������r��z�Y��k=0L`G�V����A������U�������E>.u�8?�)��r].����{�G���4|u~e���%�������t�Yof	�����4|t%�<�0M���|�V��r��{�[b����0�nfs��(i�XG�Y�]��0K�z����`��	F�i�����#�$�����r�@��& �>U�$�`R�4�\�.X�^���Z����
���B�������zD`�%���q�]F�i���'f	d�*���0M .v�0M��Lu�@0K�`�=U��k�U,�&�7Y�
��-R�;�a�|d�N�/&�N�}#�8�����=4���Bp'���=�cM����q['T��������z����K��?>_�?��p�|�����
	����M�kqT���S��VB�G�bP��g�f��F7��`Q�-8#��OpB3��W�}<�9�'Up����d�
��	.����0f_�1.v���[�Kg�sm%�u�D��7�P0�=�*Q�Y�Z�`[��]!4�+�u�.����Sa�����d���=F���S���*xu)w�n2�F��7Z
�,��4��R���)���!;�
�Z�e5�1k��h^j����5Tt�16�u���fW�U��n��~��[���#9J�7����#�7~s�&�����'������Y���f���������m}s62�S>^/-�!G�H������~��r|8~�����_�\>]_EyT�{l��6���J�����$����-d
]�N���H+��������aCs���C�
���m�S?9�gS�w�,�x�WBK;����Yf��&�F�y�9� G[pA�V���o
^��x�>'lm�)�m��.:�KA��v�a��N��@Cy�V��m�S
0���.k<���,
��Vt��Z����#o+n$H6�&����@��^<�c�b&�8����!�)��k��U����7��<�j3V95��Ua���	m�|������\��1S�9�i��Z�VZ�������[cG'��A'���s��� ��/��6���Z�����h�����@�?^/�>���4�Q\d9�_/���_~��/����������k�I��e�����]@����5�F����3i��d!�nbQ}���7�|z��"n"�������
,����X��dph�4�g[y.�,
�]�b�x��F��}��&��\�6I�scQ�.pU:i�+mc!��Bv� �{�`�T���F�M�(.�]�I���Q`�g��aw����}��b��!���
��������0ru��>,EC��TS�h�(����K�c��]�`�QREi$��6����Q�Q�ev�4#i6�����1c��|��0��X���h�b�M�K5��!YP
����4\�o4j�+�h��j}b�P������3$�gl3���KXi���m�(�����������ZSp1�&���"������h��\\�^�9>�����^������_�\>]_�G9IQ�
�"���B��(�;���|�����@�U�=�F>�c2F>��Z2�i`s��-s��n�4��������3����7<='����_ ��'\:����*d_F�cx�h
_�r�!���-�����UD
endstream
endobj
10 0 obj
<</Filter/FlateDecode/Length 9947>>stream
H��WMo9��_�G�`Z�(������}��YO&p6Nv�������~p�O����9O�"E�By�HS����pu���?n�.xb+��L��(p��o�j�S�_�{��/~����_S"��6<Mi�p����x�*EM��U��P���3���lT�^(����J����F��JA�5P�<^)�fU~�9R���'b~�%S�
� ���+S��c
�t�>��?>�LZ��q����Kvt\}.��c!)^{�_����BjP|���2^0n�����W����1n-��-z)+���L��� ���/F1/~��l���~���4y�3b���k���/:���0���o��\�>���/������x{8c�.��>�����������������}���d��%��f�&�
BB��f�?��g�N+��L����2Y\HD��D&�7�X�^$��ei
�q'��=�F�e7�JiW�v�bX��3��=[E����jP���h�P��Q���LC��F�k5r��v�ax��)h��7�{i[|�G��U���j�P������;h@7�n�Ha�HUq)a7��d�
�j�{�V����I3�;��|i4��}a8
-��u��`���QSCg���Eb/FZ�W������������p���2����NR��3���^�+I�d9XA)��	`�����������I �mSJ�1P�L�wS����4|c�M5�������
���nH����@���5V���t#�n����)��P	+���l�+��t�����:��C�eR��VX�����������������x{~~�p��M���{����}������������7����C����'���n����������7��q�.����?o�����/����o������+��T~��\A,�����%w����q�/���>�sy<.	�y3������Z�Sx�r�>��WS~��C�+����D+��(��Y�����>�e�q?27�_"�!�q�g�������Y3*��
�?�
����.m /����9�&�����V�v���/�����n���Y1�_���D��i��!���W�5�LZ��SdF��j����L��<����#�/�Z/�5@77\[�����Ju���������7�s^"07��vx��ax]Z��E�~�a1�����k
�[/XRM���S���A�/�X�0�������:�n��5@/(�Z�@`
�M@	���������8�&�H�k*��l����z��?��#���t=F����J�R�0dOz	S�-XtP�[^�)@/�)n)�r���l�y����	4���@�nyk�nF�6(��S�2���G��c��2p�P�5@o�j��-v
�I�`>�n��S�n��&K�^,d�5
��5@7�Ja��;�%���v��|7|��e�8�&`d�5-��d�tJ�' �@�����e)Z��V*`7��S�^y�^��{�5b������W�[�)@/c�-���n�LI��
�M�(�����%P�$�F����5@�T��&?
Ux�T��o(�)@g�42%�0VN�	��,z	�LQ�X��@1lp����[��)@7k�6X��L1m�������T�	�:�T,��zh��#��,r�$9��aTK�Rb|�S��)���A2[��Stp����Q��$
;c�8dQ���������O^�����C��Er����W��Vr!3]��Xl����U�W�u�1\�Pp��i�9=�Q�Pt^V�:��r������q,��U]���~��h<�'lC�%�NL�	� ��>�q/2��3�=�v�p�������e*g�EOfp�%�������u�Xe-�YG��`i-X�?���+�
�]�h\I�����V`��x�j�����\}���G?��t7[tA\uS�������>�-������������E�^2n�����cTc�����e0v����k=NT�������JnN�+�n��T������r����Jql�����0�p/q�����M6	l��S�*��t�=��M���o����o����)��\$�����D�������?u���_�e��]�2�.��������-���:��-{��\69�*l0x�,���B�#	���>�b��T;/{�lf�pi7?<#���2��
�_����7����*2�m��wl�|c�m�J��:�`���3����p�9��T���F�������S�X��[�2X��8up�@!�m�=���7����*�<���6�FS�����$�y�u��*|��n�P(���B�w�w{��3���6�jW�>5My���b��c�]���R�16�I�d�����c����
e��xi^vh��������'&kc2���;ZJ^���`��M��k��V��$����3��r�����n>�??~8�9~��
�V]�JM��s?�%~V�'g���?�n2o�3��D��Ccau/j��6U{��:�|B�a�]J
�J��e'��Gi+i"�y/m[��F
��H�i4������8�#x���#y�Re=�q�E�`�G{n��0��!������{GVw���Q-W]$�`*#*+32��,�)�G)�t�x���N6K������a^2���_%O��-�RKD����1��E(X�����Pj,J�4K�9J(�}"��D��n��v�i4J1��TlZiTT��O.��>�U��B��%u)dN��Y���`Wm4$�2�@
�L�)�l�f�P��X��8�����X`��g4$�:oARl��KF���Q����hX������f�&4{�Z����
(�����$T*�� �fP�%m���H�b�M[�0�L<�J�&�>��}��X@#K�%/��������o7�^�,o/���������}�ES����6ZL���>����#�WI����G�@^�
}/�q�Ca[���3�EiC�g�0���
�}%�xnu�i_y4<�Fs��n0|^�X�������S���w��+:K������69������xx����x���f�s�!:Q��������U\r�|�<��>�������F��b���5�C�XZ��F��1_�0uk�v�1wy���������%?�f����]��_F��D�<��w�?��F�����^�)^���CtP��7o����7�Au��������v.���M9����1�RHY^ ���Q���lx�U��m������|��Q�R(���]qu9<]�.��?~����fy{A���������>���mm)������F���m����Gf�T	�}�h�2�$;s�h�6�p�%�l��h�h������Y�Q#�U8��f���yy� �U���G������D#����PmZ60o��z������]�0�u��0#f����w�[)��[�E+��'J��X��OWI�f�`�C��YpS�I4���=
k��I4�{c�a�3�hI0�����/����C�6��r�yb(�iC�!Y�VCcQfi+��P�qq�Xc<Ek��K",y�'�1��I����H�Y*���������������>�<k��^J����i.TJ
��%j{�,"�Y��h��(c����~L���cK8�<�����J����D�4�^��r��
"X2bb���.	b��{s}��>��E��J�N#�]����������������_.��O���s\h9���)n�����/��_\}�����g������_�/-��|��O�������/�q�����>}����|j���YsSB�������DY������|�~u>oW� ��r&�	!�,���el�J?�g�?�|KTV�zF�����9���N�:�>�d5�@yKk�Y�T������_B��hd��d
;XU�&��������-�![�?���}��K�b9�q+���
������_BV�+�-��Y�?���������d���s���8?r�w�r~;�
_BL������$�lz	`�����'�~����>�1�@Ql�U�m�G�5������l�z�U����z�t�PR:B`�K�a��H��`ob=B`�K@��zD2��{�![�^+�G��	`�����S����-)A�C�zD�n@4�~�3U�*������������������o��b��& !�![�^�B<"���	�f��t�M��u�S��������d1���������#������J=`�o�0
LG��	p�C���$@XV��h�?��.B?w�/�����E�� �f�]�n
��_B�	����)]�%>��G��b2x�Ce�G�N���|���OBT�@��M�V��C��#tS`L���6@?�
�����'q��E�� r9Da��O����'�l��cH"�2�1��,�]��t��#��.B7J�������OoV�PnCt�H9��#j~����I�!�I����.B?���Q�"�S� r@���JFn���#��h�Q-=����aC���u����O)6����U>�]��GJ�0I"*B6��-D���������_����}�����\��?e��
������jJ��Wk^Q���&s(y,6�����8�Z��k�i,�X���2kPx���h`�����b:X��<QP���DM���C�
OOn���3�9�b���Xq����Vmi������+)�8�����y�K�<8���W��fIW�!�EVUw73v���]ml�);�����������WOq��S�yC&��M�������"TY�8���P�l{��XEQ	N��;�	3���W�����w�Xi[La�c;�oY��;���6+��3������\����4x,06��s
�ym�<xkz
1k3����`x�H��+��v"��y�������|j�]�����W�f�����ty�|����������W��>0V��v(xq�?�R#{��M&r*
����1��mZ����I[wg���Q�c�uV�wfi�Pp#_l����<�`j���m?�F��vy��/���u95#6sh�u�����j=���1W�'X8l,8c�����7��)�����a��Y+�e�X�!p�Wi=Q)R$��z���a���5�I�<�m�qK
2Xb�db��f����7�V������:r#��_�Ki����j��#����@Q�������9|t[�+_	l���+_�9,V�:��Kmv{z����zl�~t�I��6~�����K���WYD�b��m�����S�:?��<a��j���
96��2f���C���X3K�����)>^��,�u<����������Q��k�gO�����')evF5���6�~<]�,7���w�}|�������f���#�,	p��&��%�$k��g���i�PG�/4X��p��P���Esu�hLK�Q��&hc��h�h�����e>b��8��31��(M[��P+�0����Y,��d���1�@���\K�F]�E�'c���l��Y4P�5	�i���m�e-�D#1�W�a�b�K��8�Q�Z2�i3y`7�Z��YK�a�:���4�G��U�DQH[�n�"�O��w��v��X�����)<����u"�eG��v�������/If�:�RUN��v!�2�J>����|G}���wq0�eSP��n!�2d)���6���l��PN������J�FW�w�'���E��go������G�M��K���OW$������_���W���z}z�~G��(7{�g�M�e�"s_*�����f�[h^.���&y���f`�7��px�q�wD���Sq����:n���|�����WL[n���D�#�Y]�j1����>t8$G���wq8<��V������^}���Zxi���9�5�	���6����C��EO?^�UD���X��c������9�����T�}��P��7��h��a�����4Z��z.E_tw��%xK����y�u4I���5�}�Za���U,1N���|S^Mi8>a9��M]�;����"��Kvx��C��/����0�>�Ug|���j+��C�uy��p�O�I[���.Q16��z[�`��z��Qqd,7L?\�,����r�������������������;��~*�*D�t�
�`a��O���~�S�������4�5��l����i1����pt�h���H}�eC����;\g�����JC��l@g�$#�=g��(K
tV&�V�$`�d���,�4>�$Z�k������
�8�Fd�m��~
�D���`��(�����<�V#�\�6	<+d	�q\�a��~�P��p���YC�,\�M��h�Y���GYP��q�t@�1]�s�|���E$����4?LP"7I���ke:�|����y�Y������=@D
�Y+��a�'����h]��+Trb�o}�a�����cK��AG�-os�%�5n������]V��Nf��]�����u#�F+Edo�3��a�����?�v=I�$��d��K(7xY?����������}�^��nVzgZ�gb�p������tu������;������?_���	?�?���b>�f}�:�8��'�m�e!����������5/��$C���7��.�]���)�HdV�|����q@bs{_>�x��w+4����`��KLQ�"�����/2
.�xc�.3����^�P�=>_�*�h:����_f�UTS����4�X��L/��i����i���[c��3�x�e�yq
���e�1es�������F��M��e�0B?���_f�y������������"R��T����b>>-���v�^����o�;�#��������v�����(r:r��|/<6��\�6����^{��A#l^*V"�����PK�=��������LY��!��{�n����� 8�t8(�{�^��&�O`�M b��=@7���b�Kk�_u+��S���<�Y�H6,��H�����Es�M@����Hl4!��&pT��A�g��v�>��u��&�?t�z�^�H:"X[�^�Sw�X���Z8v'k��-�*���@2����������/�&pP���<������w�9�o�	�{�^��k^������|��$�9��j������f*��tP/%a�<�[�^���#���
�l�d��f����@>v@0��N�=@/�J����k�)X9���G�����N����������|/��&�#����w`�t8(�{�^���#r��&��N����������%����5��Gr�����"�t�W�n��!��&��F@��7���_���x������� R:�+�l�I
�D1�������������9{a��.w���_8�pA�;�~�Wam+���d]I\��8<��@(��PlRC��j����NM�T���
gd;���
.����ip�a������X�����G����C�m�
�D?��q��;T6
V�<*R�zB�������+=Gco��'��%��m�BwX/]��b���d ��x"��~uy�^9P4a0v��@W) �>�uMPy�����y�����{M*�'O���&�-�d)��ig���{����v��7�\oc'*�5X�2����b{lO	�4F�4|�`M��<&ls��s�%0S4�>2����7������L^W"d.��Y��
gpe���J������Mu�w�����������T�-�����/����r�|����z}z�>%Z����i����_>P!����L�h������(l$�I�����,���6i���%�������n$x�_���!���M^
��;
0��6�c���� �~�����j�F�����bVm�D2�rw���.���:���Ch��
[��W8Y��������f����9�=��`pWox��/o��c�B���9�����=dl�q��������Ag��%�)4������Q��1d��b+�H�����<v�3$��ZQF����]�i�.��b�������0phzA��oK}$�d|%����m(63��B���b(j�T1a��K�Z����P�B=�
~��J)�chr�ce�h�����}�"�x�H��$y��������7_!��o,�%��������"#s��x���P����#h�LK]�����j�����w��\?��O�"O���m��Bq��l���t����#�VB�m,Z�I,[��n:
�
	��h�:�yw���f�m��p�xP�G��Y��0��CZR��������-7����p�]���,^��)��W��M]z�b�N�H���R3l��X����qb����'Q��@�h��{R�Igq�c(�=�	4`)R���V��q��
"�L�(Q���I��
��2HP��2>��	<0D�t�Tf5�dN�Ful8�%�%{��7T�y:�K��h3�9��-#�X�����?�/
�7O|�R"��a4�G���$�i�����;�y�\�S��^���a�
����2�U�����J��A��S G
�1�S����?��Z
k������.�.W��_~�}X�/���r�xx�>%����xI�q37���G|���;�V�;��`t��d��t�����^���ux�����/�F�3����k+������7�V�hxXq
xK6!��!�:��<�u���!�hx�aX�w�jk0:lH$%���`�&���+@�����A�>���3AAy�o����3WX��,$�2�n{���a/�z��a4�z������j=��l�J><�P�=����������x�o�}:F����Q����@i��nv������QQ�v�_�6���s��P����R�&�5�����d%*����U��"������G"w�<�\��}��u	�
�9���c�����g:|��T�"�����w����uA���^4�k]�����j����o����������~}J���3�;��M�,�T��91�yc�?���e�uc�E�h���,J���2
���4�Y4(��
r�Y4 ^��F��I4 #������"v��B��L
��%�Z6��U(�e-;���E#�`��axMk�jI8X��e� �F;�����g�6)P*����C��Z�!�fM
�m�h�d��M��|V�0�	w{��'��ct��$���yZ�-D+�<��S�:��>�����&�h!�
4��ElV8$���=K)3�
�d����`������E%9���$h
����*R����%�i�^J�J4�[�I4���
i�JC[���<�Q0��z��5M�=b%I�����Y�C���4�Ff����H���F#/�qr%��C���p���9��=���{��@��O�@��GrQlJ������w�w�]�/n�|����r�x���
�O����I����no������_����>��e�W<�aq����������Ow7 R�B������������q������m1���>�`�� /�������M������~�s�����.D��q��j��_��8�����'.xN�HU���%����2��D�T���d_:�
��[.{T��ZRs?����_A.��W9A���������&kP��v��Y������j�/�������"� ��"���W�m�ze��}��%�S�_�Un�������;�1w������&��-��%P���/8�p�M��&�]����~��4C��\�y�[���"��y�ip4�,�-�S��Z�j�~??\��s������1��%P�q�-{��\������S�M/o����	����!��M@��N`��BV�H
endstream
endobj
11 0 obj
<</Filter/FlateDecode/Length 9749>>stream
H��WMo���_1G��6��"���rH��r0��zWkDA`�����yk���1c4/�$�u����D�|Y��������<���B�m������.��������hR�lK���E���ox���,����j��~�u[��r&n��X���3��;�����($���%^����%Q���O�w�gP������>��$����&�����%����=��g@��v��V�(+d^N�&`�[>C`+p�@������Q^(}�k��}��^���/�[��0�|f`���R�t�[�����vB/n��BROH���ah�zB�n�(�>>�{������[��5?!X�G	p!?�����N_����\�b��	��8zS����i�&���L���p|��l�?�V�3��8J�
%>�{���`���
%��oRO�&��������j/pT�p�c��Uz�W������[��W���	��8L����q�8J�������G������& �����	T*��^�
%�����^�
& ��+d��W�������+}C�p
m���d'�Km����#�KA�Y�>�T�k��
	�:0��D-�;
�Z�G�x��������|����O����2�y�������������Xl8�����Xp���m|�Vw��g
����6���C�s�����.�4��5�����y7	~s�=mo�����
�u��9���I�0@���"�P����l��1�[7}�����f+z��[j��>�7�O�F��,h���7���Y�d$k'L�����K������7�mAU���n��?�
����mj(z��Kt��Xk��0�o�b���J������
�u�y��7/��Af��YP��\��z�!�:�\(���wp>\I�:7���c�����L4�km�����+5�w����O��Qm���U��=�/���]���������������Y��D�#�
���^V�E��:����%�����e�8p��=�#Wxn+x+�x�������W�T6p�Pp����xN%����N�Vtn����kUV��{���c��%�m9�:��`}SX�Rx%�|������7]���Yj���W`?x�+0��-5��o�{�V��t�q����@��&�Y�8p��lNT�o��C�"�ip����8g�^b��21�4���c��AY�z�u���}�*��`���^�Cb�_��SI�������kM��y1DJt�![���}[�]�$4�������+��X�RP�f�k]ec��`�;8�.��a��[7�=\a����"&���B�����U$`���:~��(j'GM����r���v���?����o?�_?]�]�EK��p�c�MJ���_lY����?4�j[����m
_i�=y���D7[O�3�XY��jM6z<`���G��eu����,l�� Q����B����|V�*�+���v3uh�^�u^����5����F�#��C@���V'K<j���94�!�XX.����L�f������2���:�F������N{dQM�s�<��BS�("y�Y4�U
RZ�aem
�*�&��.����;��,�a>�F14hM�, ����`�J�F#c��xW b:��x��}���u�u^3.h4Ji +��4z��DC`���DClZ�B�	l���h�]�,uM�x\FCD��M4�?����jhc<\/u�XARR��!�
�����.�������/���w���������5}L8nqD
��{c7���%5���y}��������j��7��LX�+~����X��������T�8�R�GG4(mE�����V����J4<#�u���~xA�6o^^�n[�{���,24���k��a_��N�Nl}�k&�o)o���(�3��o�pt���	����~���	tM����U��x-2�f4|����+Y�����:O
��,�V�T�w=Zg������qk���2�N�E�Ssgl$=��a�*��J��)5��(i��9q���6�d;����Z�D;=��z��j��:��1����{��i��#��i�]���	��>�������K?�E�y}���.����������/���w�����~���~M[��
{,Tw���N�, IE�������'��
�ZY��������
o�hT!_#�|���4��J)�I44a��F#�����^G���e��h6��;�Y,8���(�y
��4XX�wy
�W��&�I�Y4��LV:���5z	��y[I�*Z����h q��
�^x������N\mV��u�Ms+�r�M�G��=���I�������c��GQFMk��<FxS�Ko�I4��2"�ZF�I���U�����7�g�bkF)aV,)5���=a���u��9b^�2AK��u�j�{
�f�e���$������q>1>�2�-
#�>k���6)�Q|��vo>t�l��$�(��4�G|��������>��)V���u�E�`�G�o�108�5Zj����QK7)q{��t�H�/#*��H�x���)�F��g7��]��?�-�������yg��s��!,w7� ���$5���[x��L����t���������������t���v�e����;s��r��������?��\-��~��D�^�����?�7��!��������!}�p*e�BB�m3?�M0>��|��|�`9M�������=�}���)[RF�*�b��&`����P./��8��1��*�A�X�.t��b-d��t1�.��}���P������^������0�?Y�q�z^��d����xb_��dJ���
���b�{�g�}�������m_!�#�8��?�\����3d�I��&@��v�=b�x�-t���CD8����y�q=�(�4���������=�0�r?�#z�Q�A���>�5md
�/�\Ic��>�����Yl�c#��W�'f0J������H���y-�0�~��`����'�&�����`���&z���D�%��LxN�����&�X����L8��d����=�09��{�Q>�f)�`�T?�=�(�����&�����`�@�&���&����g4k0�Y9����zJ��-�h� �����>?
������n�	����$@0�� �'���ua�O:po�G�@T�#��f���xaP���M�k��E��c�{a8a!�`�@��&w��b�D:�m^�0L��a{���G��&�tH���$����~�C���&�C��"�S�3��Y%�"+���{�O	Y�h�@{��la�N�a����{b�D6�Pz�a.:�`�0NA�;r�G��	��Za�0N!�a�(���y����mF�\6����}B�����#�xt\p�Hz���p��rw�=�@5���]��E��cD���TJ��)��]�qi�����������K���]�O���|��������J@�X�6��wA<��()B���J��_Ie-���q������*G^\0(R�	��A��$�>C���_�"Hh���n�3�X�<BD����< 4�zm��NCL��Q������.6�����S~x*W`�r���'��4y���b$K{89e��]^�C���?y�]�Xy��unC/���=�.�g�v�����������/%�=��s�Q��C�X�?�Ml2S������
5�
�s���:�l��b���3�����P�.V��I���Y���\�t�����O��/V�s�4�U�C�������������m���;<
R_
_.(��{��j�E����]*f�/����r�|����r}z�~����\���X��;��������D`���#F%g]��Ml��p�A�ogVL&%]h���/�0V~7L�s�5�z�s����Tjag�Z�����K��2N=Up��{�u�����m�3�r�qYq���(�KF�����OUp�a���:�bH,^k�Wn��v���e;1�����������t��14�Y����	��=5����B�t���^�a���j%2|�2xqP�T���2z�m�������3��=i�=:.6;���r�nXz�
Ut�A�;j��Q>���Mq�����5��+K6)OA�D����E���e$<�����#�d���i�����qAw�	&�2���f=I�V��������O$��r������~��r}z�~C��N�������&��;S�����&�'3\��I,B�U����g��0g.�lX?-�%4D�MKG9�bh<|W�	4r�F��~��#[W|T��g%���[62�i<��RK��iE�X6[���f���%$��aZQb���%�����L�Atf��
��%;�G%�xt^��(n�>*<qT�Z1������,"8U�p'�do
���i�B6�!���i�#!��w>��Y<X����-����MB���1����<��T���N�2*Rf������&���N�tD�T��&�,hO��N�<���g�y��VJZ�����v�z���'�XF�2�����YOR��E��a��G��O$�������o��w�������v������RLy`jq��#��_�����f�7�<����#<,��xo��1?e�>��=g��yxL�6<*��w�U���^��^z�N����~�����>����)�6|���������
��
�
����^BP�wFrl���=w9J��z������[/��>;[���+�������x��c�j?���=ry��!���!t���_�����AX[x��d5x
Q>�~O����~�e��Z�d%c����^<����#j�����?F��?�z������/��?9d=q��	�������;�����6���}����6�?�n���U��p�X�<1�'�\�o���oe���c�(���tA�\-~����nywA�������v��6�R5d���g��%f����8tr���h�B����!�f����((c�w�OC�K��>��,R@M��;�(	�2s�a�4���h�}�h��6���42�<��k��K���
?��/�kk�g�:�
�v5���T�sO��Yl�����,i�����h�&�5���sH��|It-i�����-�JB���WMo7��_1G�����C.��'s4`,��@�ec����[�p�_��>�1��Xr<]�f����w1�4�o �z#���d����L���6&"��U-j`I�u*&5��!.�a<`��������7�t�B�a��)F�Dj[N��9EC0��5$��GD�#�'f����r��a�D|��:!1��N&��j9
{�$d���"�\&�$b-�pX�L������a'Tl�%�ehJa��P� p�OK����v�����K�R��',���&�������������=���v~�����N4��.���
b��������O���Ow����|}��g�����3����|�9��zJ<�|�i	@|�O�*���%Z�2�X���~_����v���7�e�RRJ��O�4g&|��9_(p�ye��+D��8�������������o/3���h�<C�#�b?���_fZ��k�.��fh^
��|��\�����;�U����K����
cOp����P��)=��e�.
�i8��9��gz���Lc�����Y����������/2u����4�ezA��������uG����L��;.��Xj��i��,�_���l�w��.�=�9.>d�-������@'~&C�7��z���]���
t�c���~��@7~yE��
�"��&K�^|�&�����}�N�"����w��1�:u=u��Bo�������B7,�M������c���z�^��`����/��c��� ����zD���`�����p�=�^k�^�
�����zO��7t���:�Z��_l�	�����������w����^>�M=X+t3��-�����[��-;d_��AIR�,����q[Z�^���
�����]�LG����?�7uk����H���f��� _��/U�e����tc���y�n�k�n�8�4�k�n�q�[+t2��_ez.��G���'�k_�W������[�X����������z�%�lR��@/�s�o��}�^���)��+t3@��mZ�^X����B7��e��+t3���m=hzD����)�Z+t
�(�[.��v�>���&Z)gp&�����j*�Z{{:��L�$;p_)��h�����p�b��������?�?�����;=��?I�������k_q�a?�!�b�O�b�H��)�� �4M���8~��&�M��uq�+��cUtUp���A����
��u�y1��������J��e/�b�������P��������eq}��.6^�����]g�������O.Rg����|r$��\���}�\-z����V����VpFZ���\�-w�����FD-��X����S�e�{C$��$��8
*S��������9�H�
.xiW�{R^�R0#v�(��	�ZN�K9j�/���L]�,G�#K�$�U	��Y.@;��1q�Dbe���X�R,�vXBf\N���[��w��������q��-�����t7}����v~�=��L]2,�E�&���q�*���d����W��T������T�K�����-Z�v�����1��[��dcI\�,�>�?U�!Em�����p��)�[��*���t�����^V���c�iw
�i�.�N�	o�6��|pfD*Ix�Uu1��
}��^�LX�
AV�r�����={���e��H��[n=��V������<�Y����Z�:���U!1L�Q`�CV>9Rdu�LXk-���R$1MVVw�������M��'�x���p!�z8�KP N���=qGgz���eF�7N����q�w�Ip�$�L$�i�xa��*N����������Hj^�~���k���a��ml�O�����4�M�_>�����o����a>��R5�K����v� � -������Gv
f���ED"��
$Q�I�b�$�b�H����J�u�h0��k4(����d�`�T������h{e
/FRl4�����I�R�d�>Y�h�a
�&�����|�&�;�X0��z)���f�=I
�f�<D�t����(#
�EFtc���5�i���w�Pi����hD�W[F#��v)���4���14�X&���]}��X��"�S5�]&8_,�3F����
eB���6�@�R�~-�N��\<_�!uRG�X�	��(	��f0�0��f6��WYd�2��$_?�a�@""
c�s,pdR
�i$�]���w������@+��� �����4�M~y���m�p#no����|���:F0xe16��0Ngr�G���_�����T���I<���<e��xt��c��7�l69��f���g<Ev
��o��N�dtIYr��4|���Y"����z���u>O=C�7x�>z@m�>�����������������;,Kex����*�Ww,b0}|���G�	���r��#����U���q��g�_G�����T�������l�6:n��W��z��$���z9G�}�p:��f�
2�6���E��+}�~x�I�dU�5�(�{2������|*l���G=�lr�}9�?��x��bz���I��^��������C�Z�[C�F`_��T����K��4��{��I!M�����t7���������
�����e�0�v��6uy:�����mG��������G�v���$b�A4b^]S����F�=�QW����X�0��lrh!������m��w��wt����%���[��l�p������N�HY����f����,�R�tl�
�VCj]*�M��>���*��,i�)2��;���z'q�e`�U��*of�����iW���2xZc0�3�� �����MXe����2��IWfyWZ�j�Ai�f
��R�F��Y������2�2�b�N����V��n�VL�d���Z�O�u�!�$����<H]�f��B�,�B[(-pQ&�oL��W����l�]�#�r58*/�d����{�i��Q
F����g�6���I�[��,A�:����!�Qs���<�7����0�V���8b��7��)�y��J�&�lSV������n�p�������I���(o���������������������������T�s�@OzF����������_������w�.�?w�F�h��:.��������O��
�,������O����>��{BKA�<5L�L<8�Z?�`^��!�
�8[7���	�m(��~_P�m5��L���^�	��?6k}��C�U���yz&��	�-��I�6��r����_�6|��g�z��Z�����H���0��<���ZO^+�7�/Z��"���ZO~�V�*����V����'��@�0�|��Ok����{�����Z�lv����3F��L��^xZ�������r$`i*��Jk:����@.+�cr�9�?���_w���7�#�E��/�
�KM��@EhU Z	u��"4+p������ 4+�J#[�((�
+g�n�"4+p�����*B�s	r�1�:�[N;t�����X/G�./�
*B�^���U�wQ��h�LW*B����{�����c�w�aEhV�E�u�BEhU�"�'um���j������w!�F�3m6��z���{T�V��d{�����=g{������c����B�����I^B�2xP=#dChV����� 4+�*��:�'��!4�W:���:�]�"���i�!�V��
���"�*�Z���wB�t�E��@+?��]��
�����6�VA+�w	�Y�QQw�BEhV��.�g�UZ}+�2�����o	(�$�
��V���KAEhU��L�I<B���]�0��D�:]5��
�Q�{2����@��=�kChU�	�M�-T�fV���\B�w	6F�|h^G��p�,�:���Y�����0k�������6���*kWF�C>��.��"B�:����<�����r��&������v���6��d����+k-��:e�c�#J���XrO��D��0�\��J�:r����)W�����d%;�PrlB����'��|,�Df,9<��f'�o��<�29��r{�b,��u�
b���Ufw���a{�I�Hn��Dw�lk��������gl�yC�MJD�8$!�G=�|K��0����]T�3�U�G���FZ�����;gM*����Q�<O�t.�HrA��1&rV3v����V��c�.����@������$m�,��J�rS�!9mtq�;�nO��h�����i8xZy����M���8���
^	�?v�8����v;�^��wm	��&�e2�xd�e������,��?����}����1�u��#��Z!�3�c1������_wJ,Zl!v,94�Q7����������L���i=��~�c!/��0�[(�c�r��P��#'Q��B���PvF������Xr8���N2���H�J^W�a������e��r�7\Xcv���r{���
�c��=�Z'yP��{�W��uR
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 9810>>stream
H��W�n���W��<(Y�����/�����,c����z�����{G�c8)���.$g����Gdd�%��<���!P���������0}�O�^�_���0��J����~u}���_�?��)�RIqb�<��Hg`�o?��]�5��
^e�}��)��B���<2��#����$�+x���<+���K��i/F�+8�E���!Q*�����u�L�&���s�H��	M�/��_�V}���\�)��i7�:'�ko�L�N��P_q�)�l��P�<�1R��p��r�}����zN`;���x���<�R]��b��������O�����������xN��)��d.���8�A,�����}�=gn+�l�h�us���4��J��Q4���l8�0,
����a��I�[2t��D��y���(����y�QE�����48���@��6*�2�FT�������d�n�<���p8m[i���9����() ��b3qU��B�����<,�T�,e�M�����4
 ���a�C��"diXoD��2�V����_fvX"2�W��$�|�
�QlY������h *�<��H�����X�<N@IH�3O��8����,�,G��������Qv��������h�Y/X#��BZ����@��HB�����,��Zd[�����������y2�(UD51y����p�y����t������B�\^�^?;{�),�-Q�����@	S��_|��U��m<9/���������#��w�����
����,7S��^�+�H��ax��;N���S�B,��t})�{k�9t�������c�
���/��.&l���k��z��m��Z//��?x�n�)���.����O,�l�� ��7f������^a�s�/��w�7������P��x����n*��FG��+�B����S�&�Gd����z����.��}T��E��������X�l���&�aw$a��lW�����	��^�n�X�!���:�.������s�����8��vo=�j��)l[��G\9�5�9R��f����=u1���C�?����x�}���������j����������7���w�����C�jX�R�����0,I����_~�gJp[Y��(���x�4��h$8���(���p������!���i�x:��"�n�=���8����f�mwR`�V����Aa%1��>��aU�
��9����xx"��G�q�V�9���}���a3��Q$Xh1��2j�%Q*iV�rv�v���<W%�Y��4'c��Y�p���q��L#��
���{i���2�*��s��)��F�e�&Wy��w��j3X�(�4*�������u����F�Td
��hm�(��aE�@,����(��Y��M��*!�h`�V�P��8���Q���e��F�m��Z
�<�~E�50��Ge#L���g!
�f��pD6[f�7I8 Df�x{�]y����q��%B�q�?.~�}{����~��������?����?+3F�P�e:�r�������?�OW���_n/%L�����������O�.�q������}���h&*���Yr�T����3���n��=�
��FH�x���q}�-�Vf.ZC�	�B}�����d�����S>u��T��I�RM��d
����/!q;���|�l�3�ia�|���-���]���UE������YU�KXsk� �����>��K�Bi��d��d-8$���>���Z���k���~C����=����
����l�\���g���kn��O�,r���-���[|�����������	LD��`���.�����"��w�`��� ��i�5B+�$8�d�-B3�D\vUa����^J{�k���_�B�c�zJ�p���3��"4���=N�Jy�%@+>����>Ehe L{Ff}�=R)/�oEhfPp����-B+XV���[�f�DZ���S�F��� �M��Ub�v�k������d�-B+O�QvM���AF��=����A�$�r�EheP�X�X�S�f�i���[�F�3�����!D;���y��~�"4���
\	���A�\�����	�H��q�B4s0��a����hf�{��=�h�i��x����
�GGB�s0b��[�vN1�=k�,F3�^&�_��-D���HA�������@9�Z�g1ZsM��{<�C�f����9�h�P�o�$�B4s������h�����)O!�9(��������NA;�h���/2_��-B��1�����=O�#y���z2r,:�JR&�@����_�fh�'`j1����1b�]Z��"mK��2�����?������5�q������+�@:���?On�����7�&2���(���fP�Z��������q6���/x�du�fE���{!Q-yx_���T���zW����9�O�]r/��TpL\����K���,2U5��Cc�~�Y��b[���^/�����x)������N�����������8Ki��Or��}�=KA��R,����o��L��<c�q���5�4��M�Y�ht�cH��bG���,Ww&�4��05���Tsil�1��l"��)��X��/;��*c����\Jl
Oi4|@�)�~;��[����I0Z%�h��r�h����y0<����6h�������n���`gjz#q���b	��Rwp�:�I^r�!������w�����6@�����
�������c��
��H���2n�u���P�-�<��pq&��O�+��f��������n}E���H��6A+Z���/?�DY_<?��$dl(8J4p~%��o���)�R��7���n�m� @
�M�X���������9�m����`Z����Cgg\��������`8��FW�BeJ��]�0lX�2� �>��'������0��Yn��c�U�%5�	��]5���K:��"���������2S�{�8x�M�Z^�t+3L�M"W���6S�F)V���?5q~y4���cxG���fJ�Qx�.�Vx�nz�Mla���������>��e���[�etp��u&I)<�}x�������Z��0�����C��8��a��H=�s�
F�x���
����C�B����A�GY��S�>Y����r�<|�����O�\�����c�&�N�PF�4�%�KYd�}!}����7�����m*=�j�#f�1���A�f^d�ML��F�����S���g�HZ'�����\�����X�1��������-������y�	4l0������D&M�B�x&��bbK4u�H�������Mv���O�h�/�g�X�W��}0�UL�`��p��)��jrm�0d^>v��i��i��1N�r�n����Q�������O����#��?LX��m�1m�2�����c��E���m	��i<�+G+6����GA�#/�:�����J
�N#���m��4�:hD<��g�7�:`����x��6q-�y�JM�K!���,n�������XD6	��>�X��������?/��������p��bO��\�����C0U�)�_�����z��!|t��Q����7���r��R�G�����?!���R���0>i��������X�eT�P���`C������p|Eq��GCC,9��'#���Cv��|���^���.o���x�U���`��Y��Y9��h�����^��������D�x�M����j��z�4��3-������X
���?���c
X(V��������V��pI0u��=�i|`;����nx2Rp\`��pM@����0��$H� [g�H�g?���
��|lhB��5���i|����6����65F���Pv%.�Q#�v=H�(�XT]��eY�W,������_����Wl�__�����m�_G�Z�v�Z8Y�Q�4�Le��?~��Y����1y�i�������M��s62���gb���M�E���TY�$�h$m��A��&���>�k�h�	\�L��&��4i}�i��/�G!��(��Z����vU/&6�(��$���JE:�E��F��&��D��yL�c�9��q�F�`�T�0y�2����f�&������s������gH�fFu�d
>�)zti�Mb��k������U�A�0\f�)�]-U
�Z��(�h�iZ��7n3OC���5!��[H���30^�S�iR���`+�iF,���N��kS3����������hi�~���M4^��,o�>C�L���\(��Z��_���A��� %�,�VF���p�����������_����f�i�e���~�Gi*Zu���������������o_>��	��?|��c>y����C1m�j�nf�$�a�o��]�f����w9@8�Uo����.����:Q�u��}��B��&���?~Jw#�G����`���>�U�����iz_N��,M�V>m�������D#�{jo��Eu�J���/�TWN��M8.��</��D���k/�3�/j�����L=d��x�i��������Lc�����<S�m�������L�q�1��L������p�"S��
�����4������W1dS<���_`�����@evw������d�������^4��������k���H�^1;	�9C��U����yj�{�$�(�x�v�r��o7|;�d`U����r��5@?�N~~9�
����������a�Q�[�~N�:���
3TIO��
�bW
}�Z��Ty��([�n
J{���&Q�{����0}K�^�M����z�>�{���'`�g\�����|K�&X-@�b!���z�7�b6��Q-[��dY�6�g�l�	h^Vv�k[v��-�}G�������;�r�O x~9�/&�=�o�	x�o��V[�~��C���j�/�O��-������V[��T%�������X��J=��Z�s�z�^�|{�-@?�d$�=j�^�����[�~(`���n�	$�i^|E���j�3��O��S�r6"�;�u��*��!�h�-@7�w;f���@@�����'�'���w�G2�����'�{h�Z��P�qO���y�Z��j1�����7�
�jO���|�&��Vc-�ey�?�>���W��T���W����J%���D��}$��,&�e�1`o��;K�
���i���Xf{h)S�RV&�x�dt/��C�<lB�����M(e�������O�^���#/o�����{sx�������$%�-P]5�@�"�
����97
�&�]��.���
��5V��.����@	�oB4����>w��{�-�s����m��bC&���u	�F�+m@�?1$+j���j��U8'i�%/�����kKA@�(c$,V���f
t���Z����Wh��nPq�'e�5����BR	������H�����sp�A\2����*c$6�agh�.�z,4�K���������o�S���6dn��������Sl�#���w3&8ip���H��4��wx�3�q�mxU������0I���y0�<��4Q���RG��S���������
��_�.����r�|����������)����>B��X�	Jx�������?�H"]�����V���# 
_i-x���&x;��v��`u��	/a;x�f�����qp-�N�����r��+�!C�$��='aW#L|�Xlq�c���	W�;����{��������X��@���c��b#Q����/H"H)���zA�PF��H5��U�(shm���4��x���5V������okp�{|+'L����������{r�����.us��p���]2.���!.nR��R���W(��6�:Rd(v+8wYL���������q=�7q���AoML��X��V�C�(��	 B�aWU�M��
�d�)e�Q��;���A�G]7��u�����t�Yn?������w�����S���>�<�QY�8A��Mr��Fu0~���Y3al�r��(�$���P�h������m(�� P�M8
�y, ������Qt8c�!uV{+:�#l*��b���A��i,z^Gz����g"%��A�u�FCA�;�C;���z~���I���"�~��B����Hg]H����-�m�d/H<
�p���H����Wx-������Y,{��h�Y$�B��t<�<�/���b���d���0��,��-�(5�f��C�J��!s;>�/�w*�
�6�u#�,�a�E�f���|�����'�a������{#!O��%���w]�S,5�Y,�FS�r�V��"t���J�M$��uAG������pE��,�}��������_^_�^O��i1j�����
~q"��O�3K'u=/���yXql��mp>0�

�B��_�W:y$p3�G���%iu�G7��Pe4��
�i��E��w\��������KI�������%q���?������@o�����W�Sx���:"'.�GG���������w���(�jRo���,��j
�5���KN�HM�=���W�\��w<���[If~x�#��3.|=�Q�`t�
UzK��/�!8���`w��r�6n�
�Pt�qmI���7����3��a���%Ur��
�
�i�zKV����c�f}�N��5�;m�Q8�@�1�-�
�����-��G
�������t�����1o������t�Y~�t��������������S�x#%��f'!����������_����%��J�g�B��TW6��S�>H��L"Apj.a��dp�� AR,<�L"���P�BTg�������L+F��b�E�i}Q�Jl�(p9�X`���9�)����Z�AB�U�8��\�Y�(�:Uu�8���4h3`�1Q��@�i��`���5������bQ5I��E�w#
��]�k�m��v�4m���$coD$�E�3nY ^�����+��@r�E��*aTN�K8h�*�t&�T:������
v�`l�y�V�Z ��sZ�$Usx$��j���4q���}����N�
p�<�YS���=����Nx��B��Q�94������l�"�$��p�IX�Q����������#`uy����@��.������������Wo>}�������A�"~��-��xo�������r���/��Wo��
�.��12D�K2[���>�~|���-7��f�����������G_��xz�Zx�&5��.���k�j��E�0v�^Z�o��?��������8u%t^�N;�8������}�f�;�'W?f����n�
�-u��(�<�MREh7������
��z����(�����?����l��sd��<~�?L���?@����F��#�����_����M�[�3d�r��e��?O��'L��l�>YTg��-��?f������[/��%�c�nx��Y.��6�O�1�������~b���~y�-���*���T�v�)g�F�4�6<��rm�.W�1j�����n$|Q�l�'����]@`�`7����'�m�����s�u���.����~3��W�m���$��g�=��6��Z�I��f}C��m����a���jES��z�nx�X~A�<l��@�|w��x�`?I�."�n��@�dz�l�'�p�K����~�\�?������k[�W��z��G�[j���vI�n�.��f�����*r x_1����v����Hh��wD�B e�B��[��	$	�D��IW�����%�����V���8���O��^�0�R��	w�8N@F~=A`-p���'B�^�0d��N��^�8N�[C���U,A(��4�D��A�Nu�V��U�$��m/p�%�j���a���	w�8N�S�������N�
'���g��V�8d�|f
��	��o���k+pX�zK���*��biF�3��8zUZs�|���������8L��������-�3o/p�@k��P�����+��i���a����	l�����B��G�J������O�
�@�z��b��n
��#���
u ��wOA���U��(,4@�>�l���&��6�Qs�����������'�?�����e|�9
�����7�p������-}��Eb���Ir�|l��ZV=t+�m��;i��9k�}�v�]4�fY5����]�az!A���^�c���ciTsl|
>��"����Vb���
�ip����y�
�	>8�����}��J�@Z��I0t��{K�`phK?�&����7���O8V\���cVr�����K/u�?[y(��0�Krh^���fX���=���ml�R���Aq���b���Z=�,�L��J����%���-��.�"���Y��u����<!�f|,k�r�~r��M�v��,���zp.��!���-���]/����&�z�7l�����������_��^?^^_�G"o�����G�o��p�N���/�\�F�p��9[0��Bo��8�Fg`o:��{��6�����u}�Vc����{^���U��C$���9����ly����}o��-,z7�a�\a�-��Z0�T��l������H����f���
����=d�0l�D����K�����>K���"P������-P������}.�l#�awb������%Hb'LE`���U�[M9R�K*5VQ�c�;�kI�K��a[`��������
������%��T�7w<x�|k�\+�Le4](6a�4;���s�2u�J����;oM4�d=����V��Z��+�����5F�������S�!�h�g��^����������0�J�����������������������(>t�����������f��V� ���/b����,W�P3��th��,d	8l���M�����8�mz'�p��=Z�� '�	�Y��D&�����+:�4ph�9e�%�u>�X ��b�%5T�\�MKM��\ae�=�R"OP���&��r7�<���[�����b����o:k�KM���^DJ�]��Z�vH�U���iwK�x	+%�6K�DP����Y�-�Y��<N��%:�j�mL�CP��j��S1����L�Zl�FC���D>�*KMz���J�:-����X�!��Rr������%4���!����y,d�`B,��G���v�P\G��S��,�{F����:
M�5T7���>>��5�`���%�x����M�������O���77���������#�F:�x�����a�AQj���?��%o�q�����.�k��@��q��Ru��b%��h\G�}����@'�`p�P�u��,�����]���gR��G=���P�r.n���=z�X�6+Ln�����KqtB
���`�����s���`�S��.����2������kt�	B�y�����S�:�F9��p���Jn9]P�����^������06��V`�4Z��4	P
��4����]p��S�9���3,I<<�^��0Q��	��p��-��*�#��\8��px:<��P��k��
t|�%z�(��Z������x���G�po	��I�w�r�w�K�����;�:-������W�������/onJ}s{{�xy}}�@	!��������AB.���������%/����e�p�>����?X��'�����"���i4�$EPp��b8�CI������1�d�
endstream
endobj
13 0 obj
<</Filter/FlateDecode/Length 9867>>stream
H��WMo�����Q:�f�HV�!�r�fn��Xh������l�����=k{$���!/�MW=�^��%r��%������)�|��w�2��{uqSp�=�/�W	�//��N��+����~u3�u��D��+r9��Z���Wr�Od��� ��f�0Q����8E
>ij(J��Pd��/&�@��)d�����F|
�)�(���	�G�0��
C�Q�Q*iQ�Di�1�/�
#�Q0��������T�B�B�a����K��R���(��
��8
�dO�Q���R��Bq���\EE���Q�L0��~�<�3�Y����B���@!�C)�Ppnk�,{�2�h-�(`��B�p�F�8a�����v;�E?�IFL
A.��#�K0,z�Q:#�71��Q2Xl}��KG|2E�u�Q,�h���Ev1V)���n�"�&m��&O��nRwx��>�}�����?���~}x?�>����~�� .Go�?M��p�������^���?�]��������������OT��q�����9z��!T������c.M��kA>4��=���'��U�c�����d=�.�")����.�S?��@����}��S�G���PR��������%���?fu����k1X
_���T���g�?���l),���`�t�.�����M�>Z��3`4�6����/+�K������-{���R�e�ly�������PP�'h�g����{�T���`3����Xl�/�?_�s��M?7�I��,�����p�q�)����{j5?��T)�@���L���)����JP�`�@����Xl �����=�9��/����K����.s��P�rI<GXK���iovJXO����{�b~~k�,��������Ak��`���K����=+c
�V��!�5�v��=;k
��yK��6�#��Xk���U+9�P��eF�"v}��K�����_��7��Cqa,�P��c��60����%�v�5�i�%�V���cm#������:F�e�K��:F��[�9}M\O��(>]���w���^��6�1�v�x����l�� 2�({ �v�s�`;�@)��q��B�s���q
�������XCl!��d;�!�Y��<�"2�-���-6W<�o#l�@��C#���>��*,�C��s�FY#�� ^��.K�� �|����!v���
�0�v@@%/����`3�Q9��'���O2h���nM���30��c���=�}�U%`V!��
��$�)��/)W�%�z��+���/?�����Z���w;�7������v�����g�.����N�I�|p�N��9�<�y���K���� �7�,B=Hn>�kr�\+�j��5��2_wf�����G�O����i+���������P�v�CV�H����n9y����Z����R�h�{�����:���!���
�V�]�[*]�CQ���T���U(�:����9�*�C+:�J��U��49���N!x�jBVo���3&�����^��������oz���4_z���n����z�����n���)�-_o���;c�B[.�K��}�G�Lj#OZz>V�����{�9)������W%K���<��{6c���=(�o����l��t������9u�Eg\W�e�u������v�~�x{}x?�9|��G�f�X���<�@���?�J1��UM`�Z��[���J�R��k���"P���e/���i�>��M�eX�W�M������������xa�P����E�L����}�G����[������_Ln���}NR7�X��;c���[�����jIS������e��'6�������W���/,]sSk�
H;�G�K�U5!��G��f9�������W.�n�g�;���8����w���Ky&:��� �mm:���}�zk��4/��"�N��Rk��5;�~9�M�\y���.���WK"1��H7�w�QL�iI���>*h&������a�^��f����������:�7��e�;��*��o��"���a��Q�����|�"���{��y���������7�/S��l�I^�P�O0�������oZ7���p�0P	����$��Qx���dk0B�Q(�U����%DC�
73�!��`RK�.$��2���X�u�HUR4 �Q(@���v��Z$������(�������Wo9D���@���e�����4
�j3��-x��P�G����bP �
Q��d�2���NM��`0��)�T)@o5�Z��S���gf�������@��0�:��f�p@���#A}
S����0��0��NK�B@���a\1b�
+��8��f��a�Qbk��
,q����V�a0��[o�b��5m	t*^$v�������:��-� ��;<LW��������=��������������irM<���,��X��������cLDv8N^�_vHk�1�D��gp���=s����I.�9�d��K�eY�������[�/��� ���#�
v����}����Rd7�nr1`���qc�(9��7	��S$��d�fF��6/�%�_����
�E����%��j��
�Si^���R=�$Q�g�D$n��>����V��'�������e/����4�.;l>�������c`����w�i�W��o��������[�h�/�����e_������������2b�������}�
�Q��u��_�+�vd?�x5k���y���?;��
�p]��{	�Rz����:m�*D���r��
_�u>��w��b�#��^�y�����H�� o�m?�����k����pt�]��������o�����������-K)�Y��e��$V��2���?�yux���]d��	,�K��Y4��/��4�6����$�yIm��W�~��������4��*l����sX�k�FB���9�m��4�"�=k��xJ��,"�"�����$�D���0�-���"{�FC���v�,�,����4�B�.j�	�z�Smz��������Z���,�V��
|�.���������������O���BP��i	8x�����(�D�g�i3���"6�n(Q�5$�6����xh������
���FaljaV
u�R��@��9Tr�q��iG��Bn���x+\Q��d<-��}��Z��Xbu���_\#��O�'�@�0����Mu�Bb{�?�_z{<h�2\�kK�fu	:~:�|��x���~���������%,�Vb���R��?7>����>,�������3�u�p���k���s�19�����&���1��_>6���Y�9~���Z���������~;2�R�>���C�����tY]V��^�?%����U���_�R��y���(�a�����XT�J�.��A9�a�����*+y�s�DS(8����\~��������i��V4<�W��
�+�t��d�%
��X~��`G+���K4��p�y��M�p�M��"MH��,/U_%�|7A��/��R����Ni���0@�����q���y��|[���B��1d�oD��^���b�U�!����^=�\M���{� v�#.���;g��`w8���� ����"����TJ
�:���|�G�#B�������
SG_�G���(;�{� :��������������Q��]jz�D�Z���n����Q|�����~L������������Kk�`����wL�Z?���{�����#����������p�5bZ� ��Qt����~���X����Q���k\�����z����4(� H���<���~�M�
~<�l���at|D��AtXN�������;�`�D�XAv�m[�(����Z?��������������a�����""f�^�^�%x��-b�l�z�e|y��G��uY�,��&��C/�+3l�����CX����c�qa��G����`�EWT���~��5SwF���1Yb����U�����w��Z>�"8n.;�c�D��y�1k� :\Gx\���Qt��=�_��3�q��~=���Z?�.Nv4��b#B�=���JR�Vr������B'19����5A{�EyhP\�O��.�h��
7� �p�8���(��}�)~����?���/��k����_��?qE�vwx�mKz���P���!�K?c���W��E��S+�[l��]-�d���R{
��n-N���bc������/��`�����me%Ht�ovI��] ���U^L����K9����Br�I
N��
I����f[I������X�)f$
Lk�tS��]Ny������FRpZ�@��)� �RCN����MA]W�����M�5�������\������L�R[[l*��>�R���u�!*��E"9�l��``��$<15���cQ�Oz[/���6�o�rHi��(����bK��J��Pl�������0E�������V-�O�~{<��-���%�X1q�O���������wo����y�G��mw�uo�L�#B+��?���m�h�V-��Pbi�_����l�;�c6��v,����~�$j�������#�"�V�j����].u��ezSl�����k�7�Vh~�����^����g���9���J�%8��n�k�N�N�+t����7qr�%C��.�f���&s���%Fl�N�5DKOKN�C��e5#Y3g��V�}/Y"����U���������Y��
��ZZ=��d2������'������ac����B�f&�Ep����sZsn��r���4��s���S��j�sH)te���%��VH�O�����*����b�����;Q�����!h���.���r�t����.�������_�?�����D���/��^3A���������G6L��t[9�<�C�hYl$x�d'���m��Tf�����XD��,4��b�\Dr)t8�X���X�����K
������H��J�U�=Te�*	�O�iR��X�"���zQ"L��En���y��/�f'�&b��C�B�i�����f0/�|��&Pe��e����9�"a�����\�b�4K%�=�hL-�i���	|����I,
�2x������`���|��H2+�sP����7����W��>D�D�CW��w�h`��x����	v0�~�TU��mC�rH�;�zE>�g�D����2�a�.{%A�Y�:��E�!���O�__gV�\w���m�A�^N��,rsG{�-^���������CvS�����E���,"�!<���\-��|��|UhB��Q��g�����/���OWgtu~><�����L�V���5���6h�����+�`��o�%�.���^%aj:|�;���V�M����H�-Bw�����kqG�����Y�F���M�'<J;�.M�����;7�.Et�]ig�x�;�U���$Sw������/�����k�eX���69��������.C��/���.�\�:��*F�Tx���0n��c��o� ��\�<������3:�����{lX���������c��0&�)���e�*dE�D�7�Qiz��z���g
��*��~x�z���Cto;����C��O�,bRS���<�b�2��&�[]�<Iz������<fT�,Ts!���`�+�[	�
�������p�l����r����tu��������~~���������G@��F���������r�Rm�����N"aSJ3���.���\\���_�x������%��[) �eiN��> ��.��;v'6����l{e��H�B����Q����e�,*�s�	���E�rsXP�5Q�1���%A��u#9s#�I2�O�+	B� �)�vp M�
A�w!��qJ��-�r��������&�W���-
����#��E�.��h�`�}��h��8+������?���,2�[���)����j(��k���=�Kw��P2n�xGWD�.s0�M���/i�Y����"l����c�,���T
j�����X&�1��tN*��Iu��}��P:u�G�,4<����-A!�s���e��9�����j�;�k����f�(���������LWg�_>�~u~><X�5��]���e���7�n�����������9���N��)Z�B��:���>�~��p�[���%��������z��s}??>a��[F�����.���}�{�\�_<���x�b����C���FX����y�G�-��u8���O����J����/73��?G2��H��z<��\-!~�h>�����4\����f��K�nK��J��^���z<��L�C�k���L+���p�G�_��&�����35J��h�G�_��26���W&�(������L�tNP>���h��T����QtA�	�K� :�qw2��k�(:�a�C��;�z7�'�/�����a�5~s[�P�6����W�rxB
�P3�dS���Jcz�`�
���k�`�,w[;�����N��%|��N��5~�#����a�E���?�b�t�?�T��5|�Ao��hi���vk�����$�f�N(�?X�N9���At���������UN9�?�.l>���b��6�?�/'t�?�� �S��?�^����
^b�5~��DA�O��s�TA+�Vb
+��L��h�DO�=�2��Qt��S�?���*y����Q�r�2��G�A+�:m
�nQ���k�(:H%���_ �-~���b�'��%	�����X��$ �&c	�VPJ��[�(:(%�0k� ��=N��5~������Qt�33����WD����Q���a�DJk� )�r�3%����	���s����T�oh�Dq���\�G�%Dg�-~��S�	G_�G��)�h
�����P���9a\O��5~���[����1F�"`�'2)>CJ8����	=�&���'BP��J��o~�,`pi�+>�C&}
X����|,��-h5������?���������8]�?�Q���v}����E���l"��(h^���b�e7R����2z������)Y�(��bcY�e��o�s� �5��\q�Pk3]���+6��b<r`J��5��;������nE/���b�T�Kgj��!�����\t� �F�X����5��W�+tV8|6E}w����H�Z|{M@.LPs������4n���;�[@l
��8�[5��rP��{�k��*D[Q��h���]k{b+dC�4;��[!��X��0��#��n���W�(���&�����k�}s�@6_��`r,2U���}l�5��cm1�s�2\(,�A�XUg���T'���:������6�T�)��kP�i�;�%�.���>�?��?W^,L�QiMY��piP@-���e��b��O}�bc�i�
���{��Y��)�&mxUWl�u[�L
�[s��5]�-V|��ri�6���0�V��K��JOJ��\�a�M�9�Qs����m�Q����}���l�p�k�e"X��{�C�cK���a�����+���/4Y����o����J6	��Y}�EXC�I$m�
Y�b����\�������w��]#�:��}{k�/���mN����,����
f������)4S�iPz(f�����������d�.E�y���������
<Q"�V�����M:%���I>T�����r�E�N�bPs���`��M���Z1lF�|w8K6���j�������9J�P|��<�b��\3�80G	�' ;��O��UKZ-8�4�B�*6������O�����������r����w
,m�	�k�%�C�G���j���-�F�%V6��
�j4���X��B�,���{r�Bgq�4����'8�V$0��U	���J�7
��9(�`X2R������@�Z��i�V�5���V3t������"��:��~%Q�'F,���jF2��0RaU��U�H��c�H$m>L`��FzoS�a��7+��X"�G5�!�:D���-���k����(����n
'I-�.���P8��������6j�P�`�":$��M��X$q����0��J�)���mY0�h��r1k��
��>zN��2��C���Hm����f�3�:J{Y��&��z����U�]{���yY?.B^���>����]^]�W�������>y����R�Y|Cx7�O#~��o�:R�j������sp�o������v�&�3�GW�����XTS�
��tp�.~��#�R���lp��^�^��t��TSG��.��k�%��NO����=]��{4^�����-�+z7.s���*H]���q����N>�����LD��P?��Sl�.)7�869
���O�;��WL���dta'�����]s�\E�|[%�cE�E`���w��EK�J0[g��P�F�>[iJ��~�i��:�	��*\���I\�&�R,���j���~z�	��������z�R\�lt.c���]��w����su�<����.���My��!�?�B�+�*;_��~8\��\-}�}w}����������r�O�5Z���?�Z��w���*��~���#E������X�T�����(�Y�,R#�a��\�bU�{/UD��E@���Y��%#������OgA��UBu7V���&��Ef3�4�J����qr#�>Y��v��Z���� ���[��xqk���5c�YY��d%Z�I��h�"�.���'�NE��P����U�����xyR+P-��rb��3�1��P������$�������D�dK1��'|8f�O#�!�F'�'��Y�b+�,	;bB.�z-���x	E�H�c�.c)��%+�P�h
-�'����3�NR��Doh`������T����bU~�����;��D=������R�����]"Zn�?���!����������i^��������|�����?�����^�%,�o�W��J�����~�x�������/W���_�/�_.������������A.^8�\���>�������|���~��}�|/6VW�	����]<�|���E����g�F~-�v���K?�������������TT|�
����#-��SL�q�0L�t�\s�[J�I�u'?�����3U4t�[V�i���O�I=~Sf����N3��S�a�'��`Z0�y{}>�4C�(�=�<�1�u������b���i|���{B���E��������q�-~C���}�E�Cx�F�"C&b�����SGF��G��o^[;<�z����ct��9���i��=5<"��1�h�-~0Q�s�������w4�?��-��h�-~�&�������������g�e�4l������8�?�N�x\���Aa��ix K��R	h�<��-|,G�������������Qt�IW�c� :��u�/�����<�A�������>��	.:�H��Qtv�gT��t�S�B�r�V�1I��ep<K=|0I��K�S�[� z��Zc�EWWv�O��1����*�w�����r��������1~=9���0�/���	q:����ac����#�|"�qG�z�h�~<����	�0��u�v�0>���}���Q|I�w�;`_g�S�F(�����'�.�~�0��I�z�6?�QH[�_K�L�
z�����3�s�����'��q���
endstream
endobj
14 0 obj
<</Filter/FlateDecode/Length 9848>>stream
H��W�n]�����RZ��������"��4`�@������)����8��h���{!������"K
o���n�o�^q`��6�9r�p}�Q��c�Q�?���_�W����
�9���f`��|����2H%&U��^�@c��{�^��c*������h���?/x1�D������>�z)��/�M�q����
��ILY�~��&:�����1�P-}�������S����,�D1�n�
�BQ
�56���k@eAb9�Zd�y�5Ix��������{����������a�@��.����
�sRg���x��n��HNK5�Q�"����iA�rs�n�f���/��Hi\�+�U�\#�V�3t������|k�0++j[��b�1t��1Su~x��(�;�5a���%G���mRc����y��������w��i,)�j����96����+���.�p�����lF������%���;��
c��1`J��3��F�{�S�N9�Z�Q�>K�WA0)@�W�����H����s�3UT;r�AW|�+�=
���=IrG�c��x���:9��P��X���!!}�svo�ku�o@�����K����R��1����1�Z��(���6n��h������og��"|�����~�]�_Pe��Au�\�j�_���z�T��a*����
�5�`*Z��H�1k����u:Y`+;�[soIe�.�g�R���[����F�����a��c�qr�y�y����s��Fk�-������P����?���j���
]�y��2Q������`�Vq�90s����']�n��:DV�gb���F���|��e��
/�:cF�����o��ErD�u�)os����)���S�>�N��n�1}����aBK73�����m���9��l�t��]M��O��c�:�(���F��}��b����h4+>�;8^l]�b�s���Z�$j�]m�Ic�)Su������As�6�:Ms���3�pn�������_����j��.[7��	�I[�19A�P4.��W_����n���^����:X���
p?6Y��mW���
qy�t,�,B
kS�����Q���E{l�K|��'����������I����/�+ q��QZF�bPh�4V�N���A��U$���\#��c!��������<T_�A�V*ff�s�-�h�����2j�UIa���9���`��VV
.���P��f���h���P;�J3	Tp�����h(��F�!e��0�>��Ge��cTi�:i��z�vH�9�����4'�ls��+�KTnz��8��`���P��&)��4j�P=�f9���Q�TZ�\5����!���r�t|����V�QE5���k��?�v�^�������v�A�&u�)G�G*
����3�?�����r$���i�8��&����.�U��M�����C�����h��z�i���RI����Fx���:��FneD�$����!������@T��o��O���F�U���7�x�U���r����]�c��
�{N����zodk�Ld�}t9J>�����`cN�.���`�aF�c�������'oy+K�h�zJ�������)p���`�c������I!�������c2h����|m�(�(S���|xC��2�n����<oY��ko_��1�4�n6w����XZk�Qr�;��f�?��J��^�c�����F��E���.�M�G
��$�`��a���k��?�v�^�qz}~��mW��3��C��S� ��T`�������_����+����DQ^�^���I�������#l�2�	��}��b�:��j^E� �;�UU�0��tUA�m�*�<�\��U�i����(�l�2RP�5/�H���u���U3�j��g,r[��&
�9.�XDC%Z�a�V10����h���
��O�VuD�,j�*L��]�����"Rm��xU�2���Fji
�F��3IZWM�I.�X��L�\L�]�m0����Ih
^8>P Q�2]���v���-�m�a0�<��d���]���hY�r�oU>4���JN1�2�ZZ7�8C�a�Y\�]ACb�pl�eNM!8
�O[U�p��]]���X���S�H]C��D��n��p�����I!�9��T�L5�������7������?����b�78��t^%I@]b������������
S�n~�p��;�vs{�7n�����4\M��m�J�����@O�������g!�~���~A�t�z���N���,��#,����@�&\�k��~����O4A��,��37X��Q�dN'���O���`P��hz�h�2D�Ky>}�9�M�rd=?I3c!����?s�Y�����$Q��5zq��9�,�
OX��I���R~9���?KTk?>����Vn���8����h�0��d�'����������rx�_���a�C"u5�}��{��d�	��b��~Kz�O���@��i��*Qd�8N���_%j8t���8}_��U�./����E1_>���q���0<�Jzuh}~??
��V�q���0<��6���/���8�~~����u��K�����O�?���i|w��GV+~����O��q~4S0����(�q~���('4�`�3�zB�n�	j�'��	`�X���[�qJ���=�8���<�`�@�H'��`T���>�����U>S-G��\�fD�8�lJrF����)��:�(T�L��"��|%�8#�3�u'�$|F���*�3�u����3���fI
J'�8?�*��RN(�-�0)�q??��Qr�c��(�~B�n�	X����J�qF��=�8'�zu0L C�G��^���@���K�KjU������p�<(��Z�	DA~N�����k��3ju'P��J�`�����|�Wo�	��P�[�qNYOe`0L����Z���U�`)��Uz�V*AZ���0�*
�q�]n�	@�O����0xBO��	8�z&�G�a�p��`�Z%�Q�#�8�r���<An���`X��@n��9��W�D���2yRlz�ob�
�&	�l{��l�dp���D�����J|dF���z�$����_y���r�o�������/m0���_���@��rk�w��]{�:v�Th�/��T�^���y�yN:\�����$�[l����
�o;xMe.v.��C�N��i��>��<S��f�br�36����8Z�L.�&m��Z������Js��gr�����T1cS�|q<u�`W��n*6f|F��-ln�In���f$�''=gb�nPa��B���q�+����BIq(�x�(�`��f)Q����IvpC�<��5���>����^��,CW���k�b��o2��+��"�~O����������V]!nV�u���9���:n����F#�!�Buz�c�h�H;��\�B&��3J}��A�a9��I.��z����j���q�]?]��������?��_?^�^�-���}��6t�`T?��h>��3s�����S��C�6Z���6*e
4c�����#�bC�R��f����vh����X�
|e�Y&?x{j�&��o��8/����c[���������6��AW�+`*��\l��7���D�l}�S��>���\K���L.t��x��1c����#\f'��FY��5]`k������8��Z���E+������&(�����\pe��aYH�lp���V�JMs�\�t�fp��XL7��9��~��o����y�@S��`1�Xr��F6r�>�����'C�����y6vi��u�e4F��e,I�	��p�D�h.[�{�U������f�����������_?^�^��Ey��1�M�nu�x45O?+['���?3e�Vw7A��"P�b;��U����#���lU*����E�20E��E���V ��������n�;����2`��\��u$
�(;�u,�I�%#C��	F`mi��8�c�x2�'I
����.L>���6�2�� �
67���q��%���J��������baR���0iY22����@����hpd���������.;��4j`M�U44Y�c���"�L�e�xi*���E*�_]FCK7���e�UM������M4�.]����2���?nlM<�e#��w���uQ���dY��I1���4���-�LJ��U�C,����������cl�B6�HK5������f�����ww����������;�;��wL��QM5	�����~����9w��;�I��^32���N�i��R������o}�ux�����o�'�	��o/��j���u��DK:z�4=���\��g����R�vy��
�a�V��Y��^):�v���+V.�!
,�������-�?'�g4zC7x7�����_Z3�%��{��^��N/{1���1�x���F����'�� �t��F6=�^���>������n:�Bs���o�[���PA�Y*ye��+D'�{�[&�������Y���7T^�<��^3BE�K{����FeG��>�
���g��t�S�}V4�����7����I���xya*��
��T{nf/��*���cC����$���.w������|x��������~���>e��un�($C�)$/af�#	��N��?���������"U���I�EP7T�]�/54�����r�I��/��������l4��*�,{i��N��D�]0�n�$�At$-����M32��U,�Z�v�c��T<hh"�Uo"�������l:
��H8�����+�&�i,��(����fYU�bA��%�(��R������B���m��:�����h�����J����.��
�
��~�j(g�P�� Ku��($T�Zg��`Y2��)��v��f��w��R!�e
�V��U�S2_�$���f�XD�j�M�1��:\�O|������|E�����W]	��s �\Z�V�J�b��Q���$�t����!�2"��='�vK	��u�_D���`����o�*l����5j����$�2��~Di�)�0g����?`��T����
�SO�L��������/�wgW_?�������`��dX��}��h����?__}�����?�W/��=�N�5/�6����������w�3O8����7�o?�����/����})7-5�����(|sm��s���{�"MM��~������x\X����qK�#���O0��:L����|W�IiG����R��L!F��xlO��+~I]"�Os�v��<����/r��RK\�\5B��q�����U4��p��\�m��~����~�"S��?�L���'5G���/s%��%��Bg��:�n;�1�W:����{���-f��g�z{��b`��-f� 8m��b`�!�B`60N@�*��Nx�����:���T!4
tM��g#DKz�g�C�[�����K/J���I����b`��4����q�������W\x[*`10J�c�!���qxt���/���-������Z
�+����i�����!o(���p���H�10L���i��Z
��7���8�&/V�$I�j`��n��b`������a�@|U
�X�����Re}$���Xf�u�p_������% QC,�e50L e��
�}50N@7����0|N���
X
���	�=7\sV�j`�@I�^�g��j`tZ5R���zbV	�P����V��
�u50N�0���b`��"�iK
��6u�&`�-5�' �����'�.�-s10L����J�sk10<�LCk��*>X�dB.6�j`4R�3�����'P��
��&P�qK�(n��qp
7���	X��.�����p�m����q�e_��gF�j`tT)I�1�U\�'�����GNS��E��{�y�9� Mw��������@�X��IB4�<�����`C�t�������i�K���tu��ySN�pW����E2>O&�&��C!338b��m��e"���7
d�:g�Bv����+2��<�4f_�9a������]�N!:�����1@+��9cmHj��!e�c�����97+L���bK8�t�t)���;�.m���&
���
<e���@W4�S
�������Di���85���&o�)�Bs�X�=o���zE�����(�8M%Dw�M�&HaN�Mj���Pw���!�wh�5��Bid0������d(��6�3x�0��>'�7�6����d��|���&�p�����
�j-G�tN#a�'�s>P,T�&��<�����M6���+�����p�l���������������$	V��5�\A3"�bh\������4��x+QWl�Pi�fsv:C�fpS_���(�����_4�����<����Tvo�6'�&��`����=���K���s�a��:x�]��I���@sI�mr����kH��nqis��a�e��}�BG~`���
<���Ps��m4Wo24[qw��R1Wl�������2vi�?����^m���:�6Q�bT8���)�R*a����F���7�+A=9�U�g��G���,Y5	j��fb�RWd�2.�����v���!��h�89b�z;Npz�akr����iv\���Q�C!�9���y�)vh����q!Y��j�9XM������6vi�X��L���Y��b����?w7_>�?�\�]g��&��6�4
~�A�(?���1�����!@v�����48�
�[
���>Zrb�����9��C� �(u�����hca��U����
���[�r��Z:��+vI
�Xi/8d�w���_���~��qr��8� F�E��NK��S��m�
[(��"iH�N�J��;��
� �����5���K�h
��^,0�J+�=*�[0��'M9T���P�BN��P��MT�
��F�v�4	�w\����2���v�	h�$�5�K��Tw�������8+��X�#�����
s:4���I(#	tv�i�U@B��n�K��^mW��n����~2����4.��m"��}r������d����������wg���������>{�w�M&���y���AQ2�K���?����@���K^�#|9�n�P�OI��h$?x��m�O?��<c����=�������{��6��5gw���Vz����b���9���-�,�����.�|i�'!�x�k�t��G^�kG������ p�@�{'^��`������c/9���tX����>�dx���
���g,;�gXm��($v�:��Uj�gvy�	�&�P��w/<�����YR1j n��L0����eY!twG���"��TG	y�g
�
�f
Q����%��K�^6Z�),�X��^�J��!�q����+�@fp*�����.���������-��y;�	cN��\�>g���`��MD)`��i��L���Y��b������w�����G|�F���{�;���E���C.���[^g���x6A��S�{���v��&��v�b�"���]^������8�Xw��N*��AC�0?�W6����G��%������f�H����B�<��O��!��B����0qEJ*��f�/��-�0Q���&��F#3Oc�u��0��&�<�����J�E�E\^��u��:�BC�<l�i$��+O��r�,�l��Fr�+�h�A+�#C��,�PlB��'��SP�t��4X�8��l@�R�����<�Q�+�%�8?K�T�T� Y��S.�f�TKL�f�B���b��cx��
?���Y�]3vY��e��'��~��DG1�b���{�f�Ag��P����M$)��[�1"L+(gp�x���a��B�jiZa0L'E���<�00K-%����M���Xh3��\$�$��tK��&~��,A2V��5�sv#*"��"����_�a=���>K���`8���~=\|��rs����o��r�|�Z��_hY�QxY1W�{Y9����������0O����_n.�_.�����;�������_�t��>f�}���n����Yj#�$�K%��<��.�u���'��j]������V���8��/�\��R���	�=����eJ�2jD��gZ��sL
{�I{n��W�����p�+�����s}���\��i�k<��4S;�����U�;�\�,W�n�5�R=�m��c=_��y���b�9�o�W���gU�w�y��wC5
���������glt�C`h&W�;�>���#|=�3�lL��w�v�����!�9w�����ox|�Q�fx?
�������a]��~�}V:��4��Z�G0v�f�<�v��	��w���	�������9�����|s���=���P����N���G4_���Y�2�)?���<�L��#@k��m
��7�	`@�u�����[�������1w����f^����zz`h'�]��l��\����XG�V�2f���.�%�Rs&����,������	d��G1v�f�N�c+8�0�����Z	��c��eO��#	'�v
��w��#@������_�B�r�+xx�����p������6'��|��O����)�wYz�������f6�v
J.��4����8���D;��	��B
.q�����`����BV|Q�N��K:���o��kGhN����@'�v
>t��	���9/���'�v�m)u�8Bt��������N!R��:(���GaC���\z�'|M�v�f�ArY�2����b�Y��C���9�RSQ�-�� /�7�����p�s�!)=�����8��C8^����r���������%r����C���X������ifs4G�[��ahl(����y\.\��3����8oI��FG3+��'��g�����'W��77�2�J+�������dbc�nAc)vv�266��U��	�|lp�����b64��Z��f�c+�+��f��A`S�N�����[�i,n��Q��3���x��[������:UF����jk�j#��yr�4����Y�zr�4����2Y��\2���f�-�2����S�{����L����u��l3��uI5n6�8��	��jcCx�lExuW��0��(���<<��R�;?!:�j�[������Bi�"�b�a=T�%/���!s9�e�z����Z��~w}�~>|\�p�0|�~}��H~�(T��ybq��'���.���.�����y4���Ttl���e��4��)UE�@�_����dK�W��V�a���u)�wppR����1�.�U�<������E
�n�cK����
k���\�^-u������F�J^l)���`�:#��A�G&3UQ��768y����������`u�+Sup��������wgr!2`!����R����<df�,�G���a4��>������*�2��:J�E/�����z8O�HGl��_���r���3�(Yc��������r�<:6�]J���2��E�pC�c�a������j#y�����&����h��|X�~�K�����=|��e�z���\-��������_/�������dmJI+]1�K^���+�g���Yc �^��}���B|��L�K4��r�T��iRM`�<<���>L�	42�(���y��P���H��%�"v�-1��'�
uZ"
endstream
endobj
15 0 obj
<</Filter/FlateDecode/Length 9788>>stream
H��WMoG�����t]$��<�E���6;GA��vl��5���Uu�!E��ScT� ����Y����Hi��%���rw�=�������^<���F����@�e	TB�o��<����?����yQ�����>��.7v�S1o���xn4$�i44Q��,���3���TT���aXj���g���3����1�bQ�����L;�����H�I,8�����gU�����(e	V*��I�i�tq�dy�P.O%\��.l�a�����M#�3�������0<:K;�y�0XI�3{����H9M��	�z*6�}1C�Bt��M<a��fw)K�h(�������c��:��)����Y�����P�8����������*�������4��R�`�?���W���~�X�\����������5������,�i��]k�BQc����I�_�[^���PtmTGc���EPr�C���8�r�`���0��'��&���=D���>�4��zfH]��c�m�������i��a��FG��Wt���ON�y189���N^J��G_��\
^��7�F]"a,8,81��f��~�m������u�q+��8 `X������/@5���~>�8D_��F�R:>���V��s���/�RS��{ ���p�����?F����cs`8<���e�a��0���M����C���u��x�c�����������mZ�s��^V_��_7����_~�2����~���|������Z?��[!	�?���W���~�X�\����>�;�>>a-(j�$o�a�:I�o����w��\�7���p���e#�/�1�Ff���"{��$w)a�@�C��XX�v"���O�)��9
��4<�+��B�U�b�	��vI�i4 [����.�KEO���������j�bxN�h��FH��#�9�F	D[��M#Q�Y�It	�$	�*�����{{q�e18`3K�r�U
�*m���2�P�����ic�CT
;,�y�������D�C�����F�X�����i4L)�n#O�(�N�8��aI��~`�7A��A#kY'})�x@2j���8�G���X�]�)���<��3+o��Q��F�����M����B�����9<QV�G�� To��hUi�DIA�4�����o<��x��$��������m����������?����^�^����r�o��,*�>��p������~���Z>�������xu����k���c�1Q	���&�g��zD�3�����Q�b�5���7�m�{<2}u<n���i�T���Z�K�*��d�\'������q\ �2�u%��u��I7����8b)Q�8�)��da������s<a�����$��H`:���?KT��mD�I������� z:�,��}:I���DO��%Z���D�)�2�^�����l�5��,'ib&w��t�Y�C�M;I3����y&�,�$�F#�/-$^�L|C���D��3<1�.��.�|�Z����E����Hj���;��J���	;l
��L[xo��i�������n����~�5��#9fA/����A��~�����+�����va�vY�s����~�H���+O���?����m��_����:�H&����9w�����_���n�	��������_����_�����~|�Tr���G��7�M���n�*LY�9����]2N���:	���_/���������=�_�B���=���c�=����U(^`'��~���fv���w��/�_�������W���^������:��JI��i�������yt�8�����/^�����b�~���w������~�L�\�{[|/�bc�$�=�_q7�g�����E�5j����Z�bO*<�(�D��K��]&�"O�Vj�����x������������������n��oK?���A��������A�����n�(�o��	��-PY(�_MTxA��Bry��Xq�}U�1�$RQH���\����.����5D.���KS��|�m� ��<�����?�������OX����Vm`w���EEM�F�6����$��N:6i���PJ>�1��M"G��n�9�"e06�JZ��bc�5���.�i,t��6�U��%��,\u{B�1+Z�V�88�bd�u����6'��@��l(�L��K��J{���`��LD�����Y
Y�S�$��c[p�
�5����4��F�M
��F�}�+�q,�K���+c�}��������(v�l���KZ=���4^�T��o"�2���-#�{!���
��vM�J��
�[��\�~4x!v������j�a�������q����lq44t�t�<��Z\�m�e�M�8o������8����=m���O
A������j�����������c�pv���1n�BY���Ce����,Q��mo��u�(h�V����d��R�K���"��
V�5��-��
F�������Ph��&1�����VR�SuXC����_5�q�H�>����)v7�n^
{���� �(��^(������=G�>���[�4�*6��*)�^�z�5+J�a����Vs;�$D	���p�%���`
g�JV[���^�a��d�`��"3���`}��w
��Rq�I�>0Nx�E��D�o�w$��Ai0�[X$&B\+q�M�-�	Z����Z%�f�d�WN.��ISAZ���#N\�xI�H����k��gP�:��%AR��S\��=�0�6�(�-e#<m2,�cS��T�5� �7� �['�*�����SZ}���J�(��e��Mr��CU�(I��k�����?g&�2[�w���j�������]���M�ZhM�$ ](�i,��O��>�����r��(���`������,��'�gq ��_,��,��({����}TVga���q1�q�"'H��<:X��4�J�Y(��'R�'�����X�"
��P]�lI;[g�`h����$�Y$�C����ib��w��'���dhLX�$�	+�~�wO;D�6Kl	��8��������Ti�J!3F
����D���� ��i�[������4m�	� ����\����p��XqA��TI�O�C���1�E�/�:q�6�J5�Wc^2��Y�x��'Z)H�B�M�iRV�IdA����J�0��'\�!�(�U�����om)50����$���pA�\-~����~�����������C��v^-��]���W0����~d����u��p��-��R�]��T��8��n�rn�Uu8:Ip��S��#iUGw.��x��/������+��������^��h�}������) ���G��$[2������B]QF_x&-y�*?:S1���:|�Pv8�����G�MO���|��!4��Rrl
��l�P�����������=��3����]"{!�\/i���gR�3���UV8��i�*���������*GO�|�?a�K��(?�A�X<6b���/H�������!3Z�d���b�R:�T��9Q
�������-�V��X2�V�>�}kKI����5�r�;\��������n������������C�y%
o���$FEV�00B�J�����)�-q��$��`X�D��HT�+	�4���j�D��HxPQ'�:��f���(��$i,���G�&�F���\�F���X���"|�y	����Ra3���P�,���f����AK���8����Nd�������	�G���������i0����,bx���:�<��3��=��fXl�M1��L�F����b����nH��)N����|�8*vd67D2�\EV�MmFg	�,vx�q�Y`�����@&�N�@�g*qZ8�B�����<�	�\T�=S$�b�C��JG!M8���JD���f�#,F������#Uc�	H�*[���9��>��W�u�M<��9D�����������w����O~���<�?X�f�K��
��>�������o�{�����#��E�����G�������o�t�\!L���~���?������9>�_����&�y��z��K�����������r��������q#�o8���&�P�����h��n�/�?����*!�R*��S�����D}�X����gz��$U�����|�kr
L�]=Q�+rn���r��1���s}��4W�^�rM/se��[/��O�dL�����<��w��D�i����6��9�	v�O>���*q��Op������W��ix������$����F%����"��u*"_��J
_W�;��ww*7�������K	��M/�o���*�5������
w���o���PX��x*��o������y	�����?������~0}��{N���n�2	OH�J�=s�~M��{;�����[}7��I��[}?��l�/u���O���o��������{}7~�P��b+�G��������X����Q���W��y9=���TJ5�9���ww
�@�������z�����^��9���^��_C�~?��w���D�*����k�z����n|�����V���!U��j��U*�O��'t�`���i-��RN�h�������y�c�������o������g�V��o���[}7>2c�3������`������`�s������[����k���J����d����R
��[������{}7���������k(�����n|��z�J��������V����C�����O��w��gTj��U)M%p�o�T|B���\���	V.be�J�H?���h��r�-f�_�O$��j
��1b��4��b���V�d��������Oo]�{y������W�0���^����+��BJ��P�Ba�D��C�5�RZ�B<�k����z(t��s�����h9RPm��L�B���U��G�[��T����Z$_��6��1�\�]%
�w���"������^M��a��~���������4V�KLM��${�
��O�`x�8�5@U�Ku�I���h�}�K��R������8g��b��2�g�B�Pl�~�v3���e7�����+���XE���lx���]�&�gd�<2��������An,t���a�x�����h��a��
c�-�zrN:��U�xl<�=f�
Sn����
�V�����2`k4/�{
�B���b<��7Y�c��+	����5i����9�:x#�X
��k�:b��Yk���������cJ9,���p>��������v�jiD}2�
�f>Q+e�S�V��?��kD�X�.-�Pl��Gl����5B7l}l��,��uS�a�
���K�������0h*���5r��]&\��5lq��@�3[$]�r\�]K|n�Jk��ii
QI��Pd���N�[4{����T�6�������M.e��*\ls��\\�B�!q
��eutnN��Z0B�l��y	�N������	/K���`)W��d_v�a�.�1p����z��j,4x�Yr�w����7�Q���?@b\��q�q������f0����/},tA*<2�e#�����(9kL��c�����[,6���`Jxf?���X���
:�h��V�yM#���r����"9��n��Y����<\��O?�~w��?��������1c�������b�YR�[���|�]�W��u}-��Xh��g�Xl�k������`L`a0b�����1�E��5�im�T���X���b-�2�%��-�4SV���
suZcJ������2�� ��`T���R$��bN�H8�5��6��5����#[��!�	?qE�F4���PjR�J����U�4Z���B�:	���XgY
��R�@��8�"�����R��������@V'J�:�U��&�Z�,�v�|��2�����8�2�R��4$z�i�`��_1�$�GF���Yc6�v)��R������jp��Kq�v)��^J�X&�����F������z���)O����Z{(�������G����k��rXn�`W��Ow�\��w�������v����y���@tNbS��r�oV������+��ZkC�����5�Jt
����������jW����1^����htI�7p��B9[���j���PC�\���wF�-�{?{�Xr��h�?����c�G�X���P�$�������c�^3���V�����@����j46i���Bo�p�U��"��o�G�4\cU�:�	��z^y��j	�Z^�*����p�����.��P�������b��J������Py����zC��R]ft�1������X����Pn�Fl_��T�����>���a�Xyx�K��o�>]x57��W�{+�����Oex�	[���2��Q�f9����(	;,����yXn�s��t���]xwA���r�px�|�|r]uo��R@��AP�?��o����+0	j����X��$YYx��B}�k`qo)��sD+5ZlZ-��(�7b>���DV��4P��%d���Y"���jcqO���hW�XX$��YR4�y�DZW�������rZ�@,��S��5%�<�Ha��}3�Y7���/M�?uV-�4B�4�����`�����5�"�J���:�Zh�Y�M�v�,M�e,)��%�����6Y����%&�0����i4*F:rr����aa@Om����4�`�d
�)]�3&s�Z
�-��S�2O6��c�2�3�e���:��:��Z���rF�YC����r$QtZ_��-�>c�r	y[��3O{$) #���,��~����B��/�(-�������f9���WSOZ����.>_�z����?n~��r�p�Z��(,?7Z�s���,,?�y�����?����p����K����~~���������I/�����v�O�Q9Bi�6Y�>8��,�������^8V1\���k�z��y��e;���V���qK�=�v�����K7����=�4�������J������%��g�B��Ta��`�(?�*+�tW���Wp-����5?����������Wp��E�#��,Wn*��L��~�'F�m�/����[�����
��m����\s4��y��Oq$��\O�/p��#����_�?����[<=R��7V��k,��*����Z�bDx����Ze8������n��EBK?���/y������~|��w�����������������I�5��������o�����J���^�r-:��F=&Q�5����[��`�]��O�������������a��t�FgT��������V�������c�����H;��)��Fz�=xB�N��"��'���zD���w���])K���2y�����[v������Q�-������M����%b������������{�)aC�;��9���W\�buN��V��|#W��%�kV�'��%�/&��~9'�gP$r�_��	�E�~oyN���F�|}��1A?%l����`(��=�x����G���uN���(���5<)^[�~�B�~�]�1��@z���d���������	l�������_�vp��e�-��3����|_2��`X0������A���	?%���D�3HO	v00,/����wKq�������x�R����QAf���#�
v�:r�}������l6�<�E����;Z%���9� ����c%{4�s�h��+6�XU�QpE%��4����Am.lS.����4��hc�x4qaH�;S����������������o�x��/k�����mg����A���N�n=%s�I����8�b�J�VP&�8�$>|p��
���t��&AuE��K��L�1���;���9f^:'S�q���nsK����$������I�8(�|����|�����p��1�6(��&W�#M�b���>E��r�	E4&\=�lH2^��@'�lt�*�t\9�F���U
l�+?U�dp�5�mX����:�d2^j��P����
<@�1�^1�����yh+��<j���������Jx������N
?��QKv5l���Hr���-���c��F��*O����*C
�^W����N��<�9v=w���KU�]:2tFv�?/���D��5W_k;���
�qw|������|yw��r4�g��Z_��R����g�_��}����aB��b7+����� ��p�@�H`��ZK
z�����G� �P��f���_V�4l���u��3��b���s*�4���=PD��c�< �=����|r�Ox@�?�}x����e�K�V.���'7}���t�j��
)���#������F�r�N�:�|��a�I�:�*����\�d�4��a��e�]�7����&K���s���Jn~��%�����?@y� ������ 5����3�\���#B��W������H��tRw�3�$��
����@�u7�-�e.8dV���l�!���,T�O�e�a�����\an��K�"{^,>�a�s���P��|�� ��O��������������^$����6�����2����D������P9��s��*����U�o���pE n�B�z#@�m72Z!��l��}Wb�%z!|��
Ux+�F�a�o��k/���
�]U���^��z�o}���mU�d�}� �l�kH�C��mVW��:I�5�>��j�m���H_K��lU�Y�F�6��:��f��!�*�lFm���}u +�f��}_"����]�B�k�.!���uW:����Vv�	�CQZAx����T����C[FD.m#G�*�
�@����7��}9���T�(�m���:�h�J*0�m;
�>(S����/��W���4��R�bnu��mh�-$��V������b���^��T����|���w�o_z|=������������z��(��������a�������6���;�^|�{��
�����Q6z�� o��O?{7��gMG�1/�_�Q��whD:������a����*���b\��I#�O�1��������ZS����������GW��i��yd���c�s���
�k"���A���#���S�������g����Nq���!�*�a6�G��1o3G�[�?~W$d���|�SUd[Z�K*���?�����ek�}��O&�c���J�A��*:t�\�g?�����!>�����zZ��oa�����I�������;��/����|���2O7s�l���a��)������6���{id��r���������_��7��oo�����M�,�B��3��Q�9�,�]����o�<�d�e��*B`Z�"�|[^hZX����
�H�*h���2��E:�Uah��u��m=]�T
b|�o�ex�2������e�������5���N`}m�2��met�������EZ�Y�)V)�0�w'�K��4�W��b������
��;�D\
b��n`Q�2��Jc��P+��[1�mU�L_���M,�>g%��bc[:�%�h�[����X�����
�V����P�6�/��%��y_?����w���R�\�e���������U1w%^b:�Fr4-����mRJH�����l�0s��P��/�O�2��G�"�r�����������\��x�'�t^,e�1i���|��|z����/O�}������rw������a���Gh�q��r����>��w����?=��~{|z�;�������?zM
endstream
endobj
16 0 obj
<</Filter/FlateDecode/Length 3152/Subtype/Type1C>>stream
x�uW]p�u^���g;��w��������(m��C:���t��,���L�E�$��X�- H$~��r�A$a
)*�����S���&���)3�L&3u;��*���r�������w����s��m�k��������Q�~�������������������7N����{5���8.z�3�t�3~������q�	w���p�_�
�����N�
�z|#/���=��o~k���q�q/p��p=����]���3�w�G���C�A�qn?�<w/�4����;��p�������������-�<�s������t�
7�}���z�1��M����_@�����7/�r���[�v�_����<��{�_����-�����L��#+�&����+�����M�7��&6���-�Yj�[I���\�p���|%Q�C�9��%2C0
�>������g�K>�|s!#��Xmb��J��F����P�w���/>��hjh��@����O���w���~���������q��3�L9b���>�cF���V�������-^��"}�!�(�^o`$���+5@Q�bY���l"�na����2
e�L*�$c�RB�8�*(���7Pj2{����������"FW3�r��3;��x�v�m?�^`?��:US�]*�
r�Q�L�	c���d�Z�V��+��j~a���������������1�nbV��1f��@���>���v����L�����1@FF�hH��r\B\�	?��/f��d������������|�|~�
�l��J���l��Z�m���/�!h�OW�aL�SsS {��"�p�l8!�S�	#���( $�|��'=<=�<��^�4�:��E����&a�:�� V�������t��t\8�����/��x�p���9�TO���R��@d=u�.�*���ej(U����hMn�&DsTq�$L�BR4�C�aA�lhV��X��(E��&cN����W�XB��<�O���iq:�D���E
�]� �K�|�������� �@�p��^�kh.��K8I����1-01 =���>��O���^O�����I?����\ ����	�
���lVhQ3Z0Z��t�%m�6G������z���u�3�*���>�odbRH�&���rid�[�W��%���g�5������{��K�������`�����m�3���U�L�;��n�����C�(�]h����q���)�#F������)(nwDt#��O�� ������Z�@j>1+!Vg�7�'��Hl<�E�39H} [�C�Y73gEbeC�����j�����r5��%[Q��5���1��`��aE])�����u�{��zK\wa�e����q�/���'KQc���x
�)��0�E#y�
�*T#�0�i���y�������s�U�Q�QK��5���^����Q^y41����J[���)��]'����L
�A11N^�X^�$�I��:�M�R��z���3�W�SSN�v��n�_����%�	���r�y�-#��UI�L,�F�d"�(EA�g��Mf�!V{��)��S��g%�D8b������.����==ZK�u��+s��r���Bcf�:��������j��%�=n=����P������t���s�>�zf���=*���-G��5k���a�j3j	j���*M'��(����P����`@sW��Z��L���%Z]���X����s����^Ma%��������<Z��+���R	M�S��&%�r>�|���e��b��j�]��1B�U(l��p���?�n�o�kb��3.�O�L�)�1e�p��Dq�!�8=���Z�P1�V��<j�zn�<6#�����^G������_V?\�^��a��ffr>�����"J����I}���9]�:�z��V�M_#�1)�Db��BR 	A����n����:���(��|�3�J�[[��q�����w�=S:Y���4�k��������'u��n��)����Qf��X�Wl�����$��vn��Va���h���qK�)�NO����D�9�����|�Mz�#e������z�j���$d����O
�G&_��G����l@��ka�����O���6o?f���6�����O�G���������mf;���Af��n��V�������[�
���kA�i�>��������Z��9�22Zv.�!E,2&|"FEW�$������a�Mc�f���c��)o:L���X�������u��
���o�I�Y�])��:
��f�����{��O�p���C��A��'�=�����<��g���7���Y�1Ol���u���&�KC�h�OpAp��\"\�>�(�)zQ�_��e�f�
�����g�����@���yv���W�%��^�Ag��k�q��$�=��!�����#���� ���X",�Y�hF��� ���HW@��S�:�V�@��Q�)D$(�S�H�O�gv���n��kib�f=a=�:g�W���/��	���S����Q��z7y���Qw�=���L��/�X�4�+�������Xa-�4f��Q�|R�]��a�]��{����%��#����dmu_�/�zbMt���.�.�$�z6~|��}.w��^��t�!�����#�od}�5�+�3	�%^I�n�����Z������:k����B�SBB:4���6kv9�������]$��c�����b.kd�S�����-�@�i��L��������L2�� ��3)�@����!9"!�E� �`z�����#Vp�)���C�jMA��0{}���w����������p��7	��)�S�)?�e(�h��5q��|n[uw�������M�&��m��y
��j�(,��B� Kq��{�wuo���m�]�����ab��x�������}�n,�����7���4�H�1Q����g��X.������!�8��RR4-R4�,~[��{�����j�t4y2v�����c>w��`�T�����K�KuW��5�+��gRt���F}����F���Wo���
?�����^�{��� ��!��sCE�>50����N��uGx��b������;���3D!�������A�|�z]�?A
�
endstream
endobj
17 0 obj
<</Filter/FlateDecode/Length 357>>stream
x�]R�n� ���C�	$B���}�n?����Tc����Z6J�����0,[4��
>��=����\�u�E�
�����6�wy�S���i/��&��0�Lk^|���7�{v����x��#�}5Q�mY~`��x���V4/���O��,��B�i�5�_�������Hvv�.�����.K��0���W�Hr�w���p]�R��9cu6LKK�EL�D�.3�K���0-��E�X���I[�����;�b�J�*�W��(��l����>G�G�?Q�	3��@E�
�����3�Y��S`NA�"{�r~����(��SR�������<���b��� �w�����������;
endstream
endobj
18 0 obj
<</Filter/FlateDecode/Length 2784/Length1 4028>>stream
x��W{l[����~�Nl_��8~]��:N��{o�I�8I���g���G��U�RFK7V��P�(l�@���6���
�&1��6��!&��H�j�$�XTM����(��?����{�s}~�����;�@�3`��G���|��O���Z8|����`p'�K,����|� �����?�~���_<r�d�^��?���G���5�����������{x���_^T�y+`�15(���TG�#<��)���(�1�Mi����(
�`���TTW��+=x�oX\�n���b����k����fb��������'���I�2?s&s3@0	��t��c��(�TjuH���1B����=�����0g6��d������K��^AJR!.oEJRy�V����K5��/��`�9>P���������}�`�������v����EG���g�s
�W)C/!*g�R����q�`H(%^��6 �P�J���;��v���i���/6��6L������\"��#�c��#�k�E���������m�}�RU��=o1T���>����a�ha����Y>OR�S�����GO��]��J�B���N��������r�� g4G�We�R��V�H��9����V�Z���p[�������� D���y[��(�����'����C-���������J'|��F!Wi�1�O2|�f��t�S5�H�/J��M�����cCV�h��gwFe�f�%����s�Dg����l�!2�ZHS������O��.�u�����[�;nxdh�k0��x�����
TO���Jn#��W�������nS���^�Q$�a�v�
�J��T�0�M&�Bzj�����^�!�I�J�d ���z����T��Z������=����1�p�'us	?'ys�Rh{��t��*��������|�d�;"R_x�r<��8��U�#��!�qF�������E����2�y�yeU PU�.W���E-@����0�p�,V������tE�����i�C��C���*�K�/������lG������3ak����SM���7$�T��|>xo<�7������b	7G�z�lNo�\6w��n)i���@�����E��(Jf�)��&M%��9�e�
��Jd�}3�w�+tn�]��9���ju��V9TK�.��j�L�*��j�K1FZI}��H�fe9��<����IJ�g��[��]�J�������w�x&�oO�������u��*�%��{U^s�
lJ��1Q��T$�}4�q,_��Q�H;�����G:��m��3{�������S�����b����~vx�^`f9V�,���������EkW����5N��YS�-����5�^���n�2�-�TG ���s4U8���b���J����s!�:'Og2�'S���t*=�tX��X( /�/�I����B���|���@q�Z�v��\��8@X����g�5_�`�4��L��'�5r2�N?[;�Ef
�����Ib�f��][u[�YO��m��U�*�zws�r"�U��I�c�!FQ��Q[x�:�]M5��������/���	/�E�a����W��%� �F���KY��/
���Z��(�#����_W����j[�&o�����koo_�����6������������{ww�=h��5Z{��@�a��R��2!N���YWA��6����8�CUy��*��L���wnW�F��
FS�S���V��������H��6���<�Zk�5�(�[�&������D����,��"��+����Wa��������\Hm�X�I�(T���,���������!c��q����W��������#�;�.j�n�a�F�>��+����
�����@]�����L��\�GMXg�%���f��~�<=9qSW~(�
�8.�=]�Q>aDQ���g��4������=��p��ML������
���M��q��<Q��g<Wr�`�wQ�pg��h��I6��71
���-u�����w'w=(�6W�=�Db,'Ho��k����Bw�uG��s�hH�����|�4�q�OwV�����QB����&k������H�55.4��w4�d�lXlS��1+��7|k2��+�A&�Z��c�M�D����x�^�5�s,������!&zv�K���6;
���PI(�1�;yj_]2jT�Dw��M9G������;n����o��
��i����gz�q��c3�K ��|L����'���+{�kLzF�0������:
/��&���V�X��Xq����b�3["�{�O��.���X6�(������E`M���.�A�F#�>���������\�l%�i?uh�A��S��[�A?��;����G�wfB	�'#z�uT�RV�PP��]0��_E���9r��D��/��<AO������W,�}8&�d��$_��7{=-[��{��9O$�����%�y��\����B�����`��*��A�r���
���7*���u\�����p��|f@����r)�S}����\�����������_\;W8��@�
�
t��P��t��sk�j?�#|t{��	4�&i�4�z���&��	H4��<O��''�%o�'��Q�$B��NC"�t	�!��H�	:j�(�C"�I.rF�1�����;��}-x�����J.4
�?�#�y`.�G���xPz�@����4��P����U@�4�;@�([��@n��=h���K(��A�� ?#�[�8��x��
endstream
endobj
19 0 obj
<</Filter/FlateDecode/Length 373>>stream
x�]�Mn� ����e����	Y����Q���qj��!���^�J]}�7�cf��=�aJ�����(�q
>�e�FGr���FI?�t�����EM{�n�D�6���V�t�.)�����=	)e�=�)����t8����Cg
I�����Q�k���g�E^��B��m}l�?��m!��~Kn�tYzG�'�,ki����U��at�}V��,UY[�����Y�s����-��c�`�l����\���ka
��o��}-��^�^C�Y���Y�<�������{3x���hx����
�p>0#Vs�����^����p���3;�c�F���>&�y�}n��+�6���<�k�R�<��)�cN�y����q��A
endstream
endobj
1 0 obj
<</Filter/FlateDecode/First 4/Length 48/N 1/Type/ObjStm>>stream
h�2V0P���w�/�+Q0���L)�6����T��$������{
endstream
endobj
2 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<49FF4D609161A340BDE9BB7684E2E51C><49FF4D609161A340BDE9BB7684E2E51C>]/Length 29/Root 5 0 R/Size 4/Type/XRef/W[1 3 0]>>stream
h�bb&F���L�=L��W^0�
endstream
endobj
startxref
116
%%EOF

ryzen-indexscan-uniform-pg17-checksums.pngimage/png; name=ryzen-indexscan-uniform-pg17-checksums.pngDownload

�PNG


IHDR��	�PLTE�����A��///222������OOO���ZZZ������������@@@			�����������888---��


���>>>���������www  ���rrr��>������```������|||W�����"""������iii���]]]���������$$$���mmm��EEELLL�s������dddCDC***&&&��;������������XXX�����I������������(((`[���RRR���������GGG-)��������������������[TTT��8���VVV��L�������������	���������������������������0����6���������S��n�&�III����J�����������z���	����������������'������������l��@��-�������\A����`��V�*e������������������-��'��;��A��������������A��P����|������}��`�������������[��X�����e�����������aI����K��Ah���	��������Z��=������?<8����r`��>��#0w�������������j��f�����|	�p�����'����'���������$HX���?.^������S�M�������y.6���^���2�9m��4�:�]���kz����a���um8I�M���Ch���m�TU��t�S�}�ht�_]aK\^`�f���
fRIDATx���KO*I�O��h��;���
	�i�!D	A��D�wA$D7n��ss��{��jy���on��d�[x8�9U������D"�H$�D"�H$�D"�H$�D"�H$�
��2@��&�:��%R����eY���*��F�������G��`�H�\>�i�2a���#
$��$�m�E�?-!b~��xsb���?����Na���Y�u(R�ecZ>>����.s!���$��$��^���Y~�2��fH���7��%�5Q�K��=�q�%_s� ��'Q�m%F�������]��>,(��m�_�����Z�J�rP�7�V�GN9�g�I��+
�����e���e�*�����i���'�}K[��;�� �G�a:��������F��>6MG�N��d��^-��~������yf@�5��K���ix��wN�����z�Ax�1��kJ�i�v�����C�0o���3�i'b�+�����4z"��?�����7UQ��'��)���BrV/a�����$�
E�D��j1���R��������:��tO� ���nB����}=�a�n�����hB)a<���r<�on�ExK68��P�&O���������mdi�P��IHh����c�x������&6{32���������R��D������h���������c/��#�����`�q�?��EvWd������!������<v�l2k$�r��O,$���)/y�	���������"U�K�/"��E�{_q�
m���v���.��MB[�t.�/rL�L?b=l��OT��QG/^�S'hb�|��=�n�[���d���4NT���(�#l�d3y�O+�����S�
��XP!����������X�U�;O#��D�����~/9eN)�,$?����{u����3����m������A���q
����P�Cl�:������J��h�$�������j4_~��q�YTyL��������R	�n�o�r��(���S�����7�-9��3��f���3G0�M���CdwY�K����'�r�SS�#?��~+����QJ� �l(�$��0w�4^�����	��0�]��w��W�0bK~��
]�U)da���������l|����U�>�%g���Tp@>�H��SD�P��T8*��{�Y���.��:�td�0�~S������S5^�� �����Gg���T�_B����;|��u��8���X�[��	�{$�'���K6�5��	��&�����l_�(��%���iq��(���g���J.��#;���Hr!��#y��X�K�Q-x��������W��#m-��b��Vm����z�A���x�T?gD��h�V@���%����0�e���u��qDS^�%��R��J�������a���-���(�+|�D�]_����<E����bS�3�%?����� �7����wN�?�d))m�B\���%����*m����.��Kf�s�s\�i�?=x3h'[���O�wh�ME���wV�'(��<I��b�b��|k)����|+�~����S�[�O���Ll��yR���dHU4�
j����4<�/�nZ�Slj��r������6�;��c�:a��J��r��K^'����p������1�j/�����\��0���t�i|(�����9�;[���%��}$y�#:K����ip�=x���-"��h�0c����]�X��d��K��.�q�O$w����,��=����l.�h�Dh�(�>:xs�!�u�h��d3aA���7��#�������q�X����S��#�
����p�%t��V��n*���c��
�x����tTx��dd���0[NF-X�Tq���_w�f��Ay|]��,N���t!�px��������FSV��K�z�C>��9���r���!"p����<bsK�H�J����C��mJK
��%����[e��K�_��J�kV���r��j:�����w��{��V����k$_��xh����}�8ZH.��[^q�=w[�8"���c7@��c~o�~�w�%���@�W�?$�3�U:��?*y���C����G+�v����YO�����Z��&w3=����j>.4;���*[%���(��E������%��aB����R �R1T^J����1�_{^���F���!M@{If�7h�
F�n�2��E�6R��Wu�(
I�6G54�t�����N��m_%	6��O.�����^�����+
�Ku�u��D,O��������E�a!9�uc���h���>��go����j�P�W�I��&�8]�Y��Ve����(.�6^U����F����nW����u��j���h�
����r���{
CeN%D�V�
cFZ:.�?3�$l��n�����\�f��5�++�D"�H$�D"�H~c��z���8���Rq"�ZR�K)�
��wD#a��_"Q.���ot�/��}������&�K������?�])�����M�jb���C�,����]��@��j�O���A��&�%1����p�Z�����n-�pV���w�������4CD�P��RWB\���@E�Yi*���J��f����t�M��S�����B	��,Kx|�n��D�u���-2�2�
�G���W<>�1�M�+=��W,���Z(<|o{����P��o#MV�V,�G�z��j+�w3P��a���S�����h<�����k��j7"�"[��si��'�F�H�1�n T;����s,��t���Vxx�(pLb-0$�o�I�=<���Z<�x��T��}�=�����1�W�o?���?io��t�����a������
�4���U���V�3g�	^�	�3]��}����'Ofj��`�#{+b�K���l�����z�Dc���Z�d2�.�f`�'��'��<��C4�2�1kJ��UM�28����r����c�S�R��������:5?3��X�����'�2s?�%��D�I%d���%(W����g�jc��<>�����A"��|�	��Q;DkLs��Nf*�i��mmL\LDy��T���W��x���z���`/yL2�����lS@������-{������C�.Yc�Y��(�C�
�a�D����rCuX�YL]`�`�bYd���62Qc�����p��1�>���^��r���Q�
9�j�NX�����kH�!O
�2r�|w��kG��,4�v�����,��~F�y�����y!��������F��c|�������x�{4�������]_��9�9���*'��V��i���^�
s�(9�J�1JNNU���))��C��:��.�=�)�Xx��]��%��a�/�=��?�W������.��k�pP�������Rrj�H�?�/R�=Dn��)I6�f�����Z{�hx����I�
�����r�w��������v����tk�������!��W�������g�s���I��B���iM�h������E���R�tlvd��/�U�Z!s<2_'?��u�^~_��E	���:F*E��e;��
k�m�wrS�"��u�]%�)����C��W�<b��_�at��/;}sZtK���n���X�F"���uH�����xN���Pjs����3�@b/y�V�}N/���%�;��>����6T����8T�0�{��DQ������8Bx<@��%������Z��\Jwb��t�VZZ�3��r}U���%������E�����w�?q�q���S������W�eX����9�H���*X�z��V�G����E[bR$�h�WIO�/L�����4��'�3{S-��C���hK�#!��<��3��m�hW��
���P�@b��f�\[����1@���D�����������oc�8D��J9r����e
�@7�!�|M |XSe��'��}���>�|���m_g qbNmB�Wm�������Y�R��k�e���d�������@#��l�����q�Oq��?�������]�9���Dl�w/�����x��������"�FN�s"��T�jP��{��sc���a�t:'rn�e�c�����0r:-��x���(��-.NU~��FN���u*�4�'r%���a@#�����m���F�dnZ�-�FN���k*��#(�8=���0r*-�\(T5�R���T��/�=4��S*3rR��V�b7H
�7=a�t������,��swqn�
{�A���0r*���P�*=x�eniU�*�&y�����:����u���Tlm-S������ {9p���'n�0r*	]�q�9Y�`�
�[9B#Ga�]$0r�.p9B8��F��#W�#��R��-������H�o|#a��p�]��H��*'c��=v�`���B����7�u&U����6
7�N�,���"�:&y��� �i�	$"���`/7`v�B
i�� ����s���S�@#��m"r�N�o�nw�l��x�;��+ �:12��	iJ��^GbB0_�+B8=}i��$��0�lHG.���E�M�^H	�������-�-L3�U���b���Z��yCj���������'���B�,[�oQy��^�:8L �h��K���a[C���j.��A{P'J��_��Hi'��s�zN�.M���a�1��8��
2��M%���c�`����������
�A��j�#�W��SupC��b�%�c��M?a���E�W�-�Vcu;�8�����'d��h@I$:r���<���@J�X7(������#���t���*�U9+��@�$�D�ECb����uV5�3�*9r6{Y�bau(',�� ��K�9����A����.}Dc������2Y�}�|=N��D���"�E,R�2B����x�a�1����)zM�o�9	�k�"�n����7��w������A,��T�vU-��C����!���W��0�eFn6��cOG�o��E��:��8�2�n�87r���Y�W�J����2W���o}�����h�\a��n�r�H��R;'�������f�9����u:�V��m~:�8�AC�mGNN�[_|�w?���9������	�^���]���,�u�o���5��]��k��[��������~�P_w��F���)si_+���B�K��#��w?�c
�����[�����y.]�����U)�Sdm5������F�/���e���(D������xe)v���D�r^�(FC����8�_�UX?�������B ���Uw����A�%6��N$����s.�5;��O�P���#����b4D�4A��y���F �
^�>���@1�#�B5������cB����������?j�o/����|��
z&�U�����?y(��#Gh}s��d��~�P#G��E~�]��_��3o�0r��+v��{��q���������|$'��^��opM����M�L��4��v1��i�T�6B�����~���%7?��]o\������(��������Q�
dD�����jk������0u��~,����B+���������^"y]����T5����N���2:���M�������i��8G�����������Gff���OH����g������&"���#@��G.�Ez����>���>hvB��cJ�-��^�t]K�{��gN���=1���ph���S�:u��������8����9�4�'"�Xc�?�
����:c2������-�!��Y�����DM�Wt�bkr�u��!m)��]ro���6����(��=�s��������H������'������.��%��GvF; a�9)�D�����@�����
��Q�Abg�������T��������P~�>��+��UM����F�\�LY�� ����S[s���z��_~�����a���Fn��3���m�kh�|�T�^9����@����^"r�(��tz���v<;@�����X��_i��g�R��P��4)��I�_)�X���������=�h�1j!���5�(r��5T�ob���\�����d���x���!�2�	7��F~�,�?��������w�b�=B�m����n;���i)t��2�@f������[�*]"��H�����6H]�|�F��_dF�/����Zu���sR��"?:C��oe#�����9,m�n���?�;����8��nZv�{���J)-�
����_
�2A����.�1�����h��8E���oe������&&j��1�_�>(��U[��������nY�|8��{�9���&�mR"0\�Kn�@��f��$��@�<�%��=/�z ���o����{�g�����)q���;�$w��L����d��K%��,y�q@��
�i=��I����~������z$x��!�dVK^��1�UO
ZU�I%��M�+7e��N�BEr�N%�\B���!KN�j�
0�?��|����;��
j��&��"?���6U�I%�;��d�
�"$(/�{nL���%���E��w.M������0�ra���8G/`�/���b��4<W���]"yy-�\��N�H`�"� W�/$���muT�&]������[���+�?I�����]o�[>�~>&I�..��n\l
J��$j��h���2�Z.��>����F+y�����}uMGb!9��q��!��h�4y���o����w�������/>v��/��+��]zK.BU�b>
��C�h���97��?j��F����B8�P�q��Z��%���o7��F������7xe��,y�X�D�������G�������Pef��"�%������
m�\��^^;<
L������*P���x��4��4T��t�
��e
D���F9�����Y��1���������b���p��|�i���&�c����(`0{��3�b
f�����c"O_��G����{
�W4Z52�+��#��������/�''G:���{m��q@�P�>s/��n�g�?-9�p��e���	���8H��3#=�[��+���W8��|�����r��q���/a�1��I�~K����%O��9*9���.
dUxlp��EC2���@
�AZ��<
���Sr�y���12��6�}�@�w�9)9������7�����P��Mvd���A: [w�
`	�K���sR��$�4������]���r���D+���sRrK���#K����`��|$`������T�Q���
|�����!�3;�i�:��:{��I�9�&����[w,������<t<2e���Q]�������(��
��)_�L0EAi�� z�����n}�1i���&�IbBt\�w��[/>O��������\��_���}u�`@�)���kqP5P��U��R }�����
P��(�7`D�:O��
VQUy&P`�����k8�nm�H��/����s9&��W'�8{�����[?���	��1�����Q7��kd:)e�Kt%���(���czJ5���zBrk��R5������x�;s��@[��R�m��i)e���Z9�T�3>�,92�q��������\C�U6�J{@��C�$��i B���t����9��>M��$-��~t��+��)���g]r�����s|�dyg��(q�!�z��k�7��[��}l���R�/�m�k)��P����(PuF�d7�d-=� m�1��)�!$5����H�%IN����F`�x��no�f�%�	"����\����3�N�_hf��|y�x���T�QI���o�K|�
���Q���
��j�,9���&��S������V�6�����q�z�}�������Lu4a�4(�B��FV���$�@7
B��:�r�\

����)�p!��o�����%����� �_�A��@��zgPi�}��8�'XX&��/-����4�>H>]q�&��6A'K�ka��"�hT��M!��)���[�d_����V�74�k!�I��8&��b����^/zk��s�x�$y�,���K��Vy���(J;H�K�������A��m�[�vYr���/&K���1��N��u��Q<��K$�����L��t,L4@���d_rK{k���Qq�W���n������ev�5=�.�-��)����DS��rc�I��s�DW}�$yT�f�<q�tQ�w����V����xv6��u_$WS�#�e����F�D���G�����ZR%���Y��y�1I�2?u_�j7�L��{�uV����n%�3�Mu���EH�h��
QQr�@�"�!���c������*Np?�c�����T>��]�.������{[s��C��YZ��lc��v� ���iF���,)������>��9��dl���#��o,I��\,���|�q��
�H�;!s#�Ih�e���f���KZxF��$R�����LN��,�&���r�'pSC�N��=��i�RAV�j���6��INjInt�4�@�R�����C�1�����o�=�^]��1w���	"[,�p����4����'+�+�i!������3������?��*���%��tX�>����P��T��n	#���7�]��^m�
�p���VT;9��%omW������_Z��>��A�����-B��`�Q�2��O(]TI�L���!k���sP%��#�����i��\�rLi2���m�_V���N�9���������'O�/��<���`����5�i��[`����|���*e��l��=d�j���H(�k�oa9��Y�b��Qml	5���B(�BU_�

��/�L�����k
��H%9 ��+�K/������k�z~'y���_V���:�jd5��Cd	�	������yox�����~����|���/�(V8���*���
md�$��~#$@y~��_�lG@�k��["�vt�f�@-9oUp�oq�!&5B�`9��D+���}�?��!���E�0��C1aFj�9�9��J�X��6�~'������[��g��XA�
�e�?���F��������E�����o�:���EU��mU�i����Q���A2}��M����n��5�����{T��,��[�2
5��F�\�x ���v���L����E�.���vw� �����r��B�����n��h?$��a�"�5����@�V��&�`2gu��G�D^��o_T���#���7�>������y�w��#�.�7���G5�'�0���	�E��Q��������O\|_YS�]_����o��(��&�}�u����*;YM�&�l:c7P���a�
�����?"y��m��Q�MG	`0�����g(=��*0��:;�~Jn��E�Bw�TY__(�%���.\���
�0$��a�eM��l-R
��7J��H?��n��i�q�������Jnuj��Dk3�i���}^��b��!j)��/_i����#�g|zkb���kw�#����-r;���l��L
 �������CH wM�x����Z+��������2��F��N ������4������+�';��W$�o$�%���:�rg'��>"��i��j	D���q����:&�(��"��F����9-y:�u��";�u�?
�4�������SQ��k���]u�;����eBr\ae�C(\�9-y�uj���Nj���%�i��Gc������l���>1�vzc
w�s��(�}k+���}9.�_��M��S�I��� ���lK�y�h�r
2��O���+����$��V&ace^������<��z!������:�?6��h�T�uo��(L�&�\�bj4f���������9I���k���\}���M?�%���:S�z@��yj��r.�
�.S�����E"��:t�w ��`r�,�� 2��!=�a��*�����%��:�.�u���"�.XSH���)���>1�t����$EY���`��:�n=Nw�Z/<�k���$�S)R�:�����e�]G�����\$9��%�$�N��6�:�N��NJdz���"�>)%G�9Ge�]?�,�$`r�/Q��//c��O�kk��]��mv��#R�!�S���RJ��GHs!@���%�����������g�>w`��O�;�
�]����U���H����������w���:&5hD>�~y��/�;)VYgg�.=��&�<��zck��S�A���F�������#m��P"u�a"i'�C��:�_<#���>;�@���o�XXZ__*���V�����+��w�Z��N%���y�;��e�k�������v1b�����Nl7X_����u���w?���O�<���=|l��u��kh����H��l�J��������]���i��_J�/,H�b]`@ds��_����9���g�Z����H��S�p�h��Pm_��:�"�c�O���(�;���E�Q���~z�j��
���3�����u1�Z�� �|s^bk�<��:k�;�uZ?��Q��H�Z�����H,&H1,��G�i�I�w�]Y��X�;�s���mq6{ra��3zus�,y&�u��o��Q��y|�z��J�Z?DI���
����>�����?����;�����$�����,��K�<Hw�1H~�pb��|�X<�5��V'���:&m>{\�U}���n�}�`.K�]�c�����c���^�?:�%��p����q��W������L�������?��wc���v��]���?��/�K���C?��W�|����I���,��
t�-�].K~�=7��_�b�c��~\�8�?;��Yb�4�����D��
u�P�V�M��vIi�%����Z�����:�8���87�	�$������f�
ZmM������7���w������O��=����!�����=%,J��#����4�CUc����2%��r����?�������o�$���xi"��v(�Ht�����L<�R�,�?���PqB>9��"
C�P�N':�%5���j)7���F�k���J��i"����f�&�h��� ����wJ?j?�P��X���I�5����f6�������4����������q����~i"���>��l�R
�
r�B����CU���5a����tz��VG:�v���� r�n���J��������XH�����:���:����k���W"��&rk�U�x�������SS�
�=���4���)���Q��$r�J�Y��\����*�u\�?%�FG-�8S��4��f�����Nt"r�
��0%�)IL��h9��3�{"
T.������������1��s���������S����V����m��oE�e�q�+ /���NQ���7K���PI�wq��
������t���������s�/��4b>N��:@�K*/�6R���L�<���o�� ������XS}�q!F��c0�����:Y���EW���J���V'��
35��v��E��ru�+��f%r��K����
���~����s����L������A�����qQloj���7������A����vqS�|���9C����!�
�$������$R�y�w�w*�A��}�.E]VDPZ�����K{"��73�����L�Be�N&�"(���g���Q���:������G��\��l���@���������`9�m4����A�x����<r�L��_�$9c�����"�E�cW��r��y'Z��f?�z@�Y���)��)�9@���e���yw���@��"�#r��U5��O�>P�x0����e"��P�J��Fq�DY>fE~�*������w���!r�B5�#�n�,����������1��u��2�	����:��v�rT��<�Q�\-����=�OU�:���0@�+��uR�������(U�gDHm;��x���]mq��Z
7���W��������Q^��;DnW&D���T�D��p�Nx�[�v��|N&�@����TI�D7��97DP�������Wu�Cw�����?}s�Q%�^u�
�����"�:�����=��������:�,��q��99�o������}�<����D�E"�-�MZWE���~G����{<~��@y�t�;v���c���"�)G`�]������,������:}����W4����a�Uu_
�Mx�����o���x"��?��(���+��v��)�y����8/�N��~��y�
�^��riW����.���>]���e%D�k�|��~��F�,{�.�O���f��mmm���"�5&�w���p�A���%�;�.������D�?i!�bq�L}��C��'q�i8��!0~+"�c�5
�v#w�8E��8�I��DZ��q<"(��`��f�*�v��v����2�o�c �b-Bf)�w���
��7f?�����%
|
�9����G�k������yz(kZPt�#��"(�����;���^����RJCK����vB����������������
�T	�(�CY}A�����;.d~3]VDPu���!�"g��W���f2�G�X�j����<�c����9��G��D�"�|���s�]������x�����/��}�QfM<����Y� y�PV�"�0fp�^����3k�^6v��������9tH������������%����i�2��x��~Dl����jY��C����0~���4y�����9|c��z]�8��
��sls����<�b'��Kf�	E�2�Z����6A"�D���V���JL�i�Kk�9��\8��9�����>�< *r����@���j����(�r���D����],���^��h}�M}���NCr��������C���=6Y���tZ���vX����"�*��<�1�D>mJ���3&-#��u�8L8�j5Y��S��|E�F�^��b���M=�[�b�����0��e}���K�A���=���X�bB��F�l���A�l����;m�?��h���K�2f����`���,�-C������� ��h2�7UjM��km#G���;M��<Y����s�S���F�;�n1���ls�@`�sQM�����Fz�mJ����c���}��e��y[�2sQ�6q8�|�D��8z�0��/f�<R*Dr��=9��|��]D�?�[��X)��
]��a������C��yC�eY�)S�f�l���CN����m��N"�k_�u��1x�����~�0 r�p�,�j!y1r�x`��������z�&��Sh��y;:)�t��A���!�2������#��k���5>�_�d����F��z�>p�������	"��k�YZ�z1��3z�W�i�5u�_�5�W���!q�����^�&-�&��AW�46���#�{�m������l�"��f-����Y����xE��<���k�d�_V�%�\�9��15����^����*qD�kj4_)U�f.���b�y��V9����2�Z��l���h�K�E�9o.�A��/{����1��*��/g�^]���Z���}�\xc����jVS��Z��^��"��
�%G)C5��;II��e#J5��;�p	WrD�Y�V��tC�w��0� r�n%Y���"��qD�GL	3��p!����Q_����G�h���Q���V,%��JD�7�d�K2O�c��E��M/ r��euI��HKg"r�	�
Ijc�("��h� R�)D�3M�]d�1�D�3��,����C�����9tCiCr��F?��|&��7.W��S����P'�?������~%q�Gh������O�\?���v�j�2�z5��_+�v�� r�>}��G��/���+y��K��q���J�a�Ky�Z��~���;{������iX���RA%���L��0�����w��m��kzT��1�DA��E���=��}j��iH
I�<�C�<���|�QF����j�R�T����#���!������z[��
��Q#�c����2"G��?������Y��QY���3�F�R:L��!h�?n4>�(�!�KCc$PR�~!7��%������=�����e'd�mw�p��y	"��yv�T�B~�"�.������u;r���� r��wo������D+7E��g��&�HD7�����9����n]:e�tk����:%!��@#����B�[��9P���ag/X���.��=�������4{��������N�0��#���UD V���]�����)���eTF.����L�to-\E�"v��������$7�IV���Q��y�hp+rnr���Y���Z�	S!�p�ZJ�*�����@�V�5�R4L���}U8"%uYv�yl�"�*\�K��[@�MRR����������� DQ���l�f�4"�X��X�H�y�����o�:"�Ws-?�s�?�������-FE{�Z>.G[����\������	��a���N��sG33��&�x���-�V9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9�9/v��7i8���7X�@��M��m@�� ���6$+�,�2w`3FC�l�y��`f���E�������%vX�J?�C�g#��N)O0�9��C�#���y����&�����-���V��)��!L���@���<�=�-6+dj/��'��KK��Hh�y�{��\��f r�G^h���o�O�r����������[���>�;~e,�T�^3��>���r�y���A��������%��iJ��739��#�>�W�����4��FJ��8��4��f r�GN�D����T�t�L>C�����F�
Q�y�39�y��E�I"���Uhx��&���C�M0��W�k���KQ2%�7��,��	�9����y�23�����&	��/%��7y���vfT�^x�������b���|���F�N
e6�F�L
u�#�����������<!�y��a����\Mt�3���X����?E�/rK�J�e��f��������59Xkk���CXM~������������+��H��f r��i�����o��
F.���a�^5%"���-�xo.R"2N�]f r��`���+�:8��"'����tZ�����,��r�OzC����u%�"��@�`�������~�����\J��}����D�J?��=*f�Y��.Jy��.���i��k��*�D�9�^w����#'<c�*�pf}Q�=���*L�������Y""Qk�9N
T2���\.VL2_P����~��h�Y�nd��&��2�x�����A�>�ln�������f�O����A���h���a�aG~��m���F�����"�kuRw]8}K����|��x�^�o���Aw�o������#rD���{�(<X���"�k����`�����#rp�������N�G���F"q���c���0���d�.t��2rY�f�BA
M��vh�k���KY���U/F�p������~��{?�'r'r'r'rbuwJxB����|��Y�z:9�z���������[�N�$��m��L'��g���@?��v]�����U�DN�7����^��J?���@�������f�*��	��I�8����&r}����E.r�|����EN��"�.�M����nl�B9yz��qU��\�D�l>��y���������.}�&rB�\
�8*~�&rr�N��S�����������������������������������������������������������������������L���x<:�W��Hg'o�;���W�U�DN���N����G�W.r���[[��OV�9�F�[���U�DN��a��VkT��.r=����K����7;w�D�a����(�v&EZ((� %PY*EQb��p!j$���W�{\�z��x������D��Yo�{����g4�"���;��\���S(���=�nNr"�r��S��D�#r(h����=���G�P��xL�#4"����HT�1���9�CM{�����{n�e�m7"��v��48x�;����P���o�kF]y���QX�z��r��Utt���4���F���������vx�y��.S��p��0��%
"�}�5���s���#��8nlvO)�}�g�N���:�I���q��X�"fE�-j�f�����NM��_�X&���-Y� r`�G��Da��k��d�FI��N��GTu,�,
"�{����+�T�"����0�o��Q��� 
"&R�E�����*y��0�11�O�4��H��oj�'L�~�y�����ai����.�#�H*��k.�+L�rW���I�R�i���U'��
���ou	�B�K�����WI#�u��Er���$_D)���u����Y������<gkh�0��zW�cs�r����������I�
0B������o������J�n�������f�
���U�k�v���7�����������W_<[&0�7��V��!����E������%��Vi9������m������|�#��xQM����b�C�4�Z
��[����#
���m�a���q�>��Vw:��1�������yk�o�,��+�VOqG4i9��������|���G^P���=i~{��!�h�7���E\B�����Pqm�4xA�<=��<^/0����%�a�D�u��g--X?�cjvP�O��>��R.9��|!r"�������&r"�r6v�R^q�����~�Om<B#r(�Y9�Ci��u�-���������C�]mmm]�gi����CC�^
���Dd8"G����P��#r@qD(���9�8"G����P��#r@qD(���9�8"G����P�c�:����{`_�J�P�����������6	9���9a�$���q�94p�����r"�r�;�#\�������=�}��n"�b�z�)������]��i"��8��O���G�q�j�V��?,"BE��� i$���T$�J�F4
���6&:������D
�Y
���wm)��1�\�~^C����;O��1g�#r`\��O��S�X���y�U<xC����q~��YD�I)��O�[��L���ES���p8<��_�9�*�L$R7C8�9@IC��C��C��C��C��C��C��C��kB0������Q��W3��o�Q9"&
��pp7qw5��90hh�sp*��U�����2����=@���5�\��� "���u_9"�_����R���5w6F�A���34������S�B��_\A��[�X��p�;�=	90�q����W8�9�(�la|~~~���
DL������-<YG���J"`"`"`"`"`"`"`"`"}
-����VD�
En��nGB���B7�</�|:9��90h4V
W�|g9"E��5��("�$Z��T��k�2���,��#/�]��2�K������);L�b�1�I39d������tyW�V�j�A���C�-�*����6��������<�q��<r��F��ae�m��l���j�bl��t�q��D�;�,�N^,�K�?"�-c���<'Y��Y�]�&�@�����G���#�	�h"�8d�Q���w�H����ED��T�i"��0/+x�5��\������&R8G�vR�������xBTF]��[39������G7p��=r�`��cG�+��ZK����n�]�������f r�
����x:��Z��)��^��W�!�9��B���*��Z��W�^��Q3��K�<'A��Z���B��@�r���y]�6gM���M�;�q�V����qT�N��fh#�n���&�;�#�n$���p�*6D^�!���I��F����O��@py98���?3]�i�lZ��6�TG;<���x�y�eWGy��M3p'/B���Lf�u4��7(�;y^��lF���b6���j������^3y��?88�����-*�+����.� ���
"��UUv����w���L�[3y�����T��/���/r�f�O��8)�x�,�By�zA��{�[D��_-�h"/C+.�q�MGyN�E�t_������:��qs��-e��*���=�g�9i�4�������ka���S�)��/8�".:���[E�$�F���	�H�h�h�Di���[�H�V#8��T���t���$�����9����?��sw}��s=������M�����k�n��q��������`��EV�����-\;��_~p@��qH��Q�����<vf��%�,��+�ysWoY�kw�����n�����'�o��T��'}
lF�d�I���G����!c�����"G��>�m�����!M)yW�B�������0��b
��Qrvr,�O��cS4�!rD��hrj|j�a<D���ww��"@���~D�a=��1�K�\����D�(�/�w.(�VJ�M�!rD��e:�'$�A*�9"�x��Z��<�w9"���:Z���"G�Xv����N�A�����Iy�#:��w#rDI��<������!��������!,�eq���!�xm]��)����9�B��0"�d�r�>1Qo��7C��b!���4�!rHF?�6�����*?�9�;��}<��!�N����x�A��K_�P]1*@�������"�\�
�w�in�r�#�
�r��|��\�C.�'o�N.��!�'������[*��!�}�)��|��\�C2��SF���8���9$����pc�������OB��L&mN��uZ3�8M*��!��C����9D�\unOhK���$��!�K���y]�C.?�D"C��r�"g%B������'�K�#Hk�,S��9�g�������},wMx�/j.��?��9���6���N<[�+�)
�^��?�"G�1����u]���L�n<m
�F.��:y�����B��6hY�����"G��V=�k"G\	D�z���p!r��u5�%�_����������B�����&C����R=�>1�5\��0�T;F�,�!C����������p���!C��z�V������������=��{�����!r��Sn��������9�I��M������.Da��m)�H=������`�z�9D�{�By����sLx
1"� ���}��?��c���8��w�6S�#���K��,�]B�K;�XEZA;H;Y�:�v1�Tq��cgK!���Cq���g���p���>9��|LS��T�B�6�B�"���O��2{!l���)D{zcTx�v����"�9�9��ee�B�F��9��AeV���.M�����Z���"�z���W����9�kR� r�bu��a8{hV��&#����69�h����L"�tbN7x�Y�N9�N��Fn8�p��|9>"�"���>��-���X��h�4l�u��!�]<R�%��o�:'����0�����	�B��jAzu����u� ��)����,"�W"W\\%OP��E+��h��S��5��&��'*L�������6~���������g�,T�_�2�`1���wz�6��5����;�&�0?�)���]kQ*��Z�*�*F��������Q��.P\��.$�q��"��h�g�{������{wGU�1'���{���_���;��B��B_x�oP������G�OnI���S��p���0�Q����D�X>� H%|(���O��>�:�>
�(�$/���o�k�5^`yJ��p��~��=�z�����E�����)����C,�8G�0�e���x@�������;��G��C9_�:�(�$�>�N'�o����}t�.
�&�""O�w��N�2)��o�6_\�G�1\��@a��o�j�����g?��S:~j��M�f�����Vr���So�Sq)=������U�TI������I>�K��~�%w
$��s�e>����h����zUZ��=�1f/g��8�����HP�?���\�
��X�h�����q�K�P�?w��H�F�:JGr���{�k���zC/ I0�,J���s�X��1�\?')��k,q������3c��u��H��*��5f�H�Uq�x����<��M��^�:?�E�������H��^����v�t>F<,�^�&��p�������:�eF�����` �A��>}�d��K����
��x��E��rRR!��	�,&��B1��k|��������g�(�o�����{�f�ga��L��$9v��19b��l$!��F6��$9�<�p'�q��&�����r���1@��$/l��&V��&�E����cS��8�����^�z�3eG�^���5�W���u4;{��Z������rD����)�o]_�{w\\Vv��"���u����Y�}���9o��II"xk���q�d��4_a4F,�np$6�L##�|*}��R��U�I���?^�m��H����p��[O2��\��Eq|���1f���W��Q���|I��/�\�6��^M�@^�nC�/yE�i�*�$7m��|���`��}F���

��.<�X�XD5n��|�^�V���}��������g������~x>���[�����s#IX����!���Q�������6�k]�?	�b�\���m'�� c���K���@�7�&i"aDl�n+��%'��k����l�Oad�r�������L�.�9����8����/ �6O~��M)�7#+��p�yR�&N*�d�����c�d�����I���
�8)��D���Mp%!x�n>��$�B��?���*��%���[M��t��F�C�b5��{����q��$i�����}����m�N��gZI�8���,����{�v!�E��Z��$������if�q��=��L��?��l�U�x��X8�Tr���B�)[<��[�I������!
6�4���Qo6�U��{�'GEM�0�D(�u��������
����/�,�l4;Qx���5n����o�$�pJ�F�d��r��	BZ`g��k���Qu|��
"��.���e4�D��~aO�~��I��jv����>��������.�R���U:�%/���b��g�]d{)�����xh�2����,I��}�Q��W��r���cz��sIX����^�lu	��s`����6Q���I�QGXl�G�h�.����
U�g������D7���ZB���?M��/��������h���?m��>�|K���y��
E��(54�M���J�~j���cX�����������}E1M�Q��aun�P�����a����Wc�I^�xI��~�Z�b��jpt>�l#n��WMv���H�*�vj�K���\��j s���[ y�H��|����!:�l��qDW�U-��${#�2����	����z��j����
��X�e��'���&BS�X'9�HT�"��Tr��W���*$3��.���<�"���69�����_�A^�3��.��9�:���"�\C��U�����J������,O1�Y�`��\�OL������������/;�1�Z����E�F��a�F];��]dE�,��S� &�a��Z�E��r�{r��Uy��O�d@3��7�)���������<=�9�C����l�|c�1�"s]��g���X�4W�\wQ�lH+��Jd���dtyv8e����������!-':O�m*�u�X�7,����.]���f�kdT�fVV-�3_Tj�<�5��(6�b�������h����m:�Gm�SsW{�
VC�����^��� ��%�-A���
�T�<8�~]w���-	Gp�St���6��StQ��Mmu�n�\� |�JtB�������),C����K�d��qa(<�, TU��Fa��a�����r����Z�H!C���<�$��,���M�;R���������9G��5��YY )x�2���B��/�
��Bvp,��j����]J�v��P�\�o���:�_���@��������m���/�?~����O~���y����0g*j��D��X�X���b*� /TE{�]#gF���
�������Q@x�1�e���9��ab�IcX��W#�~{�M��2"@t���wH��2�O~��'gD���������!�g:�J+x�
��]�C(���m5ix�*|����7�(Cm�[�����3h��0��c+�TW�'%M���ylBCE���A���Q0
h��`gvm����
��[M8�o/PF���������c�Zw
aj��=g��V�M��t���E(�%�P��*l������kH�����S��@kk������.�}v��m���F�(�@HY�_H��(��I�_��W��)�j�\�	b�Ka3��������bXf]������U/�c(~�h"�A+������i�f5Al1����7�?��&��4yr�\�p�p]�����+��>Y�`����fO�E/�2���sF@�`�GYK+!ae%�B��^�Z�/����������I�<^Z����6���u{tf������������O�l���j��Zwz�K���ueA_kk����t�f4��wv/�[�w��$�jW������_����XZ�ly�U�y�c��(4
�.��|ib�l������.\��~�~P6�s���$���2L3��b����������'�HU����g���mPv�z 8�"$^�����555���G�~<}z�4�5;������������(�����=�����4���A�!d4
h��E���EM5�
���]��'Tcn|����;���4��6��3�3����7B>F(����b�}����]��>�46&�\�-.Q\>Z���Q@�&�m/n\L<
�����k#�������
{g��6��O�gjL��6�
f���!l�$B�()Q�5��K.��^�'z�o���]���T���;\>��1�A���%{�\�1�$��r`G���Zy	"��K���&�r����
�8���@�r)~�>o�yD����(�����v�QFsd������_q8C�yF������R)�TO
-���q�y��I���c���Z��C���T�k�a'y�mU�!�~�>O
�.�X�AN���;�nJ��
���$�wk/l���q���P����d��PfM{�U��|��N}��D��34��f��h;a�f��b�NwE.��P���BtR�b"	�P�d&�F'��c,|0�Z_M|�}�4M
�.\J���K����+���`O�rIQ��~U;lVDTU6A.��E�G���N�+I�m�Wo���E�C��T?��s�����x���Po��{Qw@jka��B��g���@���1��+�x.���(��9R�5Y�| �S��*vm�qj�$�����Ld�������k���z��bOWM��<tr�� �'2�\oM$x36&LL:���!C��9�ymB]�j�@��$��;��s�o6�:�=�sa�
-N�B$E1d4 O��J����+
�.��K2��y����J�'X-.+n!���*O�P)X�W�x��[5��Oy��C}�_�=tnZ�qiB���/�� w��sb�.&<��)
S�3sTl@��hJ}��x��*�D�BN�A��b0@\��7l���<�c}���W��%d��
In���5�7^��U�'�v��x�������9���������x����$��#�{�\��L\9C�A>����/��p>�48��$0�u������:m�A��UB�pP�seuM=��R���3�F]I�1C�[�i�:B�7�]�|T�������9��P����F���S�
��]6~��|�A~�z[��������B�6���r�[w�RW�Hgr����#�{�<�W���x��~�!.2�n�*I���|;T����v��>��[��Kj����	��0���#����� '�B>���W���_�'��	�J��8^������<}�.v�V����.5�W� w����r���
�LD�c�����M�_u�u�,���I��oT��j��-�-����f8F\���K_8$��1L`w�o�{x��!�q}�C����� ��D�9so�����/�6	v.(6��d<
sC�m�����I�|����Rf��</�M�Y���j�C��5����#�{�R�[��
����\z�.�Zo� �� 
����/���8p..)��r�b<
��oL��pX����8��+�X�s���K
��Vy��3�6����#�{����Aj��4�"rxVG<���B�=�%�RZ��W�r5m4����5	|��(����0l�_I��?�����Q��_<�z����7._I�{u	@�TN��td�0i����y��8��-t.%�E{�]�D.�'Q��yf'=�<_WC>���~I��
:|^�=����I6��{{w��6�q|$�i�dl������J!	�f(B�(9�K/�����g��)�����~�����g�86������WB�s3��o�;JH�����-/�}���j�N���E��},��j��v�K��;����*9����X���|}�J(���M����g�x����rb�s?�Q_s9w��$�Qdj����l�^�7:>�c���9�{����'���*��}U	���k67���8<�L����3��#�%I��A��:)����V�,=^��+����	�[oxjEl����R��C���R�o���<.�k�{�����YD.���GN�������>D�K?,\���L~#� �L*$�����RN�$����=���D�w��WZ]����g��@�EJ�����XP��:�k�;�:y�+��A|?T�>�<��*~�[}Q���AI���tP��w�;(S���������
�F&6�nA�H#��`��P������(����i;�F4�
�4Z�W�L:
����<�b�0'4���=�������e�����f�q�:�7t1��g��]JP�)]��*T��
����U��Ek���(�p5��K1�`[X���/���JI���r�yo(�|���<�S��^�;����u��?;���������Q���&�H��-�Q��[����~�$E�e�Gr�y9T�E��x�;g�NI�c_�Y�t�
������F|����|<M��Q����x=S�cDD��&��%?�b����/u��L5aO�E^t�����I2{�����u�����H�y����P�q��9��Trn��Q���0�TT6��"?v�����Yr$��?�<9���������S
��v�������tw7;�4�����?������)�A��"��:����e�
���&;X��x���=�(G�������lM~��Vg��&�����{����G.,����T0[�/��2�L�&}�f�Aw���������n<�|�U��m���+����}��i7�����\V�$p�[���O����{Q��,%
��n>�J�l�i��N����������I��;\$���$2�U��	)��a�����yk�;�L��j�;Mw�8�i�&t�b�F���(��T�QT�!r���e�Y�y�)x�+�<�YNj'e������iyS�q:����E�����Y
w�������N�����}��-S�}���X��rK�����nsTPK#y��D������"%7�J#�d/%�J�r4io}��z"*^���tI'rR'��'�|"l���������M���n�=`C\$x���-w��������]��������W�����*Ne���IEND�B`�

ryzen-bitmapscan-uniform-log.pngimage/png; name=ryzen-bitmapscan-uniform-log.pngDownload

�PNG


IHDR�����PLTE�����A���s��V��������������'''������AAA;;;QQQ���!!!������\\\222���������sssnnn������ddd��OOOKKK```���***XXX����vhhh������FFF...��F������wwwU����������������������E666{{{�������������������999������������
�z��������O�����kkk��������������������������UUU�����Q������^��`�������!����l�����������~~~��@��������2�����b�������@�������������o����������������}�����%��������{�����^���������<��������c�����������p��~����/������N������o�����O�����������[������������RRR���jjj���L��l��T��}}}�}��]�����"�I����������n��3������&���h�����/���~B��7����w������� ����s�������=XA/�������l]��#{�I����U�����=_�{U��6���{����9`��lg\W���XB�"DY��J�����k2�Q�����&Qi��
����2��o���|��ks���V����e���tN�d�fT�v���{t�Z��VZ�����TC�7��<�JRPD6gL����Jojd����pIDATx����������=s�M���Yi�@
c;�
���%
(�K$��d�E�e�n�G�����`=cLIS���T)��(�g���x��X&��#�@pf���`O���	��~(r��T
���0��R���,����*���=��Om�?�6���h��_c������n~xmm.��c,vU}��]�������7��-���Aq�'*�~i#��d}��.���-�/�0t�x5T��~Q���.�Y�y~S�)]�S��\B@@��Bp���Q���l��A
�l/��\p�g�������n�q���������D�~r����I�6����j��|����������N�������?Ko�^��]�&Y���E������������`c�����tc��4k�
6���(:msh/zgL��m��� 14�&�.=��*��	�[�}���������%D�"!�yO��i:�I�	3~?�2�4;p����A��k��������`��(�Y��'�;��/k����cu�����Y#�@3���m��4"USP�xc-���+n����094N��������L���^�N��[��L\��pBk������t+�����O�+1y���D��:�25���Z��~T�K`�Ix��*;��PQ�9|�gB2�'�H�����Z���3�|&��Y���y�����D�T}���]?l_�Q�!�l(��	)�1��H�����6l�cx��
V�<�nT��p�\dVq�?����Ne]SS�&|������S�_M��6��)=����,��NH�M���(�)u��L����Df��}�T�4�#������y�2F����m_��3Bm\��!�'vA�h���(K��K*S	tVr��~��6�[?3������Fe��4�lll�����.����$���Q�D��j�s��D����j����������R�w�9m�'r�L�w_���8p�3��]�Y������������@����s�����S_e(F�F/��	@5�J�wK H��G0O]��d�6O�G��U���)���6�����))�W�����c�j���PE����~�6�i��%�=����*\8c`Q��l����o#
<�
}�\����y��*�gd��3��������T&g}��$Ws���~���k�Q���Txxp(����~S�fT���r��_�Y�Vx1�@��|)�#�<�K������H"��E��R�C�3�zZC<��T���x��*�R��1�T���r9�'�'�&��-�W�i���k�;�y_}�����m���������<��(6.a���29Y�W���2����	�W�>��>��O��D<qx&�GUP�����|��*�
�;��H��2
�,��J��V��^��y�9�G�i��N����7��
����������T�������rb���������DC�`���$�D����ws��*�5��~=cn_P����z������G�}*{���f�T���x���9�X)��~�cb����2i~\��U��!�Nbb��!��A�BwZ���1�c�kLbZ
Txk�4<n����T��_eb��xYe���>�M4
�#�_�}��b2���Q��u��F|W~p���8E��/������PLR6BGS����.�������gG���m�������2t�c?�ia��Ca�it�����[�h����(jl�D9H���J�����������Z���r0����qFe_��ON.�3R~��/���L�:K��ur?�{&��zG�v�,�������w����lL���Y���[E�F2�T��A#e8:��g�u���	��\����\���=^8���^�������w�hr���"C�O[����
�V�����=I���d�C�F;S�����X�v����n����h��7&�����q��'�O�����6U+F)g�]6������������z�������E����8������2�q����_�D���/�x�V�7�Swh?���)�j�	(�6����d�1�=eP��C�����C�G)������� ��S��������2�e2;��
;�a�n�^$%/k�j������W�����m�u�RE������4R80j����s��a#�KH��x���\����M��Q�`�a�4�2�a7Z�RV=�)r\����FR��N$���1Q�[$��!�6��o��e'q(���0i��*pz�rKK���3�	�&.�@]�0,�n�Q&n|�y��
��q�3g���i�#r�szP9r-���i�{��h�%�����6�,�*b����\m@>�!�#���(MSN�C'��� �Z"��zLq�E
��b�;@[f���	������N��Y��`��,����3w���PD1@���3��L �x��)���r��E@2f��u@m5swc��0���UF)\�af�L�����]�fc@kJ9��"���C�:��W+#��u]G�s��J����LeL�&s��C�P%?�����s�>c��7��c~��kLe��'�9�\V��p�����5��<g�������G��������26;�����c���k���3�X��,.�������,�qU>��,������I�5��	��v�(lvE����/�L��'-�s����\�S��O�t7p����F��\���7K�I����Y�\	#�N?�V��^<���x�������/
<�_`*f'G0qe���C.��x~��=����`���k0�����WF������$8��
�m�|�{,j������3���� <�r������<C%!���z�B��:S>a�������.�{���q�9�\��u}u�,o3
&)eJ�R�H���|��u�<d
�jV���f��u�Zm��������f[�Kyx&���dM)eJ�R~5������ZSK���=���0M�`��j\>F���-c�b����Sf���5J�R��_M��(
����"���ZAj���.�5-.m����yMJ6V��#M�I=m>���~��L0������r)�/�F���o����_������[F�M)�����|�`!�F�O��YG�;g��/��a]�%e���|)~�e��g�S6��S.�<uK���D ����zK�%;^�vx%�����K{�N`gS����~����[R�
��)��O`Y7XH������-�>g-7ec��$�^��0��r7"���&�z���Rt�9oD���Th*8��T��jf$��^<`�<�-� .� 7��EP^��c���~����j�k���1L��n]_�^9���fN67��$�jM���fZ�)O�;G�f��a1eq��'e~MR���2����$"'e���Z����2��nScX�u���7T��Q��2/�M9x���
>S�^Z��R^J9�u���)������7�%����s�*{��o�xN9���eO���o��S��!<��q����	!/o��Cr��i����K/Dk�s��i��},!�5�k	�|��j�)��:Bx�`-YNS��2!�b�L)B)���5�~MQ���� *��4e�~0t ��F��Z>SfJ�������2>S��2!�b�L)B)����	y(eB�^S��u]������c�a0��U:J�J]D���.��DWq������1��s�w��O ���Z.U�)7��mo�S@��j)�4S��8n=r���&�2�i�l+C[Y� e�)C�Y�2d!����������Y�%��s����_ND �z�T�4S�=�ax8���'����L�V�,���A�0=R�,H������	�q\���:�.v(���pn7�K�&�%C�C�,.���((�U�'���K(;:VM=��4��Zz�%��GI�_��;I8���f�2��@RF��"�3zH��{!�c\+!<��0���`ip���(�\���+��p�9�*����x�f!��ce1�����@.f�\z>I3�,��8E���1����nh�6�(���|x������U����n.��gR_��R��a"�yP�4xN��Z��a3��5��
M�JA��(��S�Z�#Pb�	! B������\�*s�����'���k��j-g��Dvt��E�������B��4��B�,���(A���#b�6���T�>���L�>��"���j>V+���N&�
,��\�������`irIY�q�inu�H�����:��x��q��(�wb�r���k������	�/��1'H�;g0/�Xf$���Fu!�	�����������{����s%PR$�/��Ty=kg�����ZT/�|P���X��D~\e�4�����T�\a`��2��2���LG�e���l����.��$�%c����M/m\�O��r�SP�z�F�3a`p�97 Mvsy,�����V��V�����M�Mf�pAW��9^3s������n��������8�������������IGs�����M���v����{�:��N4G�4#�_ee�m
6��00������W^�;�-�?3U���:��U�
����F�Q;v�27���m���Q��:
vI��*������@��&a�9��7��e�2�zfK��S:����eu��&[U=e��M+�n�(GFv��$��\���y(a�����i�5p���I��4�Hy@RJ�������^W�s���`M4c��g����j�M�o_�������_5�B�uX�0��F����
��F8�+u=���-�"��+��y���"$��yr����E)A���W�s5�o����j{�3���}�z<���������0������������&���}\���V�a�����^���CJ"�F��1������	_=y����W8�����0<�����kh#U�o���R������>�V���S���F�k��6CY���������fMlBl�m�[���UP����C���@��d&w�N�vdG2�&�&�o�=��{g�l�}U_����?��5�$��~~��Hz���AL�L�bw�S��C�������u2�1���2�������8	���D(L3|2RL��?~X��V�f���������w������W��s�-�!�*l{�|]����S�m"�3�������u1&��J�����;o �_s��>��lG�
<�H�������ku�U��:���Y�F����
��J�W�E4IA�]%��? �.5*���<��������b�]1�f�x�Ng+��Z0&w��;���s�V���
�g���r!6�t�rW�����J�l�����+�%�f�@�1YMV ��W��aY&H�\�o�����R#����P�����@e�@�V������+����l��I���{��:�������T
N.��aZ�m����u�(�U+��3�`Z���S!LSr����81��r2�@ (O2B^��*w�K�b�*�P�U:�]-&g�a�id�����n�m���D�V��CO�?���
Y� �R����e��d�
����+� �3]*��
��2D����:;;{,�����pM�&����O���xm�����y�<C�+���4R��$�$�

��YLM�x�)iE���aP�B(�$��d��fs����G����a?h�Ffb��
�`Re��}���,�n��ep�l���7	��Z
������5��2�|3�&�s	����@\-��3*udww�{��6-6����U����T[�-k�����T�� oy��|P;�M��:KTH,�X�S�\5��\Y;W���j�vf
m	�J����m�l�T�}��
���ny�^D Ki�ZDQ�Q���[M�S��{���U�1Vy�����>Bh�>�����.�&H����^���Q��DeR�j��C)�����e��L-��������Q������k���l�fYm�
Q*��2!RU/��[���5F�q"�0����8���4��j,��,�fQ�������6�7g���4{����0y���D��&sY�!AdLW�:��0pJ�[���f*-$U�l*-|����F0
�����n��[e��.`x��a��^��Z��Hiy�!�E4]g�4���������b�����8Re�#�������$+5~���3�E�[\	;��paq���8�p������6��(�e7�E�'0%R�8]������\��N�W�������,"xKiHW8$��-U�!���[z���R����/s��D}N�/��:���G�~�X�e[�6��
���s|�
O�@�NNT�FJ��[D���K�\Ph��9D(��2"xq%���W8L�\�1Ks�G1|0���2�J�h@eg,����r{���j��������;�
.��~h���&�M%U�K���JOG^iE,�h��8c�P9����2nd
a)���8�H��.��|��p�l��/�!�a�����[��>��������6�r6n���5��.aWxy�My��FRs|:Yl�W���J��:�pY=��g0��
���������b�*�#���������$6��
o��������t3��/�$}P�B
�Qbc�!v�@\p`�r8�,��r;26I]!�E��x�y�1�U�Z��yYr����+��&/��r>�a�B�n��
|Heo�5����,��r;2{���E�3��3�J�\��
A�Y��[,����/�_Y4��f��k��sjF�y'!�7�m�m,����Ls��+��>[��Q9���,}�@�mdu��{A>M��.m�	�j9�����S�1��Q�*� �a���4w����	M�%U�t^[=�����5��`}i+���s�������Q�R9��\���T���)l,��Qytm�z��S-T����HG��,S�)j�+�<��?�����J/�}��+��kV����9�D����������D�z�����Y.(�*�%w�������������`%�/yq�o�������>xn���>'����3xN�@6���Fe}���C���T���1��dV�>��O���n����d_�kc	T}��f?�\y�	]GW�m�Q�������Z�F~��w�J��i�l�2����67/��-R(�|��R�B>�T�W`�L����3�H�R�~�����������UnKv���KpM~z�<Amy���c��� .Q�6�%*0Y>���
]�U�4�-i�hN�"���f ��V�_�z�A��_zs�����{��3�O�u��1X��u8>������Y[��dbddv�G&��h�+?O_��������D��z!}�2�)���w-D����0��GWE�G
����i%�,3�0��%e��-��'�:{�SN|���
������;��7/��<����o��7^�}����*��p��k�����������
��t������_���^�O+�����?Ar>��o?���W��8��qi�rf��8�dR�H����8����~�2-5t����f�����Z��g����kN�i�|�u��z�	�!|���ty������s�����[�6��?1=>�p���?��_�5�����)��}����`��EIPrp�����vl�D�:}kI9����l�{���!]T�([N	VK�.�r�V��'x���~������w����>�}�<�P�;,����?����it���������/�=��s����F��=����.�$�����N:���C��r;��g�����pb_z�	<��@����o�q���.]���D��)�����mp��ND��~9b�I�L���(����_���k�����9�G.H����=;����aQy':@���u���y]�����vF7{���s��W!|R�}}������l���c���~?����^w��N�`��&�.W�Z�p��$� ��d{S�������9��/XC��V=ao�i��{��d�A�cA^�����Y�G����������G�*.�O9��N|G��_w��������f�L]��~%�S����ZZ��6(���N,c�D��RJ�VIU�� ���%�����BB"�/��/���A�i;�s���2}��� s}^~��s����c���r�r ���c��L[�a�V��R��>y����Ls�N`i\&�J��k���wlb��/W[����C����3 xXT�=�6�7�����b��v�L��*w����g�8,���@�D��m�Qh
Z3��>�V����i�L�����[�!���}�����73���H�'TVKF�w&��!o�I~��S�ea!| ���A��IFUVS9u�Y��m["�ek5����������4t ������\�QgN�j]C���qE��)&��4=|y�����VC�5�*{O�+:���0�1����
/oVsH�TWLS*������F����*�u���m�����Be��m�R��C[������}����D���������,v[D�w}W��'�E%�|�7�R	'�]dH%Cq�8�*��y��*+�3�,
*/����Un������rW�n��������c��lSM
���
x����$k��g^�����s
��T��Kd��z��QO������
��U�r���C�J��C�H�v���@e���7)M*�8r��pV��c�<+���<?;<�����k'��s�K�SQ�8M�8���V�yzH�Jg��L�E%���WSyeez'# �A/=[Q�DU�7���W-������m�`Y�����f�������#��������������v�|X\�i3�%��+~�����:oi��Iq$��N��1��u�?W��7� *�U$����^z�F��8��t}H}����cl��HM���8z�on����$�,�_�����*%X���uO���\�~�n6���s�m�x���*����0�0���g���Bk��N����9�&�Z��j�����h�� �<�qH����	Ev���E�K�p?�����@ �X�_?�?�0�-7/�Q��l��X���G����k��B&��]����P���
1q'�T�i��gi����K���M���;��u��������Xe����?�!4x��Ws�h�Ie~�%�x��E����7A��0)��b�0�HT��w��'��W�y�fm��F�>=�"qi���Q�Q=������`w�>v{;skX:s/������`Bnon�����~)Uvw1b�D&�ErG&�!��c���d��?k��'5�AB�.�3��g2�c�������
���7�j/|���J��@�� W���K,)�U�d2�����#��n-����zj�����0���mZw�j�.�`;���vo����`3T�y��j{�$N8�xrYE]�'B��O�>}I�;o�nJT�"���h4����,y�bj4/}����\Gvu�8�� �����Q�>������m����(����{�y��+��el�w�Sr<��^���N��rE�*^��$��5�KD��W9�������\T�9��+j��7���%�.��U@��TEx�3�	l���*[�-Z���@�����`�w�f3w�5)cb~"�C�-K-�W�U�~,o���S�F�&L�0����/Z8]�J��|������o!�YaUy�g�-
�Se+�3�\��:������_��~�����B%D"w\�5Vf*�1\t{��z-��a�����>+�n�~I�G*��<�c2�,l��,Q��g�u�;{j���w���8	q|�!Q�CbU8����*���aCv��<	������������]1���f	�
�1l����*�Z>|+�,K�d{]XQ{�����PRs�f+��dD/��"����l�n6��,V.c�d{=�V��:!;���h��#�������+3���E�4�����V��^����3O#S�Qz
�q�P��E���L�u�������h��(��X,��`�%NH+��������� W.�R)��&�\��-.���o���j�����P�\"�>��
���#G�|���!�S�eBkc�l��Y
v�"��Y�QE�I�9�:�k��	��:�W���`�vy<�7�|	���t���p�>>�xq�zE���$�Q`����T�m{U��S}�jU��Y���S��8y^�;�8�n�o�����[��)��	�����.�2�Dp|�z��%�!�n*y�k����{�{�:R)��V��	�2���t�6���\����������G��]��1u*����\V*����v����d��[��8�Zb�:J��1������/u$�q���v6�LS����nq�,�!q/�GXb��6&�,��~���i��N�����'����+��|�]?����JV>q�����U�sM[)�"�BQQqn,�>��c��:5j��I�Va�B\y���F��������������	���3X-�G�����IIIK0�(,��*�{&8��u��.k����N�[	�X-#&��2��gb)+i|R>��nB���])�b����<oZ�liaT)Ct~*��[��P��\��'�Q��`
+�j:�LT�!��!
�6�
4&#OR9��q#��	5�5��F�����)��3�]/P�������/Ph����{e�+���x\�+2k��U=�H���L<9ZJ�lj�t�&��>"2z��q�����R�6��b�DT��������E����s��RnVF&b������2]���GM�q�����uJ���	���Gi�{�3�j,]�j����{�G��f��y�Ld�v9g-y�2fL$��we��K�y[�������HB��Z�a�	)o����z����}dO�M���k��R�!��2��4n{��_�c��o����o�#
��@����BFN�]�����T���F�S��R������9�<?���R�S�"W@�y����u���P�R��~z��'��`�i�\�����FQ[M)C�evN�f'<t�2q��c �qM_�J�@��Vl�����Z3e�H��}nr���W*R�����i:1S�z���1N���BF��[5)C��L?��L>��Z����2��E��P�(I#j��r��wl�s�.2����+#e!8K�3��U�E�-���IN���_����
��br��C"���L����;j��kN\_OB����� H����9ln�DT��9_������N��$Df���X-�2(�[�JN���Y��\�����S1v?� ePp[�Z�o�*9���`�DxH|��fsj����������20�������[�r�-�Pr��������}Z���'e��[�����-3��!e�(0��mvN��Tj�6b�1<��H.v$��Mx���dl� cxH�1�:���O�W�a��o e��6�AFS�Fa.����������z������\���&���"�RV�xAS��j��%�����\�L���)GS�V���b�<�����&�.�rU����-�)%]X�5A�8�q���Z��Ly����|}q�rU�1���.�-9�{����:�,���<���Q/������\��Kgw�a�*.Q��FF�\�I����o{!e��K��������md�J�?A� �-k�T����]f+s�o��!e@��>}��de%�gs�GJBY�_(��?���A��'7x��z��_4�%�R�@���)����#�?�$'�]!�R���<GFJ�+�W�{�<��������jIi�����r�������w��*Q�2�dt�W������_~�S��!�X`r�%i�pP�=x�@[;��{R��m���������>��m�C�� m�6��I8��]�	�|C��!�X����z� �X����I��.\/ �:�<O)#< e� ���W�\��R9��q-����Gp��X�rL��?Kvd����'���T�1�o,��bA���]�b4Sl��������+Y~�n����/���z�D7e}����%&$��A��m&��rD��]t��M����;^�OZ�]��B*��z�D5��1�g�����n��ez�~�`�@�����2��'�~tv2�/�KC�����^@QL�������5=�0$i�n�\��9#
����H��H&{��,�_���=�����B�^�t����)�5e!3FmXJh��`<�o0��G���T�0��-�W\9����s.{���������K�n3e?�����������:q�n����;��)
�~�#�3�[��Vg�2�tPCm%���R[���V[[�^[lUk���/�����A���<�����i���zg����!������|��*,���1�X�$����[^q��<���t��|��?����2��/�b�55%�Pb�Q��r�u��H����r��u��w���(Y��c��7��n������	����4���0�UP���q��0fII�%����X|�����BS�n����AxB��S���X�
�1R(��,2�|x� ��}R��@���Z�0�x*q��?��������x�Ow��LL@�E\�d�~�J���C�Oj/�o���r?�SY����(�~*�����8�(���R����$X�#r�i�/'3��@����\=^�g;'�YY���t�l:"������7'�L���HQ#ge��cw
?��Ku?��67@�:���uBB��d��d���le��E5G7B����eF9"Q���&�L��OM/c{�%�ZSB��:�^��j���}5���JOg?*����`��RJ�l�����8�2�yMD9��Dx���6K#�������V��/���r����b�����2���Gl�W{
=y$8��2�Oe�I���������2�bLj�R��+���R���C��x���
�����x����l/
�/ �$�YJ�����r�po��vq�u�����
I2u����QU��re�	�m6��m����?8�)��z`@{�����^���x2�?w��`s�gx,�o^rF5�����e�����vxVg�x&yT���������G9|�q+~���n�!-IZyF9�K�����P�k���,��U��wC��������G�<�A���xyF��d�FFo��Hc��7�N8�'�W����zV�(W2�����Dk��|�{����rd�02F,�Fy@'����W��9�����;^�;�-q�0��EP��h*�(���=�GP��*Q��}�����t7��1�"O�XHK�6���^�PT{v�v����T>��7c�#Q�!��^�Q��r%������G�A)	�s�Jp���`G,���1����ilc�K�O?Z�GiX#��'�����_ ba�#���nU;o?����/
J��N�w�,~h�X�b����Us���&�y���G� [�U��#(R�5���>��������;J7�}�����������T����e�;��9V�2�v
�O���Q�,O�E!���H���<��fp��}����Cdg7�����j[�H2X�&�N��2��<��g�JX�6,(p&K�C���\��i.)#�V������+"�\��e	�������K� #D9��5`�J�d�he�%WO��'���B�8�.u����QV��|�yO_���D%�FY�{�-v �������w�s�-���aE�l�F�yr������b�Y�]8�%���bs��{�\���}E�������em��-[�V�n/�+�
����)D�,��\~	�z�}��4F9������3�3��=+S��v�tY�u�~O�o(��Z|��Jr=++6�v�z+LLH�~��|�HP4�(�\���f�$�������W�0�23���^\@��)E%���1��-�������(K��
�3s����+c�C����N���j:��	O3FY+�17[�a��8X�C�-�����':�4�i�6z��^������WD4���l�#`kW�EG��,�^��	�|U�2��:�����"��k�7��{w�� I�7�N�����g4j�H�E3�����:�%u#J7�t$O�^DI�pXNY������
ii"�h&����u3%��u��������+����.�BX?`�R���:=�RI�j�?E�U -:��(7j;�#~�@]"	�ge��d��([��X�xE��>�G)	�tMFJFF�����h%��\��`SC��,���b��xEa1�������0���Wb�j4eM��c�k�(y�P��*���Gl����!a0�5Z����U!�%Y��F���j��zC"	�Qf+J�c�%�����n�W �Q^=������v3S\C���(3�,�o[�R%o��s+�g3V?)����&o�����l����%/�x��ss�����|,+K��Z%^�f���0S�h�.`�������7�����G�X��<,Qn�ECx��xV����-==����qyeG����g�^��A��<�o�������������<�=,���(WPQ��0���M�\�^�� WWT����%��5}�������k[&���3�h����h��Y0'���V�\�S���,�$���7�M��p(YI�FCZ:0�����5h���^�Z�_�M����'E�8�3�{���hB�[�E�3/.�)[����A�H�D����;h
������ge�����g�l�W�O)T����	�a\�~��>]�d��?m���az��� ��y$��/�*�e86�4���?���N�� �WD��q����N�a�������4b�l���Q{Naq��=.;��(�}�����Q	2p�����r\R�""��B��	lu/p�@N�+�N���j�m[Q����/�em� xED4K�����UW���6�d�;�������w�SpA���2_�E[P�S�"Ut����'����R�����K5�?{U��1~Un��x����d���2%��n]�r�R�z�D�Q����Q����>EU�q��{�=����E."�ri]i���0��-��P�����LMR�H/�b	y�1�)3��
c3)M��.g�EMMo�?�9{9{��<�{v������W�y�9��2�Zr�$\���J���p��:��S#)e��-��g�[k�K|=�wI.��]Iv�_��.�e��*�HY$E���]�/�������l�U�G/B^E�e*3A7k��G�V&r���9��s��|�����rj�m%�<G�w�M��h��c��_�(#oG��jK�+��a'�h�5��m5�X���B�Ey�=���?�9/������w�Z�ef�S����������/f���@�JJ��6������t��7)F�4��a���?&������M����v�
12�)�(��O�2�*b������������LD��3��Z��������3IZ^��e��B���Z��Q�j-&�`�Nl�\\&N���������9�E%����ffGY���|�>���I�S]�-���5?���T�)���D��Q.�/�����h���g&���������+���Q���38��2m������:m�X��������k�=��pL�>��a����������^��,����]��e���w����hGAA�n\��|����`��%"��&�N��z(KX7��O��8����Q�p)+W���O'�W��Wy���O�_��^4�b�r��eVY�����w�X+3���<~��@H�(��L��P��a;��^�>�S�$#}���9f��H�(oT&�k��s�<lr�j�z5����]c�?�������qT>�������L�rS�}��W�����k��we�����w�B�G����"l��8fp���zYiy�E��t���.�id�w��hC�'n�}�@c���;�����2����H�(�ip�^�9��`*�����|]�:[�K�m��-H�(?���Q/W�W3;�j�������a����Q~=D'���������z�No������e�JQ���c���
UP�t|����&1�m���v���@BG�X� ����Lr�����������sO�.���p����������m��d�G���J��7�Y4��+���\i>�)0�|�g2��(�Y����z�������,�z�+t�`���������2er���(���x;�����?F�I!7I������.�Q��+�,��EjQ�bAF`G��M$�����$VF���V��i���*��QNW��y�b�4����J��6_������}�Bb7�h��o�k��d�o�(�Q���3����t������5��3�L������$��W?�u�(X�,��Sn�����U@�E-�(7T�����������.9�4O�����UJ�Co��(��|.�����e��rU������V���l����F�	��-�&;�E��/����D9/�Te����(���:F����O
�6�Y��	S���"�6�r�cD>�[j�?�2�+Q%H�KC�T���K9*�(�a���f���I�X��9h��H�H������t�	��,�'�Y����#r�����������@~�v�-W�{R�e�����;��dr,r�����e4��+��n��h!���}W��we�RV��V5�^�Z�)6_��������G�(w��������Mz�#3���f��(�Q�����N������8)���[7����ns�8+��U^~��j!I�"�#����T��$��d�����-�"\&�������kTbg��O�s*����nIK����^����K�2���F�T��*���������e��]��@h
�c�si�U[�����hA��%5�k�s��e����n�{p���(gDPTrX��
_F���X� Q��t�2|e�TwF�h(7i@@Ww��3<r���q�,��jI���2��(���p���b�t��L�F������
&Ky�V[��2c�l�/����)�<}�Y�e�&��(�HF�Po'iY5��Z����d��SRR�NV�����UM��w`sW5
n���Z�,��2C��������5��7IZR#q�@[���/��^1����C��,��k���_��SR��E�r
x`���������:��X���M]���PP6�g������:o0��l���G�^���Xg~�d�n���]I����}�r��������d���E�6���7B���/�O�d"��IF�#o'i����CCV�ar�(FZ4�Uy[~~�h��|E��x��k����X�&Gf:����DLYD\RR��?��6�X^<`��O��7��e����s ���kp6=�#�*����x��gD9����V��Q��z�1!�.H����. �R.�A>�]y����z�h�|mY	L��k�@
N49��n�u��f_-�X6��d����0r�&tQ65���o���������������f�=������������M,�B>>����)��*ptQ�m�|�*��������8�/��s���oe����GG�����a�HF�>�!K
~0������&KS�&��#v��O;~�2r�6D�'��Hp��+��e���O��K���8V
�sd��=�d+C
�:���N����l�u>���K:1�R�����C��yk�����j��z�N=�.�����ba2��Q^�����Kn�0�p���X$�����)v�l�O*F���M9�QN�����^�����|G��&=<�a��A����`u�eF	I�(�HM��dM��(�tmrFy�u/�S|�FS/P���c��bc��8���%T���+u	V���b#.���6�*R�Tj	R����i]�V�.�"Mi%�uK� � "� � ""�������TY;��Y��[�M���9g��}'����;�iv��,���?�L��������4-cu���5""�N�m����m�&@���n��ZY-�7/������mM����%�{��0��+<}C�8�ky�����0���g7w�iL�)^r�Q�Z4���<S����=I|��\�a0���Y���~�!
�,����4]�G�"�!b����
��L:��1���uoI���(������2h�tQ��gB����O����$4H8����@��t�$Kl�S��h�r�^L�+'���ir�Rw98����.��+��1���UySQV�#�<����r%H*�*M2�.F>S��h�����,m�����To�q������8+���
�2a|�������K���&��3�����q[�Va&���4�'�+u�Gh��]i9��enT�Q�����Q:�;�N	����� �Z����II��2������L�}F�sE�-��(��\{�������n�;h�[&pV����n��e�L��pR^+]5���Y�[l�Q����,�L��t���/������7�G�d�'�v�%E��#�Cd���(K,�E���CX��?�����8ewM��zBUp�����y�s�'�1
�HK���u�W\I��J�Rn���'[�K��<�G��>s�<4���2jPD�):�2}kc����Q���c��8uD:+W."�����G9Z�/^��������-��Z���L�rI`��(�b P����C����e�/�H�rV��cN�U�$��C��~ez�$����ra�u�e!��N��t�Y��%-�a-��U�.K!�#���s�HJm�Z}�����w|Q�b��z��5�E�Q��9m��1-����?�hW)4>�)�c[vD�2����+�����VtG��'4�(�k3��������t�<y���
�������7?R+wU��V�����?\T��=�-�7]x�3=VNrA���#o��D�
,�Q���i_/����1V&;8�!<�hHJ�O�;��N5������N�5��9(6���h$!>
3�w��}��/��7_�D���y����oTQ.{�������������^ws]��:��f\e��L����(o&��4hf�m��h�=��
L���2J�CL�2a���},�%9�/���,�T���%�h�Tge�s�Z�'�q��|Y��Ui5�<qp�w��^�(�4v�N�C�7a�Ix������!UO2G��L;���l~bk������p�W��g�������_�N	OY9?%����Z��T���tYV�FY�>���VLXp�{���E��y��^��q�[��z��a{�l�^��I�I>��20N�{|D����Qa?
����$�����S���dNb�J{��������bT�����7ix����+����,,�i�kU�r�i`}�!�XCpa�^�Dff;�_�����J������������]�����kQ^�m�����r�C�\|��������kO��r�beD@vnE�g���wy���D�D���r^l������*��y���f-�D�!h1���f
a�G��,_�%���<��A� ��(�Zu���G7R_���S���������u9�_� ���F�����(��1e��	!D��M��T��T������NK�>��ja�hZ��g�����\�=�����z�`7�r�6����~���1X�rqVu�a��!�����lA������\����N:8/��Qo�mE�!dl������^�,���1��Q�j�DI�����r5���I��>�r$I��(�>�3��>����2N��D����y������O1f`M`!�'UQ��Y��
8�J)/D���r�zU���C���A&KQ�wV�w��B��;s��2���e6K�"��E9��rT��yv�L'��y
�es���^��c���2��Q
�2@P@���.�r~z��z���3un�(0���vD���s�<i�q��Ar)�O9��(��� )����b�q��7M������t�V��I��j|a}�=8R�������~�����~��=�c;?'e��)�}��I�}�=�����e����z���}V��#+���&��N�l=Sf���I0"��A��*�k]�rT�3���rRP����3el����R�G����o����%�$���q=S���J��i@�lT���3el����R���#��2�^�����Z��6���.k��b�0t�&e����!e���YY\\<`qP�*�Z��8���>q��l���P�y'k���Ml����R���#�2`]S.W��;���U��l'�L�/�#����Y��b����6�������2<���#�2`]SN��r��Fj�-Y��]=SfE0bE$�Z]]]�����d-�z���2`��2)�)F e������O�e��*�!���t#n��t<;�����b�}{u��sM��*�v$[k�������P���>)O�8ogF�R�n,�j{7;!��e#R�~n4u���OM�
V�v��Lg�B-��"�8����&d-���)OFl�N��
J95�Rj�����;�����%2�g���=w����)<���	�u����s�<�0��OB�{�<=W�����	�t�*�q�\_J��^E
R>������{���������	��I��
R.�GB
�{eR��
R�����}�a��J?�J&{$<�ow��~�ww�ICa��ddm�i�R��k$@6��	lb�X����f�e7&f���1Fc4za�����&�[?��BWVu/��|~��&���9=�g�tS^R�98~3
!�)#4:0e�F���H��	�2B#aPS��b���B�4s��ZTf0SV�Z�6OB�4���$��q���HL�1e�0e��?�2B#SFh$`���m-��^�^���g�J���qe"��m��?7���FC���K��=�q�����2B�����I��)�X��Bd��3
�w����ccc�o���)#�9�sVOy������k������~����2��PxW)�x0*��w���+��b�"�d.�kJt�������5�'2��^����~^]��i�����b�����A�j��n��vb��O�I/Hi��r�x�F��*���>�K,@^�H<�N�4���u]x�)#����z�#������� ���)o����k4�r4��:��X��2�I`L+���4v�	�N�PoS�\��q�[j���Q�Je�pGm�i)�d(L	W!�T���w�L)��)#������&RMUQ��_R�
1��|��
I�j�����3a�LF��Ge�����r\htv���Q��T0�p�P%aJ?��N�#�����so����q��POS^�W���v��2����J�U2E�S�QF���Q�~k�����POS��N@�c���mS��*�5T�L,�e�9����f}_������u�r������U�X*�6��igN��K��A��n~I�*��<�������2>��P�S������$�����lXt���(���N�g���!u�sD�._��3���>���`#����Z�+�?��1o*R��S|�4h�Je����A��pj���oF�-�����	������Hh�S%�f�S2|<I���v�������(���h�S�~�o�����U����H��
�X3�)��T������L0e4��I)�U��r���9�Q�����<�)�Q�Py�*V
w�INf�v8�pSFh�R�s4�8sb��f7x!t��w���oS��b4h�L���9�j�B�h�E���hT��p����N��ke�ke��8��q��<1e!�J�J4 �N��i�4*�6e�`_�������m6������s�O��y�n^f!�y�����Y��Z����b�l�g-b;��p���JlK��m&$���\����;�K����]+,�^(���A�����`�*�0�B�w}�>\�#���*V��d�zK.~��v�����%5�h4
, ����0z�	q�k�����'<3��[�aw����7`����f�6N3���PlH�/�vR���j��I�p6^>��-�`�s���3e�`�~�������e1K�������X%�-O�$e0�.��_|N�%�&��t�������Qb�)%��8�2��-��_�w�{=
lL���r�'�J���
�v���2���I��v|�n�R�'@O9B��dR����&8�/����_����W�#d=eXoq
!v/���	s�r$�
��Q#���ls�)#e�7����?8r��`�������/c�YO�Y��lj�`>z��rBL�C/e�����e)8L�Q�@H�k��k�I����H�#��;��B�c���R6[n��LGO�SfC�������L`J�2�1��Z��3{n��]/�C�f#���p��2�h6�2���V�b�~�	k����0��)M�q�>.e�aN�5i����l�Kr+)�y;�u�-X�	j�d��2N�Q�)O�KO��BEK���o*�Ge���s����������/��-��������K�������#���z���'�k�}��VnB�6V����S�N�k���R���E���^�2�3#�|Xi�Z���m�T6������[�6�w�w������2|S���g)em�}�����2B�S���Df7]�^������i_�[F���DGL�.1�Z��'�������K!�R������o_��3�JR��%�����H���T�)���n*����LN�i0�e��i�ur�0Wx���k)������S3e��N��@hDl��&�?��Z-Ef0S�	6L��@���wcFRp����2���YU���2X��y`c���5~L�3�@��V;G��T����3L�a2W��W�9��������|>�o�D?������8���MH��
���Y( h@!��A45IWiR3�������c;M�z���>�M_A^@_���C��f�D��s%OQ�F���q�'����Z���L���M�b?{>6Oo4�c����k�3e:W&d)��)eB(eB><�2!K�R&�Y[���vX�?���>���r�e\���I���d�w�i���C������&)��J@��vAjcz�5��m�f��J�1��4=Rs�y��W����{J��� o�h)�xRe���15���Sv�r#����R������?�<������G_/E�43�\n������|���-aj�5������@�]�������'-�}cfF�$Gr}|���]Ky���[��E��P��J�N9��YQf}�V�Z��}�.��b��<9�����T��`r�K��V)Y���0��6��Z�)W�`b�	g��E���g�����|�*gE����|r���)������7b����2���!x= 7���Ksn�u�n,��R&�$e$�B���
��Ye�����xo�l�e���Rk�����~pFs!`m�[4���M��c�t#���n������|��!���)eB.H�}�f2��C��,
��C"�&�h5l��YQJ�����{AoQT���6�P_V1{r
��d���j�>G,K)r=)v�L�R���Y�|��9�9���BS���~WZ���%����&�������B���}!^Q�~�F�����e��M)�w����-��������Y���c���ns#^��e����r.��s��,�
UHT�u�"oT��4�
�����|����_(eB�O��h��P,:�M��N=p���l�;�S��)G���1^^a���6����wE;*�q�{�}���KJ��RR>��2���c��hvS��qV�1��r"# -����u,
��P�=�*�K����I������P��]��m[�)�z���fQ�I�
�
���*����S����8���ls�p:ey���9��w����9�����q��o�����\�2B1��]�Y�So*����n��.W��qV�L����<L
���=i�I�g���!o����bV��~���?������Krv�
�X���\s���x,B�Q�~SY�*�R��I�fE����%��<
�!_��I�-)�'�� $�!��wkx7����M�b�L��#��>�#� �a0��xL_�\�TDzx.yO���Fe�[����k�-��s�\���B�h���6�g��!�2!��R&d)P��,J���@)�(eB��L�R��	Y
�2!K�R&KK��P0�^�:�8�61�D�����������bO�>�[�����&�L�D"(8z��xm��U����H��N�����6y�����+�
�O�mfF�����>�u��
J��1_N�h)�V���K"tn��!��r"�9Q>dKDM�>(�kK\�f�����h��g�o/E��~<?A��lx����gx���'�BW�E��������$�j��Z7W�{f����%�>�������.h��T*��R&W���%&)�Fs>���4�����xR	@,�R`��n�-�h����kg����2[�Kn���Y����E��w�[LZ.b��@��gF�:.Ys���Ok0�%��)���gJ��$�s��+����T�	��M�u�[
��`J���[�h,}�TS�l��
����F�[��.C��}�L�<�H��3�?��{�lJ�\[�p�#2/�v��o}$��y�I�_9DK,T#��G�����
�{��a6�>�s��W�e�F^Z?/�B���=3���?2�|�R&��2w�0�,�2t\��b"	p�?���[�x���)����n���-�-�7^�s;�B3g�i���w�;'e���f���
)�N3�1�,��1��K�{���q������Z���{�UJ0�h�����/�����qU��e�t���N�@O��lFbe`23�6��\=�a>U��-U�S�5D�������g��{�U�Ca��=��[�0�r	��@�46�P��e��q6ef�W'E�;3��O�%��L�!����L~�;�?���F�Ol������HF�����������+�s�u�}Y?*���������������2!gS~�RO�f�����RA�e)��@�m���� gH��_��3�z]G���|F�I������u���&%����P���RN���z�:�����7��<�50+�|�������S)#Vf��W�@��<�rs�#�N��H�}��?3��z\�u�/�q�����w\��������&N=
�"{Jj�i?��-�Ka�e[�I5{r�D�J�wb���+K��t����I)����������?�u�~������J�e�W&?�N<;t,�����dV_WL
r�!�����oH��/�f���_.N���M!�o3���	Y(5G���fFQ��,$�?u�KgFQ��,�l�>�+����	��	� Q��,��/�������+	,���37�o����������X@�����^���R�����`q�S[X`OVw��j6J�R^�2��8(eJy~�_��/$��:���L�w�)��P���R�4��SX�'�<��+�5��Q��=��(�A�m3P�;�����+�"����r��A%$����8rC�^���Y�*)�.R.p���[0�VE����By
����5<b�������FO1�mW`��
/_/�0�M�?��8�~�WI�U_v�����������[����Tb9��^%e�P�E�j���	�9��=i*N7T�6;y]�Ky����rp�<�S�c	���^����%7T������UR��/n��bVxgCOo�=`�����b�{�����Y����;U�v�CL�%'T�X�Pn�*)�b\�}�Y�R�}?]G�A�LzAa����rG��"V)t��R��
�_�~ ����V����Vy #^!���R".P�1k7�=U���.�cb��������)Y���EoR��*����'�����u2�S���?��]O�P����I[
�Y
������Bm�H�Qp&� ����0\w�^���^�^�^��@��1���s[E�9��� e�S�+�<}����\j��|�)#e��R��E���n��i=P�J�*P)#e��R�������w������T��������H��C��y�PS�_�N~��[�hcl�;��������;0U1"7S���F�s��?��i�n_��HnU�5��HL�d;��������D�F�>�����I����q�������<�*&����Xj�]�����+���k�[u.>#y	�U��������D�U��/��<�i6z��NS���W�����g�f������kR�w.&��:j�X�y���4^��F��}���,J��?�S���	����gR��_��S�����?���I���	�}?W������H���LV��"Ex'Y�>��2�K!�N�pWq�+����B�G�/�(?��/���\����,���-�}B�/F��[��c�y�(;��P�O��y*!e��_���y�>r�a��G_�KC��2�wYAo����g������,��Z�9�����S�x|T�m��\Mw�����z��3xSa��{�:O���
rmH���]�����s;�j�l�Bn��+y-�vI���������KU�:^����N��\'�4��w���o��Wu��2��D+g����XSqw:,�5/��;�c���+X����(�g:l)�FVi���'X
U�q����B�5s��������$�
T|��I&U}�!i�]��:�3�r%���W�����WYS�K�k�5������ D)�m�qC���F&C[���a��IU�L�3������l�)
D4W����'����K�qHD_�I�+��h����Y�o������9"���RL:ZU��������c0���P�:Q���� ��w�
m��k����[~xD�\�G�*��u�Q�G{��i������Nm�LN��(3
�����(��Y!�����~�|C>��DN)6��\x��LGo{�)�)�g��;9J�n5O�}i�m������2��g�)�����^�B���|��z/�%��o<�����v���p.��Y��5��DR�y���CS�k�0�r���`��e�c�tCD���"&$)�M�')yR�2��|�:����]���1���S[p�-8
��+y-�+`����H�{V�n�lj������i�)�Q�%&$)ks
�?�b�*_E�t�1�Av�b�{�1���S[P�Hhi����~?��'eFRZn=^�j�"�?K�.��(0��	I�B-^��S�`f����V��u�1���S�dV�6f�6{�r�)��YyDL�����������KN�K��V���8W~�:W^p���_�+w�H�����������U�cr���r�l'����+�N�MV���9LV��#b$e��^�`��������\�`�b�j[]�`;�q���e�����2��\p��;u�F�����O�d��D���{VR��_W��<R���3���r�����<,%���a��u��Dl��N�����&���>���S
�m����G������e�����I�2�,�}�N4�/
H��S���x ���)����fpT�(3
O��
V�9f�i�Z��#^�����wj��������r����R�{�s�<3)������9����{���)�(�=��#bC�2�u�i'��Fssx�bo�yV�� �ga�������w�k��#�S�n|�W��)�������7��Yc�����R��������%1b�2e�� ���������q�������a���q�8��u����8bD��1�-d!���u�����RB7��^@/�j45���Fx~�����8"""""""�7��.�*��������-������(9w�s���
{&zC^K�6;�����}����)���x%��|g��S&�?��������u����_�S6�Q���X���j��S��P�^�'���o�OK_�j�R�B+����])���b\�2�L1]��r,�~��[���eI���4M7U]��F�~�K���f�VKGLr��:���T2}����F����-E@QD�:�
p�w��S�a-e5�D%�
�]����
[5�����S��)��l��`]4F�9ib������{'�&T�T��>��c��[�)+]{*"1��O'NJ��v�"��0&�����8C[��&��`w����"��2n�LI�4�2�a��� H���9����~�����h�P�Q+�tPC]����1�9n���������M[ ���Y}v���[���ic��=��VF�K}�T/�2e��c����azw�K��~��%�W����k���{)�Us�T~��k�r����������[���cj����9�����@1����rbd�7q�r��^�����Ge�U��GU/W �=)��;��6Tc�{%�,>���,�Rq����}�M~5�&�K�we��?`�G��Lt@����T�84���9)<��q`��`���hq;��������P��9e�N�����CnDt�������],���#��NX3e��I�QS���a�DDDDDDDDDDDDDDD�����ch\WIEND�B`�

#206

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#205)

Re: AIO v2.5

On 7/11/25 23:03, Tomas Vondra wrote:

...

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf

The interesting thing is that PG17 indexscans on uniform dataset got a
little bit faster. In the attached PDF it's exactly on par with PG18,
but here it got a bit faster. Which makes no sense, if it has to also
verify checksums. I haven't had time to investigate this yet.

I was intrigued by this, so I looked into this today.

TL;DR I believe it was caused by something in the filesystem or even the
storage devices, making the "PG17" data directory (or maybe even just
the "uniform" table) a bit faster.

I started by reproducing the behavior with an indexscan matching 10% of
the rows, and it was very easy to reproduce the difference shows on the
chart (all timings in milliseconds):

PG17: 14112.800 ms
PG18: 21612.090 ms

This was perfectly reproducible, affecting the whole table (not just one
part of it), etc. At some point I recalled that I might have initialized
the databases in slightly different ways - one by running the SQL, the
other one by pg_dump/pg_restore (likely with multiple jobs).

I couldn't think of any other difference between the data directories,
so I simply reloaded them by pg_restore (from the same dump). Which
however made them both slow :O

And it didn't matter how many jobs are used, or anything else I tried.
But every now and then an instance (17 or 18) happened to be fast
(~14000 ms). Consistently, for all queries on the table, not randomly.

In the end I recreated the (ext4) filesystem, loaded the databases and
now both instances are fast. I have no idea what the root cause was, and
I assume recreating the filesystem destroyed all the evidence.

I'll rerun the tests - will take a couple days. I don't think it's
likely to change the conclusions, though. It should only affect how PG17
compares to PG18, not how the io_methods compare to each other. Also, I
don't think the "xeon" resuls are affected.

regards

--
Tomas Vondra

#207

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#205)

Re: AIO v2.5

Hi,

On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:

I've been running some benchmarks comparing the io_methods, to help with
resolving this PG18 open item. So here are some results, and my brief
analysis of it.

Thanks for doing that!

The TL;DR version
-----------------

* The "worker" method seems good, and I think we should keep it as a
default. We should probably think about increasing the number of workers
a bit, the current io_workers=3 seems to be too low and regresses in a
couple tests.

* The "sync" seems OK too, but it's more of a conservative choice, i.e.
more weight for keeping the PG17 behavior / not causing regressions. But
I haven't seen that (with enough workers). And there are cases when the
"worker" is much faster. It'd be a shame to throw away that benefit.

* There might be bugs in "worker", simply because it has to deal with
multiple concurrent processes etc. But I guess we'll fix those just like
other bugs. I don't think it's a good argument against "worker" default.

* All my tests were done on Linux and NVMe drives. It'd be good to do
similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
to do some of that, but it'd be great to cover more cases. I can help
with getting my script running, a run takes ~1-2 days.

FWIW, in my very limited tests on windows, the benefit of worker was
considerably bigger there, due to having much more minimal readahead not
having posix_fadvise...

The test also included PG17 for comparison, but I forgot PG18 enabled
checksums by default. So PG17 results are with checksums off, which in
some cases means PG17 seems a little bit faster. I'm re-running it with
checksums enabled on PG17, and that seems to eliminate the differences
(as expected).

My sneaking suspicion is that, independent of AIO, we're not really ready to
default to checksums defaulting to on.

Findings
--------

I'm attaching only three PDFs with charts from the cold runs, to keep
the e-mail small (each PDF is ~100-200kB). Feel free to check the other
PDFs in the git repository, but it's all very similar and the attached
PDFs are quite representative.

Some basic observations:

a) index scans

There's almost no difference for indexscans, i.e. the middle column in
the PDFs. There's a bit of variation on some of the cyclic/linear data
sets, but it seems more like random noise than a systemic difference.

Which is not all that surprising, considering index scans don't really
use read_stream yet, so there's no prefetching etc.

Indeed.

The "ryzen" results however demonstrate that 3 workers may be too low.
The timing spikes to ~3000ms (at ~1% selectivity), before quickly
dropping back to ~1000ms. The other datasets show similar difference.
With 12 workers, there's no such problem.

I don't really know what to do about that - for now we don't have dynamic
#workers, and starting 12 workers on a tiny database doesn't really make
sense... I suspect that on most hardware & queries it won't matter that much,
but clearly, if you have high iops hardware it might. I can perhaps see
increasing the default to 5 or so, but after that... I guess we could try
some autoconf formula based on the size of s_b or such? But that seems
somewhat awkward too.

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

(there's a subsequent email about this, will reply there)

Conclusion
----------

That's all I have at the moment. I still think it makes sense to keep
io_method=worker, but bump up the io_workers a bit higher. Could we also
add some suggestions how to pick a good value to the docs?

.oO(/me ponders a troll patch to re-add a reference the number of spindles in
that tuning advice)

I'm not sure what advice to give here. Maybe just to set it to a considerably
larger number once not running on a tiny system? The incremental overhead of
having an idle worker is rather small unless you're on a really tiny system...

You might also run the benchmark on different hardware, and either
build/publish the plots somewhere, or just give me the CSV and I'll do
that. Better to find strange stuff / regressions now.

Probably the most interesting thing would be some runs with cloud-ish storage
(relatively high iops, very high latency)...

The repository also has branches with plots showing results with WIP
indexscan prefetching. (It's excluded from the PDFs I presented here).

Hm, I looked for those, but I couldn't quickly find any plots that include
them. Would I have to generate the plots from a checkout of the repo?

The conclusions are similar to what we found here - "worker" is good
with enough workers, io_uring is good too. Sync has issues for some of
the data sets, but still helps a lot.

Nice.

Greetings,

Andres Freund

#208

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#206)

Re: AIO v2.5

On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote:

On 7/11/25 23:03, Tomas Vondra wrote:

...

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf

The interesting thing is that PG17 indexscans on uniform dataset got a
little bit faster. In the attached PDF it's exactly on par with PG18,
but here it got a bit faster. Which makes no sense, if it has to also
verify checksums. I haven't had time to investigate this yet.

I was intrigued by this, so I looked into this today.

TL;DR I believe it was caused by something in the filesystem or even the
storage devices, making the "PG17" data directory (or maybe even just
the "uniform" table) a bit faster.

I started by reproducing the behavior with an indexscan matching 10% of
the rows, and it was very easy to reproduce the difference shows on the
chart (all timings in milliseconds):

PG17: 14112.800 ms
PG18: 21612.090 ms

That's a decidedly nontrivial difference.

Did you keep any metrics from those runs? E.g. whether there were larger IOs
or such?

This was perfectly reproducible, affecting the whole table (not just one
part of it), etc. At some point I recalled that I might have initialized
the databases in slightly different ways - one by running the SQL, the
other one by pg_dump/pg_restore (likely with multiple jobs).

I guess that's an INSERT ... SELECT vs COPY?

Which one was the faster one?

If this ever re-occurs, it might be interesting to look at the fragmentation
of the underlying files with filefrag.

I couldn't think of any other difference between the data directories,
so I simply reloaded them by pg_restore (from the same dump). Which
however made them both slow :O

So that suggests that COPY is the slow case, interesting.

One potentially relevant factor could be that parallel COPY into logged tables
currently leads to really sub-optimal write patterns, due to us writing back
buffers one-by-one, interspersed by WAL writes and file extensions. I know how
that affects write speed, but it's not entirely obvious how it would affect
read speed...

And it didn't matter how many jobs are used, or anything else I tried.
But every now and then an instance (17 or 18) happened to be fast
(~14000 ms). Consistently, for all queries on the table, not randomly.

In the end I recreated the (ext4) filesystem, loaded the databases and
now both instances are fast. I have no idea what the root cause was, and
I assume recreating the filesystem destroyed all the evidence.

Besides differences in filesystem level fragmentation, another potential
theory is that the SSDs were internally more fragmented. Occasionally
dumping/restoring the data could allow the drive to do internal wear
leveling before the new data is loaded, leading to a better layout.

I found that I get more consistent benchmark performance if I delete as much
of the data as possible, run fstrim -v -a and then load the data. And do
another round of fstrim.

Greetings,

Andres Freund

#209

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#208)

Re: AIO v2.5

On 7/14/25 20:44, Andres Freund wrote:

On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote:

On 7/11/25 23:03, Tomas Vondra wrote:

...

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf

The interesting thing is that PG17 indexscans on uniform dataset got a
little bit faster. In the attached PDF it's exactly on par with PG18,
but here it got a bit faster. Which makes no sense, if it has to also
verify checksums. I haven't had time to investigate this yet.

I was intrigued by this, so I looked into this today.

TL;DR I believe it was caused by something in the filesystem or even the
storage devices, making the "PG17" data directory (or maybe even just
the "uniform" table) a bit faster.

I started by reproducing the behavior with an indexscan matching 10% of
the rows, and it was very easy to reproduce the difference shows on the
chart (all timings in milliseconds):

PG17: 14112.800 ms
PG18: 21612.090 ms

That's a decidedly nontrivial difference.

It is. It surprised me.

Did you keep any metrics from those runs? E.g. whether there were larger IOs
or such?

Unfortunately no. I should have investigated more before rebuilding the
filesystem. But I suspect I might be able to reproduce it, if I do the
loads in a loop or something like that.

This was perfectly reproducible, affecting the whole table (not just one
part of it), etc. At some point I recalled that I might have initialized
the databases in slightly different ways - one by running the SQL, the
other one by pg_dump/pg_restore (likely with multiple jobs).

I guess that's an INSERT ... SELECT vs COPY?

Which one was the faster one?

I suspect I initialized one instance by the SQL script that generates
data by INSERT ... SELECT for tables one by one (and then builds
indexes, also one by one). And then I might have initialized the other
database by pg_dump/pg_restore, so that'd be COPY.

FWIW I might have used pg_restore with parallelism, and I suspect that
might be more important than INSERT vs. COPY.

But maybe I'm misremembering things, I rebuilt the benchmark databases
so many times over the past couple days ... not sure.

If this ever re-occurs, it might be interesting to look at the fragmentation
of the underlying files with filefrag.

Yeah, I'll keep that in mind. And I'll try to reproduce this, once I'm
done with those benchmarks.

I couldn't think of any other difference between the data directories,
so I simply reloaded them by pg_restore (from the same dump). Which
however made them both slow :O

So that suggests that COPY is the slow case, interesting.

One potentially relevant factor could be that parallel COPY into logged tables
currently leads to really sub-optimal write patterns, due to us writing back
buffers one-by-one, interspersed by WAL writes and file extensions. I know how
that affects write speed, but it's not entirely obvious how it would affect
read speed...

Maybe. I don't remember which database I loaded first. But this time I
simply copied a backup of the data directory from a different device,
which would "cleanup" the write pattern.

I'd assume filesystems allocate space in larger chunks, e.g. ext4 does
delayed allocation (and I don't see why that wouldn't work here).

And it didn't matter how many jobs are used, or anything else I tried.
But every now and then an instance (17 or 18) happened to be fast
(~14000 ms). Consistently, for all queries on the table, not randomly.

In the end I recreated the (ext4) filesystem, loaded the databases and
now both instances are fast. I have no idea what the root cause was, and
I assume recreating the filesystem destroyed all the evidence.

Besides differences in filesystem level fragmentation, another potential
theory is that the SSDs were internally more fragmented. Occasionally
dumping/restoring the data could allow the drive to do internal wear
leveling before the new data is loaded, leading to a better layout.

I found that I get more consistent benchmark performance if I delete as much
of the data as possible, run fstrim -v -a and then load the data. And do
another round of fstrim.

Maybe, but the filesystem is far from full, which I think helps wear
leveling. The RAID has ~4TB and only ~500GB was ever used. But who
knows, and if that's the cause, we'll never really know I'm afraid. It's
hard to confirm what happens inside a SSD :-(

regards

--
Tomas Vondra

#210

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#207)

Re: AIO v2.5

On 7/14/25 20:36, Andres Freund wrote:

Hi,

On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:

I've been running some benchmarks comparing the io_methods, to help with
resolving this PG18 open item. So here are some results, and my brief
analysis of it.

Thanks for doing that!

The TL;DR version
-----------------

* The "worker" method seems good, and I think we should keep it as a
default. We should probably think about increasing the number of workers
a bit, the current io_workers=3 seems to be too low and regresses in a
couple tests.

* The "sync" seems OK too, but it's more of a conservative choice, i.e.
more weight for keeping the PG17 behavior / not causing regressions. But
I haven't seen that (with enough workers). And there are cases when the
"worker" is much faster. It'd be a shame to throw away that benefit.

* There might be bugs in "worker", simply because it has to deal with
multiple concurrent processes etc. But I guess we'll fix those just like
other bugs. I don't think it's a good argument against "worker" default.

* All my tests were done on Linux and NVMe drives. It'd be good to do
similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
to do some of that, but it'd be great to cover more cases. I can help
with getting my script running, a run takes ~1-2 days.

FWIW, in my very limited tests on windows, the benefit of worker was
considerably bigger there, due to having much more minimal readahead not
having posix_fadvise...

The test also included PG17 for comparison, but I forgot PG18 enabled
checksums by default. So PG17 results are with checksums off, which in
some cases means PG17 seems a little bit faster. I'm re-running it with
checksums enabled on PG17, and that seems to eliminate the differences
(as expected).

My sneaking suspicion is that, independent of AIO, we're not really ready to
default to checksums defaulting to on.

Findings
--------

I'm attaching only three PDFs with charts from the cold runs, to keep
the e-mail small (each PDF is ~100-200kB). Feel free to check the other
PDFs in the git repository, but it's all very similar and the attached
PDFs are quite representative.

Some basic observations:

a) index scans

There's almost no difference for indexscans, i.e. the middle column in
the PDFs. There's a bit of variation on some of the cyclic/linear data
sets, but it seems more like random noise than a systemic difference.

Which is not all that surprising, considering index scans don't really
use read_stream yet, so there's no prefetching etc.

Indeed.

The "ryzen" results however demonstrate that 3 workers may be too low.
The timing spikes to ~3000ms (at ~1% selectivity), before quickly
dropping back to ~1000ms. The other datasets show similar difference.
With 12 workers, there's no such problem.

I don't really know what to do about that - for now we don't have dynamic
#workers, and starting 12 workers on a tiny database doesn't really make
sense... I suspect that on most hardware & queries it won't matter that much,
but clearly, if you have high iops hardware it might. I can perhaps see
increasing the default to 5 or so, but after that... I guess we could try
some autoconf formula based on the size of s_b or such? But that seems
somewhat awkward too.

True. I don't have a great idea either. FWIW most of our defaults are
very conservative/low, and you have to bump them up on bigger machines
anyway. So having to bump one more GUC is not a big deal, and I don't
think we have to invent some magic formula for this one.

Also, autoconf wouldn't even know about s_b size, it'd have to be
something at startup. We could do some automated sizing if set to -1,
perhaps. But is s_b even a good value to tie this to? I doubt that.

e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

(there's a subsequent email about this, will reply there)

Conclusion
----------

That's all I have at the moment. I still think it makes sense to keep
io_method=worker, but bump up the io_workers a bit higher. Could we also
add some suggestions how to pick a good value to the docs?

.oO(/me ponders a troll patch to re-add a reference the number of spindles in
that tuning advice)

;-)

I'm not sure what advice to give here. Maybe just to set it to a considerably
larger number once not running on a tiny system? The incremental overhead of
having an idle worker is rather small unless you're on a really tiny system...

Too bad the patch doesn't collect any stats about how utilized the
workers are :-( That'd make it a bit easier, we could even print
something into the log if the queues overflow "too often", similarly to
max_wal_size when checkpoints happen too often.

You might also run the benchmark on different hardware, and either
build/publish the plots somewhere, or just give me the CSV and I'll do
that. Better to find strange stuff / regressions now.

Probably the most interesting thing would be some runs with cloud-ish storage
(relatively high iops, very high latency)...

Yeah, I've started a test on a cloud VM. Will see in a day or two. And
another on FreeBSD, for good measure.

The repository also has branches with plots showing results with WIP
indexscan prefetching. (It's excluded from the PDFs I presented here).

Hm, I looked for those, but I couldn't quickly find any plots that include
them. Would I have to generate the plots from a checkout of the repo?

No, the charts are there, you don't need to generate them.

Look into the "with-indexscan-prefetch-run2-17-checksums" branch. E.g.
this is the same "ryzen" plot I shared earlier, but with "indexscan
prefetch" column:

https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-unscaled.pdf

It might be better to look at the "scaled" charts, which makes it easier
to compare the different scans (and the benefit of prefetching).

https://github.com/tvondra/iomethod-tests/blob/with-indexscan-prefetch-run2-17-checksums/ryzen-rows-cold-32GB-16-scaled.pdf

The conclusions are similar to what we found here - "worker" is good
with enough workers, io_uring is good too. Sync has issues for some of
the data sets, but still helps a lot.

Nice.

Greetings,

Andres Freund

regards

--
Tomas Vondra

#211

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Matthias van de Meent (#203)

Re: AIO v2.5

Hi,

On 2025-07-10 21:00:21 +0200, Matthias van de Meent wrote:

On Wed, 9 Jul 2025 at 16:59, Andres Freund <andres@anarazel.de> wrote:

On 2025-07-09 13:26:09 +0200, Matthias van de Meent wrote:

I've been going through the new AIO code as an effort to rebase and
adapt Neon to PG18. In doing so, I found the following
items/curiosities:

1. In aio/README.md, the following code snippet is found:

[...]
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
[...]

I believe it would be clearer if it took a reference to the buffer:

pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);

The main reason here is that common practice is to have a `Buffer
buffer;` whereas a Buffer * is more commonly plural.

It's also just simply wrong as-is :/. Interpreting the buffer id as a pointer
obviously makes no sense...

Given that the snippet didn't contain type indications for buffer upto
that point, technically the buffer variable could've been defined as
`Buffer* buffer;` which would've been type-correct. That would be very
confusing however, hence the suggested change.

After your mail, I also noticed the later snippet which should be updated, too:
```
-smgrstartreadv(ioh, operation->smgr, forknum, blkno,
-               BufferGetBlock(buffer), 1);
+void *page = BufferGetBlock(buffer);
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               &page, 1);
```

I now pushed your change, with that squashed in.

Thanks!

Andres

#212

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#210)

3 attachment(s)

Re: AIO v2.5

Hi,

I kept running the tests on additional storage systems, both "local"
(e.g. RAID on SATA SSDs) and "cloud" (Azure Premium SSD volumes). And
also on FreeBSD/ZFS, to see what non-Linux systems do. I pushed all the
results to the same github repository, into separate branches.

I'm attaching three example charts, from these "new" systems.

In general, it doesn't change the earlier conclusions at all. Most of
the earlier observations apply to these results from these machines with
reasonably different types of storage. So I still think we should stick
to "worker", but consider increasing the default number of workers.

It's possible there's machine with storage/hardware that would work
better with some other io_method value, but that's why it's a GUC. The
default should work for most systems.

I plan to let the current tests complete, but that's about it. I don't
plan to test additional systems for this thread. I'll keep running tests
for the index-prefetch thread, but that's focused on other questions.

regards

--
Tomas Vondra

Attachments:

freebsd_d16-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=freebsd_d16-rows-cold-32GB-16-unscaled.pdfDownload

d16_premium_ssd-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=d16_premium_ssd-rows-cold-32GB-16-unscaled.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x������F�%������B����74h`3���F!��L�.I�%�*Go?�_77�#���0���9����_���f��s~~��3�->����������w����1O������/���?�o�y��y�i}���}4���|�e����e���?>K��9)����>����-�uZ�b�R-SX�s��~������X)�����3�0�y^�����������-��_���7�����6��Ni�c2y3q���i���i�a���S��XN���������fRg�.u���;����<�r~��W��0��I]�������;^��<��y��kR/uJs1�3q���i���U�e��������i1u�w����m���NiZ��/_��\�P�������O3��R��)��+^�"�e��j2g|��zI���E�k��z���Y�<�����d�^�u�v�uY�9�L����SN��
	���K����"�������5Y�����N�'�8��6�f.��}p��i��m�u��V�[�:-i�(o����y�����16K��Xb%�_�Z�T������Jb1����i��o��e,��X�ob-��l(����H!���8���H����fM��xe�r��	s��Z<"��mSL]8&n��U��&V�S��������i8&��a����;��
"��q]m��\_*1�����X9N�j����'f���xe�������xb������Ls
]<an��Y�G�M����O��xb�����W�<�.�07����#�&�2O!v�S�:��i<&���)w>!n��U��.V�R�]0a�ob���7��NK��q|�}c+S]�����Rb��3q}��\����Ac.�����&^����sO�47�b�b���\�3���]�:��>���xl��q/�S��	q��4��XyJsL��hl��qo����x���c����xy�J�������i<&�����x}���m��Mkw�k���3M���K������u<1�xL������s��Z<"n��eJ[�QP�:��i<&n�m�������xb������'���hd��!���Ny�>+��Dc����xi������\�S3]
a�.^��~O������xD|�mc[�����q���b��������3����k�����,4�:��i<&n��4���O��xb������Me�]<an��Y�G�M������sO�4w��i+�wK��xl��q���.�����'fme���xe�j��	s���J*w���0�����-_��M����;�v�9E
�$C]�?�����^�M�y��H���s�"�Nk\r	0�}�N�M�������S���e�O ��B<b)d �9(��S~-�u�V~�/�����Bz@d�7\�����<&�u�9��/����k��	d�]��h �9(�R��4�;E	 /	9���fzup��w!(e0���)�y���c��+�c�����xW����~������]x\�Xf�{�U\���0{��<��&:���2"���e����B�B@��3�
����.g�6w�~D��/�_0����@��<C�����0�6�������s��
��P2���o�e�0{��p�`��z,.#2/e��)>����d2.j�3��0�9�������k�=N��,��ED��DG���x�#�A������?����,35 �HK���.]l2�A=��������>����&���0{��P�d"�z,.#2�6Z��'<XL&���e���� ��B<�!�x<�~@��)k���:JT�������.���A��r��`�V4e��L-,z��0�6���/�n���/~	��2+�F�)��z��c2Od�Cx,.#2�Y����K�L�:a�����.��$vwP�G�\
�	�_4���
� K��)&�]�����!<��W���)l2W�"kTq������l"Cx,.#2_�Ag� ���uca}��2�����x���Z�cu��
_�����+�����&�3L�0�yd�cq�9���8h�����*�:�3����gL+0�����H9�)U�.
���1iLF&����b���y�
Md��eD�1M���e�+f.L��Ia�]4��.��X4�A=���<m�N�v)L�#�i����c��[e��m��%,!��K��ZT&fE&%��GL�0|�E�cq�y��@�w�����0+���L�)&�]��3�H&:���2$�2�������Y�I%���#&�]�n�"�cq�y�q�P��x�YiG�=�����p`�Cx,.C2���u�*���E$`:!��B<p)�x��������<�����	$�]����l@������������~|*�h�^-�z�|������T�!�&�4��;�`���
���)�@2��x<S"qwP���#b[���g�����������~��2���y}b��������A��F������!�����p_I�*�$f5�P;�@]�fB���aa�F>,a/��^5	�l�!��'�+'c�p'D�B�^���
*D�U��t�H��sv������6F��s�A��������J�>�N{(My����P���#2�$����5/:��^v���:$o����`�������4������z�%Vt@���7�����������j@^r���,f/9���:,.������
��
���0"�5��xD�FH1;��3t�=��@&Mg���������W����f�2�ISz,.���mS�tS
X�L
������.�M&2����r�V	���^i�%@#5t�=���&Fm����������U:�Nm@�6�X\^�[e����PgdT[��=:�!�������w������L���A2��2P���b�z�*�Nc���`��6]���l�zn��G��+	+2M9��<:%!��������Q
d��0��xLVG(3�4�����z����Mo�������G�"T6�A=����E~�������^P	Nq�=�@���W����m/�-k���rBF��&+$��LT8pX]^�[V�S��^FH�
�<��W2Q%�auy=�$����*#2$�`�>'�����6��VV��W����O��dp��a$�8g2�A=������-N�~���T�<���m�Z��b? a^�_�)�hu�5!�:����
��Mx�����eA�B[�`U��!�*���U
j"�cq�7���n�A�����G���Cux�E��4��W��j�LCO�q��`a�Lx�����yy?��}b�|�2��`p��a\_�cq�Bk��N!#!�Z~P��C27BNd �������
�~���LB�#i���������������z���X�s����k|B�P$�|��w�B*NXH�ds',��,{��N�.��.�n�B�no ��u��u{�n��	�
��)��L���PW����M�*���
1uB��l��00B��b?|a�Z9K3PFr����) ������oVV�IX��0����TW1�&V�����a�|��bF���a��$uJ!��]	��	�������5:�N�@#k�X\���(���E7#stXU�N�@#s����9�HP)��=����R�N�@#{t�/%9�V`��A��B��B[wAM���C9#\�NG��c����SG��-��x�v:����j].�FJu�D����bL���q_�3rRB�������Y���!��
��<��b?|M�Xq���(���UFq�=:���������*+����J��~*�8���R�D�X\���( �@Z�K����a�p2�����2~������Vx�[��
+g�=vax1�!X\����xT�EU������1{tU�MZ�G��e�<TTY�rS�^���=:�&;����.����m|E��*@<���-�E%}�w�Cu�2E�P$��LSv<&����&M������s@�[��E��-��&[1RRm��e�<D����F�M}6J�Y�S�'�d�J���2~"��u
�S�`<��c��JP.d��z,.��!�
lB�&�����*A��U�`+2������y�&���j����Qe����*E��&�)sV���`i'�4#A{��B�G	wg6<�V��
�{��%sRw"�VX	)b�F����<���`#�!<���`�'���MR����HHGLFR��,4�!<�7����M�C*)�T4�i���6��]��y�����i$�- ����a%�����������C��.��Q���g�=�$�e!�cuy�iF����"���*P���p�2�5�5�q7����)�NX�������/�����������M���9@��y�1��Q�3�?Ly���R�R���	jT��d���)M``�&�l�\���T���
����j�jKTUb����CR���s1jD5j��-i�&aq�Q�:<2m+�P�V?��F?��-�w��X���#3����z�����Z��:V,���A��r#l�(	����&Ft5/`bk^:���enR���V�@���z��UuU/`b�^:�U����fV���������z�^�Y�n�1)1��[Xd�������"�����V�Vl��l��\���K��X�^�A��tX\Fdn
S����@���}�Pk�l�����2�S��J��M����K7� ��^�	O7���AG��j|���z����\[O6��c����mP��n�����rBr�0V��0{8yM�<�cq���J���d��2f'�P��&��X\Fd.�)��*_@����\�� �����<��b?"eBp�t�V[��r,N�%T��c�.6�]�������t�����l�G��+��}%�I����:�G���Vo����xL���D��X\Fd��F�k��BL5��\���6�!<���y4"wk��j*��Q���j�7>iJ:k9�.#2W	c�{��I�%������$|��$����U��S����l�4%�c����D&:���2"si���t �i+'<.Hmy���������)E��4#{�fd���fe41�G�S/�����#b�N���=&S&�M���H=�!���q��
u�Qa�c�����&2���2"sU�4����!������6������EgXD\b&B7T�H��X<:1#������K/f.���J"_D./��a�����&2����Pz1s���z�LF�	�G'X�d�����2d����qm�i �Ue��T��@�����T�i(���6�/U*B���`k�<��_P,A��|9>���8�3'���_>S�������C�L��g��
��'p-q�C��8R�$����8��H��?'��8�N����a`��<R���o����Mz���W��|����t(�I�r&���t��I�x&=}�w���T�c,�i���Z�c�00�H�����:��XE�Gnv��/X���yh��6���!A#�at
�n�j�N:�A+��I�Nz����0b)nFK��h)6��j)h����6pV���++1�������41�J�W�G`�<�rZ��������-^m����6u������_,d�^|!_:��+_�&����C�Lru3����c�Lru3�������������n&��n��&)uu3`b�f:�$���Pe�S�J��4���d�I4
t<��=?*�����(p&QhG�8��V�)l��9(��'`�X�\]�0�R����*e��U��X\��D��N����U1D�{�[e�P������������6_�LmL�+�R����I���q�69�\�sS ��7���Y)���<�x#s��)���<������<\����_z,.��aU���LE��8���u@���tX\�����`�[������B2w�/``J_:���'�e.z��2|v��x�g���>�F]��Y6#���L�M�d���1yt
�F&�9�.��a��5;9h���A4wZ)��l��	�����_R_�����t�}I}�K�+_�[,�p�L[�*fdT
��=:u(������a"��|�c��2M+��xtZQ!���;�q�<D'�U�F�6Q�<�E�k���&�9�.��ajR��-|���x���]��������H�Y2Wv+AU�"#y��F[U�%���;���w�N�G=�[P�TI�*V;��,���?�����>���/�G���W��`�b�������_�����������������m���V�y���`���N�?����O�o����w�P��=���o��_����e��O*�}|
b�-�U�s������2�����8.y�����T��3lZ�u�:������1:7��2-a��3s��:J���i�i�S��/|9?�&���A�������M����w^����w=������'^�i�G>������Qr�����q��y��k���B����K^%���:�4O�~����W$-p�����M����w^���N��<��ysK��77y���y]�]���y��+��_s{!���"nrs�^�y����<��k�6wt&2�G}��q��S�B���z7_������4��>���oJc*�x�.f�<iF��#7\'�6�2���L�1q�R����u<1�xL��#����&��xD��[|��	q��4��X+~0m0an��YG�M���sL<e�����c�.^��+O��xl��qo�S�[<a��b���w�e��|{���q�y��C�|c./��I�B\_�6|�0����'f���x��t����'f���x)�HO��xb������'��x���c����xp�f�\���'f���x���O��xl���\�p��m�)sO�4w�2�������u�U����Qr��������%Oi�>�\_'1�����N�*%w�e�����c�&>�w����'f���xu���7K��hb%���*��XL��"��M,X��]dL��hb�������y��1q��Z0�w��)�����M46k����g���x�\�-�L�1q���N����������j�+s}��LSg��R�����x�=s
q�����S�:��i<&������s��M<6k�������5�2���L�1q��`f�	s��Z<"n�����x�\�3���]��� m<an��Y�G�]���_m<a�ob���������U�e�tld[/W��Z��.(1q}��@�A&�2���L�1q*W�����xb�������������V��]�e������EV-��X��nb�\G���H�]<��,����FV-��X���������f-7���2����".����y��/�v����
���d[V��	��i=��>�uN[�i�K.�D�#}�^��Q.��e`��l��6�5�1{����b0�A=R�_��T]��C?��L�3�~�3��0�yiC�j�|1s�5x�:��w�@�3��hq�c������D�X\Fd���R�����(������N���.o������Z�p�"/��?������3/e��0{���y�[���eD���O���0��Z�p�[����uP��f���������-I���k��������w���r	�>�Bm
��2;���,P���c����������<=}f^p�8�	7��,��|�|���A��r����L[�p��8�#o���q�=va��0�A=��g�rZ�5���+uP32����0{��p�.0�A=����g�nTu�b�4�������0{���#h"�z,.#2��|�q��� +SgdJ�s(<f�]��LdP��eD������!*� Od�]��F<����G�e������J����u<B6��x�Q-��{(�R�3uN�STpcbc6d
tg?����G��`��X\Fd"U��i�U�(��@�y�J���.�:o���;(�#R�y*���x����������<va�Md��eD����)����Td*�w:����1C�'d"Cx���^�</x�^���:��������zb���!�cu�9l#	;��������V��O0{��P��������|Z����Uc$V$`���@2��x��1h ��������:����$d*��9�3��f�Z��������v�E*�������L]�i=�����4�A=��/s�����u�����2�)�a�������&2��[W'�Z�PL�V,��W���T�������x��P��������#�����-���Y���+�0��Md�kW��Z�P��a�4u�W���(
?"nh�0v�oCA��~�A�r�T�YSc6j��2=�*���y`'2�!<�]���2�	�a�?|?>�����c�V�B0{��<�4�A=�!�/��f[�p�F��������P��V��=&�Bj.����B;	��	$�]�tNCvwP�����F�u�X="m'i���0�xz�Qw������3�!V����M�H���gU=�5�����_^�T�3oS8���~�����t�����PP����8�#�M�x)�N0�U":U":U"U��T���AF�m�l0�i������-��$��j�������YE�Ht�=�"�&F��������
X2z
8F��p����&F��q����
H�j�7V���)@�&V��p���w�[��"r����ih��W�`���O%���;�`=�S"V'E�G-bub���AF�mu�z#�#�#�Q��N��N��C~p:�����z)�b=���"��J�y)o�,�u�	�X!�����2iBD������B@�o+C�������!�������*A�"D�E���� B�^��O7����F�z	"�D8H�� ��
? a��� �	����)|<���.���,f�����A�n�F�UNq'A`2+:����
�R�F�7+A@��Q :����N@�&?t0Oc�
�*������������$����@&Mz�q�k)/�
�@�Vk�!�]v
$*�@2��x`+4 w����U
�����
��78���P�D�X\^�[��<������=B��ah}�h��	�xD��@3��"2MTp�=:��������������7�J�&18���@�������y����{C;���q�mF`�l"Cx\�N���6+������,'/�I�z,.�����e����1+2ML��<:q!��������*���S�i���_��y`+p4�!<N����q������d���� ���
��������G>h�1tJd<�|jd��Ld��e@�����W�a���"!t�mU`X3< Ce6���T��iI��		�$s+/,d �B�~@�������j���!!j��dn����@��+6��cMMZ��� ��Cd��m�J���W�B���<�}z�8������~������@)-��'�@�:he�"e���">}Y�+8t�=���&Fp����44rF��� ��D���	d����o��y�^�1r�v�EvEvE>
�	����c���(����	�
4������cV[���N���^���=��&V���jG����&������#$���=f�]�G\�,.��a%����Qs�sX�^��stX\��#����4F5<K(�5��w!�CP��l6>�N�(��8��8���������A�1+K�.�N��J�}X�^��}tx�� 9+S_�L����(����ZH��2z�t��G����%���d*�3��|��cF����8`q?#f�Y�V+�C����N+A���X�t������Q���O���	����G��������.�y�'��k�Z��{-������;j�����y����Uv�]a���u������Uu��~�wd��G��V^Y�i�����),�Md��e�<�BG�[�+��8�9���B&Mt�q��9���@��N���0$�^�!a,T��0z	�~��2�XQ����D2i�L��e�<��BB�Qi�02M3��$�������
�C'`d��D��c�����n��h7=���0����/���t�=\���r����ytJN�jG�t��d��:�����n�wL@���u����i����������D���vu�#�a�?���+���� �p�w#��l"Cx���Zd�^�6"."��IV���&F�q��c�!z�<U��1�#�4�c��D��MDrX]��C���{�����D!����V2Q��auy�<T$Z��b�zD������2����ySy��kH[��Eh�F[zX������9�=�}��I�<V�#���I M��;��oXCN�MM\B�G����*K%�����Cj�����i���ZN�KaY���_�I���??��)�)?(/ms��y�1��Q�3����*X��`)���+X��`)OD�{z)s��d���^x�����K�U��t��I�V�P��-_��S�b�)���+`b�W:������J7����C����Ckw�Q�6��}C��2�KX^���l��R�����Z�V����<p?�:����u0�:O��eD��cE�����5-/f�{*;�e����Z\U�+j9����W�2(e[m�/i��/i��=|I�����������;k�}g����5:k�}g����5�������R����]�������zL��P��V��eD����g�����J,������@��)����G[���k�1y��2i�-=���:<� ���Td@G��8�v8fHi!�����{-�&~P��a��nIa�2Pm�B���:*}����US:�NM!����X\Fd�x��� �~�A2��4h�I�~D�*f��[}K��LSLf�N1)d��z,.#2WED��)���T#�X��5����&�p������U1��+^��BH���U��I�x�����\��e
kW�2�*^:�����@�s��G�l*S�����,i�-=�\������	������(���
��mx\���h�L����:�A�	aZj����LS3<&�N�X�D�
��eD�FjXSuz0V��1y8=M���au�9�+�f���f�!�m
����`6����	
p���Vk����-|�s2t<�~H��&��k9Y���H����6e�YT��n�A�:("�����0bAZE}p�lU���V�|�_�yA~D�T���'r�5#X�!@�OA&�u��_o����V,N�X�Z������S+�=y�<r����������&v^�w�������!��R�!�xFHqg�wFHy����,9����e�<:U���������r�>��>��y�<:��x��2H�U�rAJ�����P�bae��!8��Cf�N�d��z�t���yX����Xu:�Xu�b����Xu��k��9p�gM��y�����y�%�++-M7������f�]�R�6�/�im�<j����)��J�:#Sj��]�=vahd"�z,.���������	-��Y��Y [���T���,o��I�.Y�%�Nq1�����Mq�P��O���~l��L+~�����)��H�!<.���y��T�m5L��r���9'``*b,,�=p ��
��������)&�]>�$3���2~�\%Pu��������<\M�������y4a����6@���d��6d����j?~F`�+���i8��Sp��	8��u��h����9s>Je������|��w�GX��
���U�F��0{8iM�����+49S��6v
0������a����X\���z�J�9�esrJ��(M�Il"Cx\{%b�<L�����������<\M�������y�h��j�12M	�8s	���	l"Cx������0�N�>RF���B=&W�&�&�����y4A��[���8po�N5"��uX]��C��e���{f �b�C2����:��l��	����:�RRd5����m��`K������v�wd.��l�Va�vi����*���D[�u<�����h2��Fk����k:�Q��a���]`�h��bFx��.T���V���T{Zxi��jO�k������
/�����X�������������
�������o������49x���_������y�Vo?�����������?=�{~�y�O������VX�I������MK\��'upp�T4v?>��,f�>�V��
g�O�y��T���S��4yfn�W��|�z�������e=�K�j��g"[��3s��z��;����u���)�<�0E�a���X�_��������^��8�S�<^|E�Uj�����_=[���e�O;�?~���	%�?3���g��s��H������������^-s�v�wLS�������o��J�L�e�~�z�x��<O��[N��k���rg�.w���;����e���e�_s�K�r��3q���i���U��|p��_u���s�'��>/~�6o�ROa��d��\u�| m�� ��9����
��
�D<|�l0�s�O>�i]�#Jc*�W%/f�<�<��-��x�\�3���M�\��v����'f���x��j�6�2���L�1qo��V�x���c����x��pB\Gc+
F�.V���m0an��YG�]�m�Q5l�����f-7����T&�2�71�xL��8��/w7�/�olVL�B\_)���	__�mNS)�	���hj&�������.�2���L�1q�N{������f-7�b���8�2���L�1q�x������f-7�����'6�:��i<&���L8&n��U��.u������g�YG��=cK�����q��`��]'e�/��i�L�\�Jf&�2���L�1qoY��t����'f���xk�=�O��xb�����GMQm<an��Y�G�M<h�6�G��\�3���]����[<an��Y�G�u�0��^1���������G���Y7�jL�.o ���$���eS
��{q����9�i�]���\16k�qw��5�M���B���	!�>�R71���d�.f���6�R71���d�6fA
��)�]L�31���I�"��B��d;����5�!�6�R71���d�6�2u�����Hf& w��;�������Y�G���Sj���{��N�7��a�W��r�]�����h���P4�:���5/fncn�����.&�������z��T�&������������J�������e�����.&��������V6�Pw1���$�.fJ��h���>��)vf�����':�P�����ab���\�Z�~uUqX/�U|]���M�ya��SY��Ep����h��>_
^�i��x^�k���<�|��'��Gvw�v'���r�pM6�r,
Q"k�#�d�]�l� �A��r^��
���6�JT$VV=$�]���P����)��#���X^!*������X��]&0'4����}1��a->z����{<@�]����^����$>�E��U�V�`y=�r�+��
���� �S^�H@���_8!6$�e�����.�#�@��)S���m�}c� ^�m�D��������������y�cg2n���j��
���lw��C]�������C.��\��X��'��w!��@U~.:(�#R��T��Ce�������K�����|�_)2���)������	Ql�=B6�������sP�G�o���?AX��DEb�W$<L\e������;�����RUl�����]pEujce�Gd�7\���U��xL�K��Z(��B@'�y{nu����.�nh �9��c�_Ly��2<G����o�����vo~�����<&��N5����bEb�"��<$�]�Q&vwP��������l�5]b�C�D�T��1{�����LxP�����3�*�5fC&�p���\?'���	�q�J�^�<Q���K��]��
��9�i;���+C�'h"�z,.C2���g��J�"k��@��u��+~�=f�]��LdP��eH�Xa�Wt���d*s���x��cW���	
����j�����`oxe�����z��cW�9������������Jub�������cW��l�Cx,.C2����_{�Qo�D�W:w��J?�B��J2��a?���K��P9��tf��;���aL�uFqw�t�d/�f�\��m�}��TdB�n�8`����'>2�!<�1�/�"<<`GM%�r���	!��J���xn����)GN��u�h+�n��P���z<,x�A��\�� 8N�������������1�!<V�!�':�
Z���YfA&�HgAp�'a��D��8�{�
P�	�LX���S�	p�|��cW�U6�!<V�1�/t��0�����G^q3g�cu����#Q;���������GH���SCqwP�?�����M6�siz=���O{S(��:C-a.�������>�����vE#t���>0��;(�<�1 c#�,H�:� �u�6`w����e�>J��	$s�
`-jQw����e�����J��*8X�<��D2w�=?���y_z	�G���#i]$8R���R5v?��Le�>�l�������f2w����W:S���L��B$$�e�Z"�|����T�������g
����*�)8n�@"�n����}e�H��X�{TY��7Y ��xD��je��(��P�A27���
�����07@�Yv���ExD> ��C�U~����v���R}��!�AA��L���J����`�;^HX��)�dt88�&]�w�����@�sP�_OX�3�L6����V�$sU�y
��A�=aXW�/;KpB��VAv8B���h��;�Ip/%�jo�
P	`bA��	$�]��d�����	��l�'���������A�Y�Gc;��L��|��ceVd�������QLx��e@��Z���Eoe
�����2x�f���
���-k�p�-V��s1��G'$6�!<�y���L�F��g��������,�����HX���v�~Y����4A�����,FC|-aY����f����������-�1�!<�y�*|U�B��Y����H�K�ILx��)^�[V�Y:/fm�c�������/v
�����|v��6��E�<&����"|��T�����E��L[���<�"^����X]F��+�|�����K��UnK�|a����/t�*g������}���-�c�dU�������W�y�G[����.��P�m�e��<����{�.�����:�]W�=�rN�Y�/l"�z���b��4�DW��~-���Do�����'+����.�C�����<0vu������W1�A=��A�T7�����}�����S���b��Z���}!����>�������?������Ky���<�
�H���_������4Sm���#���
��&<���B��[���[��������m9���q�6���+p�hCkchK�hGy���p�����X]�����!��C�RQ�8��U��Md��e�<H�O�f��#��b����GH��P�T��X=���'�Db���� ��B��bFuw���O���\m�~rj+I8����Z������X]���$������A����)Tp�wqE��~C��b���1o�W��
e�P0=�$�$b�c�;�u����I"5��m)�H��F��#&�f�7�Lx��e�<�(��5������eGL��?Dh�Cx�.������r��@����q��a$��Lx��e�<X���n���u����*�*��X�����|4r,c�g��j�%��>K��"���������,�z�7&.��N1yoL\=ACx,.��������TdDF9�������y���q���9VD��J���r���������s�	�D�B�P)�^Nr��K��0B��1�������
|�K+�8!�	���,F2��`�wL�|����FZG��� '��pic#�@��;��7��1�����B�+RlN	r�����s ���a1?��f���l[xNr��_2�a�x�c:,���D��>�)��B���p:���J1�GB��1�n �Z�U� {�T:!�	�#\����9�������+M5Z���jlJ�	AN8�`	��x�!^o��>�y�����7���<L	~K����X]�0����l�&��4i���
M��0���;&@�^������jEj:!�	>W��M�x�!^��Nl�����Gy�5�#����&��$S����cs�Tr�R�A
�S'9��E�Y�d��^���K��g�P)Y�:!X�{.�>(���)���C���YcV� �R�	AN8��&F<����Lg��R�2k��H��uB�M�98���x�!^�X#�
t�dT0T�T�r��������Gs��G����/8�i��;�%_�_������M*������O�O�4l��V�~1��Q�3�N������r�{u����M��P��t��RV����8T�ib���a�!X]B�cq�9�����f�`��r���#&S
CO�[�cu��jL�����i����_���E�MM�rX]Fd.�R�
H%�Q�jUF�el��F"]�P�G��%,��K���1)�9b�0U2��F�cq��
H��T����Jy�_MU����Cx<���E���H����;��R���tg������)s����z����$s-�y�����x��#RVY(����)�4�������ic��?9V��!�hA��Mx��Q�zLNm*\J#jS��eD�*�$<���K��'yLV_�3�����cq��J>;`[I�zb7�c���R�sMu��eD�����K��(S������(%!��=������U����[����"��
G	�zuwP�G�l$��T�cu��������D=���h���#�2�+����s��c������X]d�	4t?�� <��jA��������iA=1��h$��K�Je^�Ua�@���+����9�5f�����O�x���s���4�y�0�_�yq,��l��9���1z��s��cq��QXJ��?��A�	AN^�)�G���10�
�q9�_��|�r��
5��'�k�������:����6j�MO���	�n�T�T��.����HM�Tn�Mq���V[�D�n�}t�(������(�&F������IX��B
�V�qDT�_9XE##���x
��h#It$!*	A"98�XG����[w�4�q�� ���j-����-� 'R��&-��[zB�M�	,�Z����������X��i,�H������������*� '����QUzB�M��u+^G�/��QAN^Ga�����x
��������	�#�Pr �SCSN�6���'F�8E��Y��	��4�C<���MJ�*mQSw���v*��uJ��������g�I��f�CX*�C���R�bQ������)(�"a%`���c���(d�d��]##�a�a`-��������<\�
���������H%;��������C�Md��e�<��C��X��#1bQ����Ehb�"��e�<�BB�F��`:(?��M���m9UJ"��S�i�����eM�������yp�1�m#3�;��eGL����[��V���h�>T��J�����Y��oL	4�7%D`�
������9�f�����k6�������5[��lo��6M%MK�t8��4��m�L�� ����J+i*���1���xLV��%4�!<���0B��2b��2ZN��)9h������	�{^�/'�o���&��8s{of��Mt��e�<��G��J��Wk���s����
��}_iS����J�+� ��!00Bd��X3�[S>~e�g	`���SL�0�{�&<���2~Z��i���*2��c���A(�3���2|�P#}��&D]��&�r��5M�'���1E;�r��O��r9�� 6j�A�H]���h�N��_CU�L9�� ��>~��H�9���:v~:������E��Z}����s ���Q�'q�lCUR���t �����,F2��H�]������S�&�:����&6jSO��;����(>���K���rS��H�9�� C�c���^|��AN^|b�&>9���0U���vJT�	O������r5���&`+w�|(
Z�+�	.�w����p��N�	Ha>(T��jT� ��L���R��z�t��$�]�em��#�_�hWdd���x�_��K� d%�c%'c����F&'b����y���5;A+#e��AN�����
Z���L�iN[�^����AN^�b�&o��x�e:�@���U��u �Q��_�^�:��F.�6��i.�[JP�=��dn��%���;(�M�:]��m\�
;�y�1=c�%���������t#��3��J��l�_�������������t����m��6O�nu]�L�K�����<��i�m���g�|������O��������
UU�'�j�?��^���I�pB�TW�;/~�B7��O�*<���)�1����B7y��|-o&��V?��s<�;l��=KL�N�|�r�����P�g��&q���;�����N�����3~M����J����/M��]���"p�v���a9�.H�L�%�~�y�x�:,�n���5����|$u&�RW?M�s�J=�����,_�z����.�M��OR�/RO��9���kR��J�L���~�z�x��+���/_�:<+m�f.�]����w�W�C��|��Y�&�P�&����]����w�W�hx~��/I��+���������U7c"V����[�R�6�����D�:��gh4!�����&�e�JMH[4%.��U���u4j�h�	/~���c"�w7�'xB�k��W�y
���N;k�+qy���e���e����`�)q��Z4%.���OZ�'�1����l�MD:}�F�&"����\G��j���GV&�7�P�������V67�`#g��	s��l@a�#F��f"*s��LDen"�QQ��[�X�����<B(�����q��w����1����L��\_����DT�:"����\G�q�s��V�:"�����D\�0�.�07��F�:���I���uD63���X��Q���df#
sjC���2���DT�&b�qmDan"���(�M�
m�m@&.�%dd��w�������H���%����\^*1k�7��RE�&��o�uD63���������1���DT�&"���2q�l8&��%>���S�:�����D\�5����Hf6�0�y�����uD63�����\�?�27��F�&�2��v������
����$�,�����#��������Zlf�W��j-���7�:"����\G\�{�n�uD63���������D��nc�#n�SQ���lf"*sq�OMDan"���(�e�4��5F�P�2����*en"�)�����MD2���������\�K��d��^��@�����~1.���s}���d����JK�6��u<�2�����v������f&�2����vs��LDen">��E�&"�����D�p��F�&"����\G�����u�j�D�&lan"���&�M��(,����d��/�q�m�������
DB=���|�����O�q����"����i��;�q�%���#}��2��M�� ��s�d�L����4������[�����\C�J&6$����$�]�&����Hy[�N���G�
��j3��w!(e0w�~@����_�5����l��IWg��w&�%0��c��v���
t
�2U�?-zL�0�9������c��i`P:zE�
�g�`}�h�AY���;��#R�-6��UdC�+�CK��������x�����n�k)s�^��������t" �v9�������d��X\Fd^d�-)�*�0�M,�y���<va(s2�!<��W�������X���lP{���Zd��Fqw���_Ky�
x��Y��������%`��#&�]H{f�cq����Y������������q�>��P�d"Cx�����e�-�T'��P��L��1y��<��������������@������Kw�
f�^����)�����4Z�&!C�����.���	�����<b�,\jb"l"��3�B�F�|�QRa���k_�Z�\�	�%�o�2PJ;o��
�W>b�������D�X\Fd^���D-F|6fC&B1�)&�]�<�������v����
��X��e��E11�()���;X����R^�TC�����V�"A�=���{&�cq�������Y>&&#���=b���y<4�cu���HZ�����02�,�O1y��<p�6����������C�|\"���?@4��x<��cswP�G�j(�������Ia�g�#\����Y��Mx��eD���/X��Zl�-L�#2	�2�`���y<�V�D�x�lx-s�U��������Y�YqJp��	f�]��3/�MdP���y�k�C��9>s���n���3X��	�g.����:
�N���F9�������/^N1y��<�9E2�!<�!�S��
�& saVd �v��c���!����������/li��J(���@�p��c��L�L&:���2$sj-��H%a��,�����f�]�.�����6n-j
J���~~|*S"2��i>�����<���������6������������'~�f��uyh�Cx��=���� c��e�Y�}�p����e�A�E`�Cx�����9b[�?Z�'�s5Zt���O{E��:C-f.��������km�|�o�u��vo8Mp{R�����:���LW�-����g;	�!������.-����;(�/'�Q�Q���!�n������LP�17w�}l%aP*+��[��!�mX<w�d�3���+�����I�2A^
�=�0�V9L�W���+h"Cx,.��
��M�@�	g��p��"��A�K��p��9���4!"�b�L`���E�w��Ky���J�A�+�����\9b���ai"lf�����K�b�U�U���r��
�1y���0&:����z���������p���R�(�����"�q�}�u�Yp)�3�F1��UP2`w�]�z)�u�:��fYMzIa9�I45"�J��`��/%�-�6����kCc
2�%�SL�0�6�	�����7�t��I$�����r��<va(o2�!<V���mW��X���	�n2��l���T���d��M��bh�(	���.)$d�Cx�.���M�V�N�in�&)��|�3�7�����y����`H8x��b2���:b������D��X]^����w� (��PJHg�=va���&:����z�u�B�(��@a�5(	�	$�]�rs�p��<@��?^�,��2iw��x����<PD@��l������XE��J� ����G'1ldbT����y')��+/8(�h�%�3��0�wa���_px!o��Z�-"2 b)g�w!@������)!d|y3�BX�Qq����� {�������7��o(yd�r�LA���3L"-l=��������7�������U�h{�!����� ���MdP�����m��fE���1�ZQ�vp�#wf�����D��X]����5��o���Rs��<���jOnCx�.��
���3mxoeeX���b�
�0{��<�9s)�����;O��gNU;3	C�ZQ58��[�2*u�mCx\m���^������
�ddP38����@��Lx�����a�;a�|�yc2����0g����B&2���2`���S���FQH�t��&�(@M0�!<��O�����p�����������Ok�}���R���Z<B���+�����Nqc9�����Nk��3X�=����\����;+u�@>X.�@�A��-n6=��gG3A��[i�������Zi]�����m[��3Xy�\�Q�����d1pa�Z�O; OD-��$�A���x�����c�$B�����
��2�!e���.�Lt��e�<�)?�6i���M���8���,�� �������n���a���b���1���q�
��#�J~|6fC�=�SL�0<0�!<�7�#O��$�&����/'��w!��DgN���b?~�L����(�%/5L�k	1�F���/�9���E>I�j4�Y�q�B��������<�D���t��G�c]��!��e�Y���5r�����C� �EK#z,.���-��h-��� zI8���������!<����#��*#�X��j��f:=�	������61ra�Sl��k;�Y�1e9���C� �cq?�X����4�Y��1L�)&�]������yh�G�`eZ�����,k����l���C��2�9)�`�R�B23�9e-+d �f[�4r�Nw@���p����s��c����D��X\����a�ND�XE*��`�gT&h�w���iAe�+5�#i�1��P����:3$b��������F��%��B�d�r�)F�)!R�I9<���2|��$���B�		�qN��K:L<
����j?~"����i>��6"��dn5���@�s���O@��;E6����nM�9`���0�	�����W�@�<��r�d��������@������ypuH	���l�����q��7����&2���2~,��y�NeiZ�0{����W�x�
8�����meQ��2TeY1r���~�����9�Qw�@���Xf,�g1	��JE��<va@t\�D��X]�0�}�Y���$�v��t��|�U\Z� Cx�.o��������Mc���
T�p�~5��<f6�!<N]����@]h}f8$����l���t��C�'�L��Lt��e����DrXu#�tj��<����E�S7w�a�g*�R��R��.T�����&�O��������o���vX�q����B����9k��`�l)\5���(�:q�:�n`=B��\��W4]��iM������!���Ld�^��f[Bpw��W&�f[D�f[��`�l���A��^�qa���
�m�~B=��0�8M����~�����l��R&X��7�����D<Z�J�������R���~+�c����
8b9�����,`�Cx,.#2��M�&5fE&D�z��c����D��X\Fd/�K{�'����b�z���,����6���{�-���5t�c����h�5�b����5y4�!<��C!
���M*[�Y���SL�0�PG&2���2"sh�Uh%�v.4fA;j������eR4�!<�����2_7l�x�(8�,��EZ4zL�0�9���������fm~	��0���z���d�V�D��8u-5_�e�������DnE�0��<vaxt
f��eD�a��g:}�1�u9���C��������<n�3���j�1(xI��<va�f
Mx��eD�\��?U$
(��A4��������)�A}|N5�k?�.�)����JMd��eD�Zo���2������=��[�.Q�k�cu�9(�����m�A����.���;��#R^�=���7&#BI=���C������U���s�O"2M,q�=:��������\{c%��`zo��o�Z�-�����J�MdP��eD�F�`��H�i��Yd���6�!<����Z��]�-�	�:6A�c����MD qX]Fdn�
�%k%��Zo�H�37��$2i�H�s�K�������`D��
*z�a�0�2 z=�	��E"��d`���a�p������2$��D���^��J��;���=��[�!m^�H[�pt�<��&V��X]�dN=���RU�@&##����<L#.���&2���2$�e�������12�F|E?���K��������k�s�,(��K��:����[X�	[�pA����e��j%��~c
2�Jp��p#SP�&V����y�x<,�|�Rb'U�����@��"H5�/�G��A��l���ITbC�&�v�A2�������{�����
qBS^���l��)�IyQWT^������
N����������.)` ����	�9�a\1�[2�x�[�K3�h&�>d��fF��Y��
��p������b����y��������|X9�����R�)��\�1��������tq�#���Hc6d�2&���
B�yh��6����:�LoR}�	1\f+P�X�N�yt�&4��}���y����sYE�)��$�b���a.&3���2~�74�D��*o�t�pO&��2`wKw��	@YE�'!Vz
���X�1��T�@K�t1pk��mj,���7���N1y����&:���2~���T��Q�L�����-��U��c���)��cq><�d�����N)� ����b���a�Md��e�<@7����$edPz9���#�3������4�H��������`�n�Lp!V4�f{:��	`=����E3\�S2���a���)��cq?�"����2N�/0(��b���am83���2~����G���oA�TsD�6u)��;�5�q7�;2�O"]����s����S�E�"��h�Cx�{9j�<����s�"�1T>�%1�8qy=3�����q�*�G�cKZ*L'R4����ng�8����<�&2���2|V�Y��
e2M��<:ha���tX]������z(K�}J|�	$s��ip�4w7[M6r�����}�C_h~N�|[�!Q�Ld����Vg�J_2e����O;�*}����=D&2����%:Ut��ld�8#�2�����F�MdP��e�<T�Y����H��w���0{t"R�a2�����5v" ��W<�����_�&'-d �Q�~������_��De
2��|��c��
��D��X]�0���6�3d���1yt:�F&�39�.o���X�����6�"C�1g8��Qf�v�Ld�S��t�<��<���m6��6m��m�V���l���4�j�sx��Q��������*P�KS������Q��d-��.�d��\k]�Q���������O�������T�:�n�pbR-,�}����������
�������<�V<�f���.�?����O�o�������>�{��o�������������TB��2��z�W����3���0?��8������q�e���N�];fRI�vz(�sn���d6�7�3s��:J���i�	��&�R��z�y����r{I�L���n�y�u�8T;.��b���k��8�f��]��(�w~���yZ�y��k�����&��]��(�w~��cS�r�����5����&�����~$�����2��y�a���A���'^����Q�����/P��A�_�u��=3w���d��]g�F�����;�f)l�{<�������N���C}�����N��xdf
sq��?6�07��F�:",�ce�FT�:"�����D��w�E���Y��l@f"�`\�Pc���,P�;�C=�s����������`lf�W���-i�RQ���lf"*sq�m�&�2���DT�&�2u_4�7����"|i���M,e�������MD�o#
s��lDa.#��%"sQ�Z���DLS��Q�&Z�pL�D�\���	su+��/��l���r����1���L��\_�T�\���2���DT�:"lq)�FT�:"�����D����e�m�fGn,�}<E,S����E?���e&�ae���25�RU��~8�������?�^e�s������(��#�z���{L�{�t���(b�Gc�x����y�����#�z���{��U��7��{�"�y4���h�V��2o<B�����E�4������LD�w(���I�$��f��~r�x���V��"W��J�EfY�;c���X���[���q���7����5�����6+b���{���1o<R��r�����=*����*��<*��#�z���{�]3x4����u�y�1[s|�����=*���.����n������T�P�����i<k��y�� ������%b]���_�T�!}x4����u������s���?Hu����V�5�w������1o<��?�U��#�z���{��f��Gc�=�X���7���V�������2o<V���y�b�Ge�=R?��7�����:oB|�=�R���{������#�D�1��J���Ecn/U�(:�<s�Q�:�����:=�Ge�x�X�Q�{�Ak��ch-��<�X���7�
�Q�������2�����s4�����������_C�/k��[,������UB=�S��?�����xMjsw�l�+��FB�>'��F��I� s�oR�����A<�/�2����Ju�'�W�%&�#Wz����y��&�l���h�uF�e��0�\����~���SW��Z��_�����Y{_�8K_&a����q>�r�,�_N��pb�18�GGG���;a$r1#>���,��>Hj����1;3t��N14e���w&F�*+"���?��g�iZv��q!~(��K�����iv�6W�1+�>������Mq�y����%�b&F�*+"w����;���1�Jj>��8�A�Q#V���^�<#��Cc
3t��Mq�
�H�����SO��"4�mn�nr3!�Q����C�	Q����
'���0"K���	������������+BNE��N�n5&2���;�%�fADL��TVD�e�����:�O�9���������g5:bUY9U���xj�C�}���d:�1A~��q(��YQ�#V���
)���m%*���$��0�K���`U`�g�}r�6���H{�K;��	B<�~�J�>��OL�,d����G�Y#w��4s���� 0�T}�*�"d��}��U7���2>�?=x�U5=�����N�Kfc�'g_qe��q(���1�#�O
�?������2\%�g"S?�+�C�G���n���+BN�&���r�^W��p�M�t�+b����~�jLa&g�g����z�ADM�XUVD^2�J��T�+�Ld����C�P�����������+z>��f�Xn��L���t�Y&�.zQ#����ENs��~�U�f��t����F,�2(���q=M�(�"�T�R�oKc3���M���^*{�E&F�TVDN�Z���5z�S=3��W�a�8�yp�
�����^�F^�5WLt�
�����Y|��qb�����M���lxM�!r��7����s-a��DW�I�Da���������w*��N�`)�;����9je���Se�<^+����\���e
3�4O14e�H�.I�L�XU�D�&%�n��2���}�f�2�g(ED���MeE���+'�E�������C�P����$�&Fl*K"�/:<�<O1���HLP�!M �%T���>@�_�U^�Q��(LP_�:�Q��xp�9	��c�`�����'%�y��<�@� �������xe���Jc������=X����\�����4�B�P
Y@�������G9r6�������8��8���|,�&F�*��]R��
���u��u�b�F'Lk`�L�������kzyi���,�;\W7�Y�$��/D}�����(������������Ze�������n<(zZ|�����$n>0�G\aG����B1��2��E���U�������l�����T��tR"s��8��K,�&F�*��M�K�����A�%%
�G�D#���b51��7��(n�\,-i1D�1������&��>g51���(n:� m�7��B���8���3�C�PF��HZ&F�*����l{�!����9gD��2H/@DS>6����}�
�2��0��u. u0��q(��D���U���K@F.�2����)����X4e$n1�#��Q��s���FCT#*�8(������dU`�[�~0����|&y�,�.���}�g�C��j&F�*����
)��pc�x��l�N���V8�a51bS�<n�Q���H10��4J%�W���a_^><L���lxE�����:?�(A{�9r"�\��J`����U���5	P����6&,�p�"��"��U����.�Yf�4fg�r	q���9����~�GqSB F����)�pNa��q(#pXDM�XU>��`*�G�1;3�8�*�`h��zCDM�XU>���T����v�80�*|�'L14e(��EDM�XU>��RY�J5fg��F��IT�o"�&F���R��M����Cc�7	��C�P��
ND���U���=��9-�k����X��G�A?�xp?	��E~A��7�m���c bg���C�GLAD}�*�y���2��%3-�0`�8������~��gqK2��J�}�ac��g(��ixpC*{T������M:�5�2�g��M��	�x�+%11��� n�T�k�D1��"p��h��h�"�&F�*��M������d*�"�SW(�������U~A���8�!'��5fg��S�� �04�)@�L�8�g�>��A@jeZ�!��'�)�0J2�R��a)Q`����C������o�M�����������m�T�����5]���:��i�������?e��\��)n�f]�AM���X���)X����|G��N&L14ed_�E���Ue�:J�f�����A1�>�)�2�^Y������u�6q��C������@�[9�s9���>�s9��u�(�����9�`*3���S�CIc�����,_���x�1�F��:�Sg0���X�n5�������[���$#CL�m�'�
�E@DM�8�	�����ZC!+����oS%�.������Sn~�:hW�;�T�2EN�;9�?bh��9,����XU����,�R���C���
��bh���������u�|��Y��F~L��:�]f��8�A~"bb���~4��"%@��/�">@@�-gX4e�w�����_G����^����n�P�E�a�8�AiD����i���uT~����
5�$�1&���^jS��/_�>(�<w���JF�1�=�bh���(31bS��uT��l~�2���a�5U3Q#��@J���u����	�d��#���0��f51bSY�J��}(a�+�(3|*)u?������� +�AY�K���F%kCu.i��C�?x�����_�C�Z��F)����r��������YDM��T����=��kLf�U��8����d51�p��rT��"� �!\;2�?�x�<@}�&�~��	��$��0A	H$tf0bs^��J{�>��m�/]'fv.�15F8G2;Wq����4a�7����Y�����>�A�����6���@AD����},V���6|�X��QQ����������������{�*��(�!��d�L����V��g� �FG���x�\G@]��F�77��0�������e�����������#W��pVc*3��p��V�<��yQ#N����������W}�)2��J#��)�T!bI�3n*_�-D��F�JQ���h5��E�T��ED��x?���\����tI���%��>	U!�Y�34�/X���P��/yqLX���x_�� ����_����O��S�$�d����
o<�T7>�KK�������t������o���B������)��(NO�������}H~���L%��!�d-�=�>6fg��i�)�������;#V�����5��F����L�>�?�@Ww�����+�8a���w�����5q!~(��I@���+B���Itw����JH�3���RNR;�����i*�
F��1��:�I��
���j&F���x?�����KT��Mc*3����2Rbgb���"r���~���1�8_{��8��J��31bUYy������4�J��AG�B�P
�Rj�T�!��(-�p��,�W��Oq�� `��w&FN�>�<Ik:yD0���$�9@'O	 ��aQ���9WK����&�4����)�#9������s:�E^v|�������c��J����E�PY��������.�(0SP>R��9F�4�02��E������G�s.�;~bE���H:x���m�1D5#���E��|�N�n���w��}�g�t�JbXDM��TVD�eDV�u+u����o]�R�h+�]�����g���5���OH�0C�6��D�>3,�2k����x?���,r��r��2��0��>����C$WX���XUVD�U+�������1(|��$gL���������X���g<Nr/Q�?�)�pzd��q(���������5���}�2�A������)�,bo�*�"d�b�'+�l���Y�"B���}<��_�V�R�yC�hkc�5��h�
_���������t"�s��9�3��)����8��0r�q
#�,�Q`�������z�������NpI��c���l���D��E���H"�FG\O��E���B{���1}��3��W���DZ��6�%�KmJ���V�B��Bn�.�x_�D@���+B��������e��Z���/�P�����O#���"����3V�2b�8��DQ�#.�s��E�
Vp�+Y!��x9�,JZ�!@@��OP>9�fr��tT�0A��8�Q���x��!U`\�=&ij���[	���C��'K�~��t��������4�$r*i���fp��g��>Jj����O	�(��[�:�:�W�q*;����Pv�N14e��D���UE>�����z8K�d����]+S���0��H��������(r�+�Q����)��`r>������`�����:�*&t+�3C���C�PF�`XDM�XU���z��,Qc0L��S�CY������uP�:�'>��^`*N�n����2R
�"jb���~th,�&)��T����)N�?F�,�&F��)����6��d4��`^)u��@�J<0?�7������s�=�p?��d���}�)����Q#V����M��(�2&1���s��8��2��5#6�/X�.�O/����W`�b/����: "&Fl*���+����j������h{�C��M�0{��z��(�$�Z��0]�K��S�K���V��FG\��I�~q�UDD�'�A9�4�,~(!%�S��/_�qx4w�S0&3��I��2�N�V^jb���~����&Z*�M`�Xqt0�������UY�J��p�<��������}�1V���������	-]�	����MfX4e��a3:��o��tTM��7��)C�s#�.y!~(�@g���G(����kH�T>p����@�bh� 751bUY�J�$Q�B�1�5�y�)��Ddj �&FO��V��+;�k$�jL`�33S�C��I�k&Fl*����Dw�.�C�h[���X�K�������!����)C7��j	��{����l�C�P�� �&F�*���3I��7�2��=#N�k���xQ#N}{����h�A���i���?#�3�?�xp�	����,��H����>���_9a��,>R
�6�����D��ut�(8O�W��W���h/i/g��OU�\��;���I��D�0�!�'����U~��Vg[��~j�����C?5������|�:�O��Pr�����"8yt�"~(����������jY)�����B��3�hi���,��D�����1}��3�2{��x�E��kg�O�J�n�!W���X��Lh�i�N����J�{:C�o����#���J�Z)�*9��������|��������y1�_t�o�;��ky��&��/����?�<����G=�~�������!�t��o�?=���.[&����4gb��3��V�]����+/���\�����\��T�W�^�����O���������������m����?������n��L?K*\����M�<[�����4[�'�y�����3kL������#N�h�����iZ�g�����0N��?=��\~����_����f���z?=�����?t���s���y�i��O�����������������y�i��O���S	�7n=�G��I�5~e���4-���]�������?r��A$p5��/���M��R��?�W������;E���6z�k��������7�=����<Lt��6z�l��To���t�������x�;?6��|�����y�`*�M�����S<s�Q�:����Xd�U�����=*s��[����1�E��h��A&�4��f�|�J�z����K�}|����>��o�x�{m�{�-�]���_/��7��z%D�<s�Q�:���{�F�4���{���1o<���y�b�Ge�=R�x�{���{���1o<��y�J���R�;!�x����s���*s��"��y4����u�y�I���2�w��
�w��^�~t�o%��$�e��F�����b-���^��%���Gc�=�X���{�c�;���{���1o<��|�?7��G�����H]B�i���{���1o<b�V�P�7�X�w'��7��?���{"�94���$�t�Ge�x�X�Q�71 ������D�z������4O����|��$��l�s�D������E�f��Gc�=�X���{�5��v�h��G�<s�q�8w�<s�Q�:����X��n�����=*s��m�`k����ir�Ka��y������;��;yU��Wty=yU��W���*���s8��y5���bb�S���\�F%����^��\�F������\6��W`���uLB�j��"�{5���z�C��R����^J���k���zG��*r�W��y����yU��W���*��+5L	��Qo��\���w^����Q���a+���x-6�^�*��'�z����d�E�R������|
�=7zn(�n6�����{i3�V`���&r����n��R�����{�*��������W��x���Q���W���Q��B��U�7^=�wN�y�b�Ke�y��d��T��O���*��+g��	��;vyT����}�>�st�����Q�|B��T�w^'�x�n�4�d�P�q�q��Q��q���@��|����?x����<W4�W�[�=���/19K���0}�.RTo-cc
3���4���q(��n1#V���,riKH�^3Wg6&0S�^14e<��E���MeI���P�M�x������bh�<8��"jb���"���T����Mc"35�&���q(C�W��������m����E#�lhS0@/��B<�Ft�>@�_rE�(����7�0S+�O]q�#���O"jb��t����������_mc*3u�Y�+N��V�_S����S���"������{�i�~�����W����A�Q#>7�,r����ov�xk*�v����k��0��ADM��\��Y���
�k���##�X'�	��9D����C�>�<T�f�Qw��fv��4��8���YDM�XUVD.
��+��WQ��#�t^ �J�;��>@�_r���"}7������p����q(��{��������{�/3mJ����=�W�B?�x��'�7�}��^p�����Gbg�B��C�$��T�!K�7z���7��������W�@�J�w#�����/���7z���5�����9��Z����l\�p�&�a�.rG>.��,:jg�mn��	���sA�����5��}���=����i�����9��0.�Pn_�3!Zk����9�1x�r[�V	W�M����fb���x�����vd?Ta�m1�W�B@�/�qN�����5�����4oR�6�x�<L	Q�UH-_�Z��������1��/n��Iu ��P������ZkP�	7	)h����`����/��M�����6�.��������P�+�[�
��7�+�'Q#V�����'z��]iP�)��������+G_23B��,�J ��8����*��w���-�� 7X3s!Tk��l�d���(�J�\�����/���Bj�B����LJ�G'C:
o���	(a�Q/R3B�-�����x�hT���<�9���q^����H��`K����MOz�r����E�P���
�������^��b>w�"Flsz�O	Q�Ko\R!�|!�iV��(<��_;����Q�H��)!JX�rQ�����5�Mx�
�^6��)��[���We��0�g����#����,a�0l��'L��������27����Q�nL�^�\��'�����	nc3U��_�&}m��/7���������e"5�&�
9e*3U�\_14e���"fb���y�4����y�1x��i�yVW��
�<���L�����Q�����nS�1���l���0mS���iS���w��(����A�����.�2��#6���.�y���21�)���x��	�]BDL��T>�{�}u���I���W,���f��E�P�WQ�#��l�O��1�0}�l��GJ�N��P�%xgbo�*�y����m���nLffw�3r��8���6Q#V�����%�[bD`��~7���A�a@}�&�y���G�z��]R#�3A%>�����.!�FG�*��M��3�QH2�-G+'>�����! ���O����(�X��W"A)�2�,~(��4gD}�*�y���b�!��I�P� M14e���D���M�������W��X����4�e��q(��,�&Fl*����-�k��z0��|�)����<;�����|�����	t�-��G��q�#�����BTg�B����]4����A�DI�cF�RK� Z��@����)[��������\��FL��8!�\"�T���H�)E<9!���(�6.��Z���"|J�l��6��0�r�iGj#O	Q����Bj�B����S���M��vT`�m��������������P��SBBvFwI(E�#�����\	Q�����B�,_�Z>�#����@6g��SHj�	(Y��8z�a!5s!TkE�;���C�|��Bvy�9%�� Bf�B����[Fb�.����]��B@������.�1��"|���O8%�|���4&X4e��5:bUYwEs	j�Tc�x`��(IfL	(��qN������{t��i����gId\��v�2Q�#>M��0n�=�n�O�4���i�+�s����f5FB�V����*�[���[�c��qJr$Q�#�������C��>�A�$!����s�#������$�"�� -�;*3��7'��	p�
�������I���(�Oop���t�Gz���A�L�XU����c�8�����_������������������l�����:�������KE
71�a�������&�)���W���:�BDL��T��Z�:�K}����l)	�L`��w�bh�`#6���H�V��f*���bh��:�fb���~9Y�F��kLd��<�)v2�MZG11bSY��8���}#?Wl`��51�/�����G����EL����5%y!~(Ak�P{�__�A��7$�A(Cg�*��_�h�`Q�#�����7�F5��)4*4)�������P���_��ppQ��u��� �04e�11bSY���t����\2�:�[��f11�p��r�/	;���bI�0����C�P�������_eQ6\*l�5&0�i�)�dL�"bb���~i���Q�eQ��;OW��Mv��: "&Fl*��A(���|��1�.a�bh�`,�&Fl*��Q��J�8?R<z�"�8y��q(#�$�&Fl*�����e(�1&0C	�<��8��u�����,_�V�wz�N�Z���r��C�PF�Br�l\S���8d�G�����A���\���a�,%u&F�*_�_I�dK�Q"uZ�PB��p����������7��+&�1&�8��	�W��1���;#V�/XU����������je�����8z�������I�,��SI�����4'���5���U3s!T�+�S��o�����92I3�����Wu31��{UG:Ho/�Oz���R�ah��=,6#6�/X���c��S�]�Pj)(!5s!T���<���L�0
��m�Pj))<t���������~��-G%QVx3!��W��3dL���Z_���%��Z�0|W�3!��r^3&j�B�S
e�r���{(k��(gB�VW�c�af�B|�s�f���rG���rWB~��> 73��9U
�K�� �xE��.	�)!JX�rY�����/Y�Sm�����b)�	![)]L�[���u0��q��e���h;��R�3%D�
��*��/�h}��������)M����!��92:j���F(��N�� 5�J���M�������!��~���������e����(?�C�;|��o���I��
���1f�c����	�Cz.J"�FG�*2�������m����%�!�� �J��Mm�T�!'��[��C1>>k����C�Q������\*u�j�>w�@ty����+�FWDW�E���MeE�%��#S�'N��g�A?��������+B�YF�Zz�I�V'|Gl�Q0]zTM�xx��,rQ�F��+0��K���C�P�!w�L��TDN)%�zG$|�	�PN*O14ep�!�-�~���"r��~�sWb��p�����Q�a�8��M"jt���"r������w��Df(����+�0�>g11bSYy�m�i5c$�FO����ti551������c�h=�����R�b=G+��[XDL�8���~�
�QNq��i��oR�+�F�������N"�Q|��	���8�|��V����IDM��TVD^���l�l�����J���n�l��*KS���Y������nj<#5D��v�No��P�n",�&Fl*"�\5�����9#P�>M�<�P9��/
�$W���D�����s+r�}#[EY�9�ls�8����3s!Tk�<U��sN~6*��&�	�9����q/����P�5@�3�X}�z����W��v���z���"01bSY9�e��/#��	� �3��8���/����MeI�i��H�����&���<'�E��q�2��\���G����s��7S�7�-�>iJ}35~f!5s!Tk�4���.�1������"�.�SED;6�%�K�3~���o����>��iJ���;<����P�%p��*7�(#>���7�e����&o���������f}��LM����5gB������w��}���8%�Q��.s!�tN�T��H��������B�����5$+&�bht�E��R31bSYy���/����/!�.C�E���$����������yU&0�EW������,"&Fl*k"�$/�@����q�dJ�v���U����P�E@k3�'+2��iJ}5�w���Zk��$��<�5��IBY�eL.�
m����M��Y����(���7���)����W���2+Oy�(��[�:O��W���@W�i�Dfh��>��8��]�E���ME>�������[7T���u�`�:��8��-�E���Me�:�������Ti���b�J�x��0��/�S�'l*���=?���������s��X���y��������p%��#?J���1��H���;��(�> "&Fl*��Q�(���h�oTO�&��
P���_�N����B�Fp$\���W���a�sSe�<���������]
�t�����O14ed��E���Me�:\�vz�do�^p��A7��8�����L��T������)rc3\F4���6F��"bb���~\��G���&��C;�x��8��D7�;�x���~QJ��V�)QB��n�N+��V�U�
�34����]���~B{� �#�P�P�bh���]Q#6����~n^>%�f�'pK�)v�p��:Q#6������2����`J��	�b_!Z����W.`����w�}Lc"���#;����0���3m���Me�:P�#��y�VZ$[�Dq��C�PF�6�����kq���8l���|P�P��Q�n"$f.�j}�r�U\B^g+�:�%d��B��9%���1!1s!T�+�C�����[B	%k� ]4��������L��T�`�G�Z�����t��Pj-��!1s!T�+�����h��d���)�v���Mj3s!�]��������D�a\F�a5�KsJ��rxH�������i������"��G�14�|DZ>��M��A)#<��+�A�i�7%+���������_�v��`�F���`J�g7!��7���e$df.�j}�r8������0�ReESJ-e��>*uf.�j}�rZ��LS�+�xZ_a4�MS�+�D�U�	�����L����Y|~�Kg���#}:"-�u������.;�]�3���FB�����Bj�B�_����P^Kr�I(������T�v��]D4�5`S��u�B��?�u�)(�5E"�j���j}�r��KQ��h�����pGK���	(Y��$�.3B��b+O�U����1�>��`�[G����-	6`SiI��������_����><=�����??��oe�v����L��JN��D���?����~��?i<���������f���>�s������t����/�_
5������O��������_��������"�I��$���39Np*�W^�E-d_�:�@c^J�W"���PO�:�����p�7qoUri���6=���8��r����p�m���[�w�^jxq�I"�&p���Oj7a�P������?v��~�[�w���E~R�=o���W<�H����p
]�w����~R�=����/y����,C24t!��nz�I�.�����CO?z��t����xz�����7��J���&�H�4&>uwE%��nz�I�.�T_!��z�����-t!��nz�I�.tSY�W��H�[����B7=��x�F=��W}���w\�~E���7��L�Z��N�K'���9R���:WJ�������R���:_J����+?�/#�|�P�e��/4��|)q�B�/%n}U�*��Rb�M�{������|�S����+��x�zm�{�����W-p#�.������H{�hF������;_n�\M��5���J5w����p&��g��?���s�/q"�����?H������s������?H�������So�g��?���s�u�?e��A���������?e�n*��S��M����{w�x�p1s������k%R]���^�T���1��D��g���L-�zwJ�z�P�L�[_e{�3cn��T���{��w�S����Q�����chm�O�{���)s���u���["��3��_z�|������T�O�{h!��S���!R�?e��7���?��������2'[����T*�Bo������)���1��D��g��?yK��g��?���s��
�Z���'R�?c��%��)s�R�?e�������)s�R�?en��,���Vcn��T���{t����)s�R�?e�����1w����)����#�X��o*�q���U���`����.tcn/U)�W��["��3��_������� �9S����q~�sf��7���s�/�������X�w&��/B��)s�
R�;e����s>�K�s�O���,c��e���S���z����4�w'��
C�zgB��]�U�[ys�x�M�)��s1��:�T�1��)��{��["��3��_����1��D��g��?�i��/�1��D��g�����R{����T�O�{h��S���z���������
q�
B��������s
W�{o���)s���2V���1��W��}�f�����NMO��$�_o����
��.��<;���������������+=�l���w:�kDa"gnq�?�xT�����>@�����B.������>�(L� {����P�L�>�x�m�Q�;���M�{P��7�e}�"�pv/�M���lxI�t�R_]0R�1��}����2��c$���8���}�+v)j2�)�8��:�14e$r)v��XUVD�w$���WB�����d&��%B@����,S�+����1��'���q(��@!�&F�*+"��v���uFd"\��^�Y�PQ���P�W�LI�P`s�?HcP��b�R�+�����M"fb���"���M�pP���lP����W�C�����#6��@��ikc"�����)v����{c@DL��TVD^w�	�G��Y!"�F��@�J<(H@}�&�(dn�LE�Uc0U�6#�x�!��<��9D���MeE�{��r��_�F�����'�G�C	�:C@�
p�G?}��P(V����#]��K�y��q(Ca'����S���"�'gz�p�|M�������,�2�gu"ft��������sEN���r�SW��C�<B�������|y���nAj�������8Ha�0:�31�p���0r��-�~6E�La�'��bh�P�ND���UeE�9SqI�#��}f&
>;���@�J<��D@���+B���F'��������/�8y�U��<���M���x�M�~��$u������u���gig.���UC$���3���E6�u=��~�Icc*3~����bh�<�%{�������o������������z�Y*���]�>@�_2� ��,4�#��e�����M/X4e�B��,�FG�*+"��������jn^���0�ZF%��#c51������c�j���'�6��	����<��8�y<s-""&Fl*+"O��X����Vc����)N�*K��331�t���a���=sz����-���8�5�x�����qXx���L��������h�����������e�<���0�@��o|	{�L&h':N �%���M~I���p��t�U6��Tf�,n��q(�xf�EDM�XU�D^iG�.
9FF����3WnW(���,S���U~��O@��ah0m�(��qq�vbW�C��3Si�����7{|���
�{���<+A>������ye���
7c�������}~�aC�OFj.<���L8B����>��OD�$��^� q��F`$e�^e��P���0�#>	xO-���}j�G�_[&��?�@�$�Z&��*�q��1AKO�Y�1��� ��#���6�!�&F�f���6;C������xn��0��8���Dt�����q�"���6!0F�� '�G���0�z���q��|7=�j�m����P��>��8����������1��LN�6&3CY�4��8�A�Q#V���N�
���`w)t�I��Y�P1���P�\-O��[#������N0��-���@'����.��
i'y����$f\	��y��8�A*"bb���y�5Q{���(!������� ��`��<W��������������M|����>��q(#��~o&.XU��-��P��-�b���]�h� n������q{zO�+7S������w�[�^14e(n'"jb���y�<1���=:�7��)���X�z��=�$=A�������q�a;�����n�3��b�k'�r��M���F�}w�L���SMc�a�G��^p��3��~����q�{�|w�R�M��5&J���������0�g��}NM���|wq��[�,}	�3�s�kW�e��07��5#���������P������}�P�d�C	���,`��������M3�H�`���+�4��,������|w�P�^
�0��Bc3�<���2�(""&Fl*��M����Iq��4"���:|�B<8���>@��<`��a����1�*����+�����Yh8������q�+�{�y��B&%&�������U���c���"
h�pS��t�>�Qz�
�xi?�L�8��i?�{�#;%�GiDf�\�	�SR�G����C�����S���B��q�,4����>�Qz�	�x���"jb��o��Q����B]���	�!L �%�Z������� Y�BT��r�#Lp���D(�L�F����<�����0���u�U�P���{5:bUYw�m��m4�A�L"8{0�,~(A1@}�&�`D�����v�l��\�C��-8W�\��M����"�jRv�O����_����������M
�����(�:CR�HG2-���iG�����)��Kj�T&�e���%���:>�����k}���\�E����j�C����O����=o���H(�9!��I��L
B����i�����<���FpY���o�@�����M}��J��p�7+�$-z�)�P7�2��8��E������_���V
��)�PF�L14e$��"jb���~a��p2UY����A�@u�r��'�>�x���r1rF�rC���Jb{m��G��"�����XS�k����s�Lf�~��C�PF/$b&F�*���=�����s6&�35��� [^�`,b&FN��W������_��O[���Fr4,�&FN�2W��V��x��cFeP���b/Cr�A�"bb���~���58�b�0)wP���Q0�������,_��v���a��h�%i)5b�Y JM "&F<�9X�'���.����]��,I�a�8�A�"jt��i���u���{���7x����X#���*QJh�#��{���������)��]\��gX4e��b3:bUY���_��qx/g����2�}������3,�&Fl*_����h�%���I�xzS�bh�`#6���(8FN�������+�2/X4eP2�"ft���~SV�����1h����CCS2�&u1#V�/X���W����_��h���"jt��i���u�M~��J��)i#m�R�����B�P��)RP{���C���eRj��0(A�8H��0�H���q8��\��<�����A��97��q(��b.������ul^��E�!fv�L���	�C�HAD��XU��#��j�u�mc0�6��U�XG�
C��6����}�>�E������'`���G�{�TC)NT"�����#� ]r�4���B�@�J�2�������S6��$/9e3��bh�<8I�"jb������!SL��%��!WLg�fX4ehh����XU�`�Z(������)B���
E�P�A	g��������]�`����
��x�`��Y@�
P��`�M�A���`c��f9�v�`�rN��rL6�o��Jz���-���������CH?�����?�?������QV��C�;\��i��r�'�6���0AG�g������>@��qk��\����O6B~��9`���R����c�O�,�����e5�@c�8:B�O��������(d�%� ���F>]N��f0��t�����c�U���v�9�P$w���pg-7�,~(!E9�S���Yv�x�/Hi�L����������N}�"�"dn���
:����1��]Bf�2R�B"fb���"r�*]��T��eP��{�)������b51bUYy��8�3����\&�$�)����������UeE�����q�\���������2�=@DL��TVD^*N�S" i��K�Xb(�S��2�mH��������y���|�2H2������7���a��fb���"����DJ���z�=Inv��q(��l�����d�������N�PNGzJ��*3,�2�#���XUVDN��|����e
3�������x���������g���H�����K���&'�/n!~(A[�	jo�"�"�����yn�E��I�x>�5��8���C��2#6��'�[�Q���E�3/3,�2�2a3:�rj��Y��T�b�p�������)����l������,�|���{Rp�J�^���M �J����>@�_r�h(@Y�P9e�La����CI��������|��OJ"l��S���)���j� �&F�*"���nA�������Nq��"`����q85�,r'�[����Q*:���r'O��CD51bUY9%5h*�����18x�y�)��]	*� �&F�O��>�\:p��T���H��	�`��\I�\	������T(qAy=r"}�A��
_�g��%uHulQ#���?�ENm��(%�y7����J�ik�E�P���2D���UeI�8�F�oy�^\`3�6O14eP`11bSY9g%*�[K}T�:6Mu��&���H�c���MeI����
%,P��L��3j�5��q(C���X3:����}95�e�I�|ua��uA�=�����2|�
/���/,NOK*C�1�Kt"��F,�2�_@D��XU�qv!n<��Kap�k�u���S�@�R6�o�0��������)�O9����q{U�?�����rY������[���'�~RD��O�!~(��	�~R'�����[@u���l�@�������Q��B Y�O���W�)Y�p����\FB�f*�hGhe4 �2��\F�rE��u����B�>gP��a�]L}���.@���{���������(;ch�h��L�XU���Z���%��@�I��.P�P Z��xNB-\������"�����)����,�D���Ue�:"�A����(L� c��C	����P��`�e���\��Z��*�!Bm�S`���\@�R{�3c������o�:3D�c�<��Y���sv������qE���2��,v&F�*��Q���<T������*S��i�9W����4+��������c�IO�2i��l���D���Me�:<����Mf�T{F�xH�g�3����CZ�����Me�:�L���z#���p���\�ei �����������!�L��y��VfX4eP�"ft��i\��uD������$G�2���]14e�� 31bUY����hw�c���O#v&h�v�@�J<���U��y?�<��Wc23�P�S���0�Pa31�p>��py�c�����$�&1C��:��8�A9D���Me�:J�i>�C�)��J-M�Y�b�����2�\���JaJcc0����K.�XG.
������4.\�%C6���
vhp�/G��"wJ�/��g�����I%+ex�f|�q(���*y�E�PF��56\N��+�A��d|��E���)b���C�P���1��7����W.��v��"3ASW�i�)���m$��}�����N9����5&3�)��2.�a31bUY��fA%h_2��Df���Kf��������#�bDM��S<3�4m�E�Pe3Q�#��.�K��1���o-�>S��Fje��q(�@��D���U���R9��d�����S�C��_�$����M�������kn��L��>*�C	$�|i�#�/X�|R�QJ-�����d�f�o�"�������7�������I���l��E��>E9"nFG����9�YP?<��W������%��Y�������/�����?���I���_�W�y�_Ky��	�����_>�������$�?��?:�������k}�,YO�(��o���������q��OO���C������_��������Y�������b?e���rTc����W���(�X_���������?�mB�fy�[�s9�F��.���MM�>iM��D>$>���y�i��?pJ�r_\�\�w����~��F��bBb:M� =��#�So�a��.zS��Oz��S��'���������F������D���>Wz�G���g��_z!���jz�u9����W���u�4�h��.vS��Oz��S�Z7�>|�u�e0��%���2���U�[�1�*�����������'R�eV��F���������2��"�m��)q�
B�3%�}%��Sg��{�T�N�{�_����FUo�z�����A
�	�F������k���/:��N_cn/�Hu�s{�J��W�g��?���s����v7�>s�O�:�����WJ��g��?���s�/�����2�� ��S��Z���������2w���_{�?����S���1��2����)s�R�?e��U�)��U�:����J���y{��������<��;����>t%n�������N������1��D�sg��?z�N��g��?���s�/9)O5������1������?e��A�����?PN�scn��T���{Af�5������T�O�{�7�;wB�{c�����vn[�;S���!R�;e��3R�<S������Zy�v�1�WJ������V{�Tg���["��3���k<=|6���J5������15������1���4	m�������2���4�o����'R�?c��E�s��	q���zgB��*���;S���zw����2|���sw�P���1����~lon�o����~�s{�D����K�v�sa����'R�?cn����G�s�O�:����Ig���������1��*���)s�R�?en�Uoc��1��D��g��?j�|��������2��*�~v�������2��v�x�g���C�:���u���f�����+^��J�])��!���N�o�S��s�M�:w����:��;!���P�L�[_=A;g��z���1��v���)s�R�?en���8��s�O��ke��eN���������2��v�S����T�O�[���1w��I��w���9��g�p�"P��e����N����+�i����Ju�'�W�%&�l���q���p��q���w3����"S�����g!�lr����T�;�:�K��M�j&F|�Y�uk5�����f"�)�bh�H�$b&F�*K"��;���!��?c
�tA�<;�Wm��	��qO|���F��S����"W�|�0%�������#/�J�[E�La&�A2#��������������,�h��5�Fd&���W�C��wcQ������l�p�v^��D$��)����x�OPS`��y}ry�m*1��1������/$�����q8M$�,��S��x�\��L�q5S�CD�"fb���"r�He�S�)%;�t�NN�
�z/���������P�4E��U��)��w�A�i�B j�������>��o��+�
�8���M�@I�����i~I� �����x��*�3_�h��0$Q�#V���;�t������P.U���bh����E���UeA�����<Bg�����[}�	�rZX�Gu�>�|:]�Y���4i���8L���o:@��G���y��g��)��{%��U_hA_'�d�e�/n�">�����7�B7ev4����k�U�X���+�H�������>���������pmu���gN��.X4e��"jt��������ks<Y������d}��8���1��L��TVD^
7��i���I����6��8�y�=�������k�=�
5���H����Mq�`a�
"jb���"r*q�PU��Nc��r2��E�P��uE,bFG\NE�E^�PLq�/�#mc23e�|����q(�x����������o���P�Jn����L���W�C�G�"jb���$r�3�t�,%�^S�)M���2�g����31bUYy����i!�-�����ZfX4e�������������JJ�������j�?��8�ypMD���UeE�1 �u�uc0����)����<�1�E�����?�\�Bzm{�}x0��=�*]04e�)��������<�V8�$�?�������+N�.#���_Y�L�8�{��L�Xx����$��W���3�xf�CDM��Tm����ldH�$;8�r��- &y�dU���c��F8�2���F���oG������FU1m{��_���h����F��������v1 �JH��w�yY�����_c��/�����������L�������u�����$�8�}������>�uSzA.W������m��9���0�}>
xvV������L�V�S�C��'31bUYw��"Q�5M�����(3�����L�8�c�>�;9�=@�6��5&3��L3�`h��)Q#V�������F�g������)2�H�7g51���3�(�P�8G��Ic23����p�]���"jb�a��$n�C��LG���L��_a!~(�y,P�����������*�����]�?p��6�x�@�>���L�����4&2����2HB@DL��T>��n����F��,�L%Y���?��`%0{�%s��,�"�xJL*YRc�C	�D���U����&������+AO�����������q���P�?�a�75��Hc���������[u��q�GB|7FPJ=��3CQ�)���Z��IX�L�8���?��R�1s�*��P�:>e���!��q(���Y����K?���1����H���$/��U��k���{F@&F�T>������y�{Q�B����J��14e��11bS�<n�(�7|������
��\�5��q(�]e������qSr a���eW&1���)���2#6������<)<����df�_E�� �'�y��D���C?�����������O��iL����A�a�8�yp+������q����<o����\�W��������Y�z�>@��<`���e8k��`J(�3,�2��E���k?���1�5�Y���3S<��^14e(�DDM�XU>�;$�?Q$�a��x�%���a'�JI�:���/�<���FP)��0�Z��/����G�X���e�H�����W���ypX^�01�������$��<{�1�d
f�K��0m�����|w�;r��M#��(O���2�31bUY�n��g{�Gp�����E��G���ft���`���4������|K��]�Uh��M��Y�LG������yN���/���/��{�B�)=i"45����($S�������.�l�m�x���#���S�����\C�L��<i�:�L��yT����������aC�����R�\GMZ��q�F`�_��u�2�O�$X@�O��.@:�m����6Fu����0���"jb�����A�s���	5����O�b�$��&5#>�Z�:��a���#��df�mO�bh��X!1#V����	!\_��~d��JWl3���R��w����e��UImD�
���|������5j�>@�_��X_E��`�Pc0B(�(
|F����i#������BK����]v�$�1��$Gu��2�#6����sT	7E:E%&�Df�l�\14e�'������_G�R��q�����f���bh����E���Me�:*�2�����1�1C�:��q(�t
����������O�����!�"A��2�,~(A;�<N��������@��Giev��WP�y��8�A��E���Ue�:h����9����>��B�����2.�`51bSY���������r��M�j�<�4�E����/[G��7���l����Pn�M14e<��E���Ue�:h�F�<7D>c.f��o�y��r�51bSY�z�,8�O�K�>S��MI7��8��0������uPt�6�� 3��l)U3�?�x���m7����X����wc����8#v�&��:�%�"bb���~T��s)��R�G����oS�C���ADM�XU���3a��L������N��)���r?Q#V���p;�*���[�Fp%f�	Z���n��k�>@�_�
,?sEv����=Td��P�%T��7@�_���N��Y��������R�q�A�	���11#��B+�7������` �2�J�)���j<X�L�XU�`��7Q1��UP��Q�2�����580��~���d4#����f���$��bh�P�f11bS��u���
40���GGd{�^��
��n����/ I=0��6��F~�)�4���F�yp'U������l�!a$�-%[�����
p�G�if������Q)�j�?��n�Q������!�����?�dL����Q�n{e���bF�;q��j�;q)!��hX�Z'.&�N\�>�S'��B��FN�[�l:F�5��(���H�,�&FO���E^���m�&0-����6����&��9��Y��}8q�b��H�M��e�������:>^kh���s��rENmW��[����r�^������D���c�AKJ�Rj��X�C�yH�����������~U��S��df�m�bh�h��w&F�*+"������jLQ��8�AT	#��R�L�8�g�>�<K�y�/2&3C�Bn��q(��!�&F�*+"�Z����p��Ir��^s0�Ja11bSYyj����+�������5��8���\,"&Fl*+"����H~�4"0�
�&p�wD!����>@�_r���~�TT#�'"��!�%�Bg����ew���/��X���b�AD���%�H����EM������X���i���������L�3Z�_04e\�BT31bUY�Gv*�IIKy���P�UWL���90R��}����MeE�.�q�K����%��'y)��������{�g����i�|�����{L��@?�xp�	��E~E�xY.��~}*�;L��
�+�C�G�I��U~E�)��C�s�s,�T�'HfX4e��"jt���"�q�|��51 2T��f��%0�>�N}�"�$���i������e4��q(�i�Q�#��z�#����%9��q�h^{�No#`�9�������|OrO	@M���
�7�,�2������,��2�4F9I��'�6'�������.��`}��o@�����Ule1��6� r������<�H��������!�W.����'��3f����������{�AD������}9��������<��hVF�8KJO�w �����S���w�����0��V��	����?,���p��e>9r�X����O������A�h��S�����X�(��U�<3\�3��q(�@5
����������Px����b�o�P>"��h��f����,�\3�������C�D���3-Cq�*�d;)���NG�HH�MA��]F"�H�� �������!)���U��	�O)������E��2�SJ&��������Sr#
1Z�������1��P���\��������
�v�B���WE�}S+������u��Qr����)�Pj�Lq��Cad$b&F����Q�K= ��Da����	�R2)JeH@������K����!tJ��	l3;�x��>����������n��)���������������ey���)q1W1���p���;*�e�c�{�wx�"�R���U~/@��g*%����2�a�+�\V�!zYMn_�	���Z�K�������\�|1i�|�vH�(�|���~4ACAFr����\�

b�����q��8Z�t8�$3�_���]�����������K�@��vD�u� E���J]�Qm��[��2"��O9��6���:}���]��v���N4���.hm@�c+BJ�D���2,v%�-a�����ul�SD�0"�����A�]	rKX����xT��uH���^��Z��c�	.3Y,v%XTt�^V���A=DW�(N�(�n+�eb�lD4@!3���f#���������GJ�c���'�^�]���P)�8�jr�:<*`P:��F�gBoz�L�]	���
���[wZN_�'81���������,]'����v�������O�����}-�r"�+%t����l��b�+A+����\���JH+��������3���L�]	�7P�A{YM�_G�/j�H��8�Y	�ur����g6�<x*2D/��c�w�c������pE[#�gB���L�]�C���:� ����h����C[���7%�����8�@�J-�v���@�\%y���J�T���a�T�����Rm��C������b�B��H�aB�m��d����,���+��������
�d.6X#D��+��P'B��(�y'���r�<��A�%�iKmK<�5�X�JP�
*:h/�����������6�z���gOd���<��
��K[���uX2K���m���v`�!���8d�8����&_X�3�����\8,l�2JY��T�L?jX�0�Wf�?���
71��T�1z>z;�b�+yp"
Tt�^V������MO�8jYIF��BKS9��I!��������x���u�X���(��������g�p�4���l&_���R���5>�q�h�N�&>E��HA��D������m�3u����e:�=�$��������_�����z���?~���~�^k^�m}z������������t�����~�]����}���<�������l���K��F�N��(>�D/����g^�`�\�z�j�\@����z��*����-���\jC���[�7������e���t>�-�3y��|�����������^�������h:�����'C�O�-?2���QR������e����t��.�?:�����Oi�r9���?���?R�����]�g_�v�B.�o�u����{����XO�#��j#F��Wr5�ji�?������7���#�����%�������e����t���"���]��H��/�r�fY�0=�?�~s��?t��*)w6!��7�:�����)D7=���3������f�B.�o�u����/�������v����������7�W�v�J���;;*�S����?�j�9��"��?%��V�/R�)9�����������s�wr[J��A������s���,T�����d�\���:=��'��|��������O;%^��������f�FN�m����2r�O�FN�et�k�9�'Z�?#��V��l�9�'Z�?#���dfV�r5��A������o�?%�����Sr�ok�o���S���3r�/��O��X�u&���������9D�u�����n/�c��8����3��s^���R�:�J��Uq�5�FN��V����?���m*9�'Z�?#������nXTr�O�F��e��Z�)9�����S����g���h5�����|B������ZJ����Y��Sr�Z�?%�����	����{�j5��\�;J���^=u��{G.����n��Z�V3w#��j[�}\�9�'Z�?#��V�U]��V����ZJN�Q���?#��D��g��j+������V�O��?�P�����t����W��.|fnct����'�>]�,r������'�>��t~X���;���>
]�K�B

�����^�>�^�7t~�D�����k}]h��OQk}��Y�;t�S��O�|*:����_�K%���:Tr���H���4t�Q�Z��.|F[i\
��QJ�q(���&]	��5*8���KE�>���?�!
��������R:�Tt~_��OE����W��}�z�-�
?�����A��������Ox4t�Q��}SC�>��FR����OQk}���Z�'����'�>����?�G��X��&���pGo�.�A��Q��O��?|���{��a�]����f
�CE�vp����$0����2�+(���8����zP�������t�b�
5���m�q�����%O����19Q�t�~�^��8�����r�)?��+x��V�^\�Ja�M9��2^�SZd$o�Gy3��
x��%R0�N�M��O���:mdQ	Wlr@o��+�L*:D/Y|6�,�$�C���HU����Ld����3���C���3�P��p�.�8@����#���DQ�<��+�x���wLy[0��p�N����6��D������6R�N4�[��N]t�����������]�K�����������:�I��]���D���EXA�������l�tXq���l$`����y/���D
����������Q"�|p���LAG�^b����$�����������R�  �v��D&�t*�bW�����C����1���T��5�
�k?��+�+�y'��S�>@�|�?�F�	)>K��;0�b�+������&w�|EqY�X��<�H-��Wt��e������*6h/o�r��|��b�������e�t'����V�A{���i)R�g�2���f�M�t"�w�g1�D��a�n�,%
��6#�N��ktD���h4�o�$�@�nZ�����nB�
jA���
��G]G FX���J:���������'��� �?w���aS#^�1�J:�R�G�����!9�y������3���a��7��t���=���]6��N�SC�-q���##^���J��j: V�, �|�3X��-(���[��G F�cY�t���=�I~�=��6�z�!���U/��n��7H���,&��|[�t����zEt�+m'�U41�Ko,���<��g�p�����/5(0r�����`�P�VQ�a�V�,���S\����mR�,�[�����`�A�3��g�q=8���:S��{q�y�p##Z@eY�l��C��O�q3���P9�������U�d���=�
WW<U�2b�~`�(s�d� ��f��u��\x��(���#s���E�t���!x���\5ts���������9�����
3��i�&��1D`D%����o���,���<��i�_��4
��K�:�m� J�/�L��y'�}�����Nv���jT%�	���S��%b����maU:��=1�5O������_��yeW�B��1-e�0�yTd�R&������d&�G��Q����P�!zYMd�����k�C1��U�1���S9I�c!� rEX��S[0��y�)�U��9:�C
\�/
t�*S,�a�,>��1���kR���F��TO	�Z�,��B!��K[��y��E)����p8����L\�b&�bW�����&7��s���*����|_/������ �����O�T�ThD�n��� ���DvYE����T�����������@Xl%�����+�^,+�x��6�n?�p��w���������~`�T������^V�!z�L>�7=�l���1�*P�#OeX�Jp�����j�����%CP��@aP�|����YA�;1��n?�pr�����@%�I�
S9Kw!���
���m�������=�Z�DF���mq��V$S��X�J������&��{�=����Q�~8%��F/�bW"lR�!zYM>�7���\.Y]W�o����U�4��
���
�~4o����7r� Q��	���Z<h��������|��!@9�T����	�0�;�.����Y�9 �����!�x����	��Td�^V����W�����}���;qBu��������g��NO�(6\pEyn�x���=���#@���~��,.������[B�����W�E���X�1}�!)��d
{�����_�F{��*���Z�1��yLRJ���	{]gr��xB�K��S[J��ySp��K��!
��Q�T����2�%�0��Y�ku�`�bF�3�X��1��X%ybu��)������N�3�n1�pS F�CX�y[�u�������[2�y�lL�[z�|�EE��e5�a�n�Y��a>#��!x1��D��`��"C�rj|�����*	��A\L���6b��
�$J6������8��2^�B
`$�2�3�����X���M��d��
�G�g�{��H�s#��-�d�@�n��T��N�5��8P�F�`�6Q�a�V�L?q��,'b,�e?8i�b
`��WF�R(�0P�;��$��oe��k �P���$&W�'�j����}\������P~9���_��_��/�Y������t*�*�RMRGa���:�vd>1Ce��v��>u�3Y,v%��������\�_�[��H��V�^S���T#'���^�]�c���� �u�������mu������-��m��bW����b��������j9VvQ�	���P���d���`P�A{YM�_���2}fQ�H}
����R�j���m\|����n��-���S�"U��b�+��Ul�^����[�QV�&��C�	/y�t�iE}W��)CA��D��}e)��7.�k��F9<����,��nH���������S3y�y�����nH��P����]����j��[�.x������C��zJ��z���b�+�"X��e5��:pP�] H���M;��]Y,v%�V�A{YM�_G\y�>��%~�$����A�]	�P�A{y��:��\�JKB���0�4��bW�������&��c��S�o�Yb8�"s�f��P���E���(��/�z�x����'Qlt-\Jr��bW"w)R�A{yk�����DJ�{9�T���el�UFQ�wx�%��v���p0��3���P�G)�%M�i�XT%y������q�+����U���!Ua�S��1����ay�c��.�����U$��B?tS F�cQ�l���A�{�C���0�J�;+��B�N�Tn
��2��(���o,'�T���M�l��$F���!oFYV%y��[�9�ut�M����H3�1��d</���<���r� \��0�UD�rW��P�1��K��#`k ���s�G�'M�a����<b��6`I�l���7�S�=�a��"�0r���C���P��(��X����"y9u������qK���� ���D�I��e5��:�C�J���aD����KF+��rq�%Q�� V_YN}���@C����JS F���O���#`;$�����p�tQcYD�v�"f�Nd����S�����X9�`m�r�,G�vS���YNG���N'v�$�"�-.!~�Q��
��Cz���o�>KE��������eI�l�|m����W�
I�4"ZS F�Z�g%y�!��]<�P�!��0�X�k�a�����X�����}���)�*�Wr'a������������{��O��~W�aeq�dAE@C�>f�]�L��	*�H��O���+7��bW"��X���&�h���K�r��5�
�-��D/�������u�a?��)'o�
J�� ��/�-���:�b�+��M*6h/��-3��`{c�F��0u�K�����:��7�������0���?<>D��H��b�+��uV�A{YMn�9���gA!�F�8����,��*:h/��3/����Z�)����)3Y,v%�BE��r���p���S4IL����%��T'���Db�����ri��>�9�d��

�S5F��/y�3Y,v%�9Tt�^.�y�gN�`��m����p4)�d�������!YM��y\�����(:&�n%�������+��J�����5E>�r�M�U����9�SB�l�A�]	��*6h/o����f��S���N�z&��g�X�J�g������S��`�YD�`��Ii"B}W _�\�{Q����8�V����N8E���l[�B�
^�����g3/��nQK���c
�,�Z1�b�+�;xHu�AV�fn�/���6��������+�����#@��Y���v����g-���3�lpU�TIG�q���P�7�?iCJR�<��D�4��bW"!l�����&��<l�^,WH_|�d��������f�A��B�<���*�m�]#������a##���%U�� V�, /���"�B����)��� ���D��YE���P�����-�*��%���~��6�S F��(K��#@��Y�{������	w��Y,v%���������3/8���Y�y������1�5����g��EG���2:��"��g2,v#���!YLn���{>����DcD���$�2bda�$����aF V7-5[��o�� L�	�+��� �b7"��[3� ��-3�4���AZ,���~Q��lS F�0����<��i����2����
�n&���ql������5F�,i(j�A(a
�h�1
���8F�>]@�>�]��'A�i�%(��a��&9��������$JS���I��H�9A#�6qo#�fbu���3��J�P��M	�A�E3	+T,F��f�>fBgs)u(���k���C�$:*�EG��s�:�)x�������!q�c7+	�~�B��d[�u:�b�+�������&���qq�h}�W�E&�;&��5��bW�[Tt�^V���A���@h��]����r6�{g�|z��bW�_I�����&��#��xX��!�n���
N�EQ�<��@
:^'������)� �={r<����"E����G�Q�������r�	��"���3Y,v%��u�A�����4G�6$�"����p��^���@�����������p5���S
d��i���2HN���L]if��IM;�7f�'���"����*s7�<��+�=�\�{q;����.t�s�IJI�s&������
8����U����%AY*�I���^r~�O9���+�@A��D����L~%S@��l?�D�/�:��bW����1�jr�:R��(�j�v���QJv�� ���D�N�R�t�Q������c]^� ]uu��&�c��d��d����#�����&_XG���U
b���P�(y��,��&CE���P
��ugqF.�a����l�B���]�|7�j��[S��� #�;:��!�B4g
��B>�dA���7�C��p��P�c���2b��b��]3�������t���[Ey\��pS F�=,���<���������)�b%�y/��|��=#,GY%y�-iv�r(�#�d(�c�wD�f�~�������w�K
en�H$���\�1�Js5e%y[[p���d��d�*���d�d=���N<4��r����[6'g.�<�T$u����rn�b�� eY�l��o=�pHS�Z�N��D����xm.�1M=��p�E�s9�-�_��e��=��sgb�6���*R�#@����%��������kC7b�mz�-J6���+�A%U�@o��=�M��7�a��=*:h/���AA�E~�6W-��(=�������*6h/��7���<J	�>:F\�n�1�>`I�t���7����G%��#������HcY`Y�l���W���#���T3��%$I@3 F5S1\(����,��|Y�8�C�����s�R�su@�����������b�r��X�Q/*�O*6h/�I�zM�u�$�K�zG��N����(x{�����|6�#�����i�J��?�������/��'5������������O
�zJa��?=����������jB������������������r{���r����w:a+K��yI�Y�[�8�v!'��s4�^�@�L������J<��M�b�K��qSg��j����~4�N�~.��`�w���!�}�+ZE7�r9���?��������h�1��S������e����t��N���������f�r9���?�����\�|�����D�3�L_�����N�hx2�R�W\�SO?2u�
��u���nv6������+����O��#S��P^g�j�je�n���M��4����L���9&��p5q�����NA�u~���:URD
�����W;����d�[�"��^~��K��Z����h������,%�����8�&Z��Up�-/r�R�Up�M��7��P���f���o
������q���/Vj\�|�i]^�VW����������U�7R����q����|���+"5�?����]~��t=��Wpv�D������r���fu��3�VVr��H������G����{��m��^CN=�Z��������[�F�=B����������c%�E��h���l.7��{�Z�Q�����Vw�h��#�Z�J�=4�n<9���^�������q���������h��+9�b��L��������]��GQk<����Yr�h��#�Z�J�=���Vr�Q��F.<���z4r�j�G%�7�i���S���x4r��O����������o(��3r�j�C%��A6�\�KD�qi��^��W�;��\�I<UT��'�!g�K���+9�^^��6+9�(j�G#����o�r�Q��F�=�}
r�Q��F.<J����s�Pk=*9��DA�X��GQk<���u��G#���zTr��[�F�=B�����#m��������D��F��&>�Wq����o�����������V3s��j�^��l��S��84r�q�|"��X��GQk<9��eR��JN=�Z��������4��c��a*9�dq4��J�<�Z�Cf��c���C��X�u'�����ZwF��A�u���c����G%�w�k]*����%�b�Ww��]�@y�K{i���f�F�/X�8.�x�����5��{���4+9�(j�G#���?���z��������i����m�Tr��p�����s�Pk=*9��
m4ad#�E���Wr�1s�a����G���\xC�J.n'�����I����C��`U.w������T��nm��[�?�1O����19�l�9��E���=��������7�9(P�!z�Lp����oZ	���"aVIb�-z��eX�Bt���C����0s*p"kj\A$�g9���9R)5�����e�n�d��vq������no��]��E�t���-���r�Z��!�C��'o�P�C�� �a�V�, s	��o$aW�s	����������������-/��l��g�+�pHm����7&�T�0P�{��6�SK���J� ����ae����&w�<%��u���AhB���J�3#���(���t���-�RY�~<��w%[[��?���
��	l���l&��|�N1<*�5�$]��+f`D���b%fju�����������!���y3>���*��d#@��Y}�J�e�(�����s F�cY�l���~���8��9i.D����^�/���@��E
��d� �}��$�A���u��0�@�3�p�C�lR����d��xE�1�R�y��PY%ybu���A������u�9�0�a�y ��Ul�^V�;f���U��M�P�����#�����bl��Z���D9��_F��qC����+�0�(�
�l���-�E������i��:�&��*�d�@�nY��ZznIE�?�~��ne�
�BH�e����������Qi4S���s,���UZ�
@���W��TG@9�T�h�#�B����t�\Pa�K���3���N��f��C���@?�E�����
���FQ�w��f(�x���wL�_��F'#+�5:|<��6b�x��fMu�^>�n�l����]r�.����deB�~*�bW�x���bC����1�����q5���7*4�d���<������&���p���<�T���$&����2,v%nd������'��''t�6��P���m��b�+��g�����&��|�]����A��Pv����<8K*2D/��3�w��+JA�QW�0PH���2,v%������fr��o��%I��J6&!m�@i���B3�P�!z9�\~8��m3��PM������3Y,v%��CE�e5�e�	��eY��	�:)A�'G���2,v%n?�*2D/W���WU����6�P�����/{����+y�t������G:[~Z�5�"��\Ib�z����_��e���D��?�M
�����P�D�h��Nf�����7L�D��q�PB���q�,�8��<��h�Tt�^6������o:)�P	}�%���(:�\��d��`����O�*r�98	@�����d@8t�������M��&�\4�+qW���#?(s�(�0P�;��5{���H>�>4��#pV��v��0P��5�i�#
f(�s*;�Q�D>�P�!z�L>�7�_%�A����D���=�a��yCE��e3�|�9���R8�"��f���g`��I��,Rfju����w�E�ET��������4�������&��{������4�a�3�(W"���e�,C��<��c�)=lJ��#}�6�J3Y,v!L*:h/����.E�>��iQ���+I�NE��
�T������M�
S����x�zX=:!J%�)���IS7�S��dZ-�k��wE�1��~��c��(����5�zqh1��f~z��3���	y�����l&7����W$l������#��d�@�n�>��-Np#lPbD~���FX���\fju����5)��fij����������a��(�0�mS���OU������V�A��_���h���l+�0P���~E\hS!?#+Bu�����%r�M�P�����
�}2o:�.�g�+q��(Q7��/�)##�v���� VwL��]�-!q�E9~�1
�����G��(�0��O/�k,/hmP`D�o�7`�@ke�*J:�����'�y�=J�C��(����gF<}cJu���
��NB�v��A8SB������e8V�3��A�����4*�$��F��(���T��=��
�P�!z9����;�K�q�7�4Jr'b�{��I�
j����{;�%hS��=��7��#���$��,[���wb�k�����g��+�X$��pv�T����&�BE��e3�a��������u�`�����������*6D/��
��HDq��;�yu
mt��U�������<|���F�l��0�pF/�E��@�����&�;�E��Xh�tZYQ����?���~���~r!�>�'U����Tl�B����^%�P�c"	�6��(��K$a�&�� �	�U7�C�\5	1� F:gMd�n�NZ�cC�r�s�:���q�b�B�F��f�����]���N ��l&���*�kr7�e*A~C\��&��X�g�C�r�3s�:���!%�
i�v+�0o%����YI��Z}a9�N.i�NJRIP��/����8,��"@���@�������Dn8��H��1a�����LN��B3D/������Ot���VqJ4 �90�?�1�;)�0P�/,�C�D�(?����=��0�����4�@�����h:���
������f)F����I��
3���rh����9�~M�"#�W�(60�X�2��OJ6�����,�$a���	$v�ey`�	/I�MP�a�V_Xm�j�=�5]����{`D�1Fq	V�a�V�/�2T$�P^!yI��Z��_�rqG F�Kb,���<���r\�h*cR(H���iY4,�1��T�D�F�X}a9��Q�����6��h�
���y(�a��X���A��}xD�*�Q�j{`��1��T��X}a9i�2������
*��Sx`�Ach2�3���r�lcR�
�}
��|���	p��g��h��������U@,����EF>{.>0���0�=�Vm�����r(��[���"i$��\f�]���"�������������b�)�&�&F�����YI��Z��*�u~�5I���)%���7 i���yf�0H�f�w.�ag�����P��
���Q���$J6��C=�;�����)���1
t��
H�l�0dd7� ���\N@�[h���4(3�|�zv�a�(�0���[��k�>�)[��!Iu	��F�x9��$��0����[�#�����-<TQ�m}`D��,��
3���rR��:edn�UE�RB"��/�X�Ryb����WJ>�0����[� �a�����:��	n]N^$d7>��8��-������r�%(�� V�XN�Y���$��$�$B=�1���P�#@������#���VQa����-�� J6���;����n/�D�88k$������QqR�A{YM�F�V��K�B�.�����������B����=����o����8:��������iLnE�4�L	�U�$&�h`/�bb��T��e3������������Kf 1��}i/B}7@s�<5�����e�����d��r�(�Me����E m2����d�g�^E���/^%�Iq8�=������IE��e3�a�t��$�?DG��SEP�R�����0�R��3��e~�T��N��#(2���7FF�Rh���@�nY@�G�Dw3Q5�����N�( �J:����l�$�@I�El���
o�x�(�J:����$�������$�V��N��H�*2D/��3���f)Y+�y��������2��QE7R�a�V�,`
\�	q�,X�v���8�;;��bW��J����������&�zC^�W��l�����eQ��Pn��SM��h�k�V-@���n��p=#�K�d#@��X*�mH��D��Vd8p��9��(Q�%f�M��pn������>+���_j�Q��.�?�Tl�^V�[f����x���.�#O��7F{S�����@�nY@�r�d	Z���,����OA�e��-P�� ��}�j����E{���^��g`�����J6����d����*r�h3c��CP��(��Xo�d���eC=c
:B�BO��q�I����3��e+�)�e[^�F>g�����l]�d�@�nY��i�$�6���P"rms�-v%'[���,&7��r�$L+�V�"�@Q"7�*�6�`��p���2�P_�HxRD�w"e*I���'<��d#@�nY�_P�������A+�@9>o�8\d�A�3�x8�����S��
��G��Ad�]����H��;Q���r��;�swZ�
(1�Rno�8��w�]3�����V�]v	,T�S�A[�M��������3��e}�|����H���9���H=��
%fju��+���N�D�a�<#����ihg��t(���[@?�|�$nz�7�n�)�����s�!��<���;���?��"���#��#3q�H�7�"����{�DQR��5��5G�-��p�a�&Vs����}*��l|�kb3�5��b��f|��
��j�>6��<iF. �c3��3������D�(i�S�*�ey��o��j�k'}x6Ms!��E��Z��gC���f���X�k]�F���YIb�����^� �X���!z�Ln_�[�
+������i��^�r0DW�E�u�������U�k�����\�j} M��:��nIw����q�!����6��#]#nD��Ck�fj���H�g�6(iZ����D
s����������t��x\E~oDP�q�4�f������*8tu�*�hRT�
�/Gm!s�Sfj���PMc����y%�������} DBE�bC�������i����uE\��'�`�yG���`i��Z}c9E
���9�FI�����j/�b"���l&��c��d��3�>+�8��W������:�"C����������_���	
�����`��%en%fju�r(�#����&�&�WJ�bH~�HM��(����,��19��������y���]�Cg�1�br�:��G��%���t����0��z��(�0����[���l5���_�
�yz~y`��1�E���X}a9�>���*R�?���d����� ���SN���Y$#�A�FG�[��hF~e�
%fj�����Ei8U	�'+�&�����e��(D�7����fr�:V=+GE(V	SE����<�2��bW�@X�T��,&������(��g�#�e�n��]<+�
�l�������
�N����U��"��' h�P���a�Cw.���w:���
��(��0����Ju��o=Jm�K��4heD���D�5i�%f�[�&��$A�,��F���������Q�a������k[8�a}lB��6�&6b��1�TG�X}c9����K@4����J�mo�8��Lj��a�V_XN
r�!�{X�����x������w�:���������(i���t����BM�:1��M��PPZk:T���������C���@B2��/�Z�6#`;=�s9�}MA.��T�F*s FO�TG�zL4�s9�S(��m]�0��I�l&�B�k\����j���>����w��]5���b���B��
��k�q��0�T\���F
'�b'�����t�y{������Bu��=���.\e�+	G��h�?~��������~����?I���}n��m.�Og��~����y��/���?�I�����w:��iC�d�����C��?=������������O���]�������_������������W���0��c�#� �K1���4�6a`�5b��`7�vF)�N����}��R�_]L�e9[��.�j�f�S?�M��(vHx/n>��,?0���J�������-�`�f�`@~8�g������U~�%�\����K�z	��%��_�:�����@������hw:}�|�������JX�e[3{!W�7C����|�t��/���k���:{%���2�����7�q��g~��/5#w�J���f�����j��3��3��j#�E���J.<�W)�����G����{����z�����5�\x\k��x4r�j�G%�i�Rk��c%�E��h��c��Z���������Dq����H�3OG��i{�+�Zw�][��������+&j����_1���JN=�Z�����"���G#���zTr������������5�\x,�tj��{�Z�Q����l�����yT�����+�[�F�=B������[�u��X��GQk<��(:����N�'���Tru?IB���\�OR�����!�W�{B��WW,:�3�x�����5�\x,�|�x4r�j�G%�Sl����JN=�Z�������\�G#���zTr�1�7T���z�����+��o=9�����s�t��D�����Sp���|w�N���D�Z�J.�%9I���{�o�����0�_����f�FN/X^b^��!gU�z���cAX��h��#�Z�J�=:�.R=Vr�Q��F.<��W�F�=B�������:��NCN=�Z�����U[B�G#���zTr��6T��^��S���x4r����x��F�o'���Tru;��;�'����I�,�|��Urz�D�������]�B���z�����[�2�F�=B�����#�����JN=�Z����Gk�`��{�Z�Q���-��+8�����oYK���\�����Z�P��G��w��Tr�j�G%����������^�����d������\�O�\��������Z��j����t:��B
9�(j�G#7��6��{�Z�Q���k�)x�����5�\x,�?�x4r�j�G%�c��E�3r�Q���y%�V�2����:������E����s���p�m�9��LbF�n%t�s����p�&p��r.o6m����,��i�������'�������x:�P���dT�n�.�T���D�VYE��e3�y��f�qR��W� 4(�i���H��G��Gaf]����-��
��NxqM�g^�*�$�:���x.`��2-;� ��[�d�H�G���hf]���PB��E�%Fq���:�e�+s��^��Z���tl�����e_�
�7�z�U&-��0�z�}���l������v!��u��L���a�u��lN��P�$W�lPd���4#�-�ea��3��eTUj�f���$2I�Z9�a���?���l&w���������
��/�b�Ke���h�bh��e5�c���O������{F��������:������v[�mr�������9#Z@eY�l���-X�=�^Iw���#B~�T�����46R������1��^�E���Kl�Z%�����.}eJu���=��g^��*�LU[�����H�Q����&7����y�R���1�D��2���b�yHW3V�A{y��g6;>~�5n����&3��Xp���0��b��-��]`)�.�f������7F����(�0�[�����J����y?<����aS�ybu��*�y�-�_���5��KF���PY%ybu�R��/(w\Qr�2�M��*K�d#`;=�lk@���J���f�"�LE6���QmxR�a�V�,`�a����k%�	�
S�IW�������l&w��8;fM�[�7-���]���3Y,v!4�%B��e5�a�+����\���O#��/=W�_5��bW�x��EE�e5�c��k�R��-�������3Y,v%�g�ITt�^V�;f.;e�{�a�\Ia�R����]	�<����jr����.���#h8F;�q*)*����Y����p�,�����������E.�3��Hw���^�QMI|�g�����w(br$y
���{�����n�`5e�y�����*T��� �����]	���P�!zYMn�9�?��\�D9@�5�y�(�bW�x������fr��7��R�,��+����Y
	xP�
b���P������.��g�i+F@`�<2ez����V� ��X���������m9�����_��yeW�B�1-e���t��B��������(�R��;#��5��>S3��J��'����#c����(r��9��P��lK3��������sB6cE�Q�Q��#��1�O��a�V7L[���J�xO��8#h@��xO��c���*X�n�����#�
����2�5��
3��|�i��T���A8����B`��*�>mu��}4}�^m���bOx0%J~`��en%fju��}���
�)�`(2J-x`D�7F�����@�n�~��s|:��g�"���
#��2�7��
3��a�����G�RJW0�����`D�7F;�d�@�n�~��S�H��A�[��T��h��2���3��a�9����d;]�L���F���F�j�8H�y�W��^)��o\)�������q��X%ybu��76�������(mT�~`�AcA�l����G��F(����Q����A���
��d� �??�~^���n���EF�4�:#���ecV��Z�0}:���[
U��P���^�g���3Y,v%\�0�J�����y�K�������W�
��Qha`�!cJu���
��N���$�
�������>�7���0Hw=�S���^2��"�@�������e���<��Zl��e1�a��a����o���(��\��8�c��f���
�_#�0/��sQ�
m�2����h��yQ�a V7L[�Y�"��*	L25/���T8�x�0�:D/��
�.��rl���$oQ$����}��5(�����[��H��eI�JF�C�\�DXG#��T�E���X}>������,	�QR���>�a�+y<K
��C���|>o�� �*C����d����Gy���W+����f���)T��g	��I<�Ib�J��Q����:���l&7����?e�)�yD`��L�(���!� �u��V%�h�q�w���C�$3!��Tr�\]hlF�!z9���?�w��w�\ 0\����G�r�(��GX������?�w�'k�!P��G�p����L�]��#	P�A{YMn������"[4����S9�a!�F@E�������l�H���j|q
�t����mHA��D���n�m��RNz����?���~���Vi!�>��BE��SAKGa���:�9��,aY'i�I��H��a�ib���=�s9��t��u�,iz��T^�
�h��I��:���i�.���^��H�YZ%M@
K�(�
3�./����M�?{4�Z�=�j��Qf�]��Nk��e5�%��2b ���(1�eA;�	�z�2���f�����I�#�Zd��R��Pd����#�(�D�3���r�K�x��y�p.��6�7�i3��(x�J[�rj���x�o����1���{��:����y���:D/�����N�<A���B}
��S���}e���J:�����E��mEF)X�u0��X�u��
3���r���s�p���8�p�!�,;�#�-��C�`fj���4ua(���c(0J������R��*J:������>=��D��J��qSY�n	�}��A{�����ul�F��'�ZC��iM�����Y*��nyk��u9%�������d"�S�r���I�a�����g�sH
_���hp'#�|���z@���
���
���l)�HER(^$��R�)#P#�R�,A��<���r��D��A�nC���P�'�)P#^���:��������H�AT�1�pQ��I�;�<?b����������i}�:��R���Nw*�bW���R�e1�IzSz(�p������#��1�E���X}a9Cs����P4he����	���s�d�@������4����2��u�y�	��y�2��#%fj�����N:��in{%�e��Q�T�Du��A��C��;������~��{�5�<���deB):�T�����
��C�������>_��I#��-w(���`� �����WQTt�^V���A'
�{�jf
g�D&T�+OeX�J(������fr�:B����X	R�*O�E!
yp4�Ut�^6�/�G�����u()L(��NeX�J����jr�:"�
���(H`�����]	E���!z�L���*W8`%#�\L���d&����
���{Jr��*)0R���2�T��4mP�Nt��"�.�R}Pj��$"�d�,E�L"(X&�A��"���I��pFx��q�*Z��E�����U�}�,��80.'q������������{��OAz�q��,�?�Q��|���G���H��H�r����Pu�i^�
3�c����.{m�6%Z�-��T��k Zg�������i�l��.d�3|WnPbW����8K�7I9��6i;�V�,`5��E�i5��P\KM���x�&y��
3��gx����{� |Z�f�#N4��E�t���-��4
�I�6��/+G��L�h.�!�j�����)�B/�I����2UT+o�����Q6)�0P�[@��d��
UF"�O���]��5c���������b<q��N���TD����5����������N�N7�A�����������@�nY��h���"�Zw!Uh.����!��k3D/��3������V�����K���,�	�@E�e5�c�kDgp�t����(��%���@�8.c,���<�����p�����OP`�Pa��M#F�cJu��������n�DC�Q���7F�����"lk3����d��Q���
9d������b�u��%Q������,2N~[�<�7h����&)t#H�\�0<�7� ��\O
F�M�,K�����(��.4#N:R��P�a�V�,@���:����������e	$�0P�{�y�����~��*~ �i.������k��C����1��!v���"��V	��FR�-��@�nY���c�4�#��>����Q�t`��e�@���Z��z!r�������R 'Ne}Cy ��*:D/�_j)��k�s~.�!�KD]Y��]���(��(�]z
������}����L��a�s�a�+AL*:D/��3�w���EN�3PP��<��+xP���U��)S�e������M��u��eZe�a�>����q�{�o^Q:��T�c �ue*{9"9G��C����2�U�|�������I*sy��.Tt�^Nm1�gNYT�r`�H`Bu����]	/��C����2s�<�f�)��g%�	R���]������
��jr��3?�Q�a[�FW�8���'3Y,v%HA����jr��%�����T{�0J'&�[�*^����v�g��6Avb��t���I'�zG�	
:^'���8J,�5r���~k��C %:�Z^��]y
$E���uV�3R����J;z��Kh/{��X��f�^V����j��YN���jBR���P����d��RM�:�.���PC�M������#nI���#�\%fj��������t�J�H@S�M@"�����uP]u� #��A�Q,�~_:��ue�`��~e�@��_N�-)�_�7��E?�����e/�u��������XXlHu�(2�Bt�
�GlJ�t�J6������iZ3��$��L��l�+�HHi��e5�!�+L������_�P
���:#N�i��@)������C{X�F]��+��j�	{��6��:Y��Z}a9TW'I��m-��'h�AB�dg�1�N<�2����o,g�����I�1���9#.�f,���<�rhhv�rV��DM��+��V�M3b�cQ�l�����w.g�4=*!�[Gf7+���	�]��yV0�NL��<pW<th}�� <��y����e�(�08V��q9�j��V�P��C���@�8�bL�n���~�y&��$g�8i�c(0��t�
�g�$�M��R��v0�/,�����Q#kPfD�|���g�Q�a��n\N��"�!E�<�VF�h��Iy;aJu���7��e���6�1"Mp"N%���}
��"*2D/�����U��E�}�Qx)�0�X��d�@�����:�d�H�������� �8��d������@��������J�/�)!�?L��0����<���T����;�A�'�@��l�T�mj��N�,�,B��_����6f�����y_A+���T�����P�*����jr�:�Evu����W|��BTa&��-��g���wb:���q>!�C�P>�[����V~*�bW���Tt�^V�����Rc�R�J`��������\���w��a��L�����������d���<8�*:h/��������)'��"�3�yy���W9�EHu������2���[)J��H������V7��8������n]G�����C�_H`�����]	�AE��e3��:,)hq��sD��9:�P?�A�r���a�NBc+���XbV�DQobet�q�8_/�~��M7���[A�@_�>P���������.��2���8]�8m���������_�����"��?~���~�^k^�3�W�T�_��������e��OT%�/��w�������������_����E���i�?�����v�Q�B��3�j0t������O�������d��;M1Z���x.�-����t��]6}E����uG��
�YO�t������Y���c]���%�m����|	+�%U������ZD
rv�.B��"��Y���|�������%�]��t��mq0>_�/���_bYy��]���E�m�����"�p�6�8��K���1g]���%TK[���lT����'��#���j��_�������`z:��7�������������e����t�����v��[���_�������`z:��3�����Q���j����4=��V����	��v���<+F:��f���.\I����8\
��JN=�Z���������
���V�N��7�w���z�V�M����Tj�;#����:Tr���>,��{�Z�Q�����d�z4B/����qi�g��a��m�RR��7���eq����h������+&j����_�����z�����5�\x�C.�5Vr�j�G%����	�N�A�q���:?������Z�P��G�[���JN=�Z����G�usp���k���{�0�����S��84r�14GJ�Q��}D�Z�J.�#(Y/�#��*x�x�N�!gL���+9�`\ ���n��GQk<��X�w]�h��#�Z�J�=�'V=Vr�Q��F�=F:�t����z��������F�=B������*�O�G#���zTr�1��r�A7����5�\x�{�xTr~;Q������I������o���^�\��+9�^��L���������U��GQk<9�X��+9�(j�G#�+��7��s�Pk=*9����I3O��m��O�k^A+�����m�j��+�^]y�^k�G��.�B��U��W:�q�a������^
]y����WE��S<�Utu��M������M������K�����^�C.i�[�Vt�U�Z��.�&*Gx�Vt�U�Z������������;xUt�5;��a^+:�*z�WCW^3=x5t�z��������E^K���������[t�U�Z����&���
���@��S���&���/o7�{�|u=����5�Z;}%�W�N*��>���������.��y�p��5�����^
]y]_������;xUt����z��6�����^
]y�(:�z5t�z��.�R��p���{�C4a�M�.��)�E^C����+��3	�.�6��������Z��7���(�&��c&�-D��tB'���[u4�P<��6���pBg��i��j ��d!mT\p*g�R��1*V����Q�����f7+��m7TY���n�L\��C�K���w���1e:�������+�S�[Ba�A�8�*����kc���=$���3:�����=g�T�D�(;iA#E�YE��e3�c����3���!���[^9�m��v���O.����|h����7tW��A��%}��
Y,v%��@E�e5�c���rA�<p*#�@[$(=��bW��H=Tt�^V�f^ho�]�h��O#�+��� �J:Y,v%�*lu�AV�;fN�\�;
2������_�0��bW"3�1�jr���V���B����&�������*:D/��3�T����E�_}V����m*�bW����X<D/��3����_��t�D&T5,OeX�J���*2D/��3O�Y��9��|D�`O�N��]��./+�y'��S����c���j�X�g�~\3Y,v%�	b������oh��Ol��[	��P.�6����
��BR�!z92x?�9%B��(I?�J���l	_�AN��U��H��
���������e�\E���y��V��M:��eC'F��eA�t��C��p�AQ"BH����V1m���,������jr��}�S��{GS7��<.V��F|��9Q�a��)uK@5��Gc���`�s#^�2���3��gR�?�X�"��A�Y�
���i�Bybu�(��S)mY�ivs�,�	��.���3��gT��v�.���� �L���P�@�x��*���{�n|x��SAEE��r���G��n�d���=�6y����
B�+�PC�9HZ>�a^�t��~���A���J����h���v
��`,C��<��g=(�������Y7���`��m��� ���1\����;�V7�G F��eY�t����f������w�lPF�u���F��e�(�0��^�ln%�)r���rS F��eV-KG�v(��V��u�>d�*]���x���J2���������Kl(�.Wy`�[���3��i84EW7���U����9�����
3�x�����u��wk4���R��]x��S���s+�t���\M�����q�j��qO����_��yeW�B��1-e�0��6(�y�����l�ssT^x"���D^�XE�����h�!������(%B�Gy�-,�M�e���`+*�Zze5�|�T)��:/e�+AM�-��d/5���`Td�^6�����k�'��_*)�
g������`�*:D/��������I-�M%�8�C�u�3J�,�|�������|>��#�������`[��L� M�<�	���C�t��	S�:_�$�d�lL������ 	A�_�������}2����;}l���7�]��d���`�*:h/����G��=�������3�=�,���������������S������T��v��D�\��d��=�)����w�4���7����^A������,��:������|>��?�D{d;��� �����_����Q�a����?6��$��Q����� ��$�����yo�����x���ZR�RMG���'XA��D��|��D��*�-�&%�4�(/RmJ�
����f���K��o|��:�F(�#��[g�X�J$E*6h/��
���6��H@G�|&�4�D���x�{k�����Dsk��cF��7�����f�1�J:�b�����W	8�,�/8�Ou�D�]�?����jr��C��[��mPd����8m�[����3��c���+�\���A)�����B�N����*���;�����zl�+	L8l�f2,v%8������ySJ��������R5'Gy&��P�TI�@l|6���Y����RE��y�9���a�*�0m�����y�o���D<�d"jY�2[�F��jk��e1�|���OzE#@I6�X�F�[$5���X�2}*_��Z�|���"��12UZ�1B�CYR%ybu����c��<�4(0���e`�=va�T9�f�����3�����$1��-�F���e%fju����U�G}���a
�����*����-�_�^)Tg����4����ihf��;����r�
UMk��80��F�S��$�@���><^t��������Aa�)#�m0��d#`���S�zJ�b5����6��}��z�����|�4�=~=�l9����_��_��/�Y��������k]�r+�tu��y�������P��MO�>�!��J_�0�-�P��@�$`u�r�*�:��������
�-�2/J6�����J-�����
#���0��T�>R�0�o,�p���b�������o�h9�Q�a����[������\�����+��k�U�����
���3�$�������
*(�B�V�����:����As4�(R�JH!B%��\f�]�Cf^�!zYLn_���x�p���J����Y�u:D/��������#\�&O(�����2o�5�
H�$�;��u�k�r�����(4e��0��T&��:����	�������pTp�?$>n10��TD���X}a9I���:�����HWU��r*P��@���>y+��+jU$�sS�~>#�-�� �m���7����^T���~�:��e�aA���e�@n_n]�����e��a���0��T���0�o,����������#�������M�a(��3�|��q�r8k1~t�fde�U��"��p�bo�T�h�	�}gI�s,�+�T��>d����3�6������|]��������Wa���>��CaI}>u�>}�'�GdV��W��f�#� j����K���=d@/I�YD#;�L(�[R�U~1U��R�#8� ���	��<�%6��L����N��s����+;��J�#�~��0��N5
�����.��~J1��;'��D�	�QV
���+0�bO���
�����A���F����N�^���&1�S���%�������D�*���/5o��$���m6�i&�^�HD*�p��(3�5��8���%A?�#)Lb���I�����N�����Y����B}���/�QV
����gB��U�������������QV
����fB��U�����9�2F�8b�PxD��20�Y�j&�~��Y��F~���]�NDX���A��v
�b$�L���v���N�\����`���NRf�L�r��L�l��d@��W/����:R���N"�1�l���y�%���
��Q�����>����Xd�>���aJUB���l��^�D�DF�F���Aad3�j&�=YE�
n�)�s*�]���:�l�S����Gm��`�7[�%����8f+�Ac���F8�����-�����������������w����������B_��X���������
U{�k���F����^N9���Yd�>��Kn�|)�\�.WPdQt�Ap}�@N(����Q1����&���obt.P��.�9���	�&�Yf�>�{.��JSQ��MD��RT9��XT$��������b��&2 ��)ixA�QT��c����
����]���bo ��o���!� h6r��^Yb�>��kI HQ����Ex@.`�S��	(�l�g�����������'1�6��r1�;����%�^4�S�cvY��PdR8
���B�+<�s9&�/!�*�l�����b���&}�
%��b��i,DD	�>�8�{/����p�&6�����IQ6�'@N��u��pt�	���,6��o�5�.�mp]�@N�@g���4`�%	��:��-�H�5g+������PA�,�Q�f��$�#a{)�ak���'�1���)����L1��O��V���E�Nr	�FM�>�x���$f�L3��kI-l����y�T���
���-�9A���[��{�In��W�~�)��P����W�EF��d"S���~���r��D�[�.���4L_�C�]<���D�I��T3�����T(o�Y�����8��<:Ll�S���{1���[�J��E4
V���4����f:�L��%�b9������x"5��;��<f6��gr�#o��q���J��=]��A����#�~��u���O5
�(�*m wi�`��rM
�]~&�Gy(�b����k��j�5���z�K�	Y�����n��i&�^k�J`9��:�"����'��:�2��u�	8���
q�����{E6�xA
�B9��)�T3)����yj�cZF
C�
�G�R��/��T3��~(c-�y��\h�_Y��	l�G�(�I��T3������h����8����C����A?���.��T3�������]�"vEM�$����DR�/�l$�L��%p��D��[z#;Y1��-���~�Ot���O5��:���G���=u������,T�Jz�����t��T�Y��<��h������)
A���^$�H�L���or�����A���(D����c��
�����L�X\+DD������*D!:�(�pTh��%p�������{�����P5��*W�=��5����*��)u�	���}����.}O��|�O��L��&�;A:�2��`�7�%\��U<e�'p�P}�	�����x�f���t�-��$J�����9a:�%6��L������N��H3�~��r���Z���13���g�9�/��{b���;��U�{�	�:�d�3O����N;��K����!����P���a���4�����t�ip���2�S�B`!u�WQ��|��;�8�N�Z���+��������e��������e��t�	����t"_�H5�t+RG
�f9}}X��H��{�!�tb�Cz���=�]����*�uKl���@uw��L���	�k�d*���^v�z1f��t�	����;���z1Bp��I�a�5`'�c��Hg�{�#����������S�1��i/l��Q��D&���>�c���?[I��oj�C~�_B���a����>�����PQ������a���Xg������.L��_�r���&���o���m��ww�G������/\�����T�*�o&6��fR]����������0�O�^EZ������!3�j&�-Y�ck�_�L*Em��1m�@'�~�,�Q�f���tr�����HEP��a?�����R&1�S���'�c+�C���3>y�I"����_��v�&`�>�Lw������bJ��b��i5�����+oY����i�y��o��HR{|t]�3���zr��L����0��N?��]?��z�������U�5����r�$�
������w!������������M*�gB~��u�;!f�����A9
[�����qb���@��?Kd��L��1^�N��P[��������=���WW4�S�cvyC1�xP�6K���I���#�~]�c��L��Iu7[��*k�"��MY����W��~�U�&Gd�S���{��ful����"�kg
Jx�#�~]�c���O5
���@�k�l��-)c9�G�T�k��	�t��P��d���$�Tep����?������L�]�	�����I�S�o��S�V�2�8/7�A��3�3���=c��PX��_����Gj"v)�aK$@/�|yY����������??��p]�����A���0OI�Bo�X>���?����|�������l}�������_~�����?|�S�]�|�T���g3�o
��B~��&�M����z�Z6���/H*[D���r�j|�zg������XX/�n����]/�D7����~�x�����/����3�]���;����q�/������~xS&�~&�����~��t��n9]�?}��a����3�]���;������z������edx�N�/�n�����]��?�����/���R�R��~&�����~��t�@^������~"��3�]���;�����2��|���mG�!�����~��t��x�^��oY�^1�~!w�W��~��l�������}���#��|�����Lw�~���+�����&�.��Rc���~��g�/f��NN::�����+�����b�;yJ�������L#v�<b���O#v�<��i�Nn"��A6 ��xde�1x
�����O����Nc*��Z����*�.*������B%���.&���Ln"�H5=&dG�*��������zp ���r�\	p�N��:�y�������+��N�pQtU�L��n���Z����D;����t�Et��lTA7Qk$u�D��&�����E�v�6����lg�
��z�5C&jG7Q��D��.*sA���$3��]D���!�<o���+�7p`$�>o~���:�dR���e;�BG�_��u��
��y�n�Q�D�{��U�MT�3Q;��
����TtU�L����V��Et��lTA7Q��vT�ML63!��E�TCHAw1��t���oU�]T��Q�D�,��=�t;����ML��o�7�O�t8�S���&v&��n^�#�!\����bg�vt���RU�]T��Q�D=	�&jG7Q��D��&j������MT�3Q;��J%~6(���dfC2��xn��wRGw1���<j��v�����Q������.j��&���O�ng�
�O�v������S�p������'v&��n^�T�����MT�3Q;������D��&�����Emt�Ft��lTA7Q����+��*v&jGwQ����*�.*����n��������MT������r�A/p���v~{��]�YU��#�jS[�����������	*������\>`Z����!+�5����\�5���2�9�����O%��!�m�������m��a�bw:,���+������^%	I�[��1y|1����L1������|�N�c%f'T�p�
I��<������S����b���?����$$g�k��1y|1�+o:�8�.V
Rt�}��5:J�Z���\����3������-��_�����S�N9�?T�-�a�Mt�	�|������p��O%|$.[���~��D���e��S�R�o6;�S��R�xl��=����3�8�+���w�I�G��DBP����i:��a�L3�Z�@�ti�h����(����� ����_�4�%	T�)dli����������;A�*��'�\W��8�n �z��4�����t
�	PV���<�Z��y�i<8��w0*�[�k������$���Q�y��$��w26�����B�#�-�%i3`'H@Ye�>��kE�h��q�'E!"���p
����2��df�@s��^K�rA��:
QU���� e��t�	���R�b�#��I-(D1���4v��2��'p�&��%�3����+}�����{����<���Gp&�>�8>l���V^�v�DoQ��� ���%�
�W����W���O3����z-�zl�_$:b��@a'�rL_L��l"S�cqY�r����)"�7� �G���	����Rkl��L�{-I���������a��r����%'H@Ya�>��kE���mgS|��AQ��?� 7�*;Od:��kI�iT� }��o��.<��$��f�NF:����o)&�[��~�1(!J�������e���4�%	�L�nS	T�hH�'�z��	���'���X\V�<���\���I�8?S�Wu�/�2*�03O���$Pv�0A3�H[�����25���	(+d�3O���$P���x��$|�(j��/r��%6��L���$Pi�+���=lewTa���0��v2�i&�^kH�u�[5�����q�=_���K���&:�8f�%+?h�5��7�D7���r>Qn&S�4�%	;�(Mjq
��[Pc���i?,��[3�|��{?ID�?a�GP�/���q�;���BF:���qC�����~j^+3��o���K%���Vw(��eo'�3���3����g^Y7|�]5i$�0!A��<�cQ+�X�C��^�xi���0���A.	r��2D�S��F�������8�z�K�������2	_�+��U���\�w����`"S�������Hr�R�9j1�(�P���P}9G9���s���F�f9��rL��v!��Q�i���_�SH���c�0�c< �I�#'�8 ^�_��K�)%�g*���n�2������x-X�UG��Y'��	"+�x�N�CFF�@�7���|�FN�v���N@4�q�W�������#4�I�����n+��|d��4���F>���eN>ze�N��A��@N{���F����2�Z�|�C_�`����(/#`'��P���y�_����Ui^s�Vo��Re�]Q�p#�|}�*z���u�J@d������
��'�^������X��-_c5�����+����`����G��JS�1y8q%��L1����u[��15�Dd��	����-dd��{��K��*Jk�+�V&@N^X9���(t���o���7t�/��T1��89Y��]'f�	4{��K��JGKx���*�"R�d��D�F���4`���:�Y��v��T-�99I��d��L��,�kG�t��oTB�A�t��
�:��k����Qe0(#R�dA�2(+tPA���x-X~:��M)��U�9�d��4���Hg�{�X>�)��W�i����yd?	���'���X\^_��)��o��u����RR8��O3�Z�|V8@����]����.�3�	��<�s��`8�r�����ND*�������E6��L��k�_Z�� �q�3EQI.9Y%%��|3��k��MM<�ht1�J$'���LF:�����r���#�1�O.������������}r)��Oh%	�3��fD����{���J��������"9xu%$����PB�0'���M�����-n,[�^m����P��0��R}Y��J����A�)�Sf5�jL���!���I��^��^���:u&~�����PX��j
u%6j�X���E�Z��W=���[V�����Ws���$�=���Zu�r����������tP��� ��SE�q�h�PgR*����P;�2'��2�?���*^9�'�A�q��WK�Q���Y�X�R5h*f���6F�������n�1��Ac�!;$sQ�>?H�:�0���[�@k�_���#*/T�.������-2�I���,���8�?��"@N/���F���^4�f�'��cE�����Q�z����%��.G���7{zxeF���1hI@������E%�Lt�q�.��P	���UFm����4v�jSe�>����je:*	�|�Q�"�>�����T���<���@�2�.�U�F�
IW��1{85���L:��e}V��Kp�T�Tx�;yu�����^oHG%$h|����HS�=���J�&}�q�.��0U�o����l5�"�(����_(��F��%���R�t}jG����k`���w��yt��5����O��1{|	� �Lt�q��u�+��jS�|@���H5�D9 ��ll$�L@���NW��"��t.�p�k����I�D����>�.Cr	�At���Y3��4��u�u�f�	��7��#�T=�hEG@�������*@F}�	��.���t��4.OVT#"��&�NN+T|lf�@s��K����bPF����Q�O2��L@����Y��v��p�rE�5�@NV8Kyg�>���
�����#�����%S�t��
i���N3�z�YW�*�7��
&]X�@�o��%2�i&P���+a�([�� ����DX�����_H������w��P������u���F1���NR{i�������S��������^Z���M�1�O.����O.�����O.�������4���-U��V��T��
��"�E�e�����*���g�@s�F�%��'�����#J��4)]�)�*]�����tL��1([� laG?"�������1���;m��2Ve�2�X�B�*��UF������S�b������r�������Z�@W����	�l]�����B&~xv�j��k+��E����2���-e���"��LddJ��{���S�Z�)��l�)7&�/!]e�������[����;!
�S�<`���	��5�f�������`�hP�9	�v��*Yj��$+���\D��'�iT��o7�Z����2p�5EDFk;y�����7�Z��~�d�%��[u���cT����Kn|.|���#�UN�n��&
��}���>���c�}���[U��.;���P���a���~4�j�.���rH0v�*�V3@������9�h�	��%`���*lUY����cT����Dv�]V�\U�?��0�eh�����AF*
�{-I@U�����PEd��	�����F}�	����)�JT,�M� #�L��\���(�J����T�9����-��3v��O&#�y�;�bV��r-��`��U{`�I��yP{<8�����}��w<`�Q�A#�����$`d���Qt9Eg�4*:hd�kMVX	��'�a��rM���9Y�r��Tz��\S�53r��i�k���5`�5�<�����Y����h/3 ''��DF:����@��.�����_+����������_PA����<���?�f]��z�.��V�Y��oY�0��^�Y���A�Y���g�(5�5/���`���V�i����H��%���J�P��%P�*5�@��+�r`����{�����D��JN�E�6�Fm���h�F�����{��8]�J���5XNB�@K����FVBr`(�Z��o*(-��k>m���(/�Q^j�R��6�Koy����6�Z�WB���"�@8VB�������K��=(P��"��-����Q���G�����^oH�JF�n��� �Ny�N�:EFF�{�!+ ��;V��+s�V�;�Z�j�]���D��U��X����cP��D�+?>�b�0��7]] W�;�uQdd��y�^oHG�C��vv���u�	���2�f�8lW�����n��k������L3�zC:�Ie�@C�A7_��5��/�P�H��@vh�LG%��Z�S�2"#~M���v���a�^oH�
V)��X�^�c�p�*A���e1?f��y�nUy������_ '/�����f�]{fN�JT�e��5[F0���F���`6�zG:F�
s�T���X,.���X,�b�7}0��,��"#�M����v��jg#�w�c�.�@�5@NZ9��Yi��zG:F+��J '�
���b)22j���
��$Vx/����&Pd/��HF:�����D ���/N�#!��l3 '���HF:���hqWk�����8$��;�M��������<hy���=�����p+)�Y���?���~��������,�u9�f���L	J�w����?~��������.����1�������=v��������w?�������O��~�������?������_~����!��!R
|]����DF?��0y%����z�s;��y���)� *P��P�s���n��"t�h_������,�{^.n-M�?W��r��@����O�'�4F�t�������~��o�z���O��n��Q���������~���M���,x��.�����kO�y�d�f�u��2��Jt0�������M1l�\gP�%y��e�����|�A8���u�[2��Q&�X��?�������=���7��Ct��p�@����i�6s?I�����3l��7`���]��n������M�Iz7Q;��*v&jGwQ����=��]T��Q�D����*�&&���B�"���t��lPA7Q��������bg�vt��/U�]T��Q�E=����*��q#v6��_��9�-�����M�W��w����\��=�J ��F��&�����D������MT�3Q;����V�{�(��*v&jGwQ��oU�]T��Q�D�y��;TtU�L���������*�.*����n����F��&�����E���e���#f6$���M�~]�=o~���?w����E;�AG7/�Y`��E��&�����Dm���Q;��*v&jG���}��'a'�c�����.b���#2��Hf6"���a�b:]��nb��	��]�L��LP&w1���dr�0G�9����lg�
��w9���v��!��L���35���x���������Wt����������L_}�D-�W_E7Q�������D;�����sQ�Ee;U�M�#�������bg�vt��Tg�
���v6����'�j���D;����tD�Et��lTAwQ�V�U��G�lTA�������~������M�������YG�_�#��k�D��&�����D��v_l�D;����)��6T�ML63!��E<6������df#2����J�'��ML��;����������.&���L�"6n]HAw1��t�B)���c�j������W~�c�������>QKu���~kp��r
��l��S������%��<��e��+�
��PYS�X�s1�z%"r�Mx�q<�8������//]�559�=�^��L������J��J<�eb��2�����I����g����Q�2�XQ�U.M��	�r,e\}��L`(�z-���F]�e�������_���o��i&0\��Z�A'�z��!Xb��8������h��N1�}��k+o��O�+�6|��27��zDd�hb�x�xh���������SD�Se���5�^O����t�	�T�%�-RBqk�� �YB�kM�1z|	��>�)����Xy
��2
2�%��D�8>K<���`'x��U2��'�^K���~��v���jc�,�����=��HG>0���cqY�r��s��GA�S������>v��^Ye�>��kI�0��6��U�Z�,5R��	PV���<�Z��qR�8V"U��5N�~�#m��$���Q�y��$���k��pg��{���N}���~���L:��e��+TN�����������c���%Mg6SL��v�{m���$����U�������e��t�	����7C�/�x���QJ�a�_���K7D�>�8>�mN��<U<�����iPCT���@N��+Kl���{-I ����� �^^s�7��|�7,�Q�fy���
�����!H�X��@���|P�������X����v�����R�l��K N��+�sf�	����H?Q��	JQ`��������� e;�4`�%	���-�"~�T��Q�@N�@gG#�>��kE�NWv�T������\������aYe�>�~�<������|-�A	�!�\_�]�%T�6�i& ^K��6�m�e��QFt�H��� ����F2��kI)ak���s+�|^�+����\v�:+d�3O���$�q�	
r�-�Q���)��	PV���<�Z����Y�f�@���A"��r�:d��L���$�_�B��{(�V� z��q��~"��)'6�)�qu�^[y��#�IQ �
"PO����@�9:��kM��p+5b3�""%��&�uZ3�L �^l/�S��*��^��p"i�������1{|1�T�D&����e>���@v�?s-�@�����/���Z���/�������&��)�� z�AsI�.@e�;O���o����s�[%���js?41]~g��>�.�T-vPYf���Ht�[Yf��,�����e^Zw�
��(�m�X�i�eZ���L�^�yi�^'��$�GI&��L��d�(��Q��K�R__�|�lE�A�Hp���WV���<�f?�_Z�o����n/���e���c�d< �I�A#��8 ^����P'I&�A��`�{X�a#�:�2����;���yY��A�q@oJ��9Y�wS�K�7z	u���+��qo@�D41����u/���$1O�L��$�;M�YI��Z��Z���X�����K�N]�	�Z2�3O��,�i$����11
r��4�1hd���q����NR&1�bL��2�1ec��d���JB8-�c<`'x��dN��i/8xi�*��q��1;"��������Hg�{-X�J�vz'�rz���W���H28���/-��$�+�r���i�d��H2`���*I$E�
2�� 3r22���k���R�R�@N�9����CFF�q�{-X�J`�l�d9M�q54����k��U&	�{���c��q��)��>i���8�����H�zk'�TDFw�99!���Mf�	����7��!;�*���F�;y&�Q�y���F*9q��0��\F�N^�)l�g�{-X�
%����@N]9y	&�Q�f��`���@_��h��V@NV�����L3���b�]���_��QV&@N^~	l���{-X��H����A�����/�$�>��k��YI�����D$*�c���K�l"������i�����$����;y�����<�Z�|�BB���iPC��Pe�0��4�L����T����$]H��]d��aE���D&���Xt)����~���_?������~�O��+�#���o���#t(�A=���S)fi:���������<\	
*���#y����l�C��t~{gi:����4�s����$DF��9y��`�.��@���vQ�yg��s9v�O�z1�p�1����*'�U9a��	U9a��	cUNxK:^���t�X��2�/��P'��:�-8�������
�0V�_�|
O�j^�W��V���a�w$���T'�(�A:r��&���t�{�![pc��2*G�`����]'%-M��=t����9i�v�%22�����de:F��+5�R��� 5��JM~������J����9��v�'`Nz{�![����� @�4�vK����
��������(L/K/J��H�xA�|��#��$��)@N�9��qj�
�K��
����*@N�@�;��XEFF�@vw��L�*J!N�D��x5r�+22����
�8����R1�<1�q��,0rb����t��T�&��[���5rr�F�IFV��@����
Q��j3��X3r��\�f��N2������t�0u�<j`���;�
l���tT�j[H���""������&V���<�zC:*\U���id
���&@N^#Kl���{�!�2�20�@d$�	d9e`�!0�L �Sk���V�;��Vyl��5����4`�7��BW��=xM� 2���kj��TS�x�#Q��'R����6�2��&��
���>�N>�e4���hn#`'��2��'p�s]k������
5Ke�j*Vo;�"=�2��8�EzK7E���b�jr
�jr��<�&%Zd"S�cqy�����T�xy�kr���_���|��/��.�M^���=����Vc*�Q�+�{U�^K��+���K�H83�QP���F��V�������o�����;��M�Z��6(k�r$f�NWk�c�5+��h�%���c��A7;��7���Ms���9Y����Y�K+�=4�{-P�j��0BF�z�F�D��B�k	���Q/&`�(�X�j+w�1{�Qhb/?n�V������s.�:��s,�:/
���@���E	x����Vu�2�Z�B�*��UF]k�_�U.Q]�+���JW$�;MERhd��<h���k	Y�*.�v�jWn��vE&�]�q�������;�B)@�k�1J�*`�q�1J����ht�I�:��O9�N�>FN�����$`%�t�Q��T)�i,�"#�M
���$�2Q�Z;ETg��	��3,�Q�f�������R<G
���@N�EFF�{-I�JA�u��N���4� ]W��DFFur�{-I��?���n��c#�����>6���Fv��_�T���2N��r��IYB#W;U�����)d�������z�~j����O��������RN*��|���z�M���zt���<�%	���cOc
 WC�;�5Tddj��^KP���]��
�vD�o���+D����`�%	�vsi��:��ui(�*)�P$#[65�Z��)o:B+��J���XIEF��j��$[���2�Rr�T ���
�l-���&av|r]����3���)=���3�����[���:j;����;��Ymg��&#��������9������Us�^k��XZ	oJ2�
*(��5��<�~���X\��7y�`����V�	�	89��q���!@MY!+������n����������uyn���Nxyl��{q����c3�<v�{O:N��+�t����^�������@W��L�wp���t�zS���6�D��k�W�6�D���q�N�t�0�Fa�����A7
�c�7���N>D�i�"J�P z�IT$d(53���
����V���v��s(�"dj��5��U����x|q��t\�S����P^����<��&��*�A������\c��0�X���0�X���*����;��Z�������)����)�iR���*S����R<F�*^��c�u*41:����<����+@^�jC��I�jC����
�8a)L�/�V�_�a�~	�������p��d�2UY�5PN�*C���&��2T[
�p�I�L���;�v����������9�|��g'AQ��S�9�kY�F��EFF�@ve�+�q=���.0����.0\�c��0�|�������{�14<���������c�x�����<�J��8� v�� �A�_�^#'�9 ^oH�jV�b��c��<6�$-��<�FV@�-�V���b�{���O�9���;M�>������zC:V��^+N9��gnL_B>H%C���1����>�ED��
o�5V��""[cEF��j�]D�2[PE_7��P;"s1����i��lI��w}�t��cU0����i����������	���q���m�9t-<����/���(<������-yh����P\�W�1y�Ud��Un�]��aU1jM�8@N����F��7���&Z��v3�R��>*�.��������Fb��
?��b��
p$���6���
p1��L1��E�k�+n)�p��'���}�����_����G
^��[�BV�_������������������������������|���X�����?���������������}����?}~����w�jW��tB7������fO�����1��5�D��7��b����kr��E�J��z��]���� ��t��f����w��D��4_�I?��-)���*4F�)t_��s�N����
�:��SP���f5	F�It_��s~�D���$��/N*6���� �&��<�0i���M��
�l���&�=5��4��������/���[�������9?O���!��F���
����I0�O���$���$J����m�X����$�'�}M��y�m{��K�o�K�	��&��>��k�p����5<H������$!�6	��$���$�����T��/a�i��v��_�2A��"<����������m��}�l����(�]������L�t���k:�"�X9jG7Q��D��&*�P�>jG7Q��D��&*t�9����MT�3Q;��zlI
�%����lg�
��Z��oT�ML63!��E�n��)�.&������6|���������*�&*~����������?o��q�>o~���>6�3r����Y���/Z���SHE�cv;
��&*������MT�3Q;��z����*�.*����n���w�q��n������]�c;k�Q�Ee;U�M�dZ�p��n������]T����2��If6$�������}{?j��
*��Q����z�����r��0�AG7����:�y�j��7����bg�vt�8�����D;����g�b����n������]���s�*�.*����n��$��z��n������]TP�}��Ee;U���a����=Z
{W
��3e��3�����=����,���{�[
�����s��	{��}��^>��9�{���U����i�2l�]��C���"�����]��oi>rgw���F��6r��Y�md6t���Fnx���,�62����"����	����bh#wv���~����d��2��z���A�
�{���*��<�j���G��U���-�$���^;1�itv������w������n�uv�aY�md6t���E�N_����EC����1l�"wvYm��n#S���,�62���n#�y7Y�md6t���ENi������.��
la���$ja7O���sa������x.�_(x� �_u��<#���-]�"����{j��3�:�1���B/��Yv.m2(!:��"����I�~��L3��s(�%�B�C6��6$)*������ ���@�d�	���5|����Ctv g*[����l>���O:��e��3HM�P���������@N��w�"h$�L@��$P�P�_�<P������y�B��g�N����F}�	������cxy��@'���u:3`'H@Y&#�y��$�<���p/V#t":���|��� e��t�	���Zq�q_wE!":��u�	�$�����<�����������`�
�����h�%����a��+�
�xo�J:���i�V����~b�I�t���k+����u��n��#�;^r��0l?�H���x-I �[��$t^����oJ -e�����������L1��e�������w9�l�nnP�����e���>����j�;C�1H����6��5`'H@Y!#�y��N�ZG�Gt�RM�D����d)�4��9�4��T���L2	|���kD����D�p�3����x�	t�%	�J�,.a�K�ND�E<�y�	P��O3�Z�@���������D��v��yD�����B��Z�T3����
��H[O��w\8�z���Q
����gr���/������+���Bx8C������������N?
�(���
����e�T��)o�����0��N5���i��G�*U|��Gm+$��a�+7�gs����n��)�*�fO���-la!PU�gt���^a�>�L(��<�wz��.v���!�����K�~���$f}��P�Ey�T��Iw��C+����V(��u����'
��1W>��d��Q�V��@�����;������`��G(�����:���f��?����t���u?�C��	�1$����G���%A?�C |Sb3�j&pQ��'�<�e	~J����)�>�,��F"W�q��I��k��B��#�~���$f}��T����<R�����%YF��B�j�G�8���T3��k��y���U~�p�|L1�������0�Y�j&��Ix5�C1�/���n�E)(<"��yt��O5����}qw)`���sK��\
2���.�%`'�������<�z�R������n=�E!h)�������VC�;T����_��>W�v��4
 'y�N^�l�g��a/�|i���)�Rt"2���tT4,�Q�f�vT|i�*�T,N��\M����&@N�|e���4`���XAI�^���@�����<��|�?x2�)�����n�i2_�I��UD(�<�$�P��d���@���/-_E����t"2R��\:oX$#�f�^:���U��{��77Y�g����	&:�8f���m�jC��@N�9y����i&Pm3�W�o5*��+�q�+y�����@~\�-�/��J4���	A���;�'���FF�a�~��|�l
~�X1h?�g�����F}�	������~EX)����X~8)�LT
�c����u�:s�V���#R�g�d� ��sz������U�9�2M'���F;y(���<�Z�|Uf
o|D[�F��@�#
�k%�4(v�����.C��:	�@d�	����pG��4`��WU����V
�Y��	���2U�{��|��$����8'���sA�o�v1�S��.���3t���g�y�g$�7�?Q��T3)�����P%�g��~`��N��	�y�7��L���\���0�w0k���3����f:�L(��,�.���~DF��9�^wgA�d�	�����"s�G���v��5�D�o������gB�d�zL��1��tU�Qvf�~���O5�b�����1	%4,�t��H�o�v1�S�����(2�KjH�(�weg&�GY(��G��39����ea%�r�A*\�`�����W���	��T3��k�0ZO���s�Z��	NN"##���V,_U�$W�*=��3������L��IZ��5L��)/��f�Qsf�~^��7��T3��K�P&�[IQ���*9��$�]��L3�Z��.����Q�)���3��J����L$��,�����nq���	��v6��L��Vlu-������C�JU�f������Hg�{=�w2����0�������o���?�����K���c����:�R���G��)���>����@t���k����i&����2���4�}�\_b+��1z|	�M���I�b���<�D�q�� ��Z�������F}�	wk��tD����C��d��iLVj�Md�q,.���E=
��W?�DZ4rr�C-�Q�f���t�����7�z��04�$=�
��H��@r=�W�BO����$TE�E@�k�
�K��4H�Z���t�n��b5E"��f@NNd��`��L���������3U+:�I��1yX��D:�8��yte�e��"T;��4rr"Tl���{�!������R�������R������
��~����������4&+R����L1��e}]IjETE'�5]��A������P�f��>K�������&�D�O������F�H��{�!��4A_���eL�=����,��i�.����^v������T��99u���Q�f���t�_2���.��t!k���.���F}�	���t��_nY����Z����b���Ir��/���S�������7I��.�
j<���O5
����X���.���:��vA�o(�j�j�L5
����x��v���e�M�N+�]������N5
���D��W.��)��d[)5rr��Q����A���k���V���NM��W>uA�oP��S��O]
����������dF6� �7�k�C���T3��o���_-�������vA�P��O5�l�,�J�0hZ���V�w��j$Qw��5��d��	|KV]k	�95�U�m���A����l�S���'+����0U/���m�Vz� ���������T3I��Qn44xC��E;x�aG���]�D;x��Y4y���'�C�������R�dwA�o�����dv��c$�-Yu�
�eP���V��wA�o����[qB�����J�7��Y���_b�U�.H��W�b���Iz���
r��q5^��t����~C�|:�����P��l�u��rR5DF�9YI�J0�i&Pm:�� \H1��I�����!���������?�A����V(�1��(�~^�v5�"���$9�j��F}�	${���T��R���J���7��K��6��L ��P1N�HD����7�2�t����4�_�^K��p/��J��D��qLN�l"S�cqY������ZVb���V����V�@��=���FV�@r�=_K�j��}_m9X;i�����l��t�	���TFK|������qHr��0:�b��@�GM^K��gCa��m!�4�SaX�lg[���U*K����r��&@N^����O3�Z���ce�{�
:�m��u��F}�	����(V�~Y��D��q\��(o�Md�q\l���+W�������h���������L`���J^xJ��k)zm��3vd�!?�������|q��qI}�AT�c��	p���kT�c�����y1+k�D�4@NA9�
m��3�%�B�z����6r��Y`�>��kEV��*�1ib3�W*Y�4��,�C2���������@��B#3��
�.��
SY��1Jn�;%l$�7+adf���d����<T�*rE�a']riD�����}E6��fB�a%(�[����U�< �I�"#�vy�^kP�I�V[F_��p5���]���T��/�^`juV��yEk$�7+Zlf��P�EyXI���m5������F�~���fF�Iv��_���H�mU+|8:�j$��T+�u���H�>��yX��_�ct*|8�j$�7�Tdfu��P�UyX����)���)SA�e���25��.�x1��d���o�V�i&�7jQ��t��,��bP�ZI��T�YYi&�7�O����fB���O�&���mi��������	��������e2���Fi)�Y�j&�u#5���{�IL������,&��a�Y��p�^�h:���XV>B��0�����G!��N3�z"�sKXQ������Bt�Q���0�*�"�i�_�G��FM�����I"#M������O3�����tT��c�^Z���V4���K�"��)�1�����{��Y��.�r�e�~�+-C#[Z�Aq�g�L�hA���
N@������:��LTp���(/��kB�����n�)K������F}�	�����tT!��1������&@N^�"��L3�zC:�����
���p��kT��mL8��e}-�[���2��/ >����2�E2�i&P��2�Q))IK�N���S�8q9�W����O1��kf�2+)q��U�����W���^��Q��V�c&R����iY(�D;-����5������t�)s�O%U��q���F�
l"S��l�|.��z�O�����)9�5ehdk��^oH�	RT��d/��s���4������<`�7�c������rUe '/�2��e���be:V�b]�
c(Yal�+Ya���06�AWZ��jWe;��e��Q�&@N^&�lTU{�O�jX���Svr�:U�fR�
��(f}���omV�h\���NZ���f�~����O5
�����u�e��yAm$�7jdf��P��d�j����DT�l&F&T(��N5�A&\��Q�x���m��i����(����m�~�te:���~7���/tRMm&E��R(WH�T3)�n��YY]��E�B��wB�H�o���q#��r��YY�,����r��u��H�o�����r#�����hfe�i��N@V{�:�[�ZD#3�����tV�+���N��iD������E6��fB�����m�ZF[�V��	���]�C5���m���jR�bu<*X�:�HL�������x#��/k�2%^���5e���������)c3SS6�7���������2��4*�H�o�)lf���P��d�u�"����T3o���^~o`�>�L��~%^��5e��5e�s0����lM�@�s0+��D����ZQ�9������(���H��{Q�Z��(��p4������>c��"��_����G�"l�{,�!����_]D�������?���\E����c���������>��-��������~��?��_������������>������
��s�����O��z����;�s�l���A=S-�����#����p�a����l����v��#ray_<���wO]�s�\|hl���l���C��oX��h�����������>]n`y���M���U�/�����Oo��=�����>}��<�s5�r�z�������_�3��?���c��3�]���;���?��oS�,�������7����wO]�s}��|lu�~��oZh|������������>]�������-���##�g�����g_�w}��]>���~������}v
�]���;���?v��\��gg��TR�z���W��%��Bw����N�F3�Rw�<b���l�N�Gd3���������i<��p<��v��*y��L�N�G�t���	u�q4��=y���+����>��ok����w��q�H�������t�<}�@4m6ZO���F��i�����j��4b����D=���
*�&&���B�GiK����yL�3A�D���.jG7Q��F��y��y�2�yT�3Q�D�v.jG7Q��F��&*����=}�t;U���%���z����������A��WN�L���r�l5����G;U�����lLTE�������&j��
\��n���������qe�*zU�LTE7Q����_����lg�v�<�h��DU�<�����n���OuE7Q��F��&*o���=����*�W<s���?r��8x����Z=}���f`���
�������yT�3Q=�
����7%�c��	��M�F�
��ML��A;z5��4�]���Q��DUt�$M�F��&*���=�������yT�3Q�D-x<�E��&*����D=�3��B�>k���dr���9o;��?{���'M��j�|E�_3�3	(z����?}�<������G=�����1�����y��o!�����;T�MT�&��
��If6������NQ�MD2��<�������4�i�ng���D-�����&&���Bn"�t��������������0i[k���~S2�b�?���/���=������*zU�LTE��fjOn�*zU�LTE��h��>�zU�LTE7Q�Z����&*����Dm������m��R�<bMX��D����;+tt�b�I����lg�vt�B������J�+7�?�?n�]�I�/��G|z�"��p�m��JR8d[9��"8og<r	t6e�W������x����9�c�B���8�/&��A�>�8>\��V~�I#���+����+7c��b"+�r7�+���4����1�u���y(�_
2���>_&{?��%8T+�J���u�M�a+�@���mg����`�%	�J�P+�Q��gG�^@p�4^q��2��'p�{�^K ���Yk��� X/�GL.�8A�2��'�^KHe;�]��C����@���	P����<���q��@�t>f�L@�w������N@� e����`�%	�J=0*�m�5���D��q�	P���O3����x-�z��V*�P�_�"�,_�+u�a�AkE�>�8��A�k+?�(kE�~Pt���#���W����W��O3��.}x1�s��7����":��C{�	�,�Q�f�59|-���@J�s���
kw�A�}n��F}�	���/%�L��l�C��[�Dq������>;�4�%	�F��`VjzfPAC���.9A����d�	����.����"J�b�O�	���H�::��kI9���y��NEP���gLu+��� et*��<�Z��CQ��(W�Eg���5`'H@Y!#�y��$�Yk9�<	v���!~�B]�.;A�2��`�%		�V@���(9��Jm*�1y|1�O��^c�b�����P���G��G�?��UE����������4�%	��[�rp�NN����}5f�/&����J'���c���a���/�%���3��(�;c�9}}VO6�i& ^KPF	/K�����Q
a@�~~�6��L �~~/&+���s]��$�@}
��N�����'d"S�cqY��t�����e�'�vD)��\v��^Yf�>��kI����~b�#����C�)��e0v��6�3O���$P���(?i�Z7�H�R!��	�����<����x1�J_�J�}H��Q���9A�"�i&�^K�|�1�D����@��@N�@g��t�	���2�j-1�wC�
.l����@�����C���:�8�%+?�r@1���8�
"S�(��u�[c#�f�kI�N�
�	��H�A��y
�	�����<�Z�k�D��[�S���>� �"����%����&:�8f��Mt��D�<�R��h������*�VC�;T���_��^���V��O�BP�(	��z5�/&�s�������v�ye���>����$\ft�'\e�f��y�����������a�{��$��$A%&�F��a��_�e�Rg�����C���Q$�.'�P~��&�q	�	���2��'p��&/-?�-���H���$�	85�3_�������z9��Vb�Z�'4��C2�����i��d���sQ)14�`(*7���������K���� 9�Z�����x���F{�%p��8���L�Cw�u�q\���K����0����0j�[<b��+@N��+K��^��{-X�qH}L��
REC=u�������F}�	����������J@?������_��9���E6��L��,��R!��M�������� 'V\�6��L��^_>� �<$j�mHBJ�'������r��O1��i����C�?e�P�3~��~����%���&}�q�.��$�T�=D\!R��lr=F�/!��`���������
AA�C�I.�d����|�3�>��+Z��=�FkU��A	�c������a������k�����]
J�@i9�fB����:��k��A��b���;�5�dEE��,_Y!#�y��=������u��l_*j�[���58���a�[if����|i��������	� ������2�i&�^�/��B�OJP�!I�r�	�E��Ld�q,.��;�V��F�F������&��i&�^�J<����$
	�!�c�`�n�����&2�8�����-���
:�2rrbI�l���{-X~�/2��`��Y��QAf@NN*I��d�	�����z�?��v�J��DD�5 �.� �d��L��V,���
$b�2I�;�������LRCf�>��k��+��{�K�
��H��I��6l'#�f�^����E���bgp	�X8�E&�N���2��`�����pv��$G���4F�/!��
M���]^_7h�f�� �����@N"�d��L �;Q_�X��,[)��# P`���!3 '��,��N3�zr�`�n�k�4�g�������������t�R>��(�R�&�IF��mm�9��������-g�k������������"v-L�,��$���zN��i��C�H?:�8v���<�:a��0�e'(h�C6������
C�_�@����*��AA�xO�k=�e6�3O�)DK�	E� H	RGRuB��%��E�e6�3O��#�L'V-���$A�)-�q�')�Z$�y�8ie:��:��Hn�-�*���%����	e:�l�g����y�L'C��*�H��,p�d������2�0d��	�C�,�C�#��@�(���	���t����<���$�2�.�����,;����;�%f3'�t�e6�3O����K�9zO�cY4D(f*�9.�8qe�!����tk{�!����'~iQ���<�V/������B2�)����������[1�HN�� ':Kl���{�O'� �:I7	�����*7	���O3��n\��7U��CD��@�=���8��i&�^oH'�W��7��,��p>U��F���f��q[���Q(|t��*G9I�N�:.0�i&P\���t�������N�l�`�)^�����-�`��HWcqY�|���������P�����FY$#�f�m�h`OEjY�QDJN@������O���w.}e:��'E���UD A���X�
R��F:���
���u��l]��#&�H����u���2�i& ^��A�I�
I5(#B��"�*c�S���x�!���C���8��~��R�K�N��q���'�^oH'���Xl�aPED�?� Is�v6��L ��+�Ai�t�j*�h��]����	��^t5��yt��6��r���5vrrX�l�g�{�!)��u��AQ�/���PVH��i& ^�HG4��w�A�X����,�l$�L@���������3UD�|����1�y'#�f���t��*��~�KG"T�r9�X #�f���tnU�������d@$�]r�eH')�d���{�!���.	���%�h����1{|1� QM���X\��5(�nw'���w0�����o���t�	��c��([�}��[8��o���_���|��/��.��*��=�@����F1��e���[�\����;�(���+Y�#�Z�����{�����uY�$�y��z�@U��!���I�t{�����n�Eg��]���WIb+�i,_L�L�������_[y��;��%&R�tn�b��Kl�R�q�������<p= �1���WJc�������Hg�{-I V�Y=���#\/D���V' N(�u����<��U��@�����r��� l W6���@�����F}�	����iu���g�U�P�����$�����LW��&��(�.n��#��5��D�P�.�,�Q�fu�S*��J|mK��Yt�7��,?�|���-dR{a�W{.���C��]
�N����[�A�Rz��k��E6��L���$p��PuS�K��\���<������E6��L���$;�|����(9�@�R�g�MC��6��D�����yi�Pu�?�t��������1y|1����L1��e��C��8��Q%P���8��IDV�&2�8�N}m�i�]��T���@���@N�Q���F:��kI �I�s�J�Y%@(8=�$�T�b0�i&�^K(�O
����C��&'�L�c���o0�)�1��X9�%����	�%M������x|���X�;��rS�N��<T�.��{�:I0�S����	��r��&m:���
",[z�|}Qv�w�f��$�Ez�?}�eD(=�$*2h�F}�	�������k:z��>Jr�j��;��'���<�Z���gv�sJ �y�N�DFF	{-I��<�Gf����e���D�!��O3�Z�@b�����~X��3�+�5jOb�b�u�_��r��{��DF�;y}��Q�y��$�]W�����3v��N%#�y�V��@Vv�
bPBd$�	���h8u
g��$�R�hj�
�a]��@��Zg�zw�4�5	����x����W4]�$[ ��>��fi�H�I�.�%��'�#Y��:]r�
�jl���@^��M]��s1���(��>�8v�
����Lg����-����g8�j4@T]���a��=��N1������o-C�r���K�N~�(���C��jT��������:��K}n����uN�j,_LD��>�8�������W�D�h	���t�kWc��b�o*
�:�8��y��'���6�Bp�{�q�j,_L$0��������]�&������,-��b��{�2�iB����*��R���1�;	^*O0v�/�g�� �p�����q@S8jr�����CD��;�] J\����Z�k} �����%H2fp�O������P��^6`�F�1�T��(\��s��)��Sn,/&�VC�j'O��}��wc�X�9j>�k�����@�0��E	��	�U��;�)���d�p��"L������v�(a�GY�s���j����T�������$+a���@�0��� !�Qp
�j����4�F�>2m���g������Q���C-O��>N�Vk��NJm��l���%i'�$��wX������2��U��+{���2��U��	����3�����Y�����b����F���p�E��v�\�;��z%-���-Q�H���i�c����~m@�M[n
�Z��/�{��@tHIN)Z�KEBjf���pr����W��N�D�QZ)q��Y&!33��@8��\:��rhA�������c������+uk8���3��R��@�-�)�yF������@���
����p:��j���&P5a,����@]�#nG�!Z*����:Z=�h��?�h�����v,*������4����zB��X�EN%��	��},�5����hK��g6�|� %Iz$df&�Z�������PE�*�&@J������4jD��h;>��aYu��#*��o����7��	�����U�>�
D.�5�*}�K,�f&P}��[���}��=������Xi� �R��������Z+�
)@�
�5`�m�	�
�
`���]� �AU������7�Xi�@�62�XVU�������
|��U�����B��\������	��6�h�k�R���P����@�>����3lo�Vi��5��LPV9���U>�f��K�C����&�]�K,�f&���k8�%rd��UQ�L���+����}C!�<��[���B(y����wC���47����������?���7�B��X-�g8��z�Ca��?>�m�����!��s���x�^�#��,�?������?����?���{�&�&<!��	��J����~����?����������1��?B_E��o?�>��o_���O�~�����%>��������������~{�!���_4�x|R������K�R�D�'�s�WM�-���<�5w�%�;.t�a�u�(�G�3���_�g��M����UU�����~����;[���~
�;"H����\�_u�^�`���K����[�>���~�������LU"Xk���G.�G�Af�{B�S��#t��j+������B��O!P�b��TUCXi^�`u`�� }�� 7d���LUBXk��P:�uwC���)�{����������,��r���t�����s�"�/����~��WE^Y�{Ut���:����,��*:�Z\[$�j����9��.�Vz��^Q�+U�UP��:9�]����+������3�x����u���+'r.C�W�W9/J�:�*r���s�r�+�J�}��s�����Zs���'�y�����R(b�WC�^E�y5t��K��yUt����WE�^a�������W�3�]x�D�yu��S�,��*�����y�����~�r����q�Y:<�_�p��7Z�,���C�WN�\���\n�=B�:�*r���s��l�r�{9�����JG`x�������j��k�.��U��W��^]x]$l^K
�ze9�U�����5�j����9��.�V9F��*���r����c�~���=G��WA����`C��=�w�r��9,C�N�\��/��_\������j��kz�xu�����W�����y�2r����K%��g���m�����W8Ri�����W�s^
]x��W*�.���+�C^������}�J����e�J����F����w�R58���������]�*�������_��l5:�*r���s�pz��)���Os.��{�#�/T��9������}�r��k��P9t�u���C�^E�y5t��NwVC^�tg5t�ul_���:�/T�{]���������9��~��f��hWO7�z5�Ck����C�N�������S��O9t�U��WC^��&�������9t����G*������j�����O9t�ul��:�
+��la�]�S*�����z����v�>�����}�r����}�r����}�r���N>���-g/������F�����O�Tqk�IZ���B~B��U�n6-v���������z���7�xw��J-���u���D���
[I�a���H$i/�{�l�)G�5ZBG@��Q����K�P[�j�����=��U���|��"�M�&�] J�/�
��	���'�0�5��n���b4>
t���+,T��t�
���@7�/���k+��(K���%�Xa!�<��0~/���JS;!EXM�;���] J�/�
��	�U;������	~(����5�my�`���Z�k��f��3�fv!����GMI���X4^L���D��M\�E����f%K��xF����L�����,���`X��J���Q
��G��Ts���
 %�[}$
��	��=t��;*;� �%�G���,��#���%�6�,���;D]����rH	0YH�L�����@_�j�j{������m@�j{���	�U��{,A��>1���"JJ���T&@J��A��ff�uG-���I��Q�N��� %@�X���	��-�$=3o�s(#j������4����L@�n	 i���-hU�QF�� %@��$�f& Z���s���/��C�Q��k�y�`���Y�k�@���Ac��!������d�>`%�Xa!�<��*�/�Z�Y]�S\�N��n9|*��g�����g��@�
���:�����5*3�����Bjfy����+j����(#��)A�����+k[��%�A��c9u&t�Z
���(�����Bjfe������>"��G�9DG�a��P��<���	��yo��o�\WB���M�@��9���	���`����s�:Zg4u(�=�K$df&�Z��9�`�"�h ���pH	0�XH�L��n	 �"G����L����3/��c�x	�I�S�����Un��B�Q�G�SWT���7v)��W���	��-�B���6��EtE"��)a�2	��	��=P���	����).���1j���w'-$"&�cQ�c���\GX�j�v�!�a�i�@Jp��j�cf& Z�P��3�Q6��7��@>���I�����oC5���-kX�(�=�`	����&���Q�%�+��Q@�7C�?��Zp�#����_��-P��//�=[\Z���R�20���gc6����aF��
��(�2o��\+,��'�Zr��~+���MI7��7�h��/�����<�U������IlR�� �m.����@�`��
	��	��
�������"L��B)�] J��Y8mFBjy����5}XD�#ob��
D8��e,tF�D	�����Y�k�?}���{x(�M+�����(a�FYa!�<��{x�5}h���Tl��4)c2&�Q��
��Bjy�u��S{���$��� ���Rm���b��7VXH-O��n�>��+�����<�	�l��q�� �|Iz�����;>��yC�o�E�*�:��[-Y��'@J��P�YH�L��n�~�XOK�35�"x<���`����	��
�o�l�/|��Ct(+�`�(r��c�����?����w�OS�5��0j���l9��Ye����X����m^z�~�F*H���q���D�h�E��v�*o��Y��#�qY�PAY�rH�S3��@ff& Z7L?�gl���A�2�����H�3�5P13�����	p�eeD�����e�
j�lf& Z7L?'�J
����Q��@C���K��D��v,*�����'G��(�Y�>`%��0+,��'�:����kzd4~:shAdI�4y:s,����@���������!ft�
-�Z�t���t,����@��.��>�F�/+��k���|�a����da��ztf& Z7L>���	e����=�uc���K�eWPD�n�����!��w!t���K����0�rHIR/�2��	�����l����OJu�"���HrR�1�>�Bbf�u�������-6�&��Q9IZl�<<��	����LH����A�O�@���ER3(~��[��l�N����,������	�8}M�$��_�]*"�f�)��.i�eD���[*$�{�*�R���Hi�u)�bf& ZwL��Ch��*�h�u��$������\�e
D���C6��b��@�"��R���@Bff�u��-#�s��]����zL������z,*7�AqVdT>�5$�<�4�|�%����3�W��n���g,t�F?9��?����~���v�\�=����q��># �A����t��d��*�����] J��QVH�,O��8�vg8-�*���A�
hh-nZ'����������3��<C�z<�N�;Wzy�]���}yVXH-O`}���@��$�0��A�	H�p��v�/!r�Yy8��qw������������p��$����;����B�2k�>���p�{d.���rF��f��M93��Q�X���Nl��sz]fF*�p���5^B���>�����$��q�Q>R���LE8��0�Sv�(q�GKH�ZU�����
Oh�b��e;��&K|��mo��/_����f���P�$A?5;D����i��S�c�����j�5�:\}&D� U:u��J��1�ssPd����������-����I";W2c�����A��V�QDMl��Sq����W�i?�C
�J�,���}���j?���,�I5�{v�H����4^L8}�"bb;V�������P��B�7����Y<e��3gf���p���r�r�eD�9Z@����A�W33��@8P�C'Z���U@�� a�� %�B��Bqf& Z'C��B_��G�Z"XUL����&�Y#!�<��@8����kH�4�?�BI��X��R�
��	��!�w�+a�O��2�D��:M��,�f&�>�@�%=��N��f��ca� %�"XH�L��>�pfi�J"�X-\"�3t�
�/�U-�����@>��z���:�8�2�Pd��c�������pk8��U�n��%�Eh ��� %IoK,�f&�Z���9(�bS3.�p��PC���@J��"XH�L��>�����QG�
�&P����HBffeU�|k8�0���5Hilj��@���5H(�k��`]�vg8���|b������c�XW$Q�1�| |���7���i2$L���I�����E��v,*����XTl�J�Z%�6 I��*a�B>a����p|=QJS�RJ���5Hr��*�FB�riD��@�+�c)FI�%>d��)IB
�@!53��D8�;�-_�9�F�!�� %G��;3`�O��qJ�{b��7��<�4.�4��n��2:�q�}�v�n�<G��4�����������\������|.����$'<[�5-�c�@�7G;��P���i`�xXE�k5z�:��7�V��x�?��B0�T��������h���:����XT����|Z"6���n�0�����] J�2SVH�,O���	 ��MM�1��X�7�ilI2">I�F��U����/��#PZ�Z.P�#�g J��Ns�$T-���uG���b1X!�D������D�k��������mk��Od����\ ��-�@�0�n�.2���SI�
 u����1Jc�!eM�j�������R�`�[�"$n��d)t����T��7D���re����S.�wW.���^���)aC:e����������eE
��phADX����J��"$�-���uK�%.pv�IaD�u���(�����Bjfe���� �$K��2;�aZ�����d����X���'��:�3D{���)R��t�6��o33��#�qg�(��D��6�(����.8�?���h�@�
<j4;�x���01�,����k��Cm�M��m�rG��T)q*���$�f& Z��3���;�@���U*{c�x1��-�(1���X�|o���ac<s��g����>`%�3�tIH-O��n	�=��zhS�C"LR��R�z!R3h~k��@�G������P�Q�h�/!_�cA5���3����
�8��hp�RC��[��gX���)���\N�J%8R��F������HI�F�"��	��-��$��������6�IfeUEB�j�*��V��p-L-��Z�m)m[���kA��uK�����.�Q�B����b�5-tf&��b�f���b�~���!�af�����ER3`�[��n��������� ���5r63�{peJ��iUhU�I:;�*�H�UD��h�@�,3;��'�!�q�� %N�0�$df&�Z������������L Jw'��,�I�-�[�.q���A�U�U�?�z��R��t��������h��dS�O�J�(�������*q={����=�W�E����p���hbf���K� ���!��h���P�[Y~o�&�U��D���%F�
K��*|A��w�H*�Q_\a��J��t�_2�Y�%D�!���n�����8Z��h�c�"�xY�%;�fD	�)+,����
Xw��3�>\�>A������<%L�d�o@���
k�W���g@�h+|��aD�UL��] J�2Z�s�'����7��H�R�^wR��A]��"�2���_�UXzx��'Gi�I��.�>��f�D	�(�zR�P����W��'6=���&�PY&�T��X4�J��$"F�cQ�?�������y�:��3���D��v���;7�����s���U7�
h���Xf!�o��9��pJ�nw�g;� ��~� %>
H;��P��y�Z�Y7��������������Bjf����5X.irZ��(&��Q[�:��B�lL�8
E��v\��'� �fj��H��u��g6�/Z63�g�f��8��m�����L�*"H!�������	����`*�v�D\w��S�]>�QB���d�h{�@ff& Z��V���D�
��� j5��1��H�L@�>T,%I-zD���)IqT��R��	�������k|���S-?���*?�HD�n��zu��80�����4��i����C��t���Z�����;���z|	9�`��D�-�>h�0�YH�L���7��W+���K:T�qI� ������Bbf���p,����JoeD.�5RZ��:i6kD��`�%���LJ�h@��ioL/&_��"1����qX�R�3sZ���	493���Bjfmuf����*�:�������6����Q$���6��>�%�2&#V������&@J�|X !33��D8.eE����UvlHi�#!����@8�d����:�"Z�����?��83�O��,:W	��H��lB)��1ip������H�Ml��r�p�
=V5R�V5R��X�U�
��5�O��9���S��d .w����`�N"�I��E��8,�q��*�V�$�Hi�U$dY�-�O�����:�h�
I}br�r|;c�x	�������XT>�
�����7���A�k���4&
W�h(�&�cQ�����=�����T��.ej�����;J��"���8���i�0��~����?����������1��������&(����}��?}�������������o��������o�?��Wl��;�j�v��'\P�����#�6����q���������w�J���1���u��<���uA����"`t���V�������>�!�~����� ��*F�!��a���=R��c����������`t��� V��A���e������������ T��R>"�����H��ITxun��A�A���V>
�S���H��I�����`A0�Bu]+�� .=���������-F�A��b�|�}�D����b r�h���y�>=�3������U#t��Zk��  i?v7����'�& ,F�A��b�|D�}G�s�s{uO��u<b7~E�&���;����s�"����Z���V+�:��r�����XyUt����WE�^S��9�j����9��.��g��o����]�z��z��i'�9�M!s�\�!������5�U�������_�Z��W@�*9��b���s�-a�6����O�sN
]x�:��WE^Y�{Ut�N�+�J�}��s���c�e�����r���s�#J�]�:�*r����E��U��W��^]x����yUt~�9�U����@��~y����,�JM������S9����DHh��*9��b���s��jO�KC�>E�95t��)�S%�>Y��Tr��J-���V|����SE^�l^]xe9�U����+���s�"����
;������,��*����1�u��Nn5*��
���,�E��V�;�4m�n}����u9����]����2t�U��WC�^����l:�*r���s�5��^
�{9������{�����$�]
9�C�r���1;v���{�cW����s��]yf��geW��t�v��]yf��ge��v�0��u�8���������lL�%9�]�~b����	���A���X���y6v�Y�gcW�iW�<+����+��.<�����cE~Y��Ut������g~E�;6v��<��\icW�Yp�Y���ng��geW�Yp�Y�����o>dc�E�{6v�9s��s�������r+��~����gm��vt�����
����A����C��-Ek���
�ecc�c��h����"�=�����o����<������)I�1�l���z���<7�������3�<+�������&e�Ep��Pv�9���geW�Yp�Y��������J��a���S�x}_���S�9<S��kO�P�W����V�2�g!/	��#�R#����D�6�;�-#��3�A��Rz}yXH�L��x�{���d�����hA4
���M:�9�YH�L��:��@-Tnr��F
���*v)A���	��-�N��A#mN��Zd@�V�@� c��������z��������CT�1z�f�;�� eca!13��%����.����wDp�
8g�`���Z�k������Xg�f�<���#M2����QDLl��r����(Ie��<�HF�����8p"8��"lb;V�;f�����0�:��)��<�,�f&�Z��?0���&H)Z""�1�}�J���Bjy�uKy��18�W�����w)A�"��	��-��Z��/:n�v��&.�P���e��c�����j�����g��+��>	h)�2�%@Ya!�<��:`��z���j�au]	V�/�S�y�j�	�8ADMl��r���9�b'Cx�.���V�Ko���Z��X;��B��+���9D��-�O7�A��36�d33��#�B���O3d^�V� [��@�/�
�(������td[��B+�B��V�����L����Ml��y��'X�T��YGCJ��m&���!�"��fRV%�o��;o������P9���]�z�B������	;�)�28O�
��=��P
u��#�Gq(�'S3a�7�Q^T������{����86S�3!�7��/K�1����Nr��U*��#�Gq,f�f�W�D�����:�o�G�CoxV�@����%23���R��8�@��/,��l�3r�����`f135rxO��N�jp[����t���G�(�c����	;�)��u
N}���vC����z������L�}�?1�A[0�<Nj�b����)��� Bbf�uO)��[�_$\�W�;���rDP�>�����L��]qPWC�mH��1*C�����a0����I]�R~3��� �a�U|����~�������L��n
��B	m���e�*y�|DP�>������	;�+��Y�A����Ak���G�8�I���L��MKd��$�(�(�R3�%t������/���z�1��'�Q�;\��_�����//�Q=[\Z���R�2�
�t?;���s(���P?�@����mK53-�oL�y����G�;4��} %�
)K,�f&�Z7L�����{(9�gXBB'��Q�%�������E��y�����hBH�`&�	�V�����Bjy�O�5}�N��2N��
E�{/���)Qec�P���5��?(K�W�:�:T�N�2��$�Q�
j�jf& Z7L�O���s
uJ��YJ��h>�Xd!53�����L�7��=(�� �P�u���}�J�*h���Z�k�0��S����9!"	d{���4^Lx5�D��v,*��;u�iW��y ���u���U�����j$��O1/�R��e���9��g{�����D�����z:R���T	��� �d�z�X`!53�����5��=V����[.�aff���c�n��L`�����{�S'yE�=�D�����`��2	��	��
��}��G�:4-�5�T9���DBff�R���y=����0�M�����v���&� "jb;����9��)�RI�4N��A����R��GeY���X���C��~'2�q�pm�q��K��_N9����-3N��2�}O��,��$k�A=��]���L��-Q@�%������ 
��xL�����N�����L�o�^�^��
�����B��D�A=�� C���L��-Q@rW6�cw]4�C��c�4�A=����������[��������{��?GK�L��Fd,�Pw��P�w�YC^%���H��I�;�v��9��)dx,�73��c�Kz&y�nZ��x��e�> %��0x�F!13��a��Fi���z��M�}��(!sDP��6�_�Q���	9�%
K��9k��m�f&�����)G3���6oF1x?=�$F���trF	�Bz��PXDL�����eV�3p�d>#3��&n2������{���*K�IY��Q:".����M�}dk�N?��Zd�S��
1����Y�q
������{��DJ����SF��pJ������D�o����I�k���(i��s��!4�K+�c*a�Xb!53��c�J2)����xH��K����p�MY"!33�:���$�Ptr�����������4��{���=@S_h�Q��wV3m�ud>���h�F�sv@��f�s���	��qyw��[R�LZ�1R�@���;N�,C:CK
a����$h��{�GKG
����w����` E��v,*�����
��R�UD�.�b�H	��������D����"�����H2:���3N�I�	�PDLl��r��
?yl���=�����7_)3���V�����F�*"L�R��2<':3����(f�K_��DKH�~H	����Bbf���p"oP��~�Y�QC��rH�K��R3`�����*t��wG2�������L��IDLl��r��#��X��"C$�� %�<���	������9��c~��RLm�/!_t�A�Ll��r-�)C��;TA��R��1�@H�L@�>c�uR#��JQ1*� ��)q��X���ff���p�I?e������@��@J���!53��@8��iq�e3��-x)=�eS&�-53��}���(���}8V(eI ���1�v�2���	{�HHXd�����h��*�����B�
��������{�������������X�tD���Pz������D	��7�j	����*���J����/,$f&�Z�2W��x��T�Jc�;"�v��<���������DU�'����eP�"
$���zD�/������(xl�$�e5�.��5o��O��0'vDP�J�������	;�HT���E��1V�Y�l��E�^CQLM��~$*�	����Jj��>�a���tl���>XLM��~"*M�����*�1���A�u�
b@135v���05�I�����	�$���l��!��bjj&��3Q
9a�2>7�E����A�un��B��f�?����������'��K���[����A155v�����S|���i�2U��T�	Z4m�wS3a��*S�zN���1I�B���D�+��pSS3a����P�_�Qlr�����v�������E���L��g��T;����jOY�u!����i�������	;��
�������uD���A��n�h1���@����)@8\2`�����$�����!�~�������`��%Dx���rM9����\����*r)*���?���?a�\��	��c�V�?s���p�"�D�8I�*������D��`�x�o?>�:DI�Q�\�	$IN�?,3�[h�Y
����#����v����1Q����1�^d�=��C�j�@J\�E��($f& Z� pl-�!Zz��\8�i����6k����g�Q)m��6�@��tH	�v��4:3�;�|l��
��cTA.R�D�����uK��������E!	�����4^L$��"lb;V�;f��Snf�q�?�I�������_�h����f(�wLY��,��=-h����1����	l���J*�w�A9�"H��@J\��,����k��u k#�cIBE��� %��	�����D�� �U�o)'I�U����d�c�x	�?XQ����1��%������CN,^��r�����6K����X������F��v��j/y�����3�����c�J��S�����B���-I��Ia��Njj&���8 3��G�z�zFMd)�uDP�Ra�XLL���Tg%������:�#��I!<�����	;�)�A5~%�Nl���dA&������l����k�����L��	�"�������A=Kp�&115vxSX�%[k�j0�U	?�T�u@P�U���JS3a�7�y&��"��Un��#����B�-����	;�)�]���
�h����Y*�����L6�+������7�Dl�mIJ���i�� ���75vxO�=�E6:�"�*f���A=MW��j3S3a�7�)��^�=���<�#�z��"�����L�|m�"���)�J����	�mSR��iJj"���8,'���:	U�A�K9���I���,	5vxW�x���v���]�y:"������$��;�)l���g���k������q�6�����	;�+�B�A�g���Z�|w��g����R��������������
L�V��;��UaG���_���������V�8�SC�W?#��$/4���i��"������}����xf��"�p�{�F1�G%B��Sb���������H�o�6���F\����[��c|�33��o�pCt�s�":��E��'��Ab���D���j/���1D[�!{)q�I�4I�������i��. ����HE����4^LxE��v�*�����6cG:�%c����� %L?),�f&�Zgt��3�$R2�QCy�xH��Q���	���Y5���RS��N��S9I���;
��	����@��K]�=W�bTA*R�D1|����D���.z���p�"��T=��e��������h} ��%[�q��J�5D��pR�#��R3`�O��X�U��-�|�c��3�-fb;�l	�1��c9E��5D��* K��c������v��p�uF���i�R����� %.,#w0R3�����H�I�\1�� �U)qa�N��ff���p 9����Ka�JY]�-���ae�~�����h} �%?e�W
��e�������4^L�}����XUn�����j�eCS����Jz�I��������pb�.�X ��k��"�\�%��)4�A���L��G�Jt&�m�(3�5$:`s�~H��G&dgf���p��K�� I����
a�����5Z���������DU������xQYCC�R\:��d�� bjj&��#QY&,kSLc�;YZ�m&��b���fj&�7��7*�����jS�D�Foh�a;"�g�8�����L��G��,'���)���l�>H�����33��D8K�G��l��Z�8)�q[G�\i�'��M��~"�U�<�lK���eUZ�%��mK�PlUZ�!��#QI�E|�������7��[�y��TSS3a������j��x�����T���$cGB.�����p,����S��|R���I��O=,f�f�W�����Do��������2�z@b���;�HT�y�X9�N�5��r�����&�7����	9�LT�)ofU�������y�@�A�q��������p��#���5e�u��#�g��$q$&�f�?��(���e��P���� ����@Bff�/<�Tv��C{��g��oY��A���n�������aSs���	N��=H������o����������������"y���T��������o����_������^�������^�?���4#yT�T��j���j��eL�O�����ry�����������v��%����v����[�	:y&��WM��Juw�qy�����f���%��?�\�BP���r��i�_���?��������X�UZ���r��i�_���?������3H�Q����������z6�1��|��w������gr9����TO���3�������Uf�?�������z:�\vx��h�5�@����\�_5m�+�����������C��>�����N~�x6�@e�����{��ZY'��r��i�_�����g?��/�g�����~w,A<�=K��P��H�P��N��?s��{�.�<*9��b���S��7�G#�E�<9�H'�y�J�=��������>�����;/�G%�o��;_y����g*Ok��������[a���������[��{h�����G#�#t.��qs���O�3��{��.��:�*r��������*���r����T������+�y����&z��^
�{9�����J>����,��*��:h5�{Utz�Q9�U����"�����V���_8�s!:�p�O_MC�^E�y5t���������'�9�J�=BN)��4t�S��SC^7+��������^z]h������+�y�����*��O���W�s^
]x��>�WE^Y�{Ut��{�z���o7"��*������;�����n�`gf�>�N���Y�^�;��p^
�{9������-���s�"��:�
'���_�C�^E�y5t��R:�{Ut����WE^�Y�X{Ut����WE�^�O�����s�"����Z)9��*���r����c�X���G��WE�w�T�+]�N��;Nkth�����u9��������4t�U��WC�^�hx�T��Os.��{\��t.
��9����W<�m�T��O�.�\x\�H��Tt����SE�^s��}|����W�s�w�.�6�D{��.��������xB�o�T��}F��K!�w�&%�����:eN
�1��:�h"�"0t~�r�~-:�*r���s��<��z�p����9����BjxYC�{9�����Jg�y��.�������+1��*���r���s�-?��+9��b>U ��c�
�����,��*�����Wt~����(���@��q}����NN�}��X�q��"��
X���?�~r)v���������z�����xo��:�������-S��y\�@"_�����������fM��Jm�P���v\�����<
gb;��Vo�|i��	���#�yI�%jW��L�g�"jb;��V�o�<�O�p	8nd4��Z�*e��%R3��2����S������i�� %�<',����k�@r��[�'�	�g��'��$�	���6�����<�H"+w��gh �9Ijk�K,�f&�Z�P��#�I;4�B��=P�L���BjfuU&�^�>��c��N�r�K5Q�%�+.x�S�Y��)w:h
���)F|�|�t�� %����l&3`�[�z?�xp��+jt�Z�5�	�`,����k�������P�����r��(,�f&�Z��p;?p3���C�LShg� %@Y�,$f& Zwk�XH�P����r��QNc�x	� z#5�/������@g�,��a�X��#J����Rz}yI��L��n	 �g���vy�QjF:j�H�<uwL/!p�a��;��U��y	�#	�XO��a���grY*�$df&�W�o�<\d���B��M%�Zy��k,��'�Z�P�������PG������X$!33�����^�S?��d�c�PC���������Bjfy�������$:���?R�����,vR3�{���K�����@����}P�����Bjf���X�'�9BM,��3��L,��1j��|Q�(�"^�E�����y#$t+�X`a�#����Rz}yI��L��n	 �)8{
�>;T��T9�(�u���L@�n	 wj�?j��6���+�D�w@��6�"��	����`�����A�!�T���K��
��4^B�UQ����g�7����G�������B�v��^����3��b&�cQ�e��wE@g���zC�h)W8z�|i������D��wr�B�����
"�U,��0a}�����h�@�|�#x�,���1�(��:b&���:�����h`3<C��� ���)}9�I����V�I�Ml��r��o���#�Z�[�f��X�9����������X��M+�q�P�j�����;��pMMY�t53��{
=v1���ho�R��0�A���������;����a�����
)��B�v�4��������4@�-�������1�a{��#t�\��q�#.�h�:�q��b�3����)EB���9���O����+Ht���'�_��a���c%�&�d(t�Q�u�P�	�c|���@�M ��~�s�0%�'�+�S�,�H	�/���	T�[�O��o!��F�6����q�nmL�{"fb;��������;:��W�L{C7��q�_�(�~V��&\*=u-tHdX���m,�cv;�����/4������z��1���~�2*��)���Z�G�R��1e�����	��
�o
Z2�M|fs2Q���T[n����|�LQDMl��=��5���w�����:����n�\vc����X������~��v��}���ZN=����r<�����h�0�2���3�P��a�*�C %L������D��������c�@�4�{�+����BBfy�u��S�M���6N��0�R���k��V�%23`�;�����37�p�!J)��z��7XH�L ���oM:L��t+���!�p�p�>(����HBff�o'yo��N}��
�28����"fYV,�f&��*�[�o��}5r�P�
�T�g@J0}ee�����h�0}����O���"X��� �������ny2�`�;���U�5R��!J�ks���WH��L��n��R8�
	�xg����9��X	�o���Z�k�?��s�w�)Xd8�m3h�����_��Yve���Fn�
I&H�CEQE�c���H	�2������D���kI �7������#��gP�ER3`�����-r�J'�p��z �'b~��9_�6|�T!��<����)�� �����K�%�PD�n�����1���z|�A;�D�p� %��(�EVR3�;���p^�h��Y4��`dw���D"p�[`1�g��[�n�V2!��Z�	!�G+�6�V���BBfy�u���3�J��'wC- ����f�J8}e�����]O�����t�FC*CuD���� %����Bjf�u��9�N�d-�
"����('k�GE3�;V�$��3�&��I��Z��)�&�\]�L��N,�:��_������������/��s���t����p5�4�����$tf
p�����f:�i�����#@DMl��By�������f�~��y��/�n;�S�Y��"��F�sC�@0:�m��w_(HT`��8~l� �v��$$�������z�D��K,�f&PW�	�N,����/B=aS��1i��H�����XT��#q����q�+����3�Y/33�U��p�%��6�b2�@G�e�/!�������Y��8J�)�j��2�y��z\�dS"�5���a����<�tg���m#.��iUfK,�f&P������n���jO
���dU�@J�Q�X�[�������2_=�����)q�G�l�1Z�mX���E�����J:�OU������������D�3�N#������%p����4^L�(�B"bb;���H�����
�T�IoZ�q��"_�w#6����q�EZ:F|����b]%�S�H�C��`!13��@89p�-� �k���J�Y	��&���1��X�����ozF�����B��P����#f����X���D���S�1����t��2��5"�"	��	�U�����z	8������(��� %N,k����D��ty��D��L
������X	35�h�Y�@�����7+\�nUj�
*F��^@��6ct��33���i�3���5��;W)��(��HIQ���HH�L@��'y5iS���R�a{�@J�2��tf&�Z'6�����W�
uD�����8y�,����k} H�Q"��H,FQN�,�)a2KX�,$f& Z'W�/�Q���&Z=����+aq���Bjy�_��5���r��|�T$��Zv���b�E�_a���| ���j��L�DL&�N#�V�����Bjy���p ?�8�L�6UD��k��0S#,.,$f& Z�B�T�
W��1���l(R��"b�����D��@����kG������1����L@�>N��0x��P����h����)e�������j5�U�\�g�S	���v@�J<c��������UM��j��������f���qy���,���u�w�
�Q��1�����������\�����������[
!��rM�����fgvP��*T�C���R�D�.���xU�����|��(&��$]7�c�x1����Ll��r�����u����7���KK8"�������]���y�f^|"M�~_@�l
�������h�9�q�[��y���]���!.P���n�������3(�f&�Z���5,��}F��R�U��5^B�������Y����d�e����p�s[i�S�H	�)����L���	�u�52������E�U��K$�]/�
��5����9NP�
�nMS2������D�l��Ll�u�����+��enZ�2B�T��g�^�"�9�qY�|�7�^���z�QE��F�{�sM&@J�vMoW63`�[�3�ei�<(�E�'L�Z?�������Ru&�1��1s(6�z�og\�$%�t�s5�k�JR�$�+$�X������pC���	������EKa�W61+,��'�Z���4U���Hm��Z��@J�7�Y !33��'�(���������)q�	��	��-��o�������I�n9������k23`�[��hXF��}���(���}@J��"V��I33��%������A&�I���%={c�x1�Q%5���3�P��m;qhA	�t�l;q,����@[m;y3�J��?\�P��?�{c�x1�������XT��9T��^����[��GDi�)0`%l����Z�k��2p�)��1�_�%����1�_33��;9�eo��>ip�J9Q���ftf& Z���������("D���:�$�"b����X��r����������%\<��}�������D��J�%�Te��}i1�s���K,�f&PW+��?����5,v$�z��q���L�(�"jb;�[f.�'@��k�8B
V<�,{�,�f&�W{����2��/��L�ZH5��_~e������h�@��Y�NC��a�A�&R���@�'��D���u$.�j�<�P�>�����$ib����X��UW����9o�)�����Py�v��L)	�E��v,*���H�I��
�G\�e'�	e�E}C����XT��3����h(�<~or&�Uv�D�H���a	�U������6k�:��0B�F;cI_q3�o2��1*��h�9#�d�������t�Ll���sw�������RBuQK�}A�X
��H�9���V��8b(�@dr�N`Zd;�4��������	��d�3�HM wDG��]�1�iPy�&�D1����'�O�Y��aG��'C�� %L3(K,�f&�Z';h�vx94�.����������Bjfu����p
d��Y%�
D��nR�����	��'���|\�5e�	1����zH��Bb�XH�L��>NK�#|$jefh ���� %<f��#�A��L��>�(s�:.�Q����v�EJr�p��D��v\V���c���<���#�J@]��5^B$�����U��~��E2l�3J��8��x��%v,����@^�%�1������/)���}Z)a~IY`!53��@8pt���*������'e��)�)E��>
���D����\B��k$#��y���;i�|qZE��v�*��Q"�B�R�J)�C
����tX�+��2��������m?R��;���b{�=�P:F�M:3X@yk8��B�N*^QoWPu��lw+��F�
	��	�U%���@�ex�3�Y1*�����TJQ^g�����h}")��z��PC��9� %��RXH�L ��rng�"�6���P���4���+aZK��s�'0V��7���47�j�DM�UD9H������A�
��	�����|05�����NAj����$bu�����h} �T�W#:�Wj�G�ED��/�V���*��	����z��*|��CQN��NM�����E23(�cw��p�<79G�B����O���g���M�,"&���:��8t�H�|��CQ��'3�r��c������X�[�i\k�OI�2��`m^s!�?a���| ����Ez~*J��uVzq���%R���	,�����@�E��������4Ft����� �������h}"��l��.71�X�
�}�J�z�Yc!�<�e]trg8Z�4�9@Q����O`�u�?,��	�O=hfl�(��_M�����[V�O�����������t���+��o���Gl������d�0��G����\��*��=s�u�������~����?�����8������S��bz�(��~���_���?���/8F�O���(&�X����z��H������~������/1<��������������~{�!���_4qx|%���~~8Xz��	/�����Y�f"�^�S�}���-*�T�����Z�`�i]M�f>�����r����_i�N?CS<dv�����#�x��E��2U�V��[�T���
��C}Gct��s!�
�T%���y�ZB��O:��b�E@�2Q��{�������?��}�@��v����e���4/B�W�|�w}rwe��,�*S����!����K���)�����e��!�4/B���v���ks���
v%?�A�\���~��Ts�J�+������������������,��*:�Z�v�e�J�}��s���c�};�R��O��N]x���oU���+U�U���!g��R&r�t��!������Q������9���+}�6�KC�^E�y5t�zG�����{9����W8����WC�^E�y5t��>s�xUt����WE^�g��gmC^Y�{Ut���d5����S�*g^���ls�xUt����WE^�5s^��sT�{Ut}�Y`��=���SS��c�@��ec17%�-����Z:t�S��SC�^mF�^
�{9������lq����W�s^
]xm�}@�*���r���s�-p�L�j����9��.�����U��W��^]x�x�w*��'�y�B.<.�qZ�Tt~�9�T�����be�|���82������/����_�q��:�*r���S�-H�+����W�3��{��;q�WC�^E�y5t��s������+�y�����hm����s�"����Z�M�8r����K!�5���.|��w���k��V?!��nT�y5ty�i��'\�o��6�,���E`���������u����[i����9�����������s�"��:�
;n����{9������e�����,��*:�
���zhu����9��.�VlI�����+�y��.�R������,��*:�J��+����8"������N���x���!�����^6s��JN/Z�
���^��}��sj�����Z�T��O�.��{�����4t�S��SC^��v^]xe9�U����q�*�������e}E^7"r^]xe9�U����J#����,��*:���I6tz��K������������s?9U�!;�R`ot��?bo�^x7����YG�K��H��H{3f�;���$q�h ��I>����,����k����P����`]]��Zey�K$df&�Z����<�!*�-=��c�Z�/��u�����t��`$����	���|��3���7�^i^�@T��r�`,����k�@
�D��v�2�5�dN��^_�%r}7��n	 R�5V�m
D5��8�K,�f&PWGl�@����(/�w�
���
Q�%�NdU�Y��)/��^sC�m����'������X����D�e���C��R�W�i��Xd!53��j��^��)k�Gk��(K[�v�)A����Bff�uO����s`���������n���b"G������E����be�OTq����=Xa��\zep&
��	��-����J���H��P)�X	0VI�,O`����>��k*i7��A'PWh��X	0VXH-O���N�
������LiE��T���H	P��93`�[�J*��7X|�������4&��/*tB5���-3_�7f�3�$�v�w���:$�E�o$"&���:����fm\1����A���-
�g�+��<+$d�'�WUJ�P�s��x�6{�j/�D-/�A�+A�
	��	��-@��%�c�F��tE��aFu����B!13��%���5�>m:z�!�_���p�=�X`!53��:z���?���x��CQY��&@J���Bjf�uKK���p�nRwh �5��T9���DBffuu��[� mB�7f��lEQ�����^_�A�M3�[H�vN��@;�
eD=V�B�H	P�@H�L@��	��a7>�5���������L}�f�J���Bjy�uK�`_�j��.�`r^=��w+A���	,����j?�����>��?X'~t�����^��P&5���]�������$��j����w)��W�H��L���	�*y���������R��R3`�[���;�K�}�dP� �����,����k�,�T��	������k���0et��<��g�'�BX��Z�K?B�H���\f>��tr�w\��q��:>v26���ht8l��gl���o���e��g�Gd(�-5,/�Lw��K���vsh ������m�/�Bff�%����������:m#t �Zv�r�w,����W��5�����������t���c�x1�y�����E��y����M���CQ�%�P��Z��Zgf���{o��~y1kB�����@�����%R3X=p�3}�VHEN�������R�����Bjf�u��c��Z�`�RrF�@���N_Y"!33��a��[m�V;'+=jJ��x;F����E5���-��r9:��,TS~��q�s����QDLl��M�����m��W��W#�'�3��5^B�)'�����U��7����^����F�O��r�~��R3��}�{��e��8T�Vm_��L_Y_HH�L@���~��@�U��	FE,��p�� [Bjf�u��z��t�1R�������7"_r��3�����n���9��i)��@��tH	��2gf�u���hl��g*|HA��@J�����jf�u���$��OeF�ir��xG~�"��1����������9q���"()��0	�,����k�1��zu�V���Hm9���X	�����Z���[��5}8z'��2�v�:A%"j���V���,��'�Z7L�%Nf8��/J���G����5^B�[O������U��7l�����^c(�M�����e������h�0}z��.nu��"	�Sv���b��������E��y/�k�C����lQD�������3 %���2��	�����:�
E�J��z(�w8R�T���Bjf�u��Sx���8���,L����,�c�x	�U��P�Ll��r���=x���P�/IO�6���B��QH�L@�n�~�mfq����&��89A�	�����D���/t4t�I�2D�;)��p���`!13��a���Z����
"�[@��K�:���L@���~�����H\���!���rH��+���	��
���+�������7i����h��33���7�7}��$�qY�V*"�Z�)�S+	Y&eD��Y�y�q��I�HC���iL>����n&�cQ9N�`���I��������_���w;/��j}h+9F�v����w?���6�����!�
SJ��3 %X�6�XH�L��8_vg8�E�,���"N�@����'�1����	��0���������C��.��ohB���Bjf���������J'��
D�!����e����X��,��;�3W���s�%���������$��q
5T��h ��~���8�XH�L����[����J1�Yv��A�%�S��o�$�?[o�y��'a4�DO�{����EYb!53��@8�K_��:"�������c�����q�5�2�����QF:�6�"�/n��k��p2E��v\|y��q@C� E�T�����T���K�"fb;f����-�1S��6�8�a?��e��c�����j��������	sv�*"�G=������	����{��@���A%`��
%t�����GYc!�<��?���3>\�`h �~]�TY�p,Qm���@];�N����y)��B�B���I����5("&�cU�?��?*��`����z���P��	��������U��E�Vh����P�MkP&3�+q-
�BBfy���p�`�>��	��"�G���Y)e����X���9K���Y*B%"�T���Y*f����X����g�sY�
1��q���)a����Bjf���p��^�y��S;�)�P������c�x	��2��LL����2H8�]�X�Ai-Fd�� %�%b�XH�L��>��@���u��Nu�������.K��������R�pj���gV�B���`���!����?���B]���#������*$�������<�z�*���C�Z*�pI�:��,\MeE�K:�93(���w����� %���W���Kd�F#5����q@~*�@���{)�@�8Q���K�ka����E��8 g���M����3_��$=F,����k}"�%),*�a� �<��8r=
"J�����U��8`�Ek���)QB�F:E6�8F�W��	�1��p:7X>c�h:��@�0�����Z����X�N��.:��H^���Sf��$�F,����k}"�DK�)��
Q_NJ����)rc4kgf�S/�������n�4�,�6�Ic�w[HD�l�����jy�L�����n���_��������?�������/��vL���|����t�������t��U���.�2)C�f&��K�^��s0Xa����c��&��;r�B��Y��%��N\E�B�3Y�}�^���j\y�9�"jb;�~���3o�2��<��"�����/�Dx�$"&���z�|o��;5#�2w���\���l���e;YCB����seo��=�F��"8�T)�Q����PH�L��n	 ^���p5s%:FQ����f@JX�NYb!53��%8�=�H����G�U��i�/!rn({�1��1sX%�NU14��/�8���/f����X��Jv)8^e�uD�d)J>R�zn�"��	��-�E2#��t;����pH��@1,�f&�����
�-v���tF*����5�c�x	yPjE��v,*w��n��8���bn�5�
'	<���b�����2�*��"���I+*�0kvH	k�	�Z($f& ZwI#���^�m��sM�T���XB��L��}�o3��{�����"��'�g@Jx�I���CCbf�uKR�f�mb�sQ�"H5�����c��������{����:��o����Ka����U2����[����4��@97%xKi%p�i�/!:m�"jb;f�;fg���(d�h�_Q�����>`%9E���Z�k�=���B����6@�"V���Bbf�uK�Ioj�a�h�D�hA��xH	OD)�,�f&�Z��U�+�tF��8_@i��1j����(Q�1��1��PK�40�����L��6�g����	L<UQ�����0s8�D��C��;T�Y�qH�D!�`!63��%��=���L�"��!L���1i��<������x���	'�pQJ��_�b���hY`����ff�uKrx��LTw�PF��H� %����`($f& Z��P-%�&��3�z�:4f�)a�"��	��-�L/�����-�
K�T�tD=��,>#f& Z����aZj��N�CL�(_�HI�:X������j���V�s����"���)��8����X��4��\sF����d&@J��
$�AH�L@��Y��,��`��ej0Wb����%M���i��fe�C�?��@;%��������hbZ%iJ��8<��
K��*|A��uw�n��l`ee��R`�z��R�H��L�����3������72�x�������?PDLl��r=�����I
D�A���0�m�4
��	��'���8
>���CQ����g@JX�NYb!53��@8K�v����Q���>�@J|:�Yb!53��?�\�a*��@TC���R���BjfuU��pb�s�����8�"h9�)q_"f����X������o7w�84�<� %G�Yqf&�Z'I����H��}\h�>��Dob;.~[��q@2�pG�Z�zO���Z�8PDMl�e�Fycup��S)}��!�FD�d���X`!53����vg8��-5iy�������dy�X�5033��@8#p
(b�N8[$�yB�e����	W�C1�������X�;"/_��\�v��R�XH-O���2C����as���-w8�k�f���93X�a�3�(�h{���eD�mj��8%E�x��	�����Q�cL+�8�a����$���B!33��D8�Y)F��4����=�{�=L1����'�^*-���2U�:"HD�@J�I�,�f&�Z�`�����6
��w���x��q�$�&�����qkp�
����*)	������	�#*,"F��������y?C�K�����������h}"�������#
����L��p�EQ�qvE�o�Ck~������RTA�-R��]� �Bjf�u8���ON	��(!��3����q��MDtN�D��v�*���W����Q{,�w���cT����������ZI��&G��ov��v40��J�fob;��M7���K�e�5)�$x�io�/&�����E�qP��8`�~��L�hxJ HGqc������h} �da��x+�C���a I�Xca!!53��D8�Vg�.Y:TA��qH��"��F$�f&�Z�r_<�D5�D�����4s"FGR��	��'�������8�y_lp;.�A���z?��n�����;�(
6B���J��/'��q�.�v j$�����T,���Z��aE�����D�+R���������([;t�H�+�_��&Q�}����?�����'�����1���r�+�ed���o�������_�A��o���o�������o?�!�����4�w
�k)%7�.�D���M�1�F�X���*z{?_������Z�.�W��Y���}��#P]�Jy?�/I}�����~��=!@��e�B`t���V��!����#q�1��]A��9`��u���X)�������]��X2��� ]a��Z�4�'��~��>	�:��/B�u���X)���s�w��J��I���^��u���X)�Q�s����}������:�uA������A|�'�W���:�uA�������_��]�X�E
B�e�kA��O��p�tw��_��>	y�A0�Bu]+�� *T���:���I�S'O�r��;:B�|��:n<R��	�{5t�U��WC^�������^Y�{Ut�5�L�WC�^E�y5t��`���WE^Y�{Ut��=G�k���a�Z�{UD�N� ���K
������3�x�'Q�������9���K�'C���s�"��:�
U�����s�"��:�
y��W^
�{9������=��WE^Y�{Ut�u�B��U��W��^�{��f����s�"����Z������$�]
�����v����#r����Cm�����
��F��C��M�,�N/��>����s�"��:��*W�3�������j��k�v���:�*r�����J3�WE^Y�{Ut���������W�s^
]x-Rb��*���r�����W^]xe9�U���x{�y5tz�Q9����-���3����@�����0t~�D��`���� �����s�"��������xUt����WE�^a�i����W�s^
]x�s�U��W��^�z�K�������5A�&����fEE���+�,�������g��geW�Yp�Y����������L�{6vy�v�����-(�y����A���X�6����g���]y^�@�zVv��W��]x��v���/�y��.�Bq�����]�A�����B-S�cAW~In�V���am����+�,�r���s���+��.<���l��3�FY{Vvq;��ge��v��sY����z48����+��~,��Pt~�b���A�����_D6v�s��gc�E�{6v��[�6����3�<+����s���$�}
��H���>�]ye��_e�sxnn�.���*7!��k�C���+�,�r������e����t����P�����E���R�K�����ut�������e�:B^(?G��F�4����@%A���5���
���g�w3-���^�Q�2�~J��j�FN�md�l�X`!53���l�^u��`DL������h��Gr�)A��`!13��%�FE��s������������r��E�������~��A�q<Ya*W��)��X	����Bjy��|��,���hy�B��-���� ��/����XU��9����90Bg��Q�W>:@g�XDLl�uu����/�`��c�#����|i�ER3`�;�i�L���p�C
�-G)��R3`�[HU%��%x"�|� ,�Q��SM��k,��'�Z������~4�PT�A�vH	P�3�[��	h
_L�Wk�����R*5���,�f&�Z����vX�����z���~���s,����@[��{/h��	�����P�V�+�;��)���n� bF��Us��f>��|n��UD<�$g����q33��%��P�Ap��`=]����k\[p�k$d�'�ZwC��W�0����*2��z�WH�p�����	{�)�X�%�H�L�5:i	k������,f�fBo�#Q%$���>`��Re�U�P}��gTC��������{l�_c�
VD
)������zVof�uO�I��l���.���($�> %@Y!13��'��'��
����9���`�V3�{��i~X�4%��&���-�1j��pg&����2s�".|�.�+��S�A0�N��v��]�`c13?rxSKz6�:KYd����,�A�Qv	�Q
�����L��=q�"9�~��2d��Y�'�-/��Q
;gp��L��Mq� �;w)r^|��$�*�'Bz���������z����hX��<Ug�����K����31�Y���'ZX��4����r!A����'`0����	9�+������D�{���{c�x	�O gb��-3�"���t<����&u����,�f&P�!���T�.S�I��d1��A=�
�,f�fB�����~[����N���
�]�J�A(k"$�'��Z�������.��U�����)��X$!33��f}��t��������k���_^�]z6(	G�K
��1���K�;���#� ��h�$?2 %X�4YH�L���:�1�J������5
�[�R���a���7FE��	t_Z���W]G��t#	x-����Dx��D�.����r���3b��<�CgPuD0�vH	sQ�"��	��
��bQ��:�'�!Xd�k��g�L�X%!�<��{��5}���������Q����;�'�J��RVI�,O����~�E��,
�fA��V�������WVI�,O���������@�SJ K�T�w7*���A��(�&���*}�5�8����E��0Z-�����x�eR3`���p*�N�D'B���7a�gx�R�`���=V�c�uhAgM� %��(�,�f&�Z7L�>.K�`0����>`%��(k$d�'�Z7L�%�s��n���6�����3z�vf&�z�~k��t�bN���Wm��w$M�*����D3��OK���T1��d$X�xo�/&�#E5������F\���%'�f
�T���X��:�
	��	t��}g��e���F���E~#�kjiN�XB!og&��=}/�e���]���,��-K�%�������L�]���^�l����*��#(P��v@H��0XEL����T��\v���I���������R�����Bjy�u���������1�>
����
^s��J5~�
�2�d�Y���dx,������>XLM���Ev��?]c���Z�����5Y�L�����;�R'>�W�d*Z���}@J��W�YH�L�������r�E�3z1��L8"��9�����Ln{]��J��{�N��o�l�> %��UZ�6�n�����`�M�K�r~�YE#,$����@h=�bjj&���(����-M7����(1sDP����I���L��>�^)��
|�5�#lH�%R�=Bz��"	5?��S8oF�O2�9���l(g�dk�DH�!��E���L��-Qd�F���CN��2/�E�T�|�D�e�����m��Y	��p�~���*C��D��(eff&�Z�L���`�����6$9�P�28��?$$f&�Zw�fIR�T���!^I����.H$df&��;���d����4�����?����ZO��{��+<{���P�7������gu*5��5� �#����U��>`%��(k,��'�Z���3��q�&���(�%��=�L;������FBfy���p'=�&�ZZ��'��#8��3`%��(����<�e����p*�����H>����3p��1�0��>��tz+�KF�!E=q.��f�J���Bjy�O���xx/��YL�f\0I����b�������E��8~���nP6w�������X	��(k,��'�Z�����92D>�@'��>`%>2����Z�k�d|�R��XRr�<�G�
e\�����%e����X��@����2�Es�:�C=�t���1Or($f& Z��#������o����l���0�����Y�k}"�@��\�@�3R�ze��E��v,*��Qy5\+�$+�	��5�a�iL/!|o������qg����D�+]�k�:�G�i�V���FBfy���pF��G���s�u.c<{�<����B���(�#�	��w�$���I����Y��b������ae�c����r���(������=��oy�7�����6�		e�KtA2���uY�M���L�����H{�t�_3��!=�Ja15?���|���]M�����RV�ff��0G�(*�m����	;�HTe����(��J%���=4?F��]���L��G��7����� ���;f�J�)PVEH,O`�l����ov�g����u:�C���j����6S�3Yn�5*���O�r�P��/�c��=dq�,�����h}"�%���R,�&�QK���!=����������ODC��mt�'���T�2n����j_4s�g��M��F��~-����P�X�V�14y���q*Nac13?r���U���9�i85�,S�rn�������.bbj&��#Q�L5�9�1~����Bz��c�X����cO��I�S��$eI��`#N�����6S�3!��*:�	���B��<b=������8*�������~&�lk3�GM���0J����Z��Cbjj&��3QUnV��D��^0��f��;"�GQ)�T���f�?�)I:�^��*����R}3 �u>p!!K�m�h���qO��2m?������?����_���C��,/����n�i!'{x?XqE��=e3����`��> %�
k����DK���@�����I���t*]1J�����5^Ld�E��v,*w�\N.�-��bJz����1i��<���"jt;f�;f��<��S��QF��|H���L�IH�L@�n	�GY����x1�����H	sx����gf�uK�����`]0m�(#x�b���u3�[X�D_�������B�����-a������h���*}�8���C��i���D�����W����<��Y�K
��h�;�������Bjy���7�R���r�~� O���2�?Qffw�f�2N�n���XR�T���j7w3�[�A�$e!����I��l��c�d�,O`���^�������@���7�#��A7�l�����1e\�������F��]���X	y)k$d�'�Z��m[��|�4)���p�k��y[#���=8��j'��3�O����r������1k����E��v������<�P�9��2�1]�����Lg������z���8b�f�`����Y�������DHOz1l,f�gB��c�R�|`m���`�������-Z���E�-!�7������VLz5{O�]�������W�WEk�6�!���8
�T��-���p
���#��i.������L��Mq���]q���$Ijqoi,��P�$jj&����q@{�C�����!inLd�q�|@HOO�eIs����	9�)�!;7Zor0L7��C_������/����L��MqH�=0�z^��Y�v[���CHo]�o�q=g~&���8bH��71]���7t����������s����	;�)���#��hZ�������}�����c�X���d����������e���4.����YJJ�zX��m;�)�i8{2b�$1x�|Tm��$��bf~&���8���}�]g=:������&bj~&���8rx��������S��R8 ���&�����L��]qd�<�F��%�<p�rF�������� 13?rxW|��V����ha��!�g��$�Q�?9�G��MKN��Y������y�i�����(�X�h3���Q���#�g�7o�*qTb�
� ���y������_��H�Za���)*���X8����A����DK>��)]�r���_e��Q�,�N���l���YH�L@�>��6H�lH��"x���Z
uQ�+a8���	,�tZ����RP�(F�����0�$�V�����h} �$x�9S6�PF��)a�;a��B!13��@8P#D^��r�C��W� H�Tc\^��L@�>��7x#P�k�c�����#��,P��f�Y��/�!�<d�$r���'�DT?��;���Bbf���pb�
sT����-�:i�'tw	u������zk8����9�L��dII��c�x	y�mE��v,*����a��{��(���o+qJ�Yc!�<���r�5�B�����O�!cT-%Q��@J�����	����]����R����4�q�}�J|�Y#!�<��D8�*��%������N��@J�
���	����.���s�C��]>{H	��|6	�|����Y�;����Ur������X	�"��� d�'�Z��C���&{m<��'��e
n�E�����gg&��6��T�K���E������/��[�iV�a15?�e�-���R�����w��a?��H��J1�M$�f& Z�'9�e)��B�Op`���KY�115v�������L������3d��Bz�p�,6�������D��%~XAF^Rgqo�Z���6�����E�q`�D~���d�x��Z�����OH�~�����h}"���B�V[�2
,�6�%��Y8�>Jb��tC��G�Z"�
����99a������w�!=��)l,f�gB?TV����D�82*��*,C��N����0x�k��X���b��{�8!����Z����AI����l��F5$q�����e:���zDPO�v�:�"&�f�?U��}d=H��q�0a"}���������h}$�f������L��1�G�-A=��'Tk��!��#Q� ����u����T�A=��q���bjj&��3Qe�[����<�BU)!���H���q�gxS3a���J3| ��~�%���LPo��"�9������<rzn��/MQ8����(��\jp��%�6@�\*pw�:p+�^+vW������/r|����_����Q#���p�c��*������������B_�?���������F���l	��������O?������%�G�������_������?�^��&"�*�R�T
����=d#?aT�bn7u9���>�������(	/yu�j|k�v5�2��������E�f�w'���|�T�3�7�����?��x��N����U���R=�?���>}�����)E�/�j����_���?�������5X�������U���R=����/~�����/(:&��WM��J�t�p\����~��S�;������U���R=�?�@v7u����g�a)\�B�/�j����_����G����w��T�������M�+��v���k?�k�%s��<�����M�z:����������W�X�t�D| >������TL�)v�h�C��H��p��E<���9�'b��������G%�Y�yTr��c��{Tr����G%�qS��=	���B�<������
Y-x������3�x���]%��9�b"f�7rz�p�ju���z1�h��#�#��>���S�*gN:�n��^
�{9����W������+�y����&(B�r���'�9�J.<<��r���'�y��.�Ra��WE^Y�{Ut�5S�`����]F��WC���?���Fs}��G��Vo��_9�s1:�r�Z={�������j��+T�]�����d1�R��G,������}��sj��+R_yUt����WE�^����j����9��.�Rw��WE^Y�{Ut������WE^Y�{Ut��]��
9����w)��^3�3��{���&|��w�����E���EK��+�n��{9�����T����j����9������[t�WC�^E�y5t������WE^Y�{Ut��l���:�*r����y�&�����}�q��k�������,��*��J�W^��pT�{Uty�I�>{�z��w������_7�s:�n}���o}���/��{O������h��������������j��k��/+��.�������+��\yUt����WE�^s��W�N���[�3t�����
��Ib����T^t�R���E��SE�w��s������d�c}�Q:�l"�B0t~�r�.�9t�U��WC�^t�^g0
�{9����W,��r���'�9�J.<R���KE>Y�;Ut�u���+��.�������k���������
������Z[
]xe9�U���������^D����.�e\�n�?��yI�wl�����.����e�:B^(?G��F��2��i��;M�J���&
I]d��vL/&�l���XTh�[3���bKH)h�6��0s(\JJ��B���-S^v\A��S�O!
����?&������v,*7��'���T�q%8A
���yL*/&2�V��	���������z�:D�L#?=Yz�:H��L �z��@���d��/*j�`o=�D�L�R3`�[�[�$7hK�|�U��?�|��}��&����8�7���F�\�C�K C��}��\�cT����@^��~3:W�l�������(���K\z��?���XT��9����Rq#�2��g�s/p&<sQ���}��3������}8���O���K�������h�����7��mwnym
H	;��X����h����(�q��!��>*�������lXW3�l`3��Y�s�;����N
������b����"bt;���Sl��h�)�t$)B���q��C"8��"bb;.�=R�T(���c3����ew���y�,�&�cU�e��������L
�-6�����S$"&�cQ�c�E�$���J;!�����ph��zxVI�,O��n	�f���`�D�P��!���+A��q��<�u+���C���X,W�PE���� %@Y\XH�L@�n	���X~�2B�@�w8.�3��D�9Cj�&�cU�c�p62�^��L�X���(��@Jp���aRR3�;��V5b��7\��o_iix��(��H:b7S�E��)��lwY7������a���hR���LBffmU����M�T�| E
Q��n;�� c����X��r���	<S�
"8�T@�"��`!13��%����	�D,��dD$9�r�����-����XTn��B���-�h���5��z����Xa!�<��%�J�aS������� �p�8��+����D���\9�i��g���;�3�D��S�������i�7��3,z�B�����a�\zc��������$�T�b�Q��\�R&�J�2��,O`]7��
V,IP��j=9D�����������Bbfm]���J��$��
�%���H	���A�oR3�����xF��	qe?��p��//K�<[\Z�s���e�u��n�e�T��69<��;D���'X��f����~��
vs6��)iH��>';��m��H�Z����8�Nk��{���bY#���r�`;����0��*e�����g������Mv��$�" ���!Y�]��h�0�����Y+���!`%�=�yx&�U3�o��j>P�[d9�A
"R��2:���L��n��&?�&���
Tw��R��kV���X�������/5e:���!���F�	��Y�R3`��?2��&�v?l�*@*Y�)��0H������h�1}�����N��QQ�O���`����Bbf�u��a��5�(�k�"��<� %���������h�0}8��:��Hr+�;Y�.�"Ls+bf�d�{���J�����F�
&TH}��Z�5�{"f& Z�O�&��T��|Z����R��j��K��(�&�cVy������>�I:�a����,�f&P���{�����Rh���wG^-�Y���R3�~����g�
�`M�Z�8DmjRN��y��-�R3����[�/�%H:���.�j�;4$�+q������X�����K��3�Gk���k�u�}@J0}c����X������Z���u���#��R���0y�T3���9�X����/C)�x8���7VXH-O��n���G$.���;��_�h�S���ow���;3h��;��S�j?�)1��c�2x�~��K�<;��0����?��=h�J����T� %��0�$df&�ZwL�:<��+��84����;���c�����]/(9
�[�i
�aP��Th����X���Bjf�7Vyk�%�c7��r����Q�=��R����LBff�u�����b)����.�:�\
�����Ye����X����[]�T6�B�BP�� ���+q��Y%!�<��c���xIJ)�@2l���K��.4�����U��w�&�xq2�P4������+�%23`�;���!e�y��(J���&$�!a�X��-�	��	����H��
K�$�J�(���l7�����h���#�x�+�L�9\e��K�$�O�l��r�t��oZ������s���������/�i�s���/�3
V��jD������t[�����HC��-�����b"���>��
��n�c\� �i�HC��W��������3����q@�m���J#�~���d�`
�a��R���@1;�����BPC��|H��Ys(T�����'�����s>�����P�����,��pG}���>NjXu�|��(j��Fy;��,�f&�Z'����)Xk�PC��� %L�(,�f&�Z�D^|��:���0j�#7^~�)a8�	��	������}���C�Z�������8H��L �U�[��5���n�b�Q����A�v.���	�U;�;��l&n��QTAr�R�p�AK R3���%��E]�~�����1i����(���| ����-K��w;W���.�8H�Ml��rl���m��K�uDq)��|�|��Ag-3���I��:�	��zD�b|�}�J��QVYH-O��>��'�@�� ,������������Z�k} �,�N�Z�A�IQ��{Y�f@J��QXH�L��>N�/6�+���DX}�Di�j,�����h}&x�O�.���p�D����X	�Q�XH-O��>N�2bP/�7NV���Z���X	�U����,O��>�������:Dmha�������@Bffy����pF����\�|�	�P>�gu�����y�T������d����t;0D�?���8Y:�8F=��	�U�;���_�R��>�RD5~���rH	�[�2	��	���I�����U��rhA�y�@J��R�I��L��>N���	��B�����$9aK��1i���cf$"&�cU�?��^���%�N������q�C,D ��x���������qHY� �L������u&���	|l�	�C��%�t��PA�X)a8�"4�!13��D8��a�r�2E\�
b��0s���	��	���i�/N��2��mHH�D��X4^L��E��v,*�#=����E
#�6��I�@���1z$s�'�Z�G�^��=h�`��4w6QZ%���Bjy��A�{��raxHa�2n�~�-�6� .�I��l[ Z�y�^�+4�DO�n���_��������?������-!�w������w3nQ��k�9%\P��~��X�!����v�.0���5�����iH���ew���b����3���3��d�%������d�g�������k��v,*7��F]J�S�"�������c��������{�`����yi�3 h~�)Qt;��Bff�uK��=�|u�zz���)M��@hu�h
X��r��U���j� ��)q�L�q�P�s\�Z� �0�)-��QGk���H	PH��L���	`���?(�j�"����
��

��	��-p	�������X9�8�����a���L@��	��e��	��"��8����`�	U��@�n	`T�����2��]x�� �1i��<������XU��������+��
8^���m+a"HYe!�<�qgRx�u9��H��.���k�J���Bjy��}/X�����B��e��� ��gP���c��������{��o��qmc��*��f@J���BjfmU���r��Jp�	o?eD�{��+����D�����s>�c����B�av��QV�<���@]�y3����	
Z�w��:n����eR3`�[h��LP�sN:���B��C	x�b,
��f��wL���b.�:��������Q�%�t#�����U���Xxw\*|i� ��3������	��dX���6����2"(5W)a.IX����D��*��@����\�.�����S�XH�L����������)c]���j� o�)a�HY`!53��%�����C�R��l��wD���)c�����j?�{����C�<�C��&K!�P$O�%���	�U���./>8�!)F�,��h)qm:d�|IBlf�uK5SX���J
����Y���/D�R
"lb;6�[f�i{fa�l�!:���r�Z�wP��L@�n	�:p�����s����T���K�xF�rF�cV�e��`��=%U�J�P�����T�+�Bbf�uO\)������Q���v�/!������Y��7]K�,����A	0K�L��V��H�,O��N26��T����V)����%F8)� �U����e�:�9��VW��\�j��t I����X��U����%i����&����X�?�G�\=��x}d��8��?x�S2�IC���3B���a�S
(�2���w��G�Fh�aIV�o=�.LD4�V���x����8b���C�89D��"�������Y`!53�u7�;����r6�~���)p�)-��X`�j�r6 ��o
�XZ�*��^��Z��0�f_-23`�O�S\C)MgY�*�^����c����]�����5��,�

z!7�,�
��@J��/
��	���i�)��"�����|o��/T�T}3d���C�T�o�UD�3;{ �Z����P�*jk Z� T�F��UG*��q��D���<������E�q.���������8��5���#
��	�������>i�i���6�Pp�*�q�CAD8�G"��h3V�������g�����C�7]x������UR�����pR�2��{�@��	7	�%��7f��%�HD�n�}�w��8�T���|uO�Scgz�g�������������8
WNr
�q�����v7�z8&M���6m��g�t���;�>^���}�J1>+h�'0V���G~x�~R�N)�se��bT9��X"!33��:�yk8�K�����~�r�S�������0��?ff���p�����v���#���TmgH�����S����t�%i_��T��n������<�^%�CW�q^�K�5�:�'���@ �T�������.����XT��#V���
t��9Az*�$G������D��di����,	�Pe�+g����K��.(R-#����q��[���2��Q.�qH	[��BBjf���p��.\1��(R��\�F�V�4���Bjyc���pj�������@�"(��)qm@f����X��t���9>)h(�9�����g��A�����T��4���B��3i�;����p�.Z��H$?0e� ](�S8��H���.%9�D���Y��XU�5w����Xr����V���O9�������d����`��%��c�X�
E��v�*�,�����1>���md�~����o_��I�a����r���l	�����9��������������7��i�PB���p�<���������������B��?�����c}�RBK_9C�3���������_��������������������������C����h~��J��m}V-�^����bL�J�
��2hS�Js7gY"�#��Z_�zg]M?W����]N_Uu�+����v]��:<��;����n���5AW��D��<�jxrn����b����D����2U�V��!DX
�o����SH�KKX�.CPU
a�yB��<�CH��)����KA���LUBXk��tz��SH��)�m����2U�V�!@���B�}�BNVhOn��`}����r>|E7��6^�r^]xe9�U����
{5t�U��WC^]�P����+�y��.�v+�)^]xe9�U����=�����W������m�~�s�y6�3����g��O��������9���K���F�j����9����:a��:�*r���s�Kz�'������d1�R�����WW.]�d9�T���
{����C�^U��:t��p1Hs*��'�y�B.<v��-.]�d9�T���h$�i�#�'�s.�\�h*di/o3�w���u���:�f"�0t~�� ^�,8t�U��WC�^K�\8;Ur����K%�k�me����O�sN
]x�����.�������+�M��gj����9��.�Visi^]xe9�U��W�vY�*���r���s�}�����C����y!p���n�������������C0t~�D��`�����G�zC���/����'E�X�{Tt�5��c�C�^E�y5t�u��T�U��W��^�{M����:�*r��������*���r�����j��WE^Y�{Ut�5����C���s^
]�bZ�����s}�ie����_7�s:�nMJ��WC�^E�y5t�u��	����+�y������}dt����9��.��#n^��?��r���s�#��G����W�s^
]xm������'�y�B.<.��l�R��O��N�{��?�?%��s.�\�i����\/Z�x���r�^4�s��N/ZO���9�������j���x����*���r���s�X7�#���W�s^
]x�v��.�������k�R��������_�Wt������.���������]xw����]xw���N����
g/Qb����Q�������b��A�8������_���w�f���u��$�}��K���a����f��},�i|Fdi� �����Bjf���1��7;'+�B������3� "g(QDLl�yu�����1�yeCt���luJ7��><,�f&�>��^}P���g�n/��}�K�>"RUE��v���������m���C��(c���hUU����33�uU���X/���-��Ym eY�}wL/&|����XTn���{G�]�5Dx�;��zxXh�hX��b{��?�P����ac�> %�X`!53��%�$eJ:���t����Fn�H	PF�Wgf�uKP!.�qC:��PET����A������Bbf�uK�K�;���
"�!����1X�F!13��%����Z���eD���6� ��A�`R3�[�*��>h�d
���zBT��>`%�Xc!�<��%�6^�Rh��:m��P�|�`���Z�k��x������_�@)`��?F�����3���
3o!�kW�\�!*oY��O�� Ja4c�����	��-�H�}�A�i��	(����N-%�c�x1y��<Q�����1���]�E��+�LJ@��io�/&.	"jt;�[f^�16�gK�\HGk�ewL/&\
�D��v,*w��j��'V�P�=��F}�6C	xP��f��!��1�V��.T.�'�*�V�7�H��c�/$�f& Z��%`~^ ET��q��=@J��������h��h��3��_�����y�`,����k���'&�W3-+�6��:]v����`���E��v�*7���[���r��������G�f@J��cpX���D����������:������(r���Bjfeu��R���D�{
���
�����R���2
��	��=,t�0,k
A���*�N���(,�f&�Z���1��`��}A#!�9��c�k,��'�Z����HH���D�wUD��� �hAH�L@��	�����|Q��O ��������ZD3���-3oXR �����T$�P�y���;�x]X�Ml��r��l5
�����A���A�u%�2��3��������|��2��+�Hp����4^L�2��D��v�*�EFK��n�*���VXc����������8"ZjX^���.A����v��
�&5D�T>�;R�%hc����XK.��op8]
��y�,�7AZe�W�X`!53�U������g]g.H�bX��jz�a���X�c�#��lXM�������\icT����k�1���0����_Q����@�.��R3��.�oL*]K����9D���LI>�����Bjf��u�L�b6������',������p��D��@[��/f����j�|���C
�O���o���B����}����'�i��_��5�B������X[XH�L@�n�~�O1�_	���x�qo,={�p�Ha�qO�]v)���B�!CQ�����0-!N������h�0}��(���0��-i�&C���4^L�($����XU��w/|"�-�,�)�zBy��X��)���	��
��o�$�-0-��(|�A����eR3h���[�_(�I�%���CQ]"���)a:BYd!53�z�-h6pWa�h
%��gS�q!33��a�Q*���K%-�y:M�J��gf�/��5�DM�]c:��]�����1�1�Cgf&�nL���s�Z�.'F
v5��
mw��e	�3��.'o��D~�I����-�k�X	.��JBfy��?zk����c���5��� u)q~�Y$!33��a����/�_2�2���%v�������D��v�*��{4nh��F%U���%c�'U�UR�`���P�,gC-�*� e2@��n�*�}33������3��s�!��I=Y��9XH�L ��~oM?*�����!�6�c�:�;�Id�2��	4����'Z��O��M\,x����YH�L@�n�~N���j2@>BHA'-���4^L`������0o��9�A�DC�])aBYb!53��a�%��1HvH4A=!�I���if����_������,JJ$��0IrH	��,����k�1�F7:pB���b��)9��WV��	�����G�.�@�)n@�.��R_�SHve���&�s�����R0�a����%�����	�7YE�8�����o	N���j����?����Z���{��+@��1����(�t����Z]�����-��	������1i����Yt�s=���G��*5S��QC����'Y�R3`��K��_�j�
T�?Y^�,�\��5��U��p���%����0j�J���;R�p�R3`����a��:��!*�r'��&j�23`�������`6�h!{H	7�2gf���pb����".��`U�~H	"��?of���ppw0�����!�]���x���,����@�����S����-�5�����>��%�X`!53���~g8����E��"���nLc�x1������XU���V�U��z6R���G|;��A&�������XU��CZ`I��
`1�T=��* �c������h} ��ii.!-:��\Y���X	�E�*��	��|k8���5�dW-H��)�}�Jr�Xa!�<��@8�")��_����� T)aBF�����L@����`����#p0�������f��na����E��8��L���Ft����
3 %��+����D����-S�yN<)��EK
[������O�Y����-g\���;�)�P	��:����(+$d�'�W���Gq�(���B��o�����0XH�L@�>N�\-�>P�PE����0aa!!53��@8�/�2���� F
�����t��Bjf���p��3�x��q�����1V�p�U2�`����"�G�X��\�"(��R�p�%R3`����!����p�1C���58�X	�Q��9�XV���'J=�e�d	e������o��1k��<(��"jt;f���H�����L�+�$=E<�6�I���A��H�Ml��r9Jo��^*�z��xD��j�P�93�O���^���E�e�����)I��X`!53��@8%>V�j]���:eG!5V�+q��Y!!�<������S$G����i�����@J�O#H��L��>N��-���})�������v3�O�3hK-���Kx��������'������U���lKp�8>����id�IO6���h
n3V���\-�����v�������_���o�~����]�07���BP�*����=���S	,��=L��@J���8�s���������O�[3�����T�#�I�V��q���Ld�������a�[3�msi��$&�@�������1j��h
����Y�������U9�i�S�X�iX���8���9��X\�1Z�q�!��p�dY�q,��+p�y�P�Vp���o�U:�a�!*K���3 %@Y@��L��n	[���
8��������c��������[`�{N���4���{C	xP�h8����o�2������5DP/.R��r�	��	��-@�hj��$_l�R��@�|�1<���	��-�"�������RT��H�p�(k����D��Z�n�=�M�r)*�*<�R��[�� !53��%�.; �D�E���W����y-f����X��L�l �h1�[Qj_���X	PVH�,O��v��TA�t��u��/#I�Pw������R(�&�cQ�a�-$.$�$��gt���>���J]33��N��4<��z��,F
jP7w��p
�%�HDLl�yUk���|eH�B	����0���q:���5a���Bbf�uK�r��a�d����e����I��.}\HDLl��r��![�0���#f\���*:2��8��T����2sZ��V�x�T��s�j�]���G��R�`�[���i�����6(nzt��X,$f&���o@��{�
2�F@=k�>`%@Ya!�<������B[����u�
�9��4)��U�uf&�VEt�
��J�Ub��U��*�������4XH�L �wY�@���)��\5w�R��Xy�����Y�k�@�;fl��Z""8���+��(f����X��2��A���k��<�w�����������m/��~�sD�F�L!��5�@V����f����X�� ��$���v���b���r��X�t����h�o�	\���0������I�	�2	��	4_v���� �2�"���F�����}�Jr>�Xa!�<��'M�t<��S6- ��X	PVXH-O��ny_�,���^]��%���v�>U�����XT�S5���upH���jbZ�jJ�E1(�Xb�sc	rV�������?�S�5p�:%V������Mdgb;^��7zl#X�����TZ���+����	���i���g*�g�^��V��zL/&�����u��cQ�?X��J�*�)���XM/��Q�%D
ob;f�����nR�����C
Q�j!n)��)e����X��d�(F�
�9R.���^H���9R.�����X��p�P�^��:e�2����6RXH�L��>N�zR�*	iH*���g.@��fHDLl��W�5�\�=JA�^�D#��^3��I��k��	���)En3I���Tz��{M��0�"�I�����h}"�*?c	�n� #Z����@J�������h} �F�T�3F�������5^B���p&�cQ�@�*��1"��c(#���E�R��`�� 
��	�����=��j#� �D��Y����g�j5�XT��c�����:�C��S�"��&Pd��HBffe��t_8-D^co�1D���4^T��&D�U���	�MB��p�@��tY��wUD-B��}@J������L@�>x�V�p�"�������1�N�����h} �.�����1R�@��ew������E��v�*��Q�
	{�BBHQE��\�> %>c��W($f&�Zr�p��,��9��4H����h1y��1�q���d�(q0"�������A�@J|V�X$�f& Z��dB]�^qs ��h �CN��&��%R3`���t*�i��1����d*
�����,�f&P>�G{/�T��3������Q�%�A��PDMl��r{�8��]&i.i(S/IHy�}����c�K�Y���������5D����bDG��>`%,}����Y��X���1�,��1���	d��>`%>2����Y�k}"�F�*�wn�hhDpt��V�^M�
��	�U��[�YpNX|�9T\��I�%�A�����C��?��P��`c:$e	�Ag�� %N�1�$df&�Z���s��'�D��\)�!)bPs���D���#Q�s��D.�6RZ��2��Q�Z�X�t���*P0f�����}���P@�m�,o�����D�AC��l�U���k���o_�z����9�y)�n���A��o?}{��������R��?F��?B��[b3�v�g��������_������
E���������|������?��%6���5B��j��5�b%����M�1�f�H7F�3���t|Y:��������E�
�O�2-So0��@u]+��������*c���Ww�C`t���V��!��|����1��]A��a��u���X)���e�������4*�A��t-���i����~��>	x����fA0�Bu]+�� ���:���$r�.��� T��R>"�g?�1���$bz����t��� V��A�1�p�w}�KI�4A�A���V>
~�����|�'�������:�uA��������;w~��.�?�� ]��.���y�=������O"�r�O��si�(�s��5�CUI�S!>I��r�q���R��O��N�{M�a�y5t�U��WC^����.�������+w��^�Y���*�x@���aDx��J	
h��'�g������[b��������������{5t�U��WC�^[�
i����W�s^
�{�����#�>Y��Tr���~'q���'�y����Z��^
�{9����WH�����.���������j�B.|��w)����@�si��^#r����{��<.�5��������C��M�,�N/�g[^y5t�U��WC�^S������W�s^
�{���1V^
�{9�����!�����
��^Y�{Ut��$�)�^
�{9�����f����,��*��*K�WE^Y�{Ut�����W7t�No9*�����,�P���[�����f���+4t~�D�E`���A{���f:�*r����C
w�WE^Y�{Ut�u)t��y5t�U��WC^<������+�y��N���mg��f�N���{�u��s������+�,�����s��+��.<���l��s�<�9�<��\yVvy�!��l�Dtr�������1vqE��a���$�d����"�=��\
w��]xA����g���q,��/���
���"���n�]�A�����J��W��]yf��geW�������+�,�����s�|���|�}�������b���[�sP�� \yU�;nA
�5��-�z�3B���[b�����������*��R���=��_26v�9�g���c�E�{6v������geW�Yp�Y��g����]xA����g�������3�<+���m[mPv�YWI	eW�������+�,���������f����K��e��%�
���}��O�p����wljJ�a�R8�����]���#�%��s�^j�A3��-�b�:���!8�X
�e�>`�����Bjy��;��V"����B�(��0�@�;XH�L`� ��Z&�*��CEH�B��T�:�Y��WW�����1��������\�BnaS������<��������8�Vi�:���I�Y�5��N2�3���I�0�l�����V�	����py��V��*�+sX������3�p�jA�E2%3���R3X�x+�Q���'p`�^�@�Y�)��%R3`�{���d��H���%qW��K�AS%1��UC��f�n�,��A��^{�Z���>`%���*��	��-��M.qq~�*�%iDH���X�j�ff�uK����%���PF�*;�H	P�:��	��-T���>��2� ~�]�Tg�+A�*��	��=�p���8D�c�Z�ZN��0a������h�@��
ps��Q���eR��eR3h�����������
�y)�����9�X���A$8? �D��G���� c����X��*�n�o�y��(�7`,����@��o 8)��-�NT���b�[�z���
&3S3!�w������0a�AU~����^�+,��'@Z��=.5HUa=�*��;���`f135�um�7������~A
Q���H	?eA���X����6x���+k�7 ���!=� VS�3��o��%�����������	�Q���������t@8�K�/0���f�p�6���j�Z�i��B���~�B��B�gK�~^�q�s�L��C"bb;��(�{3��S���������Eh�������&bj~&��L��q���GGX��%~F- �lC���cE���X�������9eY��r���}���������H�H�3<�5�O����P�`%� �R�`����w[v�'d��9 ��V����}&��z.�����Q�0��
,B'�v@����gYz�6����8��B%&�����	;+��)��Z��${t�I&Y�v�"f����D�*V1�R��G�O��y����w���7n�SvQ�"�~�"/b>!��<&�y����7-9��~�����x}���`����.Ff�u�����I��+>������������^*�Ny��_���>�O����<�:�LZR�=�W<�
���w1A���\��q��>���O���Q����N�>p�ecgY�<��u��S)�]�z��2b��$�Y;� Ffi�Y���k'Z�F��5��Z�~ {'�|gFym<t��������ii]?�����+'`����r�����z�p�5��D�����Qf�H�9�D�o�:U�����z�p�-�I�EJ#I�R�������i,O�r�DC�����uwj�X���|@M��^�d�	dS��I��!�����u�!rWY
���$�p�E�����909�s3����o]�Z��-�as���
����'�J�-0�0���_]a���R�bB!��z�	�����L^7\>�$��k}���eFT`N�8���,ui�	����7��Vd*C�Q�CFN�8��0�0��
�O���<x>+�Q�3]:�C<p��w�`d�'P��b�����T =��?b�+��N�|���#O��=���|:�(�g�|�(1��H�����Yibda&�^w\~�������*����t�I���06�0����O�q!Dy��A�����:7��=n���{&2�-)�AE�X���yA1�����������	&�%:N�������(����N��AI�����]{����e�]���-�u����n&����Y�����������<���,3��4��C
sW3
5LxK�������0V�@�����d�0�����LxK�}�7Qw��Dc����b�T�%�^6�U����o���<����a�>��������G>!�'���3?���,���r���h�Y�R�QO��I��Y��������K����G���c�2c!'��3�0�����LxK!c_
w/C9�2��!,�?g"���`V3?�6��x/�u�sj����Vb,3c�i=#��,�Q"�h��`�{����A7�VT�k�X�	g�����fj&���Ea�S�n�*�`�!s��8�le,��������hk�w�@���6b���3�~&�F�y�����d��n���
���~����K=������	&�e�K��
M��5�A�����T��o�4J:����+<t�;�y���C&�������~�?[2�����P���V����u�����k�6�z���"���uAw�	���c	Fy�#��L'y(
��I��o���b������q���#O���[��rv�d�rgNC��S�w����N,Kb��'�O��g���
�}8��q����r��#�<��9���tP��/�*}S
�R�Z�:�1��c,��G��>�N]���\E��HW�F�n�N,'K0�����i����w����v	����<8qE��#�<��9>��t��\��hQ��)*�Q�T�r������l�<x} � G
����Q	���Y?pb����
�'�6[��Lg��������$�(�����
��1�W �������'���3<���l@P����hGcx<A����D����r	�@�����3D_���-�0��	rX��E�@�oO��%K)h:B|��N�N��B�,����th-F�W,�9������h=p��%c��6D�@�nM����s�<d�qV���X<�J��a���yHi��_���kea����Ocx<A���<�4V����fg�?�'N�3")h=��j �#3x���P��_�5k�vUTh��w���=�������1f��B�O������I����W�����9&�y��������� ��UO�Q!�R���)��Fy��D:)��-��M2����l;"~����,��mO��:���r��I����$�rh]���"~�&5��3�	?�K��$uk3&�>>�H��zB������Y������fi"��+d5eE�$��	?MLjf�g"~$��p��������"�tvV<!���`����I�4-�3+R����.H���"�������D� ����~&2�g������
/Y
�~��,���	?s��Y�����*D�H���:}��$��XV1�`d�'P�
����k�X���1#���@,��3�0�����m:���Ud��W0�lV�	�0w��%[+�bd�'�������E���_G�r�I�<c	Fy���t�k *���Uy����I{�������'�Sz�+o�; yh������_��	y�	|�I��8�>Km#��������@�6�]T{FF.���z��������T�^^H���_��w1���/��!��w.�%�H�\c^�Lr��z�-8xN����J+�'�~�����2+�h���m���������x��Ig�����x�����M4�~�.w\9�b_����N��""k�3��1�Ky��y�	�����5)�tDo�Z��]9pB�<�#�<��)�|;�����Y.iLc]	���!�~�Y<y��������d��{*~q��}��	����,��uO�X����+#���n��A��N���#�<x��Ib�(Qw�"�����~(�O_"fw�a�%t}�e��o�5h��"m�P����#O^�$�9F���Y���RW��J�N(�K0����-	D����(G�:��F�98q��|�!��uK����i��!�IJ�B)�}g��z��#�<�6������4i�����$9��S	~������r����zY.]��-��,8�6f,��"O^�$PD��!R�h�~�P�b���N\�e,��"O����o&��i�|��Z�:������e���,��-S�K��;5��A�RD�o,\��������#�<x��@X�*-������m������?����,):L0��3�	o���l��?���]���>���	?S�3�<�Ld���������o��w����-'�����d#�<��'�����YSsV�;���'D��hLjf�g"��i.���?��c�W��:����P?~&�~0�oF������ ���	?�V��Y����x����Z��6k�'�����_<!�g����Y����7�������Qf��"�,�Z�?(S��Y����$��<z��fe��P�����"��<j����I�"�����������Yd�|�It�uP��f��_�!��QU�G)�"�2���z��s����E�����X�Y��\Z
�f\�VO��AZ2�a��g��FAo���'&~�y$�9K���vB�������y����w�Q�	
��a��T{���6��$��{��=�f����S�8&a��$B�j��yeX�#��y����w��b�_����.�6��w�d��d��"O^����n����3c��	�i#�!�el�7�>��8��ho4W�q+�_+�u��@e�����<n%�,��9���2�$����LLj��8�c'>�X#�<����{g:)���N��J`�b���{'.3���#O�m:����0��&���9��$��h��!X�
�o>�3�"��d��v�sBI=Z�� 1�eY�	����<l����������e���pB�l[t�({v��n����NC��V�^�B	���^�[���ep�t�1����'�1��q��F�
�Mj����2�,����t����EZY*��jA�
~g���x���&t?V���`)L�j�*mZ��DI;��S�(m\��{�6�6h��������+Gx�5��f�}5�#�<�����J+�4��R0!\0�i��X<�J�������ri��L��DY
7Di;V��u�DgE�<����t�^/?+x��#���K?��	����`#�<x} ��l���*�����;�lm�&]�����q�7���y�.R�=0������1���#O^H��D�g�G��Z�L�v��Gs�b��'�6�8��W8�#l����TO5]�1���~`hS?����q��)����c��Y]���xL�C=���d{L��Y��W�H�g�,�Z�����q;"~���������L���b����#$K%�1��Hex������d�0�����L����{���m-z
����u.;"~��y
��y�����������[�8+et��U.������Y�,�Ld��dE{u��f��a�)��"~�
f�y�������U�����9���(>�	?�p��Y���O��p��/����2+WA�
: �-�`V3?��#Y��M9\�6fzu��d"���P�j?���}cV�R�V09z`\��J�zB����V�g��;���mN��5�P�)���C->�+���v�y������mk�{���B�7��<��<*��,�Q7irvb����8��WKT�O-.�(��rB�����f~&2�g���+Y��}�h�1�����8i�G�#�<����}g:z����~�`H�O�: ��=��?7����L����J��J��j�-yW6�#����o46+��D&��3����!��aMnP��c�������:J��1\\<\�]�l'�_�t�j�����C��������4��i�?�BEjn�*������o?��O�+���}���==����7����z����?����~��_��5�qd?���>����?|�����������Y%�+��B��%�W��?qV�c,M��3R�u���� ��q��������g�]]|,|��p� �o�~����'��e��/����������A.��<��7�/�?�G>,���{�-��Z���^����wO������O��?~��o�����l�|�W�~~����k/�#����o������kpy����>:���H�d_{�M�����x���7O����������}S~��G>Sy|�(��~���������Hz:~��o�~�b�ry�����q}y��~���o���c��@�~���7O������������������H�o�t�d�[|r�j6$n$���3G�q�~�+��f��F^�X���0�,�\83�����_���^3����x,K���6p:i������K�����|�Z��h���j�3:y9cO�;g������gt�r�@�}�!�O����f���b�,����*��v���.f��Yc3)���b6N����a��8���.�T�aRG����5t5k�^]G��������6�L�W�Qh�������[J���<^���J���w���Jm�����j7���������aVC���0�����u�-��bV�fut1k�r5t5k�ru�zV:��o���.fU�aVG�f�<���jV�����Uv�ofUt5+��Y
������4G��fut}�!��Vw_�s�o9��He�����gv���^�pk���.fU�aVG�v��gUt5+��Y
��u�~[8��u�~[���TO������j7���bV*l�N
r5���S*���>������9a7Nj���)<��s#s�l�����q�����[���������x���_q���y�����US�!G�_��Y
]��v���^����qVC���0���Y��K����u�%8����u�����j7���b�"���*��v���^�J��������������b���]�����.f-�r�^�q�n�����^�%_�q�������!C/��
)8z��E)�&Ur1'��)���1%�0Li�bN�&u�z�����\�	�aJ#3�)OCWs�)OG�b�8���Ya7�j���e}lUW#s�l\�Vr1c�^]CWs���ut1��H���6s�To��6C����m���
��OQ���R��:|~������G�z����Wr~���d����.�`�	�����8IL2���c�x�|���1����������V��NA��\'�I���}�v+W�&b?6�������&9�P��bh�������B���DC���r��w���/^��P�����?
d��2w���r��$%A Yqt��������x<A�b�	B���r��l������V�
I�j���q��t�hGi�	��-	�UN{��r�N�������T���M4�~�.w\yl��1��-����B����,���Z�z�	��-	����}F��������%`�z����{�^�$����>����
EF����%`,Wi�	��-	��H	?�,SGE�MC)�~�D	8+b��'�7g���}�@woR����T�S+�����x���7�������+�Q�����DBk��p)f�K�l�!��2�?�u����q�!��|nH�]���@������L m�'~/��u/���b�&neay� N������Zf�uKk�3�sJ||(-��r���%e�D	80�0��-	�)�9V���lEi����Jk���p��x=��uK��#-�yaZ��^���h��O_k���8��������������:���(iF��#ip���Y��E��nI�t9�"���+�i��h-Y����(gFy��%�&�1k�3��m�(Jg�Zdi��%`���dda&�^�$����b�_��m����V��� N��� Ff��#���(�v���-W���t��4T��nY���X��f��K
IJ2"�`����4"q��'x��S	�&��&b?��W�&ht�8�����)/3���"=BA�d�-�h��X]������������������T����_#K0����-	�$�B1G^��Eq��(�.�A@�8c�y�	���2:D���ti����c*�@6������;�M<�~�[��H�4�D�>�Z��u��}�,5i�	��-	P�� A��p�$&�;���x<A��R�DC���r��'lsO����+�����c'~��e1����=	Tl����(2"-���E��:KFf�uKu�6���J�uFT��O�8q��y�	����������h%G��������KQ��y�	���]ko�P�Ko}����B�Z�����.�<J�e�:�����u��a�Fz,	��E��$&��B��X<� *���!�csA�����C�d#�U��=���d�Y�Fa7��,�\������x'��M
+����B������vcsy��{���R���cw�=S�q�S�\7�h���\��n�@�X����	K�6% ��S�p�����z�p�+z2��#m�%FRep���e�
�����z�q��e� ]%F����]�1�W���x��u���.}S2Z���b(2
1I���h�]gh��a&�^7\�~3�����"#R�	'�-��A�4������x"�r��!�'B����'�Z�Y�	��%�[�_���6\$�,�O�I(�h�'�1�Xg������lG$����dA�9X���T��L<�~������y
�\$yEIgB��z8�'��+db!�cuy���%qu��������0	�s��y,O�/���������8)��?�^w�W�^��{'h*`Fy�����	<�G�R��O�k�>�����K�6��1\�����k�~�u����Bw������K46���1\��n~|&���+	(J�<�����c�x��e�&b?V�����
t"	%��R��H�z�$��'Q�-Ff�u����A�D���`@M�k�:r�I%a+�,��u����Lm�����2�F�1��,Fym�������A��=��O%��$e����'�����!��������R��$I@Q�K��
���.�Yb6D��n��@��($�y9�m]$���!vlPU��2��0�;���`�8�(i%5R����$�N���,��"O^7\>)�	ZvY8���$�N*��0��h�.��.�����v��d+� �	��v�	R
�"Ff��������1e)o@����q��w&�]�0�w-�����1�j)o�,"�����
��-�����_0����P�
��0"u��qb
��"Ff����������}@��#'@�TB�����@[��w��(����
��H��c N&�00�0��
���F�e+^vD=)cpedp��D����#O^w,��R��!�+u�E&��x��y�	�qC��x��by��H<�Qw?��_���O?���pc]s�������z|B")�������#��Ib��-���QI�L� ��csM��<j����M<���$1�IW��c�x�hd�!�cs�=�V]��=�N�]�����Q��L4�~l.������e���A�������U!��_m��r{��4�F�5�$�PpA��� N*�4�\�����z} ���<(�z����
]��X<� ���I����\��cm��P�m�HM���F-��@��9���M3��@:�n0*g��Z 9c�d�p�	��0��
C�	���I��MZU:ILX�9��Dk��D{����r\��7��#(h|��1pI��*hf[I��thAA�	>4�����`�����X��3��@:m��H"���1DmL��^�C����qV��#O^H���E;?���u�U��1< 	UHplbA�cu�=RoH���WhD�H���%�r���7���H�����	�>\�%,
�8 M�$�p�O%_"-�9�b?���y�X�"UM��H���Dt�	E9`�,����tH��RQxO�#�C�]�Vm��p�aFy��@:���E�K|.����B>�	'J����B�0P��C����D�Pa��	��h`�,�����tH,�U��f��]JP��<$9�qR]J���0���Q������G��J��u�i��rhbda&�^H��Si��8!���t<f���/�r2�������JQ�������5����8��J����<�~�7�Ww��/u-��T]�Zu�k`Ru5������t��E|I�I���N�8����b��<p��T��tH��R�[�""A�KZ'@�T�b�:�4����$���K0o
B"qq2,��r	C�	���1���3��.�0.yM��3���f�<��=�����F%�� �m��e�������K.�@�*���/����2�$Fy�c��Z���G��5F.��@�6�Y[���L^�H�����!{<��2*$��1{<�|��������r�s59�w����:*h3����1�0��'�LOk|:�(���5�g��Fb�Y�<��u.����M=���������]�?�������Km}	$�.�B�`�9������b��pK'�	�
9������|��iWY(���O8'�	��;������!�cs���[���^�D�����v���.B����
-���N�y��{��F�o?\����U[����Qm�L4�~l.7\9�am����(1"���q�H&��6�0P�[��C�-E39��^������$F����uK$R��B��
=L���R�	'��":#�����z��@�Z��`���"#:����E��8�Ff�uK	mSTY���R�#?'��
�3� �Ff�uK��-i��)!����2�g<��	���L,�~�.w\y]��,G�9�o��r�c'���b��'�[���$���ZG$��U��8A��/�!��uK����>��3/*H
�H��'@�(g�y�	�����z�m�H��H�ZO@�5+g�I5����z��@H��G������	=�w����O�/�D����r�������G�-0�$_���T=�aJX�1Cf�uK��V�5�W7:PfD*S>�6:u�vi�	��-	�*����3�~�%Mi=��t����������q�������y��$(�G_�3'��-Ff�uK���X�xb6����E��$%�#�<�:>1���p���V��r@]6TPO� N�-NX#3���w�^T�D�)����C]���]�XDo63��Y�+�r�Yn	�"J;J�X:��OZRG��L��nI@k����C�=���UN�X<���K���B���r���Pm�iB�� �L@����#3x��I3���4���H~���sV����>�l�a&�^�$0�1����t��������F����uK�3,y�d���T��8�N��D,��G�@�o�mb�*��	��������#�<��Q�K�
�Z�����0�r�	�iS�d]n�<�z��.Y��&�d�KC�9��S��|Y�>����r���������S
����
���S�����nI��Z/�2��	��V�	b�a&�[��]����YQ�-�.�LN��$�d����nA���X��-���b��d��@���)�
[I�Jxy����U���<VG9A5T�f�il�QL��(����n����Gl'�@]�Wm���,D��C��e�p}c��\�E'�IKx������*��!�cs�=�>4
Cu��W�K�o<�[u��:j�:����n��J������<~��?��}�A��0P��3.I���:��X�k��Mu���.���l�������Y���#�
\������1mha&��#xk:�=rV��?���u��i9A��:#5��4�����.-h�r�w�(1���|���w��4����P;�����'G8��:��?����da&�;���tj����������D���$+�b#3��@:T��G�w�%
��g�����I���1������Q�#���s��VZ����� 4�1������p�;=G��2�!*�JE��8AvK�.y�	�����X;��j�(�j�Hd�'@��el�����>��15)cJ��H���:���54;
����yP���7`�6|��^�|������V��I�,����tH},8*�^��&�������T�%�`\������G�.��x�]7,8�(�q��e��u�
�@�>�N�n�\d��+ZA�����U�@�L�
Rj#Ff��@:T�����t�")AQz�P��	'-�b�Y�	�����-������/�����|��8A���/6�0P���A*�([�k_A��U�:�d�W���4���z} ���4��?	!��!e+#V��:i���"Fy��@:k2�Mv9�LX�:'l�aOL�	{�q���1�Q�jk��d�6*���^%�A%���@:!+��f���f�A;dn5���4�
P����Q��������Z�	���V����F��z}"�����z��;&���LOC��Y�	��������_�Qg4�HM@��uTQ��P����D:�vUjq�N`�3���q��B ��������X�[�j�z�P��-����c_@�&C�v���U��=��>�e<����^�+�	N }J���r�(��V�mp��qUt�!��5�qG�FOr|Rm�����\�������?��<��~O�^�^.�Ht�c[�,����?���|����_��oP�L���e�]p�.D5�����~|������o�������������%j����i�������O���_�����P�������?����������������Ze�l	�������5���O�R��T���P�J'�����&�B�"<Y�a9:f�{����.m�]6^>�����]�����c�Z��$<Z<���,�>���Y��]e������$����e��.�9��oJ�D4��.S0WKa�y�B)�������
���]���e
�j)l</R���q
�o�+�^�'Y
��RpWMa�y�����q
���@?����t���Z
��)�w�z�i�������1����R�x^�@���I
�����r��	����������I��3W���7���Y�n|�
]�ZKm�Y��?��\�Y
e.�|��jO+*)>y�-�e	����x��0t��������/])^��Y
]��v���^�J?��vVC���0����R��u��4t1��
�:��5�&�qVEW��n�����8�j�TO���S��)����k)�Oi�bN�&ut1ky��'_���b6N��b����aJE7�'5t}�i������^C����#�_55���|�H�������.�T�aRG�g�U��mVC���0�����i�g5t1��
�:����n�����f��8�����z��j�bV�fut1k�}�����f��8���Y��3�Y]�
�qVC�g-��B�Y
�����0����
�1��������~�u�m����S�!G�_�>�5��|k�����[�,���Y����|�]�Z����g���v���^�������.fU�aVG��7���jV������K��8���Ya7�j���k��F}R%s�l������D�)�����8����MY�G������nS�9��~i��5��p�F^�f��z{�vt1��
�:z=k��h��C���0����RL���]��v���.f��5l����f��8���Y��X5���jV���z=k����!��bV�fut1k���aVEW��n���������]����F��Y
]�mH\.�6�muiv�%�_65V
��|�j�����4t1��
�:z=�Z�oG���0�������(���.fU�aVG��G��M�\�)f��J^��V�jSN���j7.9���xa����jV������Z�Y]�
�qVC�g=P:
���-�;���Tz���a�?�%��qs?��i������C�'T��GnK�+�>�ZS�s`���;�k/h{?ns�b���=�A��!���i{�����m���J���(�U��M@��_�
Ff�uKEO�����~�t�u=A{78CKT3�]����>q�r�|���T�c=�D	��e#
3��%��|�3�
�HZ:�v���%`,7i�	��-	t���B��u���'�[0O�8Q�����4���������+��PbDM��	�����C*i�	��-	9��.��$�/���	���56d#
3��%j
WQ>X�.�EF��-�q���
#
3��%���R�����"n(2���N�8Q�r#3��%:,X�7z���ut+#*���N���"Fy��%���g�
qT�z�&��N���,FymS�^U�#�������)�Ew)L�h
��"�,������Z�$Hw��He��:�p,O�`�!�cu���{y�V��Q��6$�������E�'9KFf�uGe�xG���cT�%�}=��G�,���L m��y/���d�HnO��3*!����%�,���L^�$@�����>n��(.�hq;8Q��,��uK	���j@�o�D}�� h+gK#3��%�ZA=IXea\IbR���p,O�/)`
���W^���R�����E�^�8�K�,�����nI�&��������kC���� N���Xa�a&�^�$@���x��6<�;���]�)�1�%�,��"O���_�1'm>����#�I��#���	Bo~*�e������+�K}�N%)���V��"u��$�(%9 |���X��8�U<�]y��+��1���+P�R;pz~����"O`{��{	�U6F�%I������u���x�����K�����{W��c/81�IgR����y\p������X���l��}���4:�tT]�+W"Z�����4���t:�!�1\��r:����e�'���5F4q:�Dowg�y�	����dk�/G��g�<�z�D	8+0�����N�t�3/��]\����O�8q�J#3��'�o_c)�������|��0���L^�<(�d��Q�
O���U���UdGcx<A��e��������(�H������X������O�%��Pg�Ko���X�)�g�D4�~J�R\u;����z��8��0P/}��|^w��Pk�8SN�F�f�����V�y����3?H���%F��>�qb�F)dda&�^w\��a����-H}��M:��S��
B�����u7�_���"�!i���'�1����f���o]>K��}��Qb�-%N�8�<���U���������A��^WH"B�K8��S	�5l�!�cuy��C������E�J�he=�M��
Ff�u���E���L����H$BRK?��S���6�������N���#q�A��8h/4�W�����
�d����n���_0}u�!r|���2bM��I�����<��x|�[��ZT�����]�S#�&��3��.��y�	����g%B��u����r4�'��6������Mv���
��2"-��q���Y�����n�|*��PA2/6(2����.�R�����z����T�J���Y
�!e='Y��e�M,�~��5���;�10��hPgDJ<E{�,���L��=Z��|>X�����Uc)��A����u���VY�	�q��[������+:��EF$�� NPX�-U�,�����g�#I�����$��S����
�1��&b?V������>����$�"�������x���b�a&�����������q�F�����d���	�
X��E��n��&=�WlE���$r�D�L���TB�j���.�_��u@M�Ae;^TM1qAe36������#=�,0;��X.9�DKl��,��u�����E�H.�dQP�
(Gc�x*��$�!�4�������bj�#'H
`�������!�����'��KdI��AA"*�^r��/�X#3x�q�e}���*�"#MN�8�4��Ui�	��
���f���gUVV(:�X49�����E�<��u��Kwa;�!��(��dr�����#�<�>�������T
"��i���E�<��u��B�����Bh����8�565����L��t�-�
k��������X��X]��[��������m?��_���O?���c]s�^��ak+���D`:���5[ZfJ&~����~;�� ��Pc!��mA��y�����
�Q�%:;��.�X_�C�	���)8�3��m����m52���Z���63���mw�S�|�s��X�#���O�8q��2>��a&�^H�U?�N����P���O@Py����Q�s��@�>�N�rZ���h�'���v2� N(��U4}3��?��t?�le�@��4���l��31�~����	C�D9�i@�Q�e� ��M�H�a#������thGr��8�z�|68������L`W�sg:)h#�?Ie>��,����?dda&�^H'���q��rCEDZh.�N����y�	���PO��=�1DY����N����y�	���C[�{�j��hsJB��PtW��"�,��f������#u�I�;����T� N*-	0�0H��[����?��n 2�}�b����$�m��2������$�ER~DY�v���N�8��l�����>��(	��x�>��O; N{�I��i���t\"*X��t���	]�X����@WmnM�$#�L�?�P�%F,<�qRu������L@�>�N^�"t��Z%���-�x�O%P���B���r���S*?c�z�0�O�8A���Y�	�����SZ|@���(�aZ�z�23+�4�����L�����W�:�H�
GC6*�b�
�}7���	������h�#tE�:���X��E���L���K�^j������<�����E��}�����+&-�+
��W��\�B`���1��	�0��h���[�qqj��I��$��5q��`F����y} ���H����j�X�:YLV��0���-���HW�����������"�ae���Op��0L��fV��"'=���*#����8�8�,�����>��rXcm8��H��E;��	�%5>b�A�cu�?���X���(��;���T5�Q�c�'�6�<��c�V�����&��@6q��hA�<�����<����8�j�k����cT�r��������EB���SP���/����b���_���C�j���%�B��(�v���5����Z�dH���6���P�����4��uKo&��wY�>s@^��N��rI��V��]�������q@
�H:3��6�r���3e��H�L@�nI�n���o"��+���' ���3n+��0P�[h��T��>s���\������gN����0���{/�>D�=��� D����
�,1���#Z��.�XX5"���m-���{�h%�h����\������
P��"#��	'�]���0�0P�[����4�M>��8\�t����VM���L@�nI -^��O�N"�L�n���	�r,6�����q��H���w;�>rQ*����T�%��x���;���r��CXn�m�������1�t*0<��h�%����{�e�n��}Y����q��Y�7�:����Pq���S�7�O�i&Wph�u�;h�5����	���{/�^x���S�!'�O��^q�p,O%\"T�����}s8�;WNR�����UU��*#�� NP���,��uK$�p��X�*k$-��J��X<�J���IL,�~\w�iD��Ru�KiDD+:g���%�Llb!��<6�y������
�6 t�s 5�;�5 4��K����z��@!����]�"	�+#���1�%����G��nI�j�*����`@�W0���]
&=�0Hc��7 �&ig�n�NI{�u�5�qRi(i����0P�[�*�.UHt���:��hBR�t��M��%Y�	���\�	�!v$�>1h�>�.5����uG�j�.H�P
M��
3�
�r��WX����K����j�+��(O���������W�m3x����r���nKUz�8���)=����f�uK���R�k;�6����^�a�Q��x���H/M�F�#M"�����}0��_��C�v��������8��M5����}�:����l�h3V�[�O���,�6�i4��8YyZ���E�@���L�u��; ��;H5����T#[yef�vo��	JL��JCua�����g��Z�����'_*�
r"C����S��X��X]���D�P�����m)��O)*�ZI�Jx]���L�Ia,v2�����GC�tb�5�����4��9����N�3����
R�f���Uq�}��!���Hsc��o��d�
M;�F����c��lnL�e�����y@�QjVTf N,�(��g#
3��@:�fqijK�v]���)����m�&cK�����yP�����,PbD�M;���&{��Q�w����	]�(ww�'��"������	�AL�����y������kX��C���vb?�4��5�^E���'	����� Nz�����Ff���th�A�P#��9��v���~��R&�"Fy}sX����U���q�#���E?p�D���#O���3nM��9���9�P�N�%���	'�2�EY�	���P}����V�w6�����N�8i)��c����L^H���������EF���qB�<a�X��f�u:������S���Qc�]�N�8Ah[ada&�����!E(��IQ��>���1�����,����t��"S���� �������Py����k3�������*��H:�9��6�*���:�	���L,�~�.��aG-�3�?��)F��"?x��h��H��P��O��6Pg������Qc��I'@���I�*Ff��@:.�>����A�����t�#��� �������~U��Vm�'$��~.R:��S	��q1��X'pg�(����' �h��'����m��'��u����<Lw���*O�0���f���>�����4}o���z#yP�T�.�@����0�0��zP��q'�M!Sh�:��x��� _"e��P�4����<X{�ozU�K
��h,8*�Nz������M�������=�v�vpnz�����o�n����T%�F����l���h��9S���Nlb!�����yg����/�M�?�q44��8m��eY�	��������� �M3���[e����f�;?�[e��2nbN��\��3nbBXy���b?����\V�Q� ���U\0���1
h5��������li6��(-�����q��G����Og
\.��}��F�X�G7���?����O�����A�7�����~����d��������}�������/���?�����_��������w����
������R�dY3���-h���O���B�n��oq������5�Tz
������^���"5�����30�!���q�TU���~�'���oI!d���)]�`�C
��j{��������[����e��e
��l\_&�z~�r�.
���@gN��1���w�a��:��>J:Nb�m�T�����N�|�$6���HT�s�VZ�_����8��u�;$�q~��Z����_bYy�����s�`t}�=��~�_��sMc�.Sp_�a��2	�v�'w����
1����<(�N�|�$6����~;'I���D�����u�;$�q~�D�����S�u��}d������o�If7�o(��3� M�Y
]��v���.fM������c^������^�����b�Je+'?����`$>hR$a���S�!	G�_��.>���Y�n����Y3�f5t1��
�:��u�K�\]�����W����������j7���b��E�:���Ya7�j���u�>��.fU�aVG�����o��rV���������8�����������������_q��M�q����W
f��y���EN�����Q�|F'�gR�Oi�bN�&u�z�5������.fU�aVG�vT��*��v���^���Y
]��v���.f������Y��3?�����wHbVC���0���Y�#���U��;���������Y�x}������b3�g`��uS�!G�_�*������U��Y����r��O��bN�
S�����U�R����'5�zV������U��Y]�Zy�p3���Ya7�j���aY�����Wg��u���p`W3'�073+������]�\�Xa�Y����nf6vy�	���G��[OX��_r����c�.^���xx3�����p�������A�����fV�qfg3�Uj����]�������f�����.g��ffc3��N@>���ya7Nk�j�<��]��������gVv9373���6�3��%��8��_qK*��k��%]�������]���+��c�^��aYyI~#"�����Mg3�Eg6v5��3;���<Z����rfnf6v1��f�����j8���jf�Pf�
���a������}�;8���Ow��]�I������fV���������ye��c����})P��p}_�(�������=������G�L�z
����Wr~���d���/4H/�5J:Y��}\����q��� �����!��2n>�������R�< ����*eo�N3�^4����4��HR���-��������Uz�q��Y�	����_y�e�NW��o(2j�y��%`,Wi�	��-	�,�tZ�W��N<p��%Y�	���;���k�X�fB������ ��1<� _��
���W����D��.|�N2��"��1{<�|�y�l�!�cu���{�@wr:_�{�;�3�b�n=?�1��i�%Y�	��s�{	�����)�����������	���5�#�<��9��� ��s���
��a�]�_��!s�
��K��K�o^�K��Z*W��=�w���c'z��e1����-	��2����oC�Q���'@�(c������z��@�ParC�;Gt�^�w���)��(gFym����J�SK�$���N��<,���L`[��^�}*yY
�;u��G���F9wp��eY�	���Z�Z���������{K6''J�Y��E�@��u+��Ve�Zb�Z��B�w�
����T�%7��61&�[�>P��"��zG����� �/�F��Y�	�M��7����������a	����?�k�Rg}o�t�T(%������a�z�"~_|��<�L��x�7�HYj��������)iX��!��<5�P3I����B�F���H��������A�#I�����@�I�fE�r�G�d��`�YX�H�3�~��p��B�$mzY��G����J���2:F$�kH�`'D�$�Y�,�Ld������K?��Ya�Vx�C�~���fj&2�=y���4���-7�Yg��
+�I��
�f"��G@���mE�D������0�S�]S����Z���0���|�����qz~�lU#
3x�����O��;��T!P�3�H���S����I�v�z7������Oe���@&?!�'y8�jf�g"��GrPv�<�+��:3VN�a?�a0��C�D&�+��K�K�����,���r������pQ35���<��u��F&'V��sF�y(�M�4�LR��y�� GI�����Tk+��S�c'^�2V��#O^�]m[����.��qW������KM�zY�9��7~!fz,�4)�O��'H��_K%�B����x�|��@L4�~�.x���(+B�%Bl$�(�u�H}�1�-�:�0����
�O�+����;�W����1'x��0�0����X������Pd����	'�w��
#
3����K����&t��qMO�:I��
,���L �M�����q�4��u�Qa���@��8��;[`da&�.��Ez��|�������p�D��,��"O^7\~o�?��-�
6�Fd���B�L&&b?��B�;����>��rr����Y�pq�\7�X��8�'7�u�k�7i����#�mN@�����b�a&�����.�;r��bV3��tn�q8"M�|�+�!�cuy����&�����H'BrM;��S	v�u�����&�%A3�EGYV\X�9pR1'A=#�<����[�_�/V�7���^��{�@�oVg�u��4������UZ�
�����#�,�zE����c	Fy����[��ann���v�e�g��X<�Jp�8�X��.o_7�*�hI'�|��:SO{}A�F��K�����B�������:3�g�a?Y�sa��fR�N��e��9��E��Yf��M:#������:�|�h"���,Hq�����o��3�������35���,Ho�4K�&�$���D�9#���������LxK�>������ �,��	`'�w�[O�,�����/Z�F��L�Y��������;�
Pbf�f"��E����i*������G�k��q�cmA�cHL4�~\��p��nRV�/�!m@��(4�@���c,���L^7\�*I���Ph
~��4sB�oPp��)�l5���,��(u���%lg��L�Vgd�%l����B���E�wgZ�V��N��sB"�q2e���W5B�	�����X������|��\s�I._�#3�x��
QW�o[��Yb����� �y����	&�%��%T���;u���(*M=!�gb`�����LxOI7Zv,�������s)G`Q3?�~��V��_�[!�2e����N��.�b6�6{"���e�M�U-E,��A���62N���G�@��m�8I�'��c����?��_���O?��Jc]s���Z��,��
�D�:��3qG�HY~��%�"�#��Y~��PC������������<b���4�Hb����� ��8��cl�����>�i�IV����*��)q��y��
/�K�U
��q���<�K�������H"���N�8A0�0���)����T��������|8�'>,b�!�cu�?������`i�������%|����f���_r�T~Y����(�ZO?��S	^{6��1\���w{EU�u�g�{�������z�	ld�;�!5���)Q5#���/Y�w��8��k�C`�(g���>�v�
�[�'��D�����T�w�x��.��W�^A3,|�u���`����Alb!�cu�?��}����8���?���!�(Ff�c�������UD |�+IJ�x���*���1�~�����������W�$�s�.��U���EY�	l��nM����^1�Pf�1�����j��VY�	�����';�����$�r�>:�Ffys����x=2�U��:3�4:����G�����3��9D���m#�x��A�,���%�����)Y�+�<�Ld��d����������I��"��3e0��C��l��o�*�iO|?5��.�U;�0�_���
��+OR��/�V�8�(u!�����q��B��l�fn��%(���I���R�	a?����fj&2�G�*���P,U%0!�����z8.���K��!��2v4�5R��Z��Wj�D�:��r��U�4����tH�
I&�� k�D�:#���NW�y������J4,|k�A3���l�!'�#[U��0��R�[�	�?�����He�b'$�O��0�P3��� ��Z�B��e?b��N����`P35��#Y�����83�Ye��$Y�yn`�����L����?�P�Q{�vD�#�~[�Mn�mj&2�G�Ri,%�o5��Z&D�#�����UC��l�[���2��j�7c���nG�������B�D&�LV��%=
�Z��>�pG���Z���� �'2�GVM�k���Ht�A���6�Z�
�'��nI��w��r�eK/�����/��������?���;���h
py�Lr��U����a�:���G<9��K�������1�o]rl���@�d0G�����t�VY�	��������d��C�Q���'4�[ada&��
���!K���D��(?�H�A�0�J��%P�CW�"_�����d2������_���v����U�-X|���5F|��	'��2����L^�$�� P�Tf(n@�Q��O�8q
��Ff��%����-�������C��8����O��=���H�/Z���|�PcD�Y=�m
LvK
a&��MPo&�J�Zycv���#R����G��V1�0����%@G� ��9�h@��A2�r����b�a&�[H������3"i,�q�~���L^�$�=����>�j�::��}z����L o����@)�j"�r�4&�Qo=gt���_Xi��Ms����V)B��5Bk�J"�+�q�*��#3x��@��QyF������%� N��U}�f#3x���(�#Kr���� i��EE/�{
1&�N�\=iN	� 6�L���zuF�m� ���f����f�2E�LM��uf�W��sY+B�35���<Ho������Y`b"Y����-�fj&2�Myp
^�9���\�u��W��EY�	��=	����r��%b��`�Iv]nT�`6(X{�7���y��A��&���J��l������`�'y��n�$o�������"i�wF����-���fr�o%��t�����4���d9����+�<�L��S�{y����
�TK[qj;f+���7Q���.�\��U�s�����6U[G���u\�����:"2�]yT_�*Pi�5f��4��kO��U_�����<Pg5lp�}Z�u ���<����M`�y�:�Jq��J;�N��(*�u�)�@7?���K:���n5��l�f�~{
i�����LxW&���1��l��f�~{�(����I����n�lC�n{���V'���u"�
:����7���TyytT�X�{�MNe��tOc/����2��G����<B���PX7�P
�,�<����V��^���=����L�EG�Q[�o�q�5Pg�y�	�K��7�C���m+/I8��R5~7�l8��d,��"O^H�e��ePc�P�t��0�F��rw oe�3*���sV2G�9����	@C���EY�	l�h��N���G��	��.���G�X:gFfelBwk:�P�ci�� j��
kM��	�X��E��>�NC����"�1�z��d]��
#3��]��1���\G�> ��M-�	������0�0�n�1��r��a�e{G�u
\N�8����U�<����t�E�������B�"�g'�����yuSQqg:qU���g��:#V�N�8�����Ffe��rk:t���W��w��v|p�	�����6�!�����L'�j�.y���`B�mN��.� �����;���t����;��u��z�	�b`Ff��@:z8�=�j���� N�9T|_^�f��@:�?T�i���|���X�	!(;
������`	��@���ZXcFRX;��Eb`+�qf"�}$%�>�hu��g@��B+'����*�(Ff���t��.��i�Z��Im'D�����3?���5���4���/::�~��������C���f��G��*
������!��������:1�Y�����{��C���aa@PHN;"|m�
#3���pg:��\�6�3�uF����c,�����>�N�����D�_�Z(N;i����B�r�����!��W���UJ�Z&q�*�ge]�0���a�_�z��z��v��J����Ff��D:�n�q��Z��t;`���H'�<
�|�V���w��xQ���<f��t�YC��p�@�r%��[y���qg�H��F��"���qq����b������a;�~�ab6V���L���Hh��&���:�dwF��8t�����I�n��5�Av�/����?W6:���/����A������D8����h��F��;���Y�����x}bu������(���V]op��5��G����p!�,>�Kv��H��P�������?����bS����>z;<���?�����~�����:�����������kZ��Mj+����������O���_�L��������������O����W(*<�__����
endstream
endobj
5 0 obj
   183151
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-1-0 8 0 R
   >>
>>
endobj
9 0 obj
<< /Type /ObjStm
   /Length 10 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
10 0 obj
   16
endobj
12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
   /Length1 9204
>>
stream
x��xtU��>g�SU]���t'$�<;�P��K"�q����D���h@  h1 #"v$��d�x�A��c����N��n�c�Zw���]�V����T��������*,��iOL�����_�� �2���;����HG��G��z��F�AX���K�F��� �~������n��q>�?c����py%@�H����6��{�j`����M-YUT�#���`6+��?����>{��=���3�B��f%�d��Y�
�n��Q}���v.x.��'��$=��A�\���,�e��xJJ
�Yo3D��!����w��fr�������u��g]jj�2H2�5GV�C�hZ&�x��#]S�m[��m��� &q��
q��x�h�f�L�H�G��b�(b.y�,$����`Sx�0���E?�~�L�II��	���6v�B�`f���1&MN&c�����s�q `3�n��>����H�y�7KOr%	2Z�@�@Fw����##�[��wW����������:g�s�r����V�s=?M�DuGi(�i��������[3j�H���2�V��	���?�>2��1���&�
q�q�y<4K_�DV��El&�"Zf��Y�����i^�������<zy�����=�'V/�Bo�+)�?��d�K�$Kr��$+����,��':��%�fM������]������s������n>Y?�����W���a����}p��8�M�5��\�
���w�"�rSByb��e��6Hn��zi����Hw�3�������`������s����������KA��5-����b���!;�Cz�L��S�&'1($���dI��i��$+�v���$;|������/���W�����c�v�?�i��[&{j�����s�Mh\��^�����U���{����)������@�����`���8bE+ Zs���\f"���,6������L�p��s������������y��?��Z�-�H����=`$L�Ga>�9���T���q�^���RB��E��X��D�0K�ryt������Dd����o>����#k�����a�m����y��<����z����~��J����V���r��jR|�Z�5#����rM���U��U�C��Ut\|(������I�$�����+�>��o�{M��NR�9��C'gL9���?����_��-u�y�]\����������jj��B�?U�Yu��R`�/%Bk������K�[*����z-�&wt|��b�ZG0����K�~����M��6c3k�M�\ ��B�,������m��L�k-�D���2#i��;V�������8��Y���}A�h�R��H��":���#G^�y�.����w�����"�"����������$����u�u�Y.��2�HI�:>l�����v���!�En�C��A��c%��>�?��P/��P��"������,7�xKA�2�������lM���w���,�O��������k�^��fI��v���&�+6������1���+�C�rr2~iPojjv��Y��?EF��������D�
�d��Y�k��u�w�]��w���X�������������W,��a������[_,��Z�x`��������>���O�N��O/_������q���"����-���M�!���b~��L�k�#�ze�;)"����l.��T����5������=�>�n|c�\�8������p@���$Cv?�
c%9�t-����bl��3g�<���A�/	]�..��!w�BTB %�X���Ab��
�	��@<���'3��� AVP
�����Y.��g�
�3��-�����B	�A�\��1
�4��/��"����{o��a������Y)��U�`�������	�	���9>����T2g���[���}m�x��+�}��MV����p9��;rrB/7�l���k��[4j3tS�����R�{����n�m�^j/s/� � u�y����$�vW��{D�q�q����=!=1=)%�\-7�[��c�J�dFZ��v�0c0�,���'}H�o������K����w+��9a0<��"��K6�Cn#a8��q{�TV>�qH����r���SS������w���dC�w�����d��Be�!��Xv���c����M�����w�V���Nv���j����f�n��`��+'�U5��lVs�f�f^�o��G5����$���fM;���G����s��.�2��������Fe��D�S���^���-^'����F3��D�*W$Z#�(�f�X����{�q�)�)�L�������PA5Ifn1���(���x��4;-1�{����y����US�)�4k[{�>�I6�`�v�����vk�-����Q����$��SrM�����l��>=���t"��<^ M���M���-�m��yz	)�3�Gm����Re�m��V�V�WZ*���J��M�M�����Zs�e�m��^��~A���B!��p*2��,���m\�����YIb�I�y���3���,�g�:6��@�0+f��}�h����,GUBg5)�N��H!�B�������\pQRHt����g�Y�d����7�kh�}c��S����t����joV������s:/�UV
�0��d��[W�F|F��1��
#��c��K�]�'��p�����Z���u��XN
�����\!�+"^�-��S����6��]�7��Ho"M������y���b��u))�Nb#2���d���+=}6�{�RN#�cp0�H�:�>�f���LE��&N
#�B���Vt�/9�6�7��"��JC��h_����K��]���p
�����]
�R�\�����2s���Zf+��iez�����:#�U�3w����6�����8�����wD�W���WgN]#����V�&9$�8��F�wEs�}�`^%@�����3�+)�9o�Y���
�����l+�g��U�Q���qT�^�����g�D�Yzx�z���6���������^�w�9Lw���c��ZZz���dZ+�q��Y)���sJft3V����*)��p�zc[�����0���a 4������F��y��a�������Co:v�+�_2����6%���0��������
vlH�����o���F�b�[r8�����`�1��KA#N4�b(tN���K{�{ )$]����?J�?'�w���52*;w���q����v�h��o�����y0����::>Z| '@�8s������#�_��������<0�a2�������v+��YX(���%nRMl�j���Y�	5�UI����+�Lu�����r��;f�bVM��Rb��������>����Q������q���F��0��"uk���`���-�:eL��#D�V��0�wrI#y���������>@��<�I�$/Q�2����
���7�G�Q�=��q��Pbf�*f3s��Fc$���J��ivY��%��I����d����R�$�����Y%��aAG��c��H~�i�:����lP@
h�������	�|u�y��t&Ng3�Li���i���,�<�G�%��-���2i�\&/P��L��3�%�
��W�6�f��n��"�"m�_P|}�-;l�a7��������J��{�Z�k�7���6;��wl��$~�>��,;+��c&������\��~������j�i|n���c&��9V#9A���tE��uI��B���
��Ga�I��I
Q`����s�z������������4��>��QU��i4��T
h	}D�K�����?�l��|��2u��I�fT�������RO�g��E�J\�*��R��7�u�
?$�'�I���x�����9��uC
=D�n��on�����<� 9;%o���G=7?���e���})�"�rew�Jx�����X��c]Tq)0�:���!�kuYsm��<}H��8��q�q|!C���X�K���1�Rg�,2��r��5�1d,)D!��Y� ��Kfe,�o�<����m�4I�����k��}h����������NTA��5n>����z�-�`�{}vil�Ye@��M;)3��:�FY��o�]�-�6c<��D]NG�'�f�s����W��Wo��Yr\w^�*]�����i�Z���G��.��	�,B�v~p������t��T��wd�����M���+W��A1���b�7��{W��wC@��!n�E��b�WA'G�
���(Es�=��1�13T���]���������g� ��$�f���,��'�b���o������o�V��W���������	]�!��RH�)�T�a�f)NvY*�1{4Z�A�+���+�<w7%f������`�/7x���\��Ed����NY��O����r!�3�D
IH�\����%�J~|�k���W�������nhR�p�^�\���_��K5F�E��|���CI�#/���|'�-�L���f�Y5��y�����\
'9'���`Wn3Er�n��1
`s6pe�-@6c���;������!�aYl��q�(~i�?�P����[�G~Qw��@`����:;��~��+�?�g�������C��)�J��(d�T�j�z�����P��Y$����b+B�qB*�d���K��GJ8Y;�n��[9��_FG#�����1:���v�s#�LZJ�h9]J��Z��7����Eb0��B*I�t��dC6�Y�2F�Q8���#%�b�d����(!����>C*R����K�<�HZ	+I%V�J^.UC5�L����o���W�z��rA�T����P�;���x��u���~�0"������3���
���L S��[�<=E�'�&f$L�����i���O�#EdYrk<"1#1�����]��d+�!��v����"��r&4��-�?][�
p�@L����}�����u�!b 
��=B���F�
��I�
������F�!M�'�j�n@	��&����AN�&RI{C��V8uPMX����,h�O8�6����'��%l;��cv�5�6�5�"6�d�N^�w����Q�	 `.��1��a����2�1MP�P
p�YPFK�}P�y3l��0���l'g���[�������	h���
��e�EK�,��Zh�*�
s����=���=<:�ao��[�J!j����=�l�b��I�6����8?%+���a#�*l,�*�B��#����(��O��Y���Y��0��{��� ��7C	��/i�L��X	U`��A�<����	���0f�,(��pzc5T���z��oP5�ZU�Y�7h�a�%��5B�P�r�^�Z=��*�����xfrR�^��5Q��!���01���7������]�^��y=_���/z��7)����a��:�hX�^c&L��^����II�{
�;�'`a^Q�o���%���
�����G��J��������	���R�/��������QP����<
jx�(�����G�kx,�k�Sz���Y�(��p����'�e���3��G�a��9�(`;�0�L�zL�Z���	�c�c<��$C1�$��O�M�c2���1�e�����E�$-��(}%�������
L�U�f�O���������p��h��&0�a�D�.�u����x���x��f��	��^����Wc��[=�'�Z5~[��v���*����U.^x%/_��/W���C��	x��>�b;~�������'~V��~����?9����C���`���y�<w7��c,?+����_?��	l����������L|_���:?���"�Q�I��
<!���w�����<�cC��7�u��u���u�Z�����
}�x�����A�oT����������{������qo����b�_]=����{�"p��]���N9w���b������C��m�]�6���k����1�j���[n�������i�P��7n����q�
�o�������U�uGp�RV���Wb��=��g�]s_+p�m����W��Uf^��Uf�(���b,_��r/���wW,��
��u\&p��2����.Y�+p�\\���.^��E
\`��|F�y�n����T;�i��g	|R��I����z.�98c	>������&�a�SbQ;>d�B��"p�$�On�I*�������]|b.��p���nx���������^�����8��h8V���|����4>:G�Y�(
GZ�n�#�qx5x���j��#8t�x�~��l�w8p� +���� +�#��N~{;���N��m��5�6c�x��bf_3����}�<��}�x[o�M��&���={xy�b����=�����i^�}(�y1�k��v��1E�G`������X�	�g���gE����c�1&��\-�[1FEF�(��D�����!���r�@���z.jK�^�6�VK$�
��["�,P��$P���"Pv�T��h�	�����(����H4�$@�W>Kz������	��[��4�p�
endstream
endobj
13 0 obj
   6713
endobj
14 0 obj
<< /Length 15 0 R
   /Filter /FlateDecode
>>
stream
x�]��n�0E��
/�E1�n$�T�
�>T�{H��g��W��RX���{��=�~�2���(�q�.�u�Kr����NI7�x���^�EdM{��k�K��YT��>�<]cX����=	)e�������tx�n��C�Q�����Qd�k����Y2���|���=5���k]H�t�!%;;�.����3�*�kY�c-����&�h�� *��<Wy-�\$��w�
��p����Kp������
�3���>��R��y���b�e�7���k�kh4k44:i�G��\&���p�u�K#O�y�a��f�j�U�Eq-
q�5����h�c�`��f�A��N��O������{�|LK�v��?yz��2/l��_O��A
endstream
endobj
15 0 obj
   369
endobj
16 0 obj
<< /Type /FontDescriptor
   /FontName /CLDOAJ+BitstreamVeraSans-Roman
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -183 -235 1287 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 12 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /CLDOAJ+BitstreamVeraSans-Roman
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 16 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 317.871094 0 0 0 0 0 0 0 390.136719 390.136719 0 0 0 0 317.871094 336.914062 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 0 0 549.804688 634.765625 615.234375 0 634.765625 0 277.832031 0 579.101562 277.832031 974.121094 633.789062 611.816406 0 0 411.132812 520.996094 392.089844 633.789062 591.796875 817.871094 0 591.796875 ]
    /ToUnicode 14 0 R
>>
endobj
17 0 obj
<< /Length 18 0 R
   /Filter /FlateDecode
   /Length1 8500
>>
stream
x��Y{xTE�?�N��w���~����ND 		Z�DD�
B4�"2�Q@������� #B@Q��(���������Fgl"�2�����	����������M�Wuuo��:�w���+TAh���'�v@B7v��G�����_�Ul|y�}s�W��-����g����p?6k��_��2
 ����5s��m�	 ����3���k�+����>5�w���\(�3��
%�� �U<2�����HZ
`z��U�\l��LP0�M4�g����XoP[b-��-�E�Z��C�\Ji?'W����{D��P�q��JH�~�d8�-�y��b]����CZM�i�<����<}���1U^lS��������)KQ���t��������91#�
�</���U���j����-�"?���D�)y����r��������d0����^���)In^.�5z�+
\���r�	�)����	�@�.LpE�V�
��3F�]u9m� �����/�����
��Zan�]%&U�7���&��/4��OJ���������
�	�K����*��E�w���F`�T����ql&���fU�y�������M����c�K�%K�h!!�"L��C��*^S7�M�f*����*}M��B�0��i4&�)������i������j�J�R^p����^�����W]��=��7X��lo;�v`:>tt}����R(M[0�N���y}c�c/,��G�}�h����V^v��.�w��o�w�������:@�j{�7��r��q�LT�*��4!����\�*�+1�]v��&�u*��v�2�������X�>�Ec"K���Re��H�6��x��u�G�2���!,&�A�[�qH9���Z/��|�`�/�������9��_.��n�zi+g[�2l���d�D�6��9������T��0���/��2N�q��1���F�2��y�X����"C��
�A�h0�}J}�}�c�Z��&�&b�$'Q$!9��Q�c��j[�����1�����\����>��eff�+��}������5�s3g��Y>s&>z��q�M�����0I������]����{���iWaGd�W�}{z��O\���Y��WB�N:��{� n��(���\��[�:��r��$��&�-�d5Mm���]���5���Om�;���Fq;�N�0?O�3#=3���s.�;���r���G�*/��c����6��m~u��W��i��-yU^�o�N��x:m-�6����E������n7���fa����Q����������s�c�byZaa.7�&7�:Z5�s������d������ql����2��B)���
@-��Y�eh�����s����GZ�hlk�{ #c�u�-xu���i����vO��-�!���::�A��SL�q��e���Lq���j{{Q{�=V�����^,j�W�j��U���Lc��ym�`Vn��������Y��L��k����� �T�Ik�z���5�����+�=$k��-������*d
���}�i��9mg�Nd#��~�y�����]?~���'x��{�*���w�����7[H
�����:�1-t8�a���CZ�2����8l����T�c�cy����v��Om��tg�.������u{��pi�4�`�����'�YKCK�/�^
o	m	�
�
{������]�q�����S��������@]�1�TK��:�bD81#=��O�p�����!�6T<<���������;�l=�.L���_�����_��t��1���^=����.�T^�����L�������}����Dq�Ra@4���9����������i�ClQ�Yjkk�����z��k�S	78��at�6����;w��
Ww(��7����8���;��i���T4��3��o ��l��7D��@�{���9�mI\��u�CZ�M�-�b��R�;=�q������%���
%v2���L�����������0)��lC,��c�X���nsX\"#���Hv&�z��-��l[�=��=Th```�o��(��������t,�/p�2���������rD����4;-N����1����;�{�����N�����w9��}����������Sc�;���?t~�������9��m��>�3>P^���/��m�d���}GF�}��/��q�~�X�����+��D��t�R�;�9`��p��D]��-�U�[~�����$�1��>�Y�b�Q��5�;2��J���o>��A�������&gdmy������8;���p*��n���`{5�C�2VMb�[��?=������@�nx�y>���o�Y�ft����Z����A���,\H{X�����Sp�*m?�[�ay��+��W�R��h�D�as�Xg�O���kRS���u�P�p
H5$l6\�nZ��l+����68M�64z���D����v�Y���E��byLn�bLG3����������|���2����n�/�[Q�h����; ��l���pk�
l�4������s����4�Jc��N���!bQ8��V�q���6�������
����C�y��u���
b����v�_��,*-!���S��6cOS,���\z����h��9�jXN�i�����RG�,
��}�6Q���ct�r��%}I�Z���
.���K1;�����Ry�9�ttc=xs7k6�0���:��b^lj}��4T�����s�*�jX�j�J�y��������
�&v�57Y?�c�4��r�|��|���_+����^�KJ�R(�`8u�j�0~ ��s_c�|7�������@�!w���������Z��N`�8I'
��u�+�M�n����5�	Q����F=!X\��N~s�|'��D��
�4 �����y%mO��U�f0�(
q���k���t#�K�Y���f�Y����.�����u���k���:�FL�����N;��A�kt���rtR�r8��� O��0��qV��`?dD,�.���q�$���W|n������sm���0�T]dT�������xb��'*g��$?�'dF131�b�x���Z���kL���\�sq�����O�����AO���K�3��H�r������6���4�����$�Ts{��,����\�x7��.�8tr�Q=_����Ndd2�����D���D��������(�$��b�x�K�:������������g�Q�������"G;v���s��g�<9��������b?����Y�qV0q�``4���V&���U�JO������2GM���>k����X�����?��T���$$���7?d$z �uY�.�	�{����rQ�\�[���+�����t���w�y@����;����m��0g��g�?�L�^]1�b�4�-�����d{�y���dU`V�j{����^lmo�b��
	�>��}v��k���C��b9�9�X���a��o��r|�7~s����|p&	]����B�e��).���������0Hp9��jr'�:iqk��3�J�����������!�z����Y���FdWT������������WL���� %#������
FE��e��,��
���+�^v*��,�&�p8��l�TQ����5�=����;S�����pXT�j�2U�y)z3��4�Lq3�|�����c�0O�^�66�D�|�o��h��2��om�G�x���^�e����k-C\ch�w��a��~	��F$��Bt
��
���-����1���nyZ~!��T������
A�1����o��r=�����U�E�:�.������k���
e�rC�C�������x,��pQ�xB(7Tx}����Z\�kq-�u�������
��~�<��i,��}�V��^X�C���7z� ��$p����n`���OW������p l�8��kP����'@��4,�%�)���8�o��������0
���{`��r\���Hq����C�\���xA�B(�s�@����F,����/�^��G|
��Gy%�����_�jq4f@-�����C�x��0�������q�P�?��0K�3�\� �`+�6�
��>�a�Zx�P{���s�������8�z���>��O��F�����8.���fC� ��)���e��O�&v3�!C7)��^�?��BL�d�OAB�4EB�Z��3J9.�eF��{��dG��������JV��V��nx
��2�4��J-/�6}o�#vkq����_*��W���2�Y�1��w0l��a�6�6P?�3�������f%
V�,Z1dl��}��p��4Xc��V�N��o*�C�RXd�����%����z��eH5�`\�cahgG���"&5����x$��?<��g����N�i���i��
���������N
�{�6���5���r�����tF��>��=��,�f�h�f�y��|�92
����%g�&^�V��(��@��I��7�A������b����b���r(����0[4B��S�(�!�������D5L��AT�Z��E#�Q%d��P�P�y8��������,v��h?��&��x�Y��R���T�x�!s�y�y�����XA�L��&q��oo���L�@
�a���n��az��t�9��0d�a7`�����)�d�N��'���}F�E8A�����	�s����	�@�����	h�� �a�s@p��
&�0F�Pmh�L��
���;��M����n��J��lqk*����s�)2V��;h��XD�HJj���h��}O�K�ry����.��������J�(��w�w��M�I��M���U���R,����U|�G�����u�O|-��8�)N��}���VIg%���3���tZ�q��T��|�J��uqr}v"">��������c�>��#>��'�U��������M���c�����C�>����"t�y�8�AG�G3��G���:����>q8�>������!I�����`?�� "���(� B�%�n5/WE���S�=I�Jj��_4�i����?��7Y��G{�hbo2���{4���.v�h�������%������t�o%���]R���H�m��"��i���5N[�}bK
�W�D�"z]�k������^��I�o��Q����eI��!�����8�U�D]��U���8�}�.���K����[���bm�xq7�X�k����R���$���5�Vg��j�X�v����be�����j�XQL����_K�^��j-W���J�2IK%�R�/$=+���G�3�~��%=%�gy��U�IK$U%�b+=)i��'$U���8-����Mb��G7��y)b~�����8=���T�PO�POz0Ns��@�~,i���%��n���>I�y4s�U��4�J3�|�4��n�iV�Z�SWQj��K�Z�I����&�H�|w��,�n���)4IRI��E�D�D�c��I���<4��$1>Nw�&�L�qw$�qq�c�&�H����1�1�K��5Q�����b�F��42N#�{�/
���8
��)���6'
�5"���V����v������b��9i`�C�Q������
=t��~	�� Y��PA�(H��&���}<������E����<�N�s7���rQ��(�N�	��g�+N=���?��A7����n�R�_Y�Q$@�24��-@��C��)��p��<�R @i�I"-B����D�;��<�A�I�"y%�&��)Q�_#�7"|q��&�������4��&I�A.�*\	�j�N��U���p���G6�C�|d��VY��"�,�$IV�HVQ��D3��&�$�@��A�g<�{�s���������y�
endstream
endobj
18 0 obj
   5973
endobj
19 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
>>
stream
x�]R�n� ���C�	$B���}�n?����Tc����Z6J�����0,[4��
>��=����\�u�E�
�����6�wy�S���i/��&��0�Lk^|���7�{v����x��#�}5Q�mY~`��x���V4/���O��,��B�i�5�_�������Hvv�.�����.K��0���W�Hr�w���p]�R��9cu6LKK�EL�D�.3�K���0-��E�X���I[�����;�b�J�*�W��(��l����>G�G�?Q�	3��@E�
�����3�Y��S`NA�"{�r~����(��SR�������<���b��� �w�����������;
endstream
endobj
20 0 obj
   357
endobj
21 0 obj
<< /Type /FontDescriptor
   /FontName /FDCCVG+BitstreamVeraSans-Bold
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -199 -235 1416 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 17 0 R
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /FDCCVG+BitstreamVeraSans-Bold
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 21 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 348.144531 0 0 0 0 0 0 0 0 0 0 0 0 0 0 365.234375 695.800781 695.800781 695.800781 0 0 695.800781 695.800781 0 0 0 0 0 0 837.890625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 674.804688 715.820312 592.773438 715.820312 678.222656 435.058594 0 0 342.773438 0 0 342.773438 1041.992188 711.914062 687.011719 715.820312 715.820312 493.164062 595.214844 478.027344 711.914062 0 0 645.019531 651.855469 ]
    /ToUnicode 19 0 R
>>
endobj
11 0 obj
<< /Type /ObjStm
   /Length 24 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL������T�1b��!���hJ����G�u�������r%:��2d�gQ��K��;Ss�{s
8����gBk��Z�er��;��=&�4�A����09��������t������tz��lb������fI�r� ��*��'E�JM���Yx�N���<��%�������PP�:'P�J��m�6����w}�����-����<x��nh���!��G�� 7��X�oW,�����zo9 �wV��6^����&�/W���n��R��9o1
endstream
endobj
24 0 obj
   275
endobj
25 0 obj
<< /Type /XRef
   /Length 110
   /Filter /FlateDecode
   /Size 26
   /W [1 3 2]
   /Root 23 0 R
   /Info 22 0 R
>>
stream
x�c```��������L212�~����������t�4X��~������C�>�&��D����A��i`r�|�&[A���JVQ0�"���P{�$3#3X�������
endstream
endobj
startxref
199748
%%EOF

ryzen_sata-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=ryzen_sata-rows-cold-32GB-16-unscaled.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x��������%����ej!��Kr��40�����(TSY���Lu�TU�����Y���V0Z����k�f�/�A�q3O��9?���3�-?����1#�����������_���������1?�����s���eZ����>����??Z���>s[�>������e[�����>����[���n��������H�������c���g�������������~���;����?<�����?�����}5�<O��&o&��V?��9���4������q�y��9���i]����?I�;����:��q��{R�ej�����J]�4u�x�z�SK�0��=��mJ����U����;����uJ�8��;R���f��"q�����a�i�����i���N�w�����j���������O�v�vIS-_I�{����(D=�z���p�W�.�����
�yN�y/y�7���>-_���$M��S��k-SN�R�4�XI(����4����(qK�$��X���bc	q��4��Xi�!�`N�����*^�j+.�0��l�#�"^�����d0���L�1q�L�eO��xl6�q�����'��]B�F<"��p��~u���O�����
n0��J�4w&.�U�SY��x0���L�1q�����8��V��E�>Oes�se�����c�*^{���-^���7�������\�c����xK����)sO�4W�:>~�x�\�c����x�TgO������xD|�}cI��\�7�����SM������3M���K��mj�>��4��I<!.��e��R��hl��_��ej��sM�4W���x���[c��[��<-����y<1�xL\��S�>�0��l�#�*�6���x�\�c����x5O��x���6�L�1q}��e������\�6`��-�Ii0��J�4u&..���ev�ce�����c�"�:Om������3���U�>-�}�����f#��4-����9�'f���xm��WY��xl6�qo����'�E<6��8�W�<�����i4��U�W��������X�X�����mS�W�������T{���U3M����T��.��s0���L�1q��i��mu0���L�1qo����'�E<6������/O��xb������Mi������f#�z���R�<����M"���i��MC��xl6�q�U������xK������^�;�w���6����<��>���=)oS��Y�i=���>�u.[�i�Km	F;�Wj�T��	�%??��M�u������~�+��x�g��{��m��n���� �W��:O�d�]��2^�bC���5s}��K���z��1�����x�U���.�M�	�q��v5&F�2^�7���<h��pi��$�:�#��0N��!^����yZ���+I�a��Ls�E����.g�&2h��rG�y��q�\���#<2'�d��F|g�e����D���@�)?S)�[1{��p�h"�F,.wd^����!��`����
;a������D�X\���-zg����`�����3:��0�yC�`q�#��M�~jM�z\�W�����B6����!�x��)��������ezB&-�^N"f�]��h��F,.wd�%�K���3y��O�+d�]��h �(�7��A������/3=!��eZ�0{����j����9��3|��i����L����:�G�=va�4�&:h��rG��.�������uF"'~n���w!�$�x��)�65r��	�q.��7���*�K���CczG�������
�K��D�]�������w����W�Mt���
�NG�w! �L� ���2��f�([��d&:	�������w!��p+���)/����1�u���)���M���.�����&<D��rG�[���a����_d���}jG�=va �J&:h��rC������j�ty��
D�p"D�]��'s���)���9"���Lc�l��w/�<va��72�!"V�;2�u��{��"��"�2��a���y<sjd��F,.wd^��:���}KI�|��{����D����x���y�S�3��/��2�fd
H�G�=va��Lt�������<�T�	~���gdJ�S:����x&��������9���@\^p/�2�2SZ��M��<va�L&2D��rG�}����)���ny!  �r�|��L)�������LOj[���o�m����*�����1\�f��)�����9J,H�<���3L<�Kbv����Sx�i����v�Q�����$�]����d����-� ���X���^p�����~���.��Y
�{�b������L��g���������_���>d������s�j��'��*!m*���{!x���U{����.��x���w6��kt��X]d 
nt��8]�.O?�
�:
�E��y��y���H�K���#�����N�~�?��)�)��U^HA^H������?�m�j��G�N\@#.8|_�f��n=H�Xi�a���i�cqy?o��?�(,���G�����y��u����zY�b����Y�aqy?o�<`��R���O��o��5��u����yG�v}�E��y��y���������#a��l^<���G������y��}�i�r2�"�}�6������o��.�Cy�W������,2�����~��L)��W�������:�#��0�7����v%���a9_V�{��og��p��c�0�A#������
_��d��C,�����_T2PhL��V�~LY0X03�uA���.*1d��(��'�C��"	]�>�����=va�����b%�����nV3��P� @2��Aa	�����J�V�h�d6��\H���Cr�����y�:���i����]UDs��h����b�v�V��`�,�Z��@��a����������~��������N�"r����**?�b%.�����y�"~Y��8����D�V)(�Lt������e	�di|�L�zA�^0y��2�!"V��������V#@L\�"M���Q�>
�=@�?a^��9��`�����3�Y�8h"CD�.7�M��9u�>��[�Y
x��a�x�#"bqy?o^��i�2��LAF��WLF��B&2D��rC���cN�E]�J;�Yx���*��
Y����x���7-��mY�x�@q�����(i�d ��q���y!^�`?�
@���(,
D,F+H�����������U����&Q"�g!jav0<;��+	��\��� ������_����~�_�_�4��%����en���PNI&_���
n���`��l6���G
�-[S�`��&�wQn���ZT�W�W��������a�L���@�:W��R��B��k�C�t��@.�^C��XC���R���z
4b_�p�<J���C�`F��D{5#6R2V��A#vR�����IEV����.,R��.��H;���y������,C���f;#e��)C�1�?g�=�(mt/mt/m�i�{i�{i���jUjc��R��a1{D�L����j�o��)k�Z���.��j�@6U�;�&,��'��TDn��;�S����H�������UT42�A#���a��7fZ��/�b�0{�M�b����?�;���.�K�
��7�
��E������Fb��Rm
��@9�T]D��������?�@6[3B��g+�DLNh�l"CD,.��C��6��[�eEB����
/�
�=@����$|��BLEf/��b��&<D��r�<l��\�2S�]G������3���
����y)��%�z
_D�g"&��������0j
���"3����u�~�����5�O���z�'� �� �;-gA#�X(��O��8�~�
;+5�U!'b���N��F��X\���*1�^���S2��D�N�id��F,.��C����V����'b�����"bq��<D�Yy��29!��O������&2h����~�<T�I�W��7T����6���JC��& �
oV(J	�"f#������|`C�)��+Y,A<"��y,.��C������z������F&:h����E'xX��
#/�J��I��>p�*���k&c;���4O������������_~S�?�������6'�F�����F��sI����xHJ�
������������[����h	#"Y������CDr�m�~/e���Z�Q��jE#��v�{X���O?�=�	�>F!�(D�_�*Dh0"o�DX��~�&���nr�w��M�c���^�F��6FF�DFrx�1�J�)�c�����m�en+S ��L�xd�*S��T�8|g��l�����DdjS��lm
����}7��2��W���a���)�Q�@�)H�A#v	of>��{Qg�C���f;#���QG�1�#���P)��q�02��l�e2P��������=U�p�kpU�p86��^��X\���v�J��U�
��op��T3'��*=����lu^w3�0V�q�=�8�?AF��xu�n�enzO-e

���
�f���
ML�+�����UYd�\��V�U�	�=�S�D�xq���e>�R�����>W������U�}���s�o��h]I�
n�VbCB�T$s[�R����;R�FT�~m�����egC���z]%6�!"�;2�&s���0�*��owE���A��#e#g�$���b�`�A���������U��xG��22=��0{8y�����v�����*���XKeVd�1y8A$#5��X\������
���<�����W�C�][��O��rG�3*T��=����Z������Q�E+s,t������9�F"bq�#sSb���
;jX<����bAS��q�1sSu��U��;���8������������3I��Y���,�G(f!�Q��������}��n��yCF�LN�(d�CD,.�d>jM��Y��DS�� ��b00�,��-)���:���f$��huC�(l"CD��V�^�FrH���!BDLN�X������E@������7(�"��~�e-=��X���������
:���[�����P<T��
��<d����k�zU��B}Ks�-����P��|}K{���/x_��������[��oi/�-����g��Hq�O �	��	$?���`�%�����[�+n�C�����+m�O?�2���(#�8����bm�����sb�OL��R�r��1/Gy��<�A��x�c<�)����H�;���d��N\5t���W}��UC'���A>1�Fk
��V��k5��V��k������9�z�-��l��e��,�K9���Y�g���&)V�$_��|-Kz)eI��%=�����Q/�%�#����<�#``����	������d�U�u,����j[��:Wb�^�XR�cI��%����P���1�'�aKL�KK	u,%����:��XJ�c)�f��>�Y�X��^�X���������:�����G��F���3�3��l�;���vsP�����-��	[��V_��0y�>�FX���!�H�E�`U[<����V\�5�����}T�'`�I�8W��������#����-Y�X]n���&�:W��&�d�a��%+��KV�����J���	% �@ ��r��� ��?����
��k�Po"Nrz�Po�4��5������$�7�������d�a��%+hbJV���at��%+�Q�0{8e���Qr<�\%���0�$k^C�
0�d�c�%+hbJV<��C�e��U���)X�����S�@������i3���^fx���e��,��^fhbz�9�.���.�=;-hE���xL�d%����|`���/Yi��������C��g��|bF��V��������JV����x�;E�����Ry/�"EXyo�jGKbv���D�(��=�U�������-�9���v{�����/������?��GV���-��fxb�pOb���������_������
�m��?�6�����\�_����������~�����e�����w����7�����O��/�?��R�W�A�Y����k��<J�Ox^����IW8w��C�������������<�������o�f��U��(�;���(*����z�K��g��=�9q�������Qrw~_��D���~�{��=���)�6ya.�����;M9��������*;u�����J^%y�w�<,z�v�|��+�*wD����J^%y�w�|�*��$�W~m�_rg�"uu����i�+����.�q�����%o&.�V7��y��
Oy�W���{�4��[Wwq���:��@�����w\o����x�u����u��K�R;�������_�5��9]g^�
���2w�|-w1���(3��l�`�)sO�4�z�7�O��xb�������b+�����f#��2-���<[i0�W�V�`�`�\Dc����x+u�1��9�'f���x���}����f#W�H���9�o���G�7�9�v��s\�8�D��`N/��I�B�_�fd2���y<1�xL\����x���3���E�R�xFO��xb�����G
�m<a.������������f�\�9�'f���x��&O��xl6�q�Q]����y<1�xL\���.����_�8�+oW7�o�o����}�9�Nb��3qq�������2���L�1qo�XR`�)sO�4���<O�7K��hb%�_�j��Pb5=��,Y�X�/b��e���9�&f���xt��	W��4����_�Z�Z�M��hl6�q/����S��n�f����EO}�-��0�o4j�gx0��J�4u&..U����9���3������:�<��i<&�����9W�"��xD\�[��v�;5��xb������Ih2�����f#���D�O��xb�����W����'�E<6�����q������!f#�p��+qy��^�Y����x�_��k�f������Z��b����y<1�xL\��^��}��y<1�xL\�+�����<[i0�W���=E)q��F,����9��)sM���#W����RJ\D#��U,:���"��p|�y�W�M�����".�����ox�����3�����cYe['x,���8~Z$����l<�5/�%�_�����n���$L���e��#��0�3Mt����T��2��{m������x]�hb2����f��%�����pm���=k�hq:b����������/��4��K}.��&>�?k�S>����M}�D�xq}.���t�����Y���l����.
��_0{����`��F,.wd���<���Ie�x��v,m~�����=����6L���J?C�e��H��L���.��'/���e{+��f~�#W\�ez>-���}��0{��p�'0�A#�;2��������!��
�w���V�&:D��m�z/s�WJ����e0�|���o�7n��U)��qs�\����L������e�C��i3�9��7������il��������6k�T���0C���q��'�����������>zgzL�0r(K3CD,.wdR]
8���X�hCzu��w!�4`�������T����Y3� K����.��TAvP�oH�����*j:6m��L���b����&s@�!"�;2Oy�V���.�2Q��36�~�d���� 0`�����3�7@�&��JT `yw>�h�A�Mh����)��7����� �K�;]�3w�a�
�����X]���.({�&�A�nz-X�a���x@Y"�{�����6U��k�����t�������y��3d"CD\\K��2���6Q�;�t����v�N�������x��o������(D����I���.����@����H��AAx��??�����+lH>���C���������2�ik��I��+�����b�����2�!"�;2OV��u����2��22K*��,`���y`�2�A#�;2��Ny��k��!���[+^q�}f����wW��^��PC�J����,LAf���1y��<������-�s9s[��PO��T
��3��a��Lic"bu�#s�9h�	E�T�@8�3s{�d�\��>���}S����[�B�����
:�������|������X��������!��B6��x<[%/���|3��7�S�_?��~BR�B�]q�[�C]>��7���X����2���v�����c����6��t�K�Zy�mJ�X}�����/��@���>C�[m����_�K	{�:	bn^���=�&V�pX\x�o��:o��$���	�d����x�|B�������$@�BVw���G����/�W���� @�:�_���GT������y[9�2:��Vs��=��&Vspx����.}W��e@a�:6`���a�Md��{��;y����A��Fe@b� �r�y���Cy5�X\��[���+�Ft�o����1y8��������y5��5��0Vr���ENr@#9x���Fo�m��5�a�Nnp���Fkp���O���3���Z"C�Ncp�Q�M�0��l�)�[	�u�c[q+.x�����2�����;6����?���D����K��!��!���D������R���������d�������{'a��O�M�����u����J���m����q��M��[�gnX�DAB��gn@>��h����j�~�����X���H�~��A����.�h�CD�.��]��`�v���M�u�7�f�]:��Ld��7�*�F���_u{;�&H�V\Hl�����	�U�������<&� -���<������� �]��EB��������J����k����d3��:Y����]��c��c�����"&'*2�!"������I����B���$��Md��������?�.��`���q�w?#(6�!"�w�.���<�"��2C>��<���66�!"V����u�D- �M�t`���FHX�@u����e���^�!��� `�pBe4bqy?oY�o�.5$h���H��!<�[&qP�oH����(;i$8PX�`HJz=R��S��}hn�\]R�Gh����B����"|��o����+vP�AY�PY @2W���`�@���cpy�v��U�bV"[�
�1''�
>��g�A���=���ac0>������_�����!D�,��'��C�<h���"����zY@���,�mY��M��,LlY���fs�<��?�7Xyj��`�(op
���a_�p�<lB��0��!{�a�/���b��E���p��;u�I�E�GO�/��xq�N�����Qb��D��U���(�=c�O��V,�XQ|QD�5��$������`VH���+�3m�
�<(yqB�����o�S��b��d��U���l��b��7�5�/�k��XC���Z9������|bVw�%��xa�a����Xm�aq�F��'z�vc���#�hb�����yXy_�������0yDM�����?�WDI�"^b���R�B>��)� 5�T\�c*.,$�Pq����B��F��*��@���G,�R�c�4�������8+e�ez	0V/��<�^��K<���1��o^?��'���O�`�'����R���-�(�������Uk,l"����������
ul�����B�|ML����r�<� ���� ��� �{��ToqP�����Fh��_��������;��L���qq�����L���c�0z��d��4z��b���B��V��*4�G�h���4/����y�����!���� ���m*���/����aE�����!3d�����8�M�l����?[��-A����8LA�!�!�x,.��C�*��2�z7d�����j$��q�����e2��9:Vs(=g>f�H?�����|`"�����(I2C���<�T�D��X\���(����[ab1;X}���(2qp��<��.n�7��(!U�!�U��7��+C56�"sQ{���	���L����I	J��4��<T�L��Ff���%(����*O��t���d�� 8Q��	5��<h��*0�����!����T�	������K���_��7������?�P��!���9��>����7�pD�o�p�����C���Ps1Z��F��Z�-�a��������7S��$T=a�U���V�X��ZLl����rG�Oi����ZC��[5+9�AB�y���������P���Y�7_�b1{��(��5+��-�o���	l+� ��L������c�d2�!"�;2�ZL�Q���>w~,���X������+s������xpI�H<k�&+����x<�m�����-)���D��KX�i�u�/a���>��en��oe��od��l��2���f����9,M�
��"��|��c�(4�!"�;2e�JG8!��&JM�U���p�������{)-%*5^��:��i��z��x���\]��U`\���U��GV�����rG�C()�z��8H�^�A���X��w�lD�
���0Vy��<��B&CyqX]��\e��>�C�@e�C2w:K'/���Ov��>h��`dP89����2�@1�Ld������f�F�����S��i���|���B&CO��������|�, L�,����B��*�A��#e�3�	fEfh$7^�5�I&"����������O�-n��x�������)n����i%.�0��J]��.�V!���x�m�����H��(R�F��l���q�;��a��2�!"�n#�{�Y���}c���#hd2������U��6�v�U7^0y8�#����������i���v���j�����+���������p��U �������=@��%e6=����X��v(�P0:����u���99?�FW@�>Y��m���`T�8(����bB�T�LQ�Pe"@2�50+�T�������Z��e�D�Bu!�\�.��ZC��>U)����Ls�V�"e'S�KGP�YS����L�����-�����X��a�������?��8�R��X��f� ��P<���a+<H�e$��o�H,��+#A���8���a\Bp*G���a1{D�L������n���#hu�j�X��a����������8+N�X�S@�����6��t_��������+[Y����Y�8V��^�X\����[�����c�c<&�P�&�8�cq������Q���02-������0�Z	��ww����p���CFR8d$�CF��!#)2��1�'�a%�qguhPgu��#�(`bu����y����8��xLAU!���x,.�������cR�2�G���/�qX\����Dr�K�Nq�������S\,��'��X�/s+��G��1��7j��U>BV���1�O}�lK��F�A��<������1P�������w�� 3z�EL����&�s,`u��|�������gn�kd����X�.��c�,(_� L���d�kd�@kd��'`���?�i:�BOB^�)������LM��5U-@��3w�U�'S�ZqP����J1P}����!3������=�Lt�����ya�R���U<&�����Q<����d�����������/�g��f�����o22C
�=�$�t6�A#�l��[�a���T��#+y\ym�	D��q��r���'����H��
���=�<�]LF5Jl"CD6>�8������{�az�9H��g��i��& M�X�c������G]d;m�6��az�������,O�51is51�y��Sc��`Z�BK��Sca�E��6�2�i�����H"Z�47��
� ������Bmt`#��[��p-�?|�|��4�b�
"m;:��/?�����/������_~���M�o�����J��������������?}����/���������?������/�J]����h��[CkA�n������N��1%xM1v?C�`�����_�e+S^�~����p��8�*�<s?ZM�����s$�\���#8��4VK�����}ZA��L��\e?<5{�z�=t�O���8�4O��������e��9�w����>�t�Ez����s�Rs�3s��z����i��:�v����K�XE.�3q���i���,w���}%����|#5�[".���7����,�m��S/�s��
S�z��=�3%_~b�f�@��Wn��<���s��1��2w_|-y1���������������9�'f���x�������y<1�xL\�kyJK���9�'f���x�$���"��xD\��	�F8!����#|����6�0��l�#�*�6���x�\�c����xK�gLO����i<&�ox����7�7��_&u!��[i�����6���l0���L�	q/%�9��9�'f���x��Sr�����f#�2��7��9�'f���x��������f#�����S�<��i<&��5\S7����FV#��Xr0�6��:�g��G��=c+�T��=�������������3M���K�7<]��S�<��i<&.�-�?��S�<��i<&.���7�i<e�����c�*��;�l<a.����G�E����?���y<1�xL\�[t3���"��xD��K�����}�P�����s��NDS����31�������T�v��E��!i�:��������H����U�3��."����UL����J]������P�;'S���b7b2s�et1����v&&1�1�Ro�1����v&&1W1;���J]�������^��d�*"���H\�[�)���0w6����{J/�zyO�^�I����B���%v#f..X����}(uS���3�1�L���*&����\�LTAmc*uS�FLf�bf�9�1���)v#&3�1�w1����v&&1�1y���)�UL�31���Y��:�"�P1����s�@X���r�>L�7�`t����|��6���^�
2��f�:[�i=� -j�\������	����R��OsP�����X�Xs����|k&���{���Iy�m�#e��-9��6��<^�bG�e�e��Y�VbCb-m�������
�{����o���lf�6�����Z7�����T���;u�^�~�{r�Q��@�q���6��A���qY����I.��a���o	C�X������>0�?z�+����\�����O5a�BlH��a���.~^�@��;R�*<A�6����u��"�]q�P�L�n�[r�i,j�\W��u��x��q�\&�>W�l���{)'*flmJu�Ug&V$`��@2��������;R��j�Y�FbC���H���!&vP��H�.	o��!_�
�
6 @2����	4`����������\
����V.ED����	��}��]�{)�G�AL�,���X���w!O��������R�!~4q�+[_��6>��	�
�`�g?������!^,L�	����s[g�k ��B<�W
x�W�^�[�z�>�Pk�L�Hl[�S�#$�]��d����
)�y^�\�S�UC��i�=�GXjO������������y��n���2fE&����-b�����u3CD,.�d��V���A*���L�3����=ve������ �73���pw���d����Is�j���cW3'4bq�%sl(^�Mr�O�02���0y����*����-�w8B�ub�\]e� �f��8���+����	
�������O<��o�2�z/�~80{�����Mx����~��O��t���\������FgoGL�2�d�&<D��rG�����W��	)A���ao��w#�Gq��z|3�4c)_��U\�efC&�v�~��cW/6����=�/xzMCUgb"��^!��J��&v��oI9��*m�e���$law��cW��$��9,.�d��7���������*��G�<ve���Mx������K�]3��O��p^JP�w��cW��d"�F���w3�xo��E��pb�iT���f<S,b����O�����=��V��3� �:B:��_b���~d�CD\o|��Z�T����UJlHl���^v>�����������U�#���y�6�|,"@������}hSO[����y[q�����|]���g+�����Y���
x�����F��j���a��I@W�$s�(�x�����'\Wl`�#��
�o����|&rP��Ox,�/���_�U
,B��!@��eh�~�[2���:���W�����S!��B���{�b�~�Chx��Q�0
��d����@����5���:FRX�P
!@2��B"qP��Ox��Z�X%T�����]G(�~L�*���V	�IX����D!:�(2B�MN?�B�_&/������0/��
���(��<ds�)��<^�b�~�y�f\�����xC��C�"�]q�S�������Le��W2!���\T9����P���p�/co%<��iy�,�at/��:O�{��.����X�_���0"��d�E4"��b�~����g_�5V�"�z�|�O`h�����	�
?��# 4:�A��p
�xn���������s�`F� 7^��g<�S]3��o�T��6�[r�AFf���U
j4bq�!oY���qe:2C#��p[>��b�CD\l'�������{��r?0C!�8�6����Mx������e)�rc��,�} �����^��D�����d��=�KbA�N+p��R@*8��7$���u���c�?!��@��aE�����x����W��3Lw_7$����J���I�o����k�sIN ��~���f1�!"������<��
�����4WV-���|K��`�L�%�
���-
 b�p�@e4bq�#o^��N������nt�g+����y���X����~0
���q���c��nl�CD�.w����37���K����k,��ELh������eM��5�P����t6��Md���m��"���V����]��O%,�6`��t�s�Y���u��"�Z��U+�d�����UxXj��c%h�WV�#��������`�k��������'��|��?���_~��8��%�����p
Z�%�H�����.�o�g���g������w�}�Cu2��<sg�c�_1�o�`Q�����v���4�2���}J�r�f`��>U/V6��=T��' k��8h����
��
�������?.���QuPJe�h�U��{��������� �b�)B�HhB�dnU��D&�P����X�����
,B[U�������~�Od�euJx����T xD��>!�	�����|���
r>�`:�f��b% 9fF0vN�kac;�'2Om�W����[�qoe���A3A������{����L�M�a��X�(S��`X}�uE��"sQ2l�DCB�3T�D����T��' �A�	AG+�`�	� A)�{��ca���(�����" �[��������?�Q�@E���S��`�"H[��������&�������+��P#���G���`q����,>�E���@X���Q1�A�]wP���F�U�sl��#b�p�ELx�����������X��c���G]����X]>0�(�)� ����G��5ZP	�C����vA�YZ�+�x��h�!��t�&<D����y��Q���B���H���D�YLx�������a3	�����$b�p�YLx�����U�����RGa��h��,2�
� �`�u�i��� 3����	*ILx�����y�z��X������X��-F`Y��
���Y��h�����OC`	�=��>���o�(�{��U��j�2��eW~u�l���O=�e��1V���nT���<�X����5��'����� C�a�C���!���MT�qX]>0�����V���9W>���9d2���;���y-���R���5��x��I�aq��<X�)���D��,y,��74��
�=����<i�C�b��#�O@��F1u�QW���>�������HX���?��/�)��}������h��;�����B������������i� �����
���B�����JYd�Y�aG'%\L�$�YVbU5��ZQM<T�;RU��`�4�0e�����l����U��Rv
�~-�|�(�+��}��QC'��r��G��h*?Z� ����>����������3*M�������M����`����)�P�J��m�� � ������=@��#e���77m(�7'm(eQ�;�6�Z�k(��n�[r%e��0ch����0���h�L�!�T6����y$�
�DEB�>L��uT}�l��W�B��#�������a�0B��dn:I�>4P-�B��#�Q����F�v�Vi~� �
�=���_���)����TA��|XH�Fq��{�jG�,��o��$�}�(��h,���Qn���{)��1��FU)H�� �UV����T�;Rup'0:JCB�����ll����)��A�:~��GZ�P������C]>����:�4�r{(T�=D�����41�A#v���f.�E������$����,��	�"��-���>���{�����<\��&��cq�%s'��3�s/"Q<".��HKBf6�!".�����y�4�B� c5���j��L�������E�(*c
�Y�U#��2�����'c�����Cn��RF���chILx��}�73�!k�4a���gm�&��h��"$bu�#sQV<�nh=!!2E�ln[H525&��-)��N)�R��R�Gh!E&�����rO�*$,��J���:W�Z�1y��MT��X\n�|����E$a="�2t�$5���#<.Ngy7�E�CQ���*DU "����I*�	��J|7��8�z��g�G$s�?�)��8����LJ�q�1�22C^�=��}d��F,.�d�z���S6d��1y8�����
��-K$$��Gki����A�dn�Kz"qP��.0dxn�e����C�Na�	�:��`M	�j2H+������Toh��2z���
��=�!��(S�P��/��	���1�??&�`���fd�p����wN`H���4������(S�#L@��;����	������h� ��j�LC�pP�����������XJL<*�
2
PJ�K�U&h��D��'d<���
	�'$s+W���*Wx���O@���eJ�H�Z`���!^d2��+#�s�8l���g[m��)�����jO6/w�������^�t�Q�$�3���������E�������Hu�gih�2��,sx���O@D�u�PeT�"!�*��lnE��<^�b��d���g�a�����!����<^�b�F��{4%(@��o�%(h0JPl����-��5M�X���v�
�`-���c7�'2�nV��d{eA1������C!�c��h��D��f$V�� �6�Qb�AZh��<�N�������16Wm�"aO��������N��P�o��mx�s������0{��Zl�]�<�����8mYUVi���&G�<veh�'�����!�G���P��U"N�Dc*NV6�������0
���p��w�r�<b.41]����T��>��J���+/��U|�6@���0�+H��
��.����a�������t�<�`�r�0V���<�*�&��x,.��Pd�\�F�"34�����d1QM�cq����;��M���j��W{��^z���!��@��|��W�P�Vw������(63���O��i%c�f��4d�^1yX������X]>0�q�G_}����=�c�pjNb"bq��<����N��@%'@4w���ZX� �`�����;���Pu"�y�v�<��Mx�����`�'9�����������z<,x�}�M���+=�1�������~f��D�.X3"af���@��Z��a����,��=�j;`��U����������=������#�����~��/�?���m��m�r�����'5�����|��������ph����/�o�����w����)u���:!x�Z�Y�s�)X>�%�r�|��y���_�tv�w<���
����������uJy3y3q���i���0�"���4���r?<�<s��otg��"q�����I�u����i��I^��ME�f�*q�����Y��BI�a��;R/[���8u!.R~��w<I�l}��q������0�:W�����SO��m���D����{rO�K0$w&�rW?��9������+���t���q_�N����2S��:�^��l4%N������y���"j4!�����&�i�%O=M��hl5�)q�M���)q��L4!������D�<Y�hB�F[�9���D>��`�L�<_���,�����4������B��[��E"�����K��/M���F�o0�S�-qQ���lf"*s1C]����yD63�����\VQ���df#
s^�6P��xde�	q��6N��xdf
s�c#qQ���df#
sq���Q����X���\�@R^�5]�A�o 	��6{����b3��2���wjF0"*s��LDe�#.�khDe�#�����E�mJ�G�""�����G\u��9��f&�2<�F�""�����G����DT�<"����\D,T�<2q�l8&.���vwoV��F�V63�p#Y���]�I�o$9Hq�����b�����*���<�p���c3P������'�y<�2������\fN��xdf
s�&^�����Gd3Q�����t���\D$3Q���+�����Hf6�0����DT�<"����\D,�=m0g7������dP�r~��{��h���]����f&ye�������`�#�����E�����Hf6�0�a��['�yD63����N}�.�0��F�4b�sx��iD13Kb�\Dl���D�""����\D\�e�\Da."���(�y����`DT��^"V&�0���s���\�3��b�N��2����L���_.��k��8�GV&������l���9��f&�2����{��yD63����i���(�ED2���������\D����`�#��V��<"���|a."v>�gD�""����\D|5�9���|������������R���2�2 �B5�ai��S[��e������H����s��W])�!��6��|��*��Q����#����M���0�B�[���_ ��BP�` ���
)�D�
��e�v��l���OO~�����D�D��X\n�|�u_>��1�G���'�1�j��p�������������76�
��+��GL�0|��D��X\���,������
�!�����<va(s2�!"�;2��g��)���I������?Kw��[��_���"��v���9���G�6m[��hbVd�>b���y���t}i�X\�����L��������X��q5=3�9��7[��f�����M��)��@U>�M��C���qs{���|]t�	V�i�����f�<|����'6�!"�;2����m�:��n�
I D�]�z&q��oH9�g������4��80s�R&�ll@���)�����_Q��)c	�D�](�X����;R������6�+A�s����P�y�x����X}��{)�;�H�,(D�H��'�D����V:�x��)������hS�@9?s-��0{�����������9���������!����WL�0�ya"bq�#��p!`�����dx�=�����3��7W��^�P��u�<T_:������.e2��D����*��2_��K�����������;��`����*��M�wL��rG��:e���-o��?�F�d�l��h�������2,�R�^xpH�4KLE���{�FL�0PG��	����yZ�6C!j�#+�y�����W\�Jf ��Md���;�����:a%p�zc%*%�i>��+��x�metp���rY�.5Q�5bb
2�Tnx1y��@���&<D��rG��s�@�u�K��fd�#dI�R3��f�]h���D�X\��^	[y��p�pe*5�.u�g���0�h�&:h����f��X\S����J�H@����.���������rOt=6�K�B�02�~�b���y`
����-���pe��V6(H~�>O�Vz�b��,K"q���VV:=�m�5j�BB>�������pc�1D����{�/3i�����`d��a=����y<KId�CD\�{Lp�"�zY�S����#<_��F��Y2���CD��{)3��^�9+�k�/�e����g�+�m�V���q/?������c�'�,k�u:wGq�Cz����>v`��<S�
/�~���H�Q�#BU%�����T�w���J:pY�I"6$��};�d��	���(�7$���G�3����@���D��L����h��w{��[y�eZ���(�Q��\xsW�] ��F"�nx+���8�zAcM�l���E�vj�����M;���CLY�N�k02�4���<vaX�@"bqy?o��T�J�������I|P�+&�]��Ld��������[��f�!��L���CL�0�C������������^=���$�#G�����M&2D���;��'���5d:eo02�0�C\��<fH� "�j��{+��O�����b����L��V9����H�`�_��p�;���\+��}����P&��K�`������7����0�mj��/b���!��Ld������3unB�t��A$�z���)���.�W�����X]��[w@�%�!��M>��#@4�����@��{(��'�$���*�=�4d@c���<va(i2�!"V���y`�J�&
b�4����vR��6��p����)����S�a���AE�g������6��X]��{Y�NJ'�����#���X�e�]��3�E�L*������A����.�Ph"CD,.������6W
�t��rp��C���%6�!"����u(e�~X\�ddP:8���JNg4bqy?oY��e�fAF��LVZ��������y��~�r�����aa&���<����D�����o���--�J���Q^0{u���&2h���~����l)s�*�
q�9�w�FT�����O��@c�~�����I�-\:�5�C\�\f�)������7-�o��R
��3�(�FC�*�x���	��~��n1�6��b����(u��D��X\n�������tu��*�%�WLFC��Ld������i}��u����2"�b�0"B]�����y/3����*�[�-��j�v��c�nd��F��u�um�T9�Qj�����r*�
P��&2D����x$ ���7��Ok�����_����qFS)Kn�	=��9!�-L@��������H�(��"�F���\=`��R���z0��;+��	�y�>����
�
��"�]q����a��d���#�l��^�
��!&�]^G"bq�i�:wX,��\�
��
�`��<vaxh"CD,.��#W~qM����Ii����;?90��"�n6n�GI��j���F��s���+������n{CD����[��bT��Q
fC:��CL�0<4�!"��������m�����D�
�H��\�CvP���4
�f$�O���Ql���A_����&2D���
v�<���&���t�_2�"n�+�h"CD�>�����/iR�A�����1y����������fm��L(��1+�?y�U��Xy?
�
&:D��5q�<����U
��R��4g��v��c�d2�!"����>p(�~qehs7��q������-�����������y��YJ:���@�����.���	���?X-����z�����2��g�1�5h�CD�.���rqC�ek�2�6�W��x���.�ZA���TL�.����V��m}�in��a���y`���������t7E���FC2hB�z��c��A&2h��r�<���> V���`6dP�9���Q�aEMd���]y�u �,t���u0T����!��Tf�N�Ld��}!�����5X�XT�AfAu�CL�0�������>e`y�$l���B�f�t�CL�0T�C&<D��r�<D�����X%������&�-l2�$����y�hg�6~N���>G��C30��&2D��S��V�
W����C)�x��SF)Z�d(C�������+���`:2R����|�*���D���|�~5��
�����22�&E�N]�d��F,.��K|�b<�T�&��LS�D��F�Bc;�'27"S.Nu��"S�h�$��Far��?0��[�W��4d@]J�8�{3������X]>0�e�uE��~���f���i��f�]��Md"�F�|�����0}����i�,B2���9������o�jH�^lu*\0�T�hnE*8��hR����y�������*�
������Mi?����?�?�f(U��`�,,_��7
qDjSIWr��A	�i�U��\���4�{���������O�_e6d�;����;v���j����w���������C�SG�%86��.�*���)�<D�����^��^�����J�w.�gF
�"��j���<l��Zb��t�o��w�&���zS�����y!�a�
2�[�c_Y��3�������#���=W�fE&���!n\)�����qs���e�V=��O�V�O���0��M����q8=���{�M���I#"fA&��;�#&�]��Ld��������_��Q���	&�����H����=���3_����Zf��x�<�!&�]Z�'"bu�#s8y����Q��R�|�|��ad@���
)g�`�:��X��~�p�#��0\L$4bq�#��L���2u�S����P��a�����^2�A#^���oekL�/��p��H�H�_!��B<p��5u�p�����2�)�6�����|�hq��M�]����t���l��\�g�+�f;t�Q7��+b�p�2���������tW:�F���!�K:���C��Ld��������%�'�S��22Ts��C�e��������9�"ty���]����r�,o&�;�{�����R^;��j�iT&��?A�@2��x������������X$nkg
2�V&b���3ya"bu�!s���pE���K��>��C�����Ovnl"�F����ofnJ[�k��g���3�G��!�Q?������)v��~v�.x`�����<l
d�&<D��rG�*bt��eN���53��SF
������L�P�������)Gx��fX�B&<D��rG�*L�`�<�A���K�F���D����]of>�_��k���k�P]�������gpR"2VE���T�"CR�&;��c%"bu�%s)~Y��P�mFF+j"fWaS�D�x����_Z���N��is�lTm�d��@�l�:����y��P���R�l��u��c���Jh��F���Ro�\���D	_*U�`Q��v����7'��K�
�t��}�|���/j�
3p�eM	��2�7�/����P�(�ds\�TbCb���;/-2��6y���R�}����{�����F,�Z{���F,��67N:^�����E����b���ALs/"b����ylz@a�t��02���b����"�����?��_����
,�n`��X���z������:���Y��p�
�e�#�G�w$0��Y1��wN�{;��E��(�!��H�C���)3,����r���u=�s�OM�`Z�sL����������<vaXRC"bq���#*bic!�:Gd�z��c�%64�!"����,R�D'&	^���%K	�E�OL���$q�dl�D��=��1�U��^�x��!&�]��"bq���"
i,���l�4�*��C;��pMdl�@�(���*U��J�QI9��K��!I�Ld����2�ui�������G��,�8s53����1D����y,\�J��`:2P��qa]��{��q�[n�GYh�t���L��y���(�Lt�a��i�����'E.T�P�>����)���>0<�=���������RC����T�1+c�0{���I&2h��r�<:�P���\;S�sV�a����sl�D�X\���R�+�F-�����@4��1��d�AR6�x����3P�|�����8X%�&(=h�d6�!"n�@���'�`���(��"�11$���q���}�@-f�K��L�#�A�9���b��������X]���x�V=�������3�&���#x<�`�H�vH�^o����g���1{��i�ML�4�w����i�a�
s"����sg9���������O�
��Q����9p$���
4�+��qu=7���i����R3�y�Z0{8Q���i����|`�U�9���Q�(��;�D"�qd��V�s��	h����������D,�:�����j�o���Ey��5 �x �;�h#����B������q�Fm�o{�e�]����=@���@���-�)�o�����."�
oY�{���u���>�B���+S�Q)������MDJ
X]��t��w�j���oA�J�	%�������jS����_�������#�R���m����[?����~��������op������IF�m�zIP��c�������~�����e�����3������7��/�����_��E���^��qM�������*1?��x�[Z��Tq\�p����:oSO$p������<N���<<�d��U��(�;����n�J�S?>6��}s�p��n�������3w^����~9�����'q^85�3s��:J���<�6Ok;N�|O�I�.i��\%�����;O>�C�a��;�/k�N7#ya.�����;M��m������$"�l?��\%�����;O��)�����I>I[AM�����Q�w~���v�|���m�R���|6a|%{1��f�}�_s,�
LDe�#�����E�ut*���\D$3Q�����
XU9�GV&��:w%����Gf6�0i���(�ED2��9����"*�����V�r�$�5R�������4���k���2�����c3��2�~��
(�y<�2������&�0����<"��\��Z���lf"*sqUMR"
s��lDaN#6�8{1sQ�F��\D���F�(�ED2����H��mDa."���(�yD8�wq�9����	(���V����Vr}'��^�<H�+s~����������8��"*s��LDe."�t�5���Hf6�0�[���uQ�<Y�pB\D[�����"�����G���3�9��f&�2�nK���\D$3Q�����?�B\�C+���h�,k4!N�!dd�1�
w�T�� �p���nif0���L���^�+���0���b6"�"������������,��-:����<p9�����3kG8���;Z������oX�drgI�����6�X_U���#�z���{��M9y4����u�y�q�����2o<B�����GJ���9���(b�Gc�x����=*��#�z�����6��Ge�x�X�Q�{��Kk�h���D�:�����d�g&��&��%9m�x��4����.xc�?�����x4����u�y���������=*s��D�m���(b�Gc�x�Q/�yT��G�����XC+!�Gc�=�X���7�+����#�z������L��Q�7!�{T���.������k�Hu��
��;���k������W:=�7���R�n#����������^s�Q�:����X_��(��7!�{T��#�9��L�1�E��h��;��4�B�����;!������h�����w!�x�6^^�)���z�����KK��Q�7!�{T���5wk��Ed��P��E�>���^D~w6&8����������Mm�����_�na�������p��JO�L��>f������m�m���EY,b&F��y�F^"wx5���Tfb�_y��q(��%1#V��W	��(�M5�2u���$���A�,b&F�N��>�����K�LS��m|�)�R3"���a51�x*3�(rjA���+Ye�V��:?�04ePX������UeI�t�
N���1_�-��i�#"��������{�����??���tD5O�����H!��#6��So(j�t"�1�i��������A�*���O�?�<:lm�,�Q�3�uS����0�P�"fb��t*������-|��;�4&���t��MZ	���Ifb���"����(��3
s��Qb��(����9�$b&FO�>����s��o�)���L��[�]14e��b31bUYy��c[~m\�����D�,��2�n�������|�������V��?S�y��Y�P���wH��(�B�[�#�4�����V&2C���C�P��������������w�K������d�2�YjP��������S��g���wdrB�d���Lf&{���+�����Oj@DM�XUVD���{��m#(���4�hY�P����B@���+B�g�Z�}`�y9��L�R��C
;������<~5�'���DqD���`�!~(AQ'������s�&�d7zc
z��g�������q]w�������w�=fe23�����2�&"jb���"r��U�Qj��K!���5R������#�$bo��������UF)�:�	V���PTt��8�y<cr"�&F�*"�=_:oD'�s�]`f��L��N3,�2�g��51�#V�������s������df���8��8�y<#������,����"���A#
��a�t����(������g!���q�a�����"���X'+
�x������y�g����4�a��51O�%�����q(�x��H"jb���$r�=%���B#*���� �J<���tS��/	Y{n�u�1����u�b��	���3��F�g��:���*��T�e������?R1�#����#�9b��i�/�1��`�)�������Y�L�XU������� ���F�e��Q����@�������m�I�tX�F���+������2)����Q�fL�^y�W�{y��a��R �fg���O14e$�A"fb��"�� n�N_�_(8�/�����M`�/�NX����0�?����fF�_��!���>���fU��7�Q�;M`�-f�����ZL=����a���h����lxA�t2^zJ���������@�J�igbL}�*�y���Dw/���i�2&~�^f��c,�&Fl*���~���i8��zy\����|�*�����H)G�����3����Q�Di��C��t�����}����iT������2o#� ���UIC��31�e��D��;fe����Df�J���2��"jb���$��3A6n$�����j�8��8���	������q�����5�����S�dJ�0H@���q��}w��}����M���H������$R$`��]?���f�<��NbP�A�@M9&��j���>@��8`N�����n�sFa��|=�y��>D���M���]�OB_0
�%�����a����!�^��=��o<�1��f�a���]D���M���%N.0|���9�PfX4e�Y5:��>�Q����^�c����	�d�C	��@@����L��)�z	Q���gfX4e�Q5:bU�<n��wK�l����8���0�����q��?��w���#Wi7�C�`Y\St�K�>@��<`��������1�����o�6��]|������q����{l������z���{@�4��n��t�e�3
��/{�A14N�
"fb���y��������R#�\��c�B<���SU��/x�=�����3��'��]51����������������pJ��)v7��3�UD����qS�<n��O��/�1��\04N�31bUYw��1a/\c$��-S�$f8S0��q(��:,bFG�*��}���q�>���i���F�LHYD�
;cU�<���c!j�����&1L�v��6��[������gqG>9F��������I�	������������/��?
.v���>����!���Q��?�����h�����o��_���m�V����^�����@:��F�����
UN�&���lLe��)�����za�jG��XU�Z����L	�6��Q��Q7��8��Z	Q#V������r{��	�����bh�`,b&F�*_��*�(�KPG�0�AL14ed���:#V��������
�&S��n�y��L�� ��q<
�\��a�O���Fd�+��J�������>�p�
�r��6�lt|����:��gW^i��q(�R�m#6�������;v�)$UQk>_$m;`�8���1�#����+�7�3�|��e�,.$)^14e$�Qjgb���~��!K���`c3I���x���� ��"jb���~YN��h�u�)�PF�bh� ��"fb���~E&����^����.~:����q(���,b&F�*��Q��7�%7������H6�
c��T���d�+"��K���~����}j���L\�j��]>������uP�#:n��)�F`�h����n2oT�G,�U���_�K����N��6�2C��<�I
K�y�XL���S-��u��-��xCi0�J��)���J& "&Fl*��A:i�w>,ib�z$
3��q(��D���Ue�:�x
m��������DL�bh���11bSY��D3��|�do�)x��L�a�8�A�"jt���_���*O����1����)�����'#6���(�����<�Z6&1���)�����wG "&Fl*����;Z�[%	8o�$'��aj8�he#�������=(��O%��VA2bh�*J6Q#V���@�&=c�Q ��A�Tf���ahh6�~�C�L�XU��CA��eV�����d���W��3:��TF�t-E$����8�kI���)i�E�%��x?�s_��q�b������Z
��T��[��tv�	l�"��k%'�S�(C���x������D�;Z
	�ft5(g�����R��
Ws�����Cs>�[��|*1!�%�,��R5:bUY���U��6������<a��U,�#�f������9%�Q�)�O��rb&1�9�)�����<��fb���;������y����0S����Mu�po�.�W��/O'I�����!�~��������L��9���;fH~��=�=�2D � c*3�YJS�C�=������P��"/�O������YKS��%"g31�t��}y�A����w�NQ)��o5��>���E���z�!�)���S�D�4�b51��W�|9��U�����1�:w��2�&�#�fb���"r���Tx���L��3���h�E�>@�_�w2��q�T����X��6p��8�A*�E���MeE��Y-�>J����R�>[���E�P�.��Q�#V��������G�7fg�km��2�����jb���"��)�\��i�-v,�8��8�����5#6��g��7q���[`23)I~���q(����,b&F�*+"/A�+���1��H%-�k3GaP��"fb�����EN��]�E~���
3
i�}��q(��A4,b&F�*+"��A���ht�DF�2��M`�.g ���T}�������9�B�wh~o�
3���L�&'����'Qfb���"r��$�����1�$^f8��\�f �&F��}M�\�94��51�Df��e��q(���������)3�pJ�RJ�,�V&��
P����I���1A�"!��e�E�P
���������\z�V�DtV#��^fX4e\�5:�rj��Y����m���ncvf��e��4��1Q#��>��En�
���'���/14N��*"�?�������p��L����Y�	�����RX@�
��V�������}K�x.�!�����J<�oADM�XUD~���O�2���j��,~*�q,`��!k����]�Ld��b��J�B@�d������Sk�U��gu�I�ak�ECS�l������S'��"�d�S���Y��]��]����"#n*+"�lDB%L��H��;f���`���Gz��Ri�UeI�\�@�������zyC�A���������zj�7�D������@�J<���������M�.��!���)�p3�)�����r1#V�%�S'�gH&�n[��h�u��V[�t�iU��/�eC�j<}�g���5���O��d�f���Z4��,E|�t�L�~k���S�":�KM=�sT�)?������m����+�s��]�s.��p����pt-I��dd��xS�CS�CIa��+L�XU����e����8���`&^f�P0����5��q��kY�gpc���ZumS��0R0C"fb���3x�:h��[�� {:�������-(���P��M31�p��Y��8��yD��0I��>y�^K���������ux������i�ED��7W�C����d�4��t`q��u�!X�!]��1sn�#>5[����Y��[c"f)7�A�T�A�O�]�M��p�p5�Zk��+`�T�0���0)TN�����S��ud���0SP}D�
S��	�C~�31�x*`Z�:����������q����q(��1#V����u�4�i��0��f���2��a31bUY��p���=��k4"�QWw�0��2�a�����,_�M�+��#"�x�@'�l�xpmJN�>@�_���j�r��kL���L�a�8�y ��"jt���������$c#�t��2b�8�A=D����9c�p\��]wT	Q<(��@����d{T����+�+A�V� ��T�\d���k��wj*A�7��SK�+\MV)�}7f�f�E�P�������uPa���%���Ayen�G�P
��dD�����][�D�
�>g��S��<�6��8��: �&F�*��Q�t�K2��1����)N2pL��!�&F�N3�V�C�����S�����G ��viiQ��/_@�����i�����8�l�S�4����q8�s\�+���}��Q�V03b=����(_+�p���\������S�5r���j#�S�L5:bUY��.�#]��\���rIg��+U�K�H�%����_G� 
}�.l��]��8IS�S6����q:�?[�K����LDty���9��-��C���hZHkv[�	5�-�4`+��2OID����5�+��e.h`�U�`��f����A@���_�� m��S��Cj@�����E��0U����/��D�(��mKLq��Y&j��S4��$u�&�S�}{���dp�����n���N������%��M���t�/���������?�����������������^����'��>Q������?�<~�v��Q��_������?������������������J�`7�����Wi|6��'^�E�9z��~~:�=��:�D�NO�9��t��sk�t��U�\i��6x�l��T����r(x�S�@���9zz����/���M�B?)�#��s'm��������\-xa�Fo�-���m��H�MFO�{>{�������������V��?��������R����0o�7��I�6~�RY������b��/���M��R������<���>��@�.���0�w�W�n��T���]E��7�������2���L(s���?Hu��x�
ow�;e���X�P�7%5�yT��G�����dJ�h��G�<��cz�x�������T�P����U���2��w���ms�et8&��7����.|c�?�L��N�Bc�=�X���{���UOk4����u�y��gl��x���zwB�{�^��;c���X���7�Yk�y�b�Ge�=�w"�<s�Q�:�������:��?���	��[z���B��k�zg ~��&xy�]��d����I�Z�F�~L;m������'b�Cc�=�����e��?Hu��x���h�s���*s��Nn���1�E��h����/��y�b�Ge�=�MN�Gc�=�X���7���n�u������2o<&���=*sw	Q���0�/";�m}s
�
�J���5���K��������D���0����u���Hm�|�l��G�<��c�\�Q�7!�{T����6�0���u����w�F���O�i���;���F�����6�_�������^�z�5��J�U�w^!w���;�������d��o�����P�//n��T�}w�y�q[�\��/��7���+0���F���p�j��"�{5���D��y�F��*r�W��y�2d����;��;yU��W:���F��*r�W��y�1�'�����b'����X�g�wi��"�;5�����N^�z�r'�J���99p�����F�NN��-�������������i�~F���D�_�Q������9i���{�*�o���+
M>=Iu��"�{5����r��3��W���*����<��j��"�{5���� N^�z�r'�J��P�xJ=*������J�������R��B��U�w^'iV�n�8�L�P����hx�{{����N�������G'M�/�2������vO���KLg��4����?@a��*
~�bh����*"bb���C�E���d3c�\07(���tE,{4_�a��g�kb�hC]cq��1h�R3��\q�3c�<�-D�������g������ow����T:3��8����."bb���"r��@��t�
�nc23�n|����q(���>Q#V���Q����1��7q;b�M��~�jb�+�;�����&�A��}s�0���@��������g���cnx�'��jL`f��W�bh�<0��D���MeE����h���	��M���2��E���MeE�!��X]9*W�5�2���[_14epU���������m��Q�)��V&2�S��C�P�P�������S��E����1h��'��1W��e�0r=g11bSY9�&�|�.�������	�����~������K�	�{?m�+S���d.�2�BDM�XUVDN��6�XG���df�J-<f8H�fap����q8�{�,�������@��N����~Y��
!P����&� d�m�;��%%�=o�S��N1Te��$�����$z's������B���\� �+���qt������faC��
,�����|�	�+%^�q���!Zk���w<�n�5�Tf�m!����/��M������. ��	YE9QGe���FeB@�`�&Bf�B�SQ���i]<x�su�l]�JL�-�g!��P.�"�f.�j�YU%�&��K+����\0 �=:�^���:�
/���K{\���p��L��������V.TR3B��,`��������1��u�g�(�Bn��S`���?�6�y�O��V6���s���>WJ��7�������f�s^���nm�JL9G�7�������:3B�-����@�Sv�s����XF�*���Gob����F���h�������)&��C<:N�3������9O�����~���k=u*���0.��Z���/pvW�$.o�;���;�~J��\!�|!Dk�"@I�7^�R����8'��P.T3B�-��5�.:*1��?'�����P����P�5�cI*���)S����C�P���9�����w�f�&=?�QQ���H�i������r5����Q�eL�^����<I�Z� *)O�Lb�k�����0xC�������?�;��7&�����L�;��\���\��N������q��e\Fj���2��W�g�V�3,�2�8AD��XU>���&�����e3UF:^14e��"jb���y�5���\pEc��t�v+,�27�-6�#.}7������)�)�h�J������or�F��IDM��T>����}���N%L&('��\_Q��05���
��C~d��";u�@�J<�)��>@��<`/�9�Cp�c�=)���5�7`�8��7"jt�����Q�����"��!+&3C9�:��8��"D���U�������P<�5&0C7��<i	�_"D���M���)Y��&L�<����
����C������|w.8GG{�<M�1��x�)�d4�0��!"&Fl*��]0W��1KRLf�R~��q(��7D���U���e�'��9S&b2'<f8��ca<�"jt�����M��uwH"9��'<fp��������P�?�m�.g6L���6:
9�F+#�l����,�����B�44I0��k�H�cJ��%B��*��/���;�,|�Md9R��A��3�Q2sJ��r������P���-��w��r~#�	(i��PEH�\�Z>�%2��i�YS 7m�����d)���n��!Z+���D�����%@*�/����%K�0�D�,_�Z>e)8������SHs�	(q���]����P��W�I�����������n���(YF��(Bf�B�>��Y�����lh��Qh�������o�!3s!b�����[��s��)7����J��H�UH�\�Z>�+�lH8g���l7QHz�	(Yf$����!ZK���B���d��"S�{�	(!� \�*$f.�j��RU��-���c�	Q:'G���Q��u���M��m�L��Y�!J-5�hEH-_�����,|��l\gvJ��������91RD�� #�ZK���E�+��/6�dXd$D���*��/DYw�![���S��,g;�;9�!��E&Fl*��$MA�@�������>.8J���	h�L�8�=>��J�8��	�f�������~�����a!��s{��RR�JGRC���y�d��7��R�D�����)�����#6��V�#e�(�$'�S����T�+Nr�[��!�&F�N'�W�#gN��I���0�}���O14e���E���Ue�:�HUvi��1�J��)����0�
`51bSY����d�s.�(	_"ju�+;@�J�:�>@�_���9���@c3�"�{�����?������uPn�F&�S)�����
m�.�2��c11bSY��s|v��E4���s	S�CZG51bUY��
2�`#
X��Ai�0��8�Aj"bb���~��Tcw�j.L���S�Wa���L���.T���FX��<Gmfg��4�C�P�+�������9����TB��C|�T5�?��f=�����d���9N3<Y�Ff���C�POT������uP�J�C��w�����L14e�������u�*�G��-	 ����d�C����w����Z��-��L��b�3��6&�����/_gmh���w�
��L��&u�	(�N�r�b�wf.�j}�r�&	(�$ar(]�&�;��tV}�2��d�K�
[|�������Bf �t*���&R3B��b9\�W'%LFe���P�����!df.�h}�r���G����df�$�� ����"jb��T��t������RJ	u(���&��%p�eU���,��n���H�,����9�TE��J#�Z_��)E>7yJ3��4����)�D5���f.�h}�r�N�F]�SNa�U���R'"jb���~��!)�����Iy�8����(i�	\V!�|!D�+��%��|�Gu�G
��|��|�@���,�J���#�Qt���D����d�)��
��Q����ON�-��U��._u!�t�WU��SgB��d9�'��T�Co��;W����=A������X�%�"o?��Y��.}u!�t�g�k!��!Z_�Mf���sz�8w����a�� �����W��H�����(�F"kN���G����f�B�OR��x��N�u�/N@�l�N�����5�����I��6)�\~J�<�I%����������~������CK�������>$�������2F�D����2�*�qS�C)�a11bS��sE����U�FS���7�Q����j�L�8�g�y���N|��AV�#���S�C�����������	]�qTd�����������C9hN"ft���"r�r&'�pF�1��FiS����0H�B$���3N���#�#����RH����#S�C)�c11bSY���d�f�Z���[i��I9	��,T31�p�R}yD�h�fS�J��v��W�A��p!pE�U����G$�&Z!�)&b�CZ�a�8��O�E���UeE�!q#"4��R-}�(����2�q�E���UeE����jg����WM14e���$��?cUYy�k�\���yc3{�������0�35#6��g�G:g���R��>���?g�������g�9���H"m���c�&i#���p~��a��g�kb������*����]�p&h��q(#�#6�����M*22D�P�AFN(��h��f"jt��<����9]#-}��������Y��m',�2H���q=�O�0rjK��5p��Q�)${���l�9:�Bj�B����,������@���)f��=����,��r��+G�������6��q(��>D����S[�#��K�_2�O�W$iJg���G�H���31bUYy���_�����n����9���x��������.�3-����uK��@��E���RK���	!5s!Tk�(��"8<7�0���G�^��Q#�}�����2K�>���N�n����M	(Y��v��3s!Tk���m������UWt&D�\*�UH-_���
��X��k�V�j�Z��Bx+�2�6*XH�\�Z��p�_[���>��{�`h����5~�L\��,��K��q�!�B�)�2P3/�2/gB�-@x��BW+��
]��Hh��S�PV!�|!���Nyx�s1Zv���;QH�L	Q��
sQ�����E���o�p�1���L�&�����k&Fl*k"/�q���\w���F��]w�*�f.D^w���F�:��$c
$Mr��&#X�d��eJl*�O���
��_{����O�������sT3�)y�s���~����yK��$	�.7@oY� 'w��M^���2�V��'4&1C5Ae����F6�Y$���Me�:r��4w���1�T�N���C�!�FG�*��Q�!����6�0S�&�;G�CIp�H;Hz���~�-]1��1����Oq�I��.)��������������vcdG��,n�m�\������x�a_����zy�q3��0ueKW��G������z���"r���ct.���S���0���Q#���\��/�c�%���a������s��������pT���A��dl@�DP:��`�<2!���v��������:I��9O$�n��e��?D���po�+"�-���}�������	B�����}Su,����i*mB��C��1;3��	S�C��l������_5fs�P�Y�7']��t{�bh�Ho/������uP����n@�@e�c�{���h�<���"jt���~T�q�D���Df��g���/Fj�XDL��T�������U��L!�3'��� �BU!1s!T�+�C�	�P��IkA�y��
�����8!�����.I�hM��*�=��p�DlM��s*�f.D<�@Y�J�dl?JBcZR�g��v!�7:Q#V�/XG��)q��7J��cn���D�]�����������do�9������T�S�������?���Z_���l����S]��B@��L*"d���P��Xg}��nI-�$��K������UI�"��!Z_����WmG%���sJ��t�������P�/XN�#�+t�%�Kk�l�lW��R3"���

�P(�3���.���BA������/YN������]t#~"���.�%B-�5�T��t9])I�S��J(Xj��������P��:��%�i����9y����WJ����B-Wu&T�+�cy�tMe�1�u!�5�EO0����H�T���Dys�x��d����(N\�9%m��Q�E����P�/Y��t����1���nJ��G����f�B����<JZ*;m�cLDW�u]���8��J�t�\��M�������J�)����O�������^�lOT(��3�*9������~<��������4�O��G���GjMN��4[0���������p������GS�~�������o�<�������\���wW�'#*��ybN��hR��xA�@���������g��'�c8��t�sk����.;o�[���DmZ�Im�O��>�B�6K�������MEN|�G��]��g���B��+����O�nynz���.t���O�w�{���<��;B�u��]�7�7=
��xz�T46�����G�H#~�jY���4l�?3�,����~O�[�����"7=��x�V^~>���~����&*�����u�J�@��i.1�|)q�B�/%�|e�M��/#�|�P�e��/���|)q�B�/%n}U
k�������w�
�6_Fl� ��j��(t���;�d)����s�|m�{������7��C��w���&K�e��/j�����6�g�s�M������s�{v����'R�?c��n���S���z�����Ot����'R�?c��E����S���z���C����2�� ��S���I������d�T���w
�7nq�������b����]���
B]�J�~Pi.����&R�;cn�e4��s�O�:���+4��w���7u�������}c����z��[�<��3���Hu����yG�����?H�����W�������?H�����'��:��^4D��g���F�/���^3�_2�V�X�1w�J��s�Qy���;%n�A�s���/�;��Ucn��T���[�#�N����'R�?c��e�S��)s�R�?en�QY�����["��3��_@����2�� ��S��_��\�O�{���)s�o�!��?e�.*��S��e�S-��6~�e##���n��G%R]���~T�W�������1��*Fxu����'R�?cn��8@��3���Hu��������1���x�7��_�dr�����7�v�����UO�L�{o���)s�/�oiF�{��[����LQ�K������xw������\����A<����� �����gD�@��65���Hu����'��:������1��:�v����'R�?c��%:d�����������\��n]���A�w������}bL�["�o|+s�/s����2�� ��S���5����c�����kF���o��;)W y[o�����v}f��R��W��+�-��t_�����Wzd�<P�?`�Q����hi@A
S����r������?��r�E�Q??c�C��H#�2�A�CK+Ue�<^k�)�K�']�Q�+��e����S��:>^���C��m�+r;8#�V1�m��`{M�c�f��9��x��G9�(�q���q<v�,r�T
x�r>�6+3����3�CD�"fb���$���r��pT������bh� r�������i����'I��f\H��7bh�H�$b&F�*+"O|�����%2.9<<���WB@���+B�r�
������	�.�����Q����$nAFi��v�����+Nv{G�C�
51bUY�����'�M�mA$��R��zi����{���U~A�~sr�A�{4F�y��k����{������L��TVD��+���	<3~�|���E�P�"�1�#�O��>��;�9S=v:h;X��	��.;B�J<�P�����3O �����S�!�y���!
C�51�x*]�,rnr�����5[	�1���x�"~(��.��������N��b�I��fh��N�&M%�y<�VDDL��TE����lwnLb��L�)��&X�|11bSYy�\�W�V4u�l\vw�Z@ ��YJ11�z�9�,���Q����kh�X����c�.X4e(�3:bUY�����#��_e
3�J���2������������GSb��c������C���H�G3:�z*��,r��(�Yx�o�S�	�q��+���E�DDM�XUVD�~o��*?�6&2������q(�xWEDL��TVDN�h��,=�G��	�:��8�y<3���L�������'�g��+�*[�`23T�R��2�&"jb���"r���?s�	(��DD���".��%��*��U~I��|%s�WlUQ��8���+����#s����P���Ly�g��~c����0<%>�4@?�x��s������3w}����L��L<��!~(��Afr��U~I��x����1��2�_q�^��<��o"�&FN�;?����b�G)�QF���M�������]11bSY��Syfzv�O]�^�2��"b�bh�<��������|����h n�����a��������!����Q�]L�^����4Dj�19
��=�sF��P,iUE�d�V��?���? 'P((Da�x�?�x ����U�����_1��/=O�)����� �!d��MD��� ��?.6���S������(s����YDM�8���>����aQ���v�]�	��:/;  �����GKm4'b���`
3�K�����d$XDM�XU>�������][c
�\C�$�G������31����~��u��Ml�4&3��5��9��,b&F������S�	������-)���+��E�
51����(���^=�6!#�����m����q(��!�&F�*��]�T��fi�H(�P�8���0��`51��o�|w�x&m�Wn�����B�C�Py	Q#6����=�C�_>��*�<���d�����==^=L���lxA��m|<����v��3AI��
E�P�A�,�����L��T�����5fg�;��p�71a�u`31����}w�<��\�9se23���L14e��'31bU�<�����W��������f�0�`�8�y<��1�#V���N��C�z���������S�C��*X�L�XU���������R3��U�0�B�����|w�N
�h"��0�����C�;B���XU>��&n�R�M&�4&0�����x�A9�<�e/"#n*��M�T�D���evf<�;O14e�41#V�q�(R�*������t�FNi�������DD��XU>����P�������=(S���B�bh���������|��p�Rw�JH���5��W�a�8�y<K�"�FG�*��M[�Lj����n�]=�U���<8�	5:������S�:��Avf*3�����q(���'D���U����(%�_RrD�pS>�q�?�xia�T�g�,��d�hc*��&�'��8�Qa(u�EDM�8�CG?�;��
���Q� ��B�P�A!���P��ad2lI�`*3��l��q(��E���UeA���pT�h����3;�W������y .Q�#>u,�,���E��$�Pq��'�PVa�E�PqCD��XUl�HB���<%�1�����>����h�a�����O;�`P6�o���|��������~��6�)��SzR�;j�G��Q��|�_�e�N0
������1B�J`����>@�G�h��ne�h�fDa�� �%�M�>@�_����!
��T;�N�i��H����<������^[���h)SfPBS���l��dy���f81��8��E`V�����_�O�b�3�iD!���d�C	Ii�N}�"�h����S��&�a��q(#�`51bUY���"NED'�0��f�)�%U�w���q��.]G��"��M�V	&3�R�l���q(������|�:vn��.��/�tC���2�Q#V���(;wj�����q%��C�PF�,"&Fl*���o���"~�k�.����|�^^���:Q#6�/X���pz��[�
����J�������S'#6����
���j��<3�;~N�h����E����S���ux�\���N�>���PYG�bh� ��"fb���~�7�g6�y���Yq���+���0H�����)�tQ�!]�T��0�7f<�t�a�8�yp5������E�����o��F(��J�@@�
�~�����
)�^u�nc"3>%�F��8�y�n5D���Me�:
��us/��Aw5T�� �J<x���>@�_��%I�^>����%i�C�P��L���|�:��Te�^K`�p��4�`��<`�������,_G�P-D�6��6�`:���+�2�W��Q#��1�+��"��[��(���-��p��`h�`Q#V������[�e
�G��\q�0X��������e�:OY-U[�1Tx?�/Rn
�I��>Df}���^j�pWhLb&��+�r���Uk#6���������#����/~�As�`��� b&F����ud�gG�h��L#���)���C�{Q#��+�������6�����,���`,b&F���5��=H>P�UJHpP*�Kg�	�������$`�(�_�����X*���2<���FW(����@@�
P��`�2�=�!DE��]��J`, �X���`y zy�B�l�|Oh���F�l�����M�����M-	��n�M���_��!�����?�����t�7G����}H~��=�8�pK�c�p��	����(�L�@���q��c,�����-�	q����(I�P����S������9��QO���)��xv��(�,@���;��S����Z3dH6��La��W�)����T7��������i:������}�Iq������7#>��9��a��U�����B�P�[A@���+BR�L������[u!~(�nK, �T�!G���_����I�p��C�PF��YDL��TVD���$�E�_�$f5��bh� rQ#6���(�������	���)3�����'U�����G����\F���C�P���Cl�CD��x_x���Pc��g������G��C��&����,��o����N��A�$���8H��0�j'��q8U�y��e�<�7&0��x�o��.���XDL��TVDN��|��f�%
�0�]���W�C��6CDM�XUVD�3�S����E3<i*qv���h�<��:D���UeE��%����P��S��I�\�ht��F	�������0���\�z���.��
i��6�T,�&Fl*K"��j��>�1h�S���k/Pa��51�s���"�=����=� B���N14e���H�.�g�TVD^��J���`Q�{���/��`�8�y��q9%X>�|��}�G��0���S�����,�2�$����,�<p;����|��1������2��11bSY9U���
���fc23a+8�q�AB�y�j�D������k�U0Q���E{�eo��.}�.�Se51bUY9%5j�M��3 *�
�A?���\;������]b�Rm�%���_+�^}��q(���Zr���S���"O�J���1�*��fX4e���D������'�)����k�<�p�c��q(#_�:#V�%��9��oc3AW\��
���-J�EDL��TVDN��	2R��L�v%�M3,�2����q]���TD�W�WKuD|�}�\���f;�|�YD�����>��rT�{Jq��8z(��,�q�*�dSH�
<�����
]���F��������5bS�nR�k��3'����I
�OY����N���sT@�)��c���N�z9������8���(�4���a�x�9�����Kk&����@�L��U�&F��1��kh.�Ytl�a�����U ���uY��L��G�_����vZ�h�����:jEin=��������u�*{�Q�!S�q�Y����
#�(Q#��w����HFb��N�S��V`S�C_�:#V���;_�9��w��fw��ah�Hz�E���Ue�:��#�����c23��eN14e�lc31bU��u�F���4��C(3���m��P��/����]�9�{w�*�0�d��q(�]�\gb���~%�����r��"
b����r������31�p���r�`��=�`j�HM>)�C���"Q���2+��Xc�}E1�S�d;K�� "&Fl*�����0����{�_9����C�� �FG�*�����4�}�fk�������J������_�~Q�-�~Q�aLb��d�C�P���_$�&Fl*��q��aM���3���fX4ePG�"ft�����Q�n��Z�pp���Q/q�E�PM�X���XU��#G���������03�^���������0��,\@	���^�e��y���Y�P��H@�h��Pq��NZ)p��2��>��8��_�E���Ue�:�nT�K�>|� �������2���"�FG����V��J\2R�����7�?�x�7�>@�_��q���<� e0��<y���H}�X�L�8��*�\�G&�r�2������+�t��g	!"&Fl*����V(;�S?���#9��-�n�e�!�&F�*���������~/��7����xp���U~��5s�z$�
o
��t�Oq�Boa�J1#���������Kk����=F��	.ZX,
: �FG|�E^����]q�1ZL����f��31�sq��uP'5��U�^���e�N�l��6��b{��f31��e��S[�a�$�#L����L�h����AD���~���%�h
���P��:C��PQ��U���4a����0��-5b���RQ�����~_�c��U��C����m�?}���������I-Nh��R[��n���o%�tMk�����~����8���_%�����t��*�K����?�����o�����)6��?:��������a`��������p�������=��t���?�����?~��������bi��~��DV�2$C�����f�����"
�����7�
F�D������<�6��>rz���	�\�7�7E���7�<�s:��8�C�L�������u��.xS��Oz�	>R�<a1������=M����F������D���>����K�~�g��&���0��7E���w}�Z������^�[]�����5���}��Ih������w�%<
�N��5+��l��E��k:]����&R�G����,�A�_�Z�;���)s�o��X�O�{���)s�/���?cn��T���{I���?e����^�O�B5�nM�z~=+F���3�������h}(%xcn?,���7���*���3���Hu����G��}�N�[o��)q���T���2�� ��S����x��3���Hu����Wl���kY�;���)s�/Q�����1w�T��k����M���2�� ��S��_�������p�T�O�����6:�p�x�H>ak�o���%R]���~X��;������1���0��B��["��3��_BS���1��D��g����!)�?e��A����������2�� ��S��_���N�3���Hu�����ZGs����?H�����W����^<D������G�T�������� :�����J���������?�m���'R�?c��eJ���_c���T���[4@��8���������pSn���<��s���#���s�O�:�����Jc����m�1�������{ex[i���}x[i��uC�z�o|[���[)����
�0~Wi��G%R]���~Tis�O�:�������b��7u����U�����[o"��3��_�Us���*���W�����["��3��_/����K�e��{uxWi���:��4�����w���^6D��g���F���o�7��������Jc�>+���"�������W�������1���p�j���:\�s�����^Us�O�:������Jc����U�1��h�o������R��u�
�*����W�������Jc�����Jc�����Jcn�]sp��];&�c�];�������w�1�Oh���l{]��f�L���4�������������C�JO�t>���r!
���M �%0|�D}�*���g!g��:��[J�=BL���C	�T}�*�$�b�".�"�u�uR�0m�P31��p��"�A��nh�Y�LD,��4��%��`��(�+B�u^���=c�36�^����%z�i��!�.��f�E������=�)�+����&������������@�HL$�_~�4���I�����W���:�w��`3�k����q(���#6��Sq���)���2��D��)������a11bSY��S��P{��`�#��b��q(��&|#6����i\�gA���\�S�2|k��q(��*�������#����=�yb��	�'R���7B�J<h�D}�*�"������5d!������������q  ��/�V.'�J��
i!
��m�!~(���������3�J����y�X	j6�Lt'�A?�xpQG��>@�_2z�������7����	��*������+4(��Gcvf2���bh���:D���UeE��s�H�UFS34j���h�P�3:�zY�Y�1�]�xjP��Ze"39�|+�bh�P�
51bSY���{i��.�������`�8�y<KH"�FG\O�.?�<o�2E�
c"e������E�P���q=���,r���E�J0r������U��8�NtE�?�B3��5�14��R�4�|�E�P���QD���UeE�;���3�$��S����
/8Iqa�,���q:��(��enYU|��)��U�Q6��������P�����i���]�[dc*3���_q���0�5�"�&F���(��<\������5���c�bh�P�ND���UeE�!r�Y��Y��	��$L�&�a�A32��������U)���c�[���)��5��q(C�a��������#�U�B�u�Ld��e��q(�xf:�B"jb���$���=
o�q�^�o��:��8�y����������r��g�i��������Z�)�����k	X�L�XU�D�_�rb��S�0��:}\y!~(��9@@���K�J
�Gj�C;%B$&�GQ�!~(��4��D}�&���������#�IL��������h��Wv{��\.�m���+;���U&q���iW�AG��������K��E^��A�[������>��m��)�2�Am��w&F��
�]
�/�����L\,�/X	�f5#>�:�(����}����e23��%N14e��������q��;�[��Fpw5��@i�&�5[k����}p����m8cd#m��I�����!�&F�����v��6byw��D��zL\B���>@�_0���a���Z�.BV����W"�$�cLD���L���$ XDM��T>�����H�9Z�L`&�%c��q(��������q���M��y����G�����fAD�����}?���,�	�R$���/S�B?�xP�{T��.�x��_�yP&1����W����0��e&Fl*��-�8h�4&1�j���+�rJ@�Bg51bSY7z��T�#��]\�8j>fmCDM�8���O����4R�*�����3�;i�Px������1����|��C-%��E�]EK�3����Q�����v?
������4�5/2���Mq�7a�:�"fb���2�A�C���y����	����3,�2��b3:bU�<��8�D�OS��2�o�N
X5e��"jt���y�9�(���i��dfr�)�3�C�{51bUYw}���L�i��L�h�~�YF���7D���s?�����WX}iJO�e��>�|�)v��^�o������|�^x��!���P�����6� �J<��;��U������r�[I�kK����
�FG��2�ABD��XU>�[�5r�"�P�Z�,��)��Ci�D"ft��� ���e�����3C��0�Yn��<8�51��������� B��-�9\04���E$�F�	����1����0�!Q�w��q(�xf������|w��A��??�����u7��q(��M\1�#��V_��M'E�����28\���)vr�Q���XDL��T�-�[h��F)`�x:�����q(��F=������q'�x���+�f3�[�b/L�yp ������-�����?%"�*J�a�N�:j��a$�������~�N�4���/n&+}��_���o��G�B�)=��'���f~�(da������m��JH���'��LtI	U`��X���(�q��P�%%��I
��>I�&F|JR,]%���|��,@P��nY�PB��E�Pe0O����ep�7�(sxF�C)�`51bUY��e��i9	��~9��f��%d
�S���_�^����.H}��QH�bh�H�������uPE����4:�*��S�C���H�8g�*��A����c��6"��V1��L!�aQ���_��m�|EJ���!�pS����0�!51bS��u����$5`�����\��w0H�@DL��T��#r^�=`+_	�	�D��@�J<8d������s>T��J���g��'3���.�`U���_m��gDo�a�:f��R!e�E�P�V �FG���Z�\G��O�����I4T|�����Av"bb���|�FY���9��	*���d8���c\�������\F+P&YW�8��8�yp�D���Me�:��5kd���]�m�����K��=�D��L��E�6�|�1��%���2��a51bSY����>t�'������� +3,�2�c3:bUY���$���[P�o�����a�8�A=������u��u��i���tc"3�&L14eP��"jb�������KI�
������,M�A?�@g�;����h����:#*���3��� J4X@����q��/��0	��!�����&���$�>���Z�jD���W���)4�j�+!����X@��U_!J�Py}EP�nLE9�t�������(Q�#�������81W�M�_7&���q��7f-���7��7����U\�R�4)�H!e�f�2���DDM�XU�`�'w��0	�@$��������� �FG�*��A���x��������RBi��q(���tQ#6�/XG�1H�7�,z#�����fd��x�}y���s�E_��p\��t�Y
Y6@=��	����#'t�T���,�#��7-�35K�4��M���<8lQ#6�/�lB(�'MK3�_���!�����	h^�M��i&�G�@/��4�������CH?�����?�w���}s4�z{e���bF��Ed��iDa��Z�	�r���ZL}��t��,d9�w�{�)� =�G�RJ���^`b����g���c�La&�cz��2�Hb51bUYy�06N��x��df(����2�ya51bUY9�
x���v���,��W��G����h��{�Kb��z��p)��n�)��p��Fk��&F|n��Y�n������1��DP��2��E���UeI�������"W&1Cn�C�P) �������C�q�@I4�K���P�&M14e�������,���z��qw9��L���;���CDL��TVD�
�m��N�!\�3Q{�@�0�oE������	�0z���������A����	��
����+^���+��-J���H9�)�!~(���!�T}�*�"d�L��7J^�DfM��bh�<xB;��������1�<n^:�6&1C��4�^��
��[��.&F�TVD����.�ARf�$$��G,�2��@��q9��>�\P/v��iLd���)�����G�BDL��TVD.G.� ��$�����C�PF�.|�WL��T�DNy��h*���L���,���1Q#����EN*4�a/���6��V�L�j�<�� k�����rv�E�3��j�>7��a�F���U.e�E�P����!�FG�*+"/�s�-���3�"�F��)N��C�{Q#N���g�������\@�����N\@�j<z b+���X������o1C����/��a�8���i�������)_��kw~�D:�����t�~�"~(��f4{T�!S}�Gc,:���[`*3T���2��51bUY��Li�_�G���n�8��q(C=����������e."�B##
w��W�	d�C���l�L}�"�$������QJH]#����Q �l" �}U��!����)��]���h9E�$��i�*0H]�K����,���Fl������=��W(��.:�vS�~:�Y��?@)�6"����Z�����^�n�-W�_1�SBbg�% F�>!�D������]���-�[
�S���!�� B����������Y(_D�[S��r���Cgj�sT��)������3��y�I�.�La��z�)������������t�:(�(
�WX7�0��oW�CIS������_G��M�r�uC��Xo����
9�x�@���Z�t+�}�4��7��bh�H�b������u�Q:-nRM5"�df��Io�C�PFjDXDM�XU����kiLh�tNh�p�W�sB#�sB�����p�:h
$�H�fP��QNq��vad)u&F�b�������~�u!�-���h�.L�j]X}��Z����(f%�H����<����I'��d�6�P���_@���[��jJ��F%OT{=��q(#�<XD��XU��#yl�r}�b�B;�fX4e���E���Ue�:��nAR����P��,	14�d�bh� o51bU��u`����??S��(�C�P�������_G������6���6��>�"������!r1W1���p���;*��|�����	�@�dJU��������J� �T8���Nk�3���0�a�+yp����l&����:6��I�#�	%����]�������fr�:J�p��yq�I,��^�]���JPIv:�jr�:�>5�]%���d�td&'���@�1R�A{����u�@��%8"b�j�������,��F�����xD��u�C�Q�����DD4���T^%U�K��C������eW�i�5#�����w���[y�n6T��H'W���A�H(*�?��
���.zr� ��b�+�(H�����ul������d')���mx�d�����l���!;��uP��z�S	�+#���Z0M�U{ �<8o*2D/����(I�J��k3Ql�����U���<8r�����up�
��A���l��[$4�A��B�
���1��:V���
I���<H@���\f�a�+A&V�!z�L���
I�T��H���QL��AR�\1o)�y'�6e����,���b@�.�UA��g"������a�b����|a�w���nT��.�U��r���t�����[7*�^A���f�l�'��/���G��t��w��a	q�^�rc�W�d�����
�y(�y'���hz��-�H��h����Io������l�����[,E���d�9DI��� ��V]��/
:^'�c���M6��0� ���F�:Q���E*IA���r�-��6�T=S�'-��=d����?�������?�������N������->����k|���<�������?�v�~��/?����������?=�i�.YT�����$�!H-Dz����������9�
����t��4��~~=�=)o�S�ev5�Km�s1��$��u���nv6���|���1]�����;f�W�nL6u!�s7�:���|�[~Q1����[~��]��f�B.�o�u������k�7bz��s�S^�7O�����WK����l��:T�|�����t)��_�����N�5<�;�����u��6�L^�����N�`z:��6���������L%�+�������j ��s8G��3r����s�g���z������V�O����E�[J��A�������8X��3r�O�F��Q����b�{m_����PG���z��g�j�����kY�����������^,�j&o��b��<Gs����g
N}�U���3#��D�qg��_�����s�j�)9�����
�;����8Sp�k�J2�3%�����Sr�/;����3r�O�F��i�W�)9�����s��O�?%�7
�j�)��i�Y����q}�(K�Fh�{%g�J���+9�V�%.�W�)8�����S_T�+�f���h5�����9�����7(5����Vh@�)9�����s����Sr�Z�?%�����|����?�j�9�����?%�����Sr��p/�����{�j���\�3J�D��{�o�e��m���9�T��L������}��3r�O�FN��$�����S���3r����V����?�j�9��y��;��X�u&���[��jn�5t������.|F)����VI��'�>]����OE>�v�����s�h}:���Z����m�-��i��Nr}q��=����5�v����u;�����s����4t���>]������s��bc�J�=B�u���_Z����{��������OE>�v����'
�|*��	��OE�>7��[���}�Z����O�%0Z�B��,�:8r}_��o99��\o�����mgo����Z;C��-�7����S�����:���5�W���}�Z����Ot/>�Tt�j���}��+,�kk�����>
]�L8|��Tt�j���}r���
��s��v�mWt��K�I�S��O�|*��9	G*:��L�
�.o0�[�_�`~X�S������r�+�h���yWr������J���n!:�G<�&�GQ9�
�?�1����(����A'� �u���H�gS�$z[�+��h%�?��T���}!YE��e3�c�t�V�w.�����:�1�Q�������"C����1��^�x���VB���,n�0��bW���V�b�����1�MZ7P���T��I�r�2,v%��P�~�l&w�<;�A��������u7��_���Ut�^��'�<b��{i�^I�d^�5�({��.�*o,��C��o[�6��,�s�8�0����<�L��g�X�J�9'T��r�]���g���c�
rf"��
�Ia1�D��c�k���jYb�J�B	y�S�3Y,v%��P�A{YM���/��C�/�3��J�0_&"�wZ��U��)�)K����������A��CR� �_DE��e�oO�������D&��LeX�J�{������g��L<�D���uDV�`�� ����wL�~�V<���7��+�z���o�a�+y��*:D/��3/����T6�$0)9���e'e
��&�*:D/��
3wu9qO ��J6"t����r@�%��P�!z9�����!t��Mb%���7��Tv��W����������6}wx�*�;����{�����x����2;�r�����2x�Z�p��;1�-��#@��Y���^�����wX����dX�F�Ajk�d1�e�1q
:]V�^6�3rKt8�6��W�n�$�@��Y��`��/�R�A��[R�<����LR��0P�{��+�s��0%��[6n�<������
����&���RXv
8U�A�;.s#���8�)7�@�nY�*�nU�J>���b>n)hV70�(��:)�0P�{�P����TQ`���pi0��_~V�a�V7-5��] F�������9�/@u���3��g�����Z�+	L�s�����$�V�<����fr��7>���
��.�=6u��	\	n<P�!z������y�7D��*��9�t?0��1/J6���7N�
p+E�+�E
0�jE}\W���@��eI�l���M�6�5H����9����x�h!P�6�#P�[���f�\>r��J"��l\=~�a�+y<�R��C������#���cj��q��������b^���Pb�K�|F:��lv�G#�1�V3Y,v%�~@E���Vq�h�!��%p���g%��|~z�������j����!�X�+o������f���
��YA�;�}��h�iA����S��}2����,�����
���}��h�R��\H��HA��ZN��A>d���
�����}6�(��������qeBQ�m&���D6�YE�e5�|�Y_�hMK�!�V�H3Y,v%)c�����y��{��q�_�F<�
��p��e���<�k�����y�1.D�f�i�dcR��S��]	n&P�!zYMn�7��f�<�y3���g�g�X�J�������y��+q�k�c3B�A%>��F��,��*:h/�������*�nQ��A�j�
S9��D���bC�r��s=:��Q�>T�:
%d�;�(u���W�������{�����|#TazY9+a��b�+yX��:h/������+'�S���F�y{��^�d�����%����\�]\?�wY��%�*�r�`#@��0���=���CS���	�eY�9����P�����5�N:�
��0���c��n��[�H�<�Gi(3r���s1�>��U�l�����xE��r��j�NI�c
`��W�%l]3�����S#Y����x�N��B��/�YWC��n�[�J!
�����j�	�B�c
������J:���n���
.�A�`����-!aGk0�����d�@�����p�j~�.	aBo�1�4�a�����1�br������LrX�A�G8���/�1/J6�R{d���SY�o8tDK�����[2��L��Q�a�V7L�-���Ou��m(1��F�il�Rfbu��j>��d�Pa�UM�!����
3��e�Q"������FHD�!����3��c��	��\��}l���3��w����������.��Y*�(0���<��bW"�s!�!9�������s��A�+�T1b�1�(J6����������Vxd�#����l���0V�!zYM��w��8�� 4Dc����e�*�08t�l�E"�M��#hZ`a�A�E���*�XX����}�#��u9����_��_������������k[��"�bt��y���I �l�X�����rE�M	� �u��n���������qz��%\&"�w|���U��p�P��+}RP�c��P�<x�d3�D���#O���W`$2�->F���XE��e3��:6�a��CZ�f{����W�(�bW�=	�l�cw����u����u�(�cR���Z��(���JCA��D��*H�������z9I��&�EE���th�u�:6\Y�����Ho���Q\����n��x�x��{�����.]���J��!�(�������d�N,�:�7/�O��%K^V���q�e�a�+�tV�!zYM�_��"Cx�����G"M"�{� ���O��b��r>�#�s�����Z�$0�8������������u$��>�H)���LJL|2p�a�+y���Ut�^6����������(��'W&%���N�]	"�b��������r?Z�z��������"�w�^�����[�w���$H'i��&%o����Nr ����*:D/�����T�p�WJ����Rd��YW y8���e���"����7?
��8ue��.�2:�%fj�����vN��?�"�rn��)#,GY%yb����G]��/GQd��M�!�%��L�$�@���� U=��3�g:���51B�EYT%yb����b��T���F��K	�%K?���C,4&J��g�o,'�g�W�{+
8�Ka"?���k%fj���d��i���4���p�h��l�������&�XG��]��6��n�Qn���L��wE��6����C	1�bT3o��H�l�DL��/��`��(�6���8l&�t%7J�|%�!rO�Qf��V�*6D/��7����`����2~�(�&�A����VU�a��S�!#�q��t�I9dBB�i`���-96�@����$y�N�+�S�sT���7[�*�0P�o,�bK��jk+ ��"TS F�
�b@�F�X}e9��C���VP`�$�F�D�,J������r��
n5��1����F����M�[y��;a����m4p0Q�]���l�E����T,���f�>�E���d�%q��������~���_����+���Q�+�>�>fx�Z��i���dTw���(���`g�Ul�^�����N"K]��������bVn*�bW�U�J������Sb�4���V�A�����%EU�#I�u3�D�������x�m]%�	�\[���NtB�b�J����l&w�<Q"�}������2��T����E��C����1�,gNy�2��-��J'.��$�A�#� ��h�wL�����GW	����3Y,v%��*:h/��
3/K���F�R��{�e���wxd�d�����wL�y9�I�{t	Q���%���^���D6�IE��e3�e����g-g��~1���[��{��bW"�?��Cr>���l����Ic�T���A��,��������{���<x�vX�~Z~���k�Y�z����e������'-�8�HA�Z��H��J�Q���DreI���e5�e�Y��l�9WBw]��� �:Y,v%�BE�����y*Vf��~hy�4�2�a�+�>Tt�^V�;fN������OWF�hB���d��� ��*6h/��
3Gy2)�CU�����!�gs#d�([TI��X��'�H�G/�����$���1B�EYP%y��E����U���d7U�g����2,v%�E��fr����Dj���YE�u�)��og@�RE�F@>A�p!Ke��U/$K��VF��M�i�XR%ybu���B�}�$yB�2#�P��9�>�VU�a�#?\W2�J���h�K"!�C���ix���3��iM*,I��E���:b������$�A>���paM{	i�vp��m*�{�{�C����"m��F�x|���^�C�&A�	���f�:s�}G�L�$Gl�������V`n�d1�g����I}�N!@��s��&Fj�>(�0P�{@��X�i5�2
�����t�z�C�O'����X�}���C������q�������4j��;�O
����iXl%fbu�<N�G�=�8���g'����������jr��u���������x��s�a���=��\n��X�",|��	�p���y6��h�#�J���g�l&��#t�4�H���5<��C|$8�%��F������DA�����GK"���y(����c��@=�d������Ut�^V����=bW�%��W=������<�TV�!zYM�_���Z
����YAn*/���&���l&_XGAw&:)�5���P��m*�bW"��r�C������-��.�bD{���!9��wH�R��c��;������^�x����l��Gp�!z��������+���x�9gK-d�L�]	^X�����]}�:�+��F�e�r��=���+x��R�N4��@���i�����Z�m��KS����P��~����uP��,Yo	�j�x��n4�L�]	>PP�A{YM����Z%Q�@|I�?�����N���P�;Q��_5�-8�J��A����qP/�bW�B�P�!zYM�_G�r�1��0����4�q����!�������s��/a�*VH���1���m&����`�����ud���H���,SJ#U����Eo)���q���\�+-8]���AW�BUB�r�.uB$��Tl�^���v7��-��V�P(��x�UC����T���D��-�cf����h�4������5�;S F���(��(m1�{�������d�ICN�.y�����H����(�0H���;�Ce��k�Z8n�����]N��b�+��P�A{�ho]GH�%H/3�)���4 ?��b7�oL���&_XG��s�$��G��~��lEy� �a��T;7�_�8��Iv+����j����,!hAI*Q�a���|��<�4~j�H�ss\�lR������ ��w.�,Rp�~���)k������2�\������-��!P_P2�q�BNBpqb�rt�"����o,��������t�X?T��1��t�m(��(mq������E{��d������40�d��C�$�@����um�+�����E������j�BI��Z}e98������v���d��U�1BHYT%yb�����������E�X�h��G��(��8���y9^��[�����`��>��I����l�t���]N��M�
�|�g�i��)�ZO>[.5��@���k���"YF8j���������$����\����t/tI(���z�>^�s�
��������.�F���e�������������?��O��'u���������[*x�>��H���_���������?������������~}����O��p�Bqo�����UfD?��3�Pm���_��=N�E�#H�H�����x.����������������������H����_��w�=D��l&/�r�fY�0=�����g�]�_6	����\��,������,�<����������\�������'s���y4��g�=S�_��|�\M��l����]yqR��?�t��7����P�^���b��N�~4<�;����t���+�Pys�4�����]��%+��������}���3_P��T>�p4��2p�Z�3��2��5��z�R0��)8�����z����hUo������So��p8��i��zS�2�r��5�D�I<�&����������g��,���T�?�����+%Zu�N����K��3�VVr��Ou��c%�E��h��c�3l�G#���zTr��b4%�+9�(j�G#��DT�F�=B�����#6��N�A�q����5��{c����o��XZwJ�� ��zTruq\���&r}q�s<��%��K���9�b�K�m�X��GQk<9��-||��X��GQk<��Huqk���#�Z�J�=f�'M���z������Bd�C��X�u'��[������sPk*9�Hu�_�����Sp�-��r������^�Q��}$S<��6r}Y);��������ju���^�U�24+9�(j�G#�ac���{,���jx����
���V�N����
>�9�����s������z������@���{�Z�Q���"�_���s�Pk=*9���9���;��5.�\�IV�[/I~���"������^/Qk&o��z����X��?h5�\x�^��N��sPk*9�X�k9|���V�N�����h������Z�P��G��W:�7����5�aF.<���#Z��?�j�	����������Z�P��G�h�-�h���z�K#Ww�k��� �[��2������&j����_0�������V�N���9��������5��{��K�4+9�(j�G#����#X��G���\x,\6��h��#�Z�J�=&�Z��?��������Fg���{c����oc\S��-d�Pry��U��-�w�X�c��%@M��9�,��i��m�����J���n!:��G<���r�a\c��$�7����'2,v!|��0�!z�LpH���S�!}Du���D&T�h.�b"3'dC����1���w$�]
DE�t��*	�
h�T*B�!?��o�2�p�H&'���4�k��+
����}/���vE�WnD{���K�9=�4�Ie4[V�a�V�,�
��UAO������ s#Z�1��;u���=�h/�E�Q�����v��#Z�1�EI��Z����S��sS���#������|I���Z����t��i=��$��t?0�sH'��@�nY��#T�8�Y�����R�g3#Z�1��d�@��Y@��'���&.�f4�N~0A��>���l&w���s6?��������'D���
�]zc��KJ6�������I|�C{��V��*�S/&@�h�%f��(��@. 47����6A�3�4w�#^�1��4#@�nY�|�
*�7��W��#���h�
�:���N�g�Z]������������`l�P�a�V�,��atkX
�D���y��@�h�%Q�� V�, d��\��V4(0����8M�0�E�XI��Y����8���P�*��"G�(�@�x�"����[���	'��iZ��2ZS�<�	#Z@e	Ju���-���7�,&�?*IL��p�y��dq�pm�*6D/�C���3/���H'�Ca��;�W�1�K_Y%y��?��Mg(�7n�w����Z�d�1�TE�F�X��O?/T�+{y�$1��/F����zW��M���}O?���7��Ey�J
�.8y<�Ij]y<K���C�r:���l�T�%�{/R������G3Y,v%�g��&����\I����*�*~�H
�J<���EJDyp-V�!z�L��y\��O�8���$0���8�a�+ypKV�!z�Ln�y�R�T�Gl�+�L|(\y�a�+�����l&��|���-�~D3W���(7�A�����IVP�!z�L��9�NQ�"��"C�C���<3Y,v%4�������K�}��3Y%���z9S�K�1!���bC��?�$�p�t�-�Gi�FRh����,��7c������ ��FB�]Y.�J'=�x&.�W�����<��
T0D/W��5i�yM���<�B5���_��y%W�B��!.%��l������dd[�L�l��2,v!f�:H:D/�����~�����k��u�D&y��n���:D/����v��(�6�� ��lZ.��-���b�c�h�gsG��)my�D����m.�����N=5C���|>oW��G@�[EF�?��I��c+�fju��)�m�6~Io��y��
X�%�2��Tu���
����r����4 K`�^�<v�gF�P��u���Z�0}���eN��5(`g5l\�l`D�7�P��3��a���8H��W��@��7Fd��`%fju��7>O�c�*
�Kq�	��X�9�H�a�V7L��3������lL��������a�V7L?�"�>��yF.�7}`�Qen%fj���)��-����i��B*E$�k�1�����ybu���z��	%tw�E{����b��D	�A{YM>�7=��1*�zkCx����A��#
�
���R�����p�l����r���A���Z�0�x��E�l���
�+*Ne��_�.#�	�{�9�b
� oB�������'�m����H�
�(��J:�����'y\�\5��TT��}�7�9#�~eJu��vg���o���pe���.���5�����C%�<��0�;����\�/�cd�
#N�x`��7��T��X�0}:R��7c���9J+#
s�9#��KP�#@���>C����A���o�8V�,Q�a�V�O�/Q>�T8��6F������9Q�a V7L�l�6eZ�Q�A���"&o�h��"~��0P�;��&nI��yVD������d���_Y%ybu���*�����*�^?���s�������m��!z�7��>�w�{�j�*�
*�(,���e����
3��a�1�����g�
#��0��W�E���X�1������N��,�maD�m���o,���<��a�iAb"�Y�M�(@��Q�������(�� VwLe1xH���l���2�dd��P
n����
���]i�p�}�b��I/�<���Z=�j��!��������	b5B��b�DL(�����}2h��+'-�~�����o?���������j.R�I*?�!~����*����Pg����()�t`��xLA�q��|VHZ1"I*����%�0i�Vt�^��V�\e���Y�sBIdR��d�eX�Bt��C�����>h�n�B��sy�7@&�h�L�3�o.'�JBT���g�"#���o�PS[G\J3����YkN,���D&�Z��*H �)����fr�:��0|ha(!��	J�����]����jr�:���FN�
HE�������1��g�
3���r��@-3�"M�	Z@��Lal�t�1w.�����q��������8�(f�J:�������i�����9��x`�Ae.��3���r��zu7S��#��Qk���]���g������r�]d��I_�An����F��*2D/����X�rR5�JV8"+%5LeX�J��J������+��J�d�FS�&+��NeX�B��Tt�^V�����%-"M�B�Ah�F���������P�� �����Y���AhY�����0�(�2yG��@����P^+���O|�jPd�� ��G#�)	[J%ff�����E�)���sE���������E(�� V_X�� qq�+_]
���V��~`�Ae[��
3���r�z��)+�62���k���5�r�9(�a�.��.'{���a%�HP`D�E�
�-����d�@��_�X�DR]`��"lZ���3 {���"J:���v2�s8E�-�5�(0��W�#^����d�@���zV���{����(0�t��
����Q���l�������8Ht���������SK�����>F�bC������`Ah�C�$-h[Q�O�1�?KeJu���7���|�#1�hs�<����q��X%yb���D�JQ?�����>���MeX�J��#��A{YL���f9{R���4��G��EQ�����'�.'-('�DV+�@*Uf{jd�2	��a����q�"�C$QWD����1��P�#`;$Q���(�����V�h�'�x��-�� J6�����]+��1�!�P�k�,m��%Q�A{�lM�m��DM����$����������?���������5�VG��h�����O#m������J��'e-A�����@4���:D/��4��d�e�$|=������Y����p8����rr���){n���{��U��mF��D^�������l&7�}b���*B92n6;�����=�f�t��>�}~�U�&��E����c�8=}Q��3�2�����YS�?\ B&(�y�jJ�0��p�(`S3����9P����HJy���Pk{�R^6���^. ��d��e�������Q'��wZ��������G���di�R�N���&�� ���Dv�I���e5�c��������T���CyTo�b�>��%fju��+'���"u����B!�P���q��X%ybu�r|��������:6����6yY,v!��	�������1s����"�N�$T|/RAy�a�yH�	����&7��bI|�D������8f�X�B4��Tl�^V�;f��rR�
{R
�UwKo��I��V�a`V�, �U�E�.E�����)P#^���:������&����E�P��:��7@jX<Z���
3�c���@���4�����&E�1��!s F2�TG�X�������P��MNB<
%�sy�~�z�\7V�!zYMn�y�y�Y������8��q4�X%y�P}���E2���'��gEPf�nIr�F#��[�T��X����w�_5~����\f�]	}v�c�:D/��
3�Z���\��4��E�����s F��d,2kF�X���$����/��	��(���It,(�0P�[@��~�,>6hcD�	�#�sP��@��Y@�b�<*5��(�*�AS��D��@8;�l! g�v��
������/8�51�����:�����@!N�r~E��)w)�A����%Q��p���p���H�pA�/C����@�x�"����{ �8q
��!��B6�
HZC�a^�l��C
����k�7~~6B)Y�s��2��b"yB�b���v��37�Alx��QPcEy�m�H�6�����'�?\��%E������Ik�,L#)�P�a V��SiF�j�8�e��^^��U
��QTt�^>��f(����B���q�!2���[zp���V
P�,��w�IM�qG��D&t�d.�bb�{\���D�����x��6Mq��<�J��Zy
�)�fC�r���u$��d�'#8���zY#J �R�!z�B��:K*u�VN��L"CQ�����`$�QRR�@���Z}c9M�'�nP�R�k��qb�Z;DAI��Z}a9��%[����5Bnm��:#��4ik�����#P�o,�K�\�%D���4�d��br�����N>�G����j��6?JPM�Jo�!?
�I�rs��G����@BQE�9���7F�eN��6����I�+��%������@i�V��G]���J[mhrj���P��(����j�!�w�-r���G)s�l���7�Sj���oV���d��Z���	G����jr�:�e��F�_���N������/�1���
3��9�sh�AN%r`h�-�eC�����U�TG�v�����Z�����UDWq��s F�1D�F�X}a9�(�j��e*J��M�&�[c��Q�a��e�\�GcNm�������~^N��@��S�0�XG@iK��j&����w�J��
������<������u���A�d��2#��mo�8[��*J6�������Y���|hP`�r&��H����%fj���l�WT���Et�4���y�Y1b��-cA�l����S�q�x�C� #����h�rD�!��J.���J/7��/�FC�x�B_�(��k�mS�q��Xb��<���r��^��:@��E���4�2����3���r��	4H����ipe���4@�x9��ipe���i�X}a9
y�)��6�0�t&�
Hz�a^�l���9�;�S4# �cM�PC�{�	@����yQ�a���X���*�|$Pd�A�DY[�
HZd�a^�l��C��;�W�S$N����������0���P��@����Mse�FpMPaD�Ho����0/J6����I�b�H�c���} ���7 �.G�<��0H���@PL.����o�� ������o��`�:����iBo���R�i��	�
@����(JM��b��B����������^�66G�,�)J8�jRcs�I�����C+U�����d������?�����H>����_]�oX����O����<�����?�E�z�����M~�������?�������?�������N��#%�������FBN������?���y��?=]z��������?~��������j���"��U6�#��;��0��/�S����`�-d��`v��h���Z�l���;k=��R�b�����\M\�t���t��P��lO�������;f�����]�������|����ANl����K�=K�Q��%�\����K�t���~�_�����K]���%��-�`y��6��%����P�,?��%��2�����mWL7��\��Z�����^����Pg�;�==���w���ko����������`w>{��������Xg��b��Pf�;�}�}�7����}�.���U	������5�6B�S������Ci�����WZ���R���G#���zTr��z���z�����5�\xL�@�x4r�j�G%'-���{�Z�Q��G�����h�Z��K���~^={��Q���7�X�eq��5:m���%�WL���9�b�i(Y=Vr�Q��F.<��vX��s���p���M��*9�'j�C#�4��{�Z�Q���������3��V=Vr�1i��<��>������M�q�h��#�Z�J�=R<sm�^���KT�qi��^j�ry/���D�H�Z������f�
�/V@zg���S��84r��)�!��{�Z�Q��GJ���D%�E��h��#E(���{�Z�Q������WCN=�Z�����$EI�G#���zTr���c�����G����{��Z
�\�JD�qi��V��f�n%��NR=�����z�Z3y#��+Q���~&r�Q���J.<�nt9x4r�j�G%�]������S���x4r��4i�h��#�Z�J�=R�����!�E��h��c�M����s�Pk=*��Xj�t�h��#�Z�J�=��u�}8���Z�O���$�{;6�O�%�w�����z%��J���9�Vtd�>��z�����Y��V�F�=B���������:4p�Z�;�(���M��7�j�	8��}������5�\xLZW�<9����������h��#�Z�J�=��,�G#��k\���d*�{y�������a��!gL��
U#�l[]�����z�����RS��h��#�Z�J�=�ei?
9�(j�G#*9U�
���V�N�����7�
N�A��������Q��������Z�P������G#���zTr�q�K9��L"F�n#t�S���������k''�m����,��|�s���9����+�[�';F<�d5��G:*�LBX^~*�bb	����f��)��\���G������!f�g��U��WF��d�@�nY%b�w�~�hW���,{`D0���b3������J�MKg�h���aB^8gy`D0F�����@�nY@��T�8����
!�,�	�7F�cT���t���=��\�C-`��%��+��x��J[]����u�x����]��������OCfju�V�
I������A�Q���#^�2��d�@�nY����f�*1q-8L9������cfju�B��R7/�VKu��|-�z�a�+y k�Ur��z�����S��UJ#�PfE�S�3;.s Ft�+KP�#�
e~���ig��v-
)Dx�~.'�^"��X������k�p�E��&�<�	�1E�`��IAd�P�!z9r?�yq�K��I��VB
��g����v2,v%�G�J�������^9�wr�����>��"!'F�q�'��a V�,��!s{���9^�3-Q��e���p�g`�:h/��3�r�|N1G�[� TO.I��Dm��0��j�@<T/�l��jT���4(2����
�-�X,P�a�V�, j�����h�<�D������M�t���-��U�-R��"�D��)��1�T�H3�����gH+gS��pkz�W����l�Q�����,T��Tt�^6�;f����Y(}'�Kg��L�)�d���<���wH��e5�c�5J�:�����x&iC��Q�����,���������*T�z�N���������,�Ju��������0�m�8,P(����	+�m��u?�b�+��'Q�A{YM������#l�wPl.P�� B}W�(�W�
b�������`���^>&�$&����l�����<8i
*:D/�?&�q����:>)�L�5���Q����f�AE��e3�e���f�,1��L%H��������D��Y����}O1��y��:��2��J�s���o��E���y�F��Y���d������YFQ�w���������P��)'��!�,����FjY}W���+�y'��-S�������/�^�\�[\Sx�a�+��*:D/��-;F��,�"�i���90���mW�#�
0�����4k��x����4JB�Y���_��y%W�B��!.%��G:�l�Ni��N�C��"#J]���f�1��'%fj�W���K�a������"#
t�7`����"J:�����o�bH�QId���T^���VZ%�����y��������
B��@��o�����+�0P����hx���I7(0���{�����eQ�j���������b�������(��
o�x��h�.5�@�n��[_1KA:������H-L����1z&%fju����gJ��)8b(0�+��fF[PF-�XI��Z�0}zqr`���S'�3��������"!�d�@���~��X�"���QH&��V
��m���@�n�~�5�Wx��"
����q�1�##�~eQ�l���
��������_j0�E�#�~eJu������R��A������h��9Q�a���?�~�k����_�
���V�h0��+[��R��Z}>�D�I�q���lDb��k3�l�+y�E���,&7�[:�r��(�
�sS�D|���}���bW�@QHV�A{YL>�7}��moY9z��A�QZ����hs�X.P�a�V7L�Fa��?�wEF�e�H�qxG��\�:������vhB��%��V��x`��f1��3��a�A;��I�6ES�oI�6���
3���h�q�$�G'd?�.S��|�G#��1
���3��a�[�3�%qV�Fl��#2��H�"�w	�`�u��>������eK�������6����B1��C�rj�4�����vV�&��f3�����7�����f���7.�[�\#JE*�Q)�A ��������� ���St�Y����Y��p��,��w��h�&G����Vh?������B{?�PIdB�?�W�(���Tt�^6�������B]8�S	����q��Q^�����XE��e3�a����*m����_�l��d&�����lATt�^���{�;,��/-�q�&�O�\2,v%�����fr��"k����9E5�T��\���8�bC�rh;�6����X�V�P��Bn&�����������l�����7���#V|�FQ�w
�BA�������t/����������DQ�A��g\b�rt���zP�@�0F-N�����?���~��?j:��5��Bs^���#�������B|����EFTSiy`�����:#%fj%1�[�Sj��@�rEF����q*�2WD��z�V_X��h���T;�p���5��u���[��C���z7i�A���UzW��(�������S����� �DA ���Ld�]�����w������)��]�*A���,�h�������g6D/w���\5�B��qQ
�
������8uC�t��@������qb-�6�[(Fj/�������J��������I3j/���>:��A�2[�J�3����&��#,�Z^���
��"mG�0���2j5IJ6��������s��c,)������i8,���<���r����?�Z�37�����/�X%y�}��u9[��~A��
�Qi��
�o�%f������d�B)���^
���(l���r���<)�0P�/,�h9�%k��!��,DYs F����[��<�|H:�q9��lbJQ�+�S�(�����S�F-1D�F�X}a9���<'��j��W�"Nq*�b�1+�����u���|����2�$|�Q[y�E�*[D���o�@�.�K%|�9qc��x�D����Tz�1*,���u�x�%�9Y	ZC&���9H�H�<���������|"���E�S�Q��MO�l�DI��Z}a9[�.m�x���_��������Q|��������������'��F��y�J���[*D�VYE��e3��Cy��O�TR�P��4�a�+yprTt�^V���A�z.����|#�mA��(�bW�=c����&����W�a+�@b�9Gi���e;N���aP�N�����x�)s�)F
N�d&��8�s�a�+yp��Ul�^V���A�mE��[F���������
!N	a������������%$�0��	��d���<�c�����u4�*���4a�IfB��e*Gi�&���V�!z9z�������O\&�-���BWi*�bW�.����&_XGB���j;��L���:��]����P�!z�L������9$��
�g�����]�A9V�!z�L��S'!�"�++�qk�E����/D	5�����QMcb1���C#a9�������?���?����?x������8:=�����Z�b��;�>6-l�@�u���<�\f�]�Q����&�������u|j�����j��H������3�.R�����D������ ���b�����|I���Z����ks����5(2
�k�����[����J:������)pZW��&(0��h�
p�UW�����@��Y@���N��6(0
\�q������r3����D+BC�;�X����0�`�IB�('��l���=@	�c)J N��(�V�I ��n���@�nYU�����J�
�(���N+9VFiM���@�nY?gH7dW6��)����v���mP�a�V�, ����HRD?��p�47b$���%(�� V�,���gq���F�	��I��0/J6�������M�x+�����O�s9��F��T#�:D/���"���	��TE5�bNZ?�b��c	Ju��Cm��I�)����AE������ZX�������t(����X����X�c^3��w�(�7��(�7L9�b����Q��PbD���#��&l�Rfbu�"�/[�!��cD����G��EQ�� V�,�{*3���2��$3�(U�����<P��Ul�^V�;f6<�/Rd�"D�2�V<�O�8a��*J6����P���\H
2�~���*;	�{����al��������&[�Y�X+A�j�_�����BX���|�`�p�9����n�)�����8��{XE��e3�e���i��{>i�&�I�H�eX�J\*:D/��
3��f�71Q#�@a���&b�j�����:�gS�hY�;X
x5�����Ds79���G+4�w�q��9NY�D��YA!@��m&�������YA�;Q���2��7�~T2�ID(sk�QT�����!������V����J���9o�����/��C����=�6���m����!!�b�-�f�X�J�8V�A{YMn��Fv�I�����Pf�X�J�1f�������z+�"�U!��AS���@���d3�e��p����Q�6�eX�(YT4���fr����I�����"�DJ:Q���	�$�����$pBE�����8q�!r���;@p���V����wG)MwE���(2
�;0��Xc���J:��J��7.'���S��~4t�m�5���3��A����V��yhWc�v��o��]
�u��a����s9�����kI����/�6�����U��Z8��y���I�3F>��(2
����`�q$Ww����6����Sp��~�i�v8�����a.��3��{�o9����(�L�6I�������+k���t�Sw.��%
$��F6���6�����H����l&�����e���hY3�P�l*���M3j�;qm
E��:@�I�W���j
�u��|��q��X%y������P��U3e�#�"�8��/s F��#,���<���r�������Q�����Ny���c,B��<���r6��@�����Zj�����y�
�P�� �p�\�h�S��	��OCH����C� F�1��TG�X}c9A*|.Y6�*r�Q�E�~�@�x9�"���������C�?��O97�G��R�xs{�R<�j$9F��(����{�7.����-ER�q��7FZ��e���
3���r�<�_RB��!�H�-�����@�R�C�r:��n\m\p|'�W���RQb��J�#�<gl�Rfb���$�N��QBh�m�P�%8���q`�X%yb��������1PFa:#���3�������-*6D/������/K�\�!(���Lq*�bWB��*6D/�����X��
��& �&n�����`�
*2D/����(���:uI�@Pi�{f2,��o:Q�!zYMn_G���j���JGMeX�J������fr�:��2��e)���l��'Le�R���P�!z���\������=�Fn�3�a�+AK����&���KC�Z���m[�K� [�U�vi�6���E��]GB�,��D���GQ�wJ9������x����|� ��IS��p�T���D"y\M��e5��:�,k��F<�s�k&�����QV�A{9�&���To;y�$�1�
�ESY�f������}�o]G
j�����6rv�a�E��R#g�L��'a����j
�,�u��7q4��v�
_/�~��M�����������Hy�~y������?����Sp�][��L������������R;�?�����)7h�O�"��������_~��������5R9�_��������������������E���)/�SE'�O|���f�g^�`��OX���QU�f���_����f���2���6u�����%�����Y��L����28�Y�|�q�=��_�v���`��
��%l��@�o?����{�l��.B��"��Y���|����,���%(�`t
J��P-mG��P��������?%��oK3!��7�:���������_���\����+�����M���d��������j�r�j�Lx{�U�]��O�)a��E�c%�E��h��c~-�%8��Z�;���|��������5�\xD�����s�Pk=*��H7��C��X�u'���w�������;���G#�:���_;���~r�C�Zw�Eo��+9�^��L���K���R���z����������{�Z�Q���-hiI�X��GQk<��X���h��#�Z�J�=f��@�c%�E��h��c������G���\x,�}�z4r�j�G%������G#��k\���P��xy?���������7����Z�}%��p���~��S���x4r���J����G����{���O=Vr�Q��F.<.�z4r�j�G%�ig�pk��GQk<���������G����{�"���e6����5�\x���n}��?���^�R�����z����v��&����1�JN���5�7r~��JU��N�A�q���[Yj3������5�\xL�D��h��#�Z�JN=�vG��-���4��5��+�������k�na
�������j��+�^]x����6r�j�K%W��\*:��������;oJ��]<�\�Z8��=4K�����^�C.X�b��&�g^E��j��kt��x������^
]yM���z5t�z��.�&T�o�Vt�U�Z����F����
��WEW^J�^
]x�������Z�S�kE�^E��j��k��WEW�Q<�Ut}�IT?���s������|�����+'z��_9nT|x'k��W�k��
]xu��s7N����Z�R����C���.|B��T��WJ�v��[��W�k������������;xUt���&��F�}B�Nr���y��KC>�wp����$
���^3
�(���8j*���5�?����ZW��������W��4��h���d�������O�I�2�$����t���tFm&������������P�g3(�[v�EZ�g��������T)���D������
x|������� G�zY,v%����������������sW���:Q�������*L���U��)o����d�%�CWR��m�����]��O�AE��e5�c�\�c�\�,�AH����zi�T��&���(�1�Rp��SO#ol��Szb
8m��b�+�G*:h/��
3/���U"*A5�B}~�r��B�������P%���;��A��8C�5K��!���b�+�mv������S��
W4�$#����@VT��e����J�V��������vhL��������O/�(��.�U��)���5�[�W�������0�a�+A�����&w�<F>�LO�����$�����{Q�wn@�
2^'��SN��$��xn9bd��5�NMd�����
������o��"p�"#TP8�g�{89�b�+y�� �!YM��9�S�|��R	*������(k�!��P�!z������2Q�rD��nduxb�L��,�\s�����&7�����, 7���ALd]��.p��b�+�o������&��\j-��]�G	p�����<A����H�N6�[fNu#��CVT��<��20�-[D���?�F}�����>�F�e�[�s'�X�J�q������a���G�S�_�
*(�Ni�a`�?Z�UI��X����q�����,���W)
;b�?ZUIG@>�g�p[B�!��r�yE�vL-��41����# ������������*�g�,���@�x��(���[@5�)|M��^+Z���
�)p���aQV��8�z�p�;b�M�����'��Xf�"��Bp�����fr����-�Mv���G����^f����hU>���3��#�����~
z�%��`�?F��v6f
���PS�D�9�����Is��G��(�0����.`��y|;�b���^���)#^�1+��#@��Y��cTj�-�	]:v��pb�D[�l���M�w�����O�s'��y�a��s��z�<���o/m���7�o�m�&�-�e{���]����&�
�������Yu�����1*��������+��J�q)�/�H����G�@����p�$0��l&�bW�x�%CE��e3��0o:�H3����~����<Y}W�9���w��>��!�������
��������b�+��$Tt�^V���M�@d_e�*)HBH�6��r�jfB�y�h��G9��>����;�X[<�Bdy[^�D����A_D(�y'�������G�x�$3��f�G9�^�lb��
����~�l���.�3V�(Aa�L
H�2,v%�������y����qe�[��(���~W��/�Q�P�N4��'�V�I��"K�	���4/�_Gu=��(~>C��WZ�6O%~g������������y�,�����_�����'�L����8����&��;� ��>0�ZTT=�Me-4%����|�M���#?��@��C^n66:��I���^}�u��>���w�4E9��
I�<��Y�^�z�AP�*:D/����$��0�J�Y��;?���PiU�G+���s��1+���@a@��@��1���~t���wb��Q�-t/_�R�/��N�g�����]�
��� oM���&Mu
v��:�/��V)�P��)#��2�D�F�X�1}iq��e�QUB7�+A�g2,v#���b�����0o*���^��mtCxr�w\�k`�����J:��]�Un������Vl�TT��"L�0}e^�t���;�g����Q*�~1��C��*9b��7TIG�X�4}:j��W,:y��"W�DV���
0�D��a���S.����PT!�``���2/J6�����'mT� �%pK���3kTP��%��������cp~TK���"
�P�d?0B�B�RTI��Z�0}�#P��>�J��T��pS F��i,���<�|�#�7���tG��-�<�i��}�'F�6Wf�at���-�/��*�
���-�SYJ:+�KTt�^Nw=�PA���4��7�tX��b�+�}4V�A{yk�C~8��=~�Q��A���X� j������� ���?�>�
6<WG���$79�/~/qV"-��v�^�����A�$�u��8M,bI[�4���
3�����l�G#�����6���e�h�t�Ul�^V�����SzaX�I���������O��G�5�����\^�BE!�\�#�h����'���������d&y���(��`�*:D/�C{�;�����n�UIFa��>�a�+�:��C���|ae����E��A�v��pS���b�����u$�	h|��'p\��c�nF��]������������%�8���:�HW6�Zg�X�J����u�A�-*n]�f�Q�=%��<s�J��b�+�:���C�QV�������7.���KL��r�����bW"{G���������������T���R�	�W��]�$DyW��e3��J����N��"�?�3Y,v%�O�:� �����'�9�����d�~"B}W���� �����/���.��w�J2j��M�(�@B����c��t�:B�Ym�4l��D9id��P�h�h5�E��Q����]3PP�2��?���f���p^��$��K�~|b��D����\/�bW"���"C�������&p���G����=�n8�6�b�+�	*V�A{YM����ryK���
h��p0)OD��
d��T�^�����$���P�?u�(Q�@ZQ�w;�PG���=�p�"2L���&i�g�������`�7D)���#P�o,g���%u�AH��`�:0��[UI�@<$p����}���'�Q�I�;�#,G�S%f�m�p�r��<���%j�Pf������#�VU�a V_YN������X�(��Ke~?b��(���#@����$LX���u���KB{�A��no�P�A{�|�I�sVp�I|�S�c�
�ZR���iz�W%fb���p'n\t�C��� �{�p?�)�bO��
J2����q��\.d6����
��(+&������+�q�;_W�ak ��W(���gV�C��vt��:_K����D�"(�v�`�G�xv&��U���L��,��%�N3��n/\�N��f����E������H�b�6�����x}"����I�;����g�9�VEl#�f�U'.Mv}*�����"�T�}�t?v�t�����8�c���L/Y��i��:�9 '�(�mb$�L��>�N��atT{G��d=��Y#�f��X��R[��h��D_�V�������_x4�i& ^I'�N�Fg:���u�I1��##�f����0����2JT�b��=�@O><&��cqy.�A������@�?��?���~�������u��mj_��S��b���z���T��<L���|�(�:��C�d�������o�{k��Se���Z���k�����0d�K.�Dd�V�I�k�����0�e��N1��������tr����	��T�D<)6��!����.��P�W,���o�7_I���GM�`��C���D&����o�GfAn�E(�bpd�b��
��!���rw�b�b�-pi�w���qR`;���!�K��}�q\����a.J?�Wp`��3�l��\B���0d�K�v^|@�N!*�@|�.����!�{�������f����x��g��R��F���de���P��C�_�������u�����N���B���d��{�X\����S��%*�>�P������!���#B�bW�:������-�$�dB*m���}��1y<���I4�)����Xy9t#�_s%�^��u���wy��H�k0S��}�k^���UJ�]�������b��W�I��o_���[��H{�A���M���������C>T������/X2�$t8��/>_�A
��)P}�)`'�[�e1��'�^�8�@V�*E���'��e����;Q�v1��'�^k���G@�e�������v6&��z��	O1��e��S��X�7�������9�$lc#�f��(���}EI��)�Y�����0���RSS�cuY��;A;tc�� j�M�-�r��X`#�f�5�~7�v��\Uz
i�`�������=
�pg_2�)����!����Q��HA����U ��U	+l�3O�p�Po&�6n��pc�o��
���3v�D�����t�	����Y�R�����G�k��a��J����}fSZ�����7e�
ER2������P&%:�\
��	$��"�0����=(�v��vA #�y��&X���k�/D��&�*��������%3�8�5+��@��;���������eY��{�]��M���=.��(Q%660;�t+����L3�Z��m������qg��lL%�7��)�1��Y9<�y��;�)j�����s��y��4�]��7w�D9����)����c��*	F������28UI�qOPF�78���*I�N&�:�����QE��2�.����h�x;q4&��;�q�q�!_�0��)����vai�o���r������c�x��3���-���e};�����;�<�H��Y>��C���b�������~��
�v��q����!�>���L1��e}��u�N��)ITd��{;oR�F��Wh"S�cuY������:��o"U�ce�<d���/���o���Z�S�t\_	w�=��NP�c�x�J�SL���_�G����4��4L�/?�Z���d�!�Io��I������%��������0P_T����0�$���I�^:9�T ����wma�������F/K�H;	l�r���$3�w&���q�!_]*�I���| ����m�eA��H��n��C���������>�}�S�Mn�"������1{<��X����y���3<w4T��6K��~�KGC"����N1�w�[ei�=�!���WJ*��;�Mc�x�
4�)������m�->���y"��G�j�i��}E�����N1����W-�x�-�v���=o�3�?@h ����m�u	�
&-��y��6��c�[�FqH�4��M�d�	$wi:���+>�[^�m���PM�A���(�bT{�6��<di:q�C}�+0�t����v���s
m1���pFK�����:�S����9�o�Ak�S�N��+b$3O��mq�#�:d��HQ���r���~��� F2����+��3���y_���Ci?��������v�q��}��y@O�=�N����7;�]=�DF�������D��s[��G��v�*"�*����6b��t�	��G��_�c�>}�����x/���@�����4���?�pP�6�S��<U�[x�~w9��#l��l� ��y�����}����(������9i)Rc������D:������Fex�r��v������F:���#�d�����5�Qs!Fc�I�q����y����A������9�-�/�x��O3��m<U�?�pE��*vM�<����N3�>�������;�^�s�)Xc�����S ^��-���e%�Ou5l��U����I�q��q
�=�����"�����cL���7�_�������
O�{>���k�����o�����?���p�����A���X��D�Y�����������?����X��o����K?<�����o?>�O�U�{�T�fJ!C�p�h���y\�O������~�b=�^�*�������	,�*��W�����.�j�5���Z|���{����r����z�r��[��_�~��;��3�\�z��;���O�=����_���
X��3�Z��������G��v���W���'�u�L.W��}�����K��p���W������������������I�[�������L�6>U.	�J�{j��U�����$�����Z�G��m��+f&s!G�z�d�HrD%���y��p����+�F����b�����K|�8t��DT�:����J^Glt�����uD1��\D�c�J."�1G%�#���!dG�c�]��U�?���������*�"*���?��.������**���L�b��
��*�>�L�~���
*���?	8��|�~��D�l
�.^6�3)(�x��>���\�d3R�E���<]�;T�E�}�r��LH!W��|_�2E,����"�)�*f��%��U�������"���������er��lH&W���r��a3��w<^*��������v`S#�~E/����~�������1���Tr1�1��.b��	��"*l��<�\�d3R�U�/5t!]�d;T�E�������"���������R��*�**������6lv��
���v6����9N?WE�4jg�*�~��������I�0������M�L�.^�������bg�*��zlS��.������*�>���*�>���"*��u���.������**���Q]E�6�&����
o�qQ]Ee;U���i������vf�g���k���s���G�lTA���i�%���s����\mHA��'v&E/�^9l�+��*v&�����`q_u;��*v&����m�U�U�6���"j�STEQ��DUt���.����lg�
��������"�����}?W��Et��lTAWQO�0A����lA���i���C��b?>�i��]�f�6(n�&��B�������~��K �x�������J�1�N�P�=o\�O�~6�Ld�q�.$q��rP��3#�F�$$�n?�����<�)���,X�Q���Yy'��#`s���,��Y�N1�W�:5y9����@�dL&�����X]��Ty=���">��mxu�����m|�K��@����K�S�����D�r�AM�����R��d�	���b��
&��x�P�����zu�	,M Q*��g�:I�$Qq��x��"��]i��L1��e��3Uf������(d*S:�/�2X-�4�%	�{��t����(����9A�`�h$�L@��$�����=�2����9"	{��gc�x0���@d"�����5{o����Y���"9�Y��`'x�;����<������@���a���;��d^2��8�	t��Q�y��"��E�S
g�}��w���A��
w�	t�#33O���$C�g����=Q�F�dN;=�,�I3�|��{	D���Y�J;��BD58�	t��Q�y��$�T��
|g��:a�p��
��$�Ya#�y�k��^Y����J�X��p+;��������Md�q\]��{+/Uv^w�������(�����r�����F:��kI��Y^&<�mPEw���d9�mX`#�f��~/�c�������N:��(��@g�� a	7&�>��kI-�Qj�GW��y�) �]���$�Y!�>��kE���S�*y���$����1z<�|�.$f�q�.+VO�HS�k��oP��I�������2:�f��{-I rA��������(�_�s@N���� #�f��$�D�[R���@;)HR�~���<L������X]V�<g�8���`p'��%�>�c�x��G�N�����b�e�~
���%w�R�%�n���i:+d�g�{�I�`���_z%
_�T]�9���!��B&:�8>�������!�r1EG�n��������Qa��y�z�7�klC���r;9�$�2�d\�������&:�8.����W��"��CuJPM��=B`�P�W���]V��n��*��_��"8�<�o���i&�^��w�W����;(Tz}����3nw�g���p;b���278yq.�@�����GW��{h�5y�l��b��55��u;���8-f+^�q�����Z�X-�����u;�d+������,�+1`b�;^��F��f���:^Vt2�u7��H��S��u�Fi�����d�>|�~��0^l�f�O�(�����X�""��2�/�v����B����r�����������FVzq@�,�i"�>)/�O���	��V�FV}q@�,�K"mV_����Q}i'�K��6�/m��AI|�p�U�B	�*O9���V�F2��k��Ug���%����2�7���(/�MDi��������P��U-Nwq��&�����{-X��Bu����[��vU4�������;!��9������or�<���[���)�����A�8���Z<`�QjA#+���K�Z�U?�e� �rB��4
-hd������w����Ff����2v�2��F:���#������d�\E���"�0f'��l"���fw��Z��S�XRG!"�����K,;Ee�Nzk�]�8�PG�.�����,��t�	��[���G���V^�[�s3j����+���Q�y��`�]I����W"#�L����r�i&�^�/_U�_����kR�!�[Y$�b��!���`�9�U�w�QT�u�b�	 '�������4`��W���{B%Z�d�T*��89=��d�g�{-X�-c�Y��4�A���4&�����h"S�cqy�*s}��A�)���c�pZ
\H���X�y��4..NI��V�H2rrJJmd��L@�V,�%�Tn�4��������P�PV�Hg��a����|Q8R��$��)��$'������FF:��k��I����#�FJ��@D:���a���'2�)���,X7k	����*)	�5c�Nf�N(G(+l�3O����Z�H�$�H�u�R����US�W��4H�N�w�kX8�|�H�{�I& NNK��x��$3�j�i)G�����������������/�J��Rn��zB�C�5+�0=��'U�h�E'Tj����Q1�:�8J/���\V����6	0m`� ��Y�i��nc����dV��h���(�Y��o��fe:Nl��������u<�e)N�A#��80���L��/i��Ic���4�<i��q@�>��+���������x�GKlD�9^�Y���h���A*�Tf�:P��A>��k�������`�I#�y�^H��7a�����Q'
':Qu�0�DIg�t�,�Q9��t��Z`�x�G��#O�����4��a���0�K��$/����<`���4 �@e���r��4�Mhd�&���Xe(��S��S��SP���ERS��S�
�"'��W�Cy��`�I�B#�Cy���\���h��R��.�;��]j��Y������t*x�:��v�t*0r:��c�f'&�}���>�V���[����<`���8��n�wJ �e
��
�����N�@q7��L�JO�qnU.���*��n�[����5��n�\�����6�^�����8��Y�k���t�@����*������IW��X\��a�J�2+�r���?��
Yil���t���X��d'�N��q�X�#�"��&}�q����6��*�k����j6v��Yf#�y�S�F����B��%DF4�9y��QW�F ^�HG���{��F�|��6����j�Lt�q,.��Hj�/�QD[�Fe�9y��`##�y ^H�Ka�?9�� 2�����n���N3����Kn�t|����n�dU8��"#�f���r��m�\���v��w-�
�?��?���~�������u5�m���S4��(��������A+��Vf���[��A�F�wV^��AB7&$����g��A�+n�j��[+�G9��������q+�:Zu�8�h�DG���G-.J��Y�p6�f�l&�N��:�����-Z�?�����.����Q�n�O�+w�=�^���/wz3_����0<���)�<���)�OaQN���!'}A����<�"�Q�B#+}9�K��L��P�p��]1y����c����]v�.+V�4'�y��[pe���`�I�#'oy����o%�T&J�	Z�
Z���FN��`a^WJ���+
�U�K���^��d����D���Z-��Z�;Mj9����/!~+���>�^�E���!�{KvR~������v�T'4���m�G�A����
:Su�������TF�i��}'����)K����;��Yei��$+��W�AK������%�����^+������Q��z�9M�Q��z�{-I��*�6UE�6TEy@NSU��(�kIV��*@�r
��4*Dhd�4[�fF��Mn�	������8�^�����kB~����{+��}�u* �y�N�
�FV�����d�c�~9�gI>������}�q*L��6����;��O�UQ�N�{/���G%���^�9�Y���Z�@�p���J�NU����D�L&F������[
/��-��v����K
���$���p�?��d F�����:4���0�E�<$����V�A���e��4���H��{=�hj�'���K7���hBt"M���t�
�"hV�_�g���+�Z��?�dja�����yx�%N�N��8�;qw� ����y��h�T�H��7��f�^2�/*��Be�Y�#o��B���t~j��t|S�K��X2���pR2��)Fiea:N��#�Aq�S�<�5FN)B#�90��L��9��q�����<�?�������O�
<���*��VErcy����&FE���X������NSJt%e��&@N^S�d�%����q�-r
|�t
�r9��0��S�<�WJ-L�KA�?�	N�
������)h�D'������J�V��t"�?�KR5
���4IR`�$)���8�;�{�j��@�;MY����:��L��Ix9���r�������M�f����>�|����6{�N�R:�9�NSG?4��<`�����p��WN���*�`��r
�\���iX�N��*���?�t!k����LF}�	�O}�q��m� �}
��F�����^H��3��p*
uSi(���\"_$z��+�J�1�'�p*�{Al� f�8M�9A�_��2U��4v����`���Bj�k�T���Q�1�i���
b`���5�78u,1��8FWU4�����| ��v)X��Q�I�&�k�Jg��t�	��`ee:���j��r��VS�\�lsA���tq�n���*hn����&FA�cq�@*�������c��&�{>v=MZ<�|6���|\���<x��aCd��	��oF����{}f��t��'���e������bd#�f�e$�S��l�]�
{�>/b�i��������O�d�r��A����R]Hq���,����?��������/?��u�����K�����s�X�����~����������WL���d���o��S��t��������������_n�����������������?����!�xWp����"�~��a����������F�S��$��U[����u�|�������t���*������`���8-���V�_��D�odO��e��	8�'	���x�$������ ���LL�����|��z����n���5����v��.����������{
��_����7����U3p�D��d�k~� 6����J��u�����3H��3�O��1*�������^����G���a_A�vr��o[�N/�<\��?�>KB�L����7a.O���.������"�e�^�*��*v&������A�\�$3��E���t��TtS�LPEWQ�H��
���v6����G����I�.������**�&pQ]Ee;U�U������
�x����*�;8��O>���M�s��q����&f=%�_�7s{*�TtS�LPEWQ��#������wQ�E�d��QHE1��Ut�6@����y��lTAQs��U�ET�3Q]E=�qK>����lg�
��Z�H�U�ET�3Q]E��a�*���F�lTA�����������x�T��5E/���]�p�N�LTEQ��DUt>�F�&QtU�LTE����v��}���uT��Q;��Z�x��*�**����.����7Q]D;U�UT�!
>����lg�
��ZQ��A�\�$3��ED�O��5�~���	���q��*�^<o��6{��yxW*�x���d���u+�7T�EL63!�\D��~b�C*��)v&�����9�Q]Ee;U�E��
?U�ET�3Q]E��9k�*�**����.���F�D�EL63!�\E�����t��lPAWQ�{��g���I#v6���x�p�?^=i�7�*/�n�+z�����;S������[6LTEQ��DUt5�����*��*v&����	���\]D;U�U��k$����lg�
���G1H�E�<��*����)�*&�������~[�Dt��lTAQO3E�7g����75��^�7�~�>�
w�^��}�����=��!TY���K��Xs	�>��|�����D�N��e�r�z�����e��
f�	����%�3h�f[���:x�r��P�<F����e����b�>d4�4���"j>�����-�e���L`�H�^;_v��J�A9BI��o�p�	�lg#�y��$���#}��;i:�+h2H����Ic��F:���4�%pT������@�HRt<�~G�a;���G)�K�������w�-��[��s�N�@g;��`�	���cM\i3n����o�S�;=�,��Hg�{-I B��^�U9qN��>��W�FE8f�	����H���~m��"���=�v�:����<�Z�|�������6� Xo����;A��d�g�{-I��^T��.�RP)[�%��� ��v6*��h��$��n��Y�3D'��x+{���1y<�p������]V����r���JBDR�����=Lx�d"��cqY���b�5��
�������
\�~�	�4�6��'�^+���E_�1{3u���p+����v�:+���8\o�����E�v�};����N�/�
��'�^K�o6���)���
"�i��1y<���?��N:��e��������vZ'X��c���1z<�������]V�<W���X���*�='���9�����F:��kI��XX�6���d��1�c�x0���"2�)�qwY��p2�'B��v��r�`�~?�/}g����`�%	��*��g[��
����r����t�	���@�����6����"�jz�	�,�Q�f��"�
�,T4`kC�
j����f�E�*h"S�cqY��P���L���
*�*����eoF�i& ^K�;I-)6�N��uB)"�1���`'H����t�	���R�=��G�����x�N��	PV���<�Z����
ks0E	QM����r����d�	���r��4�����AP�A�v�	P����<��
��L��j3U1��l(��y���f������r�"�FP�;,6�!�$?�|,6,�Q�f�����d�^��>�B�
"Rb�9a�|�#�f��fK��o����>t�	
"�V"}�����`����!����X\�����@��}��rt����~t]����oP����|fz���I���"A%
"F�prL��A4�7��U���'t��_F�/�0lg#�y�����_�{.i�cu��=N�!f��|�������t���C��
-�9`'Vc��l�3O��
�����nS������,��7�2lg��o��5�zk�N��;%�:%��|����Ub<X�|'�Pu��aRt�i�a���0������H�U�<�0yTa��
�G&�*L^����1|���Qr9��F��8
���;Y;$xfK��;M
9��f�$��|E�D+6E�V�G�V�;��"uEh�g�{-X��DR�����v��4�����`�^I���F�%���y:���	0iT`��y�6B�U�4�c��1
0hb7n������H�V_F~����2v����F:��m�����"R&���K��-���,e�]��]�_�B��_��r���4j.hd5���������<).yP\����Yq����oc���RH��4*.i�\o�v����.iT]R��}k�V	!���.���2�,�]���.�V`k�V
9F��������C�������-�m�~����?�Ai���iTZ��*-`������;��%���q�-{&_$����X�X\�_w�<"v�
u�7�����+���N3�z�F���i��JFd��	���)�6R�e��`�]���c�A	�T&���`��T6R�e��`�]�8����W"�������R�Hg�{-X��<b�Gu�SW<`�Q]A#����V,�( 9d��l����;ym%���<�Z��.w�7���"#�L������O3�Z�|A_.k�
k4�	$�\����t�	${��[��2H��#��dDFL�99�%d��L@�V,��"��<V)�S7����������`��O��T<b���;vUW��xX��e6�I���<W[J�o
g^\���_���~�����w{�Tc)7h�	��#4-$>=��k
&�"LU�0�0!�u1ab����.��L����L&�u2i,�I�4�R�/�Ic�L�H:]P�7�Wj�Z��c�x������cvY��+k�#	��&����v�*h��V�x������8ae/����A�q��&
����{} '���5cm�XZ�1y���5c]�g~,^q���Q�������F]g�u>��Ps��6�Xks��6�I��1��c��G~����t�4C7�:�S�<`�QB#�
���rW����^��"�����*GE )�y|U��t�t��=2N%�T"�iT����D`��c�:��T#@N5���F����j4��#�L�<�>�	�U���=�L������>'��2�J����4�JhdU%�k}:N*x����9��vK|�9�i���t\>�|�O�C���4���+����>�����1�Q���;A:�e62������r���:Ty@N�FEF�0���@:VH�� �
���@N�fEFF�{} ������$#���8&W6t��Y~�]������-��E���5q��V&�>�����J�V��e���p��&F��@�o8�md���@��7/S�Yk�"�FE��(bmT������|p��N�U�9�l�4*ddd2���X$�A1�3�iT���(f`���e���S����Q�F�N^A����<��D:F�J[5@NQ���FE����6��@:]����Na�w��&@N^a���O3��D:F�b8��a���i1��T�H���x} +�
O�����!�?|��V��Vn��I�nt���<r5O�r���rhd%7��[����|�y't�x��������&��&2�8��
���\n�����{<���5����/���T~�����?$������w)��8��������j�>�q����j�>�q�C�{	d��-)��F�/�i-��~3�a;��n�+'��}���>hn��/�����n���@��Dg�^u[��hh�cA��O:��������Qo&����z3�:.�]����.�]��I��0v�c���&/gQ�+��A*��9�N�r�FV9�����o&��xLZhvN+s�HN+#��y��������2<�Y;�y��s�N����G�>��kI�=���,���{,�As��&���A������?�R��l��Z)����Rv�c���Z)7n�s�{+wb��O������4�Zhdu-�kE��	K4}��V��)�ijV���h���_e�VNP:�fu������'����Y�16�3]>�K�IHu�V��Uu�W�]9�ZU��u��+�{	��&�t�k�bj�< ���
�l
���$�O�j��u[Z�y@NS�:4���<`�%	y(������X
���c����kPv�]V��JAT�T'@Nur@�F�	���4W�fV��x����9�i�4�Lhdu&�kIN�k����ji��!��RM ��`�����D7��h������t�	�����d��L[�],�@���l����kI��i�S����
+�i��B#[a�{-I��6��w��>�����Eb�Xq���}���(4��Z1X�fu�	���"�4`�%	�s{
�*"��n����@F}�	���z�9��i��!�-��N��*���<������`�4������]��@���Q{L3��k��fF���`N���i,�B#[f5����To	;��dD�Ujq�1yXQ'$6�)����u�O��}���Sur��������B��V�����~G��4��Kc��46�K���F��]{������G�K�2�����0W5_Y4�X4^X�2P�H�����>��rJM�4�4HBiP�0"&����y�������3Z��s�s���jt#�i�����F����-�H�q��pS���a���;M}��������>����(M� ���4\�����|�6\��nc�O���<�����B��4	Mhd�&���8-�:�9�	Z�9��v�d'4������t�>�xc+B�1{"�t���e}N'��^��� I9�N�$�FV�����%+�q���p6����4
T��@5��6�X�����\j��R�4�Z��R�4�Z����3o6��/�����0t�s��F���l�?���be:V_���  'e
��F)����5��@:Vmbm�
[(jZakY�Q+l��������t�E�j����\~��������\v�]��a[�e��u��: NcA4�]�^H��R$�8	���E�qY	���@:��%*�qw.rW7
 H1�i6x�Qo68�Z����� ]� Wj5rK����Zy ^H��?E�\��Zr�V �����L�����t�NFu�eDF:�@��x'���Q�<��c;���kV�5+����f�hd���������w�V�)�;yumg#�y���tl�S�c�U�����u\f�EUU)�*+?N��|�4:X������ 2j����X56������EPt�������@N^������j��?�4�����8�Xl������
��'�^��d�N ������ZD�����^�8ySG�������%1���Bmx�;��C�����~�~�v��g�^��}�
���s���������=���������U�����������t�������}��?����?����o��_�������K?����2�4��m���Uz�V\���iM�!��)k��[)�u���kR|P �-����n[���*��SQ��]f�}{��<�#�}���ux?���kR��.���)0�NA}M
��<�z�7����9���$����\���=��2�������-���B,;��0�NA}M��u���������{���=	F�I��I�9�N"�{*O��U?	y��$]&�}{��e���<���?	8���������_j���:�����{�(1������}��uq�������g���O�>sMU$L���"�����.���v�zP!1���r�D<mC*��)v&�������LTAWQ��Ft�����*��*v&��������������*�*j�����]Ee;U�E�q��FUt��;U���~��q�O��������M�L�^�nm��3�3���Q��G��"*H��swGQ��DUt�`E�Dt��lTAQc����\;��*v&����^o��
���v6�����������.������**�%��
���v6�����=�9�^?q��Ft��ii�����s��i�bs�:�x���d���u��h��B.b��	)�"b�rIC*��)v&���������EQ��DUt�I|�*�**����.�6�3Q]D;U�U�J��6����lg�
z5l���?����
���.#���K���'�������8�r�	�x�\?r�����i���'�6eW/^*�:���]ECY�U�=�|`AWq���t�lrP��UvWm`e������.#���,�*�Nm�\deW���FVv9��>����l�"��\q�Gv�
]daW������i	���$�6���y$�T��#�z�*l
:i
�WeW���<�]�����]g�����)��L7����.#���,�*r�ID���]ECY�U�H������"�����2��3vy�6�22�����]FfCY�U���q��;��,�n[�e�3eN��s�|]�w<�B����K�!%���Kr=j*B�t��Pe
�rl�Ep���H����BzOD���p�^QCt���'`����%6�i&�����%%������4�����P�B�"�4(�B����=�@/K��)5DG*��	 'H��DF}�	��������&���QEt@��@N�@g��t�	���
�H1lw(�u@QBt�r�O9A���F2��kI5`}<�������������8N;A�2�3O���$p@]	�/��L�{���,8�v����^��kc7����=�q��i�
{"�CM��8���yL!_�F�>�8��b����-�V0LZ3	����hT1u�	�&�edf�	����u�7�Hr i!����<L���N&2�8�+�tHV+���.y��b���1y<�|���D'���b��� NJ��t����2�����_�6��'p��w�K�>`��vb�qG����\��9`'H������h���{	���@�0? D���P��3@N�@g���4`�%	�D�f�p#^������������`�����b�G��`��*��.~M���Oc�x�_�b�����b�
��qY~��$#ix�lL&_�02�)���,Xy������
4�+WF-=��*����=Lb�S�dw�o�5*�i�2���,l�J�N	�Q&1��f����7�H�JGa��7�@W���-V*=#�Gyt���O5
�(����=�Y�����s��bK^{F����pg�>�L(��<J�r�=���(e�����<:��L��	\��.�[A�7���$����g�(����T3���������h��Y����$_�f�~�G�I�t����x��<@1��t�Y	t��0.�iz��C�!�t�����*����� �N6�����rz|9V���<�Z�@���H@o!�d!�?>#��o!#���fBW�A����7��o^(�rO�3B~�G�E�t��P�Ey@�H���{�L\��<);���!��D'���f��v�	��Q��uM���3�~��0��N5��5	3�H����G��,�����3�~����f}��P�Uyd�����a1e�O��
�:���<B�2��f�W�Q�{+���:e=6A�	�H�����Q�t��P�E;H�7@s�=�� T[��m�5y���;+d�g�{=o�y�]^����s�Zh���??��t�C�7�H�ek�3��}jN�����3r���T�O3������w%F�`�����9�X�`��=�>�8�{�o��*2��V�������z62��L��V,��:RzCz�T$ ����
��|����)�q�=��Zwi���k����vz��B�4O@�w{g�4H���Vs�������;��(����wp����Z>h0��RC��`�(&q�;��;+d�g��a����E�+|i�A��9O@�K�#w3�������%���S��b��L�HS'�"23���^���U���1����K<�8��8#�6�)�q���Zw�b*J�N���g�4:�l�3O��,�K1N[���9g��C�}@�b�����
L��eDF��@���`�h$�L@�,��/���
����3� 
q��j�h�O�P����Nh^��_�f��	�,��?hd������w!����K5%Zug��U{��?���X]�_��2�������Smg'���h=���8����Y�^
��t�6]p&Pt����F:����{o�]o�
r/�T�47��L�ox���T3��K��j]�����������,����wF��k9����/�h^��-�6H9#A�Q�Il���	\���^`������;"N�	���N3�j&pIVg9�,�`�6'�����H���fF�I�'K����,�>�:��8���i3!|j�K7����������-v
:�f��UeQ�Z?)�{-X�
,��n.pb|Q��
�P3��7O�;���gB�d�������S��XWjf�~��#��T3��k�Pu%��\'��\8�kt�����s�(��T3��K�P�����������	�p�0n+���@Y���,�/����#�H53����N3�j&�^��^V���l����b�������6���H�e����z�������f&���y5g3�j&pM*�dU�;�,w�f&���1��f�������~����_�({i& NN�it���y�6�?Ur2\�������6��������?�G�-�K�m��A�Ih=	7�����������+���x��+�a�4H����t�_��/�����3r�E>�'��x�	�����L�g<�(��(<`'/e2�3O��>��(6��s���]�2�8.|-���"������\W���u���[�3��BEg�f�i&����+��ZN������+B�Z��Y �>��+kZ����X�i�����X{%�8f����h���cqY�Gy>Z���#2R����I�i&�^��q�>T�����l�y�NcQ0WT4����V��R���D;iH��4����M�Md�q,.����@;]��#@���v��Sa#Su4��@:��h����
)!2UH '_�T�HE��������;G�j*A����1y<���L�����������H���z�����S`'_�T���<��@:(*���}W�JN��;KS���T��C-`���{} �.5���w�VCd�	���Yb#�f���tT��|��:��e�q��(������8�����ZT�����o5����X��L��I�'�fe������v��W�4t�}g[�DF��i���l�F��;�KP��5�e����65����jU��)/�t���f�~�L��O5
�����U������S�������Y3�~&�#YuU�bC`'���!,�S}l&�7�hY�t��P��deE�Ta
/V'�����j�"fFP	�HV��IJ=M��g�j�d�n3,�Q��j���n�UN��h����F��j ��E8�\�M�lU�H(�G����{�(���oW�f�~�(���O5
���LQ\K���,
��uV#A���*��N5
���TI���5���z����a��t�	|����b��K���&@N��U����u �O�c���&1o��Z��D�%���7��^��6U�
��[v �z�L�o�����T3)�����X�V��*���fB~��W��h
������9���b;��!2�����~��f;������U�g�v��7j���?��?���~�������u��m��a�4��(����W�8�����
���>~dd����\��V��������������>~(N�x/[�U�>�����
���>~hdK��^k���
PH�W�n����
�t���Ww����������9�n�4
wdd������~�Q��:@N�9�RY�n��$����dG���q��[Y�u+�S�������r[sE{����+�@�=_W��F��k���������Cz���+�9�^�\���=��fF�_Vb%6?N���Hn���8�\���+_���k�S�9�l�dE68�JF�Ju@��$�Z��k����+�z�yP� �a��t�	��?�^�U�!U_�u 2�] '��/�Q�f��$�.kE�L����9ql��+�������l!��l���m��kgK���:Y�r/?v�-�[9jV���T�������<���H��f�����b��6�zq��, ����7n��Jj$�bw�+�R|�K������}#A?�}�������\�G���{IC���J��k&�7�~1��gB��U�,W�V�Y�j&�7�\���T3�����yX�)����@������,l���FB�a��R�I�������,e����FB�a�%����%`^�	���U3�j&���}3[�%yx��M:�'��!�$�O5	�(�q�E+PaD+Py@N�@EFF�����$`�z��P�
���	���T&33�L(��<�&D�h�E���5��E(2�"�Hv����<�
��Yv�e����,;����FB�a��;n�&��k����(4����/����R��t����7��;�����7
�(W������@��`)�za��Gf���H(��<LK�����o��m����(&e1��gBW��j��J��i��j���'���#,�%3�~&��������Qw���
F(�T�����`=[��L��^F��'���>�^�(D������^.�����y���Q��UDF�9y�(����7��NE�L�K<���D�J�P4r�j��F:��k��2���?�R/�]u7�+�c�x����L1���yizS������D��	���**��N3��@:]:�bwqT@d��;y���Q�R����ti(�k�}A�M������P�i&�^H����7i{{����+��D����>+G������=W#>M���BU��
R8��������p��C�����0�}_�Z���.;d��	����@R��H8G>Hb�*>�b���X]����R"�E!,��(t����hd[z�^H�4(�:���� �6���D���H,�/�jL-�
"'V
��|��FF�[����������F��Uy@N�EU�Uy�^H��N;�C�4w��5�w�������O1�w�-��61����]9�][5r�e`��L/����Q]*�l��IA���qLN�jl"��w��yUY��e`��e`A��202�e`�����:U�u��yal �7clf���P��d�U+�qau����L�o��"���f��],��kXYJ�:�Tk���	d-��,�Q�f���-M�63���vO����vO	�����tO	�HV]�����e
���f�~����L���������aU���]�U���^�5��/�b3si�H(�G�2f��J���z��69���86�b����t|��U�J-,�tE_)����lD3��q$�S�?��W�9�_��_�5��/�"3{��H���8+���q
Kg��f�~���tf	�HV��"����"JU�U������0\l}�	��G�1�#��.��5^A?��a;}2��f"?�U��J�����wIn&I�yY�7S�$yi~iVFVkcQ�*�K�fB~���������B�V����c�z?���c8�����a��>�#'���M����f�������w��4`/���k;�8WR5S��1�[L�=����o�z���M���[Np�vv1�_����������������>���>�S���nP��������_~��������P������O?��p����w���v���T`�;���dC����j'?aN�c���o����r����z�rdhht�S��K�n[��j����k]<����g_�s=]|,����}��������+��=E�|����x��r����7�-����#r��,^�����.���Z9���|���Z�:3]?����g_�s}���k��?~����=�������g_�s=]?l�4�l�����a�5	��r�I@�U�S����u�����7���T������d�����g�|��DT�:"���J^F�={�N^F�������
(�u<�2���V�6�M8%���t4���^��r���
���������� <��.���/����w��kP��~>���(f=b'/#�eL���A/c�]j�E�v?j�Q]De;U���!c�M���Q��D��"*l�DU�ET��Q���=���i���bg�vt��Dl�*���v6������x\>`��F:���[9G�|�\?ZB�i��,���/����w�������):zU�L��^G��C������bg�v�:j��8������bg�vt5��U�ET��Q]D��-tu��g���G�;~M��^G;���������tt��lTEQ��������F�lTA���96����#���#5&�^�nj�30�����+������bg�v�:j����&t�:&���J^GL�H�ET�:"���J."�(�!]�d;T�E�F
ulTEQ��FU�:j�J���Q��D��"*����vt��lTEQ����|����*��Ia{%���N�����{����~���d�������IT���lfB*y��T9oBv�:�������6��M��^G;�����nvQ]De;U�E�vO���:���v6���QeK��j���jg6�:��ZH/�Q]De;U�E����A/�6jg�
�~�����z����(Lq�K6M
�~�������.�{��C��*v&jG���|o������lfB*y��������bg�vt��y���.������"j���6����lg�*zuO���G�A������Wtu��f�ET��Q]D=�5�~������qea�����++)@sgR�k|�R��@\��{,��T���{9��"8��Xs	����l�{����J��9��n;�P
�����>*��elF��Z]Mw�]�b�A��<�v��P�d���p|�_��v�m�U�Z�8���w��|6��/^(����X\�<@�z�T�cY�5�9jznL@������F:���]��B�#�0���3����u�O�'<��,���<����x/�-DyR<>�����u��C�-3��|���W,9U�[�\i-��E�)�O}��yg|�u�y�u�|/��S�<���A�����H5�'H��,_�d�	T{��	������_���kt
�	��+�������{	�G��vd|�t��v���$�Yb#�f��$�ztu��]
5���	��:���4��h��/���3��J������r�:�l��L���$5��d\'Hh�NW�c�x����Lt�q�.V.������� �@N�/�62��L ����%7��WI6��_K�������A�d�i& ^KH�������h%D��?�	(;*�4�%	�p?�s��wa;W���[LU��aL!��?v4���cvY���N)�	g�|�	���f3}������+/h�'���b������t����-�C��c�x��g4����j�������.�<�7���^qo�d����;��L3������$
s��D�����$��62�i& ^Kh���0�=	��A���p�=�v�:��Hg�{�Hv��1Vn��F2�Q��(O9=�,��H��{-I��4���3��3���%�]�	 'H���F:��k&�^q�����p?��f����O9A���F2��kI�R���������2��"U��r�����d�	���r�����q�-���7�c�l�y�3��A#�f��$��7Si}h� H�������3�
��h ��P��,y�����f!5P�H��1y<�������N:��u+�7)k8��$�J*�������Y!�>��kM|8���m]
��p��	����0��n��@vm]�L�R_��m�q�B���-��ps��C�w����b����W��'d�$���5DPzq<�/}g�#�L3�Z�A�������-K'
w$���|�!��V����]��

x�(f	���+�)�����u����A�d.[;�����b7��O�n���~��������H`b���X\�����tA)�#P�����[����C6�b�����l���[������}����?}g�����D��r���cqy{���[�X��QZm�M�)/�Dh��'������(��I@U�P��}h8r
�	��,�Q�y��`����W��QT���N�)'X~g��t�	T{	�[�O�=���7��|G���@�P�Q�g������������QD�DZ�-x6"`S���/������'P�}qo-Z���f`�� ��b�B='g N��(�l�3O��,?��c��"�V�!���~2�~b��L:����u�*�1������<'�����H��������yC���J�8����\'����D�x�h"S��}���h�o��|W�HQ��O9�PTe�d��L���_>t�:�C�\��o�y���f��t�	�����������T���@�m�'��PjQ�H��{-X>��$�2T���dQ
��;f�	d�2���x�	�#l�uFQ���di�bX �>��o����A��$��nPE�Z�@N����i&�^�_������4:���(�������/���X\�_7j �������DJ�9 '�[XOB�>��k���t��;T�Z�T$N$������_v4�)�����n��-�c��7(#B��	 '�X��4��������=}�4(!B��	��SegGe#�f��`���<*�������0L�F~���8)���F77�����A���$2
�<��@r6&�Tq���X\�_7�	�9+4�+*�pX@���vR�Ya#�y��`��h������'��"]����'���F2��k��Y��w*��	D"j�<&�O�]�'2�)���,X7K;,���a�br�	7��ml��L��,�
|���_
XD!T�>����(��i&�^+�/��r��:*��#3`'+����F}�	4{#�{���\���]}P����|��!~6����m��+�
2��L
�3�bB�F#"C�;�bBl'�>��k���HG�=}+��dLT!�9�����F:����
k��T<�~���k?��_�����~�ZJ5�r��Gh:	�( R����iEH������0�"�����*�l�C�N:�}a��<�$-���\w�����o�����Q��O:�}��y�f����G�"����8d�������7�~yP�b
X������������!D~�f&���>�p�=e>E/5xJ��w�',~�}�����3=��N��L'���$54j��E8������2�2��`���v���U2DlML<�b"�&#�t����y`�L~#nH������)N�8�Z�,���<��@:�J����S�z��8�Z"���8�z�,���<��@:{��������t��	L����&�2��v`���t�Aw�@:cgP�������r����F:�vw�ne:P� �!6����F]���N9�x�,��N3��@:M?������7=��{���t�H��@���
6���sGt�	�>O9�6D,��N3�b�����2�_2in�������	(r��at���f�����")|gC:G�5(�$���	�#e��t�	�����WHU�4�"�(>��Z�R�Q�f���t`�"����f��@6�O9���,�Q�f���t��g��QO�T����r�t:d���{} ���p�]�e]�����KP�h{�L3��.�Z��A������Q!M�6i�5��������N3��@:��J�0����D���+U��l$�L@���C;���yePB���	�d������L3��@:Z���(j���"�	���:���<��
mV��RT2������F��q�����]���8�*��ytQ���*�QI���F�N^���Hg�{} ������7�.��uB3�r�}g <��L3��@:*]U�
�����E����1������>K�P+����2��6rrzYil$�L@�>�N��q���#��c�x��KIh������<T�:D�BZ`�T�����
��'pXvm:U*.�e8���|����	�@��D��������ic��D�3�����)!Kd���{}bkP�����At�l3 '��a��`��@�'[NT���k�
�p�P���/����R���/�?��������-��K%6�q�Ok��I��fB����i'm�^eDL{�>�8����Z9|��/lGQX�!%��+�l,&�r4i��������9j�(�46({����<�G��M��d��_�d(3�JH�"
W4!"��O��	��)�l�3O���$=���=���G�C���'���e6�{3�T�4{/�XI��
��X@C�]��������h�2��`�%	��%M������D�`��	 '�,�,��N3�Z���^]�74��hkq���2fI�be�	����.:C��]����5>����@N�'Pow�i&�^K�+�M����z=f
X��rB=LYd#�f��$��h/Z��D��Q�z�v����^�i&�������]��L
K^L
v��;�c�x��}94�)����X9����r��I������-��H '����2u�	���Pe�B��e�Q�8jUO9���N���L3�Z�@��e��c-��cPA����KX��FF:��kI)��l��Z*���X�L'��}4�)�q�Yk�P���>N���NX��h�'�����f���{	��;��C��DX����T;I�;4�i& ^K���\���
"#/M����o�`���x-I�&9%����N��"A����N���F:��kI�g�����6s�e6�;����=)�D����b����	.������w�9���R����:��kE�H��CH��)"2�N#`'_�����<�Z�h9E�!�jt�G�@uO�tL&_����L1��e��U��Rc��Pm�*B�8r�DW�jc�b����w��b�M�	m��&4v��Pf#�y��$�.���\/A�j+��c�x�"]MT���b���@�/�:�t�g��|"���X\��\���9�' 2����K<��t�	���v�GzU���hsv�J7����i&����wP������9��U�;y5'�Q�yu�[�J+���	8���3�$��
����8H���{	��nI��b�TK�������1y<��F�A&2�8V�%T,�@�=T4
���D|������Y�l�Lt�q���y&�T�,�L�C��+�����`�	���R,�X��uy���ZM��"�?X+%Dk���~66�RHL�T�t�Z��y��i���f��KU&�C1t��X�n:uf��@��'��8���k��n�����D~`���������#l;�"'y�uC%T>�E1y�I������d�=�EA�P����!��S N(�(�+���`����cE�;�P��v1�EYv,&�c!��we�qu';W���~g���5B�&J��CM@�P���r2:����&����],��&t�P�+���X<L��B&�>��X\��Q�ljW�7OERU�OlA_maO@���!�?'�����^H���%z�$�g���I�;���/;`��k�G���%��X1���'�K���b���� !.Lv�x���:wBG��|q2.|b������L1��;��2���B���N��J����H�pd"S�����.�t.���w� ��Ox���5,��N3��n�[�N�t?�#�^$����N���`����D����>��@Om"��H)E�	 'n���&h���{} ����v��'�a#�>TN�d�� ��@�v�O,��p��"#j��KO@���a���4��z���#*�w6���JE@�=�$�R`���4`��S�������&��jGO9��$�i&�^H� )��v�����r�@Nr�����Q�f���t������T	U���t:�\EJ��o2�)���,�����d2��4(!B��	���{��W��@w�?����9b�O!�D3,���X�[EI��o��3#�:B�����|������IIRQ�;������h} ������L��nD.^�]��$�:O\������r�>s��bx�����b���;k�x	��<�����U���w�����F~�V��������H�5����[��N���y�C��)I!�:XH�L@�>����_+^������V��0e����X���.}t#l������,r#�%�inkD��X�Q�s/�:�W�4�"u��E23(�:�����r��a�����X|"�����D��HW������B�h1��XI�b�2��	�������� �	q(!�$�XdH��D#@��D�gz��
��J9Ty����f�d��c��	$?T
���7C��N{�Y��u��GL^�����lp�|�=��Bx�M��RZ�Sp����o?����^��/�{��PPLEa��8�(Y�~���_���_���O
���}���J������������~������O���G��?<�������������Y3~��C��s'��B���?������$�P'�
�Vs7��L{�dTx.����z,6�j����;�>�����n������x�	cG�����|GP@����TU#XiDm�qbey��"��0�;B(c<��j��B0U	a�y������_��J��-"�������o�T�S��~�����5��D�.PU�`�yB���B��.0�!0�AU5���EKz�������_A>S\��B0U	a�yBnXf��o�W����?U,�=�A�\�����TS�~�SH��W��/���U���j-�N�������+���S!>I��r�1G9XT��2�O�RuN�����>�#e�ZX����g�sY�����g\�jJ�_5s(9�Z|��~w:�)r�������;��.�������k�,��j����9����BQ��Yc����9��.���Yc��k��5�.���Yc��k��5�N���Bn�'^:��r����y��q��y�r�k�����������M�z��������b.%��Z����C�>E�95t�5�FS�������j��k�>o����������}�8t�U��WC^��y�����}�8t��.���C�^E�y5t�5M�C^���1t��N�C����^�o~�8uJW������*'����M�\��_7��y�:�*r���S�uI�'�C�^U��:t�ul�8]x�'�C�^C�>q:�*r����}��q��k�>q:�������W�s^
]x-�'�C^��������=�t����r����'N��_9^|��~�@c��q��u9���������s�"��:�Z���1t�U��WC�^[��8������j��+r[���.�������k_d��z5t�U��WC^���1t�5OOC^���1t��MOC�^����j���#r�������{\���s}����}�8t����;5t����<C^��;t�U��WC^��������}�8t�5����C�^E�y5t��o9]x��G�C�^a�������W�����7=rZ�Rhe^]xe9�U����}�8t�ul9�{���:}���3]?rZN�Z.���22)�g�~F�i~�q� ���~e7����Y��F�g�-�@�%f���\l
U�*�����
�Z�)��,�P��W��D}/����P�
<fg��Nj.�f�L@� c����X����u�is8�V�7���,�@� c4�Y�k�@_�D���
EX���#�J��' J�1j)�,O��n	`�?����",�\�#wS<Q��e2�h�o���R�N���T�Ty��_Q���.gy��S3�q�'Z�����`��
�w�����,��Y�k�,�,�d�������.��&@J��DBff�uK)�7����|EQ�C��N�� c����X��V7�N��>p��$A8Y��:XH�L ����@)���n`:T��{�$�����Bjfiu��j�fA��+Uj��f�/��o�����h�@kVY��#�$ �%�,Z��a�Ee/("&�kQ�c�0)s�U0G����3�G��������7VXH-O��n	��[ �4;���
T�V��R�����oP�����;_+��R��aU�v8�H��L���v�@�e_'��PGTC{�@J�1*�qf&PV�|�*��kr����=j���;R��23`�[�6�x������mE����� c	��	��-�LMz��pL. ��+��K	��
.��R���r-�X�G�)�z@%>m����
��	��-��o�&���p�_@�Ia�ER3Xwlx/���/)^g�	t� ������ c�����j�{�^�����CnE	<�(�Q�($f& Zw���yP�>�ph jK~�P�o���Bjf����?�b�Jy������O�T��w,����@]U��@���x�A�p���"j��@J������X��R��!�(k�F�68��]�K*�Zg6�]��-;�����ih�����v���b;/�E��v�Wp��9TR�E��P�"p\@��P�h��33��*}3:E
z�%��P��y|:��,�f&PVm���P7�kF�Wj� �R@��+�23���+o@����=� gS��q�~�83H�=8�
�2�-����<J�J���_�pW�8����;�f�U!C3���������������5��@)k.���L��6���t��T��4�`YH��'��2	��	����l�q���G�gDs��o�%�:)�,��'�|�����5���:E��2`���D	3��k�P�����Q�[��A���*���"�ml�7�Q�����A!�<������>4�k|�[
�ka����N	%L�(�8��X���Cr�����W��������Q�d�2J�;�h�g�[�I�s����JE��	�$��D	S5�2	��	��
��[��������g<U�(8�XH�L��)
om?e�1
?7�t$�����?"��E��v]�zb����[z�8D�K�p�h/�"��	�z����g7�G�u�,��bw-� �p�E�����[�����4(%�(#�lK=��)bp�_�33��a��Qk
<0������\0���+���UR�~<�[���&U���Y	.���$+C���Z�������>L0k�/���C|�@J�}a������h��}H���G�uHF�����pH�32�	��	��
��A��
�R>F�A�%R�|�HBff�u��7��aN��!B��jl�Z'�D���(�&���'D������gO��w�"���8I���Q�#gf��7}�o~��H��4�a���l_Y�$�f& Z7lr$�3t$d�D�i��$����
R���H����9~�k��)��!�F;���f����X���C~$��,���P�0��XIR1�
��	��
��������t
%D���,2��Xi,$f& Z�o�r#8BH����2,�@��ER3���O����)�a�#���> %M� �,�f&�Z7l�"��"K�L����8���F���/�������Y��}CR�p�~,s��p�fV�+a@Ya!�<��c��)	K��(�[�*��^��*�YDLl�����-#��.BP^Y^e��N�d2�hw}�w����;����lli,$f& Z7l�Ho��x��D���5j��|���@5�]��
�N�8*Qh��C4oR)���[��NBjf�u�!�C�]4&(�:bM�Lk�����T�c&�kQ9N�T����Zo'�����O������jS�Rj���� {��^4�����=�N��Gz�6`�����@}����$d�'�V���:P6)r��a�dTX��y���s,��Z��zv���������F��	��V�e��Q�X�jn
�Gl��U��g�`&r��E��DF�����E��8\��wr�i�0eS+�g�.%��0�fM��X��p ��UN��	���(�S������.)?������[�Yq�\����?�����'w�P���������,E��"�h|�Y�e]i$�/��Dob��~���q$�78#�E�uD%-��)�t,����@Yi����?Q���5�����d���X`!53����g8�<���9�s�����H	w�-,�f&�Z�V��?��S��T�
"H�@J�3�B��Bbf���p�������m\H���s���Bbfc}=��p ��g\)��a�b����Y���������E��8�D�*�
Ck+A�������V��f��<��@8a��:���lb4��Q:���&f����k} ���F��0����M@�8Fs`��	������%���HE����5ip2�s2
E��v-*���:���!&���h��}gd����Ml��C_31uT���"��R`��*c����,���@*	Gr�Mu�`.��N	�}�J��"VXH-O��>��t.��6^:��[:�f�E�C��o���������EK�)���"�R�@J��R�����k} H6��i�:GG��*���o�}P�����Bjy�#�;�q5@R����J�	h��c�����kw�c���W�j�\JkY�X9XH�L �+V��c)����W����e�&@JZ�i�
	��	����\��E�Q�1���"E�.=�XH�a[ Z�G2YM��h���Z��f��tYf1�]���q�$}��\�V��
�R��U�:3����[��W�1��tZC��g �u%S`!53��@8Zl������i�
�6k�x	�DOF3�]����*���8��H�t�cZ��K�Q+�[�n��CqhJ���w��W�<��&
�w��� �&�kQ9������R�aw|���������R���_�!�$:���%���6�J�c�W�glR�Se4"�ujX`<�U��2�uR�X?���R���LD�R�����{���SF7gym��{/��[����Cd-�����U�:e��.�u����O��Fr\��:�amS�[��K�$�����Y����=��-�J��'$�'�����&��5^Bx� bF�kV�a�eq��H5 ��,��iu�&�\f��\fyX��B��lr�G&��"�2�;�D��c����Y�������}9�P�v��@��zH����%R3`�[HI:DV����@�t�T���X"!33��o��@.���<�p�!�t�d9�p,����@^M����L	���w@h?�v�I�H��\E��v�V���v^�T��������=�>ZL��nPH�L@�n	�ey�E)�2T�
k��q����u:P03�{���e7� vY��X�Sq�*��	�������a��R�]��_���;g�&��eR�`�;��q=H���.u��8�����'��DBffu�'���T�S�������
��A�N��E23(�N�����y����dn�/@�
��gf������q����v�����n@�HI��E�;����k���E����S��K>m)m�O(��Ok Z�9 �Dm�+��MH a�)���&���/J�����Y���k��}f��T�f����R Q�,Y��-����-���MA���)\\�[����7�C1�]��;�\N�fq��P��~���vT�
($f& Zw��7�$�*A�J�������j%�wlr2x�'f�{�P�Vs �z��j��U����p�[J8D=t\[�	i���p�	53��%�&.�U���w��K��s�/�sB��Y��-[�����.���qVk���qH��8�����s�.R�H�������:q�����k���'
����u�8@�~rk�J����Bjy�uK�7��L�i���/�����Z��-��>X�DM5D�Y9Y�$df&��=h�l���.�B9M�LKw��`r5��f������
��>�����W����Y��xEH:e~A���~�S�_uX���NL������uo�/!|��"jt�f��1��X�� �0_1�#�jM�V@�0S�e
��	����n���!�EMml���Y��x�UNw�C���BD�������@D49�ft�^��3�U�w��UN�����M�:�r��t��i
V����V��Q:�ek�H����q\�U��aM�Q]�6���\wk8P�,un�LD��"5���$��������:�pc81r���=��$p����+��&��N(�&����}g)?�}��R��������4^L������ZT��#K�q��!��Y��d-�2XH�L �����4<���B�KUD�p�R�*'e��	�����SL�O1�d$P��v��#�|q�g)fb�V��A���!
-2D3�K����r�`�3����t�*�M�`nQ����X�[�1+,��'�Z>�d����4�}W�i���5"�}�����&vg8��n;���z��%xh�~@��ct���L�~�o��b�p.��A���]NL���������s�7��z�dn��PC���@J�S��7@!33���s�NN\�����&F
$�� %�01$df&�Z�@�������J%��EZ�k����D��`�<xlBO�`-�*��Q��]�J���R	��	���i4��FESR�^��)��%�KJ*Q��x�����:�Q�h��}�����o��4�#V�G����f
>`vS��2�������TYH-O`���N�U6���=��`�5f������Ec���Ml��r>�T�9�OsZ��6���i.ri�
`����"������+m)��^��4�����p G!��#��s�PA�i�@JR���
3�O���g��`�]�c�b�(��X	�Q�YH-O��>��������M��p=���	�V�Nk����7����j�zZ���E�j�z�������p�������]`�=p�|~�t��
n�h}"�p
�Z�5O��v5O�5j�k�:�X��f-*8�YH���e� �]�mZ���y������ZT,�{��T-�J��\r���������r���K����i�p��;#������o�����)Q����o�/0�`����3Gh7�����~������O��������?������o�K?��W��<��C?[:=�\����#����M�!$�V��!$P��w_�s�q�Px.���&��f]FP�}[��#P]�Jy?��������@Y�'���E@�:�t������&��l�)a����u)�]�`��Z�4�j3��������3^��A0�Bu]+�� r{�e?��}�pI;��]��.���y�x������_B��.A�A���V>
f��D�m�MU�`B�!��^�����b<��\:�)r����rY�yUt����WE^�4�3�����ze9�U������kt#������������V_��Z�����������e^d���`���9���W��g�q����W�s^
]x��2��.�������k���z5t�U��WC�^�2�7
�{9������o��WE^Y�{Ut��?c^��
]xe9�U���0#��:�*r����Y������g��y��~�3
�C���3�mY�r��r*g18t�������{9����W�%SWO:������j��k�j#�U��W��^�{Mi��t����9��.�6i�l^��a>��r���s�Pt��aC�^E�y5t�U2����,��*����)��WE����^]?ss��G�ox�T���Pr������������/���O�sN
�{��o�S%�>Y��Tr�q���.
��9�����:�#
]x��;���W8��[v����_t?E���9y�=+����+���<7iv�<+����+��.<��R����>wL�{Vv��	�������GOX������P}�.^@�57��]xA�����.W��geW�Yp�Y���B-�W��]xA�����1�]�]y���������B.����)��c�2��OeW^Yp�W����w�g��_��]yN�X6�`c�#\y�[G�>��W��������`q�xE��a���Km���s�*�Or�]x�������"�=��������+���=�Vt�k36���+����+��[�{���<������xae�dp�������<O�{=��<������^�U��3i?7"�7<�B�_r�_��')���1��tz�.=���g,����d���4"(?{l���0������\h�d�z}�:���)��<�$df&�Z|���r|&����G�+�z���{�� c����X������;GUD�����`la!53��%�
�:���X�PC������($df&�Z���XMx'��Q
t�]�6R��-$df&�Z�W�#O��C�^:8q9E��E23(7����F��*�s�@�1�-,�f&�Vc �
`�=f��BY_�S|%�K��A��u���L����3�]'���������L�^��P����'����0`�E� bF��U���v3]���z����DD#����������Y�k�|�`ryd����zF�q�I�@J���Bjf�uK0�Z)���F�:T���C� %@�e,��D���B�r�i�� #���T�5^B������U��y�����=�
H�R���Xd�a��
`�[��������h\�
'��K	���8��f��wly�����OyA�l4�q����2	��	������W�_'�vl�������XD!og&��� W�Y����#���{� #�����U���q�b����������5����{&�Gq�"V��bK��Mq������e�0
l����^_+�XH�L���	 7*���g�3���`A�A=��0�DLM����
�G��={n��{��?������X	�!�eR� �{��aO_vE 6�u�������Y���L��Mq`�����0��ui�{w�Q�����	9�)��1������g���t�>"�Gq(���(��f���#,�.��.����vf���/p��OX��0XI���	9�)��c ���rn��c���4�8T�?p�"	��	��Mt>����P���R�o���0c����H��bB�������q�/xn���(f��G!13��)�&�}�7x�	Z�������I��UnI���C)�U/�]�g����!���}�z���bf~&��_�!$���g:�E�3����xDMjz��"bjj&i�K��8�<����8�<b��	y��OX��P�EL����t�#`
���H�<���+=g���&������G�c/T*_��Y����E�����eY�g
�.P���2:�3��%��O/�?�PG���9�;���9�"	��	?���gjM���PE5��l���Bjf���zk�%Z�uHT���
����@J��n��(�f& Z7l
E[�c�1���^:���YH$df&���[�����	?����~�}�����7F�@���m��8������!�������k�x1�����ZT���|U��'�:Tu�cwH	^vc4 ���@����>�k��E�+8�p�v� %�])[�93`���u�
a�U)|E��;���PS4�!53��a��PSg���}C����:�R��[XH�L��n�~jrZ$S���_��
����(a
HYf!�<�������!��"�e��`�An'R�18G!13��a��s�5����{���=|Y+Y��7@�`��
��	��
�o���T6�
UD��i �x�23H~��[��AN#�bS�QAY�zH	���R3����(�������
"H��@J�
"�!53�z����/��Ic�<k� ��jI.���rvfB�n	!d�l<���&���]�Jx�m�f�8���r��>�c�`~JHN��G"fv��YHN�XLL����d*�=C���*mc=�v��Q�������[�����)Ir@v��@F3Gw�gCP�s@
k�*��$���(0-�w�\
�@����z��$bjj&���(0+�?�N;ttM���z.
��]YLM���������H�Y�����G�4	DP�e�����[�����u:�v�P;K���s)�@O$SS3a�wD�����$���2�u�> %��0VQ�Y�i��}L��m��%~���	RR�����8��L�]��(#CF��Z�����2;G�(
�I���L��-QhZ�G�@
��Y�g�F���/z�QDMl��r����A��|�
����Bz��a�EL��d5���(,���C�e�]>g&��I���bjj&���(4
��&����l�LPo���]���L��=Q�.]���c�F8C.'�����ELL���r�%��^x��C��xfPd��c4����@��v�>9����Nf�}������o?��F���b)8h����4�<����e�V���=
D�����8D��};33��|����������e�eD����g����Bbf���p�%�#�j�`�����,����k} �_j��J!�"::�}��JY�V-�D�����)e���"�rC��^d������h} ��G��Y�M��7�a9�Q�%���g���v�*��o���T�h�!��L:����a�(df&�Z��)!>��|����K��g��%d���������-��h�	�#���	��q���@�?�G)��?�����1i�ew�;�/��CLl��r�qs��K�2Ax�
o�i|�D��T�
��	��%�;���S�W=r��	�j�|���F����"jb�f���(��|�%���h5�����t�����(��ty�V�����N�8����F ����	��(I1��C�����
`���w�-�eBJ:�����u�� D�������v]VBw���Q��R��+P!uU��U�2��[����E��8(�$����g�n'@�����>���{�3���� %G��=�����cF�ZG��JO�I�L��~$*LP��X-#��
����'��2b�o����	9�HT�����t��XFF��#�z�!��]���L��G�����)-������r�#�zV5E����������D�+�ZK�"*�EM���-A=�A���,��fB?&��Gel�V0Uv@�Ee0��p7�6�~.*�z�_��`cD���zj�fu�23 �O���/���+����w����W�
��������OD�0������XE�y�|DPOk���$�9S3!�����_���"�eV3A=�������$]���\����&CW����LPo���/F�xS3a���%�������T���6���5[���fkD�#�HA���uS��P��fM��6\������X
�5*��EMq+���4�L���-�7X�{Y_	�9*=c�R�(~��F��j��o=c������c��\	q�����*��%�7g�X�e�6�~��Q�rCFX:D+-�7a�M�tf&�a���QxK�����d�������R���_�!��w�	K��t���T� '{x?�W�Rf���B� ����]���	�������E��-���,#��Gj��t =�\�i]��>��r/6�]�����*2���n(#�����P	��9
��	��-���M�>���Ex I=�cT���L �z�7h�Y_�^�dR��:����F���/��5�]��;�����6-P�������@J���AH�L@�n	<>��:pXkR�Mi����nD�"jt�n��o��i�J�iL����%��V)�G�0���Bbf�uK!K7�,��5j��4�����R3��~�+���'A��Z�@0kh���e`�*��	��=��y��5���h���@J@�#zrg��Z��i�5�CG�������3������QH�L@�n	�t9m�����i���$�� %.����Bbf�uKu�U��@�:"H��@J�E�O4�y��uK��������0F	��;��a���}gf�uK����F�����DJ��V��0f����X���e����
�V�F��wPK"���5z;3ao7��M|.�h�@�g�:����A������	;�)HN-����L��-���MLs��l��OQS3a�7���=fl�*����x��*&��H���"��fBo��ntT<8���A���]�	�Y���uSS3!�7�Q��M�A=���0y����E���L��Mqh��swI-~��*��R�����Bjy��Rx3�������B'�� l&��-,f%ba�7�a5[�YS��ed�$l&��-�,fEba�����#�v'IE�1���E��8%�#f&PV��7�,Q������tt�����&-�y����	;�+W�E?���\�^�m	�mK�SS3I����|�V��`��M.�+�����1�%Jbf~&cu-��8�5�q�sJ���j�%���95k��jC��Mq����{��"jNi&��=|���|NiK�����+�>����* ����X��m;�+�f�R���2j!enxP��c���	��=�f��)8}���(mcI�y��] v����v�*�I#l���C}��[�F!��F9@������A�\�,Z���������F�/��z��F�$i.���Bjfiu���p���'<`���":s��Q8�����-$df&�Z�P/���d�!��3_��A�5�z������h} ��_E����uD�V���8�T��.
��	���i�v����C���@J\?��G!33��@8=Z�~RzJ���z��f�R��T�4;
��	�����!�C*r�j�i{��e�Z_��,��f)���$~�`dw�����Q������������h} 8��w���B ��pg-<�*s�E��v�|?�[����|���CT��@J��Po��P���k Z���z>?��l�
�Z��g����
���w����7��2�j�8Rs���&��XB"bb����(C����HI���.���	�"jb�.���q`5�|�
��5�\
����Q��1���,O��>d�����(���0��DI`rp�Bfyc5^��p����
u4B#q��Zf3�d(�&���8���AV��~�:�I]���P��fi�\^��I]g���RV	���sd�a�e�f�z��C135r��� �U�.'X�X��Aj�v+����Bjy�������
rc��V|�d��H\�VroE)v���l��/��q[���5�������$�}�&�?U������U������1����	;�HTZ�ns���5@W6����*bj~&��#QY5W�c�u����DW6�s���)&�������D�'s���p���?��\3A������T��\[B?�%����uJ.q���5�l�s�����ZT>���-\��Ss����������#1���v���t��"���Q�i]3Y��`�|$&�f�?����7��!n��z���p���|�����84��5Md�qn��t3��&2DLM�$��DwFe9��1n�x������I��b�������D�eJt~�>,�'C���\�A=���i�$��fB?�pU:'��:}u�PI|���	�O��j&.��
���s�|3 %_*V�BBff���~��K�cm��'�{ �GL6V������
qD�O��3��9c������|��o��'L����>��}~��+���������~������O����������?<~���]����)���
����_����&!?aP�b�tH�4�j��Qk����;�2�}����������->�y&��WM��Juw��<���oZ�����|�����U�&��WM��J�t�!>c�}'>������{����\��4u�k���C�W/�������a��gr�����TO������{��F��:�!W�7M��Zuw�a<�Ny�o3q����
�.���\���J�4T�m��~�=>�d�� s�+���G����;�J�=������c����<9�(b����G��
8�GR���so<���Sr����C%��	�y�U�B��B��m�O��
5!
hW�>c��.������^!#���������B���]:t�S���C^���w*��'�y�B�=�r����O�sN
]x�t��{Ut����WE�^c������s�"�����}L�WE^Y�{Ut�����WE^Y�{Ut�5j���:}����j��Y`������I����.C�������n�kO�WC�^E�y5t�f���<8t�U��WC�^��yo+�������j��k�zQ�U��W��^�{����{5t�U��WC^�x��.�������+uYyUt����WE�^���{5t��9�����'�sy���~���K�,�N_8��:}�bx�9Ur����K%���BX�S9t�S��SC�^S������W�s^
]x�8�~�U��W��^�{�<��y5t�U��WC^�3��]xe9�U��W��~7���r���s����x��N7*���|��\�)^=o~���v�+���n"�"0t������~�:�*r���s��l�j8t�U��WC�^q ��=b����9��.���
�U��W��^�zM��v�Gh������3%�����i������J��WE^Y�{Ut�����WE�O��^]>m�u	��is}>�"�z���p"�B0t��%�o��:�*r���s��l�j8t�U��WC�^��qN���d1�R���B��KE>Y�;Ut�uP��U��W��^�{�i�����W��)E^+����
��Ib����;�TE����T���GM������� ��TWPv����x�0<6�y7%���Y��F�g�-�@7
f���F�����vp���#��~d^���@�e��~l�����,S��(M!R��`;��u��D�x�0uHMl�e������%P'<����"�?aqwM*/&�wHY��	��-��4�
���!C�n�,����R����"��	��]��]U��e���:�����@J��HBff�uKq��Q0JwS���H��e����p��;����[��m9���F����N�9��� %x��ER3`�{��e����>S)S�P�����BjfeUG�^��b�U��f����Rh� %�Xd!53�����^����x_��x\�=67�� c�t��L��n	�-���\5�P�/'U���R5���73������H����
���y�h������>�d�o�c�_���\+*�O���/�2���33��%���}y6��PB��� ���c������h�@��!n��BG��&a��� ctS�Y�@���AL|�|����UD�� I-�c	��	�U-���KR�����+qS������_*ub[�����c��������s�
�Q�6/�����6�I�,O����@M�*7div��� �e�-�	�`,��Y�@[M-z/�[��M#Iv���zl��������Bjf����_�/���8�E(����,��'�Z��
�X:O�2h����}�J���Bjy}5)����8���e�MN�
�
��#���Qz}yVXH-O��n	 T�D�{�6.
D	��R��%R3��f.�+}���;0D?'R�V���,�f&�Z��
]��\�
'�����3R���83�[��[-t��G9,g4��g� %�Xb!53��jR�^%�w�w�6m	F���q��,��Y�@[��~3���x�V:PWD�"R�X���4.s,����@]5.{3h�K�C�2"HL�@J���(�f&�Z�Pi�M��v8�HB�
M����s �������ZUn�9|�'�O�4$�Ba��:���y`1�]���!���&Z���':
D�f���H	�4���	��=�R<=������iIG��R�s)e���������K�,��z�_O�M���/�=+�J�z�\��������P]vz���a�#�InNl���b")Q�����?�7|��X���#�����$�2	��e��~l���ah��<�s�E�>������SC����mL`�P�����a�)�LQ�3�Z3GHV�#�1�U�����nG�V$���4�5����e�q-	u;����G���}�Pf��^I��a@jwgM/&�������E��}'����N5�:�3����H�`���af&P|�����a�����JxZk^�qw]xv+~�IDLl���{}k��R�1��P��PGJ��e;���cMYd!53��a�m���i�e��*��q�f�y���@�ZX9��7|�F;{����H��[�������E�3�]����������C�*B�)p���u��=��	$������%J�F|&�^�PA��� %L�k����h��}hR$���JIF	�O�@J�d��
��	����j����;��e��6R��kt��L@�n�~�6������u)`jg���	e������,���S�-�?-���u��rH��C��=��L��n�>\��~��<K����-�b����J�rQ�I�,O��n�~�|E�tn�o�D�K���8�����;�����[�oY�f�C�F
dR� %N�0,�f&�Z7l�X�P"s:D�0�r�s�����D���k��j������/�	|t�N"bb�����i� F���BQ��� %��Pq�����h��}��
~kRc#$N��5^B�eG1�]�����8j=�R�L��@�"_c���X��}��"bb��w��@q@�NN�_o#_����v]��&��w"5�]��^�\����9��a~d��$Q�UR�`��_��7����J�0�I4�I��8�B,6R3�;�_)�4�h-�@!�""��,���8�����Z�k��������yv��D3��k�[���	��#�6��u�����w�CY%�:�MA3�D�i����2	��	�����H�S�;�$ ���k�x	��������E��}�Q�SZq	?"��,���8q����Z�k�q�$�����	�g,#2�$y�R3H>���6�����	^O�}������o?������b)�	�0�IfD�����[���c��5��(�����v�Z��D���EMl�����XW����%nJ\���%�%.qS�['��i�tY���"MS���K[�Wi�1�u����t�;sIG��F�N���\@��;c���L��n��N�V7����N�4}��������mhgfeu_��p�����Ou:)
���E?9I��L��?�d.��'oY���U���j��V��{��o��#�<_C�Jr���;�� c����X���E� ��*�,Z'�?_K�2���4^L���"jb���!�J��ahx��T�-Q�%���<��o�,��$�����5nP�!�H	�W������k} �!G{pMJ�@1�Ts%���87�l�JE53��D8Q.�$>/w�Rv����	$9/wl�����@���w�	n��<q��#x7
�Bi�/!_\�����ZT>Gt58�����,`��Jc�wR3����V�@�zcy��u��3&_T�"jb�.�d��8R���%�oAvIQAE:����(�����D�����-��S�k�(#�}�@J�0��X��	������wk���D�4���A����
��43����7H�|g�P��'��2V�p��(ggy}ug��p e4����
�&��+Ivjp�	��	4lk8#�x����U���&�A�dv)a8�J'!53��?�,Q�*����P��%p�~E:�ER3(����p���S@���8D�c�:�$EC�rg!13��@8�b
0��hh�CQ�p�l��)[XH�L ��5�Nj�Z�5|��!J�'��=An��OK3�����d�l
EyAI��X�3cZ�IB��27��>T��*��F�^RJ������:>��S���X���+5�����)2���
��e�Y��(����fY���;�"�N���dN�!�Wp�����f��/������E�qd�M�����C	d�� %N��oG��ef& Z���q��%��(!�
�@J��������h}"�NN����,I�R[���]�JZ�T���Bjy���#M�a�A\e�0�]"m����
�����I�myf��1�
�I�������.�~�����gY���W>�9�T� {��k\XVi#���{�kI�qY3�]�s:o�&O9��O#	t��_�g!"� �&����%������no�HF�n��u�K�Dx�(�&����m������i�*{�:��,z;�"U��ER3(�*��������`c�:�c���@J@�>q(df&�Z�+=}�R��cJ:V��@O�y�/!��A5�]��;��+\�l��A�:�g�_��K��Z�7�]��-;�6���q�F�)���	�7�cXH�L��n	2T����V�C��l���$�����Bjf��j{3����O��r��RT��]]�}�H	��)[XH�L��n	�������
���� ����Bbf�uOM�P��iZ��N�xL
��)a���>
��	��`
R��cx�����8��HI��
�Qsf&�}
����~��F5�""����V�d��JBfy�uK�/^����4$1�m�i���#�/.5�]�U��7w^�_]��@�N��'Y���o���B�Jc!13��%�4��%wn[���MNH5��5k���e��"bt��������K%�bC	�0����I�A!53��'����z!R{�&��A���X���E���Bfy�uKX�T�*���r'n�*�v+i�S����Z�k�@����ON(e}�B~��-#�ua�E���7K��c�Cj���q:4�.3�����DBff��o�����HJ�T'�UD�j��8Q�la!53��%����:Q�nG*���_'j�����( b&����t��s�u�WJ���CQ
�/� J��X�iff�uK	���3��t$���]����u�$1�]��;������q+�:���g?En}9YH�L��n}�|��������CV6R��'b	��	$�[����?�*�o	�HR��X;�����|���D��v�*w�\�-��Z
`I��p6�B�����YW����u/���E'�0���Txz� %��($df&�Z�P�.XX�FB�*�R�|�l��)e����X����U�U�8,��S� %�:����7���dJR%�@��2��j5�2�Q�%�� ��l��r��iP���S����W�������,Bz*��r�_w�3�SG�4x�)�`����_���|�:���/z_�M�x�n�#�P���&�-*��v-*���������i��A7���D�M�y���5>D��O\��l(�+��Q��T/�0��Ll����;�����S$]��V�\:e�uN�L`����pR���:�^�F�.`>���P--�Vob�.�O�Q����A�C����x@����P���
`����TEf�p�:E����"'O�I�:53��>y�3����2u�S���ZN�qZ���	��HDLl��rz��n�5�	�yi�X$!33����N��H�3	4Q�PC�K��0��,����k} ���q��\�'��J��`<���B���P�L���'.A2��B�(XR�~�6K9o)q����
dff���p`���P1���$�<������D��	E��v�*�C������u�y]��>`%�O����Z�����n
~Y����4�G}����'����8H��L �O]n*{%i�����figI�/_���YAbo�d����Y�r�tQ�����Q=���|��'*QDMl�yuE��8dW2.�
��Q�]��?"_T�D"bb�^
�3��)
]r�Q�n0Ze}�u�~�D������������Wb�>,-����b���V�L��JBfy���p�_���3�*���c�c�	�=R�YH-O���gn
�s�K������8���@��;YH�L�c��iJQ�\SK:)�[�%�t��F���/�������E��8b�1T��huHg	*�+�}�J\�����Z�k} ��O�	��

D)�0�)a8�	��	�����{t�/�B/A% Jip��	��h)+$d�'���nGk�*_�2B7��ti��|S�����������r��qTi�����(�a��@J��s�����X��T�*��R�����r���U"Y
R3�O���n>��,��f��dI�����8o�,��Y�k}"����-Q$�F(!��X�N�1��Bbf���p�5��#C)"�r�}�J�UPZf�P��0>u���/����� f��k*�HiU�4:iZmT�%���V�x�B��)`�#��j����_6�:L���u!<�K�>�Ko+s��������>�����?��st6j@[_l��s�d�~����~|���}��?a,����AL��<���)AO�X�������o?���_����>��������?��__�K?��g�
�9Z��:Yf�I'����/��",rWWU\��[��|g����t��7#���p�����s����r����_i�n?������������GY���n��]���e���4"����T�����|W���]����J�"����&���W�Q��9�!�
�T%���y8���� ~��B�����]����J�"�	_B���p�����TUCXi^������)}���R�����2U�V�!,p9x�_!}��B�8?�E��*U�������:�����]�������2U�V�!��������\+^�T6����A*��7Dsy�T����wE�^E����.�6k�)^]xe9�U����H��z5t�U��WC^����.�������k��H��(bK�+U�U��>��ma�����w������h���0t��������+����]b����9����B����b���W�s^
]x�������,��*:�:�|�WC�^E�y5t����0����+�y��N�����w�����W�.�>�u^]xe9�U���n���k��U'���^]?s��#��8�\?s
���":�D��`���K
�x�������j��k�xg�{5t�U��WC�^���*9��b�����s1����,��*��:��yv'��+�y�����Ds���'�9�J.<ViVl.]�d9�T���!}����g��y��.�5��'��?}���'
�_9t�����������k�J�}��s���c]������T9s�����!�y5t�U��WC^!C�����+�y���������o��s�"�����������,��*���f��f]lN�������+T�.��g�N�6*���|�Tl�/��\?mj��O#��_7�s:� ��~�:�*r���s�pOnY}"9t�U��WC�^[�k�����W�s^
]xmvAX�*���r���s�0����:�*r����e��7t��L��.�v.�w^]xe9�U��W�o��sUr��a1�R���Jj�z�\f�el?�:}�T��:}��N�7�C�^E�y5t���M6�U��W��^�{�c�����'�9�J.<�����!�2�������\6	f!��X�����X��]���?p�.����^]xe9�U���������^������1j��"�?���<���E����mY(_j���q
Mv�,}I#�����K�13��f���cS�:"8�O��^_�ER3`-���^P������5+��kg�Ef}8Y�Z;�
(�Y���Y��uD�%`���v,�P�6��Zd�@��+Px��x^QG��4o�)A�"��	��-�Be�G,��Jyp�����%1������7��p���\�jK	�a9��1|:����,�P�?�
`�[���4������|�Pd.�c�����j.�{��l�?|���#X�Sbx��u�nL�x8������o�������d���C
Q��� K3,�h��33��j��^0���flhd�2��v���b����"bb��;v�����	����\��)0�iwM/&_T}I"lb�V�;v^��.PjH��L����,?��k�x	�:Y�����������2���qiU������n,��Y�k����@��l����������c�����@^5�|+�
�
��v���rA- *�q��X	0Fc���	��-�������%����g=�L]2V������h�@�����F�CeA% ����V��2�`�[H�!�7�3�����Qc��4`%�X&!�<��%�s��"M�2��B3��%	�x��/V��ce�����?�C�G�#H-��eO�D�m&_��E��v�n{��������v.$.H�p���Y��7�ID�n��r����n4,���-b�#�0������ER3`�[���@�A��	����� >*��[��K4:�D��v�*w�|Tn�,8!��%DuH������`w$�f&�Zw�������-��-�Ez��u[�.Z_��H����<������#P[-�P�yM/!����L����?�:���n�I���&6Wr(#j1S���\�X�,$f& Z���s@#��3
��3��O--��j� ���Blf�uOr��L:�'�*�m��W�	��,��Z�@[U���i�u��PGOG������B�9�@S��PDMl��r��������*�4y?Q������D���BC���Z:C��r�Q�;��0a������h�@�g��R:E0\�|��>`%@Y!!�<����	3����N�4$p��w���b�
�����ZT����'x��3���
t����?�,�������sYF�`������w�����6�R��qc����XK^�7�?L��,��+��\���v��/t8���Y�����yI�R�0�:"�Z@��a�E23(�s�[���?����G������/KF:�<�Av���b�-5IDLl�����1��/EN�NI����rhJh�D��z����7K���[�Z<KBEQGTBx�@J�@QYH�L��n�~�<��g�sOe�"H��@J��PXH�L��n�~�O�:���`���"_g�,Q�%�7���Y��
��6jt�cXQCT�x������Bjf�u��a�x��4j������e��1��Bbf�u����B���2�6��J-�ik��W���	��
��)U��	8R���v6)���A����D���Wn>��<k�C��V����Bjy��g
�4B�����P�03�XI�'�
2gy�u����lX�R�b���������@J�<U� �f& Z7l?J;�<���!��I��
q'�J�:!F����	�����n�T�g�JE	Q���3 %L����($f& Z7l�����)qe�#�ip�l��7I��L��n�~����r;�a����� %��1:�sf&�|����__���m�S��������)���%R3`��c�P3,�-�~]�����k�x	��D"5�]�����4�����@�*��}��In:�����@�7������j;U(�� W2)qB�Yd!53�z�����GlDW��� u�� W����Bbf�u����'a=u��$�(fH���5k��|Q�!Q��u�����w�+��)x7�P�0��XI�,�
��	��
���G�qZZ�@g�`�doi��/�����Y��
���g~PR�QC��|H��*���	��
���k��G9�gSD��o:Y&;H��L ����m�q~����U(���p1a��&���/>�/���f��Msu��=w��`w�0O�����,2��|Q�uv���w�\:_#�t���a��D�i�I@H�L@��8���F�����P ����L�5i�
��5�]��q�,�
�-'#���O������j�Sj��������5�����OMD��������(��� %�]QI��L��85vg85��}��������n4����2�S��J\�N~V�)$�9��g�&Cp�9��G�L`]�sg8��8p����`���-M�f�����Y?���`���z���p���I�����0����83`���):Z��T�]�-K�4��Tl,�P�4�����;�������#VC��F�ay�
(z�j,����@�G���������@vGHCR��kwM/&\C"bb���������]�s�!*P�yH	�&���	����P��U-)R/-ET��@J������X������
D#�")�p;���4^L�$����ZT���T>��4���[�h�=$�� %�nq^`�����h} �68%Wm����%E��[�J�oQ�YH-O�}����'6�k2"��.8so-�~�|��7��zD��qH�X�
#T�P:�M���D��bE��v�W�7�%2��Q������K�?���5^L�h����z�
��#���,�
23B��)�5i���0*����G�8�.�^;��(��j�����0��,����k} ���yo���N91�j��+{��0��,����k} �"_+�O���j�/'3 %�Xb!53��@85�O����W��3Jx���Q��Mpf&P?�Q	�#L�B6	�w��bTA
�R��e��	��'������)Hj������b�E�z$�&�kU�?l��|��(5Du�(�@J��RXH�L����L���$+��1����>��u���Dpud1�]����;��=�{����9�QA��vH	Nq��ABjf���p:]h�)��Yn�E�i1���y�/!��$�&�kV�?��(w�c�od�U��Hcw+a�N��p�'�>��k����<����+�@����zg!13��@8���R]PO�r�����pH�3e���"��	��'��F�m�#��1�#$����5j��|q�Iw&�kQ�?�:5
%���5&jt%����J\�����Y�k}"�j|��)�7GW����.S�$d�'��u�{��8��j�	�N�`����J���K�m��r�p��/|}`i'	����/���T~����H<�oc�k��o���A>��~�Bc	1���b���9�>R�"&e����XK@�@�.E9�f���Y���.�{����������vM��ns����!I�{	�k�x1�������E���w��Mi��/���(C���f��ER3`�[��u����24K�� c��������� s���:u�&�B�x�:u,����@��N� pE?Z��dC~���0e����X��b�i���*�3a�"Ht�@JX��,����k�N��Z$���PC�L� %��),�f&�Z�W��/>n%���"�tV=��9/f��	��-������C�
�R���)avHX,$f& Z�_;3�z���!��c\�X����WC!13��%�&�mH�QQ�o��-veh
X	Z�8i�'0V����;[��9�*"�@�@J�BN
"vf&�V�soPa����mo��Hr������ ���SH;����dm�n��MI�o��������|H�$�������@	��U�M���@1�T�`y��$8�������80V�*(e��;X�k-	 �)��CH3XeD��R�:(a�X!53�Z@����{,\;�����P�v`�����X�f�q�[�����3W�� k�hP�`fyer.����FE���	�Z�e�����@��@�<�Z����P�"(u
'����Y`!53�Z@-RQ��g��*"�$���g��<��	�u?�Kk�)�8����X�t�$���!����(��Ra��C���;�Az���,��,�f&��=wV���'��e�f�������B)s,�f&�ZK�*%�94	@Q���s��X��E�2	��	����\����X�t�d����	lOV�@����V'��*T3�O@�Z:c������M��P������ "4�*����4^L�/���~-*+�~��d�����""���R���J
O��DkM�'�V���Z���(���'�����RYH�L@���yy)��D�R�:"H��@J��aYH�L�����Fi��4��(�5Gk�x1�5Q����,����
�d�c��'���/uNd@�� ����R���60����	o���M�&y�3p���B�	��/����q�6�jXvh�!I�5x�&�I��H��\����_ta%H���T���t�'`����UF��-�ZN��{�e��8 �����	���q`���������p8�@�,�2���38^g�of�o3Q�uJ�����Z��q���g8��1&�53�]�ge8���`���_j�`VQ:����l��$���pXk}88�(I����$�,O�����b��,�L����>��C�wlPE��|H�3D�<��	��o��pb��(���
�q�/�';I��yR3H�B�����g�?�D�ys�<�b�#x`����v?������Q���#'������� �D$�&�kUYGq������%DX�tH	sb��������h}"nL>�E��"����E�1��Bbf���pZ��q6j���QA�����O,b�XH�L����S�c�I���JC�r/r�{�J�|RF�,��	��6��p�����I*����yr���8����Y�k} ����C��:�x���|xb<9�1���u���v���p"�Q�G��k�""���N)a�JX�$�f& Z�'����������m`�_��5k��<(�"jt�nc���q�L{\�{@e�'G����u�2��F!13��@80/-Q�>�$���i6H@9z���Y���A-�HD�����>���+u��A��h�2���	 %Nhk����D�����a�'V�(!*�txHI:�!�����D��t���=y�N�������'��0&�IQH�L@���S�}��{����UDP&�O)aBL�g!53��@8x�z?f���D=Bp��	p2��X�$�f& Z�����`i5�F�P���`���*�	o��q�����XM�2m�#���`Eb��k��n��+"1�_��������3'D�!�2�	�2sr`�����f���p��	N�j�[��Z��u�����1�C�2���h}"���5S�����e� q�N)qv�Yd!53��@8����y�\L��=%���:�J&���G�ugV.�����4���i�P;\�9a���� �&�kQ��. e���Rw�$Q����������\��i���T,w�5�����C�o���J�������%�`�Bn!'&���������������&G��?�����'H�������o{��?���������Y�/��~{����?}}��������y���<n;������n�����M���b�H��H��;����B�F[�=��������k���Tw�`�|�u��(�d�{Bp���Z��CP�!���q�������;����D��~h�n�0]b�|D������}����h1��w��}�;�Q�6&�q�w�%����!
B�m�kAl�/���K�?�w�%��;��>���(_�A�q���Ujz���(a��I*7���b�3U�m����k�,6�Tr��h�G<��x���F�Bn<v���r8�>Ynt���k������k�"7x5t�5����UQ��|w��WE������|��Q�3s�
�t����c0t�������W.gY�^
]{����k�%Z�g�j����
^
�x�V�*^�xe����k�<�v�j����
^
�xm8�l�U��W��*��
��e��7t�U���n�fU'^�xe������U�WE����*����������q ��77M#�������|�:�S�����O����2�54������������������*r�WC7^��1��N�\ze����k�������k�"7x5t�Uk�����,7zUt��������,7zUt��?{�yUty�Q������M�����v�w�"�,C����
�~��1��-���'�
.�\{�~�i{�3t�S������h�����k�"7x5t��M�8C7^�t�3t��;�����xd�~Mp��;�;��������yf��gew�i�j�Y��g�xVv��kg�l���c��gc�w�4��i�7����w�f�m�0v������0y��#���x����;�}����������Kp��<��"8z6v�����pd7�Ep�l��37���Kr���������;�,�q���s�tld�l�������y�`{��nnI"����;nI�>k��%�o~z����a���1c���w����)c��Up�=6v��������_��*��Z����V��_�8Vv� �����,��gcw����<+������n<G7=����nr��V<��o=+�������<�z�]����1���I>�g�������R��>����c��,�~�a��W��gn.���j���B���B��2*x&EAA�B�$�������n����.6���XO7�����@J�����)��	�����#��k*a�_UU�T���R�i~O�����h-	�K�N�.v���n�9;^'jS��A�-����:���������67��+���
�8�R��^Y�,$f& ZK�
��ol����h�n����� c����XkM������z!	�*��:��L"���"bb���)�o]yw��� ��^������;�3��o`������1�{xnt���L���h�
��I'�� c����XkI!��|����f��@J���HH�L@�����:
54�����y�vx6!n�N"lb�V�W3J0��"��PAuQ�D�330�Bjfq3g����3����n@�@J�1OBff��&������JE��zq�~�A���($f& ZKh�k�r���WD]z����� e��!53�Z@O��j��Q?�5X'@L�z�f&�m0�f�:������^��O���X`!53������z���W�e�x\YC��w�zFP����� bjj&�pQ�����-�F��\�2�f&�Gq(�GOS3a����T��S��WTq����ZTz=6�����i�	���P�	���H��?�D(�$df&@Zk�]nu����,Qr��$������BK	S3a�����v�.u��Y�.=�,4>$�Gq(����L��.��yJm�U��VD�Y���t����
%.���DkQ<����x���'q�>#�G�A���L�f��{qxu�,DM�FFM��������a�Ov�f�m%�f�={�������2�����o�C�z�A���bjj&�pU���`�P�h���	�g�(�A��=��	9\G�46�&����F������������vR��q@��������2t���Q�������4�7����������wo,�o	'��8��Hy0?����nQ6x#����	3���C�
����������	9\G�A�����FL7�0�������vS3a��6��4��|6l�)��`�.�R��>e����X��Mk�r���g:�>�6������%������6e��3=N
E��<����.��l8+$d�'���Vo]>�����q��!�������c�J�,RVXH-O��\~.��$��D��!�SN���6�������������o�$0`�-��{:�;�V��WVH�,O��\~�������Ps�Z�:�	�f��������������O#�f`���`,h�k�x	���y0�_������m?R'����O@�N����3��/
8
���?%���54]�Y������$"F��:~��w���.�����f��@J���-7JB�r�[ Z.��F�C���J� ��N�����*u643��_�9Ur$������z�1�����eR�`���[�=�.-J�z��U3/]Z�����	������������H����rGk�x1y���"jt�ncC������^��3C�8�:�{�1`%N
�,	e����Zp��qi-\JPF~�	 %�|apq �f& Z.��g����9EE���C��;��	!e������S|����qG�&�CFT������%��_��K��V.6&�J+�=���������5�Z���������,bj~&�pI���(�.]�@�:��r!��A=�)�"��fB�D+w4ht�4�Hw���Q
������L���(��aJ�H����A?�3�zE�UI�[�[��Dg=�%J�!5�)��Rx�3�z�B������	;\�k���yEC���}�d�+$df&����{��?J������F}3�B�V�$��,Bbyml����c�E�+��AP�����j�c@Jx����k�L@�\�w�[�Tj`��a��-�>!�������3!�k���&9��2��{�	����}&S�3!�K�����?8'5O	@J�V����,BbymL#�w�0���n�71&�D�s=!��������L�x���(*�[�/�l���YW|���	���a�"&�f��De��Hf���a���/dp�A=M���(��f��D����m*��h��"�4B�a;Bz��a�Y������Bs.����(a�!�3T���%xi~gXk��6M����_cn~p�>�mf@J����!33��������7�j����������������k��f�@�S���Q����3>��1T�A�v��!�y�=`%�~5�YH-O��8u�2x��\�D��F
naG�?���w��p�����:��K����:��5��n�iy>���1��Bjy���K�)��k�c�1J��F(�R�p�a#�0���h}"*i����G��H����V�T���Bjy���p`*C����e:�*���'�� c��5��k} �.5�0�JB���Ve�pM/!��3�_���(RF��M��1B
&�N)If���0	��	���p��B<>�P#2Ft�{��L��0e���B����@8�r!N}�O07/`�(,I�%��o({�%��:�����eD�J�R��0�(!53��D8�)>�I[	��!�
G�����b���@��u�����+
���
�����������e������h}"�&�T8����# _='�������Q��5���6��lBs=����F'u����f& ZJ�
��Q�UDX�tHI��
������@��X�k� �:�r��Y����9)��	�q�Ha155r���|�:G�r���8��y��3R��Yc����H���6L�c��S,Q�:8�3�zZ�%!�bbj&��#Q�I���[h���2�:8QxFP�3I�����f�?UJO9����Xb?�"���!`%*��YH-O��>N��sO�i����G���N	���y45v������cp>�f)�b�i��+�U��2*�����L��G���r�wh�$)���0P��i�"	��	��G������������a�j��	!=-�b�DL���~"*h[��*�����c@J�G2�Ph�<�:�[N����VW�cF���wYr�3A=��	�]�4��	;�HT�q���BG5��e�d[9#��UX��������D��? d�����������d�#135r������jl��cV�a�~FP�SE��Y,V-/�'��#QE���+�i���`~.���4^��3S3�cY��������krf�G!_��iZ�ad135r���2;��7i�����c�����ry9�Gbjj&��#�������6�?���!�7R�f������������=<�������������C�?|����?�?�����y�O01�NN��q�/?��>+HZ���D_��C=�@,�����!�o],�I�HZ�QC�b���H	���Q����XkI�I�.b	0��D0�8�R���2�Bjf��&���o��2MHB�2�?Z������P�M������K�k�p�:E���@J��S�YH�L����o��>bJ� ��O)a@a������h-	�g9O�p>��UD��r'��8���H(�I�`�@�52�w�������\����w������X��%�i�`�AZ:����
��	���|���'|�PG�}�=� %�*�,�f&�ZK��[����RTa���&��923`�%�h�s{��)m��#VD��5i��p������UV\yv\D�5%�eD����R�f���0��L@��P��k{j���m@�l�|X	���eR�`�%��������"xN��F���sX	P��(�<������|����6WO@���28<Bjf��"�����b�)V����_������&S�3!�����[�@E
���ir������
������M���8B��)�E/�0f��j��A=-cDLM�����R)�V�0%,g��+'�pX�����XkM0G�s[H�b�KY��(�F�3R�g�k�������EqX���/�m�����!�5��%�Z155r�(��x6%L�QRZ�
2l����ig@�����L�����qH#,|0��X�M�P�Bz���bf~&�pQ=H���:+�Q�NNg����:��'!33�Zd�R�_T��X�tmo<0��x��*�_� f�f�����k��#�i��*g�gij�k����l�y��n����>y��.Q\��Y�<���$����Jzj��	'm����9����C~L������=���EI11��������V�1�"��fB�=���/������!qw�Bz�PKQ�������b���Y���X�uF<U��F���y45v�*��'_�S��w-w�ZIm��:
���`�����%I��@�m�����������<��
D,�[��yr(@}T����gl���a�J )�������/���n��f��x�����y��4R���C+%j��iXK���IAzxtF�$Ft�E��g��a]����h} �L���+�c[�uDX�uH	sK�"��	���)��t\j@t�	�������
��������pj������a�*�("j�������U�=of& Z�=�PjC���
�V�R1f��z�<���L�2�y|)<.R��"x���F�3`%��)+$d�'�7�o���t����4V�PD9�z���7
*QH�L@�>�O��p_���UD�\��R�p�=����k}"�����J	WW����k)�"�
:I�����zWi�2����~��k@���G�9R���<��	���I��I	�e��-:v)t����"��xCQ�u�K,�#�=�����Q����a�H	��9R3`�����x��3�B|Ma�������C����~]>�g�l������"�e�Y�j������pLY&!�<��@8}�����0��Au�	�z�N�������h}"��M����B�<�7�	 %�������@;��{r���-�������T{O������
2��L��GB
T`G)]�IW���������T��,bj~&m���2�H�Q�.�A�(2kT\��A=���dp$�����O�:��9�l10�����qe�����n��� ���PV�����^�Sf'��?!1����	��'�)Q��E:�3�������P��i�������@��pp'�������+��!q�Sa�tfj&��#QA����)JWFF���	@%���6A'!53��@8��&C��K�O*d�^�'��4����`~&��#Qy��������b�vX	7��eR�h�)�k��n����t�~`Pa�1�F���q�Ka15?��=l�2��������q�r����f��!�����	;�LT�Z�[�>�w�m�����>�WEL3wa��*:z�S�/�=f�zE��4O�qT
=�������Dm��U��%�4���C�Jx�PVXH-O�o�?.
g(����:1,�����I���:1���$�
��n�qz�9�]�i�����N�7������
<��vr�)gL����!����������=K&����B�
�����!b�}����?�����'���������d�����P-�o{��?��������`�������_������x��/��<�*G�6�P#���7����QM����A3����v��/H�o-���S����r6K���c�������^�j��oT/>��k 	O�������bg.�~!w�o�z�[���w����[�+�����������^�j��oT��������������n������U���zy�P�~x�����{���H'�e����~�����^����A2?�?~�����@�i%CLnPM`�z@*��O��Op��5��S	��>	@����d��?Q,N�
\�c)s���[������k,68Tr��RO����K�"f�\{����}��1~�����G�Jf�n�J*��x���t���S#'_�O������#�x�F._0��7r���>s\������]{���|t���'�
.��x�n��Bn<���Q����q������O����
g:�N���$�����0�o]*���r�SE�^�����WC�^En�j��k�����.o0*7zUt{��p��c�o1>�D�1C�/��
!�~�
�>��WC�^En�j��k����k����+��w�C�^En�j��+������k�"7x5t�5b#��WE7^Yn����k����S!7>Ilt)��cw{�J�=���Q��G����Tt}������[
4��w���;M��fv�`@����Y�|�d���s0�k�"7x5t����v���l��]#��o�����}#��x�FL�~�t���F�����������*r�WC7^3Z�xUt���F��n���k�N�m����k����WC�w������T�-���K�qw)���~�Dn��������F�����������V�����*r�WC�^{����WC�^En�j��k�:��WE7^Yn����kt������W�6��x�~�[�Bn|���R���:y����N�wy�
VG�J.�3"6xTr{��.>�����oC�P�@�x���_2�"0t����q�;s@�^En�j��+����7�k�"7x5t�5{l�0z5t�U���n�f�������+��^�x�*��WE7^Yn����k�1�����k�"7f�x-����^���|��������Q�C��-'�go�������0�"|W6NP$���b�x��U�����=�2<����d���.�z�����HAR
�w9X������6���~-*t4��+��d;f*�����~^�E��<������o]r��d�*��$=bC��5i����
"jb��W�]�>d���}E���:��vH�z(	s,�f&�ZK����X�	��
�gv0���������k-	�;�7,�g2R�<4�8^��K�y@�L�������}�3v&P��_��]fH	^zc����hv�����e�"��3T����`�����k-	�@���R�����9���:r
=~�����������+ox�=h;�'�D�3�D<H	^zeP��Bbf��$����4m8�H��EEv��k�x1y��R�#6�_���+�.���:^����C��	 %x����X�L@������U�ye�G��v����B�#������{�?a�	�7/�]S����VFD��S���ZUV\������P��0T���XgtX���L�'���x/�@�NC(	;������(�����)A�<��	���R��)!����
�(@��1`%�>,O�n���@�V!@=Z��xF�6��,�H	0XH�L������+���PDr�$����6+��	���j��G_p��u�*�g@J��HBff��&���O��r2����O����2����DkI�j^�.p�_�q0
��������k-	��g���8[��Bh\U��c�h�����W�8Y-���ZU\yt�+�,s�DC�`:����g!53�����r��y9�o(!������ *IHN��j-	 &lV@n���Z�����@�0e����XkI)Q��PO�Mg�Z�U:�2��Y����J"5�_���+�O�/�'�����H����������"lb�V�W��^MXPC��	 %x��A���	����7�
F��l]b�:D1S%�`%�Xb!�<�Z@��|�h�s@fZN)a�����	���"����N�0�c_&��Q�%?�E��~�*K����x^��F	�f����������Dk�/�LM����4p�P�9���#@J��]Y�,$f& Z��N�@|(�����*�t��?yYb�Y|/J:Sv��1�������eL��$�C���a!2$e��~�I��s��CYJ��`�x^JJ���dL}�S2o]p���������j��&c`��a�����_p���o�MF��]����0�$��f&����u���	Kx��$����I�5i����D��~-*K�;GN���������P��xH��Gs,�f&�Z.?F�b�48�P�v�A� %��0d�Q��L��\~�����9q8��
���]�&@J�@R�H��L��\~�z����J

��^o;�u��D�MO"Yo=�uC�u�%HB$r��
"���O)���2GBff����aP8�f�!��(�p�$��	�\���YH�L@�\>y�jA���&�����f/���Bbf������������#/������#1�'��`b����;���B��)�G�P
����R��f.�R�`����O��v@��&p��@������b�������ZT������	��W*QPT��5�>I���$df&�V��C���a�\�"���Y�N��0�",W3����`.3�Z�����d�X	/_Y&!�<�Zq�
{�aF
&�r	��($�;H	SE�<��	����/��'!{�h��"�?K��@J�g9k�f& Z.�E�����u�����F;Pdl��h(�`fe���w��LN���`�=���N��%��@�YF��[����-
�p�Te
�	A�S�u�������6��
�'�Z.�g�U����o��G@�@?��T�"��	�����|����#�VQTA���R��
3�Bjf����#��C���t@Q����\�1�Bjfq�@���w�J�,>�
�#���^���Bjf����
���g��X!"���c�J��P�XH-O��\~N��
J������V���f�Q�,O��V\~��������R"���c�J�ZaVXH-O��\~q������+�B@9�~X�+�
	��	������A���_!/vI�69�������Y�@���w�z��Xd*������!`%���'����Xk���;
n-��Lj)�i�/!����PvkQ9��q�@p���/���������?���63.�r�����)zhz@�h:�������Mi��p�(&	�<^G��FD�, �&����0hI��*8�J@A���$�2i��0u\�� 8�!5����$����nI�/�W�I��_��G��	 ��� �@��S�}^F�N@k{F��:���W���%���S��5����k�x	�MM����>���p��e�pyL�NS���� �e��~�)�Y��J3<�n������@J����C�����fX��pR�m ��������+O��0etvj03��9�4�.���������OY��O@
����������p���
Ih�������@JXm#*Tp6����h}"8�=�����)��wR��0��G!13��@8��~@�<ip@4����
3�2i�Xi$�f& Z�'I%����!����������������h����E��b!r���e.�!6�_�����qx�N+��
�#-���c�J��h,��Z�@���q��NJ���P��HI���H	K���	������4�������L-��)a�FX�,$f& Z'{���
�P����V�H	37�R#!53��D8fl�E�B��v����y�/&X8�XD�����>����Z�\)�PB��p��2	��	���i��*�J
��	k��������Y"!�<��@8��Y�����/#4,����5i��<x�D��~-*��<=�Y��'�e�
�?��?!���+�zlV"����8*n��������������c�����(e�*!
I��n(��4^L�/�"jb������?T�R5��
�e�x�/!�����~�*���3�9��t���B@�H�����R�`���?Z�&�>*��P�zX	sB�2��	�M���p`��s��%�p��0 ���i�/&N�����E�qT�����
Q��+�'�J�mQ�YH-O�}��<$Mh��w��b�A���v�X	�QFmZ���,K��O�=�|�C��#�3<�c�J�cF�
�`�O��	���%nsc���V���2��	��6&4����m�e�a
� 5;6R�T!��B� ���H���&�$��"e��������~�������YwNd�g	1�N>��aQR��"�����������������z����+o�OLbAH�VqY�I@`��:�S<�:	���:n�����l�N��MF
��y�aZ����\y����ZT\�w�e@����]�0�'���PI�c!53�Z�8��+Ah���B9]���G ��\c<���L n���T�K�g��)@`[9^G:��D�N�hb�����w������XT�(�r��������@����@��]?`�H)������R��q�Jg!13�Z@�>z��J
(#�Y��M �Q)c������h�	����]���3xN'��.�1hb�Bbf��$�*Ov�@�J��P��?���K�
��	������wi:����2�����0e���,O�oZ�@�|��Jy���� ��O)A�<��	���� .e-adh��+���4�L���5i��<��D��~��>�o^y��8���{=�)�QW�y]h��	^y$5�_�q���+���;X��E�w���p2F���!53�Z@�\*��6�o+��N���o3��=X�@�N�~/�,�OI~�*=�+v�&��y����i��{/8�
o�F���HU�	��5i��`��N"bb�V�W^9wyx���� �R�<�����2�`�5D����eOy"!�!�/�~�f��u�B5�_���+o����`2���4|�q�����Ze!13�ZL[���7Mh�B�(��CR��&e�`f��&����-T��J�U:�(�s,�f&�=.@��qhz��.E�D����@���IYa!�<����+F:��
�������q'�K�W������TkI�sa������"��8���!e��������t�	����)��< ,]:�$�M�w�IH-O�����i�����)��8tKM�{�J�2z�,O���P��'�Q��&k;y>��yf�����X �fp�7d�n-J�!���
�8��gz�%2�`�5D���&uN�""L��R��"WR3�5pk7L�p�M��q3`���&oPH-O�mrmo��hR�c�������4n������Bjy�u����X���3������I�$9�l�/(�
��J�����a�pK�p���w�-������R��UO[���]@�k[�����V����	������M������q@'6�yqK77d�|�������f#"�p�7K�m�v���X��P������k�"2T���zS=�28�']�`{�����Z��@J���K4R3`�������n���f��@J�PIC�7�C��-`����)Tr�k�D>&<p�g�c!53����2h����bJ���QA�QO)a8�	��	�����I��m�!��h/�m��q�833�M����`��F9���D#a�P<����*������h} ���T
\n$cwM2�z�\N������Ee}���J�k��V>y�j����b���'o&�kU�@���+eJ!E$��	 %�2�w
��	���p�K��]P������T<��s�f��������E�qd�Wl�JAq���x'��{�JX�����X������w�*
��$dr���f��/L�����,�>����B���Wm� ������0-����Z�k}"�N��r�=j���U�<C|�f��]8����ZT���G��R�D����q��:p��V<Q.WM����>�Rd�R��u��AL3e���hM/!�Y!5�_���8�����'w(!�y�G����IXh,$f& Z�sU�#��a���$Hk|�k��Y-e����X��t
6��pq�.8,rH	s\�"��	���p���e�t�2^�"� �O)a�KXh,$f& Z��S�
l�cy��;�X�������#(�F�kVYG,O9���v@A�_�h7}�2����������'���H�#�F:��~�.|r��������������8dL��O��A��V�<��Bjy���p��8�S������s6�pM/!\"ft��c/��qTn��;n�s@D! ��~���� ��Bjy���pd����oGC! �!J3`�q�\=
��	�OmRb�>���+wk1��}������A 1��:X�@�tkYN���~�=WC�)i�2d���$�F���Z�@����e�:W(����-��Ic�@�Q������[�I�/G���R�� 4��-������_�}��.���?BN��g�,\�9��<������?���_��������8�a�	�_�O�����}����?�����'L�������?��K������C����=����~���_�K�_����O_���������o�?����&O_��fv��6=E�'��/�A��-�U����DL�0�� �.%<-���X_�fb�]~��b��'r{����>�^z�C������?|e�~����S����^��j���)R���R�_|�~O�(���� �.S���7!l>B�~�_�uQ�<��m��!l4�C�m�|��������m�nCPU
a�y�7�'!���Bp��B`t��j��B|�v�F�����`�������`��V�:�+�����
�������6Q���n.��g��������>^?��TU#�h������|���"o��odE7ObP�1|EK\OU+u��
��Ib�K!���t��i����
N
�xM��j����+��^�x�A��WE7^Yn����+�P���u��S����Ww�O
_��02�<�=���H���	���K'rC��_�:T��WC�^En�j��k��wn^
]{����k���+F�����������j�����,7zUt�5;o��u@�^U����k���U��W��*��JC�7^�xe����k�~�����]�tTn�j����!Po�9���'�6�L#�/�
����E��NX��8�_�����k���>�����W������3���j����
^
�xm�w����+��^]{�|��j����
^
�x�6�[�*���r�WE7^�Bl�*���r�WE�^�g����
]�nDn�j��v=roo7�q��:�����~�Dn�����V��9P�u@�^U����k}no8Fn|���vc����<��\��)r�SC7^�c4��n����U��W�l?��*r�WC7^�7�����,7zUt��[�)����+��^]{���������F���n�6�3��������8�c�~�Dn�����V�,�WC�^En�j��k�!��WE7^Yn����kM�����*r�WC7^�Q6�MO-_�$����k�-��ui����
N
�x�q�U��W��*����lJ�vk�t���F����B:�o�����6"7x5t������ns��Y�����|�Tn��5t��U8���J��W������+��U��W��*�����7o�]{������Z��^�xe����k�c�d�]{�q{_���bS�U��W��*��������������z��7ty�9J����T�
�������i�_������;:���U�����=��������13>�����DKH�)H u����b"��("&�kQ��=�]9���&5�u�M��{u���12j�;������{������v�D�)H'�� c����XkI-���xPAI�x��x��L nN��@�x{��X<l<�����`�����k� ��s�i�v.PA��D�\>0�Bjfq����h��0�p@��o�@80�Bjf���;�s?���lW����t��������("&�kQYq�1I�q�o*��W�������5j���������Ee��W���F�M��2��3�� m����Bbf��$8����"�S@�BQB����	 %@L
D!13�Z@��Eix����5�v��1`%�Xf!�<�Z@m�o��=	��J:���%�5^B4�E��~�*+���C?��-x�1D�s���`�/�1�Bjf��"�����G�
���{�vD�� c	�`yus���2�7tJn@x�
=��d6��<��	��l���v[P	-b�;��ED�g*�<�(�����DkU4'�g��`����@J��HH�L@�����6�t87R��T����`���Y�k�
��������-�R��y23`�%�Lg�|s8���"���4� %@Y�,$f& ZK��pk�I��P�p�{)�n��4^Bp�-��F��:��w�-S����f��/C���KsT5V���Xa!�<�Z@����Y�����ZG����,��Z�k���*uJ����5����(�O ��c`������h-	[`�I���X�e� �M�p�&�(���E��~-*+�<T�y	����!�"��W
�Z|�/���Bjy}�^���'�o����!��Aw��5k���K_2�����-|�c�ni�Dl�
��� �1P��@J��s,�f&�ZKH4p���n�:��X�-;���9x�3m�����l���y�	k���
�xp����F��K>��/�2�u�'�Zk(Xt^h��HC�[?\g�2&�/}`1�_�u��P>���\�CM��_�������	�u�Vr�t/�����0��BfZ~�%/}��`5�_��i�T��x(�9N�@�������D��{qP����
_��gh:5���	�Q+)HR�O�&������~-*���;�`�3�$����e��13��g��|���4:�*)Hm�I��D�.���~-*��YS��9��(AZ�D��90�B�z��@�N�|��[�^����$U��� N)aJF�c!53�z���*"H=�$dY����MB���_��6	��.����
Iqc�PA�}�^3��6p`������6�����s���EQ�
�GR����M�`f����c�j���D��S���uB %�k�~�($f& Z.?��{j�=�$'/���S'l&���
&�kQYr�|�q�t������M&@J����Bbf�����-
��<�,k�Z�4/
8�E���H��5����sR��,x�K	xP��b��%���MJFJ�I6��rK�p�L�1�;��	������,A�����"���?�)���
��	����_\�C���9@H�(����%���)a�%Y3����xB*�XL����CW\���	�^��BBfy�����4��R�F0_�U���I�%�A�QD���}H��w�]zkU:�a�<��x7xZ��K�?D5�_��$�[����sl�Vc�A��������3�����U\w�y�R�Qh#���y��/���<X�V_�+.�x����[f@	Q)�~AR�����HH�L@�V\~����rOrC��?�>#V��7�XH-O��������h�������������x�n���e����������W�'�������.�aq�����D��:�V����"~��}������A�DHGR���&�lwYDL�����uC��@��{BIH ����b���Xa�����1p���GI
A?����"V����BBfy}�d���K�|n���H�Z��pM/!��E5�_�E��`O���d�����A�#V����LBfy������3�o���QG)wH��"�"	��	����/\���RJ��I<�`^'���5�D��~��	��]�$*R�2"��Il�(���F[c�c�$�;C���B�4�@0	a�����|����;�K�?O~�Mr�P�r1������������1�����6�����\��������M�
#)S	<(c\���U����]YN	x��|`N}@t~!A6��������`f�C
+��])S-8�s@Q�v'����E�#!33��@8���j����*��Q�@��s$df&73)������QsHAjnZ��K�8�a��e7�?q�x��}����\��m����c!53������d�x4�����������nyW��C�MU}������B�yo=P�K��r� �O)q�EG������h}"c�>`45�b�����y��(+����D���4T�x���zOWq�F��������ZT����T�L�T�H�:�58^{.� �i%�������QJ��KC�J�+�1V��e����X���k���$��E�a�S5������c,��Z�k}"��t>��n2�=���c�J��Bjyu35|e8=s����	�>B��s��	��(+,��'�?��i����Y.qF�QF�wH�3B���
��	�������'�u|�1�iwJ�1`%GY!!�<��@8��q�P��M�E�:j�c�J��DBfy���p'���)@���a�@J�0_IH�L@�>��� ����A�
���
���c,��Z�@��_����������8Yr�$df&�����p"�%�1��~ 
H����k�x	yp�$�&�kVYG�4��CB��T�h���	�k,u3���`�'JN&������u�d2��DQ�u���Q]���.��Q��O�h��Xk,$f&�7��������J���J(����� c������>���JYjg�*�*��<���(s$df&�Z'F�����Q�Hq@�����4^B��"ft��cG��qTj.�<�����E�I�%��@5�_���8 c�9�F�T%
	f�������Q�u�L�[m�RF(I�4F	6o;��Y2b������h}"���s��w.&"Ta��	 %��Q�XH�L��>�(�*�����e@��g�k���f�Se6�_����8	��
C���4�6��?X�Q�$����ZU�SmP��7����"���������~������a��;
2 �s�����'�G=�J���Cg��D��q��������� W)�)��b@T��������`fq�������r*<�E�Kx���f��9R3����7h�ZK�L�d���A���:S&�253�m�����7�����E���};@JX�4��C�>����Zt��3�T3<��(���'��0e����3�������'�W*���k@Q�'����J�#!33�Z@p���K�b>K������������0EG�[�k.6jQV�)��@�eZ<^y��sF("&�kQYq���`����&RD��s�2�k���S� �Bjf��$|��N����.�����0�)a���HH�L@����oIz���&9��{r��b1V*��	���j���{IB
����$l
�R����Bjf��$�����9!	tts�k�x1yP6E��~�*�;����\�_��#T��X��V���0]�,!,O���=��Y"�Z����5��gt�2��c����������D��|�a3��KYl,$f& ZK�C���:�T��)�t9^��K��cD3�_���+���w�`��M�:"�RN)a/5e����XkIx��$Q{�)*�GOR�1���c�[���	����o�j��IH �R��O�yP-����ZUV\9Ok�/��%�T$00���>�+�r3�_��\��+oA�u"n���-���~�Y���&��923`�%�����]z�m��)��3�5j��`i�'5�_���+�L
����f�f$*���-�@J�R�XH�L�����R�W��TA��r�<������@Z�����J��5�j`������1�����	���b�C����a))H�o9\������������Ee��gi3��c;����c^:����>�L������ZTV\������������$9q70OBffis���(	AI
j�g�#�M8��yf����XkM���gG(s%��2��gn���HH�L@����p"T&�d�O�z������6�T�DBjy��f�J�'{IL����k��4�����E�<I����:kV�7I��&K��������,@�*��r���9�0$R*�W*�RlC�fH	s6m������k��wa8p��<�7�y��������H��<����s��^Gs���2�*���:��y�l��P��L n��K���,�� ��	�@!�?\���	�UH$kG��ZT��������'�i?���v`�S��H������4��l2��RS_�u�p�_G�
�D�.a4�_�����8 Sd>3����#I�]���G��0�������k}"�b������h���;�$=��OAB��6;�Z'�����
���eOeNG��0a��Bjf���pR�����5�$��#,�5j��pu����ZT�����;~/�H�EDN��'#1�q�33��@8�s2�4��4��Q���1`%.�b�YH-O�}�3k�/t���ZR�A�(�R����Bjf���p5���r��OHC5N�pM/&J$�����Ee}�����u����S�:=G�lc����*�<�����@78������w	�T��g8Z���	�Yx����Ee}0s��'�}hI)�z�p���e�����D���yT�%�1C�����x��X	k��%R���c*�����Fq:VI�4j����:pV�������M�qT:��a����J�`�P<^v����Bbf���pJ�6�����(��]CY�?N�(K����D����k���[q@\i=���0�$df&�Z�9�3��q
%J�B��5��,6R3���3���.�CJp1�� O)A8�\�,�f&�Z���������tT
a(R���g8�����Q53���i��9y��!���t����YH�L@�>���<a��Hi�:D5p��X	b���	�O=XB�(;i����HQB�/wH	�: �f& Z�G�%:>�n���C�.�''")3������| :�A��(k�u@�����6k�x1yH;�d&�u�����gqu���#�II������QI�`fe�XYNy���(�o�`��x������2h("&�kQ�@R����v,�
��i�$q���.}��!EP�%���=���.��`z�ri�%��marprs�e�}�����@?)\�{J��bD�o��~����gy9���
~�F�F�4����o���??���0��O��G/��Gx ��)�g
0h������~���o��/!C��_��������/_�������E�z�a5Gg�q��Q���u��F6�z�t<
���=4a��t�����KUZ�G/����U���<-F����F�8(_�����������.�!�
�t-���qp`
b8g~y����4Cb��}�;�Q����=	���%��J�1F�A���F�:���~�_���k�n�0]b�|<�;����/!7�1F�A���F�:�U��A���KxomQ��j����� Tn��P�����ww��n����U��W��Le����W��������U��W��*���c�����,7zUt���p���"dv�:zUD�V/>�W�4�m?��W��9���������~�Dn���KY������*r�WC�^K�6�����W����
_m�A4t�U���n�V;�%^�xe����k���(no������������s�WE7^Yn����k���T��O]
��8�b��Y����
N
��pu������o��V�N1��uS9�a@��\��y5t�U���n�V;)^�xe����k�1��L�j����
^
�x�����n����U������#�>Ylp���c����R��O=�mN�x�����v�|���F�������{��1�����
^
��o:����m��f�s�]�l"7`��e��������*r�WC7^;��*���r�WE�^{����T��O\*����[��v��������G���<�����;�C�����3n<+��������x����;����6��]�~Lp�Y��
�;�����G��[�w1<���d7/��a�y����gc7�Ep�l��svv^<��,��gcw��3��G���g�xVv��#���l�������y�2�y����3n<+����-�zVv��7���x�p,l�3v�YG���<':�6:tsSb��[A�qK�^�������A=�[�����y�Dp������x�l�Q�����>�������l�������y�t�g�Y��g�xVv�
�w����x����;���Y��_���t�5N?�Gv�7N?�Gv�2�a�Y��g�xVv��(�[lv���83������}����/L^{����~k�� J~�x���U�����=��������3>�P�CQ��7�$U��h����X��Yf!�<��c�
����z<�����yK��TR����Bbf��$�\��u�Mh��p�t�0V*;4���$��/K�d������y������u0�_���+�����r.p��:�V�t� %x��ER3`�%��G����x�"SV�W�T-uX	0VXH-O���c qg���?J�0�@y�Q�%�[Q�����Ee��w'�+�H�S%��Zo��v�/�2h��`f��$���k�u���7Rz=U� �f& ZK��[:W��@�|��y7l�+,��'�7/����{Tu8�|�|������@Z���S���Z�; *��)r��h%�1.�53�U��@����>�@IA�3�
����b��F� �&�kQYr�2���!�����K�A(`%x��eR�`�%@sz���{@4j�C/���]�1|����	���x8��?A?��Z�iT3`%�Xa!�<�Z�
p�!�.A�����N�x�j=�3�sz����-
�G�:�,`�g����s��/Tz=FO
e43�Z@����#_Vi6���Bz�waf13?����f�~����a���/�������a0����	9\G}���Kb�8�E��-�A=�Ca�"&�f���O�R82����-R��Q
Ke155���|3��,��O;�a,"�P����Q
Ke155v�(l�������u���uHP��Pc�ILL��.�G��#5��8�ed��D��z���ELL��.��W>�]����+��B�g�(�N���L���8�]n�G{*-��{g�0�zSS3����/�!�t:s�"�!Oq@P��PX�������EqgFg��d�����8�m����	9\G�=���n���y���)����2	��	����X�8"�Tz�~?��(��n�L������C���!V�[��!sa�LH��I������Uq	�Z]��%d��3�z���l,��f�m�y�#�q���x@���]�]��n�	�A� �f& Z�}YaS���3��a_������W��{qP%���
_�����4���Z���PC�.�����R3`-y������@�Yy�0���Zt\�5V��7�YH-O�������T����"E�:�C*�V�|��BjymL�u�<�#
�T�T���H	/_�#!33�Zp����o�$	C�2%�}�$M��,�f&����[�_����T���y6�z�N��#��+E5�_�!U��u�l<��_U�������R�d���Bjf�����
__Dj#<���<��~s{�IH�L �����������������+��?
�+��++,��'����o]~H8��_��%0�h]�����$"F��M����z���c� TI�� ���rH	�C���	�������'k���eD=�����1�`�����h�����U�,]}�S0@����Y���A��(�F�kQy����?%
nPA9�n�c!53�8nx�����t��JF�������5i����"ft�n���:����!g�("��=�	 %�	��R3����D��1U�`a������H	._<3�����h�����=)xXjI�(�*2���g�h3��155r�$���FM����J���/i����P�}W�X�/�=!�K��;�����B����/�g=��9e����Xk���LM�����U]�]r�l����MS�3!�K���_M/v���_�z��Bz���bf~&�~�z|(���3��2O�O�q�G ����L���( ��x��za�G���{���Ea0����	9\E��s��k�ya���<�":!��y�I���L���(z��vx���a�v=��	@%�9)�����	���_��,I������Z6L��sc:;�L�I�dm"�s��4��1�w"2��H�
�x>w<�*3/��`�a�t��d�V�]��A�v�\��.Ff���~o� �$>D�W�xO�+K7���HSPV�H#O`l����;_�U�����=��B���:�����Ff��b� ��a����Y��,�����)��^V4�����Xw��<�>���]�j�~E������
�������U�($�t�&��1x��e�B~��t�j&3?�6�oe��i��
�Ps�Y�9�D���DFy���k��*P�����Lc�x
����`bb�a,.��M��<>}S��K�~�������~��K/����
�EC�Sh���u����J���2���d���D�3@N�������L@�X�[�N��`Xc�������;������;~h����a\}Q��<���X�u("j�9v�	��l���(�/�#������[i'�:���N�3@NX��,����{} (��|�&��������~� '������3��D:�w/#���J"����z��1y<��5�d"!�cqY�6������7X��P�K�9����FFf���t���"�+��#BO��AE�)`'LGu�s�'��+���C�#���A���:��;��	��E6�0`���C�	������Q>��2o����0��K�I~�wT�AUR�uxr��������H�L@�>�N������lj����s�N�2�N#@#�<��@:pfA�,7�z�����c�x
��}��BLcvY�||w�]��%u�a[�tLO!��h������W����������?�Q��������h"!�cqY�G�g�%�?;�A�v��������Ff���tz{��:����@&���P�R�H�L�����e;�\��VG*l��O����`��r�DB���
�4��JS�-����-X����
�#T����x}"�Xd{&i�����m1���L���y��������FniV�X�}dz��-�V8�ciH`�b&�f�~&�n��E�R�1$o%;� ��Y%;d�f��I�������m�"����F�����{0���8�#,�6�+o��	p��$'�V���1l�a&@^�H�u+��{�=�#x���� ����b����H�p{�������;���2�tE��+���j&�Q��Yu�D�Rw�]��zX�"Y����LC�$uwK�yk��X��&j~���B~V-���_�uBh��dW��9b��]�{g��H`V�H#O���;��I,���3=?8Fu����D}~pp3
5����W���{�XX�#��T!?�Ja3
?��#Y��t�R�����t�+�~&��zKfj&4�g�R

�V�A���y�n"�g�W�w������&��&�lu�;!�N\���`�;!��W����#�b49�_S8�R^�y���_��w1���/��!��w&��-�@��c�;Mr�����Q�~KP����6Dp�Q>���Y�i�	��\��V	�?X�������s�?Xc�s�0�o&��CN��c����i@�(�tLO&�[D&�8V�+�C��
z���@�L����;���[3-2*�L��kIU5��'�A��V���+v
�	[�;�EF���{-I���CU6�j���(Re�X$#3���l�M@J`
j���H���t:&�'�/Z*�H��X\��o����-��"��J'x��*�v��,f��4��kI!=Z��a���!�y�x�\`�Xd#
3�2^`�^MU/���5m�.��hWc����L��������_eY�P%=,t;	{Ie5c��,�� ���@�|'y}��YGk��4��gc�x2�33d"A�cqY���i�G�B��������v���Xa#�<�>��z/���O��E�%D]�r���b'��H�L@��$��}<a�jX�
J  q��1y<��|A&�8�+�J0y���L��TJv6��P.�/Jl7��P�,9l[��	��\$��n	;��s@^����+��gB�-J"�GN�:�n;g��g� ��]�&6��3�	�UVM������/�Y����)���B�$�{z�Z+����Y�VB������)�V����P��nI�X�#%V�5DX���9���l#	3�Z�@.��s�k{�ed�ko�"��%On7���m�O�(�����)��2�����\���'�I�4�L�����<@��*��9�*� ���$�����Y����������W,��'a\0����!?��&6��3�	����-J�;��Q����l=X��X�����\�������yT�PW�~*�CYG��k�L��e)�Q�4�Lh�Ey;�����p����I� '�H�zV��4��kM;����Nt��NE�+�~���`il��f���#�[����.S��6w�����!�����	O�*�jMb����N�s�]��O�F1�P3)��fz�O�������@�o��Gf�{������peX�)?�|k����Rb3_�u$u���Sr:>xz���o�4�F���E�{�����X������FaT�`�����n-��s������	�l���HE�B����<�L�x�L4�q,.�^�G����B)�PD����/�&�#�������x} ���jm�f��iCP'��wK���v�R0e��,����2����A-��T��Z�rJz���`�rg#	3��@:P����=�:D��@�W�z�1~l�08���2������g�J�PB���5G�������x} ���w���8%T��Z�3F���
.{G	q����"j�h
��
%i��h�E6�0(�&��tz���U�/&����!���� '���NFf���t�M�FV��
�r��p��	�
@�����<�:�X�N��\��(�"��|Z>g������{�qg�,bi�����7'�U�:���c����a�"i�	�����Qx�1��'%t�`�x�dLO&�S!	q�����-�����Cb��9Yk�aLO!��M�5�����������8�tg(�R���~�Iw����Q���#��Wm����������4�|�	�����'I�	������gi����UD���^�$���6�0H��������Os�������Ns:X�,�w�#Os.�j�~L_���f��C�����Kh���������[��Hr@��J��������$=��`3
5���Gi,�A{o]J�D1���f~�D�3�KZwa& ^�H�s��6gQl���a�s!��XlS��L����{�+��-��'�I�"��&��f�Yu��"ELj&<�g���F��ncT��(���jcm�}
����#+o�w������� 
[m�%6r��7ai�	���	�.�J�nk����]]������k�f��
��Y���]�����OY����jW$���A�2���	O����|jnZ�f,r�Z�O��lV��7��5��#Y���N�F-�d��0\J�"��:<=�=)4�P3�	?�U|���-w��e�l���}I�F�pe,�i�������������)��v��������y��i���O�_��V�3���;�}�I�� �g�^���df�gR}���Y9Y���R_D������Q��l����	?���*��5���S�����c]����e]��x�6{U+��*�����}#\��������4C8�\7�{���������������~��?���?�������aw��V��bm/{�������?������_�e�pY����������?��������z�eV����"��s�Ow!��&����=�<#l>A���z��$����N_���n>�[���_��v��i�\O�t��;|;=_��m�~��%�*#�r�~��������4m���O�i�	Nad�~&��WO[���r�	��������o
�S��3�]�z�����������������C�/�n����]O��.�{�(l�������!h��5P�����\_&����_���/7�����l��@b�2W��+�B��7l"73���Z3y=c���nF!�.��qt3*����|�M�d8Z�_|@<�-���j��j[���/������|��p3*y=����F^�������Q�lF#/g�/��{��4�zN��I�����?���Y�������a�{���la�K���,���L��fc;?���Y�������������Y�@-����^���������
l���r��r���`�rPt��������W.�����Y���j�f�F�~VAw����U��Y�����fV�s����SE��U���l�gU�z�i�����fV�s�����6�Y���v~VE�gm�����fV�s���5���gt��#v~VE��9�0�y��������4%�_51��y�����.�)��)vnRC�g��RnVE7����������s�U���l�gU�z�H���Bn�d37����w�8#�����(�f��������#�C�g������fV�s���5��e?����2j�gUt�.���H��<���w�O��+���u;������4�v�*��U����^��v����nf;7���Y�#�8�*�nV���*z=k�}M?���Y���j�fV����*�nV���*z9k���dC�gU;��������������Y��o���Y�c��
�~�Q;?���w�2H�x��s���N��wD����fn�J^�h��	F?���9��Mj���p �
6�nf;7���������Y��*vnVC7�f<��'er7'��)�������R���l�'U�z�2="���L����Z�?zEw��������'�������
{E�o5Zl�>��$�2���T��J�
;������p*��*+x��������5�@J��O�����,�2BE�P�w>|����l�<��8������V�v���AQ�3���&2�qLO&��=[��X]���T��A7*B2Z���y�_Y��[��X]V��/���N�9��l��h����q?�~���,6N��t����(l�'�� �3��b�H�L@��$�o�P�&�#q(!�]�v�	�h���4��kM����J	(J���S��	(��T�H�L@��$�����T���x4�A���e��ka&P���%���n�e�6D�#�G>ntK�/��5��P�W,���G�
4a�[QDr���	�����t
�4��kIp7�
�������2���/��L%�f��$�����u�!x���a���~�	�8.������z�3���Em�
���Fg�g����c��4����o%�o�Q�H�S���uD�� ���g��4��kI�������������h�F.9A�66�0`�%	����4��9�y`�E>�9A�Re#	3�Z�@l����!�G�q%TD������)�5�F&�8f�+�;��g_���"s���bp��r{�ci�	�����7�=����I-hW��W&�N�/�Y�	��������I����nE�~�����$`,����{-I����T�nb+jT
 Z�	�������,�������K\����c����)����x~��8��e���F����*U��@}�{�'x��Q7SymhR�^����rp�6T����l'���_�6�0`�%	�\N�\����w.K=z��c��4�����K �U��n���oF	l��@N�����H�L@��$<���7�W�HD�F��x��q�|�R+�p��X]V��k���i�[uB��P��@N���&�.���a��
��>kH@QF3�@N�����H�L@��$P�
[xU6>�/�����O���d�E�5�h��X\��\��D�`��2}����z�	_ze�^�E�{�I�>2��o���o�l��!�B���4�>^�X��8�+����C,�@N�j+��0�0�5[S��`�����&!�?�w�����;�>���Fy�u�{�&�xP<�����[o��??Mz���U�)o��1���6\��F��F��|������D4�\,�q�.�������q*�V]
�������5�EB��
���;�����h$!i��r�%$"��B��������nX��J���`���u�vH�W�bL@����D��#dh(��NC4
`�h7��P��_�)��#�0J��G@������L��,��k���p���C��
4�z���2:i�	�����=��F���y�v���L��fe������i%<�R#p��;���f�=�B�L4�q,.��{�C�����@c�Ay�&���F��2�Z��Ew����!z�
|_� '�]X���jf��`�-Rw}���[j��2�Sw���.�
Y�	tK�[��	����)�wuuS���=�L�H]!	z7U�;�F���k����p���^�'c�x2�"!�L$�q,.��;��a�6�
U*�
;�9��
�2���L��,�rPz����P�p������ER
�H��X\�_wl�=�����������3H	�c�x
�u2��1���nP:�)#BOJ�T�t ������)���4��1���nx��mvh�A��:$�8h��D���1�����@�����|<���G��T�jQ��l�4�|�3����+����:���b���HVj7��b������x-X~��
6��#	�!��xc���%p��H��X]�^�I#�8�j	4�K^-�9y�$A���}�	�������jo�*I�I
"��8��)x���M8�q�.��{�x
y��E��v(#��]rb�Y���8��k��#������6H���z���q
�Id��4��k��Y���t���@_&E�����X(�H�����d�Z�(?��|�U�c��S/L,�q\}��P�%{6M�?n��q�I5����9�0`��g�"o�kw�H�:Q@�1y8E$��x`�!�cqY�n���
op;D�*u��"��E2�0(��;���������ZB����0�2��m��+�*�
QCS��	��'�����@Ff��2}o�He��o��P��N��;
JH*l��'�����[���6��u,gB_B��_���������c1�=�o(�����1"}��_O������
��6������1y<�h�J�������0��=RE`��KJx,;�����%�B���<����:���0IHZ����<�LD���B���>������Q�d;;1e���)�8���yxNu�r1$E-��B?��E��h�	�\V��j��jH��M�"�T� '^��U##
3��@:1Xe���%Dv-.9�#���H�L@�>�N
V�g_H���/@�E�3@N,�h�i�	����A��7���_����z�	�Qu1`�a& ^H�,YRA�C��!�����G������t����x���Q�'EtOo�����P�{�`�a& ^H�%�D���j��W$Tg���<�B�H.��1���f]���yw��)������8�$4�2i�	��j�����B���?T�tR�������
DX�!	q�O}��Cr��Ct�:�9@/Vwlc#
3��b����U>��G��1��@������������L@�>�N�-e(����F
�s��t�;��!	q�����<P����H��"!j����'� ')�!��nf���t
������HH���6���@N"!!���`�a& ^H*f:odlx�!,9�I::�$�9��2��"O��>��AP>�7,�b�	I��,P�N���d�E���DB���>�R!u�.���FW�����)�%(0����,�e�Be]���)���6��R��X<�L�H�dA�cqY��F����2�j�Pz��$e<�v6�0`�����
KMXPk(DX�s�I�z�e6��`��P��*_�j(l�@�����X�b��H#O�W��LG�&9�d�v2E�0��$�S�2�H��x8��4�p�g�|1���#B������k�]�	��b���p�L
:�+�ii��	 '_�i�	���q���A���qL!M�6���y$>�����0���x�j^���Tcm�,��/|X�N����>�)���8�I�Mf���4S��&G������&��1b����L`��q��*i�[i"e��f`�A\��d�kth�����J
�ku{��B\����������������3��on(���1���8���Z��(���6���{gC���@�y��a8��zk����BH��P@U��<��0���X!<��;����,�YS�TP�k��
HK��j���	�Lh����0V�+�|��kc�:%t�)�:���D����n!��r�����C1���p���T��D� '����d$a& ^K���M�ji�7F��z6y66�����L@��$7T�I6��H�""��� '���Ff��$��=�D�ED!�)�s�Ic��DX�	����/O+���X�����=����h�A�cqY��=I���T�D��M��������ZY&#�<�6��%Pw.�����,&H(U*��c�x
a�L,�q�.+V�R-S�k>�*�5P[t�j>�X�0H���^=�C���9�PDzT����1P�H�L@�V$�eBxgEJ�T�uf��j1�=h1��4��qZ�l��}�����.���s\�{>.��I��K�DC���~��r�������o���P�w��v�I�l�a&�^K���xk�d�P�f��*�s�N����Fymh��^��>x��L���}�)������&�p���n�}��$P}���_U_"��PL��3U�3�$��&�8��U��{��:=��0���O ��I�(c������x-I������2��T��+�r��%b��,����z3�)����7��T���I�`��#c�v�,��kE(��B�:��`z�%R�N����^�����b� �d�����".@���v6�0��������;����z��v�Fo;/�4��������h�MQ:��$lH�@�l�O&\OC&�8�����V�Zo�nN�I�Jo*��vB�DYf#�<�Z���/
h
JO@���#`�Q��dd�'�^k(�;����Q��n����Y��n�(C���(|�
�E�!��Zs�	P���L���$�EGw�������.M����&��pa&�^kp*
��{�*Cw���in9���kI�)�c�S�c��8F���/����<�cvY�A�L}�46CI��z�;
�LIl��'�^��L���%O
�	�ZU&��,�|G���T~��R%~y���T���*����F��c{��1y<�H��H��X]���.�Z�u�S���"7�9�M\�F1e�_��M��?�b���L.�{z��e�X��A����L2��{{V����R|�0��=�Mc�a��|��Us?�~yp���JF���h��|>v��r����DC�C?��y�};%J��^�S�T5�M>���j���)�0��7���C�����ub@N��D�@�4����$�kA]o���0J�@����X�!�!i�	����]��S��:CV0]r�2� W������x} �H����;��~���1Twf���t�,i�)�����K�N���%�e[h"!�cuY�GKr[Q��#T�����@����X>bF_l\�	����Gj���P���v1���HFym(�X��>
���/�egm�QG�Q���7
1�l�a&�^H��{��6�
�-P�t��6�66�0��j�0���s{��f�����Gyp?vBJYa#�<�>��-L'6��LI��qAV���f����X`#
3�4�u�LT#�w�u���U���g����c��4�����4,����/�CQ�����l���X��;5����@�>Am���:T���/�	Hr^�1�����@���LK��V�Ny;Dg������(Z����H�L@�>��N�g�&.�p��G	HW��D�"�@fi(�X�
Q��K{�,w��Mj� '����Udf���tB�?��
���t�����!��"o ��C���>��x����i�J\gc�x2a!�L$�q,.���J������d�����4��2���CM��t\�R"}��E�eQ��=eQdbeQ�X\�����h��W/�&��������p�
�X���|h��	Y���fWX����	����%6��`���J�z��z( �j���
M\=�8�����2�<���
�P@N�z(4��P�^�H��_���r�W����v�t��"#Wu���t|�������'WuYnC�����C@n�Z��2[��C���MT�;
�\^�Fy���R����������6���2����������j�{����h��	t<���g��o?��������?<���c��n~�p+�1�
n7�����������~��?���?������=|���w�p"L�o_���������������������������]��_V���uH{U������`�'|)f�
�r��55�}0z���)$���=������7���n��a�X�|F��WW]��y����#�L���^�����7d�w�e��6u����#b5H������[R�&�e���]�(��^/��r��~�O ���"m��nPW�`��I����y
�o�	�������6u���R��~�Oa�|�����6u�����(oD�����J*,"�	������,�����O>B�M��_�]�*���S�G�|�N�~�On���[K��m
��)�7)�
��)���?���������y�BL�|�n��O�g�P�(�G��Uj��7�z�+��b3{��r�W�����3��5��n2� P�����UQF=���'_����nOP�m�?0h+X�$��tb��0�����KRM��*��U����^�
������Y���j�f�f���*�nV���*z=+���m�U���b�f5t3k{lc�B��$3?���3�`]�yJE7s���������u�Y���v~VE7�V�d �
�y�;?���7�F��n�o��n@��o���/��Y
�|��.GYtR!7s���R���X�\��S����^��v���gUt3���Y
��Z���2������B^���S*��S����nf�\�ft7+��Y��
;��T���l�gU�zV<��Y�~�Q;7����8��o�j~�;
��,������M�\�^�n=�u<���Y���j�fVh;���nV���*z9k��mnVC�gU;�����>��fUt3���Y
����&a�U���l�gUt3k��#dVAw����U��Y���t�Y��*vnVC7��U^���^�����U��;N��������w����%�(�y���e`�������7S���b�f5�z���^�U���b�f5t3k?~����?�z=kM�F�fUt3���Y
�������*�nV���*z=k��.p�U���b�f5t3k~�>�\���v~VE7����`��������B��m�'4��m���*j��J^�db��
��|�j�v�O��fN�s�z=+��0|�Rr3'��)������O��fN�s�����������9�z��s�a'�
��U������Y�(r�
�������nf����ft7+��Y���D,W����l[���[MM��d�}����R����3��n�.lp����T	UV��m�}�G�k�����t��>�<��5����	Z���e�r��n�08\��^��[v�IG4�}��?s\&z��q�ms�[9�'��4��myF��&A'�@N��+�������x-I�m����V�8������=�B��ns!�cqY��V��l�U�����#� ��R8b�`�H�L@��$���k�}��~p,$]���_�ga&p�����x!�@�[�D1�����_�a����L@��$7w�
T<��u��!r{�l�+��������x-I`o|����;��""���]r��A�\4�0�%	@�v��x���pA�=��o��*�v���S��<�Z�@R���ziC�Aw��PX�;A�2Y�	4���	�]�@�����:"�%��	�dda&�^K������
4���PE���u�9A�Y�	���Z��ws("��s��l�-�X�d�a& ^K���m��t�*"���/9A�����^+�Mxh���o�~�!*�.��H@���l�a&0�o��|������ l{+��J����$`,����{-I`��o�_��lA�.^/{��3`'H�Xb#�<�Z������j�i:��n�@�����Q��3�����@��
C`W\QFT�m��$�,u2�0�%	����
��^�[������U�L��<D4��q����r��@��5�i��P�`�>�v�����.��kI���RZ�{���J��'��=��E6�0(�=��%����c�Z�
��w*!<�	��H�L��V$P�����j$!�[�M�i��)�Q�p��X]V�<4��fitY����]��:����,��F�{-I`��~�^�������{djU{�8��H#O`�e�^1��({�;?<8D����Y ��.�J'#
3�Z�E�c
�+���f� H�Tc������x-I Q+Y|������
�a�x:&�����V.�4�����+�T��(���x�I�kl�a&�������^�Q���������3H���X #3������D�}C-u�HgQ�f�/9a�r%#
3�Z��D<���a��Nh�������[2���q��r�O�����b��y�v���?~~���(��
��S�z�`���
�dI��
Q+��wI'����L�FZ�	���o,j�/)��q(!J����# 'X��L;�f��`�p�Ff�]�*�����Jv�	�����.��k��k����~B��%D���]r��+���3�Z�|x��rI!k8�D������i8�T��05���w,J{Q�}7������tx�kgd���0V���
����N�J��D����M����(��o#��M������!�E�^rb�����l$a& ^��}t����h��v��3\�����0���]�	������Y��YC���Hj�9`'�l"�hd�'�W���-yU��6����@���s�N,�0�dd�'��^�[��t���#3�������_+F�/@D
q�[��QJD#�C��L�Q�!sl##3���!{k�-�=�F��$�~����d�95�����Uv�F�Q_@_���l�a&04
|g� ����H7��>���zAz�/9�>�,����{-X>H&X��K�S���)
/@[�:�H�L �z��$J��:k3Qz���]rbm�I�/2�0`���K���7��+�
�"W	�re��FFf��`����)o�����$�p6�zh,Q�V3�Z�|��S��{��4�x>F��Pe��&�8���
G�X8QE&�p��K9f�B�tk�,���a���{o-�q�s������0��@^���X�a�H�L��,_e���
���Pg
'�;
�7T]�����'i��u��G}��E6�0(^xk�����RI"�dV��r�ID����d�a&�^���������D*�VN���d��;�h"!�cqy�(����cw��a Q����X���f��b�N���?���Af;��Kf#�<�Z�|D*����p�
"TU.@������lfq��p�C��v�+/Lp�f
���>u�|�r!�cqy�*����0
.���K)�1y<�|Q�*0���1�,Xw��tH�P���r
�I�^+Y�	TH����*�t9�*��j��q�^fI�L4�q��	�S�vD�{��m�?��_���O?��]c�s����fk;4i
�Hu���s����V���3:An�6�ZKpa& ^���L'g���B�l��\8����&���d���XH�2�Rx��~;X�a����>��	Ua(�df���tj�j%Qi�4�(.���hMI�	�*��tZp	X���
����>y_�B�RGb��������PB�7K\� ���`o�$��k}:��d�c���%D��\r!'���H�L@�>��l������9D��@�I`�cx�re#	3��@:{��I�:����G��:G�+�I��P��2���/�����x�����dH�O_����0d��	�M.k.��9T������)���r�1��e}p�_V�D"#*n�_�|�M�dda&P����������&>*>X;T���'��	�!eY�	���iTg��r=����D]�cM`�z$e���4����`��Dz�?;D��
��7�9�H�L`�la:X��w\�t$X�s:.�qB���z�DB����,�#����������r��5%fi�	�����|X�*���V�L�R=�$�?�i�	��&������NI-��0G*�]�2�}�b�G��Y.�q�>�=�
�.p�!��D$���oE'���
�����#!�cqY��	D�c��
����&@N��e���0`��U<x�K��\/���a[�:��S��8`bA�cvY���\�w�9aI� '�B�w�H�L@�>�K���c��D (N��7:������]��X\��A
_eP�FC�n�u�S�N*f��Ed��'�^H��~��CT����&@Nc�Qd#
3��@:X�IQ��oRo�ISD�D������SAI�	���a*pY���h`�5C@(o�v
,p�Y�	��'�����au��
WAm���s@N��o�H�L��>�
W���]r(!"����jd�r##
3��D:*g�x�k�2"��M���\��
����@�>���\���ki�#r���i(^������M��>���G�e(e=2�C���i,e
dda&�^��/}������U��c���[����8n^�8U�rz4|�-�P���/����b���_���C���Py�[�%��*a�9�����Y���Y�G)�i�V�1���)��)0�����%�o���OY������|BP�;�����	���=_2���.�q,.+V^�
����F	Q���3@N��	��$��kI���BAJ�����@�� �rc�bb��kI�>��z]�]	�E^[x�����"IFf��"����l�Me���v�r�	�1ap��$��kI�'6����C�k
RW��|s1�I�	����f/f��&}��I�p�����NF�~$#�%	D:�K*Jf������������d�%J��X\V�<��<�!���C� +�;q��LFym�!�K���]�+��*z� ���@N�r1�.�.��kI5�H���Z�
"�f�������FFf��$�W�\Yg�I�R���]r����Ff��$�E�J$mK����z�M��GOx���:�/Yl��>�/:�8)�>��;��q�C�LP���DC����s"l��G~Pp� B��DyPplc#
3���A$!j,���}���>K���}�"i�	���L� (i�q��������KOc�Z�Y�	����%\��#: g$"��r�������z&4�����X9�=���we���(!"����JK��FFf��$(E���l�uDX�t����1�dda&�^Kig����c��6*tj���DE�����4�����@���g���:��xw�Q�_�4��b>�����|��]�"D�O�0t�I�"b��4��kIV�����P�T���	��X��H�L���$����5��R�(��N���
���(}���8��^�����E��'��R����1z<�|�(}���]���	/`�x
���Q�!#��{-I����(��j )S��&�Nc�]��"O���W�����#zR���R^�!#���@��$�����ehh�2w�4*7��L�9�Z��
(
���\@���;�ZM&#�<�Z��#ZL}�4��C}���;����3�|&�fn���:��Z>�����3a��`�e{4�/��A�J��\���XS�oCW:E��.B��s`]��i:	3��+��try���z�P��t���:�
U�A��/*�����	h/ShE��9`)S��'�*�����*�V�S37b�k���5Cu��'�c�v���vHB���C+�n����"��.!A�s��%Dh�/!�x} ���|<~V4�h�(!��u��+�~
5f��>Pe��zPs���pc�l��f�#
.����:]O
����@�	`�bUcxgKpa& ^Hn��,��e("�{�.9��E�� �f?����������H+"�P%q(�v��x�
i�	t��mi:�I&�
�I)%�:9���"�a��E��
z]��Qv9�Pxk�����.@��[�h���@7�WQ��$�;
U��	_�.@�����1�3��[w.M�%��b����m�R�0�XYb=X�L ���@�m&ju��R�*"����$�R�i�	���t��	[G���H@p������b)2��q�&\�G���+�N��UD�=�@N,P1l�a&�^H��q����������`���1y<�|�0E&�8��y$�`@%)%�2�HF�r���<�J
�}
M$�q�.����F�����HF�5P�c��*��E�$���8��e}���m�Z�+�eD��i�r)���H�L@�>���cAtL�2)B�D
��.9��J�Y�	�����{���CO�eD������oj]Rpa& ^H���u��v
�vD����������F�@�����\L�m���C
�\��D	#�����{} +dJRdd�Ru��CM i����Ffi(2Z�
X��[M&KT��T�s@NR3UeG�ff���tT��|�x������1z��~9��0����-x6?i�!���(������>����D�m�<^��]�4rY��FFf���t*��z��!:�M��9��������������t|K@hph<h�<8v%�DF�����'�QU����uDN[�@�����]�	������g"��G����� VL~;����qp�L4�q,.&��n��M��G����g~K�������OWz^��F�*i����ER��O?>�������UR�������G� �Cc������?���������g������������?�����~x��PT���m��u�Yn��������0��5�X�����w�2����B�4Bj��^�������`�Xd�3`t��������;b��<�����uG��������R��S��#5�����oJ"I]�%��>	�uI���HX���o�Il
UO���$��%18�N"��^����O��Zo>	A�I��%1:�L[l��o�I���O��}����_'�#��$���Ih�Gyw'����������]�[��V�U���b�f5t3���m��	�����O)$c��?���s�����_|�<�-����n�(r�n^6�s9z�����>~+��U����^�������*��U����nf�r
7���Y�������%��nVE7���������Yt�U���l�gU�z�DQ�Y��*vnVC7�&.�q�
�������nf����gt��#v~VE�o:�>��{��x������y����_���o�{D���^��v6�C�g
�n*w�*��U����^�����79C7��������U:�����fe;?������N����nf;7���Y���Y���v~VE7���&�z���9��O)���i������^��������[M���:_������B��OA���&v.C�_������fV�s�z=kt�����fV�s�����
�Y���v~VE�g�����Y��*vnVC7��E�����fe;?�����>�c|�3�zV�s_	
��J����n�e�abew3�$Dg�����������2�Vo�|��z��S��!
ew/��4��������2�������l�f���7����fC?�����8!rVv7������Lg�����������nf���|�Uv7�����������,�vf6fVv7s��]tfa�3��0����k Q�����mI���~��\GO����w����<����C�����`���D"�(��Y
��������3r7+��9���X��H�v;+�*��yx"f�Y���b�g6v7s��������������f��qfa�3��0���������Y���b8l��_�-�B�v�~t.(�?
pEw�}L��h1n������,P�O���T?
U���m�}�G�k��T��.4T~�����[�B����u��S_��&�8f:f�����(�p�0637�=[�n{��'���_�e2��hC3��H���B�����"(#��wK�-n�	+l��'�^K��Z���5�lY��v�4��;-��<�K�+Dzq7<n� hpX�l��O_�����b�b���.����U�	���9`'x��%6��`�%	���\!*��4����s�N����Fyu��{/�N7�is��'��4F=�	�l��'��o%[�t<��p@�x+��o���	������Fy��$���
Q+<�c�����v�	0V��"O��X�%�l���F����7f(l��� c��,��kIi������uW��:����;A�2Y�	����@'?�$<di�l�:4�8�	Kdd�'�^K(���VED+��9A��Hkpa& ^K�T)�O���D$��X5�����/9'q�����f��o��������F����W�+I�	���z�
S���+�
A��^�{�T�:v��e6��h����[X.o�y��p�<�rz~y�H�L���$��W<(>�RO�
u�[�� ������s���,�L�P��f1���b�?�X��&���4��<&6��3�	���<2��ZuZ���9{���P��>�0`�5	p�k���:�����vA��~��i����w��y�����"�f+����7%���<1�P3I��Y��v���������:�6�-����a�����	M�(���.x�
��2xL�
��	�Q3�Y����k��]uhu�a���1�O`���ug��(�T�����
��o��Q��0�t���{�>��<�lf�g��.���Q��@�5D!���d��pl#	3��n#8D��pE�iB�.�G �_��G�E����n����8U�Ed4�A?�MR���I��������1J��u.8���R�9�BYd#
3�Z�@�}��q�R��"`!�ppE��~w1�P3�	W�Q��p��N��IA��8�E�4�Lh�5��B�#L���
BA&}���g�	`'�SF��\�	��u���s�c��+����_������G	�lP�����3=W����n|�PB�B������Xjl$a& ^�:���X��+��j�@����`��v6�0`��O���s����� �J��G����4����k�Y��f�2�",uO(J�s�N�|eT�"O��,�ha;�Cf��Zw@P�R.9���l'#3�Z��*�2�s�C�~�[������D�
i�	t������b�k�_C
�?��KD�v6�0`���?U��y��g�~���~@c������x��|��Y+�o|�@
���(m9�����OYa#�<������|Pl
�D��!jU���������,��o����#6��7���?5A��l�O&|M4�q\}K����v�H�u
@;`z��!�?|���~�������[V(�� w�;�����,��k����de�.X#�g���@����X�a�H�L��,����o�F*�v���<�L����L$�q,.������E�	��I�aT��S/9���,����{-X>�1������"u.9���yW��$��������	v1��yX�9�d���hda&�\������r`"��>�����zE�OU�;�Y����K��������5f���b�~E�O5�;�Y����K��>7#_���K&9���eap3
5��/y{/�\�s��E��OZ@���r�?e��>�Z�|��GC�l9��< ��+u���M�4�L��2z/�^�/nY���{S�H��"�gZO�����	M�$^
��L��/���p.�9����i����+�@�[������:����+�~�� L��,�Lx�%Y���?r(�W����X�� ��T�+ 35�pMUNu>d�,�,�\��,1��3����{Y�����mO����)0Vq.�q
�Y����k���.��(cp��y�B~��~6b3
?��j_�~���g	9�8������	O�&i�_TZQ��XBFJ�A?|�&fj&<��,T}�|������83A��������I�����Bd�N��p�Ph�N��;
bO��#y��;{��������K�~�������~��+.�
m��G��+'���H���s	(�G~*�%� )*�@��.9��lc#
3�b-oe:p�N�O}�<r�>CZ��.pQ�:�����@��L��7��vV���u��k�������[��q���<r����J�2"ApF.���9��Z��LFym8=�2����b@�EIG� ��1y<�|�#?�H��X\���e�%<E���(eG�$<c��Y�	��tZ���9�~�������������D����L$�q��N��<z�OJ��
���G��<�]eye������x�O�oO��5~_3�"��uF��#1+l��'��7�����=.��$�`�=6��9`'V����,��p5��t�.O��o�1U1)��T�;q��>���E�@��Y�N�?Sw���\E!����/�����<����t`F^Cn�d��A�*��X���hda&�^H�dQ�����@�@E������)��O4���1����9��H6�[�
"��� '����H�L��>��PEd��C�7�#�Y���D�*r��4�����.E-'��Dj��d �mm�c�x
�����B���<�h����w�ae�=]���N���,�L�p���v�s�7[d-�BVvdT�tA��
�6��3�	?�U�r�>�������ZX�"�����fj&4�G�E+�.���Q�icW$k-����i����aiV����'�fM�|��c��O������Ja3
?��#Y�-V������%�$�]�3���Z�fj&���K��*[iiT����Ij3!���V���lGB~$���,�7�	q�+���9Hr�cZ�$a&��"���8����n��<�Z�L�� ��I00s�g����6+S��6$���ko3A��D��B��&�LV���c��*v�0Q���9iYV��JL>�����;w�,;�D�YFF��A?������4�Lx��d��Z�6=��0�H7�;jyQ�4�L�p����Lj+��2�/�*y3)��2��B��|�ge�Vt��X�U���������+����	M���������J�2_�5�;jY���w$4�'�����g��I�s2��i�q8{Iq��B�{������Z������]�?�������i�}p�{{x��4�>�q�a�OUT�P0TvD��o?��*��BFy�%���~���VE�C��*�|�l�O&_�T2�Z:��e��S�/�t�4>A�{�@	�����Jc�i�	����0s��G���H��gA`l�c�x
�==0���q.�}k�E����u3kY��O�H�4�"Y�	��K�{	�%���k�L�uX��KG�#`'���%6����M�o%bZ��cf�.D�A�w>N��C���DC�i���}I��`��_[z@�m;��S+�h��cqY���	#RKLG�Y�w��<_���\2��|CqG����w)�����tEQYJ�&@N��)�l�a&�^K�C%��&���6EQ���9���,����{-I E��?���H�Y>��S��c���8��e���>���V�Chd��i�h����@6��K�
9�Y9�/�)�%?������#):����W,�V������ ���|��U2f��4��kI V�^47!U��y(^����D���2��q����r(������
L>Ia�� '��S2���z��"����N��.3j_��#��V� �����#'3?�pQ�N��������M����)`'��U��H#O��+��MJ��X�z�{F�������������B�$����#e���3vU��^�Y������v���H�L���$o�R���VRS�hTW�L��JZ6�P3I�s��y`�w���Xe�K�.H2��`3
5�4��o�����p�	+�F��H���Lmb��fB.��7��+��|h�l��w$�w�������	M�&�����Ae9Y[�����v4ra&@^k����`����PH�����Dk��5�pU�jN���P�l,;�;�E1�P3�	�aUZr��3:B���f�~�����,�L����������$&��#)t��QLb3'&	M�(_&M��6�3Uh�dm�������H��/FU��J\��u/����JQ3�	M�*W��-�A�8������ M�Y�c��.�OR�����W�P��]�&�N��4���7�z����2�m����AJ��a(�r;�c�_��������Bw�Y�	���%����F�@���L'�
�b�}���v��/��M4�n�����ra:)��|�#�� �0��#��f���h�!����T��>��\(����PP��.9q��@Ff���t
'���UCQ+�����clc#
3��@:���-t��!j1�M����N*bP��Ff���tZ�����P
�N�����(;a�Ae��4���l�0��������!�V���9`'.�K%���;q��>���U�3���*�����tLO&_�P�]��X\���\����*��
HQ���U�Rc#	3��@:1H+�����
m'�0��;�z��>V\�	��Zte:����������"�o>��S�����.�q�wa\��w�I��.��a�@N��0�� 4�0`��S��F�NKUDO��r�������Q�6,�|�����bwom�Q�����a:����Fy���t@��ns��!j�����r��c;5�lwC/���t��Bk#TV�u`�c��&�e5��8>�X/�������D-kn7�d�+��V����LC��'�HVP�%W�l�*����-K]�A?�����,�L�����~+���] ���D�)��d�rpg35�<���
5/��-\}��r�d��sB~V)���.�i�����
%0>��G����n!R�) 'j�����F�@4����MX\%���U�2"�Q��r�����0�O����������,���s
�����	M����f;���YD��_�L��`2�P3�	?��pW�n��N�v;�L\�IeW���:�2��\�U��1�T���j-������:}�����	O���@O�/�*D*S��
�OI2!�`3
5�4
�+��R����=�>��l&�w�lf�f�|��Y��[a����V)���Q���w��.O���|�u�+����bG�~s���J�#)������G7"���Q�;���?6s�����v��Yyq��5+v5�@��D
$3���'����
����B]�8�o�4h�5��E�@���p)���[a�	o~��1n���R��_~���+u�C��
����No������������������>���J�V�o(Z�� ��������?������p;����������?���]���"�������i��
endstream
endobj
5 0 obj
   182154
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-1-0 8 0 R
   >>
>>
endobj
9 0 obj
<< /Type /ObjStm
   /Length 10 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
10 0 obj
   16
endobj
12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
   /Length1 8984
>>
stream
x��x{tU��>g�SU]���NwB������	(M�%	�8ABDy&�@�G�����D�h���Y�/*(:��8�!�|��Ag������o�[����S9u�������@@����<e��Y���e���t����&w�M���@�Kg=8�����?��>���R��s� �����K>yr���G���iS'�C�'<��6m������3��#3�L����?�0����`�O����g=6u��{H����B��a��d��[�5����S��C�Ao5�2����b��2h����_�����2�Be�V�[A�������u�5�,G"z������s��WB��L�J
����4t�9�������m��m��m�F,���k�*��|�$�M��d�l��d�Q!*��YH������<���96.;y�������l�:
�J����QJ���Co�e8s{C�P����P��qd�+�o�����dT��`T]t����\�Nx� ���[�~�L�g�g��P^����+ce,���x���K������,���eq���y���x��|)������6��'��*�9���,�vI����h=�<�=��+~wv���	_��{c���{��'����4b~M���������v�J�P(`��(�`�?�����<@Mp�OX�U��0�]?�<�!�e��_l�r�F�g��:7l�V0��6�6����w��+�����U G�n�N�a_2��e��VHJ�<�W[YH
f�n��5RrP�����������6�g�R|2bGu�V�TL��EI`��B Z
x�w�kU�:��5���M��`JR�Oo���.�av���~��
gn������6ak���\�x0��"�*�]��Y}s��wO�M�h�p���lB2%+�����c��;�e�s���u8��g�����uq��������#;�?z���G��`Z��A|wO����/�_� }��J4�BX�q�����^p����
��R �G��.qM�2;Y��z�i����OD�#%>Soo]i�
���
z(I�����_�KO���7;+��Qt�����i9}�F����*��kw�Z�v�.�k����:/�=�������j��u���_�l�:�����-�UTn)L>���?|u����S�}��G����������l�i���$�*����0�h�?!C��$I�$��������(kGY]��	o����@X��`��_��s|�F�;E���Y2�=2��v/�3<8��eo��� �x�S��yTf�l����g�{�4�L�w�E��,��	�HI���o�����@����m�U��� pN�D�8������$���R��KR��[�??K��l�R���Y3t����z�]G(�q�Gw���^
���O��&^�����������fv\���:�%�@����qI���������<UKLbn� U�U���To�z_�X�%Q���e�;^���+uhb����377�p`�������17h�nr�l�0�A�LH��G�hkOKw����:�2@�
�j��L�h���5�����kt��.I�)i�+�
��Zas�P*���V���c,�a<zX��s��A��(�X��6#������fG��|I$n��MM���H"�Ez�H��1{&UU=�aP����t���S�������o���b�t�RP��b��l���^�����cG��i�m���������bp����e�Zp���8��6�j�d���j54M��i4^�o��c��fU-��`M����������\#�&��oY���q����
cO-U��YX���`!f^2S7�#�l��+�m�j�M�ym9��]��$�$u�Zi[j[os��Z$�[5���!n�3���.�e���9:CI��,�g(],>5MK�v�u�wu$� ���L��oQ�j}���r���Lc0��������/�e��g��aa��0����1���Bi�\��c�G���N�Ot����NS�?�(6����*XeY���V���U�g-���-�-�Z������Qg��8otS����I7����w!��M����x�#��S����~r��igm^Q���o�G��V�� 
��c�h'R��Z���*0�fQ�a.��$���k��+
!���C��?�))"�������n3}D����������:��)Z{�>�}�����+Y���%����+�X���
���+l+U3>c���8�i����qz�����I\1�1��9�����k�����Y}���$"�R�����m���:�j��1w�*�.�Lzi�N���[�MM�u���Fn#v�$����{�,�\a=��o���W��Da0L����L����k��o��[fY8)����=���Z@r}m�7�������p�?J�wB�5hjC�c,:�C�
��A�eP&���J��\-�����r{��\/7�����X�3�?H�9����~�����).�~/�#�������:s������o�i�K����b�xGH$0�����%��

��9�I�)9�u�/buu�t\`X�o��F%��&V�P�x��T�o�ed����7�|]�L������g�����;����k���SZ!�S����L�K�d04��-�dUR@�4�;��+����M�s(?jG���>~�g�L�}�����������CN�������@���8�-
���{�#��Xk��Y]���to��1K������XP�wHNg����F	e�kh�2�D-����<\�ui�@W$E$*a��4�g��������19��s��
�vm��+(D���c�n���C��������Az��s���>w����:!���]�x��)����������[!�����J�/q�ja�U�W5��'T�T����+�L
�V����r����b�T�"q��h2����M�50�����;RH��?��aVN�e�0����n�/�n���Quz��0���;#���i4���j�:����T�z/���W��G���T]H��B^�+��t3������4��'pjA�i\U4���Mc1���x%����V���v����R��S:[���k������k����q���$?�2D���~{!�B��������|e��@�M�2�N��l:�.M��L������yd!]���H*����e���R�>�-�V�����	6�
t=ne����f�Y�������v�ZZ���>�����O�����}�`Gy���������{|�������G������/���?����Xq{
N7?�X�>n��*�@�c����
I�a�H����)�1�d�8a�B����������Mt�oo�8�zb�A�j���1�*��3����BZJT���|��O+�i
���@�n�����;�.���M�[�a��
W�*��T-o�M�^|�����,_�V��ZY\�l�l���^bx��S�7���[����\����\�O��?�����\��i�&�
e%w�Dx���T�Z�x��Tq+0�:C=a�kW�&�G��+��+cP���@��	�	|"�� ��x�]����tWg�L2��t����6�XJ��,\{eg��.����6�>��S>|X\�IF��D�]+�����������@�����(r���a����wX'~�e�lp��!m�Mv��� J���.f�"�T{(��!�/���e�=�vs<��D�.g�7���q��e����l��Ir~)n�tI��r����!�v\`sx+t�2OpG�����������<����u����(@W�'=Y��+�"�����2bn�~C����o�����DFx��q�),%���&'��0(
f��k�L�mB��u/��W���^pf������������?6���_���KjX��>���Y5�On�=�����C���qp��p������������L@s�(����W~�}Tv�������3��
�l�XX�|�����?�����s�y;<o?��Qp�?���9�T��b��i7s�����
Y�z.t%����<�71U[�;�pE nojz����6�|R&V����-;++�v�v��@��-;�D���mp��V�0��.9-� %�nkeB2����2E��
������
����������Y�)
�Q�i�i�����io��O�H��"N��_+���C�Z��������]����v���[�hV-~���M���VSp��;�{����|���~�x���&j���5���s?���|o�RD��%�O��zn��<���������{�l�
��pjC"
�)�p9�pq���@��~�>7>�Uw��_Y���������_|��={�y�E<L��)��d2B��!��������K��u@�N�����C��R	$3�Ky�M.q��3�����H��_��9�����/���/E����_��N��2ZN+�R���R�|�-�
n�q,�If�d%rH��2�a0���l.�3�N��,_)�R�>����be�%eX���E�
XA���U�
�j�&�7��|����(�)o)����p���Y��H	$�kc����Z ���X	o�����8�����+�W�h����kc<DE������
�M�7U���u��/�(1��DH�
�&���h���2c�f������O0�z����m$M4���?9m�[���{/�}�����8��_�?&'7��<�u���t��ehv[�
���D����A���d�=y�������3��'�OI��0�� w�b��,�1���������L���-d+�N�����9E��3���?l� ��kT� B,0H������n���!\��hH�@�F����$p�}���?m$���p��V��P
��� ��y����T������^��F�����lh��8�+��V�K\$W��1����%���������;y!��y.�C��.$A���9p��l<��0;��&�_�^����� .2�i���4o�-�fB#4���,���Z`32:��8��7X���i)���PMP
[`�BT���������[�{+�C@��\���
{l79IB�z�Y�g��d��=l8TG<��P�[���R����2sv:����5+��2x����y��1(���MdY�UP
�h4�#Y/�j� /��0s`:��28��@5-��+���j��>��PM���&P�.�3dG�@����Q���:��	�g&���������\�u�������	,�O���:�)u�����
~�����	�u���1���!=��7����oC&����>tH��P�~��;��!I	��H}k0|~��u�J$���a���?@�!�c�JT�����RX���<*y��(�1P�c`5����j����z�u������p���l�'�0�AG�;���#����P�v����u<�b�����p��?���\1�R!J�+2�i*-�s��)�eK�1��L��o���U�%?�H�,����e�e��M���hT{4lY
�	2���������!H��;�	ls�O>�{^�����G�W�����5����V
��	�\���`�
�i��
��?~���~��_\���������������6��~&�S�����\�I
����s	~t�"����?/���;x�l���)����x��
���6������}����^�+��
����;�� ��������7���c�
<b`}����~��~���~__�������x��^��!����A�����J�e;�����Kp�^'����N|i�`�R�����w9�V�;���,�i��K0�j0�����Y�v������X�������bq�����Y���$���W
�kp�z;������6\��(_'pmu_{�.e�O�xuV���>|J���=���{��
_5�Vj���+5������Xa�
�0�I���|��e>!p��r����/Y�/p�\\�en^��E
\`��V|\�y����6|�
g��,�3>*��|X�t#�O�	��4�x���KN�������
��b��{N8q��'�������da���n>>�8��|\'���cGF���5�K��;u>F��:�8jd%p�����	6>B��6�C��Z�C�N{���0�(�~���v�����[:��N8���;8����
��������������h���9�I�lf��x���f��x�
{i���������ga��>���f8yWf8�Kg�2;�0���t�4L������'&�`R&&�xb	&��c�q���6���X#��
�T�1��<F`4�yt,��F	ty�)��=����t��]���m�D��h��:Z*�W�.�J��3�~7�R�s��� �I���H��6����qK�?o�(4
endstream
endobj
13 0 obj
   6550
endobj
14 0 obj
<< /Length 15 0 R
   /Filter /FlateDecode
>>
stream
x�]�Mn�0��>���"��`7��T�
����{H��g��W��R�����yf��;taN�x���)�i>�y�DGr���VI?�t;��;
�(���_��N]�a�,>�8�S���g���(���[��p�_m�PY�:QH�M#=M�h_��u8�,�x�y
iN��W��e|^W�*����O�up�p$a���v�A���V)H��}QX��e��F�
\eV`����5X3?�����m#�"�$��
�'��FXSe6��o���j�j�h����9u4���u�Y�N����kx����
�����U�������=(xP���'�{�=����3r4������_U�t���w���#��/��'>��������/����
endstream
endobj
15 0 obj
   371
endobj
16 0 obj
<< /Type /FontDescriptor
   /FontName /WLGDFY+BitstreamVeraSans-Roman
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -183 -235 1287 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 12 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /WLGDFY+BitstreamVeraSans-Roman
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 16 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 317.871094 0 0 0 0 0 0 0 390.136719 390.136719 0 0 0 0 317.871094 336.914062 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 0 0 549.804688 634.765625 615.234375 0 634.765625 0 277.832031 0 579.101562 277.832031 974.121094 633.789062 611.816406 0 0 411.132812 520.996094 392.089844 633.789062 591.796875 817.871094 0 591.796875 ]
    /ToUnicode 14 0 R
>>
endobj
17 0 obj
<< /Length 18 0 R
   /Filter /FlateDecode
   /Length1 8500
>>
stream
x��Y{xTE�?�N��w���~����ND 		Z�DD�
B4�"2�Q@������� #B@Q��(���������Fgl"�2�����	����������M�Wuuo��:�w���+TAh���'�v@B7v��G�����_�Ul|y�}s�W��-����g����p?6k��_��2
 ����5s��m�	 ����3���k�+����>5�w���\(�3��
%�� �U<2�����HZ
`z��U�\l��LP0�M4�g����XoP[b-��-�E�Z��C�\Ji?'W����{D��P�q��JH�~�d8�-�y��b]����CZM�i�<����<}���1U^lS��������)KQ���t��������91#�
�</���U���j����-�"?���D�)y����r��������d0����^���)In^.�5z�+
\���r�	�)����	�@�.LpE�V�
��3F�]u9m� �����/�����
��Zan�]%&U�7���&��/4��OJ���������
�	�K����*��E�w���F`�T����ql&���fU�y�������M����c�K�%K�h!!�"L��C��*^S7�M�f*����*}M��B�0��i4&�)������i������j�J�R^p����^�����W]��=��7X��lo;�v`:>tt}����R(M[0�N���y}c�c/,��G�}�h����V^v��.�w��o�w�������:@�j{�7��r��q�LT�*��4!����\�*�+1�]v��&�u*��v�2�������X�>�Ec"K���Re��H�6��x��u�G�2���!,&�A�[�qH9���Z/��|�`�/�������9��_.��n�zi+g[�2l���d�D�6��9������T��0���/��2N�q��1���F�2��y�X����"C��
�A�h0�}J}�}�c�Z��&�&b�$'Q$!9��Q�c��j[�����1�����\����>��eff�+��}������5�s3g��Y>s&>z��q�M�����0I������]����{���iWaGd�W�}{z��O\���Y��WB�N:��{� n��(���\��[�:��r��$��&�-�d5Mm���]���5���Om�;���Fq;�N�0?O�3#=3���s.�;���r���G�*/��c����6��m~u��W��i��-yU^�o�N��x:m-�6����E������n7���fa����Q����������s�c�byZaa.7�&7�:Z5�s������d������ql����2��B)���
@-��Y�eh�����s����GZ�hlk�{ #c�u�-xu���i����vO��-�!���::�A��SL�q��e���Lq���j{{Q{�=V�����^,j�W�j��U���Lc��ym�`Vn��������Y��L��k����� �T�Ik�z���5�����+�=$k��-������*d
���}�i��9mg�Nd#��~�y�����]?~���'x��{�*���w�����7[H
�����:�1-t8�a���CZ�2����8l����T�c�cy����v��Om��tg�.������u{��pi�4�`�����'�YKCK�/�^
o	m	�
�
{������]�q�����S��������@]�1�TK��:�bD81#=��O�p�����!�6T<<���������;�l=�.L���_�����_��t��1���^=����.�T^�����L�������}����Dq�Ra@4���9����������i�ClQ�Yjkk�����z��k�S	78��at�6����;w��
Ww(��7����8���;��i���T4��3��o ��l��7D��@�{���9�mI\��u�CZ�M�-�b��R�;=�q������%���
%v2���L�����������0)��lC,��c�X���nsX\"#���Hv&�z��-��l[�=��=Th```�o��(��������t,�/p�2���������rD����4;-N����1����;�{�����N�����w9��}����������Sc�;���?t~�������9��m��>�3>P^���/��m�d���}GF�}��/��q�~�X�����+��D��t�R�;�9`��p��D]��-�U�[~�����$�1��>�Y�b�Q��5�;2��J���o>��A�������&gdmy������8;���p*��n���`{5�C�2VMb�[��?=������@�nx�y>���o�Y�ft����Z����A���,\H{X�����Sp�*m?�[�ay��+��W�R��h�D�as�Xg�O���kRS���u�P�p
H5$l6\�nZ��l+����68M�64z���D����v�Y���E��byLn�bLG3����������|���2����n�/�[Q�h����; ��l���pk�
l�4������s����4�Jc��N���!bQ8��V�q���6�������
����C�y��u���
b����v�_��,*-!���S��6cOS,���\z����h��9�jXN�i�����RG�,
��}�6Q���ct�r��%}I�Z���
.���K1;�����Ry�9�ttc=xs7k6�0���:��b^lj}��4T�����s�*�jX�j�J�y��������
�&v�57Y?�c�4��r�|��|���_+����^�KJ�R(�`8u�j�0~ ��s_c�|7�������@�!w���������Z��N`�8I'
��u�+�M�n����5�	Q����F=!X\��N~s�|'��D��
�4 �����y%mO��U�f0�(
q���k���t#�K�Y���f�Y����.�����u���k���:�FL�����N;��A�kt���rtR�r8��� O��0��qV��`?dD,�.���q�$���W|n������sm���0�T]dT�������xb��'*g��$?�'dF131�b�x���Z���kL���\�sq�����O�����AO���K�3��H�r������6���4�����$�Ts{��,����\�x7��.�8tr�Q=_����Ndd2�����D���D��������(�$��b�x�K�:������������g�Q�������"G;v���s��g�<9��������b?����Y�qV0q�``4���V&���U�JO������2GM���>k����X�����?��T���$$���7?d$z �uY�.�	�{����rQ�\�[���+�����t���w�y@����;����m��0g��g�?�L�^]1�b�4�-�����d{�y���dU`V�j{����^lmo�b��
	�>��}v��k���C��b9�9�X���a��o��r|�7~s����|p&	]����B�e��).���������0Hp9��jr'�:iqk��3�J�����������!�z����Y���FdWT������������WL���� %#������
FE��e��,��
���+�^v*��,�&�p8��l�TQ����5�=����;S�����pXT�j�2U�y)z3��4�Lq3�|�����c�0O�^�66�D�|�o��h��2��om�G�x���^�e����k-C\ch�w��a��~	��F$��Bt
��
���-����1���nyZ~!��T������
A�1����o��r=�����U�E�:�.������k���
e�rC�C�������x,��pQ�xB(7Tx}����Z\�kq-�u�������
��~�<��i,��}�V��^X�C���7z� ��$p����n`���OW������p l�8��kP����'@��4,�%�)���8�o��������0
���{`��r\���Hq����C�\���xA�B(�s�@����F,����/�^��G|
��Gy%�����_�jq4f@-�����C�x��0�������q�P�?��0K�3�\� �`+�6�
��>�a�Zx�P{���s�������8�z���>��O��F�����8.���fC� ��)���e��O�&v3�!C7)��^�?��BL�d�OAB�4EB�Z��3J9.�eF��{��dG��������JV��V��nx
��2�4��J-/�6}o�#vkq����_*��W���2�Y�1��w0l��a�6�6P?�3�������f%
V�,Z1dl��}��p��4Xc��V�N��o*�C�RXd�����%����z��eH5�`\�cahgG���"&5����x$��?<��g����N�i���i��
���������N
�{�6���5���r�����tF��>��=��,�f�h�f�y��|�92
����%g�&^�V��(��@��I��7�A������b����b���r(����0[4B��S�(�!�������D5L��AT�Z��E#�Q%d��P�P�y8��������,v��h?��&��x�Y��R���T�x�!s�y�y�����XA�L��&q��oo���L�@
�a���n��az��t�9��0d�a7`�����)�d�N��'���}F�E8A�����	�s����	�@�����	h�� �a�s@p��
&�0F�Pmh�L��
���;��M����n��J��lqk*����s�)2V��;h��XD�HJj���h��}O�K�ry����.��������J�(��w�w��M�I��M���U���R,����U|�G�����u�O|-��8�)N��}���VIg%���3���tZ�q��T��|�J��uqr}v"">��������c�>��#>��'�U��������M���c�����C�>����"t�y�8�AG�G3��G���:����>q8�>������!I�����`?�� "���(� B�%�n5/WE���S�=I�Jj��_4�i����?��7Y��G{�hbo2���{4���.v�h�������%������t�o%���]R���H�m��"��i���5N[�}bK
�W�D�"z]�k������^��I�o��Q����eI��!�����8�U�D]��U���8�}�.���K����[���bm�xq7�X�k����R���$���5�Vg��j�X�v����be�����j�XQL����_K�^��j-W���J�2IK%�R�/$=+���G�3�~��%=%�gy��U�IK$U%�b+=)i��'$U���8-����Mb��G7��y)b~�����8=���T�PO�POz0Ns��@�~,i���%��n���>I�y4s�U��4�J3�|�4��n�iV�Z�SWQj��K�Z�I����&�H�|w��,�n���)4IRI��E�D�D�c��I���<4��$1>Nw�&�L�qw$�qq�c�&�H����1�1�K��5Q�����b�F��42N#�{�/
���8
��)���6'
�5"���V����v������b��9i`�C�Q������
=t��~	�� Y��PA�(H��&���}<������E����<�N�s7���rQ��(�N�	��g�+N=���?��A7����n�R�_Y�Q$@�24��-@��C��)��p��<�R @i�I"-B����D�;��<�A�I�"y%�&��)Q�_#�7"|q��&�������4��&I�A.�*\	�j�N��U���p���G6�C�|d��VY��"�,�$IV�HVQ��D3��&�$�@��A�g<�{�s���������y�
endstream
endobj
18 0 obj
   5973
endobj
19 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
>>
stream
x�]R�n� ���C�	$B���}�n?����Tc����Z6J�����0,[4��
>��=����\�u�E�
�����6�wy�S���i/��&��0�Lk^|���7�{v����x��#�}5Q�mY~`��x���V4/���O��,��B�i�5�_�������Hvv�.�����.K��0���W�Hr�w���p]�R��9cu6LKK�EL�D�.3�K���0-��E�X���I[�����;�b�J�*�W��(��l����>G�G�?Q�	3��@E�
�����3�Y��S`NA�"{�r~����(��SR�������<���b��� �w�����������;
endstream
endobj
20 0 obj
   357
endobj
21 0 obj
<< /Type /FontDescriptor
   /FontName /FDCCVG+BitstreamVeraSans-Bold
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -199 -235 1416 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 17 0 R
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /FDCCVG+BitstreamVeraSans-Bold
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 21 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 348.144531 0 0 0 0 0 0 0 0 0 0 0 0 0 0 365.234375 695.800781 695.800781 695.800781 0 0 695.800781 695.800781 0 0 0 0 0 0 837.890625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 674.804688 715.820312 592.773438 715.820312 678.222656 435.058594 0 0 342.773438 0 0 342.773438 1041.992188 711.914062 687.011719 715.820312 715.820312 493.164062 595.214844 478.027344 711.914062 0 0 645.019531 651.855469 ]
    /ToUnicode 19 0 R
>>
endobj
11 0 obj
<< /Type /ObjStm
   /Length 24 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL��u�MJV�1��d�_ja4%I�����v�������2%:����/"�!���!w�� ���pAa���J�5ZEr��;w�=&���A����0�������V��������tz��lb������fE�2�$��^<)zTj:����t����~G�,�>�f�~p������Wg7���<:��u-�����$:����&�}���
�w<�������~��aQ��=�y����}g�m|���4�|�����p�W���o4
endstream
endobj
24 0 obj
   274
endobj
25 0 obj
<< /Type /XRef
   /Length 110
   /Filter /FlateDecode
   /Size 26
   /W [1 3 2]
   /Root 23 0 R
   /Info 22 0 R
>>
stream
x�-��	�@��f�%�&^���t�*r���
{X=Z��l+o����x���p���K�����\�`)m��U�]���"�c8�_u��GKn�C�.�_��|W����O
endstream
endobj
startxref
198580
%%EOF

#213

Nathan Bossart

nathandbossart@gmail.com

6 months ago

In reply to: Tomas Vondra (#212)

Re: AIO v2.5

[RMT hat]

On Tue, Jul 22, 2025 at 03:51:12PM +0200, Tomas Vondra wrote:

In general, it doesn't change the earlier conclusions at all. Most of
the earlier observations apply to these results from these machines with
reasonably different types of storage. So I still think we should stick
to "worker", but consider increasing the default number of workers.

It's possible there's machine with storage/hardware that would work
better with some other io_method value, but that's why it's a GUC. The
default should work for most systems.

Given there have been no further proposals or discussion, we have marked
the related open item as "Won't Fix". This means we plan to ship v18 with
io_method=worker and io_workers=3 by default.

--
nathan